All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v6 0/4] Add Mediated device support
@ 2016-08-03 19:03 ` Kirti Wankhede
  0 siblings, 0 replies; 100+ messages in thread
From: Kirti Wankhede @ 2016-08-03 19:03 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: jike.song, kvm, kevin.tian, qemu-devel, Kirti Wankhede, bjsdjshi

This series adds Mediated device support to Linux host kernel. Purpose
of this series is to provide a common interface for mediated device
management that can be used by different devices. This series introduces
Mdev core module that create and manage mediated devices, VFIO based driver
for mediated PCI devices that are created by Mdev core module and update
VFIO type1 IOMMU module to support mediated devices.

What's new in v6?
- Removed per mdev_device lock for registration callbacks. Vendor driver should
  implement locking if they need to serialize registration callbacks.
- Added mapped region tracking logic and invalidation function to be used by
  vendor driver.
- Moved vfio_pin_pages and vfio_unpin_pages API from IOMMU type1 driver to vfio
  driver. Added callbacks to vfio ops structure to support pin and unpin APIs in
  backend iommu module.
- Used uuid_le_to_bin() to parse UUID string and convert to bin. This requires
  following commits from linux master branch:
* commit bc9dc9d5eec908806f1b15c9ec2253d44dcf7835 :
	lib/uuid.c: use correct offset in uuid parser
* commit 2b1b0d66704a8cafe83be7114ec4c15ab3a314ad :
	lib/uuid.c: introduce a few more generic helpers
- Requires below commits from linux master branch for mmap region fault handler
  that uses remap_pfn_range() to setup EPT properly.
* commit add6a0cd1c5ba51b201e1361b05a5df817083618
	KVM: MMU: try to fix up page faults before giving up
* commit 92176a8ede577d0ff78ab3298e06701f67ad5f51 :
	KVM: MMU: prepare to support mapping of VM_IO and VM_PFNMAP frames

Tested:
- Single vGPU VM
- Multiple vGPU VMs on same GPU


Thanks,
Kirti


Kirti Wankhede (4):
  vfio: Mediated device Core driver
  vfio: VFIO driver for mediated PCI device
  vfio iommu: Add support for mediated devices
  docs: Add Documentation for Mediated devices

 Documentation/vfio-mediated-device.txt | 235 ++++++++++++
 drivers/vfio/Kconfig                   |   1 +
 drivers/vfio/Makefile                  |   1 +
 drivers/vfio/mdev/Kconfig              |  18 +
 drivers/vfio/mdev/Makefile             |   6 +
 drivers/vfio/mdev/mdev_core.c          | 676 +++++++++++++++++++++++++++++++++
 drivers/vfio/mdev/mdev_driver.c        | 142 +++++++
 drivers/vfio/mdev/mdev_private.h       |  33 ++
 drivers/vfio/mdev/mdev_sysfs.c         | 269 +++++++++++++
 drivers/vfio/mdev/vfio_mpci.c          | 536 ++++++++++++++++++++++++++
 drivers/vfio/pci/vfio_pci_private.h    |   6 -
 drivers/vfio/pci/vfio_pci_rdwr.c       |   1 +
 drivers/vfio/vfio.c                    |  82 ++++
 drivers/vfio/vfio_iommu_type1.c        | 499 +++++++++++++++++++++---
 include/linux/mdev.h                   | 236 ++++++++++++
 include/linux/vfio.h                   |  20 +-
 16 files changed, 2707 insertions(+), 54 deletions(-)
 create mode 100644 Documentation/vfio-mediated-device.txt
 create mode 100644 drivers/vfio/mdev/Kconfig
 create mode 100644 drivers/vfio/mdev/Makefile
 create mode 100644 drivers/vfio/mdev/mdev_core.c
 create mode 100644 drivers/vfio/mdev/mdev_driver.c
 create mode 100644 drivers/vfio/mdev/mdev_private.h
 create mode 100644 drivers/vfio/mdev/mdev_sysfs.c
 create mode 100644 drivers/vfio/mdev/vfio_mpci.c
 create mode 100644 include/linux/mdev.h

-- 
2.7.0

^ permalink raw reply	[flat|nested] 100+ messages in thread

* [Qemu-devel] [PATCH v6 0/4] Add Mediated device support
@ 2016-08-03 19:03 ` Kirti Wankhede
  0 siblings, 0 replies; 100+ messages in thread
From: Kirti Wankhede @ 2016-08-03 19:03 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi, Kirti Wankhede

This series adds Mediated device support to Linux host kernel. Purpose
of this series is to provide a common interface for mediated device
management that can be used by different devices. This series introduces
Mdev core module that create and manage mediated devices, VFIO based driver
for mediated PCI devices that are created by Mdev core module and update
VFIO type1 IOMMU module to support mediated devices.

What's new in v6?
- Removed per mdev_device lock for registration callbacks. Vendor driver should
  implement locking if they need to serialize registration callbacks.
- Added mapped region tracking logic and invalidation function to be used by
  vendor driver.
- Moved vfio_pin_pages and vfio_unpin_pages API from IOMMU type1 driver to vfio
  driver. Added callbacks to vfio ops structure to support pin and unpin APIs in
  backend iommu module.
- Used uuid_le_to_bin() to parse UUID string and convert to bin. This requires
  following commits from linux master branch:
* commit bc9dc9d5eec908806f1b15c9ec2253d44dcf7835 :
	lib/uuid.c: use correct offset in uuid parser
* commit 2b1b0d66704a8cafe83be7114ec4c15ab3a314ad :
	lib/uuid.c: introduce a few more generic helpers
- Requires below commits from linux master branch for mmap region fault handler
  that uses remap_pfn_range() to setup EPT properly.
* commit add6a0cd1c5ba51b201e1361b05a5df817083618
	KVM: MMU: try to fix up page faults before giving up
* commit 92176a8ede577d0ff78ab3298e06701f67ad5f51 :
	KVM: MMU: prepare to support mapping of VM_IO and VM_PFNMAP frames

Tested:
- Single vGPU VM
- Multiple vGPU VMs on same GPU


Thanks,
Kirti


Kirti Wankhede (4):
  vfio: Mediated device Core driver
  vfio: VFIO driver for mediated PCI device
  vfio iommu: Add support for mediated devices
  docs: Add Documentation for Mediated devices

 Documentation/vfio-mediated-device.txt | 235 ++++++++++++
 drivers/vfio/Kconfig                   |   1 +
 drivers/vfio/Makefile                  |   1 +
 drivers/vfio/mdev/Kconfig              |  18 +
 drivers/vfio/mdev/Makefile             |   6 +
 drivers/vfio/mdev/mdev_core.c          | 676 +++++++++++++++++++++++++++++++++
 drivers/vfio/mdev/mdev_driver.c        | 142 +++++++
 drivers/vfio/mdev/mdev_private.h       |  33 ++
 drivers/vfio/mdev/mdev_sysfs.c         | 269 +++++++++++++
 drivers/vfio/mdev/vfio_mpci.c          | 536 ++++++++++++++++++++++++++
 drivers/vfio/pci/vfio_pci_private.h    |   6 -
 drivers/vfio/pci/vfio_pci_rdwr.c       |   1 +
 drivers/vfio/vfio.c                    |  82 ++++
 drivers/vfio/vfio_iommu_type1.c        | 499 +++++++++++++++++++++---
 include/linux/mdev.h                   | 236 ++++++++++++
 include/linux/vfio.h                   |  20 +-
 16 files changed, 2707 insertions(+), 54 deletions(-)
 create mode 100644 Documentation/vfio-mediated-device.txt
 create mode 100644 drivers/vfio/mdev/Kconfig
 create mode 100644 drivers/vfio/mdev/Makefile
 create mode 100644 drivers/vfio/mdev/mdev_core.c
 create mode 100644 drivers/vfio/mdev/mdev_driver.c
 create mode 100644 drivers/vfio/mdev/mdev_private.h
 create mode 100644 drivers/vfio/mdev/mdev_sysfs.c
 create mode 100644 drivers/vfio/mdev/vfio_mpci.c
 create mode 100644 include/linux/mdev.h

-- 
2.7.0

^ permalink raw reply	[flat|nested] 100+ messages in thread

* [PATCH v6 1/4] vfio: Mediated device Core driver
  2016-08-03 19:03 ` [Qemu-devel] " Kirti Wankhede
@ 2016-08-03 19:03   ` Kirti Wankhede
  -1 siblings, 0 replies; 100+ messages in thread
From: Kirti Wankhede @ 2016-08-03 19:03 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: jike.song, kvm, kevin.tian, qemu-devel, Kirti Wankhede, bjsdjshi

Design for Mediated Device Driver:
Main purpose of this driver is to provide a common interface for mediated
device management that can be used by different drivers of different
devices.

This module provides a generic interface to create the device, add it to
mediated bus, add device to IOMMU group and then add it to vfio group.

Below is the high Level block diagram, with Nvidia, Intel and IBM devices
as example, since these are the devices which are going to actively use
this module as of now.

 +---------------+
 |               |
 | +-----------+ |  mdev_register_driver() +--------------+
 | |           | +<------------------------+ __init()     |
 | |           | |                         |              |
 | |  mdev     | +------------------------>+              |<-> VFIO user
 | |  bus      | |     probe()/remove()    | vfio_mpci.ko |    APIs
 | |  driver   | |                         |              |
 | |           | |                         +--------------+
 | |           | |  mdev_register_driver() +--------------+
 | |           | +<------------------------+ __init()     |
 | |           | |                         |              |
 | |           | +------------------------>+              |<-> VFIO user
 | +-----------+ |     probe()/remove()    | vfio_mccw.ko |    APIs
 |               |                         |              |
 |  MDEV CORE    |                         +--------------+
 |   MODULE      |
 |   mdev.ko     |
 | +-----------+ |  mdev_register_device() +--------------+
 | |           | +<------------------------+              |
 | |           | |                         |  nvidia.ko   |<-> physical
 | |           | +------------------------>+              |    device
 | |           | |        callback         +--------------+
 | | Physical  | |
 | |  device   | |  mdev_register_device() +--------------+
 | | interface | |<------------------------+              |
 | |           | |                         |  i915.ko     |<-> physical
 | |           | +------------------------>+              |    device
 | |           | |        callback         +--------------+
 | |           | |
 | |           | |  mdev_register_device() +--------------+
 | |           | +<------------------------+              |
 | |           | |                         | ccw_device.ko|<-> physical
 | |           | +------------------------>+              |    device
 | |           | |        callback         +--------------+
 | +-----------+ |
 +---------------+

Core driver provides two types of registration interfaces:
1. Registration interface for mediated bus driver:

/**
  * struct mdev_driver - Mediated device's driver
  * @name: driver name
  * @probe: called when new device created
  * @remove:called when device removed
  * @match: called when new device or driver is added for this bus.
	    Return 1 if given device can be handled by given driver and
	    zero otherwise.
  * @driver:device driver structure
  *
  **/
struct mdev_driver {
         const char *name;
         int  (*probe)  (struct device *dev);
         void (*remove) (struct device *dev);
         int  (*match)(struct device *dev);
         struct device_driver    driver;
};

int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
void mdev_unregister_driver(struct mdev_driver *drv);

Mediated device's driver for mdev should use this interface to register
with Core driver. With this, mediated devices driver for such devices is
responsible to add mediated device to VFIO group.

2. Physical device driver interface
This interface provides vendor driver the set APIs to manage physical
device related work in their own driver. APIs are :
- supported_config: provide supported configuration list by the vendor
		    driver
- create: to allocate basic resources in vendor driver for a mediated
	  device.
- destroy: to free resources in vendor driver when mediated device is
	   destroyed.
- reset: to free and reallocate resources in vendor driver during reboot
- start: to initiate mediated device initialization process from vendor
	 driver
- shutdown: to teardown mediated device resources during teardown.
- read : read emulation callback.
- write: write emulation callback.
- set_irqs: send interrupt configuration information that VMM sets.
- get_region_info: to provide region size and its flags for the mediated
		   device.
- validate_map_request: to validate remap pfn request.

This registration interface should be used by vendor drivers to register
each physical device to mdev core driver.
Locks to serialize above callbacks are removed. If required, vendor driver
can have locks to serialize above APIs in their driver.

Added support to keep track of physical mappings for each mdev device.
APIs to be used by mediated device bus driver to add and delete mappings to
tracking logic:
int mdev_add_phys_mapping(struct mdev_device *mdev,
                          struct address_space *mapping,
                          unsigned long addr, unsigned long size)
void mdev_del_phys_mapping(struct mdev_device *mdev, unsigned long addr)

API to be used by vendor driver to invalidate mapping:
int mdev_device_invalidate_mapping(struct mdev_device *mdev,
                                   unsigned long addr, unsigned long size)

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Change-Id: I73a5084574270b14541c529461ea2f03c292d510
---
 drivers/vfio/Kconfig             |   1 +
 drivers/vfio/Makefile            |   1 +
 drivers/vfio/mdev/Kconfig        |  12 +
 drivers/vfio/mdev/Makefile       |   5 +
 drivers/vfio/mdev/mdev_core.c    | 676 +++++++++++++++++++++++++++++++++++++++
 drivers/vfio/mdev/mdev_driver.c  | 142 ++++++++
 drivers/vfio/mdev/mdev_private.h |  33 ++
 drivers/vfio/mdev/mdev_sysfs.c   | 269 ++++++++++++++++
 include/linux/mdev.h             | 236 ++++++++++++++
 9 files changed, 1375 insertions(+)
 create mode 100644 drivers/vfio/mdev/Kconfig
 create mode 100644 drivers/vfio/mdev/Makefile
 create mode 100644 drivers/vfio/mdev/mdev_core.c
 create mode 100644 drivers/vfio/mdev/mdev_driver.c
 create mode 100644 drivers/vfio/mdev/mdev_private.h
 create mode 100644 drivers/vfio/mdev/mdev_sysfs.c
 create mode 100644 include/linux/mdev.h

diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index da6e2ce77495..23eced02aaf6 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -48,4 +48,5 @@ menuconfig VFIO_NOIOMMU
 
 source "drivers/vfio/pci/Kconfig"
 source "drivers/vfio/platform/Kconfig"
+source "drivers/vfio/mdev/Kconfig"
 source "virt/lib/Kconfig"
diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
index 7b8a31f63fea..4a23c13b6be4 100644
--- a/drivers/vfio/Makefile
+++ b/drivers/vfio/Makefile
@@ -7,3 +7,4 @@ obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
 obj-$(CONFIG_VFIO_SPAPR_EEH) += vfio_spapr_eeh.o
 obj-$(CONFIG_VFIO_PCI) += pci/
 obj-$(CONFIG_VFIO_PLATFORM) += platform/
+obj-$(CONFIG_VFIO_MDEV) += mdev/
diff --git a/drivers/vfio/mdev/Kconfig b/drivers/vfio/mdev/Kconfig
new file mode 100644
index 000000000000..a34fbc66f92f
--- /dev/null
+++ b/drivers/vfio/mdev/Kconfig
@@ -0,0 +1,12 @@
+
+config VFIO_MDEV
+    tristate "Mediated device driver framework"
+    depends on VFIO
+    default n
+    help
+        Provides a framework to virtualize device.
+	See Documentation/vfio-mediated-device.txt for more details.
+
+        If you don't know what do here, say N.
+
+
diff --git a/drivers/vfio/mdev/Makefile b/drivers/vfio/mdev/Makefile
new file mode 100644
index 000000000000..56a75e689582
--- /dev/null
+++ b/drivers/vfio/mdev/Makefile
@@ -0,0 +1,5 @@
+
+mdev-y := mdev_core.o mdev_sysfs.o mdev_driver.o
+
+obj-$(CONFIG_VFIO_MDEV) += mdev.o
+
diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
new file mode 100644
index 000000000000..90ff073abfce
--- /dev/null
+++ b/drivers/vfio/mdev/mdev_core.c
@@ -0,0 +1,676 @@
+/*
+ * Mediated device Core Driver
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/fs.h>
+#include <linux/slab.h>
+#include <linux/sched.h>
+#include <linux/uuid.h>
+#include <linux/vfio.h>
+#include <linux/iommu.h>
+#include <linux/sysfs.h>
+#include <linux/mdev.h>
+
+#include "mdev_private.h"
+
+#define DRIVER_VERSION		"0.1"
+#define DRIVER_AUTHOR		"NVIDIA Corporation"
+#define DRIVER_DESC		"Mediated device Core Driver"
+
+#define MDEV_CLASS_NAME		"mdev"
+
+static LIST_HEAD(parent_list);
+static DEFINE_MUTEX(parent_list_lock);
+
+static int mdev_add_attribute_group(struct device *dev,
+				    const struct attribute_group **groups)
+{
+	return sysfs_create_groups(&dev->kobj, groups);
+}
+
+static void mdev_remove_attribute_group(struct device *dev,
+					const struct attribute_group **groups)
+{
+	sysfs_remove_groups(&dev->kobj, groups);
+}
+
+/* Should be called holding parent->mdev_list_lock */
+static struct mdev_device *find_mdev_device(struct parent_device *parent,
+					    uuid_le uuid, int instance)
+{
+	struct mdev_device *mdev;
+
+	list_for_each_entry(mdev, &parent->mdev_list, next) {
+		if ((uuid_le_cmp(mdev->uuid, uuid) == 0) &&
+		    (mdev->instance == instance))
+			return mdev;
+	}
+	return NULL;
+}
+
+/* Should be called holding parent_list_lock */
+static struct parent_device *find_parent_device(struct device *dev)
+{
+	struct parent_device *parent;
+
+	list_for_each_entry(parent, &parent_list, next) {
+		if (parent->dev == dev)
+			return parent;
+	}
+	return NULL;
+}
+
+static void mdev_release_parent(struct kref *kref)
+{
+	struct parent_device *parent = container_of(kref, struct parent_device,
+						    ref);
+	kfree(parent);
+}
+
+static
+inline struct parent_device *mdev_get_parent(struct parent_device *parent)
+{
+	if (parent)
+		kref_get(&parent->ref);
+
+	return parent;
+}
+
+static inline void mdev_put_parent(struct parent_device *parent)
+{
+	if (parent)
+		kref_put(&parent->ref, mdev_release_parent);
+}
+
+static struct parent_device *mdev_get_parent_by_dev(struct device *dev)
+{
+	struct parent_device *parent = NULL, *p;
+
+	mutex_lock(&parent_list_lock);
+	list_for_each_entry(p, &parent_list, next) {
+		if (p->dev == dev) {
+			parent = mdev_get_parent(p);
+			break;
+		}
+	}
+	mutex_unlock(&parent_list_lock);
+	return parent;
+}
+
+static int mdev_device_create_ops(struct mdev_device *mdev, char *mdev_params)
+{
+	struct parent_device *parent = mdev->parent;
+	int ret;
+
+	ret = parent->ops->create(mdev, mdev_params);
+	if (ret)
+		return ret;
+
+	ret = mdev_add_attribute_group(&mdev->dev,
+					parent->ops->mdev_attr_groups);
+	if (ret)
+		parent->ops->destroy(mdev);
+
+	return ret;
+}
+
+static int mdev_device_destroy_ops(struct mdev_device *mdev, bool force)
+{
+	struct parent_device *parent = mdev->parent;
+	int ret = 0;
+
+	/*
+	 * If vendor driver doesn't return success that means vendor
+	 * driver doesn't support hot-unplug
+	 */
+	ret = parent->ops->destroy(mdev);
+	if (ret && !force)
+		return -EBUSY;
+
+	mdev_remove_attribute_group(&mdev->dev,
+				    parent->ops->mdev_attr_groups);
+
+	return ret;
+}
+
+static void mdev_release_device(struct kref *kref)
+{
+	struct mdev_device *mdev = container_of(kref, struct mdev_device, ref);
+	struct parent_device *parent = mdev->parent;
+
+	list_del(&mdev->next);
+	mutex_unlock(&parent->mdev_list_lock);
+
+	device_unregister(&mdev->dev);
+	wake_up(&parent->release_done);
+	mdev_put_parent(parent);
+}
+
+struct mdev_device *mdev_get_device(struct mdev_device *mdev)
+{
+	kref_get(&mdev->ref);
+	return mdev;
+}
+EXPORT_SYMBOL(mdev_get_device);
+
+void mdev_put_device(struct mdev_device *mdev)
+{
+	struct parent_device *parent = mdev->parent;
+
+	kref_put_mutex(&mdev->ref, mdev_release_device,
+		       &parent->mdev_list_lock);
+}
+EXPORT_SYMBOL(mdev_put_device);
+
+/*
+ * Find first mediated device from given uuid and increment refcount of
+ * mediated device. Caller should call mdev_put_device() when the use of
+ * mdev_device is done.
+ */
+static struct mdev_device *mdev_get_first_device_by_uuid(uuid_le uuid)
+{
+	struct mdev_device *mdev = NULL, *p;
+	struct parent_device *parent;
+
+	mutex_lock(&parent_list_lock);
+	list_for_each_entry(parent, &parent_list, next) {
+		mutex_lock(&parent->mdev_list_lock);
+		list_for_each_entry(p, &parent->mdev_list, next) {
+			if (uuid_le_cmp(p->uuid, uuid) == 0) {
+				mdev = mdev_get_device(p);
+				break;
+			}
+		}
+		mutex_unlock(&parent->mdev_list_lock);
+
+		if (mdev)
+			break;
+	}
+	mutex_unlock(&parent_list_lock);
+	return mdev;
+}
+
+/*
+ * Find mediated device from given iommu_group and increment refcount of
+ * mediated device. Caller should call mdev_put_device() when the use of
+ * mdev_device is done.
+ */
+struct mdev_device *mdev_get_device_by_group(struct iommu_group *group)
+{
+	struct mdev_device *mdev = NULL, *p;
+	struct parent_device *parent;
+
+	mutex_lock(&parent_list_lock);
+	list_for_each_entry(parent, &parent_list, next) {
+		mutex_lock(&parent->mdev_list_lock);
+		list_for_each_entry(p, &parent->mdev_list, next) {
+			if (!p->group)
+				continue;
+
+			if (iommu_group_id(p->group) == iommu_group_id(group)) {
+				mdev = mdev_get_device(p);
+				break;
+			}
+		}
+		mutex_unlock(&parent->mdev_list_lock);
+
+		if (mdev)
+			break;
+	}
+	mutex_unlock(&parent_list_lock);
+	return mdev;
+}
+EXPORT_SYMBOL(mdev_get_device_by_group);
+
+/*
+ * mdev_register_device : Register a device
+ * @dev: device structure representing parent device.
+ * @ops: Parent device operation structure to be registered.
+ *
+ * Add device to list of registered parent devices.
+ * Returns a negative value on error, otherwise 0.
+ */
+int mdev_register_device(struct device *dev, const struct parent_ops *ops)
+{
+	int ret = 0;
+	struct parent_device *parent;
+
+	if (!dev || !ops)
+		return -EINVAL;
+
+	/* check for mandatory ops */
+	if (!ops->create || !ops->destroy)
+		return -EINVAL;
+
+	mutex_lock(&parent_list_lock);
+
+	/* Check for duplicate */
+	parent = find_parent_device(dev);
+	if (parent) {
+		ret = -EEXIST;
+		goto add_dev_err;
+	}
+
+	parent = kzalloc(sizeof(*parent), GFP_KERNEL);
+	if (!parent) {
+		ret = -ENOMEM;
+		goto add_dev_err;
+	}
+
+	kref_init(&parent->ref);
+	list_add(&parent->next, &parent_list);
+
+	parent->dev = dev;
+	parent->ops = ops;
+	mutex_init(&parent->mdev_list_lock);
+	INIT_LIST_HEAD(&parent->mdev_list);
+	init_waitqueue_head(&parent->release_done);
+	mutex_unlock(&parent_list_lock);
+
+	ret = mdev_create_sysfs_files(dev);
+	if (ret)
+		goto add_sysfs_error;
+
+	ret = mdev_add_attribute_group(dev, ops->dev_attr_groups);
+	if (ret)
+		goto add_group_error;
+
+	dev_info(dev, "MDEV: Registered\n");
+	return 0;
+
+add_group_error:
+	mdev_remove_sysfs_files(dev);
+add_sysfs_error:
+	mutex_lock(&parent_list_lock);
+	list_del(&parent->next);
+	mutex_unlock(&parent_list_lock);
+	mdev_put_parent(parent);
+	return ret;
+
+add_dev_err:
+	mutex_unlock(&parent_list_lock);
+	return ret;
+}
+EXPORT_SYMBOL(mdev_register_device);
+
+/*
+ * mdev_unregister_device : Unregister a parent device
+ * @dev: device structure representing parent device.
+ *
+ * Remove device from list of registered parent devices. Give a chance to free
+ * existing mediated devices for given device.
+ */
+
+void mdev_unregister_device(struct device *dev)
+{
+	struct parent_device *parent;
+	struct mdev_device *mdev, *n;
+	int ret;
+
+	mutex_lock(&parent_list_lock);
+	parent = find_parent_device(dev);
+
+	if (!parent) {
+		mutex_unlock(&parent_list_lock);
+		return;
+	}
+	dev_info(dev, "MDEV: Unregistering\n");
+
+	/*
+	 * Remove parent from the list and remove create and destroy sysfs
+	 * files so that no new mediated device could be created for this parent
+	 */
+	list_del(&parent->next);
+	mdev_remove_sysfs_files(dev);
+	mutex_unlock(&parent_list_lock);
+
+	mdev_remove_attribute_group(dev,
+				    parent->ops->dev_attr_groups);
+
+	mutex_lock(&parent->mdev_list_lock);
+	list_for_each_entry_safe(mdev, n, &parent->mdev_list, next) {
+		mdev_device_destroy_ops(mdev, true);
+		mutex_unlock(&parent->mdev_list_lock);
+		mdev_put_device(mdev);
+		mutex_lock(&parent->mdev_list_lock);
+	}
+	mutex_unlock(&parent->mdev_list_lock);
+
+	do {
+		ret = wait_event_interruptible_timeout(parent->release_done,
+				list_empty(&parent->mdev_list), HZ * 10);
+		if (ret == -ERESTARTSYS) {
+			dev_warn(dev, "Mediated devices are in use, task"
+				      " \"%s\" (%d) "
+				      "blocked until all are released",
+				      current->comm, task_pid_nr(current));
+		}
+	} while (ret <= 0);
+
+	mdev_put_parent(parent);
+}
+EXPORT_SYMBOL(mdev_unregister_device);
+
+/*
+ * Functions required for mdev_sysfs
+ */
+static void mdev_device_release(struct device *dev)
+{
+	struct mdev_device *mdev = to_mdev_device(dev);
+
+	dev_dbg(&mdev->dev, "MDEV: destroying\n");
+	kfree(mdev);
+}
+
+int mdev_device_create(struct device *dev, uuid_le uuid, uint32_t instance,
+		       char *mdev_params)
+{
+	int ret;
+	struct mdev_device *mdev;
+	struct parent_device *parent;
+
+	parent = mdev_get_parent_by_dev(dev);
+	if (!parent)
+		return -EINVAL;
+
+	mutex_lock(&parent->mdev_list_lock);
+	/* Check for duplicate */
+	mdev = find_mdev_device(parent, uuid, instance);
+	if (mdev) {
+		ret = -EEXIST;
+		goto create_err;
+	}
+
+	mdev = kzalloc(sizeof(*mdev), GFP_KERNEL);
+	if (!mdev) {
+		ret = -ENOMEM;
+		goto create_err;
+	}
+
+	memcpy(&mdev->uuid, &uuid, sizeof(uuid_le));
+	mdev->instance = instance;
+	mdev->parent = parent;
+	kref_init(&mdev->ref);
+
+	mdev->dev.parent  = dev;
+	mdev->dev.bus     = &mdev_bus_type;
+	mdev->dev.release = mdev_device_release;
+	dev_set_name(&mdev->dev, "%pUl-%d", uuid.b, instance);
+
+	ret = device_register(&mdev->dev);
+	if (ret) {
+		put_device(&mdev->dev);
+		goto create_err;
+	}
+
+	ret = mdev_device_create_ops(mdev, mdev_params);
+	if (ret)
+		goto create_failed;
+
+	list_add(&mdev->next, &parent->mdev_list);
+	mutex_unlock(&parent->mdev_list_lock);
+
+	dev_dbg(&mdev->dev, "MDEV: created\n");
+
+	return ret;
+
+create_failed:
+	device_unregister(&mdev->dev);
+
+create_err:
+	mutex_unlock(&parent->mdev_list_lock);
+	mdev_put_parent(parent);
+	return ret;
+}
+
+int mdev_device_destroy(struct device *dev, uuid_le uuid, uint32_t instance)
+{
+	struct mdev_device *mdev;
+	struct parent_device *parent;
+	int ret;
+
+	parent = mdev_get_parent_by_dev(dev);
+	if (!parent)
+		return -EINVAL;
+
+	mutex_lock(&parent->mdev_list_lock);
+	mdev = find_mdev_device(parent, uuid, instance);
+	if (!mdev) {
+		ret = -EINVAL;
+		goto destroy_err;
+	}
+
+	ret = mdev_device_destroy_ops(mdev, false);
+	if (ret)
+		goto destroy_err;
+
+	mutex_unlock(&parent->mdev_list_lock);
+	mdev_put_device(mdev);
+
+	mdev_put_parent(parent);
+	return ret;
+
+destroy_err:
+	mutex_unlock(&parent->mdev_list_lock);
+	mdev_put_parent(parent);
+	return ret;
+}
+
+int mdev_device_invalidate_mapping(struct mdev_device *mdev,
+				   unsigned long addr, unsigned long size)
+{
+	int ret = -EINVAL;
+	struct mdev_phys_mapping *phys_mappings;
+	struct addr_desc *addr_desc;
+
+	if (!mdev || !mdev->phys_mappings.mapping)
+		return ret;
+
+	phys_mappings = &mdev->phys_mappings;
+
+	mutex_lock(&phys_mappings->addr_desc_list_lock);
+
+	list_for_each_entry(addr_desc, &phys_mappings->addr_desc_list, next) {
+
+		if ((addr > addr_desc->start) &&
+		    (addr + size < addr_desc->start + addr_desc->size)) {
+			unmap_mapping_range(phys_mappings->mapping,
+					    addr, size, 0);
+			ret = 0;
+			goto unlock_exit;
+		}
+	}
+
+unlock_exit:
+	mutex_unlock(&phys_mappings->addr_desc_list_lock);
+	return ret;
+}
+EXPORT_SYMBOL(mdev_device_invalidate_mapping);
+
+/* Sanity check for the physical mapping list for mediated device */
+
+int mdev_add_phys_mapping(struct mdev_device *mdev,
+			  struct address_space *mapping,
+			  unsigned long addr, unsigned long size)
+{
+	struct mdev_phys_mapping *phys_mappings;
+	struct addr_desc *addr_desc, *new_addr_desc;
+	int ret = 0;
+
+	if (!mdev)
+		return -EINVAL;
+
+	phys_mappings = &mdev->phys_mappings;
+	if (phys_mappings->mapping && (mapping != phys_mappings->mapping))
+		return -EINVAL;
+
+	if (!phys_mappings->mapping) {
+		phys_mappings->mapping = mapping;
+		mutex_init(&phys_mappings->addr_desc_list_lock);
+		INIT_LIST_HEAD(&phys_mappings->addr_desc_list);
+	}
+
+	mutex_lock(&phys_mappings->addr_desc_list_lock);
+
+	list_for_each_entry(addr_desc, &phys_mappings->addr_desc_list, next) {
+		if ((addr + size < addr_desc->start) ||
+		    (addr_desc->start + addr_desc->size) < addr)
+			continue;
+		else {
+			/* should be no overlap */
+			ret = -EINVAL;
+			goto mapping_exit;
+		}
+	}
+
+	/* add the new entry to the list */
+	new_addr_desc = kzalloc(sizeof(*new_addr_desc), GFP_KERNEL);
+
+	if (!new_addr_desc) {
+		ret = -ENOMEM;
+		goto mapping_exit;
+	}
+
+	new_addr_desc->start = addr;
+	new_addr_desc->size = size;
+	list_add(&new_addr_desc->next, &phys_mappings->addr_desc_list);
+
+mapping_exit:
+	mutex_unlock(&phys_mappings->addr_desc_list_lock);
+	return ret;
+}
+EXPORT_SYMBOL(mdev_add_phys_mapping);
+
+void mdev_del_phys_mapping(struct mdev_device *mdev, unsigned long addr)
+{
+	struct mdev_phys_mapping *phys_mappings;
+	struct addr_desc *addr_desc;
+
+	if (!mdev)
+		return;
+
+	phys_mappings = &mdev->phys_mappings;
+
+	mutex_lock(&phys_mappings->addr_desc_list_lock);
+	list_for_each_entry(addr_desc, &phys_mappings->addr_desc_list, next) {
+		if (addr_desc->start == addr) {
+			list_del(&addr_desc->next);
+			kfree(addr_desc);
+			break;
+		}
+	}
+	mutex_unlock(&phys_mappings->addr_desc_list_lock);
+}
+EXPORT_SYMBOL(mdev_del_phys_mapping);
+
+void mdev_device_supported_config(struct device *dev, char *str)
+{
+	struct parent_device *parent;
+
+	parent = mdev_get_parent_by_dev(dev);
+
+	if (parent) {
+		if (parent->ops->supported_config)
+			parent->ops->supported_config(parent->dev, str);
+		mdev_put_parent(parent);
+	}
+}
+
+int mdev_device_start(uuid_le uuid)
+{
+	int ret = 0;
+	struct mdev_device *mdev;
+	struct parent_device *parent;
+
+	mdev = mdev_get_first_device_by_uuid(uuid);
+	if (!mdev)
+		return -EINVAL;
+
+	parent = mdev->parent;
+
+	if (parent->ops->start)
+		ret = parent->ops->start(mdev->uuid);
+
+	if (ret)
+		pr_err("mdev_start failed  %d\n", ret);
+	else
+		kobject_uevent(&mdev->dev.kobj, KOBJ_ONLINE);
+
+	mdev_put_device(mdev);
+
+	return ret;
+}
+
+int mdev_device_stop(uuid_le uuid)
+{
+	int ret = 0;
+	struct mdev_device *mdev;
+	struct parent_device *parent;
+
+	mdev = mdev_get_first_device_by_uuid(uuid);
+	if (!mdev)
+		return -EINVAL;
+
+	parent = mdev->parent;
+
+	if (parent->ops->stop)
+		ret = parent->ops->stop(mdev->uuid);
+
+	if (ret)
+		pr_err("mdev stop failed %d\n", ret);
+	else
+		kobject_uevent(&mdev->dev.kobj, KOBJ_OFFLINE);
+
+	mdev_put_device(mdev);
+	return ret;
+}
+
+static struct class mdev_class = {
+	.name		= MDEV_CLASS_NAME,
+	.owner		= THIS_MODULE,
+	.class_attrs	= mdev_class_attrs,
+};
+
+static int __init mdev_init(void)
+{
+	int ret;
+
+	ret = class_register(&mdev_class);
+	if (ret) {
+		pr_err("Failed to register mdev class\n");
+		return ret;
+	}
+
+	ret = mdev_bus_register();
+	if (ret) {
+		pr_err("Failed to register mdev bus\n");
+		class_unregister(&mdev_class);
+		return ret;
+	}
+
+	return ret;
+}
+
+static void __exit mdev_exit(void)
+{
+	mdev_bus_unregister();
+	class_unregister(&mdev_class);
+}
+
+module_init(mdev_init)
+module_exit(mdev_exit)
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/drivers/vfio/mdev/mdev_driver.c b/drivers/vfio/mdev/mdev_driver.c
new file mode 100644
index 000000000000..00680bd06224
--- /dev/null
+++ b/drivers/vfio/mdev/mdev_driver.c
@@ -0,0 +1,142 @@
+/*
+ * MDEV driver
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/device.h>
+#include <linux/iommu.h>
+#include <linux/mdev.h>
+
+#include "mdev_private.h"
+
+static int mdev_attach_iommu(struct mdev_device *mdev)
+{
+	int ret;
+	struct iommu_group *group;
+
+	group = iommu_group_alloc();
+	if (IS_ERR(group)) {
+		dev_err(&mdev->dev, "MDEV: failed to allocate group!\n");
+		return PTR_ERR(group);
+	}
+
+	ret = iommu_group_add_device(group, &mdev->dev);
+	if (ret) {
+		dev_err(&mdev->dev, "MDEV: failed to add dev to group!\n");
+		goto attach_fail;
+	}
+
+	mdev->group = group;
+
+	dev_info(&mdev->dev, "MDEV: group_id = %d\n",
+				 iommu_group_id(group));
+attach_fail:
+	iommu_group_put(group);
+	return ret;
+}
+
+static void mdev_detach_iommu(struct mdev_device *mdev)
+{
+	iommu_group_remove_device(&mdev->dev);
+	mdev->group = NULL;
+	dev_info(&mdev->dev, "MDEV: detaching iommu\n");
+}
+
+static int mdev_probe(struct device *dev)
+{
+	struct mdev_driver *drv = to_mdev_driver(dev->driver);
+	struct mdev_device *mdev = to_mdev_device(dev);
+	int ret;
+
+	ret = mdev_attach_iommu(mdev);
+	if (ret) {
+		dev_err(dev, "Failed to attach IOMMU\n");
+		return ret;
+	}
+
+	if (drv && drv->probe)
+		ret = drv->probe(dev);
+
+	if (ret)
+		mdev_detach_iommu(mdev);
+
+	return ret;
+}
+
+static int mdev_remove(struct device *dev)
+{
+	struct mdev_driver *drv = to_mdev_driver(dev->driver);
+	struct mdev_device *mdev = to_mdev_device(dev);
+
+	if (drv && drv->remove)
+		drv->remove(dev);
+
+	mdev_detach_iommu(mdev);
+
+	return 0;
+}
+
+static int mdev_match(struct device *dev, struct device_driver *driver)
+{
+	struct mdev_driver *drv = to_mdev_driver(driver);
+
+	if (drv && drv->match)
+		return drv->match(dev);
+
+	return 0;
+}
+
+struct bus_type mdev_bus_type = {
+	.name		= "mdev",
+	.match		= mdev_match,
+	.probe		= mdev_probe,
+	.remove		= mdev_remove,
+};
+EXPORT_SYMBOL_GPL(mdev_bus_type);
+
+/*
+ * mdev_register_driver - register a new MDEV driver
+ * @drv: the driver to register
+ * @owner: module owner of driver to be registered
+ *
+ * Returns a negative value on error, otherwise 0.
+ */
+int mdev_register_driver(struct mdev_driver *drv, struct module *owner)
+{
+	/* initialize common driver fields */
+	drv->driver.name = drv->name;
+	drv->driver.bus = &mdev_bus_type;
+	drv->driver.owner = owner;
+
+	/* register with core */
+	return driver_register(&drv->driver);
+}
+EXPORT_SYMBOL(mdev_register_driver);
+
+/*
+ * mdev_unregister_driver - unregister MDEV driver
+ * @drv: the driver to unregister
+ *
+ */
+void mdev_unregister_driver(struct mdev_driver *drv)
+{
+	driver_unregister(&drv->driver);
+}
+EXPORT_SYMBOL(mdev_unregister_driver);
+
+int mdev_bus_register(void)
+{
+	return bus_register(&mdev_bus_type);
+}
+
+void mdev_bus_unregister(void)
+{
+	bus_unregister(&mdev_bus_type);
+}
diff --git a/drivers/vfio/mdev/mdev_private.h b/drivers/vfio/mdev/mdev_private.h
new file mode 100644
index 000000000000..ee2db61a8091
--- /dev/null
+++ b/drivers/vfio/mdev/mdev_private.h
@@ -0,0 +1,33 @@
+/*
+ * Mediated device interal definitions
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef MDEV_PRIVATE_H
+#define MDEV_PRIVATE_H
+
+int  mdev_bus_register(void);
+void mdev_bus_unregister(void);
+
+/* Function prototypes for mdev_sysfs */
+
+extern struct class_attribute mdev_class_attrs[];
+
+int  mdev_create_sysfs_files(struct device *dev);
+void mdev_remove_sysfs_files(struct device *dev);
+
+int  mdev_device_create(struct device *dev, uuid_le uuid, uint32_t instance,
+			char *mdev_params);
+int  mdev_device_destroy(struct device *dev, uuid_le uuid, uint32_t instance);
+void mdev_device_supported_config(struct device *dev, char *str);
+int  mdev_device_start(uuid_le uuid);
+int  mdev_device_stop(uuid_le uuid);
+
+#endif /* MDEV_PRIVATE_H */
diff --git a/drivers/vfio/mdev/mdev_sysfs.c b/drivers/vfio/mdev/mdev_sysfs.c
new file mode 100644
index 000000000000..e0457e68cf78
--- /dev/null
+++ b/drivers/vfio/mdev/mdev_sysfs.c
@@ -0,0 +1,269 @@
+/*
+ * File attributes for Mediated devices
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/sysfs.h>
+#include <linux/ctype.h>
+#include <linux/device.h>
+#include <linux/slab.h>
+#include <linux/uuid.h>
+#include <linux/mdev.h>
+
+#include "mdev_private.h"
+
+/* Prototypes */
+static ssize_t mdev_supported_types_show(struct device *dev,
+					 struct device_attribute *attr,
+					 char *buf);
+static DEVICE_ATTR_RO(mdev_supported_types);
+
+static ssize_t mdev_create_store(struct device *dev,
+				 struct device_attribute *attr,
+				 const char *buf, size_t count);
+static DEVICE_ATTR_WO(mdev_create);
+
+static ssize_t mdev_destroy_store(struct device *dev,
+				  struct device_attribute *attr,
+				  const char *buf, size_t count);
+static DEVICE_ATTR_WO(mdev_destroy);
+
+/* Static functions */
+
+
+#define SUPPORTED_TYPE_BUFFER_LENGTH	4096
+
+/* mdev sysfs Functions */
+static ssize_t mdev_supported_types_show(struct device *dev,
+					 struct device_attribute *attr,
+					 char *buf)
+{
+	char *str, *ptr;
+	ssize_t n;
+
+	str = kzalloc(sizeof(*str) * SUPPORTED_TYPE_BUFFER_LENGTH, GFP_KERNEL);
+	if (!str)
+		return -ENOMEM;
+
+	ptr = str;
+	mdev_device_supported_config(dev, str);
+
+	n = sprintf(buf, "%s\n", str);
+	kfree(ptr);
+
+	return n;
+}
+
+static ssize_t mdev_create_store(struct device *dev,
+				 struct device_attribute *attr,
+				 const char *buf, size_t count)
+{
+	char *str, *pstr;
+	char *uuid_str, *instance_str, *mdev_params = NULL, *params = NULL;
+	uuid_le uuid;
+	uint32_t instance;
+	int ret;
+
+	pstr = str = kstrndup(buf, count, GFP_KERNEL);
+
+	if (!str)
+		return -ENOMEM;
+
+	uuid_str = strsep(&str, ":");
+	if (!uuid_str) {
+		pr_err("mdev_create: Empty UUID string %s\n", buf);
+		ret = -EINVAL;
+		goto create_error;
+	}
+
+	if (!str) {
+		pr_err("mdev_create: mdev instance not present %s\n", buf);
+		ret = -EINVAL;
+		goto create_error;
+	}
+
+	instance_str = strsep(&str, ":");
+	if (!instance_str) {
+		pr_err("mdev_create: Empty instance string %s\n", buf);
+		ret = -EINVAL;
+		goto create_error;
+	}
+
+	ret = kstrtouint(instance_str, 0, &instance);
+	if (ret) {
+		pr_err("mdev_create: mdev instance parsing error %s\n", buf);
+		goto create_error;
+	}
+
+	if (str)
+		params = mdev_params = kstrdup(str, GFP_KERNEL);
+
+	ret = uuid_le_to_bin(uuid_str, &uuid);
+	if (ret) {
+		pr_err("mdev_create: UUID parse error %s\n", buf);
+		goto create_error;
+	}
+
+	ret = mdev_device_create(dev, uuid, instance, mdev_params);
+	if (ret)
+		pr_err("mdev_create: Failed to create mdev device\n");
+	else
+		ret = count;
+
+create_error:
+	kfree(params);
+	kfree(pstr);
+	return ret;
+}
+
+static ssize_t mdev_destroy_store(struct device *dev,
+				  struct device_attribute *attr,
+				  const char *buf, size_t count)
+{
+	char *uuid_str, *str, *pstr;
+	uuid_le uuid;
+	unsigned int instance;
+	int ret;
+
+	str = pstr = kstrndup(buf, count, GFP_KERNEL);
+
+	if (!str)
+		return -ENOMEM;
+
+	uuid_str = strsep(&str, ":");
+	if (!uuid_str) {
+		pr_err("mdev_destroy: Empty UUID string %s\n", buf);
+		ret = -EINVAL;
+		goto destroy_error;
+	}
+
+	if (str == NULL) {
+		pr_err("mdev_destroy: instance not specified %s\n", buf);
+		ret = -EINVAL;
+		goto destroy_error;
+	}
+
+	ret = kstrtouint(str, 0, &instance);
+	if (ret) {
+		pr_err("mdev_destroy: instance parsing error %s\n", buf);
+		goto destroy_error;
+	}
+
+	ret = uuid_le_to_bin(uuid_str, &uuid);
+	if (ret) {
+		pr_err("mdev_destroy: UUID parse error  %s\n", buf);
+		goto destroy_error;
+	}
+
+	ret = mdev_device_destroy(dev, uuid, instance);
+	if (ret == 0)
+		ret = count;
+
+destroy_error:
+	kfree(pstr);
+	return ret;
+}
+
+ssize_t mdev_start_store(struct class *class, struct class_attribute *attr,
+			 const char *buf, size_t count)
+{
+	char *uuid_str, *ptr;
+	uuid_le uuid;
+	int ret;
+
+	ptr = uuid_str = kstrndup(buf, count, GFP_KERNEL);
+
+	if (!uuid_str)
+		return -ENOMEM;
+
+	ret = uuid_le_to_bin(uuid_str, &uuid);
+	if (ret) {
+		pr_err("mdev_start: UUID parse error  %s\n", buf);
+		goto start_error;
+	}
+
+	ret = mdev_device_start(uuid);
+	if (ret == 0)
+		ret = count;
+
+start_error:
+	kfree(ptr);
+	return ret;
+}
+
+ssize_t mdev_stop_store(struct class *class, struct class_attribute *attr,
+			    const char *buf, size_t count)
+{
+	char *uuid_str, *ptr;
+	uuid_le uuid;
+	int ret;
+
+	ptr = uuid_str = kstrndup(buf, count, GFP_KERNEL);
+
+	if (!uuid_str)
+		return -ENOMEM;
+
+	ret = uuid_le_to_bin(uuid_str, &uuid);
+	if (ret) {
+		pr_err("mdev_stop: UUID parse error %s\n", buf);
+		goto stop_error;
+	}
+
+	ret = mdev_device_stop(uuid);
+	if (ret == 0)
+		ret = count;
+
+stop_error:
+	kfree(ptr);
+	return ret;
+
+}
+
+struct class_attribute mdev_class_attrs[] = {
+	__ATTR_WO(mdev_start),
+	__ATTR_WO(mdev_stop),
+	__ATTR_NULL
+};
+
+int mdev_create_sysfs_files(struct device *dev)
+{
+	int ret;
+
+	ret = sysfs_create_file(&dev->kobj,
+				&dev_attr_mdev_supported_types.attr);
+	if (ret) {
+		pr_err("Failed to create mdev_supported_types sysfs entry\n");
+		return ret;
+	}
+
+	ret = sysfs_create_file(&dev->kobj, &dev_attr_mdev_create.attr);
+	if (ret) {
+		pr_err("Failed to create mdev_create sysfs entry\n");
+		goto create_sysfs_failed;
+	}
+
+	ret = sysfs_create_file(&dev->kobj, &dev_attr_mdev_destroy.attr);
+	if (ret) {
+		pr_err("Failed to create mdev_destroy sysfs entry\n");
+		sysfs_remove_file(&dev->kobj, &dev_attr_mdev_create.attr);
+	} else
+		return ret;
+
+create_sysfs_failed:
+	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_supported_types.attr);
+	return ret;
+}
+
+void mdev_remove_sysfs_files(struct device *dev)
+{
+	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_supported_types.attr);
+	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_create.attr);
+	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_destroy.attr);
+}
diff --git a/include/linux/mdev.h b/include/linux/mdev.h
new file mode 100644
index 000000000000..0b41f301a9b7
--- /dev/null
+++ b/include/linux/mdev.h
@@ -0,0 +1,236 @@
+/*
+ * Mediated device definition
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef MDEV_H
+#define MDEV_H
+
+#include <uapi/linux/vfio.h>
+
+struct parent_device;
+
+/*
+ * Mediated device
+ */
+
+struct addr_desc {
+	unsigned long start;
+	unsigned long size;
+	struct list_head next;
+};
+
+struct mdev_phys_mapping {
+	struct address_space *mapping;
+	struct list_head addr_desc_list;
+	struct mutex addr_desc_list_lock;
+};
+
+struct mdev_device {
+	struct device		dev;
+	struct parent_device	*parent;
+	struct iommu_group	*group;
+	uuid_le			uuid;
+	uint32_t		instance;
+	void			*driver_data;
+
+	/* internal only */
+	struct kref		ref;
+	struct list_head	next;
+
+	struct mdev_phys_mapping phys_mappings;
+};
+
+
+/**
+ * struct parent_ops - Structure to be registered for each parent device to
+ * register the device to mdev module.
+ *
+ * @owner:		The module owner.
+ * @dev_attr_groups:	Default attributes of the parent device.
+ * @mdev_attr_groups:	Default attributes of the mediated device.
+ * @supported_config:	Called to get information about supported types.
+ *			@dev : device structure of parent device.
+ *			@config: should return string listing supported config
+ *			Returns integer: success (0) or error (< 0)
+ * @create:		Called to allocate basic resources in parent device's
+ *			driver for a particular mediated device. It is
+ *			mandatory to provide create ops.
+ *			@mdev: mdev_device structure on of mediated device
+ *			      that is being created
+ *			@mdev_params: extra parameters required by parent
+ *			device's driver.
+ *			Returns integer: success (0) or error (< 0)
+ * @destroy:		Called to free resources in parent device's driver for a
+ *			a mediated device instance. It is mandatory to provide
+ *			destroy ops.
+ *			@mdev: mdev_device device structure which is being
+ *			       destroyed
+ *			Returns integer: success (0) or error (< 0)
+ *			If VMM is running and destroy() is called that means the
+ *			mdev is being hotunpluged. Return error if VMM is
+ *			running and driver doesn't support mediated device
+ *			hotplug.
+ * @reset:		Called to reset mediated device.
+ *			@mdev: mdev_device device structure
+ *			Returns integer: success (0) or error (< 0)
+ * @start:		Called to initiate mediated device initialization
+ *			process in parent device's driver before VMM starts.
+ *			@uuid: UUID
+ *			Returns integer: success (0) or error (< 0)
+ * @stop:		Called to teardown mediated device related resources
+ *			@uuid: UUID
+ *			Returns integer: success (0) or error (< 0)
+ * @read:		Read emulation callback
+ *			@mdev: mediated device structure
+ *			@buf: read buffer
+ *			@count: number of bytes to read
+ *			@pos: address.
+ *			Retuns number on bytes read on success or error.
+ * @write:		Write emulation callback
+ *			@mdev: mediated device structure
+ *			@buf: write buffer
+ *			@count: number of bytes to be written
+ *			@pos: address.
+ *			Retuns number on bytes written on success or error.
+ * @set_irqs:		Called to send about interrupts configuration
+ *			information that VMM sets.
+ *			@mdev: mediated device structure
+ *			@flags, index, start, count and *data : same as that of
+ *			struct vfio_irq_set of VFIO_DEVICE_SET_IRQS API.
+ * @get_region_info:	Called to get VFIO region size and flags of mediated
+ *			device.
+ *			@mdev: mediated device structure
+ *			@region_index: VFIO region index
+ *			@region_info: output, returns size and flags of
+ *				      requested region.
+ *			Returns integer: success (0) or error (< 0)
+ * @validate_map_request: Validate remap pfn request
+ *			@mdev: mediated device structure
+ *			@pos: address
+ *			@virtaddr: target user address to start at. Vendor
+ *				   driver can change if required.
+ *			@pfn: parent address of kernel memory, vendor driver
+ *			      can change if required.
+ *			@size: size of map area, vendor driver can change the
+ *			       size of map area if desired.
+ *			@prot: page protection flags for this mapping, vendor
+ *			       driver can change, if required.
+ *			Returns integer: success (0) or error (< 0)
+ *
+ * Parent device that support mediated device should be registered with mdev
+ * module with parent_ops structure.
+ */
+
+struct parent_ops {
+	struct module   *owner;
+	const struct attribute_group **dev_attr_groups;
+	const struct attribute_group **mdev_attr_groups;
+
+	int	(*supported_config)(struct device *dev, char *config);
+	int     (*create)(struct mdev_device *mdev, char *mdev_params);
+	int     (*destroy)(struct mdev_device *mdev);
+	int     (*reset)(struct mdev_device *mdev);
+	int     (*start)(uuid_le uuid);
+	int     (*stop)(uuid_le uuid);
+	ssize_t (*read)(struct mdev_device *mdev, char *buf, size_t count,
+			loff_t pos);
+	ssize_t (*write)(struct mdev_device *mdev, char *buf, size_t count,
+			 loff_t pos);
+	int     (*set_irqs)(struct mdev_device *mdev, uint32_t flags,
+			    unsigned int index, unsigned int start,
+			    unsigned int count, void *data);
+	int	(*get_region_info)(struct mdev_device *mdev, int region_index,
+				   struct vfio_region_info *region_info);
+	int	(*validate_map_request)(struct mdev_device *mdev, loff_t pos,
+					u64 *virtaddr, unsigned long *pfn,
+					unsigned long *size, pgprot_t *prot);
+};
+
+/*
+ * Parent Device
+ */
+
+struct parent_device {
+	struct device		*dev;
+	const struct parent_ops	*ops;
+
+	/* internal */
+	struct kref		ref;
+	struct list_head	next;
+	struct list_head	mdev_list;
+	struct mutex		mdev_list_lock;
+	wait_queue_head_t	release_done;
+};
+
+/**
+ * struct mdev_driver - Mediated device driver
+ * @name: driver name
+ * @probe: called when new device created
+ * @remove: called when device removed
+ * @match: called when new device or driver is added for this bus. Return 1 if
+ *	   given device can be handled by given driver and zero otherwise.
+ * @driver: device driver structure
+ *
+ **/
+struct mdev_driver {
+	const char *name;
+	int  (*probe)(struct device *dev);
+	void (*remove)(struct device *dev);
+	int  (*match)(struct device *dev);
+	struct device_driver driver;
+};
+
+static inline struct mdev_driver *to_mdev_driver(struct device_driver *drv)
+{
+	return drv ? container_of(drv, struct mdev_driver, driver) : NULL;
+}
+
+static inline struct mdev_device *to_mdev_device(struct device *dev)
+{
+	return dev ? container_of(dev, struct mdev_device, dev) : NULL;
+}
+
+static inline void *mdev_get_drvdata(struct mdev_device *mdev)
+{
+	return mdev->driver_data;
+}
+
+static inline void mdev_set_drvdata(struct mdev_device *mdev, void *data)
+{
+	mdev->driver_data = data;
+}
+
+extern struct bus_type mdev_bus_type;
+
+#define dev_is_mdev(d) ((d)->bus == &mdev_bus_type)
+
+extern int  mdev_register_device(struct device *dev,
+				 const struct parent_ops *ops);
+extern void mdev_unregister_device(struct device *dev);
+
+extern int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
+extern void mdev_unregister_driver(struct mdev_driver *drv);
+
+extern struct mdev_device *mdev_get_device(struct mdev_device *mdev);
+extern void mdev_put_device(struct mdev_device *mdev);
+
+extern struct mdev_device *mdev_get_device_by_group(struct iommu_group *group);
+
+extern int mdev_device_invalidate_mapping(struct mdev_device *mdev,
+					unsigned long addr, unsigned long size);
+
+extern int mdev_add_phys_mapping(struct mdev_device *mdev,
+				 struct address_space *mapping,
+				 unsigned long addr, unsigned long size);
+
+
+extern void mdev_del_phys_mapping(struct mdev_device *mdev, unsigned long addr);
+#endif /* MDEV_H */
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [Qemu-devel] [PATCH v6 1/4] vfio: Mediated device Core driver
@ 2016-08-03 19:03   ` Kirti Wankhede
  0 siblings, 0 replies; 100+ messages in thread
From: Kirti Wankhede @ 2016-08-03 19:03 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi, Kirti Wankhede

Design for Mediated Device Driver:
Main purpose of this driver is to provide a common interface for mediated
device management that can be used by different drivers of different
devices.

This module provides a generic interface to create the device, add it to
mediated bus, add device to IOMMU group and then add it to vfio group.

Below is the high Level block diagram, with Nvidia, Intel and IBM devices
as example, since these are the devices which are going to actively use
this module as of now.

 +---------------+
 |               |
 | +-----------+ |  mdev_register_driver() +--------------+
 | |           | +<------------------------+ __init()     |
 | |           | |                         |              |
 | |  mdev     | +------------------------>+              |<-> VFIO user
 | |  bus      | |     probe()/remove()    | vfio_mpci.ko |    APIs
 | |  driver   | |                         |              |
 | |           | |                         +--------------+
 | |           | |  mdev_register_driver() +--------------+
 | |           | +<------------------------+ __init()     |
 | |           | |                         |              |
 | |           | +------------------------>+              |<-> VFIO user
 | +-----------+ |     probe()/remove()    | vfio_mccw.ko |    APIs
 |               |                         |              |
 |  MDEV CORE    |                         +--------------+
 |   MODULE      |
 |   mdev.ko     |
 | +-----------+ |  mdev_register_device() +--------------+
 | |           | +<------------------------+              |
 | |           | |                         |  nvidia.ko   |<-> physical
 | |           | +------------------------>+              |    device
 | |           | |        callback         +--------------+
 | | Physical  | |
 | |  device   | |  mdev_register_device() +--------------+
 | | interface | |<------------------------+              |
 | |           | |                         |  i915.ko     |<-> physical
 | |           | +------------------------>+              |    device
 | |           | |        callback         +--------------+
 | |           | |
 | |           | |  mdev_register_device() +--------------+
 | |           | +<------------------------+              |
 | |           | |                         | ccw_device.ko|<-> physical
 | |           | +------------------------>+              |    device
 | |           | |        callback         +--------------+
 | +-----------+ |
 +---------------+

Core driver provides two types of registration interfaces:
1. Registration interface for mediated bus driver:

/**
  * struct mdev_driver - Mediated device's driver
  * @name: driver name
  * @probe: called when new device created
  * @remove:called when device removed
  * @match: called when new device or driver is added for this bus.
	    Return 1 if given device can be handled by given driver and
	    zero otherwise.
  * @driver:device driver structure
  *
  **/
struct mdev_driver {
         const char *name;
         int  (*probe)  (struct device *dev);
         void (*remove) (struct device *dev);
         int  (*match)(struct device *dev);
         struct device_driver    driver;
};

int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
void mdev_unregister_driver(struct mdev_driver *drv);

Mediated device's driver for mdev should use this interface to register
with Core driver. With this, mediated devices driver for such devices is
responsible to add mediated device to VFIO group.

2. Physical device driver interface
This interface provides vendor driver the set APIs to manage physical
device related work in their own driver. APIs are :
- supported_config: provide supported configuration list by the vendor
		    driver
- create: to allocate basic resources in vendor driver for a mediated
	  device.
- destroy: to free resources in vendor driver when mediated device is
	   destroyed.
- reset: to free and reallocate resources in vendor driver during reboot
- start: to initiate mediated device initialization process from vendor
	 driver
- shutdown: to teardown mediated device resources during teardown.
- read : read emulation callback.
- write: write emulation callback.
- set_irqs: send interrupt configuration information that VMM sets.
- get_region_info: to provide region size and its flags for the mediated
		   device.
- validate_map_request: to validate remap pfn request.

This registration interface should be used by vendor drivers to register
each physical device to mdev core driver.
Locks to serialize above callbacks are removed. If required, vendor driver
can have locks to serialize above APIs in their driver.

Added support to keep track of physical mappings for each mdev device.
APIs to be used by mediated device bus driver to add and delete mappings to
tracking logic:
int mdev_add_phys_mapping(struct mdev_device *mdev,
                          struct address_space *mapping,
                          unsigned long addr, unsigned long size)
void mdev_del_phys_mapping(struct mdev_device *mdev, unsigned long addr)

API to be used by vendor driver to invalidate mapping:
int mdev_device_invalidate_mapping(struct mdev_device *mdev,
                                   unsigned long addr, unsigned long size)

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Change-Id: I73a5084574270b14541c529461ea2f03c292d510
---
 drivers/vfio/Kconfig             |   1 +
 drivers/vfio/Makefile            |   1 +
 drivers/vfio/mdev/Kconfig        |  12 +
 drivers/vfio/mdev/Makefile       |   5 +
 drivers/vfio/mdev/mdev_core.c    | 676 +++++++++++++++++++++++++++++++++++++++
 drivers/vfio/mdev/mdev_driver.c  | 142 ++++++++
 drivers/vfio/mdev/mdev_private.h |  33 ++
 drivers/vfio/mdev/mdev_sysfs.c   | 269 ++++++++++++++++
 include/linux/mdev.h             | 236 ++++++++++++++
 9 files changed, 1375 insertions(+)
 create mode 100644 drivers/vfio/mdev/Kconfig
 create mode 100644 drivers/vfio/mdev/Makefile
 create mode 100644 drivers/vfio/mdev/mdev_core.c
 create mode 100644 drivers/vfio/mdev/mdev_driver.c
 create mode 100644 drivers/vfio/mdev/mdev_private.h
 create mode 100644 drivers/vfio/mdev/mdev_sysfs.c
 create mode 100644 include/linux/mdev.h

diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index da6e2ce77495..23eced02aaf6 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -48,4 +48,5 @@ menuconfig VFIO_NOIOMMU
 
 source "drivers/vfio/pci/Kconfig"
 source "drivers/vfio/platform/Kconfig"
+source "drivers/vfio/mdev/Kconfig"
 source "virt/lib/Kconfig"
diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
index 7b8a31f63fea..4a23c13b6be4 100644
--- a/drivers/vfio/Makefile
+++ b/drivers/vfio/Makefile
@@ -7,3 +7,4 @@ obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
 obj-$(CONFIG_VFIO_SPAPR_EEH) += vfio_spapr_eeh.o
 obj-$(CONFIG_VFIO_PCI) += pci/
 obj-$(CONFIG_VFIO_PLATFORM) += platform/
+obj-$(CONFIG_VFIO_MDEV) += mdev/
diff --git a/drivers/vfio/mdev/Kconfig b/drivers/vfio/mdev/Kconfig
new file mode 100644
index 000000000000..a34fbc66f92f
--- /dev/null
+++ b/drivers/vfio/mdev/Kconfig
@@ -0,0 +1,12 @@
+
+config VFIO_MDEV
+    tristate "Mediated device driver framework"
+    depends on VFIO
+    default n
+    help
+        Provides a framework to virtualize device.
+	See Documentation/vfio-mediated-device.txt for more details.
+
+        If you don't know what do here, say N.
+
+
diff --git a/drivers/vfio/mdev/Makefile b/drivers/vfio/mdev/Makefile
new file mode 100644
index 000000000000..56a75e689582
--- /dev/null
+++ b/drivers/vfio/mdev/Makefile
@@ -0,0 +1,5 @@
+
+mdev-y := mdev_core.o mdev_sysfs.o mdev_driver.o
+
+obj-$(CONFIG_VFIO_MDEV) += mdev.o
+
diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
new file mode 100644
index 000000000000..90ff073abfce
--- /dev/null
+++ b/drivers/vfio/mdev/mdev_core.c
@@ -0,0 +1,676 @@
+/*
+ * Mediated device Core Driver
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/fs.h>
+#include <linux/slab.h>
+#include <linux/sched.h>
+#include <linux/uuid.h>
+#include <linux/vfio.h>
+#include <linux/iommu.h>
+#include <linux/sysfs.h>
+#include <linux/mdev.h>
+
+#include "mdev_private.h"
+
+#define DRIVER_VERSION		"0.1"
+#define DRIVER_AUTHOR		"NVIDIA Corporation"
+#define DRIVER_DESC		"Mediated device Core Driver"
+
+#define MDEV_CLASS_NAME		"mdev"
+
+static LIST_HEAD(parent_list);
+static DEFINE_MUTEX(parent_list_lock);
+
+static int mdev_add_attribute_group(struct device *dev,
+				    const struct attribute_group **groups)
+{
+	return sysfs_create_groups(&dev->kobj, groups);
+}
+
+static void mdev_remove_attribute_group(struct device *dev,
+					const struct attribute_group **groups)
+{
+	sysfs_remove_groups(&dev->kobj, groups);
+}
+
+/* Should be called holding parent->mdev_list_lock */
+static struct mdev_device *find_mdev_device(struct parent_device *parent,
+					    uuid_le uuid, int instance)
+{
+	struct mdev_device *mdev;
+
+	list_for_each_entry(mdev, &parent->mdev_list, next) {
+		if ((uuid_le_cmp(mdev->uuid, uuid) == 0) &&
+		    (mdev->instance == instance))
+			return mdev;
+	}
+	return NULL;
+}
+
+/* Should be called holding parent_list_lock */
+static struct parent_device *find_parent_device(struct device *dev)
+{
+	struct parent_device *parent;
+
+	list_for_each_entry(parent, &parent_list, next) {
+		if (parent->dev == dev)
+			return parent;
+	}
+	return NULL;
+}
+
+static void mdev_release_parent(struct kref *kref)
+{
+	struct parent_device *parent = container_of(kref, struct parent_device,
+						    ref);
+	kfree(parent);
+}
+
+static
+inline struct parent_device *mdev_get_parent(struct parent_device *parent)
+{
+	if (parent)
+		kref_get(&parent->ref);
+
+	return parent;
+}
+
+static inline void mdev_put_parent(struct parent_device *parent)
+{
+	if (parent)
+		kref_put(&parent->ref, mdev_release_parent);
+}
+
+static struct parent_device *mdev_get_parent_by_dev(struct device *dev)
+{
+	struct parent_device *parent = NULL, *p;
+
+	mutex_lock(&parent_list_lock);
+	list_for_each_entry(p, &parent_list, next) {
+		if (p->dev == dev) {
+			parent = mdev_get_parent(p);
+			break;
+		}
+	}
+	mutex_unlock(&parent_list_lock);
+	return parent;
+}
+
+static int mdev_device_create_ops(struct mdev_device *mdev, char *mdev_params)
+{
+	struct parent_device *parent = mdev->parent;
+	int ret;
+
+	ret = parent->ops->create(mdev, mdev_params);
+	if (ret)
+		return ret;
+
+	ret = mdev_add_attribute_group(&mdev->dev,
+					parent->ops->mdev_attr_groups);
+	if (ret)
+		parent->ops->destroy(mdev);
+
+	return ret;
+}
+
+static int mdev_device_destroy_ops(struct mdev_device *mdev, bool force)
+{
+	struct parent_device *parent = mdev->parent;
+	int ret = 0;
+
+	/*
+	 * If vendor driver doesn't return success that means vendor
+	 * driver doesn't support hot-unplug
+	 */
+	ret = parent->ops->destroy(mdev);
+	if (ret && !force)
+		return -EBUSY;
+
+	mdev_remove_attribute_group(&mdev->dev,
+				    parent->ops->mdev_attr_groups);
+
+	return ret;
+}
+
+static void mdev_release_device(struct kref *kref)
+{
+	struct mdev_device *mdev = container_of(kref, struct mdev_device, ref);
+	struct parent_device *parent = mdev->parent;
+
+	list_del(&mdev->next);
+	mutex_unlock(&parent->mdev_list_lock);
+
+	device_unregister(&mdev->dev);
+	wake_up(&parent->release_done);
+	mdev_put_parent(parent);
+}
+
+struct mdev_device *mdev_get_device(struct mdev_device *mdev)
+{
+	kref_get(&mdev->ref);
+	return mdev;
+}
+EXPORT_SYMBOL(mdev_get_device);
+
+void mdev_put_device(struct mdev_device *mdev)
+{
+	struct parent_device *parent = mdev->parent;
+
+	kref_put_mutex(&mdev->ref, mdev_release_device,
+		       &parent->mdev_list_lock);
+}
+EXPORT_SYMBOL(mdev_put_device);
+
+/*
+ * Find first mediated device from given uuid and increment refcount of
+ * mediated device. Caller should call mdev_put_device() when the use of
+ * mdev_device is done.
+ */
+static struct mdev_device *mdev_get_first_device_by_uuid(uuid_le uuid)
+{
+	struct mdev_device *mdev = NULL, *p;
+	struct parent_device *parent;
+
+	mutex_lock(&parent_list_lock);
+	list_for_each_entry(parent, &parent_list, next) {
+		mutex_lock(&parent->mdev_list_lock);
+		list_for_each_entry(p, &parent->mdev_list, next) {
+			if (uuid_le_cmp(p->uuid, uuid) == 0) {
+				mdev = mdev_get_device(p);
+				break;
+			}
+		}
+		mutex_unlock(&parent->mdev_list_lock);
+
+		if (mdev)
+			break;
+	}
+	mutex_unlock(&parent_list_lock);
+	return mdev;
+}
+
+/*
+ * Find mediated device from given iommu_group and increment refcount of
+ * mediated device. Caller should call mdev_put_device() when the use of
+ * mdev_device is done.
+ */
+struct mdev_device *mdev_get_device_by_group(struct iommu_group *group)
+{
+	struct mdev_device *mdev = NULL, *p;
+	struct parent_device *parent;
+
+	mutex_lock(&parent_list_lock);
+	list_for_each_entry(parent, &parent_list, next) {
+		mutex_lock(&parent->mdev_list_lock);
+		list_for_each_entry(p, &parent->mdev_list, next) {
+			if (!p->group)
+				continue;
+
+			if (iommu_group_id(p->group) == iommu_group_id(group)) {
+				mdev = mdev_get_device(p);
+				break;
+			}
+		}
+		mutex_unlock(&parent->mdev_list_lock);
+
+		if (mdev)
+			break;
+	}
+	mutex_unlock(&parent_list_lock);
+	return mdev;
+}
+EXPORT_SYMBOL(mdev_get_device_by_group);
+
+/*
+ * mdev_register_device : Register a device
+ * @dev: device structure representing parent device.
+ * @ops: Parent device operation structure to be registered.
+ *
+ * Add device to list of registered parent devices.
+ * Returns a negative value on error, otherwise 0.
+ */
+int mdev_register_device(struct device *dev, const struct parent_ops *ops)
+{
+	int ret = 0;
+	struct parent_device *parent;
+
+	if (!dev || !ops)
+		return -EINVAL;
+
+	/* check for mandatory ops */
+	if (!ops->create || !ops->destroy)
+		return -EINVAL;
+
+	mutex_lock(&parent_list_lock);
+
+	/* Check for duplicate */
+	parent = find_parent_device(dev);
+	if (parent) {
+		ret = -EEXIST;
+		goto add_dev_err;
+	}
+
+	parent = kzalloc(sizeof(*parent), GFP_KERNEL);
+	if (!parent) {
+		ret = -ENOMEM;
+		goto add_dev_err;
+	}
+
+	kref_init(&parent->ref);
+	list_add(&parent->next, &parent_list);
+
+	parent->dev = dev;
+	parent->ops = ops;
+	mutex_init(&parent->mdev_list_lock);
+	INIT_LIST_HEAD(&parent->mdev_list);
+	init_waitqueue_head(&parent->release_done);
+	mutex_unlock(&parent_list_lock);
+
+	ret = mdev_create_sysfs_files(dev);
+	if (ret)
+		goto add_sysfs_error;
+
+	ret = mdev_add_attribute_group(dev, ops->dev_attr_groups);
+	if (ret)
+		goto add_group_error;
+
+	dev_info(dev, "MDEV: Registered\n");
+	return 0;
+
+add_group_error:
+	mdev_remove_sysfs_files(dev);
+add_sysfs_error:
+	mutex_lock(&parent_list_lock);
+	list_del(&parent->next);
+	mutex_unlock(&parent_list_lock);
+	mdev_put_parent(parent);
+	return ret;
+
+add_dev_err:
+	mutex_unlock(&parent_list_lock);
+	return ret;
+}
+EXPORT_SYMBOL(mdev_register_device);
+
+/*
+ * mdev_unregister_device : Unregister a parent device
+ * @dev: device structure representing parent device.
+ *
+ * Remove device from list of registered parent devices. Give a chance to free
+ * existing mediated devices for given device.
+ */
+
+void mdev_unregister_device(struct device *dev)
+{
+	struct parent_device *parent;
+	struct mdev_device *mdev, *n;
+	int ret;
+
+	mutex_lock(&parent_list_lock);
+	parent = find_parent_device(dev);
+
+	if (!parent) {
+		mutex_unlock(&parent_list_lock);
+		return;
+	}
+	dev_info(dev, "MDEV: Unregistering\n");
+
+	/*
+	 * Remove parent from the list and remove create and destroy sysfs
+	 * files so that no new mediated device could be created for this parent
+	 */
+	list_del(&parent->next);
+	mdev_remove_sysfs_files(dev);
+	mutex_unlock(&parent_list_lock);
+
+	mdev_remove_attribute_group(dev,
+				    parent->ops->dev_attr_groups);
+
+	mutex_lock(&parent->mdev_list_lock);
+	list_for_each_entry_safe(mdev, n, &parent->mdev_list, next) {
+		mdev_device_destroy_ops(mdev, true);
+		mutex_unlock(&parent->mdev_list_lock);
+		mdev_put_device(mdev);
+		mutex_lock(&parent->mdev_list_lock);
+	}
+	mutex_unlock(&parent->mdev_list_lock);
+
+	do {
+		ret = wait_event_interruptible_timeout(parent->release_done,
+				list_empty(&parent->mdev_list), HZ * 10);
+		if (ret == -ERESTARTSYS) {
+			dev_warn(dev, "Mediated devices are in use, task"
+				      " \"%s\" (%d) "
+				      "blocked until all are released",
+				      current->comm, task_pid_nr(current));
+		}
+	} while (ret <= 0);
+
+	mdev_put_parent(parent);
+}
+EXPORT_SYMBOL(mdev_unregister_device);
+
+/*
+ * Functions required for mdev_sysfs
+ */
+static void mdev_device_release(struct device *dev)
+{
+	struct mdev_device *mdev = to_mdev_device(dev);
+
+	dev_dbg(&mdev->dev, "MDEV: destroying\n");
+	kfree(mdev);
+}
+
+int mdev_device_create(struct device *dev, uuid_le uuid, uint32_t instance,
+		       char *mdev_params)
+{
+	int ret;
+	struct mdev_device *mdev;
+	struct parent_device *parent;
+
+	parent = mdev_get_parent_by_dev(dev);
+	if (!parent)
+		return -EINVAL;
+
+	mutex_lock(&parent->mdev_list_lock);
+	/* Check for duplicate */
+	mdev = find_mdev_device(parent, uuid, instance);
+	if (mdev) {
+		ret = -EEXIST;
+		goto create_err;
+	}
+
+	mdev = kzalloc(sizeof(*mdev), GFP_KERNEL);
+	if (!mdev) {
+		ret = -ENOMEM;
+		goto create_err;
+	}
+
+	memcpy(&mdev->uuid, &uuid, sizeof(uuid_le));
+	mdev->instance = instance;
+	mdev->parent = parent;
+	kref_init(&mdev->ref);
+
+	mdev->dev.parent  = dev;
+	mdev->dev.bus     = &mdev_bus_type;
+	mdev->dev.release = mdev_device_release;
+	dev_set_name(&mdev->dev, "%pUl-%d", uuid.b, instance);
+
+	ret = device_register(&mdev->dev);
+	if (ret) {
+		put_device(&mdev->dev);
+		goto create_err;
+	}
+
+	ret = mdev_device_create_ops(mdev, mdev_params);
+	if (ret)
+		goto create_failed;
+
+	list_add(&mdev->next, &parent->mdev_list);
+	mutex_unlock(&parent->mdev_list_lock);
+
+	dev_dbg(&mdev->dev, "MDEV: created\n");
+
+	return ret;
+
+create_failed:
+	device_unregister(&mdev->dev);
+
+create_err:
+	mutex_unlock(&parent->mdev_list_lock);
+	mdev_put_parent(parent);
+	return ret;
+}
+
+int mdev_device_destroy(struct device *dev, uuid_le uuid, uint32_t instance)
+{
+	struct mdev_device *mdev;
+	struct parent_device *parent;
+	int ret;
+
+	parent = mdev_get_parent_by_dev(dev);
+	if (!parent)
+		return -EINVAL;
+
+	mutex_lock(&parent->mdev_list_lock);
+	mdev = find_mdev_device(parent, uuid, instance);
+	if (!mdev) {
+		ret = -EINVAL;
+		goto destroy_err;
+	}
+
+	ret = mdev_device_destroy_ops(mdev, false);
+	if (ret)
+		goto destroy_err;
+
+	mutex_unlock(&parent->mdev_list_lock);
+	mdev_put_device(mdev);
+
+	mdev_put_parent(parent);
+	return ret;
+
+destroy_err:
+	mutex_unlock(&parent->mdev_list_lock);
+	mdev_put_parent(parent);
+	return ret;
+}
+
+int mdev_device_invalidate_mapping(struct mdev_device *mdev,
+				   unsigned long addr, unsigned long size)
+{
+	int ret = -EINVAL;
+	struct mdev_phys_mapping *phys_mappings;
+	struct addr_desc *addr_desc;
+
+	if (!mdev || !mdev->phys_mappings.mapping)
+		return ret;
+
+	phys_mappings = &mdev->phys_mappings;
+
+	mutex_lock(&phys_mappings->addr_desc_list_lock);
+
+	list_for_each_entry(addr_desc, &phys_mappings->addr_desc_list, next) {
+
+		if ((addr > addr_desc->start) &&
+		    (addr + size < addr_desc->start + addr_desc->size)) {
+			unmap_mapping_range(phys_mappings->mapping,
+					    addr, size, 0);
+			ret = 0;
+			goto unlock_exit;
+		}
+	}
+
+unlock_exit:
+	mutex_unlock(&phys_mappings->addr_desc_list_lock);
+	return ret;
+}
+EXPORT_SYMBOL(mdev_device_invalidate_mapping);
+
+/* Sanity check for the physical mapping list for mediated device */
+
+int mdev_add_phys_mapping(struct mdev_device *mdev,
+			  struct address_space *mapping,
+			  unsigned long addr, unsigned long size)
+{
+	struct mdev_phys_mapping *phys_mappings;
+	struct addr_desc *addr_desc, *new_addr_desc;
+	int ret = 0;
+
+	if (!mdev)
+		return -EINVAL;
+
+	phys_mappings = &mdev->phys_mappings;
+	if (phys_mappings->mapping && (mapping != phys_mappings->mapping))
+		return -EINVAL;
+
+	if (!phys_mappings->mapping) {
+		phys_mappings->mapping = mapping;
+		mutex_init(&phys_mappings->addr_desc_list_lock);
+		INIT_LIST_HEAD(&phys_mappings->addr_desc_list);
+	}
+
+	mutex_lock(&phys_mappings->addr_desc_list_lock);
+
+	list_for_each_entry(addr_desc, &phys_mappings->addr_desc_list, next) {
+		if ((addr + size < addr_desc->start) ||
+		    (addr_desc->start + addr_desc->size) < addr)
+			continue;
+		else {
+			/* should be no overlap */
+			ret = -EINVAL;
+			goto mapping_exit;
+		}
+	}
+
+	/* add the new entry to the list */
+	new_addr_desc = kzalloc(sizeof(*new_addr_desc), GFP_KERNEL);
+
+	if (!new_addr_desc) {
+		ret = -ENOMEM;
+		goto mapping_exit;
+	}
+
+	new_addr_desc->start = addr;
+	new_addr_desc->size = size;
+	list_add(&new_addr_desc->next, &phys_mappings->addr_desc_list);
+
+mapping_exit:
+	mutex_unlock(&phys_mappings->addr_desc_list_lock);
+	return ret;
+}
+EXPORT_SYMBOL(mdev_add_phys_mapping);
+
+void mdev_del_phys_mapping(struct mdev_device *mdev, unsigned long addr)
+{
+	struct mdev_phys_mapping *phys_mappings;
+	struct addr_desc *addr_desc;
+
+	if (!mdev)
+		return;
+
+	phys_mappings = &mdev->phys_mappings;
+
+	mutex_lock(&phys_mappings->addr_desc_list_lock);
+	list_for_each_entry(addr_desc, &phys_mappings->addr_desc_list, next) {
+		if (addr_desc->start == addr) {
+			list_del(&addr_desc->next);
+			kfree(addr_desc);
+			break;
+		}
+	}
+	mutex_unlock(&phys_mappings->addr_desc_list_lock);
+}
+EXPORT_SYMBOL(mdev_del_phys_mapping);
+
+void mdev_device_supported_config(struct device *dev, char *str)
+{
+	struct parent_device *parent;
+
+	parent = mdev_get_parent_by_dev(dev);
+
+	if (parent) {
+		if (parent->ops->supported_config)
+			parent->ops->supported_config(parent->dev, str);
+		mdev_put_parent(parent);
+	}
+}
+
+int mdev_device_start(uuid_le uuid)
+{
+	int ret = 0;
+	struct mdev_device *mdev;
+	struct parent_device *parent;
+
+	mdev = mdev_get_first_device_by_uuid(uuid);
+	if (!mdev)
+		return -EINVAL;
+
+	parent = mdev->parent;
+
+	if (parent->ops->start)
+		ret = parent->ops->start(mdev->uuid);
+
+	if (ret)
+		pr_err("mdev_start failed  %d\n", ret);
+	else
+		kobject_uevent(&mdev->dev.kobj, KOBJ_ONLINE);
+
+	mdev_put_device(mdev);
+
+	return ret;
+}
+
+int mdev_device_stop(uuid_le uuid)
+{
+	int ret = 0;
+	struct mdev_device *mdev;
+	struct parent_device *parent;
+
+	mdev = mdev_get_first_device_by_uuid(uuid);
+	if (!mdev)
+		return -EINVAL;
+
+	parent = mdev->parent;
+
+	if (parent->ops->stop)
+		ret = parent->ops->stop(mdev->uuid);
+
+	if (ret)
+		pr_err("mdev stop failed %d\n", ret);
+	else
+		kobject_uevent(&mdev->dev.kobj, KOBJ_OFFLINE);
+
+	mdev_put_device(mdev);
+	return ret;
+}
+
+static struct class mdev_class = {
+	.name		= MDEV_CLASS_NAME,
+	.owner		= THIS_MODULE,
+	.class_attrs	= mdev_class_attrs,
+};
+
+static int __init mdev_init(void)
+{
+	int ret;
+
+	ret = class_register(&mdev_class);
+	if (ret) {
+		pr_err("Failed to register mdev class\n");
+		return ret;
+	}
+
+	ret = mdev_bus_register();
+	if (ret) {
+		pr_err("Failed to register mdev bus\n");
+		class_unregister(&mdev_class);
+		return ret;
+	}
+
+	return ret;
+}
+
+static void __exit mdev_exit(void)
+{
+	mdev_bus_unregister();
+	class_unregister(&mdev_class);
+}
+
+module_init(mdev_init)
+module_exit(mdev_exit)
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/drivers/vfio/mdev/mdev_driver.c b/drivers/vfio/mdev/mdev_driver.c
new file mode 100644
index 000000000000..00680bd06224
--- /dev/null
+++ b/drivers/vfio/mdev/mdev_driver.c
@@ -0,0 +1,142 @@
+/*
+ * MDEV driver
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/device.h>
+#include <linux/iommu.h>
+#include <linux/mdev.h>
+
+#include "mdev_private.h"
+
+static int mdev_attach_iommu(struct mdev_device *mdev)
+{
+	int ret;
+	struct iommu_group *group;
+
+	group = iommu_group_alloc();
+	if (IS_ERR(group)) {
+		dev_err(&mdev->dev, "MDEV: failed to allocate group!\n");
+		return PTR_ERR(group);
+	}
+
+	ret = iommu_group_add_device(group, &mdev->dev);
+	if (ret) {
+		dev_err(&mdev->dev, "MDEV: failed to add dev to group!\n");
+		goto attach_fail;
+	}
+
+	mdev->group = group;
+
+	dev_info(&mdev->dev, "MDEV: group_id = %d\n",
+				 iommu_group_id(group));
+attach_fail:
+	iommu_group_put(group);
+	return ret;
+}
+
+static void mdev_detach_iommu(struct mdev_device *mdev)
+{
+	iommu_group_remove_device(&mdev->dev);
+	mdev->group = NULL;
+	dev_info(&mdev->dev, "MDEV: detaching iommu\n");
+}
+
+static int mdev_probe(struct device *dev)
+{
+	struct mdev_driver *drv = to_mdev_driver(dev->driver);
+	struct mdev_device *mdev = to_mdev_device(dev);
+	int ret;
+
+	ret = mdev_attach_iommu(mdev);
+	if (ret) {
+		dev_err(dev, "Failed to attach IOMMU\n");
+		return ret;
+	}
+
+	if (drv && drv->probe)
+		ret = drv->probe(dev);
+
+	if (ret)
+		mdev_detach_iommu(mdev);
+
+	return ret;
+}
+
+static int mdev_remove(struct device *dev)
+{
+	struct mdev_driver *drv = to_mdev_driver(dev->driver);
+	struct mdev_device *mdev = to_mdev_device(dev);
+
+	if (drv && drv->remove)
+		drv->remove(dev);
+
+	mdev_detach_iommu(mdev);
+
+	return 0;
+}
+
+static int mdev_match(struct device *dev, struct device_driver *driver)
+{
+	struct mdev_driver *drv = to_mdev_driver(driver);
+
+	if (drv && drv->match)
+		return drv->match(dev);
+
+	return 0;
+}
+
+struct bus_type mdev_bus_type = {
+	.name		= "mdev",
+	.match		= mdev_match,
+	.probe		= mdev_probe,
+	.remove		= mdev_remove,
+};
+EXPORT_SYMBOL_GPL(mdev_bus_type);
+
+/*
+ * mdev_register_driver - register a new MDEV driver
+ * @drv: the driver to register
+ * @owner: module owner of driver to be registered
+ *
+ * Returns a negative value on error, otherwise 0.
+ */
+int mdev_register_driver(struct mdev_driver *drv, struct module *owner)
+{
+	/* initialize common driver fields */
+	drv->driver.name = drv->name;
+	drv->driver.bus = &mdev_bus_type;
+	drv->driver.owner = owner;
+
+	/* register with core */
+	return driver_register(&drv->driver);
+}
+EXPORT_SYMBOL(mdev_register_driver);
+
+/*
+ * mdev_unregister_driver - unregister MDEV driver
+ * @drv: the driver to unregister
+ *
+ */
+void mdev_unregister_driver(struct mdev_driver *drv)
+{
+	driver_unregister(&drv->driver);
+}
+EXPORT_SYMBOL(mdev_unregister_driver);
+
+int mdev_bus_register(void)
+{
+	return bus_register(&mdev_bus_type);
+}
+
+void mdev_bus_unregister(void)
+{
+	bus_unregister(&mdev_bus_type);
+}
diff --git a/drivers/vfio/mdev/mdev_private.h b/drivers/vfio/mdev/mdev_private.h
new file mode 100644
index 000000000000..ee2db61a8091
--- /dev/null
+++ b/drivers/vfio/mdev/mdev_private.h
@@ -0,0 +1,33 @@
+/*
+ * Mediated device interal definitions
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef MDEV_PRIVATE_H
+#define MDEV_PRIVATE_H
+
+int  mdev_bus_register(void);
+void mdev_bus_unregister(void);
+
+/* Function prototypes for mdev_sysfs */
+
+extern struct class_attribute mdev_class_attrs[];
+
+int  mdev_create_sysfs_files(struct device *dev);
+void mdev_remove_sysfs_files(struct device *dev);
+
+int  mdev_device_create(struct device *dev, uuid_le uuid, uint32_t instance,
+			char *mdev_params);
+int  mdev_device_destroy(struct device *dev, uuid_le uuid, uint32_t instance);
+void mdev_device_supported_config(struct device *dev, char *str);
+int  mdev_device_start(uuid_le uuid);
+int  mdev_device_stop(uuid_le uuid);
+
+#endif /* MDEV_PRIVATE_H */
diff --git a/drivers/vfio/mdev/mdev_sysfs.c b/drivers/vfio/mdev/mdev_sysfs.c
new file mode 100644
index 000000000000..e0457e68cf78
--- /dev/null
+++ b/drivers/vfio/mdev/mdev_sysfs.c
@@ -0,0 +1,269 @@
+/*
+ * File attributes for Mediated devices
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/sysfs.h>
+#include <linux/ctype.h>
+#include <linux/device.h>
+#include <linux/slab.h>
+#include <linux/uuid.h>
+#include <linux/mdev.h>
+
+#include "mdev_private.h"
+
+/* Prototypes */
+static ssize_t mdev_supported_types_show(struct device *dev,
+					 struct device_attribute *attr,
+					 char *buf);
+static DEVICE_ATTR_RO(mdev_supported_types);
+
+static ssize_t mdev_create_store(struct device *dev,
+				 struct device_attribute *attr,
+				 const char *buf, size_t count);
+static DEVICE_ATTR_WO(mdev_create);
+
+static ssize_t mdev_destroy_store(struct device *dev,
+				  struct device_attribute *attr,
+				  const char *buf, size_t count);
+static DEVICE_ATTR_WO(mdev_destroy);
+
+/* Static functions */
+
+
+#define SUPPORTED_TYPE_BUFFER_LENGTH	4096
+
+/* mdev sysfs Functions */
+static ssize_t mdev_supported_types_show(struct device *dev,
+					 struct device_attribute *attr,
+					 char *buf)
+{
+	char *str, *ptr;
+	ssize_t n;
+
+	str = kzalloc(sizeof(*str) * SUPPORTED_TYPE_BUFFER_LENGTH, GFP_KERNEL);
+	if (!str)
+		return -ENOMEM;
+
+	ptr = str;
+	mdev_device_supported_config(dev, str);
+
+	n = sprintf(buf, "%s\n", str);
+	kfree(ptr);
+
+	return n;
+}
+
+static ssize_t mdev_create_store(struct device *dev,
+				 struct device_attribute *attr,
+				 const char *buf, size_t count)
+{
+	char *str, *pstr;
+	char *uuid_str, *instance_str, *mdev_params = NULL, *params = NULL;
+	uuid_le uuid;
+	uint32_t instance;
+	int ret;
+
+	pstr = str = kstrndup(buf, count, GFP_KERNEL);
+
+	if (!str)
+		return -ENOMEM;
+
+	uuid_str = strsep(&str, ":");
+	if (!uuid_str) {
+		pr_err("mdev_create: Empty UUID string %s\n", buf);
+		ret = -EINVAL;
+		goto create_error;
+	}
+
+	if (!str) {
+		pr_err("mdev_create: mdev instance not present %s\n", buf);
+		ret = -EINVAL;
+		goto create_error;
+	}
+
+	instance_str = strsep(&str, ":");
+	if (!instance_str) {
+		pr_err("mdev_create: Empty instance string %s\n", buf);
+		ret = -EINVAL;
+		goto create_error;
+	}
+
+	ret = kstrtouint(instance_str, 0, &instance);
+	if (ret) {
+		pr_err("mdev_create: mdev instance parsing error %s\n", buf);
+		goto create_error;
+	}
+
+	if (str)
+		params = mdev_params = kstrdup(str, GFP_KERNEL);
+
+	ret = uuid_le_to_bin(uuid_str, &uuid);
+	if (ret) {
+		pr_err("mdev_create: UUID parse error %s\n", buf);
+		goto create_error;
+	}
+
+	ret = mdev_device_create(dev, uuid, instance, mdev_params);
+	if (ret)
+		pr_err("mdev_create: Failed to create mdev device\n");
+	else
+		ret = count;
+
+create_error:
+	kfree(params);
+	kfree(pstr);
+	return ret;
+}
+
+static ssize_t mdev_destroy_store(struct device *dev,
+				  struct device_attribute *attr,
+				  const char *buf, size_t count)
+{
+	char *uuid_str, *str, *pstr;
+	uuid_le uuid;
+	unsigned int instance;
+	int ret;
+
+	str = pstr = kstrndup(buf, count, GFP_KERNEL);
+
+	if (!str)
+		return -ENOMEM;
+
+	uuid_str = strsep(&str, ":");
+	if (!uuid_str) {
+		pr_err("mdev_destroy: Empty UUID string %s\n", buf);
+		ret = -EINVAL;
+		goto destroy_error;
+	}
+
+	if (str == NULL) {
+		pr_err("mdev_destroy: instance not specified %s\n", buf);
+		ret = -EINVAL;
+		goto destroy_error;
+	}
+
+	ret = kstrtouint(str, 0, &instance);
+	if (ret) {
+		pr_err("mdev_destroy: instance parsing error %s\n", buf);
+		goto destroy_error;
+	}
+
+	ret = uuid_le_to_bin(uuid_str, &uuid);
+	if (ret) {
+		pr_err("mdev_destroy: UUID parse error  %s\n", buf);
+		goto destroy_error;
+	}
+
+	ret = mdev_device_destroy(dev, uuid, instance);
+	if (ret == 0)
+		ret = count;
+
+destroy_error:
+	kfree(pstr);
+	return ret;
+}
+
+ssize_t mdev_start_store(struct class *class, struct class_attribute *attr,
+			 const char *buf, size_t count)
+{
+	char *uuid_str, *ptr;
+	uuid_le uuid;
+	int ret;
+
+	ptr = uuid_str = kstrndup(buf, count, GFP_KERNEL);
+
+	if (!uuid_str)
+		return -ENOMEM;
+
+	ret = uuid_le_to_bin(uuid_str, &uuid);
+	if (ret) {
+		pr_err("mdev_start: UUID parse error  %s\n", buf);
+		goto start_error;
+	}
+
+	ret = mdev_device_start(uuid);
+	if (ret == 0)
+		ret = count;
+
+start_error:
+	kfree(ptr);
+	return ret;
+}
+
+ssize_t mdev_stop_store(struct class *class, struct class_attribute *attr,
+			    const char *buf, size_t count)
+{
+	char *uuid_str, *ptr;
+	uuid_le uuid;
+	int ret;
+
+	ptr = uuid_str = kstrndup(buf, count, GFP_KERNEL);
+
+	if (!uuid_str)
+		return -ENOMEM;
+
+	ret = uuid_le_to_bin(uuid_str, &uuid);
+	if (ret) {
+		pr_err("mdev_stop: UUID parse error %s\n", buf);
+		goto stop_error;
+	}
+
+	ret = mdev_device_stop(uuid);
+	if (ret == 0)
+		ret = count;
+
+stop_error:
+	kfree(ptr);
+	return ret;
+
+}
+
+struct class_attribute mdev_class_attrs[] = {
+	__ATTR_WO(mdev_start),
+	__ATTR_WO(mdev_stop),
+	__ATTR_NULL
+};
+
+int mdev_create_sysfs_files(struct device *dev)
+{
+	int ret;
+
+	ret = sysfs_create_file(&dev->kobj,
+				&dev_attr_mdev_supported_types.attr);
+	if (ret) {
+		pr_err("Failed to create mdev_supported_types sysfs entry\n");
+		return ret;
+	}
+
+	ret = sysfs_create_file(&dev->kobj, &dev_attr_mdev_create.attr);
+	if (ret) {
+		pr_err("Failed to create mdev_create sysfs entry\n");
+		goto create_sysfs_failed;
+	}
+
+	ret = sysfs_create_file(&dev->kobj, &dev_attr_mdev_destroy.attr);
+	if (ret) {
+		pr_err("Failed to create mdev_destroy sysfs entry\n");
+		sysfs_remove_file(&dev->kobj, &dev_attr_mdev_create.attr);
+	} else
+		return ret;
+
+create_sysfs_failed:
+	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_supported_types.attr);
+	return ret;
+}
+
+void mdev_remove_sysfs_files(struct device *dev)
+{
+	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_supported_types.attr);
+	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_create.attr);
+	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_destroy.attr);
+}
diff --git a/include/linux/mdev.h b/include/linux/mdev.h
new file mode 100644
index 000000000000..0b41f301a9b7
--- /dev/null
+++ b/include/linux/mdev.h
@@ -0,0 +1,236 @@
+/*
+ * Mediated device definition
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef MDEV_H
+#define MDEV_H
+
+#include <uapi/linux/vfio.h>
+
+struct parent_device;
+
+/*
+ * Mediated device
+ */
+
+struct addr_desc {
+	unsigned long start;
+	unsigned long size;
+	struct list_head next;
+};
+
+struct mdev_phys_mapping {
+	struct address_space *mapping;
+	struct list_head addr_desc_list;
+	struct mutex addr_desc_list_lock;
+};
+
+struct mdev_device {
+	struct device		dev;
+	struct parent_device	*parent;
+	struct iommu_group	*group;
+	uuid_le			uuid;
+	uint32_t		instance;
+	void			*driver_data;
+
+	/* internal only */
+	struct kref		ref;
+	struct list_head	next;
+
+	struct mdev_phys_mapping phys_mappings;
+};
+
+
+/**
+ * struct parent_ops - Structure to be registered for each parent device to
+ * register the device to mdev module.
+ *
+ * @owner:		The module owner.
+ * @dev_attr_groups:	Default attributes of the parent device.
+ * @mdev_attr_groups:	Default attributes of the mediated device.
+ * @supported_config:	Called to get information about supported types.
+ *			@dev : device structure of parent device.
+ *			@config: should return string listing supported config
+ *			Returns integer: success (0) or error (< 0)
+ * @create:		Called to allocate basic resources in parent device's
+ *			driver for a particular mediated device. It is
+ *			mandatory to provide create ops.
+ *			@mdev: mdev_device structure on of mediated device
+ *			      that is being created
+ *			@mdev_params: extra parameters required by parent
+ *			device's driver.
+ *			Returns integer: success (0) or error (< 0)
+ * @destroy:		Called to free resources in parent device's driver for a
+ *			a mediated device instance. It is mandatory to provide
+ *			destroy ops.
+ *			@mdev: mdev_device device structure which is being
+ *			       destroyed
+ *			Returns integer: success (0) or error (< 0)
+ *			If VMM is running and destroy() is called that means the
+ *			mdev is being hotunpluged. Return error if VMM is
+ *			running and driver doesn't support mediated device
+ *			hotplug.
+ * @reset:		Called to reset mediated device.
+ *			@mdev: mdev_device device structure
+ *			Returns integer: success (0) or error (< 0)
+ * @start:		Called to initiate mediated device initialization
+ *			process in parent device's driver before VMM starts.
+ *			@uuid: UUID
+ *			Returns integer: success (0) or error (< 0)
+ * @stop:		Called to teardown mediated device related resources
+ *			@uuid: UUID
+ *			Returns integer: success (0) or error (< 0)
+ * @read:		Read emulation callback
+ *			@mdev: mediated device structure
+ *			@buf: read buffer
+ *			@count: number of bytes to read
+ *			@pos: address.
+ *			Retuns number on bytes read on success or error.
+ * @write:		Write emulation callback
+ *			@mdev: mediated device structure
+ *			@buf: write buffer
+ *			@count: number of bytes to be written
+ *			@pos: address.
+ *			Retuns number on bytes written on success or error.
+ * @set_irqs:		Called to send about interrupts configuration
+ *			information that VMM sets.
+ *			@mdev: mediated device structure
+ *			@flags, index, start, count and *data : same as that of
+ *			struct vfio_irq_set of VFIO_DEVICE_SET_IRQS API.
+ * @get_region_info:	Called to get VFIO region size and flags of mediated
+ *			device.
+ *			@mdev: mediated device structure
+ *			@region_index: VFIO region index
+ *			@region_info: output, returns size and flags of
+ *				      requested region.
+ *			Returns integer: success (0) or error (< 0)
+ * @validate_map_request: Validate remap pfn request
+ *			@mdev: mediated device structure
+ *			@pos: address
+ *			@virtaddr: target user address to start at. Vendor
+ *				   driver can change if required.
+ *			@pfn: parent address of kernel memory, vendor driver
+ *			      can change if required.
+ *			@size: size of map area, vendor driver can change the
+ *			       size of map area if desired.
+ *			@prot: page protection flags for this mapping, vendor
+ *			       driver can change, if required.
+ *			Returns integer: success (0) or error (< 0)
+ *
+ * Parent device that support mediated device should be registered with mdev
+ * module with parent_ops structure.
+ */
+
+struct parent_ops {
+	struct module   *owner;
+	const struct attribute_group **dev_attr_groups;
+	const struct attribute_group **mdev_attr_groups;
+
+	int	(*supported_config)(struct device *dev, char *config);
+	int     (*create)(struct mdev_device *mdev, char *mdev_params);
+	int     (*destroy)(struct mdev_device *mdev);
+	int     (*reset)(struct mdev_device *mdev);
+	int     (*start)(uuid_le uuid);
+	int     (*stop)(uuid_le uuid);
+	ssize_t (*read)(struct mdev_device *mdev, char *buf, size_t count,
+			loff_t pos);
+	ssize_t (*write)(struct mdev_device *mdev, char *buf, size_t count,
+			 loff_t pos);
+	int     (*set_irqs)(struct mdev_device *mdev, uint32_t flags,
+			    unsigned int index, unsigned int start,
+			    unsigned int count, void *data);
+	int	(*get_region_info)(struct mdev_device *mdev, int region_index,
+				   struct vfio_region_info *region_info);
+	int	(*validate_map_request)(struct mdev_device *mdev, loff_t pos,
+					u64 *virtaddr, unsigned long *pfn,
+					unsigned long *size, pgprot_t *prot);
+};
+
+/*
+ * Parent Device
+ */
+
+struct parent_device {
+	struct device		*dev;
+	const struct parent_ops	*ops;
+
+	/* internal */
+	struct kref		ref;
+	struct list_head	next;
+	struct list_head	mdev_list;
+	struct mutex		mdev_list_lock;
+	wait_queue_head_t	release_done;
+};
+
+/**
+ * struct mdev_driver - Mediated device driver
+ * @name: driver name
+ * @probe: called when new device created
+ * @remove: called when device removed
+ * @match: called when new device or driver is added for this bus. Return 1 if
+ *	   given device can be handled by given driver and zero otherwise.
+ * @driver: device driver structure
+ *
+ **/
+struct mdev_driver {
+	const char *name;
+	int  (*probe)(struct device *dev);
+	void (*remove)(struct device *dev);
+	int  (*match)(struct device *dev);
+	struct device_driver driver;
+};
+
+static inline struct mdev_driver *to_mdev_driver(struct device_driver *drv)
+{
+	return drv ? container_of(drv, struct mdev_driver, driver) : NULL;
+}
+
+static inline struct mdev_device *to_mdev_device(struct device *dev)
+{
+	return dev ? container_of(dev, struct mdev_device, dev) : NULL;
+}
+
+static inline void *mdev_get_drvdata(struct mdev_device *mdev)
+{
+	return mdev->driver_data;
+}
+
+static inline void mdev_set_drvdata(struct mdev_device *mdev, void *data)
+{
+	mdev->driver_data = data;
+}
+
+extern struct bus_type mdev_bus_type;
+
+#define dev_is_mdev(d) ((d)->bus == &mdev_bus_type)
+
+extern int  mdev_register_device(struct device *dev,
+				 const struct parent_ops *ops);
+extern void mdev_unregister_device(struct device *dev);
+
+extern int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
+extern void mdev_unregister_driver(struct mdev_driver *drv);
+
+extern struct mdev_device *mdev_get_device(struct mdev_device *mdev);
+extern void mdev_put_device(struct mdev_device *mdev);
+
+extern struct mdev_device *mdev_get_device_by_group(struct iommu_group *group);
+
+extern int mdev_device_invalidate_mapping(struct mdev_device *mdev,
+					unsigned long addr, unsigned long size);
+
+extern int mdev_add_phys_mapping(struct mdev_device *mdev,
+				 struct address_space *mapping,
+				 unsigned long addr, unsigned long size);
+
+
+extern void mdev_del_phys_mapping(struct mdev_device *mdev, unsigned long addr);
+#endif /* MDEV_H */
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v6 2/4] vfio: VFIO driver for mediated PCI device
  2016-08-03 19:03 ` [Qemu-devel] " Kirti Wankhede
@ 2016-08-03 19:03   ` Kirti Wankhede
  -1 siblings, 0 replies; 100+ messages in thread
From: Kirti Wankhede @ 2016-08-03 19:03 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: jike.song, kvm, kevin.tian, qemu-devel, Kirti Wankhede, bjsdjshi

MPCI VFIO driver registers with MDEV core driver. MDEV core driver creates
mediated device and calls probe routine of MPCI VFIO driver. This driver
adds mediated device to VFIO core module.
Main aim of this module is to manage all VFIO APIs for each mediated PCI
device. Those are:
- get region information from vendor driver.
- trap and emulate PCI config space and BAR region.
- Send interrupt configuration information to vendor driver.
- Device reset
- mmap mappable region with invalidate mapping and fault on access to
  remap pfns. If validate_map_request() is not provided by vendor driver,
  fault handler maps physical devices region.
- Add and delete mappable region's physical mappings to mdev's mapping
  tracking logic.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Change-Id: I583f4734752971d3d112324d69e2508c88f359ec
---
 drivers/vfio/mdev/Kconfig           |   6 +
 drivers/vfio/mdev/Makefile          |   1 +
 drivers/vfio/mdev/vfio_mpci.c       | 536 ++++++++++++++++++++++++++++++++++++
 drivers/vfio/pci/vfio_pci_private.h |   6 -
 drivers/vfio/pci/vfio_pci_rdwr.c    |   1 +
 include/linux/vfio.h                |   7 +
 6 files changed, 551 insertions(+), 6 deletions(-)
 create mode 100644 drivers/vfio/mdev/vfio_mpci.c

diff --git a/drivers/vfio/mdev/Kconfig b/drivers/vfio/mdev/Kconfig
index a34fbc66f92f..431ed595c8da 100644
--- a/drivers/vfio/mdev/Kconfig
+++ b/drivers/vfio/mdev/Kconfig
@@ -9,4 +9,10 @@ config VFIO_MDEV
 
         If you don't know what do here, say N.
 
+config VFIO_MPCI
+    tristate "VFIO support for Mediated PCI devices"
+    depends on VFIO && PCI && VFIO_MDEV
+    default n
+    help
+        VFIO based driver for mediated PCI devices.
 
diff --git a/drivers/vfio/mdev/Makefile b/drivers/vfio/mdev/Makefile
index 56a75e689582..264fb03dd0e3 100644
--- a/drivers/vfio/mdev/Makefile
+++ b/drivers/vfio/mdev/Makefile
@@ -2,4 +2,5 @@
 mdev-y := mdev_core.o mdev_sysfs.o mdev_driver.o
 
 obj-$(CONFIG_VFIO_MDEV) += mdev.o
+obj-$(CONFIG_VFIO_MPCI) += vfio_mpci.o
 
diff --git a/drivers/vfio/mdev/vfio_mpci.c b/drivers/vfio/mdev/vfio_mpci.c
new file mode 100644
index 000000000000..9da94b76ae3e
--- /dev/null
+++ b/drivers/vfio/mdev/vfio_mpci.c
@@ -0,0 +1,536 @@
+/*
+ * VFIO based Mediated PCI device driver
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/uuid.h>
+#include <linux/vfio.h>
+#include <linux/iommu.h>
+#include <linux/mdev.h>
+
+#include "mdev_private.h"
+
+#define DRIVER_VERSION  "0.1"
+#define DRIVER_AUTHOR   "NVIDIA Corporation"
+#define DRIVER_DESC     "VFIO based Mediated PCI device driver"
+
+struct vfio_mdev {
+	struct iommu_group *group;
+	struct mdev_device *mdev;
+	int		    refcnt;
+	struct vfio_region_info vfio_region_info[VFIO_PCI_NUM_REGIONS];
+	struct mutex	    vfio_mdev_lock;
+};
+
+static int vfio_mpci_open(void *device_data)
+{
+	int ret = 0;
+	struct vfio_mdev *vmdev = device_data;
+	struct parent_device *parent = vmdev->mdev->parent;
+
+	if (!try_module_get(THIS_MODULE))
+		return -ENODEV;
+
+	mutex_lock(&vmdev->vfio_mdev_lock);
+	if (!vmdev->refcnt && parent->ops->get_region_info) {
+		int index;
+
+		for (index = VFIO_PCI_BAR0_REGION_INDEX;
+		     index < VFIO_PCI_NUM_REGIONS; index++) {
+			ret = parent->ops->get_region_info(vmdev->mdev, index,
+					      &vmdev->vfio_region_info[index]);
+			if (ret)
+				goto open_error;
+		}
+	}
+
+	vmdev->refcnt++;
+
+open_error:
+	mutex_unlock(&vmdev->vfio_mdev_lock);
+	if (ret)
+		module_put(THIS_MODULE);
+
+	return ret;
+}
+
+static void vfio_mpci_close(void *device_data)
+{
+	struct vfio_mdev *vmdev = device_data;
+
+	mutex_lock(&vmdev->vfio_mdev_lock);
+	vmdev->refcnt--;
+	if (!vmdev->refcnt) {
+		memset(&vmdev->vfio_region_info, 0,
+			sizeof(vmdev->vfio_region_info));
+	}
+	mutex_unlock(&vmdev->vfio_mdev_lock);
+	module_put(THIS_MODULE);
+}
+
+static u8 mpci_find_pci_capability(struct mdev_device *mdev, u8 capability)
+{
+	loff_t pos = VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_CONFIG_REGION_INDEX);
+	struct parent_device *parent = mdev->parent;
+	u16 status;
+	u8  cap_ptr, cap_id = 0xff;
+
+	parent->ops->read(mdev, (char *)&status, sizeof(status),
+			  pos + PCI_STATUS);
+	if (!(status & PCI_STATUS_CAP_LIST))
+		return 0;
+
+	parent->ops->read(mdev, &cap_ptr, sizeof(cap_ptr),
+			  pos + PCI_CAPABILITY_LIST);
+
+	do {
+		cap_ptr &= 0xfc;
+		parent->ops->read(mdev, &cap_id, sizeof(cap_id),
+				  pos + cap_ptr + PCI_CAP_LIST_ID);
+		if (cap_id == capability)
+			return cap_ptr;
+		parent->ops->read(mdev, &cap_ptr, sizeof(cap_ptr),
+				  pos + cap_ptr + PCI_CAP_LIST_NEXT);
+	} while (cap_ptr && cap_id != 0xff);
+
+	return 0;
+}
+
+static int mpci_get_irq_count(struct vfio_mdev *vmdev, int irq_type)
+{
+	loff_t pos = VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_CONFIG_REGION_INDEX);
+	struct mdev_device *mdev = vmdev->mdev;
+	struct parent_device *parent = mdev->parent;
+
+	if (irq_type == VFIO_PCI_INTX_IRQ_INDEX) {
+		u8 pin;
+
+		parent->ops->read(mdev, &pin, sizeof(pin),
+				  pos + PCI_INTERRUPT_PIN);
+		if (IS_ENABLED(CONFIG_VFIO_PCI_INTX) && pin)
+			return 1;
+
+	} else if (irq_type == VFIO_PCI_MSI_IRQ_INDEX) {
+		u8 cap_ptr;
+		u16 flags;
+
+		cap_ptr = mpci_find_pci_capability(mdev, PCI_CAP_ID_MSI);
+		if (cap_ptr) {
+			parent->ops->read(mdev, (char *)&flags, sizeof(flags),
+					pos + cap_ptr + PCI_MSI_FLAGS);
+			return 1 << ((flags & PCI_MSI_FLAGS_QMASK) >> 1);
+		}
+	} else if (irq_type == VFIO_PCI_MSIX_IRQ_INDEX) {
+		u8 cap_ptr;
+		u16 flags;
+
+		cap_ptr = mpci_find_pci_capability(mdev, PCI_CAP_ID_MSIX);
+		if (cap_ptr) {
+			parent->ops->read(mdev, (char *)&flags, sizeof(flags),
+					pos + cap_ptr + PCI_MSIX_FLAGS);
+
+			return (flags & PCI_MSIX_FLAGS_QSIZE) + 1;
+		}
+	} else if (irq_type == VFIO_PCI_ERR_IRQ_INDEX) {
+		u8 cap_ptr;
+
+		cap_ptr = mpci_find_pci_capability(mdev, PCI_CAP_ID_EXP);
+		if (cap_ptr)
+			return 1;
+	} else if (irq_type == VFIO_PCI_REQ_IRQ_INDEX) {
+		return 1;
+	}
+
+	return 0;
+}
+
+static long vfio_mpci_unlocked_ioctl(void *device_data,
+				     unsigned int cmd, unsigned long arg)
+{
+	int ret = 0;
+	struct vfio_mdev *vmdev = device_data;
+	unsigned long minsz;
+
+	switch (cmd) {
+	case VFIO_DEVICE_GET_INFO:
+	{
+		struct vfio_device_info info;
+		struct parent_device *parent = vmdev->mdev->parent;
+
+		minsz = offsetofend(struct vfio_device_info, num_irqs);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz)
+			return -EINVAL;
+
+		info.flags = VFIO_DEVICE_FLAGS_PCI;
+
+		if (parent->ops->reset)
+			info.flags |= VFIO_DEVICE_FLAGS_RESET;
+
+		info.num_regions = VFIO_PCI_NUM_REGIONS;
+		info.num_irqs = VFIO_PCI_NUM_IRQS;
+
+		return copy_to_user((void __user *)arg, &info, minsz);
+	}
+	case VFIO_DEVICE_GET_REGION_INFO:
+	{
+		struct vfio_region_info info;
+
+		minsz = offsetofend(struct vfio_region_info, offset);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz)
+			return -EINVAL;
+
+		switch (info.index) {
+		case VFIO_PCI_CONFIG_REGION_INDEX:
+		case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
+			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+			info.size = vmdev->vfio_region_info[info.index].size;
+			if (!info.size) {
+				info.flags = 0;
+				break;
+			}
+
+			info.flags = vmdev->vfio_region_info[info.index].flags;
+			break;
+		case VFIO_PCI_VGA_REGION_INDEX:
+		case VFIO_PCI_ROM_REGION_INDEX:
+		default:
+			return -EINVAL;
+		}
+
+		return copy_to_user((void __user *)arg, &info, minsz);
+	}
+	case VFIO_DEVICE_GET_IRQ_INFO:
+	{
+		struct vfio_irq_info info;
+
+		minsz = offsetofend(struct vfio_irq_info, count);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz || info.index >= VFIO_PCI_NUM_IRQS)
+			return -EINVAL;
+
+		switch (info.index) {
+		case VFIO_PCI_INTX_IRQ_INDEX ... VFIO_PCI_MSI_IRQ_INDEX:
+		case VFIO_PCI_REQ_IRQ_INDEX:
+			break;
+			/* pass thru to return error */
+		case VFIO_PCI_MSIX_IRQ_INDEX:
+		default:
+			return -EINVAL;
+		}
+
+		info.flags = VFIO_IRQ_INFO_EVENTFD;
+		info.count = mpci_get_irq_count(vmdev, info.index);
+
+		if (info.count == -1)
+			return -EINVAL;
+
+		if (info.index == VFIO_PCI_INTX_IRQ_INDEX)
+			info.flags |= (VFIO_IRQ_INFO_MASKABLE |
+					VFIO_IRQ_INFO_AUTOMASKED);
+		else
+			info.flags |= VFIO_IRQ_INFO_NORESIZE;
+
+		return copy_to_user((void __user *)arg, &info, minsz);
+	}
+	case VFIO_DEVICE_SET_IRQS:
+	{
+		struct vfio_irq_set hdr;
+		struct mdev_device *mdev = vmdev->mdev;
+		struct parent_device *parent = mdev->parent;
+		u8 *data = NULL, *ptr = NULL;
+
+		minsz = offsetofend(struct vfio_irq_set, count);
+
+		if (copy_from_user(&hdr, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (hdr.argsz < minsz || hdr.index >= VFIO_PCI_NUM_IRQS ||
+		    hdr.flags & ~(VFIO_IRQ_SET_DATA_TYPE_MASK |
+				  VFIO_IRQ_SET_ACTION_TYPE_MASK))
+			return -EINVAL;
+
+		if (!(hdr.flags & VFIO_IRQ_SET_DATA_NONE)) {
+			size_t size;
+			int max = mpci_get_irq_count(vmdev, hdr.index);
+
+			if (hdr.flags & VFIO_IRQ_SET_DATA_BOOL)
+				size = sizeof(uint8_t);
+			else if (hdr.flags & VFIO_IRQ_SET_DATA_EVENTFD)
+				size = sizeof(int32_t);
+			else
+				return -EINVAL;
+
+			if (hdr.argsz - minsz < hdr.count * size ||
+			    hdr.start >= max || hdr.start + hdr.count > max)
+				return -EINVAL;
+
+			ptr = data = memdup_user((void __user *)(arg + minsz),
+						 hdr.count * size);
+			if (IS_ERR(data))
+				return PTR_ERR(data);
+		}
+
+		if (parent->ops->set_irqs)
+			ret = parent->ops->set_irqs(mdev, hdr.flags, hdr.index,
+						    hdr.start, hdr.count, data);
+
+		kfree(ptr);
+		return ret;
+	}
+	case VFIO_DEVICE_RESET:
+	{
+		struct parent_device *parent = vmdev->mdev->parent;
+
+		if (parent->ops->reset)
+			return parent->ops->reset(vmdev->mdev);
+
+		return -EINVAL;
+	}
+	}
+	return -ENOTTY;
+}
+
+static ssize_t vfio_mpci_read(void *device_data, char __user *buf,
+			      size_t count, loff_t *ppos)
+{
+	struct vfio_mdev *vmdev = device_data;
+	struct mdev_device *mdev = vmdev->mdev;
+	struct parent_device *parent = mdev->parent;
+	int ret = 0;
+
+	if (!count)
+		return 0;
+
+	if (parent->ops->read) {
+		char *ret_data, *ptr;
+
+		ptr = ret_data = kzalloc(count, GFP_KERNEL);
+
+		if (!ret_data)
+			return  -ENOMEM;
+
+		ret = parent->ops->read(mdev, ret_data, count, *ppos);
+
+		if (ret > 0) {
+			if (copy_to_user(buf, ret_data, ret))
+				ret = -EFAULT;
+			else
+				*ppos += ret;
+		}
+		kfree(ptr);
+	}
+
+	return ret;
+}
+
+static ssize_t vfio_mpci_write(void *device_data, const char __user *buf,
+			       size_t count, loff_t *ppos)
+{
+	struct vfio_mdev *vmdev = device_data;
+	struct mdev_device *mdev = vmdev->mdev;
+	struct parent_device *parent = mdev->parent;
+	int ret = 0;
+
+	if (!count)
+		return 0;
+
+	if (parent->ops->write) {
+		char *usr_data, *ptr;
+
+		ptr = usr_data = memdup_user(buf, count);
+		if (IS_ERR(usr_data))
+			return PTR_ERR(usr_data);
+
+		ret = parent->ops->write(mdev, usr_data, count, *ppos);
+
+		if (ret > 0)
+			*ppos += ret;
+
+		kfree(ptr);
+	}
+
+	return ret;
+}
+
+static int mdev_dev_mmio_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	int ret;
+	struct vfio_mdev *vmdev = vma->vm_private_data;
+	struct mdev_device *mdev;
+	struct parent_device *parent;
+	u64 virtaddr = (u64)vmf->virtual_address;
+	unsigned long req_size, pgoff = 0;
+	pgprot_t pg_prot;
+	unsigned int index;
+
+	if (!vmdev && !vmdev->mdev)
+		return -EINVAL;
+
+	mdev = vmdev->mdev;
+	parent  = mdev->parent;
+
+	pg_prot  = vma->vm_page_prot;
+
+	if (parent->ops->validate_map_request) {
+		u64 offset;
+		loff_t pos;
+
+		offset   = virtaddr - vma->vm_start;
+		req_size = vma->vm_end - virtaddr;
+		pos = (vma->vm_pgoff << PAGE_SHIFT) + offset;
+
+		ret = parent->ops->validate_map_request(mdev, pos, &virtaddr,
+						&pgoff, &req_size, &pg_prot);
+		if (ret)
+			return ret;
+
+		/*
+		 * Verify pgoff and req_size are valid and virtaddr is within
+		 * vma range
+		 */
+		if (!pgoff || !req_size || (virtaddr < vma->vm_start) ||
+		    ((virtaddr + req_size) >= vma->vm_end))
+			return -EINVAL;
+	} else {
+		struct pci_dev *pdev;
+
+		virtaddr = vma->vm_start;
+		req_size = vma->vm_end - vma->vm_start;
+
+		pdev = to_pci_dev(parent->dev);
+		index = VFIO_PCI_OFFSET_TO_INDEX(vma->vm_pgoff << PAGE_SHIFT);
+		pgoff = pci_resource_start(pdev, index) >> PAGE_SHIFT;
+	}
+
+	ret = remap_pfn_range(vma, virtaddr, pgoff, req_size, pg_prot);
+
+	return ret | VM_FAULT_NOPAGE;
+}
+
+void mdev_dev_mmio_close(struct vm_area_struct *vma)
+{
+	struct vfio_mdev *vmdev = vma->vm_private_data;
+	struct mdev_device *mdev = vmdev->mdev;
+
+	mdev_del_phys_mapping(mdev, vma->vm_pgoff << PAGE_SHIFT);
+}
+
+static const struct vm_operations_struct mdev_dev_mmio_ops = {
+	.fault = mdev_dev_mmio_fault,
+	.close = mdev_dev_mmio_close,
+};
+
+static int vfio_mpci_mmap(void *device_data, struct vm_area_struct *vma)
+{
+	unsigned int index;
+	struct vfio_mdev *vmdev = device_data;
+	struct mdev_device *mdev = vmdev->mdev;
+
+	index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
+
+	if (index >= VFIO_PCI_ROM_REGION_INDEX)
+		return -EINVAL;
+
+	vma->vm_private_data = vmdev;
+	vma->vm_ops = &mdev_dev_mmio_ops;
+
+	return mdev_add_phys_mapping(mdev, vma->vm_file->f_mapping,
+				     vma->vm_pgoff << PAGE_SHIFT,
+				     vma->vm_end - vma->vm_start);
+}
+
+static const struct vfio_device_ops vfio_mpci_dev_ops = {
+	.name		= "vfio-mpci",
+	.open		= vfio_mpci_open,
+	.release	= vfio_mpci_close,
+	.ioctl		= vfio_mpci_unlocked_ioctl,
+	.read		= vfio_mpci_read,
+	.write		= vfio_mpci_write,
+	.mmap		= vfio_mpci_mmap,
+};
+
+int vfio_mpci_probe(struct device *dev)
+{
+	struct vfio_mdev *vmdev;
+	struct mdev_device *mdev = to_mdev_device(dev);
+	int ret;
+
+	vmdev = kzalloc(sizeof(*vmdev), GFP_KERNEL);
+	if (IS_ERR(vmdev))
+		return PTR_ERR(vmdev);
+
+	vmdev->mdev = mdev_get_device(mdev);
+	vmdev->group = mdev->group;
+	mutex_init(&vmdev->vfio_mdev_lock);
+
+	ret = vfio_add_group_dev(dev, &vfio_mpci_dev_ops, vmdev);
+	if (ret)
+		kfree(vmdev);
+
+	mdev_put_device(mdev);
+	return ret;
+}
+
+void vfio_mpci_remove(struct device *dev)
+{
+	struct vfio_mdev *vmdev;
+
+	vmdev = vfio_del_group_dev(dev);
+	kfree(vmdev);
+}
+
+int vfio_mpci_match(struct device *dev)
+{
+	if (dev_is_pci(dev->parent))
+		return 1;
+
+	return 0;
+}
+
+struct mdev_driver vfio_mpci_driver = {
+	.name	= "vfio_mpci",
+	.probe	= vfio_mpci_probe,
+	.remove	= vfio_mpci_remove,
+	.match	= vfio_mpci_match,
+};
+
+static int __init vfio_mpci_init(void)
+{
+	return mdev_register_driver(&vfio_mpci_driver, THIS_MODULE);
+}
+
+static void __exit vfio_mpci_exit(void)
+{
+	mdev_unregister_driver(&vfio_mpci_driver);
+}
+
+module_init(vfio_mpci_init)
+module_exit(vfio_mpci_exit)
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index 8a7d546d18a0..04a450908ffb 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -19,12 +19,6 @@
 #ifndef VFIO_PCI_PRIVATE_H
 #define VFIO_PCI_PRIVATE_H
 
-#define VFIO_PCI_OFFSET_SHIFT   40
-
-#define VFIO_PCI_OFFSET_TO_INDEX(off)	(off >> VFIO_PCI_OFFSET_SHIFT)
-#define VFIO_PCI_INDEX_TO_OFFSET(index)	((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
-#define VFIO_PCI_OFFSET_MASK	(((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
-
 /* Special capability IDs predefined access */
 #define PCI_CAP_ID_INVALID		0xFF	/* default raw access */
 #define PCI_CAP_ID_INVALID_VIRT		0xFE	/* default virt access */
diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c
index 5ffd1d9ad4bd..5b912be9d9c3 100644
--- a/drivers/vfio/pci/vfio_pci_rdwr.c
+++ b/drivers/vfio/pci/vfio_pci_rdwr.c
@@ -18,6 +18,7 @@
 #include <linux/uaccess.h>
 #include <linux/io.h>
 #include <linux/vgaarb.h>
+#include <linux/vfio.h>
 
 #include "vfio_pci_private.h"
 
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 0ecae0b1cd34..431b824b0d3e 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -18,6 +18,13 @@
 #include <linux/poll.h>
 #include <uapi/linux/vfio.h>
 
+#define VFIO_PCI_OFFSET_SHIFT   40
+
+#define VFIO_PCI_OFFSET_TO_INDEX(off)   (off >> VFIO_PCI_OFFSET_SHIFT)
+#define VFIO_PCI_INDEX_TO_OFFSET(index) ((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
+#define VFIO_PCI_OFFSET_MASK    (((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
+
+
 /**
  * struct vfio_device_ops - VFIO bus driver device callbacks
  *
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [Qemu-devel] [PATCH v6 2/4] vfio: VFIO driver for mediated PCI device
@ 2016-08-03 19:03   ` Kirti Wankhede
  0 siblings, 0 replies; 100+ messages in thread
From: Kirti Wankhede @ 2016-08-03 19:03 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi, Kirti Wankhede

MPCI VFIO driver registers with MDEV core driver. MDEV core driver creates
mediated device and calls probe routine of MPCI VFIO driver. This driver
adds mediated device to VFIO core module.
Main aim of this module is to manage all VFIO APIs for each mediated PCI
device. Those are:
- get region information from vendor driver.
- trap and emulate PCI config space and BAR region.
- Send interrupt configuration information to vendor driver.
- Device reset
- mmap mappable region with invalidate mapping and fault on access to
  remap pfns. If validate_map_request() is not provided by vendor driver,
  fault handler maps physical devices region.
- Add and delete mappable region's physical mappings to mdev's mapping
  tracking logic.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Change-Id: I583f4734752971d3d112324d69e2508c88f359ec
---
 drivers/vfio/mdev/Kconfig           |   6 +
 drivers/vfio/mdev/Makefile          |   1 +
 drivers/vfio/mdev/vfio_mpci.c       | 536 ++++++++++++++++++++++++++++++++++++
 drivers/vfio/pci/vfio_pci_private.h |   6 -
 drivers/vfio/pci/vfio_pci_rdwr.c    |   1 +
 include/linux/vfio.h                |   7 +
 6 files changed, 551 insertions(+), 6 deletions(-)
 create mode 100644 drivers/vfio/mdev/vfio_mpci.c

diff --git a/drivers/vfio/mdev/Kconfig b/drivers/vfio/mdev/Kconfig
index a34fbc66f92f..431ed595c8da 100644
--- a/drivers/vfio/mdev/Kconfig
+++ b/drivers/vfio/mdev/Kconfig
@@ -9,4 +9,10 @@ config VFIO_MDEV
 
         If you don't know what do here, say N.
 
+config VFIO_MPCI
+    tristate "VFIO support for Mediated PCI devices"
+    depends on VFIO && PCI && VFIO_MDEV
+    default n
+    help
+        VFIO based driver for mediated PCI devices.
 
diff --git a/drivers/vfio/mdev/Makefile b/drivers/vfio/mdev/Makefile
index 56a75e689582..264fb03dd0e3 100644
--- a/drivers/vfio/mdev/Makefile
+++ b/drivers/vfio/mdev/Makefile
@@ -2,4 +2,5 @@
 mdev-y := mdev_core.o mdev_sysfs.o mdev_driver.o
 
 obj-$(CONFIG_VFIO_MDEV) += mdev.o
+obj-$(CONFIG_VFIO_MPCI) += vfio_mpci.o
 
diff --git a/drivers/vfio/mdev/vfio_mpci.c b/drivers/vfio/mdev/vfio_mpci.c
new file mode 100644
index 000000000000..9da94b76ae3e
--- /dev/null
+++ b/drivers/vfio/mdev/vfio_mpci.c
@@ -0,0 +1,536 @@
+/*
+ * VFIO based Mediated PCI device driver
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/uuid.h>
+#include <linux/vfio.h>
+#include <linux/iommu.h>
+#include <linux/mdev.h>
+
+#include "mdev_private.h"
+
+#define DRIVER_VERSION  "0.1"
+#define DRIVER_AUTHOR   "NVIDIA Corporation"
+#define DRIVER_DESC     "VFIO based Mediated PCI device driver"
+
+struct vfio_mdev {
+	struct iommu_group *group;
+	struct mdev_device *mdev;
+	int		    refcnt;
+	struct vfio_region_info vfio_region_info[VFIO_PCI_NUM_REGIONS];
+	struct mutex	    vfio_mdev_lock;
+};
+
+static int vfio_mpci_open(void *device_data)
+{
+	int ret = 0;
+	struct vfio_mdev *vmdev = device_data;
+	struct parent_device *parent = vmdev->mdev->parent;
+
+	if (!try_module_get(THIS_MODULE))
+		return -ENODEV;
+
+	mutex_lock(&vmdev->vfio_mdev_lock);
+	if (!vmdev->refcnt && parent->ops->get_region_info) {
+		int index;
+
+		for (index = VFIO_PCI_BAR0_REGION_INDEX;
+		     index < VFIO_PCI_NUM_REGIONS; index++) {
+			ret = parent->ops->get_region_info(vmdev->mdev, index,
+					      &vmdev->vfio_region_info[index]);
+			if (ret)
+				goto open_error;
+		}
+	}
+
+	vmdev->refcnt++;
+
+open_error:
+	mutex_unlock(&vmdev->vfio_mdev_lock);
+	if (ret)
+		module_put(THIS_MODULE);
+
+	return ret;
+}
+
+static void vfio_mpci_close(void *device_data)
+{
+	struct vfio_mdev *vmdev = device_data;
+
+	mutex_lock(&vmdev->vfio_mdev_lock);
+	vmdev->refcnt--;
+	if (!vmdev->refcnt) {
+		memset(&vmdev->vfio_region_info, 0,
+			sizeof(vmdev->vfio_region_info));
+	}
+	mutex_unlock(&vmdev->vfio_mdev_lock);
+	module_put(THIS_MODULE);
+}
+
+static u8 mpci_find_pci_capability(struct mdev_device *mdev, u8 capability)
+{
+	loff_t pos = VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_CONFIG_REGION_INDEX);
+	struct parent_device *parent = mdev->parent;
+	u16 status;
+	u8  cap_ptr, cap_id = 0xff;
+
+	parent->ops->read(mdev, (char *)&status, sizeof(status),
+			  pos + PCI_STATUS);
+	if (!(status & PCI_STATUS_CAP_LIST))
+		return 0;
+
+	parent->ops->read(mdev, &cap_ptr, sizeof(cap_ptr),
+			  pos + PCI_CAPABILITY_LIST);
+
+	do {
+		cap_ptr &= 0xfc;
+		parent->ops->read(mdev, &cap_id, sizeof(cap_id),
+				  pos + cap_ptr + PCI_CAP_LIST_ID);
+		if (cap_id == capability)
+			return cap_ptr;
+		parent->ops->read(mdev, &cap_ptr, sizeof(cap_ptr),
+				  pos + cap_ptr + PCI_CAP_LIST_NEXT);
+	} while (cap_ptr && cap_id != 0xff);
+
+	return 0;
+}
+
+static int mpci_get_irq_count(struct vfio_mdev *vmdev, int irq_type)
+{
+	loff_t pos = VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_CONFIG_REGION_INDEX);
+	struct mdev_device *mdev = vmdev->mdev;
+	struct parent_device *parent = mdev->parent;
+
+	if (irq_type == VFIO_PCI_INTX_IRQ_INDEX) {
+		u8 pin;
+
+		parent->ops->read(mdev, &pin, sizeof(pin),
+				  pos + PCI_INTERRUPT_PIN);
+		if (IS_ENABLED(CONFIG_VFIO_PCI_INTX) && pin)
+			return 1;
+
+	} else if (irq_type == VFIO_PCI_MSI_IRQ_INDEX) {
+		u8 cap_ptr;
+		u16 flags;
+
+		cap_ptr = mpci_find_pci_capability(mdev, PCI_CAP_ID_MSI);
+		if (cap_ptr) {
+			parent->ops->read(mdev, (char *)&flags, sizeof(flags),
+					pos + cap_ptr + PCI_MSI_FLAGS);
+			return 1 << ((flags & PCI_MSI_FLAGS_QMASK) >> 1);
+		}
+	} else if (irq_type == VFIO_PCI_MSIX_IRQ_INDEX) {
+		u8 cap_ptr;
+		u16 flags;
+
+		cap_ptr = mpci_find_pci_capability(mdev, PCI_CAP_ID_MSIX);
+		if (cap_ptr) {
+			parent->ops->read(mdev, (char *)&flags, sizeof(flags),
+					pos + cap_ptr + PCI_MSIX_FLAGS);
+
+			return (flags & PCI_MSIX_FLAGS_QSIZE) + 1;
+		}
+	} else if (irq_type == VFIO_PCI_ERR_IRQ_INDEX) {
+		u8 cap_ptr;
+
+		cap_ptr = mpci_find_pci_capability(mdev, PCI_CAP_ID_EXP);
+		if (cap_ptr)
+			return 1;
+	} else if (irq_type == VFIO_PCI_REQ_IRQ_INDEX) {
+		return 1;
+	}
+
+	return 0;
+}
+
+static long vfio_mpci_unlocked_ioctl(void *device_data,
+				     unsigned int cmd, unsigned long arg)
+{
+	int ret = 0;
+	struct vfio_mdev *vmdev = device_data;
+	unsigned long minsz;
+
+	switch (cmd) {
+	case VFIO_DEVICE_GET_INFO:
+	{
+		struct vfio_device_info info;
+		struct parent_device *parent = vmdev->mdev->parent;
+
+		minsz = offsetofend(struct vfio_device_info, num_irqs);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz)
+			return -EINVAL;
+
+		info.flags = VFIO_DEVICE_FLAGS_PCI;
+
+		if (parent->ops->reset)
+			info.flags |= VFIO_DEVICE_FLAGS_RESET;
+
+		info.num_regions = VFIO_PCI_NUM_REGIONS;
+		info.num_irqs = VFIO_PCI_NUM_IRQS;
+
+		return copy_to_user((void __user *)arg, &info, minsz);
+	}
+	case VFIO_DEVICE_GET_REGION_INFO:
+	{
+		struct vfio_region_info info;
+
+		minsz = offsetofend(struct vfio_region_info, offset);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz)
+			return -EINVAL;
+
+		switch (info.index) {
+		case VFIO_PCI_CONFIG_REGION_INDEX:
+		case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
+			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+			info.size = vmdev->vfio_region_info[info.index].size;
+			if (!info.size) {
+				info.flags = 0;
+				break;
+			}
+
+			info.flags = vmdev->vfio_region_info[info.index].flags;
+			break;
+		case VFIO_PCI_VGA_REGION_INDEX:
+		case VFIO_PCI_ROM_REGION_INDEX:
+		default:
+			return -EINVAL;
+		}
+
+		return copy_to_user((void __user *)arg, &info, minsz);
+	}
+	case VFIO_DEVICE_GET_IRQ_INFO:
+	{
+		struct vfio_irq_info info;
+
+		minsz = offsetofend(struct vfio_irq_info, count);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz || info.index >= VFIO_PCI_NUM_IRQS)
+			return -EINVAL;
+
+		switch (info.index) {
+		case VFIO_PCI_INTX_IRQ_INDEX ... VFIO_PCI_MSI_IRQ_INDEX:
+		case VFIO_PCI_REQ_IRQ_INDEX:
+			break;
+			/* pass thru to return error */
+		case VFIO_PCI_MSIX_IRQ_INDEX:
+		default:
+			return -EINVAL;
+		}
+
+		info.flags = VFIO_IRQ_INFO_EVENTFD;
+		info.count = mpci_get_irq_count(vmdev, info.index);
+
+		if (info.count == -1)
+			return -EINVAL;
+
+		if (info.index == VFIO_PCI_INTX_IRQ_INDEX)
+			info.flags |= (VFIO_IRQ_INFO_MASKABLE |
+					VFIO_IRQ_INFO_AUTOMASKED);
+		else
+			info.flags |= VFIO_IRQ_INFO_NORESIZE;
+
+		return copy_to_user((void __user *)arg, &info, minsz);
+	}
+	case VFIO_DEVICE_SET_IRQS:
+	{
+		struct vfio_irq_set hdr;
+		struct mdev_device *mdev = vmdev->mdev;
+		struct parent_device *parent = mdev->parent;
+		u8 *data = NULL, *ptr = NULL;
+
+		minsz = offsetofend(struct vfio_irq_set, count);
+
+		if (copy_from_user(&hdr, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (hdr.argsz < minsz || hdr.index >= VFIO_PCI_NUM_IRQS ||
+		    hdr.flags & ~(VFIO_IRQ_SET_DATA_TYPE_MASK |
+				  VFIO_IRQ_SET_ACTION_TYPE_MASK))
+			return -EINVAL;
+
+		if (!(hdr.flags & VFIO_IRQ_SET_DATA_NONE)) {
+			size_t size;
+			int max = mpci_get_irq_count(vmdev, hdr.index);
+
+			if (hdr.flags & VFIO_IRQ_SET_DATA_BOOL)
+				size = sizeof(uint8_t);
+			else if (hdr.flags & VFIO_IRQ_SET_DATA_EVENTFD)
+				size = sizeof(int32_t);
+			else
+				return -EINVAL;
+
+			if (hdr.argsz - minsz < hdr.count * size ||
+			    hdr.start >= max || hdr.start + hdr.count > max)
+				return -EINVAL;
+
+			ptr = data = memdup_user((void __user *)(arg + minsz),
+						 hdr.count * size);
+			if (IS_ERR(data))
+				return PTR_ERR(data);
+		}
+
+		if (parent->ops->set_irqs)
+			ret = parent->ops->set_irqs(mdev, hdr.flags, hdr.index,
+						    hdr.start, hdr.count, data);
+
+		kfree(ptr);
+		return ret;
+	}
+	case VFIO_DEVICE_RESET:
+	{
+		struct parent_device *parent = vmdev->mdev->parent;
+
+		if (parent->ops->reset)
+			return parent->ops->reset(vmdev->mdev);
+
+		return -EINVAL;
+	}
+	}
+	return -ENOTTY;
+}
+
+static ssize_t vfio_mpci_read(void *device_data, char __user *buf,
+			      size_t count, loff_t *ppos)
+{
+	struct vfio_mdev *vmdev = device_data;
+	struct mdev_device *mdev = vmdev->mdev;
+	struct parent_device *parent = mdev->parent;
+	int ret = 0;
+
+	if (!count)
+		return 0;
+
+	if (parent->ops->read) {
+		char *ret_data, *ptr;
+
+		ptr = ret_data = kzalloc(count, GFP_KERNEL);
+
+		if (!ret_data)
+			return  -ENOMEM;
+
+		ret = parent->ops->read(mdev, ret_data, count, *ppos);
+
+		if (ret > 0) {
+			if (copy_to_user(buf, ret_data, ret))
+				ret = -EFAULT;
+			else
+				*ppos += ret;
+		}
+		kfree(ptr);
+	}
+
+	return ret;
+}
+
+static ssize_t vfio_mpci_write(void *device_data, const char __user *buf,
+			       size_t count, loff_t *ppos)
+{
+	struct vfio_mdev *vmdev = device_data;
+	struct mdev_device *mdev = vmdev->mdev;
+	struct parent_device *parent = mdev->parent;
+	int ret = 0;
+
+	if (!count)
+		return 0;
+
+	if (parent->ops->write) {
+		char *usr_data, *ptr;
+
+		ptr = usr_data = memdup_user(buf, count);
+		if (IS_ERR(usr_data))
+			return PTR_ERR(usr_data);
+
+		ret = parent->ops->write(mdev, usr_data, count, *ppos);
+
+		if (ret > 0)
+			*ppos += ret;
+
+		kfree(ptr);
+	}
+
+	return ret;
+}
+
+static int mdev_dev_mmio_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	int ret;
+	struct vfio_mdev *vmdev = vma->vm_private_data;
+	struct mdev_device *mdev;
+	struct parent_device *parent;
+	u64 virtaddr = (u64)vmf->virtual_address;
+	unsigned long req_size, pgoff = 0;
+	pgprot_t pg_prot;
+	unsigned int index;
+
+	if (!vmdev && !vmdev->mdev)
+		return -EINVAL;
+
+	mdev = vmdev->mdev;
+	parent  = mdev->parent;
+
+	pg_prot  = vma->vm_page_prot;
+
+	if (parent->ops->validate_map_request) {
+		u64 offset;
+		loff_t pos;
+
+		offset   = virtaddr - vma->vm_start;
+		req_size = vma->vm_end - virtaddr;
+		pos = (vma->vm_pgoff << PAGE_SHIFT) + offset;
+
+		ret = parent->ops->validate_map_request(mdev, pos, &virtaddr,
+						&pgoff, &req_size, &pg_prot);
+		if (ret)
+			return ret;
+
+		/*
+		 * Verify pgoff and req_size are valid and virtaddr is within
+		 * vma range
+		 */
+		if (!pgoff || !req_size || (virtaddr < vma->vm_start) ||
+		    ((virtaddr + req_size) >= vma->vm_end))
+			return -EINVAL;
+	} else {
+		struct pci_dev *pdev;
+
+		virtaddr = vma->vm_start;
+		req_size = vma->vm_end - vma->vm_start;
+
+		pdev = to_pci_dev(parent->dev);
+		index = VFIO_PCI_OFFSET_TO_INDEX(vma->vm_pgoff << PAGE_SHIFT);
+		pgoff = pci_resource_start(pdev, index) >> PAGE_SHIFT;
+	}
+
+	ret = remap_pfn_range(vma, virtaddr, pgoff, req_size, pg_prot);
+
+	return ret | VM_FAULT_NOPAGE;
+}
+
+void mdev_dev_mmio_close(struct vm_area_struct *vma)
+{
+	struct vfio_mdev *vmdev = vma->vm_private_data;
+	struct mdev_device *mdev = vmdev->mdev;
+
+	mdev_del_phys_mapping(mdev, vma->vm_pgoff << PAGE_SHIFT);
+}
+
+static const struct vm_operations_struct mdev_dev_mmio_ops = {
+	.fault = mdev_dev_mmio_fault,
+	.close = mdev_dev_mmio_close,
+};
+
+static int vfio_mpci_mmap(void *device_data, struct vm_area_struct *vma)
+{
+	unsigned int index;
+	struct vfio_mdev *vmdev = device_data;
+	struct mdev_device *mdev = vmdev->mdev;
+
+	index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
+
+	if (index >= VFIO_PCI_ROM_REGION_INDEX)
+		return -EINVAL;
+
+	vma->vm_private_data = vmdev;
+	vma->vm_ops = &mdev_dev_mmio_ops;
+
+	return mdev_add_phys_mapping(mdev, vma->vm_file->f_mapping,
+				     vma->vm_pgoff << PAGE_SHIFT,
+				     vma->vm_end - vma->vm_start);
+}
+
+static const struct vfio_device_ops vfio_mpci_dev_ops = {
+	.name		= "vfio-mpci",
+	.open		= vfio_mpci_open,
+	.release	= vfio_mpci_close,
+	.ioctl		= vfio_mpci_unlocked_ioctl,
+	.read		= vfio_mpci_read,
+	.write		= vfio_mpci_write,
+	.mmap		= vfio_mpci_mmap,
+};
+
+int vfio_mpci_probe(struct device *dev)
+{
+	struct vfio_mdev *vmdev;
+	struct mdev_device *mdev = to_mdev_device(dev);
+	int ret;
+
+	vmdev = kzalloc(sizeof(*vmdev), GFP_KERNEL);
+	if (IS_ERR(vmdev))
+		return PTR_ERR(vmdev);
+
+	vmdev->mdev = mdev_get_device(mdev);
+	vmdev->group = mdev->group;
+	mutex_init(&vmdev->vfio_mdev_lock);
+
+	ret = vfio_add_group_dev(dev, &vfio_mpci_dev_ops, vmdev);
+	if (ret)
+		kfree(vmdev);
+
+	mdev_put_device(mdev);
+	return ret;
+}
+
+void vfio_mpci_remove(struct device *dev)
+{
+	struct vfio_mdev *vmdev;
+
+	vmdev = vfio_del_group_dev(dev);
+	kfree(vmdev);
+}
+
+int vfio_mpci_match(struct device *dev)
+{
+	if (dev_is_pci(dev->parent))
+		return 1;
+
+	return 0;
+}
+
+struct mdev_driver vfio_mpci_driver = {
+	.name	= "vfio_mpci",
+	.probe	= vfio_mpci_probe,
+	.remove	= vfio_mpci_remove,
+	.match	= vfio_mpci_match,
+};
+
+static int __init vfio_mpci_init(void)
+{
+	return mdev_register_driver(&vfio_mpci_driver, THIS_MODULE);
+}
+
+static void __exit vfio_mpci_exit(void)
+{
+	mdev_unregister_driver(&vfio_mpci_driver);
+}
+
+module_init(vfio_mpci_init)
+module_exit(vfio_mpci_exit)
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index 8a7d546d18a0..04a450908ffb 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -19,12 +19,6 @@
 #ifndef VFIO_PCI_PRIVATE_H
 #define VFIO_PCI_PRIVATE_H
 
-#define VFIO_PCI_OFFSET_SHIFT   40
-
-#define VFIO_PCI_OFFSET_TO_INDEX(off)	(off >> VFIO_PCI_OFFSET_SHIFT)
-#define VFIO_PCI_INDEX_TO_OFFSET(index)	((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
-#define VFIO_PCI_OFFSET_MASK	(((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
-
 /* Special capability IDs predefined access */
 #define PCI_CAP_ID_INVALID		0xFF	/* default raw access */
 #define PCI_CAP_ID_INVALID_VIRT		0xFE	/* default virt access */
diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c
index 5ffd1d9ad4bd..5b912be9d9c3 100644
--- a/drivers/vfio/pci/vfio_pci_rdwr.c
+++ b/drivers/vfio/pci/vfio_pci_rdwr.c
@@ -18,6 +18,7 @@
 #include <linux/uaccess.h>
 #include <linux/io.h>
 #include <linux/vgaarb.h>
+#include <linux/vfio.h>
 
 #include "vfio_pci_private.h"
 
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 0ecae0b1cd34..431b824b0d3e 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -18,6 +18,13 @@
 #include <linux/poll.h>
 #include <uapi/linux/vfio.h>
 
+#define VFIO_PCI_OFFSET_SHIFT   40
+
+#define VFIO_PCI_OFFSET_TO_INDEX(off)   (off >> VFIO_PCI_OFFSET_SHIFT)
+#define VFIO_PCI_INDEX_TO_OFFSET(index) ((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
+#define VFIO_PCI_OFFSET_MASK    (((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
+
+
 /**
  * struct vfio_device_ops - VFIO bus driver device callbacks
  *
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v6 3/4] vfio iommu: Add support for mediated devices
  2016-08-03 19:03 ` [Qemu-devel] " Kirti Wankhede
@ 2016-08-03 19:03   ` Kirti Wankhede
  -1 siblings, 0 replies; 100+ messages in thread
From: Kirti Wankhede @ 2016-08-03 19:03 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi, Kirti Wankhede

VFIO IOMMU drivers are designed for the devices which are IOMMU capable.
Mediated device only uses IOMMU APIs, the underlying hardware can be
managed by an IOMMU domain.

Aim of this change is:
- To use most of the code of TYPE1 IOMMU driver for mediated devices
- To support direct assigned device and mediated device in single module

Added two new callback functions to struct vfio_iommu_driver_ops. Backend
IOMMU module that supports pining and unpinning pages for mdev devices
should provide these functions.
Added APIs for pining and unpining pages to VFIO module. These calls back
into backend iommu module to actually pin and unpin pages.

This change adds pin and unpin support for mediated device to TYPE1 IOMMU
backend module. More details:
- When iommu_group of mediated devices is attached, task structure is
  cached which is used later to pin pages and page accounting.
- It keeps track of pinned pages for mediated domain. This data is used to
  verify unpinning request and to unpin remaining pages while detaching, if
  there are any.
- Used existing mechanism for page accounting. If iommu capable domain
  exist in the container then all pages are already pinned and accounted.
  Accouting for mdev device is only done if there is no iommu capable
  domain in the container.

Tested by assigning below combinations of devices to a single VM:
- GPU pass through only
- vGPU device only
- One GPU pass through and one vGPU device
- two GPU pass through

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Change-Id: I295d6f0f2e0579b8d9882bfd8fd5a4194b97bd9a
---
 drivers/vfio/vfio.c             |  82 +++++++
 drivers/vfio/vfio_iommu_type1.c | 499 ++++++++++++++++++++++++++++++++++++----
 include/linux/vfio.h            |  13 +-
 3 files changed, 546 insertions(+), 48 deletions(-)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 6fd6fa5469de..1f87e3a30d24 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -1782,6 +1782,88 @@ void vfio_info_cap_shift(struct vfio_info_cap *caps, size_t offset)
 }
 EXPORT_SYMBOL_GPL(vfio_info_cap_shift);
 
+/*
+ * Pin a set of guest PFNs and return their associated host PFNs for mediated
+ * domain only.
+ * @user_pfn [in]: array of user/guest PFNs
+ * @npage [in]: count of array elements
+ * @prot [in] : protection flags
+ * @phys_pfn[out] : array of host PFNs
+ */
+long vfio_pin_pages(struct mdev_device *mdev, unsigned long *user_pfn,
+		    long npage, int prot, unsigned long *phys_pfn)
+{
+	struct vfio_device *device;
+	struct vfio_container *container;
+	struct vfio_iommu_driver *driver;
+	ssize_t ret = -EINVAL;
+
+	if (!mdev || !user_pfn || !phys_pfn)
+		return -EINVAL;
+
+	device = dev_get_drvdata(&mdev->dev);
+
+	if (!device || !device->group)
+		return -EINVAL;
+
+	container = device->group->container;
+
+	if (!container)
+		return -EINVAL;
+
+	down_read(&container->group_lock);
+
+	driver = container->iommu_driver;
+	if (likely(driver && driver->ops->pin_pages))
+		ret = driver->ops->pin_pages(container->iommu_data, user_pfn,
+					     npage, prot, phys_pfn);
+
+	up_read(&container->group_lock);
+
+	return ret;
+
+}
+EXPORT_SYMBOL(vfio_pin_pages);
+
+/*
+ * Unpin set of host PFNs for mediated domain only.
+ * @pfn [in] : array of host PFNs to be unpinned.
+ * @npage [in] :count of elements in array, that is number of pages.
+ */
+long vfio_unpin_pages(struct mdev_device *mdev, unsigned long *pfn, long npage)
+{
+	struct vfio_device *device;
+	struct vfio_container *container;
+	struct vfio_iommu_driver *driver;
+	ssize_t ret = -EINVAL;
+
+	if (!mdev || !pfn)
+		return -EINVAL;
+
+	device = dev_get_drvdata(&mdev->dev);
+
+	if (!device || !device->group)
+		return -EINVAL;
+
+	container = device->group->container;
+
+	if (!container)
+		return -EINVAL;
+
+	down_read(&container->group_lock);
+
+	driver = container->iommu_driver;
+	if (likely(driver && driver->ops->unpin_pages))
+		ret = driver->ops->unpin_pages(container->iommu_data, pfn,
+					       npage);
+
+	up_read(&container->group_lock);
+
+	return ret;
+
+}
+EXPORT_SYMBOL(vfio_unpin_pages);
+
 /**
  * Module/class support
  */
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 75b24e93cedb..1f4e24e0debd 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -55,18 +55,26 @@ MODULE_PARM_DESC(disable_hugepages,
 
 struct vfio_iommu {
 	struct list_head	domain_list;
+	struct vfio_domain	*mediated_domain;
 	struct mutex		lock;
 	struct rb_root		dma_list;
 	bool			v2;
 	bool			nesting;
 };
 
+struct mdev_addr_space {
+	struct task_struct	*task;
+	struct rb_root		pfn_list;	/* pinned Host pfn list */
+	struct mutex		pfn_list_lock;	/* mutex for pfn_list */
+};
+
 struct vfio_domain {
 	struct iommu_domain	*domain;
 	struct list_head	next;
 	struct list_head	group_list;
 	int			prot;		/* IOMMU_CACHE */
 	bool			fgsp;		/* Fine-grained super pages */
+	struct mdev_addr_space	*mdev_addr_space;
 };
 
 struct vfio_dma {
@@ -83,6 +91,22 @@ struct vfio_group {
 };
 
 /*
+ * Guest RAM pinning working set or DMA target
+ */
+struct vfio_pfn {
+	struct rb_node		node;
+	unsigned long		vaddr;		/* virtual addr */
+	dma_addr_t		iova;		/* IOVA */
+	unsigned long		pfn;		/* Host pfn */
+	size_t			prot;
+	atomic_t		ref_count;
+};
+
+
+#define IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu)	\
+			 (list_empty(&iommu->domain_list) ? false : true)
+
+/*
  * This code handles mapping and unmapping of user data buffers
  * into DMA'ble space using the IOMMU
  */
@@ -130,6 +154,84 @@ static void vfio_unlink_dma(struct vfio_iommu *iommu, struct vfio_dma *old)
 	rb_erase(&old->node, &iommu->dma_list);
 }
 
+/*
+ * Helper Functions for host pfn list
+ */
+
+static struct vfio_pfn *vfio_find_pfn(struct vfio_domain *domain,
+				      unsigned long pfn)
+{
+	struct rb_node *node;
+	struct vfio_pfn *vpfn, *ret = NULL;
+
+	node = domain->mdev_addr_space->pfn_list.rb_node;
+
+	while (node) {
+		vpfn = rb_entry(node, struct vfio_pfn, node);
+
+		if (pfn < vpfn->pfn)
+			node = node->rb_left;
+		else if (pfn > vpfn->pfn)
+			node = node->rb_right;
+		else {
+			ret = vpfn;
+			break;
+		}
+	}
+
+	return ret;
+}
+
+static void vfio_link_pfn(struct vfio_domain *domain, struct vfio_pfn *new)
+{
+	struct rb_node **link, *parent = NULL;
+	struct vfio_pfn *vpfn;
+
+	link = &domain->mdev_addr_space->pfn_list.rb_node;
+	while (*link) {
+		parent = *link;
+		vpfn = rb_entry(parent, struct vfio_pfn, node);
+
+		if (new->pfn < vpfn->pfn)
+			link = &(*link)->rb_left;
+		else
+			link = &(*link)->rb_right;
+	}
+
+	rb_link_node(&new->node, parent, link);
+	rb_insert_color(&new->node, &domain->mdev_addr_space->pfn_list);
+}
+
+static void vfio_unlink_pfn(struct vfio_domain *domain, struct vfio_pfn *old)
+{
+	rb_erase(&old->node, &domain->mdev_addr_space->pfn_list);
+}
+
+static int vfio_add_to_pfn_list(struct vfio_domain *domain, unsigned long vaddr,
+				dma_addr_t iova, unsigned long pfn, size_t prot)
+{
+	struct vfio_pfn *vpfn;
+
+	vpfn = kzalloc(sizeof(*vpfn), GFP_KERNEL);
+	if (!vpfn)
+		return -ENOMEM;
+
+	vpfn->vaddr = vaddr;
+	vpfn->iova = iova;
+	vpfn->pfn = pfn;
+	vpfn->prot = prot;
+	atomic_set(&vpfn->ref_count, 1);
+	vfio_link_pfn(domain, vpfn);
+	return 0;
+}
+
+static void vfio_remove_from_pfn_list(struct vfio_domain *domain,
+				      struct vfio_pfn *vpfn)
+{
+	vfio_unlink_pfn(domain, vpfn);
+	kfree(vpfn);
+}
+
 struct vwork {
 	struct mm_struct	*mm;
 	long			npage;
@@ -150,17 +252,17 @@ static void vfio_lock_acct_bg(struct work_struct *work)
 	kfree(vwork);
 }
 
-static void vfio_lock_acct(long npage)
+static void vfio_lock_acct(struct task_struct *task, long npage)
 {
 	struct vwork *vwork;
 	struct mm_struct *mm;
 
-	if (!current->mm || !npage)
+	if (!task->mm || !npage)
 		return; /* process exited or nothing to do */
 
-	if (down_write_trylock(&current->mm->mmap_sem)) {
-		current->mm->locked_vm += npage;
-		up_write(&current->mm->mmap_sem);
+	if (down_write_trylock(&task->mm->mmap_sem)) {
+		task->mm->locked_vm += npage;
+		up_write(&task->mm->mmap_sem);
 		return;
 	}
 
@@ -172,7 +274,7 @@ static void vfio_lock_acct(long npage)
 	vwork = kmalloc(sizeof(struct vwork), GFP_KERNEL);
 	if (!vwork)
 		return;
-	mm = get_task_mm(current);
+	mm = get_task_mm(task);
 	if (!mm) {
 		kfree(vwork);
 		return;
@@ -228,20 +330,31 @@ static int put_pfn(unsigned long pfn, int prot)
 	return 0;
 }
 
-static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
+static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
+			 int prot, unsigned long *pfn)
 {
 	struct page *page[1];
 	struct vm_area_struct *vma;
+	struct mm_struct *local_mm = mm ? mm : current->mm;
 	int ret = -EFAULT;
 
-	if (get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE), page) == 1) {
+	if (mm) {
+		down_read(&local_mm->mmap_sem);
+		ret = get_user_pages_remote(NULL, local_mm, vaddr, 1,
+					!!(prot & IOMMU_WRITE), 0, page, NULL);
+		up_read(&local_mm->mmap_sem);
+	} else
+		ret = get_user_pages_fast(vaddr, 1,
+					  !!(prot & IOMMU_WRITE), page);
+
+	if (ret == 1) {
 		*pfn = page_to_pfn(page[0]);
 		return 0;
 	}
 
-	down_read(&current->mm->mmap_sem);
+	down_read(&local_mm->mmap_sem);
 
-	vma = find_vma_intersection(current->mm, vaddr, vaddr + 1);
+	vma = find_vma_intersection(local_mm, vaddr, vaddr + 1);
 
 	if (vma && vma->vm_flags & VM_PFNMAP) {
 		*pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
@@ -249,7 +362,7 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
 			ret = 0;
 	}
 
-	up_read(&current->mm->mmap_sem);
+	up_read(&local_mm->mmap_sem);
 
 	return ret;
 }
@@ -259,8 +372,8 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
  * the iommu can only map chunks of consecutive pfns anyway, so get the
  * first page and all consecutive pages with the same locking.
  */
-static long vfio_pin_pages(unsigned long vaddr, long npage,
-			   int prot, unsigned long *pfn_base)
+static long __vfio_pin_pages(unsigned long vaddr, long npage,
+			     int prot, unsigned long *pfn_base)
 {
 	unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
 	bool lock_cap = capable(CAP_IPC_LOCK);
@@ -270,7 +383,7 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
 	if (!current->mm)
 		return -ENODEV;
 
-	ret = vaddr_get_pfn(vaddr, prot, pfn_base);
+	ret = vaddr_get_pfn(NULL, vaddr, prot, pfn_base);
 	if (ret)
 		return ret;
 
@@ -285,7 +398,7 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
 
 	if (unlikely(disable_hugepages)) {
 		if (!rsvd)
-			vfio_lock_acct(1);
+			vfio_lock_acct(current, 1);
 		return 1;
 	}
 
@@ -293,7 +406,7 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
 	for (i = 1, vaddr += PAGE_SIZE; i < npage; i++, vaddr += PAGE_SIZE) {
 		unsigned long pfn = 0;
 
-		ret = vaddr_get_pfn(vaddr, prot, &pfn);
+		ret = vaddr_get_pfn(NULL, vaddr, prot, &pfn);
 		if (ret)
 			break;
 
@@ -313,13 +426,13 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
 	}
 
 	if (!rsvd)
-		vfio_lock_acct(i);
+		vfio_lock_acct(current, i);
 
 	return i;
 }
 
-static long vfio_unpin_pages(unsigned long pfn, long npage,
-			     int prot, bool do_accounting)
+static long __vfio_unpin_pages(unsigned long pfn, long npage, int prot,
+			       bool do_accounting)
 {
 	unsigned long unlocked = 0;
 	long i;
@@ -328,7 +441,188 @@ static long vfio_unpin_pages(unsigned long pfn, long npage,
 		unlocked += put_pfn(pfn++, prot);
 
 	if (do_accounting)
-		vfio_lock_acct(-unlocked);
+		vfio_lock_acct(current, -unlocked);
+	return unlocked;
+}
+
+static long __vfio_pin_pages_for_mdev(struct vfio_domain *domain,
+				      unsigned long vaddr, int prot,
+				      unsigned long *pfn_base,
+				      bool do_accounting)
+{
+	unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+	bool lock_cap = capable(CAP_IPC_LOCK);
+	long ret;
+	bool rsvd;
+	struct task_struct *task = domain->mdev_addr_space->task;
+
+	if (!task->mm)
+		return -ENODEV;
+
+	ret = vaddr_get_pfn(task->mm, vaddr, prot, pfn_base);
+	if (ret)
+		return ret;
+
+	rsvd = is_invalid_reserved_pfn(*pfn_base);
+
+	if (!rsvd && !lock_cap && task->mm->locked_vm + 1 > limit) {
+		put_pfn(*pfn_base, prot);
+		pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n", __func__,
+			limit << PAGE_SHIFT);
+		return -ENOMEM;
+	}
+
+	if (!rsvd && do_accounting)
+		vfio_lock_acct(task, 1);
+
+	return 1;
+}
+
+static void __vfio_unpin_pages_for_mdev(struct vfio_domain *domain,
+					unsigned long pfn, int prot,
+					bool do_accounting)
+{
+	put_pfn(pfn, prot);
+
+	if (do_accounting)
+		vfio_lock_acct(domain->mdev_addr_space->task, -1);
+}
+
+static int vfio_unpin_pfn(struct vfio_domain *domain,
+			  struct vfio_pfn *vpfn, bool do_accounting)
+{
+	__vfio_unpin_pages_for_mdev(domain, vpfn->pfn, vpfn->prot,
+				    do_accounting);
+
+	if (atomic_dec_and_test(&vpfn->ref_count))
+		vfio_remove_from_pfn_list(domain, vpfn);
+
+	return 1;
+}
+
+static long vfio_iommu_type1_pin_pages(void *iommu_data,
+				       unsigned long *user_pfn,
+				       long npage, int prot,
+				       unsigned long *phys_pfn)
+{
+	struct vfio_iommu *iommu = iommu_data;
+	struct vfio_domain *domain;
+	int i, j, ret;
+	long retpage;
+	unsigned long remote_vaddr;
+	unsigned long *pfn = phys_pfn;
+	struct vfio_dma *dma;
+	bool do_accounting = false;
+
+	if (!iommu || !user_pfn || !phys_pfn)
+		return -EINVAL;
+
+	mutex_lock(&iommu->lock);
+
+	if (!iommu->mediated_domain) {
+		ret = -EINVAL;
+		goto pin_done;
+	}
+
+	domain = iommu->mediated_domain;
+
+	/*
+	 * If iommu capable domain exist in the container then all pages are
+	 * already pinned and accounted. Accouting should be done if there is no
+	 * iommu capable domain in the container.
+	 */
+	do_accounting = !IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu);
+
+	for (i = 0; i < npage; i++) {
+		struct vfio_pfn *p;
+		dma_addr_t iova;
+
+		iova = user_pfn[i] << PAGE_SHIFT;
+
+		dma = vfio_find_dma(iommu, iova, 0);
+		if (!dma) {
+			ret = -EINVAL;
+			goto pin_unwind;
+		}
+
+		remote_vaddr = dma->vaddr + iova - dma->iova;
+
+		retpage = __vfio_pin_pages_for_mdev(domain, remote_vaddr, prot,
+						    &pfn[i], do_accounting);
+		if (retpage <= 0) {
+			WARN_ON(!retpage);
+			ret = (int)retpage;
+			goto pin_unwind;
+		}
+
+		mutex_lock(&domain->mdev_addr_space->pfn_list_lock);
+
+		/* search if pfn exist */
+		p = vfio_find_pfn(domain, pfn[i]);
+		if (p) {
+			atomic_inc(&p->ref_count);
+			mutex_unlock(&domain->mdev_addr_space->pfn_list_lock);
+			continue;
+		}
+
+		ret = vfio_add_to_pfn_list(domain, remote_vaddr, iova,
+					   pfn[i], prot);
+		mutex_unlock(&domain->mdev_addr_space->pfn_list_lock);
+
+		if (ret) {
+			__vfio_unpin_pages_for_mdev(domain, pfn[i], prot,
+						    do_accounting);
+			goto pin_unwind;
+		}
+	}
+
+	ret = i;
+	goto pin_done;
+
+pin_unwind:
+	pfn[i] = 0;
+	mutex_lock(&domain->mdev_addr_space->pfn_list_lock);
+	for (j = 0; j < i; j++) {
+		struct vfio_pfn *p;
+
+		p = vfio_find_pfn(domain, pfn[j]);
+		if (p)
+			vfio_unpin_pfn(domain, p, do_accounting);
+
+		pfn[j] = 0;
+	}
+	mutex_unlock(&domain->mdev_addr_space->pfn_list_lock);
+
+pin_done:
+	mutex_unlock(&iommu->lock);
+	return ret;
+}
+
+static long vfio_iommu_type1_unpin_pages(void *iommu_data, unsigned long *pfn,
+					 long npage)
+{
+	struct vfio_iommu *iommu = iommu_data;
+	struct vfio_domain *domain = NULL;
+	long unlocked = 0;
+	int i;
+
+	if (!iommu || !pfn)
+		return -EINVAL;
+
+	domain = iommu->mediated_domain;
+
+	for (i = 0; i < npage; i++) {
+		struct vfio_pfn *p;
+
+		mutex_lock(&domain->mdev_addr_space->pfn_list_lock);
+
+		/* verify if pfn exist in pfn_list */
+		p = vfio_find_pfn(domain, pfn[i]);
+		if (p)
+			unlocked += vfio_unpin_pfn(domain, p, true);
+
+		mutex_unlock(&domain->mdev_addr_space->pfn_list_lock);
+	}
 
 	return unlocked;
 }
@@ -341,6 +635,9 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
 
 	if (!dma->size)
 		return;
+
+	if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu))
+		return;
 	/*
 	 * We use the IOMMU to track the physical addresses, otherwise we'd
 	 * need a much more complicated tracking system.  Unfortunately that
@@ -382,15 +679,15 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
 		if (WARN_ON(!unmapped))
 			break;
 
-		unlocked += vfio_unpin_pages(phys >> PAGE_SHIFT,
-					     unmapped >> PAGE_SHIFT,
-					     dma->prot, false);
+		unlocked += __vfio_unpin_pages(phys >> PAGE_SHIFT,
+					       unmapped >> PAGE_SHIFT,
+					       dma->prot, false);
 		iova += unmapped;
 
 		cond_resched();
 	}
 
-	vfio_lock_acct(-unlocked);
+	vfio_lock_acct(current, -unlocked);
 }
 
 static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
@@ -611,10 +908,16 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
 	/* Insert zero-sized and grow as we map chunks of it */
 	vfio_link_dma(iommu, dma);
 
+	/* Don't pin and map if container doesn't contain IOMMU capable domain*/
+	if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu)) {
+		dma->size = size;
+		goto map_done;
+	}
+
 	while (size) {
 		/* Pin a contiguous chunk of memory */
-		npage = vfio_pin_pages(vaddr + dma->size,
-				       size >> PAGE_SHIFT, prot, &pfn);
+		npage = __vfio_pin_pages(vaddr + dma->size,
+					 size >> PAGE_SHIFT, prot, &pfn);
 		if (npage <= 0) {
 			WARN_ON(!npage);
 			ret = (int)npage;
@@ -624,7 +927,7 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
 		/* Map it! */
 		ret = vfio_iommu_map(iommu, iova + dma->size, pfn, npage, prot);
 		if (ret) {
-			vfio_unpin_pages(pfn, npage, prot, true);
+			__vfio_unpin_pages(pfn, npage, prot, true);
 			break;
 		}
 
@@ -635,6 +938,7 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
 	if (ret)
 		vfio_remove_dma(iommu, dma);
 
+map_done:
 	mutex_unlock(&iommu->lock);
 	return ret;
 }
@@ -734,11 +1038,24 @@ static void vfio_test_domain_fgsp(struct vfio_domain *domain)
 	__free_pages(pages, order);
 }
 
+static struct vfio_group *find_iommu_group(struct vfio_domain *domain,
+				   struct iommu_group *iommu_group)
+{
+	struct vfio_group *g;
+
+	list_for_each_entry(g, &domain->group_list, next) {
+		if (g->iommu_group == iommu_group)
+			return g;
+	}
+
+	return NULL;
+}
+
 static int vfio_iommu_type1_attach_group(void *iommu_data,
 					 struct iommu_group *iommu_group)
 {
 	struct vfio_iommu *iommu = iommu_data;
-	struct vfio_group *group, *g;
+	struct vfio_group *group;
 	struct vfio_domain *domain, *d;
 	struct bus_type *bus = NULL;
 	int ret;
@@ -746,10 +1063,14 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 	mutex_lock(&iommu->lock);
 
 	list_for_each_entry(d, &iommu->domain_list, next) {
-		list_for_each_entry(g, &d->group_list, next) {
-			if (g->iommu_group != iommu_group)
-				continue;
+		if (find_iommu_group(d, iommu_group)) {
+			mutex_unlock(&iommu->lock);
+			return -EINVAL;
+		}
+	}
 
+	if (iommu->mediated_domain) {
+		if (find_iommu_group(iommu->mediated_domain, iommu_group)) {
 			mutex_unlock(&iommu->lock);
 			return -EINVAL;
 		}
@@ -769,6 +1090,34 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 	if (ret)
 		goto out_free;
 
+#if defined(CONFIG_VFIO_MDEV) || defined(CONFIG_VFIO_MDEV_MODULE)
+	if (!iommu_present(bus) && (bus == &mdev_bus_type)) {
+		if (iommu->mediated_domain) {
+			list_add(&group->next,
+				 &iommu->mediated_domain->group_list);
+			kfree(domain);
+			mutex_unlock(&iommu->lock);
+			return 0;
+		}
+
+		domain->mdev_addr_space = kzalloc(sizeof(*domain->mdev_addr_space),
+						  GFP_KERNEL);
+		if (!domain->mdev_addr_space) {
+			ret = -ENOMEM;
+			goto out_free;
+		}
+
+		domain->mdev_addr_space->task = current;
+		INIT_LIST_HEAD(&domain->group_list);
+		list_add(&group->next, &domain->group_list);
+		domain->mdev_addr_space->pfn_list = RB_ROOT;
+		mutex_init(&domain->mdev_addr_space->pfn_list_lock);
+		iommu->mediated_domain = domain;
+		mutex_unlock(&iommu->lock);
+		return 0;
+	}
+#endif
+
 	domain->domain = iommu_domain_alloc(bus);
 	if (!domain->domain) {
 		ret = -EIO;
@@ -859,6 +1208,18 @@ static void vfio_iommu_unmap_unpin_all(struct vfio_iommu *iommu)
 		vfio_remove_dma(iommu, rb_entry(node, struct vfio_dma, node));
 }
 
+static void vfio_mdev_unpin_all(struct vfio_domain *domain)
+{
+	struct rb_node *node;
+
+	mutex_lock(&domain->mdev_addr_space->pfn_list_lock);
+	while ((node = rb_first(&domain->mdev_addr_space->pfn_list))) {
+		vfio_unpin_pfn(domain,
+				rb_entry(node, struct vfio_pfn, node), false);
+	}
+	mutex_unlock(&domain->mdev_addr_space->pfn_list_lock);
+}
+
 static void vfio_iommu_type1_detach_group(void *iommu_data,
 					  struct iommu_group *iommu_group)
 {
@@ -868,31 +1229,52 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
 
 	mutex_lock(&iommu->lock);
 
-	list_for_each_entry(domain, &iommu->domain_list, next) {
-		list_for_each_entry(group, &domain->group_list, next) {
-			if (group->iommu_group != iommu_group)
-				continue;
+	if (iommu->mediated_domain) {
+		domain = iommu->mediated_domain;
+		group = find_iommu_group(domain, iommu_group);
+		if (group) {
+			list_del(&group->next);
+			kfree(group);
 
+			if (list_empty(&domain->group_list)) {
+				vfio_mdev_unpin_all(domain);
+				if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu))
+					vfio_iommu_unmap_unpin_all(iommu);
+				kfree(domain);
+				iommu->mediated_domain = NULL;
+			}
+			goto detach_group_done;
+		}
+	}
+
+	if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu))
+		goto detach_group_done;
+
+	list_for_each_entry(domain, &iommu->domain_list, next) {
+		group = find_iommu_group(domain, iommu_group);
+		if (group) {
 			iommu_detach_group(domain->domain, iommu_group);
 			list_del(&group->next);
 			kfree(group);
 			/*
 			 * Group ownership provides privilege, if the group
 			 * list is empty, the domain goes away.  If it's the
-			 * last domain, then all the mappings go away too.
+			 * last domain with iommu and mediated domain doesn't
+			 * exist, the all the mappings go away too.
 			 */
 			if (list_empty(&domain->group_list)) {
-				if (list_is_singular(&iommu->domain_list))
+				if (list_is_singular(&iommu->domain_list) &&
+				    !iommu->mediated_domain)
 					vfio_iommu_unmap_unpin_all(iommu);
 				iommu_domain_free(domain->domain);
 				list_del(&domain->next);
 				kfree(domain);
 			}
-			goto done;
+			break;
 		}
 	}
 
-done:
+detach_group_done:
 	mutex_unlock(&iommu->lock);
 }
 
@@ -924,27 +1306,48 @@ static void *vfio_iommu_type1_open(unsigned long arg)
 	return iommu;
 }
 
+static void vfio_release_domain(struct vfio_domain *domain)
+{
+	struct vfio_group *group, *group_tmp;
+
+	list_for_each_entry_safe(group, group_tmp,
+				 &domain->group_list, next) {
+		if (!domain->mdev_addr_space)
+			iommu_detach_group(domain->domain, group->iommu_group);
+		list_del(&group->next);
+		kfree(group);
+	}
+
+	if (domain->mdev_addr_space)
+		vfio_mdev_unpin_all(domain);
+	else
+		iommu_domain_free(domain->domain);
+}
+
 static void vfio_iommu_type1_release(void *iommu_data)
 {
 	struct vfio_iommu *iommu = iommu_data;
 	struct vfio_domain *domain, *domain_tmp;
-	struct vfio_group *group, *group_tmp;
+
+	if (iommu->mediated_domain) {
+		vfio_release_domain(iommu->mediated_domain);
+		kfree(iommu->mediated_domain);
+		iommu->mediated_domain = NULL;
+	}
 
 	vfio_iommu_unmap_unpin_all(iommu);
 
+	if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu))
+		goto release_exit;
+
 	list_for_each_entry_safe(domain, domain_tmp,
 				 &iommu->domain_list, next) {
-		list_for_each_entry_safe(group, group_tmp,
-					 &domain->group_list, next) {
-			iommu_detach_group(domain->domain, group->iommu_group);
-			list_del(&group->next);
-			kfree(group);
-		}
-		iommu_domain_free(domain->domain);
+		vfio_release_domain(domain);
 		list_del(&domain->next);
 		kfree(domain);
 	}
 
+release_exit:
 	kfree(iommu);
 }
 
@@ -1048,6 +1451,8 @@ static const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_type1 = {
 	.ioctl		= vfio_iommu_type1_ioctl,
 	.attach_group	= vfio_iommu_type1_attach_group,
 	.detach_group	= vfio_iommu_type1_detach_group,
+	.pin_pages	= vfio_iommu_type1_pin_pages,
+	.unpin_pages	= vfio_iommu_type1_unpin_pages,
 };
 
 static int __init vfio_iommu_type1_init(void)
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 431b824b0d3e..abae882122aa 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -17,6 +17,7 @@
 #include <linux/workqueue.h>
 #include <linux/poll.h>
 #include <uapi/linux/vfio.h>
+#include <linux/mdev.h>
 
 #define VFIO_PCI_OFFSET_SHIFT   40
 
@@ -82,7 +83,11 @@ struct vfio_iommu_driver_ops {
 					struct iommu_group *group);
 	void		(*detach_group)(void *iommu_data,
 					struct iommu_group *group);
-
+	long		(*pin_pages)(void *iommu_data, unsigned long *user_pfn,
+				     long npage, int prot,
+				     unsigned long *phys_pfn);
+	long		(*unpin_pages)(void *iommu_data, unsigned long *pfn,
+				       long npage);
 };
 
 extern int vfio_register_iommu_driver(const struct vfio_iommu_driver_ops *ops);
@@ -134,6 +139,12 @@ static inline long vfio_spapr_iommu_eeh_ioctl(struct iommu_group *group,
 }
 #endif /* CONFIG_EEH */
 
+extern long vfio_pin_pages(struct mdev_device *mdev, unsigned long *user_pfn,
+			   long npage, int prot, unsigned long *phys_pfn);
+
+extern long vfio_unpin_pages(struct mdev_device *mdev, unsigned long *pfn,
+			     long npage);
+
 /*
  * IRQfd - generic
  */
-- 
2.7.0


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [Qemu-devel] [PATCH v6 3/4] vfio iommu: Add support for mediated devices
@ 2016-08-03 19:03   ` Kirti Wankhede
  0 siblings, 0 replies; 100+ messages in thread
From: Kirti Wankhede @ 2016-08-03 19:03 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi, Kirti Wankhede

VFIO IOMMU drivers are designed for the devices which are IOMMU capable.
Mediated device only uses IOMMU APIs, the underlying hardware can be
managed by an IOMMU domain.

Aim of this change is:
- To use most of the code of TYPE1 IOMMU driver for mediated devices
- To support direct assigned device and mediated device in single module

Added two new callback functions to struct vfio_iommu_driver_ops. Backend
IOMMU module that supports pining and unpinning pages for mdev devices
should provide these functions.
Added APIs for pining and unpining pages to VFIO module. These calls back
into backend iommu module to actually pin and unpin pages.

This change adds pin and unpin support for mediated device to TYPE1 IOMMU
backend module. More details:
- When iommu_group of mediated devices is attached, task structure is
  cached which is used later to pin pages and page accounting.
- It keeps track of pinned pages for mediated domain. This data is used to
  verify unpinning request and to unpin remaining pages while detaching, if
  there are any.
- Used existing mechanism for page accounting. If iommu capable domain
  exist in the container then all pages are already pinned and accounted.
  Accouting for mdev device is only done if there is no iommu capable
  domain in the container.

Tested by assigning below combinations of devices to a single VM:
- GPU pass through only
- vGPU device only
- One GPU pass through and one vGPU device
- two GPU pass through

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Change-Id: I295d6f0f2e0579b8d9882bfd8fd5a4194b97bd9a
---
 drivers/vfio/vfio.c             |  82 +++++++
 drivers/vfio/vfio_iommu_type1.c | 499 ++++++++++++++++++++++++++++++++++++----
 include/linux/vfio.h            |  13 +-
 3 files changed, 546 insertions(+), 48 deletions(-)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 6fd6fa5469de..1f87e3a30d24 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -1782,6 +1782,88 @@ void vfio_info_cap_shift(struct vfio_info_cap *caps, size_t offset)
 }
 EXPORT_SYMBOL_GPL(vfio_info_cap_shift);
 
+/*
+ * Pin a set of guest PFNs and return their associated host PFNs for mediated
+ * domain only.
+ * @user_pfn [in]: array of user/guest PFNs
+ * @npage [in]: count of array elements
+ * @prot [in] : protection flags
+ * @phys_pfn[out] : array of host PFNs
+ */
+long vfio_pin_pages(struct mdev_device *mdev, unsigned long *user_pfn,
+		    long npage, int prot, unsigned long *phys_pfn)
+{
+	struct vfio_device *device;
+	struct vfio_container *container;
+	struct vfio_iommu_driver *driver;
+	ssize_t ret = -EINVAL;
+
+	if (!mdev || !user_pfn || !phys_pfn)
+		return -EINVAL;
+
+	device = dev_get_drvdata(&mdev->dev);
+
+	if (!device || !device->group)
+		return -EINVAL;
+
+	container = device->group->container;
+
+	if (!container)
+		return -EINVAL;
+
+	down_read(&container->group_lock);
+
+	driver = container->iommu_driver;
+	if (likely(driver && driver->ops->pin_pages))
+		ret = driver->ops->pin_pages(container->iommu_data, user_pfn,
+					     npage, prot, phys_pfn);
+
+	up_read(&container->group_lock);
+
+	return ret;
+
+}
+EXPORT_SYMBOL(vfio_pin_pages);
+
+/*
+ * Unpin set of host PFNs for mediated domain only.
+ * @pfn [in] : array of host PFNs to be unpinned.
+ * @npage [in] :count of elements in array, that is number of pages.
+ */
+long vfio_unpin_pages(struct mdev_device *mdev, unsigned long *pfn, long npage)
+{
+	struct vfio_device *device;
+	struct vfio_container *container;
+	struct vfio_iommu_driver *driver;
+	ssize_t ret = -EINVAL;
+
+	if (!mdev || !pfn)
+		return -EINVAL;
+
+	device = dev_get_drvdata(&mdev->dev);
+
+	if (!device || !device->group)
+		return -EINVAL;
+
+	container = device->group->container;
+
+	if (!container)
+		return -EINVAL;
+
+	down_read(&container->group_lock);
+
+	driver = container->iommu_driver;
+	if (likely(driver && driver->ops->unpin_pages))
+		ret = driver->ops->unpin_pages(container->iommu_data, pfn,
+					       npage);
+
+	up_read(&container->group_lock);
+
+	return ret;
+
+}
+EXPORT_SYMBOL(vfio_unpin_pages);
+
 /**
  * Module/class support
  */
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 75b24e93cedb..1f4e24e0debd 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -55,18 +55,26 @@ MODULE_PARM_DESC(disable_hugepages,
 
 struct vfio_iommu {
 	struct list_head	domain_list;
+	struct vfio_domain	*mediated_domain;
 	struct mutex		lock;
 	struct rb_root		dma_list;
 	bool			v2;
 	bool			nesting;
 };
 
+struct mdev_addr_space {
+	struct task_struct	*task;
+	struct rb_root		pfn_list;	/* pinned Host pfn list */
+	struct mutex		pfn_list_lock;	/* mutex for pfn_list */
+};
+
 struct vfio_domain {
 	struct iommu_domain	*domain;
 	struct list_head	next;
 	struct list_head	group_list;
 	int			prot;		/* IOMMU_CACHE */
 	bool			fgsp;		/* Fine-grained super pages */
+	struct mdev_addr_space	*mdev_addr_space;
 };
 
 struct vfio_dma {
@@ -83,6 +91,22 @@ struct vfio_group {
 };
 
 /*
+ * Guest RAM pinning working set or DMA target
+ */
+struct vfio_pfn {
+	struct rb_node		node;
+	unsigned long		vaddr;		/* virtual addr */
+	dma_addr_t		iova;		/* IOVA */
+	unsigned long		pfn;		/* Host pfn */
+	size_t			prot;
+	atomic_t		ref_count;
+};
+
+
+#define IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu)	\
+			 (list_empty(&iommu->domain_list) ? false : true)
+
+/*
  * This code handles mapping and unmapping of user data buffers
  * into DMA'ble space using the IOMMU
  */
@@ -130,6 +154,84 @@ static void vfio_unlink_dma(struct vfio_iommu *iommu, struct vfio_dma *old)
 	rb_erase(&old->node, &iommu->dma_list);
 }
 
+/*
+ * Helper Functions for host pfn list
+ */
+
+static struct vfio_pfn *vfio_find_pfn(struct vfio_domain *domain,
+				      unsigned long pfn)
+{
+	struct rb_node *node;
+	struct vfio_pfn *vpfn, *ret = NULL;
+
+	node = domain->mdev_addr_space->pfn_list.rb_node;
+
+	while (node) {
+		vpfn = rb_entry(node, struct vfio_pfn, node);
+
+		if (pfn < vpfn->pfn)
+			node = node->rb_left;
+		else if (pfn > vpfn->pfn)
+			node = node->rb_right;
+		else {
+			ret = vpfn;
+			break;
+		}
+	}
+
+	return ret;
+}
+
+static void vfio_link_pfn(struct vfio_domain *domain, struct vfio_pfn *new)
+{
+	struct rb_node **link, *parent = NULL;
+	struct vfio_pfn *vpfn;
+
+	link = &domain->mdev_addr_space->pfn_list.rb_node;
+	while (*link) {
+		parent = *link;
+		vpfn = rb_entry(parent, struct vfio_pfn, node);
+
+		if (new->pfn < vpfn->pfn)
+			link = &(*link)->rb_left;
+		else
+			link = &(*link)->rb_right;
+	}
+
+	rb_link_node(&new->node, parent, link);
+	rb_insert_color(&new->node, &domain->mdev_addr_space->pfn_list);
+}
+
+static void vfio_unlink_pfn(struct vfio_domain *domain, struct vfio_pfn *old)
+{
+	rb_erase(&old->node, &domain->mdev_addr_space->pfn_list);
+}
+
+static int vfio_add_to_pfn_list(struct vfio_domain *domain, unsigned long vaddr,
+				dma_addr_t iova, unsigned long pfn, size_t prot)
+{
+	struct vfio_pfn *vpfn;
+
+	vpfn = kzalloc(sizeof(*vpfn), GFP_KERNEL);
+	if (!vpfn)
+		return -ENOMEM;
+
+	vpfn->vaddr = vaddr;
+	vpfn->iova = iova;
+	vpfn->pfn = pfn;
+	vpfn->prot = prot;
+	atomic_set(&vpfn->ref_count, 1);
+	vfio_link_pfn(domain, vpfn);
+	return 0;
+}
+
+static void vfio_remove_from_pfn_list(struct vfio_domain *domain,
+				      struct vfio_pfn *vpfn)
+{
+	vfio_unlink_pfn(domain, vpfn);
+	kfree(vpfn);
+}
+
 struct vwork {
 	struct mm_struct	*mm;
 	long			npage;
@@ -150,17 +252,17 @@ static void vfio_lock_acct_bg(struct work_struct *work)
 	kfree(vwork);
 }
 
-static void vfio_lock_acct(long npage)
+static void vfio_lock_acct(struct task_struct *task, long npage)
 {
 	struct vwork *vwork;
 	struct mm_struct *mm;
 
-	if (!current->mm || !npage)
+	if (!task->mm || !npage)
 		return; /* process exited or nothing to do */
 
-	if (down_write_trylock(&current->mm->mmap_sem)) {
-		current->mm->locked_vm += npage;
-		up_write(&current->mm->mmap_sem);
+	if (down_write_trylock(&task->mm->mmap_sem)) {
+		task->mm->locked_vm += npage;
+		up_write(&task->mm->mmap_sem);
 		return;
 	}
 
@@ -172,7 +274,7 @@ static void vfio_lock_acct(long npage)
 	vwork = kmalloc(sizeof(struct vwork), GFP_KERNEL);
 	if (!vwork)
 		return;
-	mm = get_task_mm(current);
+	mm = get_task_mm(task);
 	if (!mm) {
 		kfree(vwork);
 		return;
@@ -228,20 +330,31 @@ static int put_pfn(unsigned long pfn, int prot)
 	return 0;
 }
 
-static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
+static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
+			 int prot, unsigned long *pfn)
 {
 	struct page *page[1];
 	struct vm_area_struct *vma;
+	struct mm_struct *local_mm = mm ? mm : current->mm;
 	int ret = -EFAULT;
 
-	if (get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE), page) == 1) {
+	if (mm) {
+		down_read(&local_mm->mmap_sem);
+		ret = get_user_pages_remote(NULL, local_mm, vaddr, 1,
+					!!(prot & IOMMU_WRITE), 0, page, NULL);
+		up_read(&local_mm->mmap_sem);
+	} else
+		ret = get_user_pages_fast(vaddr, 1,
+					  !!(prot & IOMMU_WRITE), page);
+
+	if (ret == 1) {
 		*pfn = page_to_pfn(page[0]);
 		return 0;
 	}
 
-	down_read(&current->mm->mmap_sem);
+	down_read(&local_mm->mmap_sem);
 
-	vma = find_vma_intersection(current->mm, vaddr, vaddr + 1);
+	vma = find_vma_intersection(local_mm, vaddr, vaddr + 1);
 
 	if (vma && vma->vm_flags & VM_PFNMAP) {
 		*pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
@@ -249,7 +362,7 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
 			ret = 0;
 	}
 
-	up_read(&current->mm->mmap_sem);
+	up_read(&local_mm->mmap_sem);
 
 	return ret;
 }
@@ -259,8 +372,8 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
  * the iommu can only map chunks of consecutive pfns anyway, so get the
  * first page and all consecutive pages with the same locking.
  */
-static long vfio_pin_pages(unsigned long vaddr, long npage,
-			   int prot, unsigned long *pfn_base)
+static long __vfio_pin_pages(unsigned long vaddr, long npage,
+			     int prot, unsigned long *pfn_base)
 {
 	unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
 	bool lock_cap = capable(CAP_IPC_LOCK);
@@ -270,7 +383,7 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
 	if (!current->mm)
 		return -ENODEV;
 
-	ret = vaddr_get_pfn(vaddr, prot, pfn_base);
+	ret = vaddr_get_pfn(NULL, vaddr, prot, pfn_base);
 	if (ret)
 		return ret;
 
@@ -285,7 +398,7 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
 
 	if (unlikely(disable_hugepages)) {
 		if (!rsvd)
-			vfio_lock_acct(1);
+			vfio_lock_acct(current, 1);
 		return 1;
 	}
 
@@ -293,7 +406,7 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
 	for (i = 1, vaddr += PAGE_SIZE; i < npage; i++, vaddr += PAGE_SIZE) {
 		unsigned long pfn = 0;
 
-		ret = vaddr_get_pfn(vaddr, prot, &pfn);
+		ret = vaddr_get_pfn(NULL, vaddr, prot, &pfn);
 		if (ret)
 			break;
 
@@ -313,13 +426,13 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
 	}
 
 	if (!rsvd)
-		vfio_lock_acct(i);
+		vfio_lock_acct(current, i);
 
 	return i;
 }
 
-static long vfio_unpin_pages(unsigned long pfn, long npage,
-			     int prot, bool do_accounting)
+static long __vfio_unpin_pages(unsigned long pfn, long npage, int prot,
+			       bool do_accounting)
 {
 	unsigned long unlocked = 0;
 	long i;
@@ -328,7 +441,188 @@ static long vfio_unpin_pages(unsigned long pfn, long npage,
 		unlocked += put_pfn(pfn++, prot);
 
 	if (do_accounting)
-		vfio_lock_acct(-unlocked);
+		vfio_lock_acct(current, -unlocked);
+	return unlocked;
+}
+
+static long __vfio_pin_pages_for_mdev(struct vfio_domain *domain,
+				      unsigned long vaddr, int prot,
+				      unsigned long *pfn_base,
+				      bool do_accounting)
+{
+	unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+	bool lock_cap = capable(CAP_IPC_LOCK);
+	long ret;
+	bool rsvd;
+	struct task_struct *task = domain->mdev_addr_space->task;
+
+	if (!task->mm)
+		return -ENODEV;
+
+	ret = vaddr_get_pfn(task->mm, vaddr, prot, pfn_base);
+	if (ret)
+		return ret;
+
+	rsvd = is_invalid_reserved_pfn(*pfn_base);
+
+	if (!rsvd && !lock_cap && task->mm->locked_vm + 1 > limit) {
+		put_pfn(*pfn_base, prot);
+		pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n", __func__,
+			limit << PAGE_SHIFT);
+		return -ENOMEM;
+	}
+
+	if (!rsvd && do_accounting)
+		vfio_lock_acct(task, 1);
+
+	return 1;
+}
+
+static void __vfio_unpin_pages_for_mdev(struct vfio_domain *domain,
+					unsigned long pfn, int prot,
+					bool do_accounting)
+{
+	put_pfn(pfn, prot);
+
+	if (do_accounting)
+		vfio_lock_acct(domain->mdev_addr_space->task, -1);
+}
+
+static int vfio_unpin_pfn(struct vfio_domain *domain,
+			  struct vfio_pfn *vpfn, bool do_accounting)
+{
+	__vfio_unpin_pages_for_mdev(domain, vpfn->pfn, vpfn->prot,
+				    do_accounting);
+
+	if (atomic_dec_and_test(&vpfn->ref_count))
+		vfio_remove_from_pfn_list(domain, vpfn);
+
+	return 1;
+}
+
+static long vfio_iommu_type1_pin_pages(void *iommu_data,
+				       unsigned long *user_pfn,
+				       long npage, int prot,
+				       unsigned long *phys_pfn)
+{
+	struct vfio_iommu *iommu = iommu_data;
+	struct vfio_domain *domain;
+	int i, j, ret;
+	long retpage;
+	unsigned long remote_vaddr;
+	unsigned long *pfn = phys_pfn;
+	struct vfio_dma *dma;
+	bool do_accounting = false;
+
+	if (!iommu || !user_pfn || !phys_pfn)
+		return -EINVAL;
+
+	mutex_lock(&iommu->lock);
+
+	if (!iommu->mediated_domain) {
+		ret = -EINVAL;
+		goto pin_done;
+	}
+
+	domain = iommu->mediated_domain;
+
+	/*
+	 * If iommu capable domain exist in the container then all pages are
+	 * already pinned and accounted. Accouting should be done if there is no
+	 * iommu capable domain in the container.
+	 */
+	do_accounting = !IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu);
+
+	for (i = 0; i < npage; i++) {
+		struct vfio_pfn *p;
+		dma_addr_t iova;
+
+		iova = user_pfn[i] << PAGE_SHIFT;
+
+		dma = vfio_find_dma(iommu, iova, 0);
+		if (!dma) {
+			ret = -EINVAL;
+			goto pin_unwind;
+		}
+
+		remote_vaddr = dma->vaddr + iova - dma->iova;
+
+		retpage = __vfio_pin_pages_for_mdev(domain, remote_vaddr, prot,
+						    &pfn[i], do_accounting);
+		if (retpage <= 0) {
+			WARN_ON(!retpage);
+			ret = (int)retpage;
+			goto pin_unwind;
+		}
+
+		mutex_lock(&domain->mdev_addr_space->pfn_list_lock);
+
+		/* search if pfn exist */
+		p = vfio_find_pfn(domain, pfn[i]);
+		if (p) {
+			atomic_inc(&p->ref_count);
+			mutex_unlock(&domain->mdev_addr_space->pfn_list_lock);
+			continue;
+		}
+
+		ret = vfio_add_to_pfn_list(domain, remote_vaddr, iova,
+					   pfn[i], prot);
+		mutex_unlock(&domain->mdev_addr_space->pfn_list_lock);
+
+		if (ret) {
+			__vfio_unpin_pages_for_mdev(domain, pfn[i], prot,
+						    do_accounting);
+			goto pin_unwind;
+		}
+	}
+
+	ret = i;
+	goto pin_done;
+
+pin_unwind:
+	pfn[i] = 0;
+	mutex_lock(&domain->mdev_addr_space->pfn_list_lock);
+	for (j = 0; j < i; j++) {
+		struct vfio_pfn *p;
+
+		p = vfio_find_pfn(domain, pfn[j]);
+		if (p)
+			vfio_unpin_pfn(domain, p, do_accounting);
+
+		pfn[j] = 0;
+	}
+	mutex_unlock(&domain->mdev_addr_space->pfn_list_lock);
+
+pin_done:
+	mutex_unlock(&iommu->lock);
+	return ret;
+}
+
+static long vfio_iommu_type1_unpin_pages(void *iommu_data, unsigned long *pfn,
+					 long npage)
+{
+	struct vfio_iommu *iommu = iommu_data;
+	struct vfio_domain *domain = NULL;
+	long unlocked = 0;
+	int i;
+
+	if (!iommu || !pfn)
+		return -EINVAL;
+
+	domain = iommu->mediated_domain;
+
+	for (i = 0; i < npage; i++) {
+		struct vfio_pfn *p;
+
+		mutex_lock(&domain->mdev_addr_space->pfn_list_lock);
+
+		/* verify if pfn exist in pfn_list */
+		p = vfio_find_pfn(domain, pfn[i]);
+		if (p)
+			unlocked += vfio_unpin_pfn(domain, p, true);
+
+		mutex_unlock(&domain->mdev_addr_space->pfn_list_lock);
+	}
 
 	return unlocked;
 }
@@ -341,6 +635,9 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
 
 	if (!dma->size)
 		return;
+
+	if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu))
+		return;
 	/*
 	 * We use the IOMMU to track the physical addresses, otherwise we'd
 	 * need a much more complicated tracking system.  Unfortunately that
@@ -382,15 +679,15 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
 		if (WARN_ON(!unmapped))
 			break;
 
-		unlocked += vfio_unpin_pages(phys >> PAGE_SHIFT,
-					     unmapped >> PAGE_SHIFT,
-					     dma->prot, false);
+		unlocked += __vfio_unpin_pages(phys >> PAGE_SHIFT,
+					       unmapped >> PAGE_SHIFT,
+					       dma->prot, false);
 		iova += unmapped;
 
 		cond_resched();
 	}
 
-	vfio_lock_acct(-unlocked);
+	vfio_lock_acct(current, -unlocked);
 }
 
 static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
@@ -611,10 +908,16 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
 	/* Insert zero-sized and grow as we map chunks of it */
 	vfio_link_dma(iommu, dma);
 
+	/* Don't pin and map if container doesn't contain IOMMU capable domain*/
+	if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu)) {
+		dma->size = size;
+		goto map_done;
+	}
+
 	while (size) {
 		/* Pin a contiguous chunk of memory */
-		npage = vfio_pin_pages(vaddr + dma->size,
-				       size >> PAGE_SHIFT, prot, &pfn);
+		npage = __vfio_pin_pages(vaddr + dma->size,
+					 size >> PAGE_SHIFT, prot, &pfn);
 		if (npage <= 0) {
 			WARN_ON(!npage);
 			ret = (int)npage;
@@ -624,7 +927,7 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
 		/* Map it! */
 		ret = vfio_iommu_map(iommu, iova + dma->size, pfn, npage, prot);
 		if (ret) {
-			vfio_unpin_pages(pfn, npage, prot, true);
+			__vfio_unpin_pages(pfn, npage, prot, true);
 			break;
 		}
 
@@ -635,6 +938,7 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
 	if (ret)
 		vfio_remove_dma(iommu, dma);
 
+map_done:
 	mutex_unlock(&iommu->lock);
 	return ret;
 }
@@ -734,11 +1038,24 @@ static void vfio_test_domain_fgsp(struct vfio_domain *domain)
 	__free_pages(pages, order);
 }
 
+static struct vfio_group *find_iommu_group(struct vfio_domain *domain,
+				   struct iommu_group *iommu_group)
+{
+	struct vfio_group *g;
+
+	list_for_each_entry(g, &domain->group_list, next) {
+		if (g->iommu_group == iommu_group)
+			return g;
+	}
+
+	return NULL;
+}
+
 static int vfio_iommu_type1_attach_group(void *iommu_data,
 					 struct iommu_group *iommu_group)
 {
 	struct vfio_iommu *iommu = iommu_data;
-	struct vfio_group *group, *g;
+	struct vfio_group *group;
 	struct vfio_domain *domain, *d;
 	struct bus_type *bus = NULL;
 	int ret;
@@ -746,10 +1063,14 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 	mutex_lock(&iommu->lock);
 
 	list_for_each_entry(d, &iommu->domain_list, next) {
-		list_for_each_entry(g, &d->group_list, next) {
-			if (g->iommu_group != iommu_group)
-				continue;
+		if (find_iommu_group(d, iommu_group)) {
+			mutex_unlock(&iommu->lock);
+			return -EINVAL;
+		}
+	}
 
+	if (iommu->mediated_domain) {
+		if (find_iommu_group(iommu->mediated_domain, iommu_group)) {
 			mutex_unlock(&iommu->lock);
 			return -EINVAL;
 		}
@@ -769,6 +1090,34 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 	if (ret)
 		goto out_free;
 
+#if defined(CONFIG_VFIO_MDEV) || defined(CONFIG_VFIO_MDEV_MODULE)
+	if (!iommu_present(bus) && (bus == &mdev_bus_type)) {
+		if (iommu->mediated_domain) {
+			list_add(&group->next,
+				 &iommu->mediated_domain->group_list);
+			kfree(domain);
+			mutex_unlock(&iommu->lock);
+			return 0;
+		}
+
+		domain->mdev_addr_space = kzalloc(sizeof(*domain->mdev_addr_space),
+						  GFP_KERNEL);
+		if (!domain->mdev_addr_space) {
+			ret = -ENOMEM;
+			goto out_free;
+		}
+
+		domain->mdev_addr_space->task = current;
+		INIT_LIST_HEAD(&domain->group_list);
+		list_add(&group->next, &domain->group_list);
+		domain->mdev_addr_space->pfn_list = RB_ROOT;
+		mutex_init(&domain->mdev_addr_space->pfn_list_lock);
+		iommu->mediated_domain = domain;
+		mutex_unlock(&iommu->lock);
+		return 0;
+	}
+#endif
+
 	domain->domain = iommu_domain_alloc(bus);
 	if (!domain->domain) {
 		ret = -EIO;
@@ -859,6 +1208,18 @@ static void vfio_iommu_unmap_unpin_all(struct vfio_iommu *iommu)
 		vfio_remove_dma(iommu, rb_entry(node, struct vfio_dma, node));
 }
 
+static void vfio_mdev_unpin_all(struct vfio_domain *domain)
+{
+	struct rb_node *node;
+
+	mutex_lock(&domain->mdev_addr_space->pfn_list_lock);
+	while ((node = rb_first(&domain->mdev_addr_space->pfn_list))) {
+		vfio_unpin_pfn(domain,
+				rb_entry(node, struct vfio_pfn, node), false);
+	}
+	mutex_unlock(&domain->mdev_addr_space->pfn_list_lock);
+}
+
 static void vfio_iommu_type1_detach_group(void *iommu_data,
 					  struct iommu_group *iommu_group)
 {
@@ -868,31 +1229,52 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
 
 	mutex_lock(&iommu->lock);
 
-	list_for_each_entry(domain, &iommu->domain_list, next) {
-		list_for_each_entry(group, &domain->group_list, next) {
-			if (group->iommu_group != iommu_group)
-				continue;
+	if (iommu->mediated_domain) {
+		domain = iommu->mediated_domain;
+		group = find_iommu_group(domain, iommu_group);
+		if (group) {
+			list_del(&group->next);
+			kfree(group);
 
+			if (list_empty(&domain->group_list)) {
+				vfio_mdev_unpin_all(domain);
+				if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu))
+					vfio_iommu_unmap_unpin_all(iommu);
+				kfree(domain);
+				iommu->mediated_domain = NULL;
+			}
+			goto detach_group_done;
+		}
+	}
+
+	if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu))
+		goto detach_group_done;
+
+	list_for_each_entry(domain, &iommu->domain_list, next) {
+		group = find_iommu_group(domain, iommu_group);
+		if (group) {
 			iommu_detach_group(domain->domain, iommu_group);
 			list_del(&group->next);
 			kfree(group);
 			/*
 			 * Group ownership provides privilege, if the group
 			 * list is empty, the domain goes away.  If it's the
-			 * last domain, then all the mappings go away too.
+			 * last domain with iommu and mediated domain doesn't
+			 * exist, the all the mappings go away too.
 			 */
 			if (list_empty(&domain->group_list)) {
-				if (list_is_singular(&iommu->domain_list))
+				if (list_is_singular(&iommu->domain_list) &&
+				    !iommu->mediated_domain)
 					vfio_iommu_unmap_unpin_all(iommu);
 				iommu_domain_free(domain->domain);
 				list_del(&domain->next);
 				kfree(domain);
 			}
-			goto done;
+			break;
 		}
 	}
 
-done:
+detach_group_done:
 	mutex_unlock(&iommu->lock);
 }
 
@@ -924,27 +1306,48 @@ static void *vfio_iommu_type1_open(unsigned long arg)
 	return iommu;
 }
 
+static void vfio_release_domain(struct vfio_domain *domain)
+{
+	struct vfio_group *group, *group_tmp;
+
+	list_for_each_entry_safe(group, group_tmp,
+				 &domain->group_list, next) {
+		if (!domain->mdev_addr_space)
+			iommu_detach_group(domain->domain, group->iommu_group);
+		list_del(&group->next);
+		kfree(group);
+	}
+
+	if (domain->mdev_addr_space)
+		vfio_mdev_unpin_all(domain);
+	else
+		iommu_domain_free(domain->domain);
+}
+
 static void vfio_iommu_type1_release(void *iommu_data)
 {
 	struct vfio_iommu *iommu = iommu_data;
 	struct vfio_domain *domain, *domain_tmp;
-	struct vfio_group *group, *group_tmp;
+
+	if (iommu->mediated_domain) {
+		vfio_release_domain(iommu->mediated_domain);
+		kfree(iommu->mediated_domain);
+		iommu->mediated_domain = NULL;
+	}
 
 	vfio_iommu_unmap_unpin_all(iommu);
 
+	if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu))
+		goto release_exit;
+
 	list_for_each_entry_safe(domain, domain_tmp,
 				 &iommu->domain_list, next) {
-		list_for_each_entry_safe(group, group_tmp,
-					 &domain->group_list, next) {
-			iommu_detach_group(domain->domain, group->iommu_group);
-			list_del(&group->next);
-			kfree(group);
-		}
-		iommu_domain_free(domain->domain);
+		vfio_release_domain(domain);
 		list_del(&domain->next);
 		kfree(domain);
 	}
 
+release_exit:
 	kfree(iommu);
 }
 
@@ -1048,6 +1451,8 @@ static const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_type1 = {
 	.ioctl		= vfio_iommu_type1_ioctl,
 	.attach_group	= vfio_iommu_type1_attach_group,
 	.detach_group	= vfio_iommu_type1_detach_group,
+	.pin_pages	= vfio_iommu_type1_pin_pages,
+	.unpin_pages	= vfio_iommu_type1_unpin_pages,
 };
 
 static int __init vfio_iommu_type1_init(void)
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 431b824b0d3e..abae882122aa 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -17,6 +17,7 @@
 #include <linux/workqueue.h>
 #include <linux/poll.h>
 #include <uapi/linux/vfio.h>
+#include <linux/mdev.h>
 
 #define VFIO_PCI_OFFSET_SHIFT   40
 
@@ -82,7 +83,11 @@ struct vfio_iommu_driver_ops {
 					struct iommu_group *group);
 	void		(*detach_group)(void *iommu_data,
 					struct iommu_group *group);
-
+	long		(*pin_pages)(void *iommu_data, unsigned long *user_pfn,
+				     long npage, int prot,
+				     unsigned long *phys_pfn);
+	long		(*unpin_pages)(void *iommu_data, unsigned long *pfn,
+				       long npage);
 };
 
 extern int vfio_register_iommu_driver(const struct vfio_iommu_driver_ops *ops);
@@ -134,6 +139,12 @@ static inline long vfio_spapr_iommu_eeh_ioctl(struct iommu_group *group,
 }
 #endif /* CONFIG_EEH */
 
+extern long vfio_pin_pages(struct mdev_device *mdev, unsigned long *user_pfn,
+			   long npage, int prot, unsigned long *phys_pfn);
+
+extern long vfio_unpin_pages(struct mdev_device *mdev, unsigned long *pfn,
+			     long npage);
+
 /*
  * IRQfd - generic
  */
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [PATCH v6 4/4] docs: Add Documentation for Mediated devices
  2016-08-03 19:03 ` [Qemu-devel] " Kirti Wankhede
@ 2016-08-03 19:03   ` Kirti Wankhede
  -1 siblings, 0 replies; 100+ messages in thread
From: Kirti Wankhede @ 2016-08-03 19:03 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi, Kirti Wankhede

Add file Documentation/vfio-mediated-device.txt that include details of
mediated device framework.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Change-Id: I137dd646442936090d92008b115908b7b2c7bc5d
---
 Documentation/vfio-mediated-device.txt | 235 +++++++++++++++++++++++++++++++++
 1 file changed, 235 insertions(+)
 create mode 100644 Documentation/vfio-mediated-device.txt

diff --git a/Documentation/vfio-mediated-device.txt b/Documentation/vfio-mediated-device.txt
new file mode 100644
index 000000000000..029152670141
--- /dev/null
+++ b/Documentation/vfio-mediated-device.txt
@@ -0,0 +1,235 @@
+VFIO Mediated devices [1]
+-------------------------------------------------------------------------------
+
+There are more and more use cases/demands to virtualize the DMA devices which
+doesn't have SR_IOV capability built-in. To do this, drivers of different
+devices had to develop their own management interface and set of APIs and then
+integrate it to user space software. We've identified common requirements and
+unified management interface for such devices to make user space software
+integration easier.
+
+The VFIO driver framework provides unified APIs for direct device access. It is
+an IOMMU/device agnostic framework for exposing direct device access to
+user space, in a secure, IOMMU protected environment. This framework is
+used for multiple devices like GPUs, network adapters and compute accelerators.
+With direct device access, virtual machines or user space applications have
+direct access of physical device. This framework is reused for mediated devices.
+
+Mediated core driver provides a common interface for mediated device management
+that can be used by drivers of different devices. This module provides a generic
+interface to create/destroy mediated device, add/remove it to mediated bus
+driver, add/remove device to IOMMU group. It also provides an interface to
+register different types of bus drivers, for example, Mediated VFIO PCI driver
+is designed for mediated PCI devices and supports VFIO APIs. Similarly, driver
+can be designed to support any type of mediated device and added to this
+framework. Mediated bus driver add/delete mediated device to VFIO Group.
+
+Below is the high Level block diagram, with NVIDIA, Intel and IBM devices
+as example, since these are the devices which are going to actively use
+this module as of now. NVIDIA and Intel uses vfio_mpci.ko module for their GPUs
+which are PCI devices. There has to be different bus driver for Channel I/O
+devices, vfio_mccw.ko.
+
+
+     +---------------+
+     |               |
+     | +-----------+ |  mdev_register_driver() +--------------+
+     | |           | +<------------------------+              |
+     | |           | |                         |              |
+     | |  mdev     | +------------------------>+ vfio_mpci.ko |<-> VFIO user
+     | |  bus      | |     probe()/remove()    |              |    APIs
+     | |  driver   | |                         |              |
+     | |           | |                         +--------------+
+     | |           | |  mdev_register_driver() +--------------+
+     | |           | +<------------------------+              |
+     | |           | |                         |              |
+     | |           | +------------------------>+ vfio_mccw.ko |<-> VFIO user
+     | +-----------+ |     probe()/remove()    |              |    APIs
+     |               |                         |              |
+     |  MDEV CORE    |                         +--------------+
+     |   MODULE      |
+     |   mdev.ko     |
+     | +-----------+ |  mdev_register_device() +--------------+
+     | |           | +<------------------------+              |
+     | |           | |                         |  nvidia.ko   |<-> physical
+     | |           | +------------------------>+              |    device
+     | |           | |        callbacks        +--------------+
+     | | Physical  | |
+     | |  device   | |  mdev_register_device() +--------------+
+     | | interface | |<------------------------+              |
+     | |           | |                         |  i915.ko     |<-> physical
+     | |           | +------------------------>+              |    device
+     | |           | |        callbacks        +--------------+
+     | |           | |
+     | |           | |  mdev_register_device() +--------------+
+     | |           | +<------------------------+              |
+     | |           | |                         | ccw_device.ko|<-> physical
+     | |           | +------------------------>+              |    device
+     | |           | |        callbacks        +--------------+
+     | +-----------+ |
+     +---------------+
+
+
+Registration Interfaces
+-------------------------------------------------------------------------------
+
+Mediated core driver provides two types of registration interfaces:
+
+1. Registration interface for mediated bus driver:
+-------------------------------------------------
+     /*
+      * struct mdev_driver [2] - Mediated device's driver
+      * @name: driver name
+      * @probe: called when new device created
+      * @remove: called when device removed
+      * @match: called when new device or driver is added for this bus.
+      * Return 1 if given device can be handled by given driver and zero
+      * otherwise.
+      * @driver: device driver structure
+      */
+     struct mdev_driver {
+	     const char *name;
+	     int  (*probe)  (struct device *dev);
+	     void (*remove) (struct device *dev);
+	     int  (*match)(struct device *dev);
+	     struct device_driver    driver;
+     };
+
+Mediated bus driver for mdev should use this interface to register and
+unregister with core driver respectively:
+
+extern int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
+extern void mdev_unregister_driver(struct mdev_driver *drv);
+
+Mediated bus driver is responsible to add/delete mediated devices to/from VFIO
+group when devices are bound and unbound to the driver.
+
+2. Physical device driver interface:
+-----------------------------------
+This interface [3] provides a set of APIs to manage physical device related work
+in its driver. APIs are:
+
+* dev_attr_groups: attributes of the parent device.
+* mdev_attr_groups: attributes of the mediated device.
+* supported_config: to provide supported configuration list by the driver.
+* create: to allocate basic resources in driver for a mediated device.
+* destroy: to free resources in driver when mediated device is destroyed.
+* reset: to free and reallocate resources in driver on mediated device reset.
+* start: to initiate mediated device initialization process from driver.
+* stop: to teardown mediated device process during teardown.
+* read : read emulation callback.
+* write: write emulation callback.
+* set_irqs: gives interrupt configuration information that VMM sets.
+* get_region_info: to provide region size and its flags for the mediated device.
+* validate_map_request: to validate remap pfn request.
+
+Drivers should use this interface to register and unregister device to mdev core
+driver respectively:
+
+extern int  mdev_register_device(struct device *dev,
+                                 const struct parent_ops *ops);
+extern void mdev_unregister_device(struct device *dev);
+
+Physical Mapping tracking APIs:
+-------------------------------
+Core module supports to keep track of physical mappings for each mdev device.
+APIs to be used by mediated device bus driver to add and delete mappings to
+tracking logic:
+    int mdev_add_phys_mapping(struct mdev_device *mdev,
+                              struct address_space *mapping,
+                              unsigned long addr, unsigned long size)
+    void mdev_del_phys_mapping(struct mdev_device *mdev, unsigned long addr)
+
+API to be used by vendor driver to invalidate mapping:
+    int mdev_device_invalidate_mapping(struct mdev_device *mdev,
+                                       unsigned long addr, unsigned long size)
+
+Mediated device management interface via sysfs
+-------------------------------------------------------------------------------
+This is the interface that allows user space software, like libvirt, to query
+and configure mediated device in a HW agnostic fashion. This management
+interface provide flexibility to underlying physical device's driver to support
+mediated device hotplug, multiple mediated devices per virtual machine, multiple
+mediated devices from different physical devices, etc.
+
+Under per-physical device sysfs:
+--------------------------------
+
+* mdev_supported_types: (read only)
+    List the current supported mediated device types and its details.
+
+* mdev_create: (write only)
+	Create a mediated device on target physical device.
+	Input syntax: <UUID:idx:params>
+	where,
+		UUID: mediated device's UUID
+		idx: mediated device index inside a VM
+		params: extra parameters required by driver
+	Example:
+	# echo "12345678-1234-1234-1234-123456789abc:0:0" >
+				 /sys/bus/pci/devices/0000\:05\:00.0/mdev_create
+
+* mdev_destroy: (write only)
+	Destroy a mediated device on a target physical device.
+	Input syntax: <UUID:idx>
+	where,
+		UUID: mediated device's UUID
+		idx: mediated device index inside a VM
+	Example:
+	# echo "12345678-1234-1234-1234-123456789abc:0" >
+			       /sys/bus/pci/devices/0000\:05\:00.0/mdev_destroy
+
+Under mdev class sysfs /sys/class/mdev/:
+----------------------------------------
+
+* mdev_start: (write only)
+	This trigger the registration interface to notify the driver to
+	commit mediated device resource for target VM.
+	The mdev_start function is a synchronized call, successful return of
+	this call will indicate all the requested mdev resource has been fully
+	committed, the VMM should continue.
+	Input syntax: <UUID>
+	Example:
+	# echo "12345678-1234-1234-1234-123456789abc" >
+						/sys/class/mdev/mdev_start
+
+* mdev_stop: (write only)
+	This trigger the registration interface to notify the driver to
+	release resources of mediated device of target VM.
+	Input syntax: <UUID>
+	Example:
+	# echo "12345678-1234-1234-1234-123456789abc" >
+						 /sys/class/mdev/mdev_stop
+
+Mediated device Hotplug:
+-----------------------
+
+To support mediated device hotplug, <mdev_create> and <mdev_destroy> can be
+accessed during VM runtime, and the corresponding registration callback is
+invoked to allow driver to support hotplug.
+
+Translation APIs for Mediated device
+------------------------------------------------------------------------------
+
+Below APIs are provided for user pfn to host pfn translation in VFIO driver:
+
+extern long vfio_pin_pages(struct mdev_device *mdev, unsigned long *user_pfn,
+                           long npage, int prot, unsigned long *phys_pfn);
+
+extern long vfio_unpin_pages(struct mdev_device *mdev, unsigned long *pfn,
+                             long npage);
+
+These functions call back into the backend IOMMU module using two callbacks of
+struct vfio_iommu_driver_ops, pin_pages and unpin_pages [4]. Currently these are
+supported in TYPE1 IOMMU module. To enable the same for other IOMMU backend
+modules, such as PPC64 sPAPR module, they need to provide these two callback
+functions.
+
+References
+-------------------------------------------------------------------------------
+
+[1] See Documentation/vfio.txt for more information on VFIO.
+[2] struct mdev_driver in include/linux/mdev.h
+[3] struct parent_ops in include/linux/mdev.h
+[4] struct vfio_iommu_driver_ops in include/linux/vfio.h
+
-- 
2.7.0


^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [Qemu-devel] [PATCH v6 4/4] docs: Add Documentation for Mediated devices
@ 2016-08-03 19:03   ` Kirti Wankhede
  0 siblings, 0 replies; 100+ messages in thread
From: Kirti Wankhede @ 2016-08-03 19:03 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi, Kirti Wankhede

Add file Documentation/vfio-mediated-device.txt that include details of
mediated device framework.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Change-Id: I137dd646442936090d92008b115908b7b2c7bc5d
---
 Documentation/vfio-mediated-device.txt | 235 +++++++++++++++++++++++++++++++++
 1 file changed, 235 insertions(+)
 create mode 100644 Documentation/vfio-mediated-device.txt

diff --git a/Documentation/vfio-mediated-device.txt b/Documentation/vfio-mediated-device.txt
new file mode 100644
index 000000000000..029152670141
--- /dev/null
+++ b/Documentation/vfio-mediated-device.txt
@@ -0,0 +1,235 @@
+VFIO Mediated devices [1]
+-------------------------------------------------------------------------------
+
+There are more and more use cases/demands to virtualize the DMA devices which
+doesn't have SR_IOV capability built-in. To do this, drivers of different
+devices had to develop their own management interface and set of APIs and then
+integrate it to user space software. We've identified common requirements and
+unified management interface for such devices to make user space software
+integration easier.
+
+The VFIO driver framework provides unified APIs for direct device access. It is
+an IOMMU/device agnostic framework for exposing direct device access to
+user space, in a secure, IOMMU protected environment. This framework is
+used for multiple devices like GPUs, network adapters and compute accelerators.
+With direct device access, virtual machines or user space applications have
+direct access of physical device. This framework is reused for mediated devices.
+
+Mediated core driver provides a common interface for mediated device management
+that can be used by drivers of different devices. This module provides a generic
+interface to create/destroy mediated device, add/remove it to mediated bus
+driver, add/remove device to IOMMU group. It also provides an interface to
+register different types of bus drivers, for example, Mediated VFIO PCI driver
+is designed for mediated PCI devices and supports VFIO APIs. Similarly, driver
+can be designed to support any type of mediated device and added to this
+framework. Mediated bus driver add/delete mediated device to VFIO Group.
+
+Below is the high Level block diagram, with NVIDIA, Intel and IBM devices
+as example, since these are the devices which are going to actively use
+this module as of now. NVIDIA and Intel uses vfio_mpci.ko module for their GPUs
+which are PCI devices. There has to be different bus driver for Channel I/O
+devices, vfio_mccw.ko.
+
+
+     +---------------+
+     |               |
+     | +-----------+ |  mdev_register_driver() +--------------+
+     | |           | +<------------------------+              |
+     | |           | |                         |              |
+     | |  mdev     | +------------------------>+ vfio_mpci.ko |<-> VFIO user
+     | |  bus      | |     probe()/remove()    |              |    APIs
+     | |  driver   | |                         |              |
+     | |           | |                         +--------------+
+     | |           | |  mdev_register_driver() +--------------+
+     | |           | +<------------------------+              |
+     | |           | |                         |              |
+     | |           | +------------------------>+ vfio_mccw.ko |<-> VFIO user
+     | +-----------+ |     probe()/remove()    |              |    APIs
+     |               |                         |              |
+     |  MDEV CORE    |                         +--------------+
+     |   MODULE      |
+     |   mdev.ko     |
+     | +-----------+ |  mdev_register_device() +--------------+
+     | |           | +<------------------------+              |
+     | |           | |                         |  nvidia.ko   |<-> physical
+     | |           | +------------------------>+              |    device
+     | |           | |        callbacks        +--------------+
+     | | Physical  | |
+     | |  device   | |  mdev_register_device() +--------------+
+     | | interface | |<------------------------+              |
+     | |           | |                         |  i915.ko     |<-> physical
+     | |           | +------------------------>+              |    device
+     | |           | |        callbacks        +--------------+
+     | |           | |
+     | |           | |  mdev_register_device() +--------------+
+     | |           | +<------------------------+              |
+     | |           | |                         | ccw_device.ko|<-> physical
+     | |           | +------------------------>+              |    device
+     | |           | |        callbacks        +--------------+
+     | +-----------+ |
+     +---------------+
+
+
+Registration Interfaces
+-------------------------------------------------------------------------------
+
+Mediated core driver provides two types of registration interfaces:
+
+1. Registration interface for mediated bus driver:
+-------------------------------------------------
+     /*
+      * struct mdev_driver [2] - Mediated device's driver
+      * @name: driver name
+      * @probe: called when new device created
+      * @remove: called when device removed
+      * @match: called when new device or driver is added for this bus.
+      * Return 1 if given device can be handled by given driver and zero
+      * otherwise.
+      * @driver: device driver structure
+      */
+     struct mdev_driver {
+	     const char *name;
+	     int  (*probe)  (struct device *dev);
+	     void (*remove) (struct device *dev);
+	     int  (*match)(struct device *dev);
+	     struct device_driver    driver;
+     };
+
+Mediated bus driver for mdev should use this interface to register and
+unregister with core driver respectively:
+
+extern int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
+extern void mdev_unregister_driver(struct mdev_driver *drv);
+
+Mediated bus driver is responsible to add/delete mediated devices to/from VFIO
+group when devices are bound and unbound to the driver.
+
+2. Physical device driver interface:
+-----------------------------------
+This interface [3] provides a set of APIs to manage physical device related work
+in its driver. APIs are:
+
+* dev_attr_groups: attributes of the parent device.
+* mdev_attr_groups: attributes of the mediated device.
+* supported_config: to provide supported configuration list by the driver.
+* create: to allocate basic resources in driver for a mediated device.
+* destroy: to free resources in driver when mediated device is destroyed.
+* reset: to free and reallocate resources in driver on mediated device reset.
+* start: to initiate mediated device initialization process from driver.
+* stop: to teardown mediated device process during teardown.
+* read : read emulation callback.
+* write: write emulation callback.
+* set_irqs: gives interrupt configuration information that VMM sets.
+* get_region_info: to provide region size and its flags for the mediated device.
+* validate_map_request: to validate remap pfn request.
+
+Drivers should use this interface to register and unregister device to mdev core
+driver respectively:
+
+extern int  mdev_register_device(struct device *dev,
+                                 const struct parent_ops *ops);
+extern void mdev_unregister_device(struct device *dev);
+
+Physical Mapping tracking APIs:
+-------------------------------
+Core module supports to keep track of physical mappings for each mdev device.
+APIs to be used by mediated device bus driver to add and delete mappings to
+tracking logic:
+    int mdev_add_phys_mapping(struct mdev_device *mdev,
+                              struct address_space *mapping,
+                              unsigned long addr, unsigned long size)
+    void mdev_del_phys_mapping(struct mdev_device *mdev, unsigned long addr)
+
+API to be used by vendor driver to invalidate mapping:
+    int mdev_device_invalidate_mapping(struct mdev_device *mdev,
+                                       unsigned long addr, unsigned long size)
+
+Mediated device management interface via sysfs
+-------------------------------------------------------------------------------
+This is the interface that allows user space software, like libvirt, to query
+and configure mediated device in a HW agnostic fashion. This management
+interface provide flexibility to underlying physical device's driver to support
+mediated device hotplug, multiple mediated devices per virtual machine, multiple
+mediated devices from different physical devices, etc.
+
+Under per-physical device sysfs:
+--------------------------------
+
+* mdev_supported_types: (read only)
+    List the current supported mediated device types and its details.
+
+* mdev_create: (write only)
+	Create a mediated device on target physical device.
+	Input syntax: <UUID:idx:params>
+	where,
+		UUID: mediated device's UUID
+		idx: mediated device index inside a VM
+		params: extra parameters required by driver
+	Example:
+	# echo "12345678-1234-1234-1234-123456789abc:0:0" >
+				 /sys/bus/pci/devices/0000\:05\:00.0/mdev_create
+
+* mdev_destroy: (write only)
+	Destroy a mediated device on a target physical device.
+	Input syntax: <UUID:idx>
+	where,
+		UUID: mediated device's UUID
+		idx: mediated device index inside a VM
+	Example:
+	# echo "12345678-1234-1234-1234-123456789abc:0" >
+			       /sys/bus/pci/devices/0000\:05\:00.0/mdev_destroy
+
+Under mdev class sysfs /sys/class/mdev/:
+----------------------------------------
+
+* mdev_start: (write only)
+	This trigger the registration interface to notify the driver to
+	commit mediated device resource for target VM.
+	The mdev_start function is a synchronized call, successful return of
+	this call will indicate all the requested mdev resource has been fully
+	committed, the VMM should continue.
+	Input syntax: <UUID>
+	Example:
+	# echo "12345678-1234-1234-1234-123456789abc" >
+						/sys/class/mdev/mdev_start
+
+* mdev_stop: (write only)
+	This trigger the registration interface to notify the driver to
+	release resources of mediated device of target VM.
+	Input syntax: <UUID>
+	Example:
+	# echo "12345678-1234-1234-1234-123456789abc" >
+						 /sys/class/mdev/mdev_stop
+
+Mediated device Hotplug:
+-----------------------
+
+To support mediated device hotplug, <mdev_create> and <mdev_destroy> can be
+accessed during VM runtime, and the corresponding registration callback is
+invoked to allow driver to support hotplug.
+
+Translation APIs for Mediated device
+------------------------------------------------------------------------------
+
+Below APIs are provided for user pfn to host pfn translation in VFIO driver:
+
+extern long vfio_pin_pages(struct mdev_device *mdev, unsigned long *user_pfn,
+                           long npage, int prot, unsigned long *phys_pfn);
+
+extern long vfio_unpin_pages(struct mdev_device *mdev, unsigned long *pfn,
+                             long npage);
+
+These functions call back into the backend IOMMU module using two callbacks of
+struct vfio_iommu_driver_ops, pin_pages and unpin_pages [4]. Currently these are
+supported in TYPE1 IOMMU module. To enable the same for other IOMMU backend
+modules, such as PPC64 sPAPR module, they need to provide these two callback
+functions.
+
+References
+-------------------------------------------------------------------------------
+
+[1] See Documentation/vfio.txt for more information on VFIO.
+[2] struct mdev_driver in include/linux/mdev.h
+[3] struct parent_ops in include/linux/mdev.h
+[4] struct vfio_iommu_driver_ops in include/linux/vfio.h
+
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* Re: [PATCH v6 2/4] vfio: VFIO driver for mediated PCI device
  2016-08-03 19:03   ` [Qemu-devel] " Kirti Wankhede
@ 2016-08-03 21:03     ` kbuild test robot
  -1 siblings, 0 replies; 100+ messages in thread
From: kbuild test robot @ 2016-08-03 21:03 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: kevin.tian, cjia, kvm, qemu-devel, Kirti Wankhede, jike.song,
	alex.williamson, kbuild-all, pbonzini, bjsdjshi, kraxel

[-- Attachment #1: Type: text/plain, Size: 3143 bytes --]

Hi Kirti,

[auto build test WARNING on vfio/next]
[also build test WARNING on v4.7 next-20160803]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Kirti-Wankhede/Add-Mediated-device-support/20160804-032209
base:   https://github.com/awilliam/linux-vfio.git next
config: i386-allmodconfig (attached as .config)
compiler: gcc-6 (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All warnings (new ones prefixed by >>):

   drivers/vfio/mdev/vfio_mpci.c: In function 'mdev_dev_mmio_fault':
>> drivers/vfio/mdev/vfio_mpci.c:384:17: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
     u64 virtaddr = (u64)vmf->virtual_address;
                    ^
   In file included from drivers/vfio/mdev/vfio_mpci.c:19:0:
>> include/linux/vfio.h:23:46: warning: right shift count >= width of type [-Wshift-count-overflow]
    #define VFIO_PCI_OFFSET_TO_INDEX(off)   (off >> VFIO_PCI_OFFSET_SHIFT)
                                                 ^
>> drivers/vfio/mdev/vfio_mpci.c:424:11: note: in expansion of macro 'VFIO_PCI_OFFSET_TO_INDEX'
      index = VFIO_PCI_OFFSET_TO_INDEX(vma->vm_pgoff << PAGE_SHIFT);
              ^~~~~~~~~~~~~~~~~~~~~~~~

vim +384 drivers/vfio/mdev/vfio_mpci.c

   378	static int mdev_dev_mmio_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
   379	{
   380		int ret;
   381		struct vfio_mdev *vmdev = vma->vm_private_data;
   382		struct mdev_device *mdev;
   383		struct parent_device *parent;
 > 384		u64 virtaddr = (u64)vmf->virtual_address;
   385		unsigned long req_size, pgoff = 0;
   386		pgprot_t pg_prot;
   387		unsigned int index;
   388	
   389		if (!vmdev && !vmdev->mdev)
   390			return -EINVAL;
   391	
   392		mdev = vmdev->mdev;
   393		parent  = mdev->parent;
   394	
   395		pg_prot  = vma->vm_page_prot;
   396	
   397		if (parent->ops->validate_map_request) {
   398			u64 offset;
   399			loff_t pos;
   400	
   401			offset   = virtaddr - vma->vm_start;
   402			req_size = vma->vm_end - virtaddr;
   403			pos = (vma->vm_pgoff << PAGE_SHIFT) + offset;
   404	
   405			ret = parent->ops->validate_map_request(mdev, pos, &virtaddr,
   406							&pgoff, &req_size, &pg_prot);
   407			if (ret)
   408				return ret;
   409	
   410			/*
   411			 * Verify pgoff and req_size are valid and virtaddr is within
   412			 * vma range
   413			 */
   414			if (!pgoff || !req_size || (virtaddr < vma->vm_start) ||
   415			    ((virtaddr + req_size) >= vma->vm_end))
   416				return -EINVAL;
   417		} else {
   418			struct pci_dev *pdev;
   419	
   420			virtaddr = vma->vm_start;
   421			req_size = vma->vm_end - vma->vm_start;
   422	
   423			pdev = to_pci_dev(parent->dev);
 > 424			index = VFIO_PCI_OFFSET_TO_INDEX(vma->vm_pgoff << PAGE_SHIFT);
   425			pgoff = pci_resource_start(pdev, index) >> PAGE_SHIFT;
   426		}
   427	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 55108 bytes --]

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v6 2/4] vfio: VFIO driver for mediated PCI device
@ 2016-08-03 21:03     ` kbuild test robot
  0 siblings, 0 replies; 100+ messages in thread
From: kbuild test robot @ 2016-08-03 21:03 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: kbuild-all, alex.williamson, pbonzini, kraxel, cjia, qemu-devel,
	kvm, kevin.tian, jike.song, bjsdjshi

[-- Attachment #1: Type: text/plain, Size: 3143 bytes --]

Hi Kirti,

[auto build test WARNING on vfio/next]
[also build test WARNING on v4.7 next-20160803]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Kirti-Wankhede/Add-Mediated-device-support/20160804-032209
base:   https://github.com/awilliam/linux-vfio.git next
config: i386-allmodconfig (attached as .config)
compiler: gcc-6 (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All warnings (new ones prefixed by >>):

   drivers/vfio/mdev/vfio_mpci.c: In function 'mdev_dev_mmio_fault':
>> drivers/vfio/mdev/vfio_mpci.c:384:17: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
     u64 virtaddr = (u64)vmf->virtual_address;
                    ^
   In file included from drivers/vfio/mdev/vfio_mpci.c:19:0:
>> include/linux/vfio.h:23:46: warning: right shift count >= width of type [-Wshift-count-overflow]
    #define VFIO_PCI_OFFSET_TO_INDEX(off)   (off >> VFIO_PCI_OFFSET_SHIFT)
                                                 ^
>> drivers/vfio/mdev/vfio_mpci.c:424:11: note: in expansion of macro 'VFIO_PCI_OFFSET_TO_INDEX'
      index = VFIO_PCI_OFFSET_TO_INDEX(vma->vm_pgoff << PAGE_SHIFT);
              ^~~~~~~~~~~~~~~~~~~~~~~~

vim +384 drivers/vfio/mdev/vfio_mpci.c

   378	static int mdev_dev_mmio_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
   379	{
   380		int ret;
   381		struct vfio_mdev *vmdev = vma->vm_private_data;
   382		struct mdev_device *mdev;
   383		struct parent_device *parent;
 > 384		u64 virtaddr = (u64)vmf->virtual_address;
   385		unsigned long req_size, pgoff = 0;
   386		pgprot_t pg_prot;
   387		unsigned int index;
   388	
   389		if (!vmdev && !vmdev->mdev)
   390			return -EINVAL;
   391	
   392		mdev = vmdev->mdev;
   393		parent  = mdev->parent;
   394	
   395		pg_prot  = vma->vm_page_prot;
   396	
   397		if (parent->ops->validate_map_request) {
   398			u64 offset;
   399			loff_t pos;
   400	
   401			offset   = virtaddr - vma->vm_start;
   402			req_size = vma->vm_end - virtaddr;
   403			pos = (vma->vm_pgoff << PAGE_SHIFT) + offset;
   404	
   405			ret = parent->ops->validate_map_request(mdev, pos, &virtaddr,
   406							&pgoff, &req_size, &pg_prot);
   407			if (ret)
   408				return ret;
   409	
   410			/*
   411			 * Verify pgoff and req_size are valid and virtaddr is within
   412			 * vma range
   413			 */
   414			if (!pgoff || !req_size || (virtaddr < vma->vm_start) ||
   415			    ((virtaddr + req_size) >= vma->vm_end))
   416				return -EINVAL;
   417		} else {
   418			struct pci_dev *pdev;
   419	
   420			virtaddr = vma->vm_start;
   421			req_size = vma->vm_end - vma->vm_start;
   422	
   423			pdev = to_pci_dev(parent->dev);
 > 424			index = VFIO_PCI_OFFSET_TO_INDEX(vma->vm_pgoff << PAGE_SHIFT);
   425			pgoff = pci_resource_start(pdev, index) >> PAGE_SHIFT;
   426		}
   427	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 55108 bytes --]

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v6 2/4] vfio: VFIO driver for mediated PCI device
  2016-08-03 19:03   ` [Qemu-devel] " Kirti Wankhede
@ 2016-08-04  0:19     ` kbuild test robot
  -1 siblings, 0 replies; 100+ messages in thread
From: kbuild test robot @ 2016-08-04  0:19 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: kevin.tian, cjia, kvm, qemu-devel, Kirti Wankhede, jike.song,
	alex.williamson, kbuild-all, pbonzini, bjsdjshi, kraxel

[-- Attachment #1: Type: text/plain, Size: 2869 bytes --]

Hi Kirti,

[auto build test WARNING on vfio/next]
[also build test WARNING on v4.7 next-20160803]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Kirti-Wankhede/Add-Mediated-device-support/20160804-032209
base:   https://github.com/awilliam/linux-vfio.git next
config: i386-allyesconfig (attached as .config)
compiler: gcc-6 (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All warnings (new ones prefixed by >>):

   drivers/vfio/mdev/vfio_mpci.c: In function 'mdev_dev_mmio_fault':
   drivers/vfio/mdev/vfio_mpci.c:384:17: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
     u64 virtaddr = (u64)vmf->virtual_address;
                    ^
>> drivers/vfio/mdev/vfio_mpci.c:424:32: warning: right shift count >= width of type [-Wshift-count-overflow]
      index = VFIO_PCI_OFFSET_TO_INDEX(vma->vm_pgoff << PAGE_SHIFT);
                                   ^~

vim +424 drivers/vfio/mdev/vfio_mpci.c

   378	static int mdev_dev_mmio_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
   379	{
   380		int ret;
   381		struct vfio_mdev *vmdev = vma->vm_private_data;
   382		struct mdev_device *mdev;
   383		struct parent_device *parent;
 > 384		u64 virtaddr = (u64)vmf->virtual_address;
   385		unsigned long req_size, pgoff = 0;
   386		pgprot_t pg_prot;
   387		unsigned int index;
   388	
   389		if (!vmdev && !vmdev->mdev)
   390			return -EINVAL;
   391	
   392		mdev = vmdev->mdev;
   393		parent  = mdev->parent;
   394	
   395		pg_prot  = vma->vm_page_prot;
   396	
   397		if (parent->ops->validate_map_request) {
   398			u64 offset;
   399			loff_t pos;
   400	
   401			offset   = virtaddr - vma->vm_start;
   402			req_size = vma->vm_end - virtaddr;
   403			pos = (vma->vm_pgoff << PAGE_SHIFT) + offset;
   404	
   405			ret = parent->ops->validate_map_request(mdev, pos, &virtaddr,
   406							&pgoff, &req_size, &pg_prot);
   407			if (ret)
   408				return ret;
   409	
   410			/*
   411			 * Verify pgoff and req_size are valid and virtaddr is within
   412			 * vma range
   413			 */
   414			if (!pgoff || !req_size || (virtaddr < vma->vm_start) ||
   415			    ((virtaddr + req_size) >= vma->vm_end))
   416				return -EINVAL;
   417		} else {
   418			struct pci_dev *pdev;
   419	
   420			virtaddr = vma->vm_start;
   421			req_size = vma->vm_end - vma->vm_start;
   422	
   423			pdev = to_pci_dev(parent->dev);
 > 424			index = VFIO_PCI_OFFSET_TO_INDEX(vma->vm_pgoff << PAGE_SHIFT);
   425			pgoff = pci_resource_start(pdev, index) >> PAGE_SHIFT;
   426		}
   427	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 54451 bytes --]

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v6 2/4] vfio: VFIO driver for mediated PCI device
@ 2016-08-04  0:19     ` kbuild test robot
  0 siblings, 0 replies; 100+ messages in thread
From: kbuild test robot @ 2016-08-04  0:19 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: kbuild-all, alex.williamson, pbonzini, kraxel, cjia, qemu-devel,
	kvm, kevin.tian, jike.song, bjsdjshi

[-- Attachment #1: Type: text/plain, Size: 2869 bytes --]

Hi Kirti,

[auto build test WARNING on vfio/next]
[also build test WARNING on v4.7 next-20160803]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Kirti-Wankhede/Add-Mediated-device-support/20160804-032209
base:   https://github.com/awilliam/linux-vfio.git next
config: i386-allyesconfig (attached as .config)
compiler: gcc-6 (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All warnings (new ones prefixed by >>):

   drivers/vfio/mdev/vfio_mpci.c: In function 'mdev_dev_mmio_fault':
   drivers/vfio/mdev/vfio_mpci.c:384:17: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
     u64 virtaddr = (u64)vmf->virtual_address;
                    ^
>> drivers/vfio/mdev/vfio_mpci.c:424:32: warning: right shift count >= width of type [-Wshift-count-overflow]
      index = VFIO_PCI_OFFSET_TO_INDEX(vma->vm_pgoff << PAGE_SHIFT);
                                   ^~

vim +424 drivers/vfio/mdev/vfio_mpci.c

   378	static int mdev_dev_mmio_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
   379	{
   380		int ret;
   381		struct vfio_mdev *vmdev = vma->vm_private_data;
   382		struct mdev_device *mdev;
   383		struct parent_device *parent;
 > 384		u64 virtaddr = (u64)vmf->virtual_address;
   385		unsigned long req_size, pgoff = 0;
   386		pgprot_t pg_prot;
   387		unsigned int index;
   388	
   389		if (!vmdev && !vmdev->mdev)
   390			return -EINVAL;
   391	
   392		mdev = vmdev->mdev;
   393		parent  = mdev->parent;
   394	
   395		pg_prot  = vma->vm_page_prot;
   396	
   397		if (parent->ops->validate_map_request) {
   398			u64 offset;
   399			loff_t pos;
   400	
   401			offset   = virtaddr - vma->vm_start;
   402			req_size = vma->vm_end - virtaddr;
   403			pos = (vma->vm_pgoff << PAGE_SHIFT) + offset;
   404	
   405			ret = parent->ops->validate_map_request(mdev, pos, &virtaddr,
   406							&pgoff, &req_size, &pg_prot);
   407			if (ret)
   408				return ret;
   409	
   410			/*
   411			 * Verify pgoff and req_size are valid and virtaddr is within
   412			 * vma range
   413			 */
   414			if (!pgoff || !req_size || (virtaddr < vma->vm_start) ||
   415			    ((virtaddr + req_size) >= vma->vm_end))
   416				return -EINVAL;
   417		} else {
   418			struct pci_dev *pdev;
   419	
   420			virtaddr = vma->vm_start;
   421			req_size = vma->vm_end - vma->vm_start;
   422	
   423			pdev = to_pci_dev(parent->dev);
 > 424			index = VFIO_PCI_OFFSET_TO_INDEX(vma->vm_pgoff << PAGE_SHIFT);
   425			pgoff = pci_resource_start(pdev, index) >> PAGE_SHIFT;
   426		}
   427	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 54451 bytes --]

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v6 1/4] vfio: Mediated device Core driver
  2016-08-03 19:03   ` [Qemu-devel] " Kirti Wankhede
@ 2016-08-04  7:21     ` Tian, Kevin
  -1 siblings, 0 replies; 100+ messages in thread
From: Tian, Kevin @ 2016-08-04  7:21 UTC (permalink / raw)
  To: Kirti Wankhede, alex.williamson, pbonzini, kraxel, cjia
  Cc: bjsdjshi, Song, Jike, qemu-devel, kvm

> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> Sent: Thursday, August 04, 2016 3:04 AM
> 
> 
> 2. Physical device driver interface
> This interface provides vendor driver the set APIs to manage physical
> device related work in their own driver. APIs are :
> - supported_config: provide supported configuration list by the vendor
> 		    driver
> - create: to allocate basic resources in vendor driver for a mediated
> 	  device.
> - destroy: to free resources in vendor driver when mediated device is
> 	   destroyed.
> - reset: to free and reallocate resources in vendor driver during reboot

Currently I saw 'reset' callback only invoked from VFIO ioctl path. Do 
you think whether it makes sense to expose a sysfs 'reset' node too,
similar to what people see under a PCI device node?

> - start: to initiate mediated device initialization process from vendor
> 	 driver
> - shutdown: to teardown mediated device resources during teardown.

I think 'shutdown' should be 'stop' based on actual code.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v6 1/4] vfio: Mediated device Core driver
@ 2016-08-04  7:21     ` Tian, Kevin
  0 siblings, 0 replies; 100+ messages in thread
From: Tian, Kevin @ 2016-08-04  7:21 UTC (permalink / raw)
  To: Kirti Wankhede, alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, Song, Jike, bjsdjshi

> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> Sent: Thursday, August 04, 2016 3:04 AM
> 
> 
> 2. Physical device driver interface
> This interface provides vendor driver the set APIs to manage physical
> device related work in their own driver. APIs are :
> - supported_config: provide supported configuration list by the vendor
> 		    driver
> - create: to allocate basic resources in vendor driver for a mediated
> 	  device.
> - destroy: to free resources in vendor driver when mediated device is
> 	   destroyed.
> - reset: to free and reallocate resources in vendor driver during reboot

Currently I saw 'reset' callback only invoked from VFIO ioctl path. Do 
you think whether it makes sense to expose a sysfs 'reset' node too,
similar to what people see under a PCI device node?

> - start: to initiate mediated device initialization process from vendor
> 	 driver
> - shutdown: to teardown mediated device resources during teardown.

I think 'shutdown' should be 'stop' based on actual code.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 100+ messages in thread

* RE: [PATCH v6 4/4] docs: Add Documentation for Mediated devices
  2016-08-03 19:03   ` [Qemu-devel] " Kirti Wankhede
@ 2016-08-04  7:31     ` Tian, Kevin
  -1 siblings, 0 replies; 100+ messages in thread
From: Tian, Kevin @ 2016-08-04  7:31 UTC (permalink / raw)
  To: Kirti Wankhede, alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, Song, Jike, bjsdjshi

> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> Sent: Thursday, August 04, 2016 3:04 AM
> 
> +
> +* mdev_supported_types: (read only)
> +    List the current supported mediated device types and its details.
> +
> +* mdev_create: (write only)
> +	Create a mediated device on target physical device.
> +	Input syntax: <UUID:idx:params>
> +	where,
> +		UUID: mediated device's UUID
> +		idx: mediated device index inside a VM

Is above description too specific to VM usage? mediated device can
be used by other user components too, e.g. an user space driver.
Better to make the description general (you can list above as one
example).

Also I think calling it idx a bit limited, which means only numbers
possible. Is it more flexible to call it 'handle' and then any string
can be used here?

> +		params: extra parameters required by driver
> +	Example:
> +	# echo "12345678-1234-1234-1234-123456789abc:0:0" >
> +				 /sys/bus/pci/devices/0000\:05\:00.0/mdev_create
> +
> +* mdev_destroy: (write only)
> +	Destroy a mediated device on a target physical device.
> +	Input syntax: <UUID:idx>
> +	where,
> +		UUID: mediated device's UUID
> +		idx: mediated device index inside a VM
> +	Example:
> +	# echo "12345678-1234-1234-1234-123456789abc:0" >
> +			       /sys/bus/pci/devices/0000\:05\:00.0/mdev_destroy
> +
> +Under mdev class sysfs /sys/class/mdev/:
> +----------------------------------------
> +
> +* mdev_start: (write only)
> +	This trigger the registration interface to notify the driver to
> +	commit mediated device resource for target VM.
> +	The mdev_start function is a synchronized call, successful return of
> +	this call will indicate all the requested mdev resource has been fully
> +	committed, the VMM should continue.
> +	Input syntax: <UUID>
> +	Example:
> +	# echo "12345678-1234-1234-1234-123456789abc" >
> +						/sys/class/mdev/mdev_start
> +
> +* mdev_stop: (write only)
> +	This trigger the registration interface to notify the driver to
> +	release resources of mediated device of target VM.
> +	Input syntax: <UUID>
> +	Example:
> +	# echo "12345678-1234-1234-1234-123456789abc" >
> +						 /sys/class/mdev/mdev_stop

I think it's clearer to create a node per mdev under /sys/class/mdev,
and then move start/stop as attributes under each mdev node, e.g:

echo "0/1" > /sys/class/mdev/12345678-1234-1234-1234-123456789abc/start

Doing this way is more extensible to add more capabilities under
each mdev node, and different capability set may be implemented
for them.

> +
> +Mediated device Hotplug:
> +-----------------------
> +
> +To support mediated device hotplug, <mdev_create> and <mdev_destroy> can be
> +accessed during VM runtime, and the corresponding registration callback is
> +invoked to allow driver to support hotplug.

'hotplug' is an action on the mdev user (e.g. the VM), not on mdev itself.
You can always create a mdev as long as physical device has enough
available resource to support requested config. Destroying a mdev 
may fail if there is still user on target mdev.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v6 4/4] docs: Add Documentation for Mediated devices
@ 2016-08-04  7:31     ` Tian, Kevin
  0 siblings, 0 replies; 100+ messages in thread
From: Tian, Kevin @ 2016-08-04  7:31 UTC (permalink / raw)
  To: Kirti Wankhede, alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, Song, Jike, bjsdjshi

> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> Sent: Thursday, August 04, 2016 3:04 AM
> 
> +
> +* mdev_supported_types: (read only)
> +    List the current supported mediated device types and its details.
> +
> +* mdev_create: (write only)
> +	Create a mediated device on target physical device.
> +	Input syntax: <UUID:idx:params>
> +	where,
> +		UUID: mediated device's UUID
> +		idx: mediated device index inside a VM

Is above description too specific to VM usage? mediated device can
be used by other user components too, e.g. an user space driver.
Better to make the description general (you can list above as one
example).

Also I think calling it idx a bit limited, which means only numbers
possible. Is it more flexible to call it 'handle' and then any string
can be used here?

> +		params: extra parameters required by driver
> +	Example:
> +	# echo "12345678-1234-1234-1234-123456789abc:0:0" >
> +				 /sys/bus/pci/devices/0000\:05\:00.0/mdev_create
> +
> +* mdev_destroy: (write only)
> +	Destroy a mediated device on a target physical device.
> +	Input syntax: <UUID:idx>
> +	where,
> +		UUID: mediated device's UUID
> +		idx: mediated device index inside a VM
> +	Example:
> +	# echo "12345678-1234-1234-1234-123456789abc:0" >
> +			       /sys/bus/pci/devices/0000\:05\:00.0/mdev_destroy
> +
> +Under mdev class sysfs /sys/class/mdev/:
> +----------------------------------------
> +
> +* mdev_start: (write only)
> +	This trigger the registration interface to notify the driver to
> +	commit mediated device resource for target VM.
> +	The mdev_start function is a synchronized call, successful return of
> +	this call will indicate all the requested mdev resource has been fully
> +	committed, the VMM should continue.
> +	Input syntax: <UUID>
> +	Example:
> +	# echo "12345678-1234-1234-1234-123456789abc" >
> +						/sys/class/mdev/mdev_start
> +
> +* mdev_stop: (write only)
> +	This trigger the registration interface to notify the driver to
> +	release resources of mediated device of target VM.
> +	Input syntax: <UUID>
> +	Example:
> +	# echo "12345678-1234-1234-1234-123456789abc" >
> +						 /sys/class/mdev/mdev_stop

I think it's clearer to create a node per mdev under /sys/class/mdev,
and then move start/stop as attributes under each mdev node, e.g:

echo "0/1" > /sys/class/mdev/12345678-1234-1234-1234-123456789abc/start

Doing this way is more extensible to add more capabilities under
each mdev node, and different capability set may be implemented
for them.

> +
> +Mediated device Hotplug:
> +-----------------------
> +
> +To support mediated device hotplug, <mdev_create> and <mdev_destroy> can be
> +accessed during VM runtime, and the corresponding registration callback is
> +invoked to allow driver to support hotplug.

'hotplug' is an action on the mdev user (e.g. the VM), not on mdev itself.
You can always create a mdev as long as physical device has enough
available resource to support requested config. Destroying a mdev 
may fail if there is still user on target mdev.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v6 1/4] vfio: Mediated device Core driver
  2016-08-04  7:21     ` [Qemu-devel] " Tian, Kevin
@ 2016-08-05  6:13       ` Kirti Wankhede
  -1 siblings, 0 replies; 100+ messages in thread
From: Kirti Wankhede @ 2016-08-05  6:13 UTC (permalink / raw)
  To: Tian, Kevin, alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, Song, Jike, bjsdjshi



On 8/4/2016 12:51 PM, Tian, Kevin wrote:
>> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
>> Sent: Thursday, August 04, 2016 3:04 AM
>>
>>
>> 2. Physical device driver interface
>> This interface provides vendor driver the set APIs to manage physical
>> device related work in their own driver. APIs are :
>> - supported_config: provide supported configuration list by the vendor
>> 		    driver
>> - create: to allocate basic resources in vendor driver for a mediated
>> 	  device.
>> - destroy: to free resources in vendor driver when mediated device is
>> 	   destroyed.
>> - reset: to free and reallocate resources in vendor driver during reboot
> 
> Currently I saw 'reset' callback only invoked from VFIO ioctl path. Do 
> you think whether it makes sense to expose a sysfs 'reset' node too,
> similar to what people see under a PCI device node?
> 

All vendor drivers might not support reset of mdev from sysfs. But those
who want to support can expose 'reset' node using 'mdev_attr_groups' of
'struct parent_ops'.


>> - start: to initiate mediated device initialization process from vendor
>> 	 driver
>> - shutdown: to teardown mediated device resources during teardown.
> 
> I think 'shutdown' should be 'stop' based on actual code.
>

Thanks for catching that, yes I missed to updated here.

Thanks,
Kirti

> Thanks
> Kevin
> 

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v6 1/4] vfio: Mediated device Core driver
@ 2016-08-05  6:13       ` Kirti Wankhede
  0 siblings, 0 replies; 100+ messages in thread
From: Kirti Wankhede @ 2016-08-05  6:13 UTC (permalink / raw)
  To: Tian, Kevin, alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, Song, Jike, bjsdjshi



On 8/4/2016 12:51 PM, Tian, Kevin wrote:
>> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
>> Sent: Thursday, August 04, 2016 3:04 AM
>>
>>
>> 2. Physical device driver interface
>> This interface provides vendor driver the set APIs to manage physical
>> device related work in their own driver. APIs are :
>> - supported_config: provide supported configuration list by the vendor
>> 		    driver
>> - create: to allocate basic resources in vendor driver for a mediated
>> 	  device.
>> - destroy: to free resources in vendor driver when mediated device is
>> 	   destroyed.
>> - reset: to free and reallocate resources in vendor driver during reboot
> 
> Currently I saw 'reset' callback only invoked from VFIO ioctl path. Do 
> you think whether it makes sense to expose a sysfs 'reset' node too,
> similar to what people see under a PCI device node?
> 

All vendor drivers might not support reset of mdev from sysfs. But those
who want to support can expose 'reset' node using 'mdev_attr_groups' of
'struct parent_ops'.


>> - start: to initiate mediated device initialization process from vendor
>> 	 driver
>> - shutdown: to teardown mediated device resources during teardown.
> 
> I think 'shutdown' should be 'stop' based on actual code.
>

Thanks for catching that, yes I missed to updated here.

Thanks,
Kirti

> Thanks
> Kevin
> 

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v6 4/4] docs: Add Documentation for Mediated devices
  2016-08-04  7:31     ` [Qemu-devel] " Tian, Kevin
@ 2016-08-05  7:45       ` Kirti Wankhede
  -1 siblings, 0 replies; 100+ messages in thread
From: Kirti Wankhede @ 2016-08-05  7:45 UTC (permalink / raw)
  To: Tian, Kevin, alex.williamson, pbonzini, kraxel, cjia
  Cc: bjsdjshi, Song, Jike, qemu-devel, kvm



On 8/4/2016 1:01 PM, Tian, Kevin wrote:
>> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
>> Sent: Thursday, August 04, 2016 3:04 AM
>>
>> +
>> +* mdev_supported_types: (read only)
>> +    List the current supported mediated device types and its details.
>> +
>> +* mdev_create: (write only)
>> +	Create a mediated device on target physical device.
>> +	Input syntax: <UUID:idx:params>
>> +	where,
>> +		UUID: mediated device's UUID
>> +		idx: mediated device index inside a VM
> 
> Is above description too specific to VM usage? mediated device can
> be used by other user components too, e.g. an user space driver.
> Better to make the description general (you can list above as one
> example).
>
Ok. I'll change it to VM or user space component.

> Also I think calling it idx a bit limited, which means only numbers
> possible. Is it more flexible to call it 'handle' and then any string
> can be used here?
> 

Index is integer, it is to keep track of mediated device instance number
created for a user space component or VM.

>> +		params: extra parameters required by driver
>> +	Example:
>> +	# echo "12345678-1234-1234-1234-123456789abc:0:0" >
>> +				 /sys/bus/pci/devices/0000\:05\:00.0/mdev_create
>> +
>> +* mdev_destroy: (write only)
>> +	Destroy a mediated device on a target physical device.
>> +	Input syntax: <UUID:idx>
>> +	where,
>> +		UUID: mediated device's UUID
>> +		idx: mediated device index inside a VM
>> +	Example:
>> +	# echo "12345678-1234-1234-1234-123456789abc:0" >
>> +			       /sys/bus/pci/devices/0000\:05\:00.0/mdev_destroy
>> +
>> +Under mdev class sysfs /sys/class/mdev/:
>> +----------------------------------------
>> +
>> +* mdev_start: (write only)
>> +	This trigger the registration interface to notify the driver to
>> +	commit mediated device resource for target VM.
>> +	The mdev_start function is a synchronized call, successful return of
>> +	this call will indicate all the requested mdev resource has been fully
>> +	committed, the VMM should continue.
>> +	Input syntax: <UUID>
>> +	Example:
>> +	# echo "12345678-1234-1234-1234-123456789abc" >
>> +						/sys/class/mdev/mdev_start
>> +
>> +* mdev_stop: (write only)
>> +	This trigger the registration interface to notify the driver to
>> +	release resources of mediated device of target VM.
>> +	Input syntax: <UUID>
>> +	Example:
>> +	# echo "12345678-1234-1234-1234-123456789abc" >
>> +						 /sys/class/mdev/mdev_stop
> 
> I think it's clearer to create a node per mdev under /sys/class/mdev,
> and then move start/stop as attributes under each mdev node, e.g:
> 
> echo "0/1" > /sys/class/mdev/12345678-1234-1234-1234-123456789abc/start
> 

To support multiple mdev devices in one VM or user space driver, process
is to create or configure all mdev devices for that VM or user space
driver and then have a single 'start' which means all requested mdev
resources are committed.

> Doing this way is more extensible to add more capabilities under
> each mdev node, and different capability set may be implemented
> for them.
> 

You can add extra capabilities for each mdev device node using
'mdev_attr_groups' of 'struct parent_ops' from vendor driver.


>> +
>> +Mediated device Hotplug:
>> +-----------------------
>> +
>> +To support mediated device hotplug, <mdev_create> and <mdev_destroy> can be
>> +accessed during VM runtime, and the corresponding registration callback is
>> +invoked to allow driver to support hotplug.
> 
> 'hotplug' is an action on the mdev user (e.g. the VM), not on mdev itself.
> You can always create a mdev as long as physical device has enough
> available resource to support requested config. Destroying a mdev 
> may fail if there is still user on target mdev.
>

Here point is: user need to pass UUID to mdev_create and device will be
created even if VM or user space driver is running.

Thanks,
Kirti

> Thanks
> Kevin
> 

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v6 4/4] docs: Add Documentation for Mediated devices
@ 2016-08-05  7:45       ` Kirti Wankhede
  0 siblings, 0 replies; 100+ messages in thread
From: Kirti Wankhede @ 2016-08-05  7:45 UTC (permalink / raw)
  To: Tian, Kevin, alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, Song, Jike, bjsdjshi



On 8/4/2016 1:01 PM, Tian, Kevin wrote:
>> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
>> Sent: Thursday, August 04, 2016 3:04 AM
>>
>> +
>> +* mdev_supported_types: (read only)
>> +    List the current supported mediated device types and its details.
>> +
>> +* mdev_create: (write only)
>> +	Create a mediated device on target physical device.
>> +	Input syntax: <UUID:idx:params>
>> +	where,
>> +		UUID: mediated device's UUID
>> +		idx: mediated device index inside a VM
> 
> Is above description too specific to VM usage? mediated device can
> be used by other user components too, e.g. an user space driver.
> Better to make the description general (you can list above as one
> example).
>
Ok. I'll change it to VM or user space component.

> Also I think calling it idx a bit limited, which means only numbers
> possible. Is it more flexible to call it 'handle' and then any string
> can be used here?
> 

Index is integer, it is to keep track of mediated device instance number
created for a user space component or VM.

>> +		params: extra parameters required by driver
>> +	Example:
>> +	# echo "12345678-1234-1234-1234-123456789abc:0:0" >
>> +				 /sys/bus/pci/devices/0000\:05\:00.0/mdev_create
>> +
>> +* mdev_destroy: (write only)
>> +	Destroy a mediated device on a target physical device.
>> +	Input syntax: <UUID:idx>
>> +	where,
>> +		UUID: mediated device's UUID
>> +		idx: mediated device index inside a VM
>> +	Example:
>> +	# echo "12345678-1234-1234-1234-123456789abc:0" >
>> +			       /sys/bus/pci/devices/0000\:05\:00.0/mdev_destroy
>> +
>> +Under mdev class sysfs /sys/class/mdev/:
>> +----------------------------------------
>> +
>> +* mdev_start: (write only)
>> +	This trigger the registration interface to notify the driver to
>> +	commit mediated device resource for target VM.
>> +	The mdev_start function is a synchronized call, successful return of
>> +	this call will indicate all the requested mdev resource has been fully
>> +	committed, the VMM should continue.
>> +	Input syntax: <UUID>
>> +	Example:
>> +	# echo "12345678-1234-1234-1234-123456789abc" >
>> +						/sys/class/mdev/mdev_start
>> +
>> +* mdev_stop: (write only)
>> +	This trigger the registration interface to notify the driver to
>> +	release resources of mediated device of target VM.
>> +	Input syntax: <UUID>
>> +	Example:
>> +	# echo "12345678-1234-1234-1234-123456789abc" >
>> +						 /sys/class/mdev/mdev_stop
> 
> I think it's clearer to create a node per mdev under /sys/class/mdev,
> and then move start/stop as attributes under each mdev node, e.g:
> 
> echo "0/1" > /sys/class/mdev/12345678-1234-1234-1234-123456789abc/start
> 

To support multiple mdev devices in one VM or user space driver, process
is to create or configure all mdev devices for that VM or user space
driver and then have a single 'start' which means all requested mdev
resources are committed.

> Doing this way is more extensible to add more capabilities under
> each mdev node, and different capability set may be implemented
> for them.
> 

You can add extra capabilities for each mdev device node using
'mdev_attr_groups' of 'struct parent_ops' from vendor driver.


>> +
>> +Mediated device Hotplug:
>> +-----------------------
>> +
>> +To support mediated device hotplug, <mdev_create> and <mdev_destroy> can be
>> +accessed during VM runtime, and the corresponding registration callback is
>> +invoked to allow driver to support hotplug.
> 
> 'hotplug' is an action on the mdev user (e.g. the VM), not on mdev itself.
> You can always create a mdev as long as physical device has enough
> available resource to support requested config. Destroying a mdev 
> may fail if there is still user on target mdev.
>

Here point is: user need to pass UUID to mdev_create and device will be
created even if VM or user space driver is running.

Thanks,
Kirti

> Thanks
> Kevin
> 

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v6 1/4] vfio: Mediated device Core driver
  2016-08-03 19:03   ` [Qemu-devel] " Kirti Wankhede
@ 2016-08-09 19:00     ` Alex Williamson
  -1 siblings, 0 replies; 100+ messages in thread
From: Alex Williamson @ 2016-08-09 19:00 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi

On Thu, 4 Aug 2016 00:33:51 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> Design for Mediated Device Driver:
> Main purpose of this driver is to provide a common interface for mediated
> device management that can be used by different drivers of different
> devices.
> 
> This module provides a generic interface to create the device, add it to
> mediated bus, add device to IOMMU group and then add it to vfio group.
> 
> Below is the high Level block diagram, with Nvidia, Intel and IBM devices
> as example, since these are the devices which are going to actively use
> this module as of now.
> 
>  +---------------+
>  |               |
>  | +-----------+ |  mdev_register_driver() +--------------+
>  | |           | +<------------------------+ __init()     |
>  | |           | |                         |              |
>  | |  mdev     | +------------------------>+              |<-> VFIO user
>  | |  bus      | |     probe()/remove()    | vfio_mpci.ko |    APIs
>  | |  driver   | |                         |              |
>  | |           | |                         +--------------+
>  | |           | |  mdev_register_driver() +--------------+
>  | |           | +<------------------------+ __init()     |
>  | |           | |                         |              |
>  | |           | +------------------------>+              |<-> VFIO user
>  | +-----------+ |     probe()/remove()    | vfio_mccw.ko |    APIs
>  |               |                         |              |
>  |  MDEV CORE    |                         +--------------+
>  |   MODULE      |
>  |   mdev.ko     |
>  | +-----------+ |  mdev_register_device() +--------------+
>  | |           | +<------------------------+              |
>  | |           | |                         |  nvidia.ko   |<-> physical
>  | |           | +------------------------>+              |    device
>  | |           | |        callback         +--------------+
>  | | Physical  | |
>  | |  device   | |  mdev_register_device() +--------------+
>  | | interface | |<------------------------+              |
>  | |           | |                         |  i915.ko     |<-> physical
>  | |           | +------------------------>+              |    device
>  | |           | |        callback         +--------------+
>  | |           | |
>  | |           | |  mdev_register_device() +--------------+
>  | |           | +<------------------------+              |
>  | |           | |                         | ccw_device.ko|<-> physical
>  | |           | +------------------------>+              |    device
>  | |           | |        callback         +--------------+
>  | +-----------+ |
>  +---------------+
> 
> Core driver provides two types of registration interfaces:
> 1. Registration interface for mediated bus driver:
> 
> /**
>   * struct mdev_driver - Mediated device's driver
>   * @name: driver name
>   * @probe: called when new device created
>   * @remove:called when device removed
>   * @match: called when new device or driver is added for this bus.
> 	    Return 1 if given device can be handled by given driver and
> 	    zero otherwise.
>   * @driver:device driver structure
>   *
>   **/
> struct mdev_driver {
>          const char *name;
>          int  (*probe)  (struct device *dev);
>          void (*remove) (struct device *dev);
>          int  (*match)(struct device *dev);
>          struct device_driver    driver;
> };
> 
> int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
> void mdev_unregister_driver(struct mdev_driver *drv);
> 
> Mediated device's driver for mdev should use this interface to register
> with Core driver. With this, mediated devices driver for such devices is
> responsible to add mediated device to VFIO group.
> 
> 2. Physical device driver interface
> This interface provides vendor driver the set APIs to manage physical
> device related work in their own driver. APIs are :
> - supported_config: provide supported configuration list by the vendor
> 		    driver
> - create: to allocate basic resources in vendor driver for a mediated
> 	  device.
> - destroy: to free resources in vendor driver when mediated device is
> 	   destroyed.
> - reset: to free and reallocate resources in vendor driver during reboot
> - start: to initiate mediated device initialization process from vendor
> 	 driver
> - shutdown: to teardown mediated device resources during teardown.
> - read : read emulation callback.
> - write: write emulation callback.
> - set_irqs: send interrupt configuration information that VMM sets.
> - get_region_info: to provide region size and its flags for the mediated
> 		   device.
> - validate_map_request: to validate remap pfn request.
> 
> This registration interface should be used by vendor drivers to register
> each physical device to mdev core driver.
> Locks to serialize above callbacks are removed. If required, vendor driver
> can have locks to serialize above APIs in their driver.
> 
> Added support to keep track of physical mappings for each mdev device.
> APIs to be used by mediated device bus driver to add and delete mappings to
> tracking logic:
> int mdev_add_phys_mapping(struct mdev_device *mdev,
>                           struct address_space *mapping,
>                           unsigned long addr, unsigned long size)
> void mdev_del_phys_mapping(struct mdev_device *mdev, unsigned long addr)
> 
> API to be used by vendor driver to invalidate mapping:
> int mdev_device_invalidate_mapping(struct mdev_device *mdev,
>                                    unsigned long addr, unsigned long size)
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Signed-off-by: Neo Jia <cjia@nvidia.com>
> Change-Id: I73a5084574270b14541c529461ea2f03c292d510
> ---
>  drivers/vfio/Kconfig             |   1 +
>  drivers/vfio/Makefile            |   1 +
>  drivers/vfio/mdev/Kconfig        |  12 +
>  drivers/vfio/mdev/Makefile       |   5 +
>  drivers/vfio/mdev/mdev_core.c    | 676 +++++++++++++++++++++++++++++++++++++++
>  drivers/vfio/mdev/mdev_driver.c  | 142 ++++++++
>  drivers/vfio/mdev/mdev_private.h |  33 ++
>  drivers/vfio/mdev/mdev_sysfs.c   | 269 ++++++++++++++++
>  include/linux/mdev.h             | 236 ++++++++++++++
>  9 files changed, 1375 insertions(+)
>  create mode 100644 drivers/vfio/mdev/Kconfig
>  create mode 100644 drivers/vfio/mdev/Makefile
>  create mode 100644 drivers/vfio/mdev/mdev_core.c
>  create mode 100644 drivers/vfio/mdev/mdev_driver.c
>  create mode 100644 drivers/vfio/mdev/mdev_private.h
>  create mode 100644 drivers/vfio/mdev/mdev_sysfs.c
>  create mode 100644 include/linux/mdev.h
> 
> diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
> index da6e2ce77495..23eced02aaf6 100644
> --- a/drivers/vfio/Kconfig
> +++ b/drivers/vfio/Kconfig
> @@ -48,4 +48,5 @@ menuconfig VFIO_NOIOMMU
>  
>  source "drivers/vfio/pci/Kconfig"
>  source "drivers/vfio/platform/Kconfig"
> +source "drivers/vfio/mdev/Kconfig"
>  source "virt/lib/Kconfig"
> diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
> index 7b8a31f63fea..4a23c13b6be4 100644
> --- a/drivers/vfio/Makefile
> +++ b/drivers/vfio/Makefile
> @@ -7,3 +7,4 @@ obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
>  obj-$(CONFIG_VFIO_SPAPR_EEH) += vfio_spapr_eeh.o
>  obj-$(CONFIG_VFIO_PCI) += pci/
>  obj-$(CONFIG_VFIO_PLATFORM) += platform/
> +obj-$(CONFIG_VFIO_MDEV) += mdev/
> diff --git a/drivers/vfio/mdev/Kconfig b/drivers/vfio/mdev/Kconfig
> new file mode 100644
> index 000000000000..a34fbc66f92f
> --- /dev/null
> +++ b/drivers/vfio/mdev/Kconfig
> @@ -0,0 +1,12 @@
> +
> +config VFIO_MDEV
> +    tristate "Mediated device driver framework"
> +    depends on VFIO
> +    default n
> +    help
> +        Provides a framework to virtualize device.
> +	See Documentation/vfio-mediated-device.txt for more details.
> +
> +        If you don't know what do here, say N.
> +
> +
> diff --git a/drivers/vfio/mdev/Makefile b/drivers/vfio/mdev/Makefile
> new file mode 100644
> index 000000000000..56a75e689582
> --- /dev/null
> +++ b/drivers/vfio/mdev/Makefile
> @@ -0,0 +1,5 @@
> +
> +mdev-y := mdev_core.o mdev_sysfs.o mdev_driver.o
> +
> +obj-$(CONFIG_VFIO_MDEV) += mdev.o
> +
> diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
> new file mode 100644
> index 000000000000..90ff073abfce
> --- /dev/null
> +++ b/drivers/vfio/mdev/mdev_core.c
> @@ -0,0 +1,676 @@
> +/*
> + * Mediated device Core Driver
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/module.h>
> +#include <linux/device.h>
> +#include <linux/fs.h>
> +#include <linux/slab.h>
> +#include <linux/sched.h>
> +#include <linux/uuid.h>
> +#include <linux/vfio.h>
> +#include <linux/iommu.h>
> +#include <linux/sysfs.h>
> +#include <linux/mdev.h>
> +
> +#include "mdev_private.h"
> +
> +#define DRIVER_VERSION		"0.1"
> +#define DRIVER_AUTHOR		"NVIDIA Corporation"
> +#define DRIVER_DESC		"Mediated device Core Driver"
> +
> +#define MDEV_CLASS_NAME		"mdev"
> +
> +static LIST_HEAD(parent_list);
> +static DEFINE_MUTEX(parent_list_lock);
> +
> +static int mdev_add_attribute_group(struct device *dev,
> +				    const struct attribute_group **groups)
> +{
> +	return sysfs_create_groups(&dev->kobj, groups);
> +}
> +
> +static void mdev_remove_attribute_group(struct device *dev,
> +					const struct attribute_group **groups)
> +{
> +	sysfs_remove_groups(&dev->kobj, groups);
> +}
> +
> +/* Should be called holding parent->mdev_list_lock */

I often like to prepend "__" onto the name of functions like this to
signal a special calling convention.

> +static struct mdev_device *find_mdev_device(struct parent_device *parent,
> +					    uuid_le uuid, int instance)
> +{
> +	struct mdev_device *mdev;
> +
> +	list_for_each_entry(mdev, &parent->mdev_list, next) {
> +		if ((uuid_le_cmp(mdev->uuid, uuid) == 0) &&
> +		    (mdev->instance == instance))
> +			return mdev;
> +	}
> +	return NULL;
> +}
> +
> +/* Should be called holding parent_list_lock */
> +static struct parent_device *find_parent_device(struct device *dev)
> +{
> +	struct parent_device *parent;
> +
> +	list_for_each_entry(parent, &parent_list, next) {
> +		if (parent->dev == dev)
> +			return parent;
> +	}
> +	return NULL;
> +}
> +
> +static void mdev_release_parent(struct kref *kref)
> +{
> +	struct parent_device *parent = container_of(kref, struct parent_device,
> +						    ref);
> +	kfree(parent);
> +}
> +
> +static
> +inline struct parent_device *mdev_get_parent(struct parent_device *parent)
> +{
> +	if (parent)
> +		kref_get(&parent->ref);
> +
> +	return parent;
> +}
> +
> +static inline void mdev_put_parent(struct parent_device *parent)
> +{
> +	if (parent)
> +		kref_put(&parent->ref, mdev_release_parent);
> +}
> +
> +static struct parent_device *mdev_get_parent_by_dev(struct device *dev)
> +{
> +	struct parent_device *parent = NULL, *p;
> +
> +	mutex_lock(&parent_list_lock);
> +	list_for_each_entry(p, &parent_list, next) {
> +		if (p->dev == dev) {
> +			parent = mdev_get_parent(p);
> +			break;
> +		}
> +	}
> +	mutex_unlock(&parent_list_lock);
> +	return parent;

Use what you've created:

{
	struct parent_device *parent;

	mutex_lock(&parent_list_lock);
	parent = mdev_get_parent(find_parent_device(dev));
	mutex_unlock(&parent_list_lock);

	return parent;
}

> +}
> +
> +static int mdev_device_create_ops(struct mdev_device *mdev, char *mdev_params)
> +{
> +	struct parent_device *parent = mdev->parent;
> +	int ret;
> +
> +	ret = parent->ops->create(mdev, mdev_params);
> +	if (ret)
> +		return ret;
> +
> +	ret = mdev_add_attribute_group(&mdev->dev,
> +					parent->ops->mdev_attr_groups);
> +	if (ret)
> +		parent->ops->destroy(mdev);
> +
> +	return ret;
> +}
> +
> +static int mdev_device_destroy_ops(struct mdev_device *mdev, bool force)
> +{
> +	struct parent_device *parent = mdev->parent;
> +	int ret = 0;
> +
> +	/*
> +	 * If vendor driver doesn't return success that means vendor
> +	 * driver doesn't support hot-unplug
> +	 */
> +	ret = parent->ops->destroy(mdev);
> +	if (ret && !force)
> +		return -EBUSY;

This still seems troublesome, I'm not sure why we don't just require
hot-unplug support.  Without it, we seem to have a limbo state where a
device exists, but not fully.

> +
> +	mdev_remove_attribute_group(&mdev->dev,
> +				    parent->ops->mdev_attr_groups);
> +
> +	return ret;
> +}
> +
> +static void mdev_release_device(struct kref *kref)
> +{
> +	struct mdev_device *mdev = container_of(kref, struct mdev_device, ref);
> +	struct parent_device *parent = mdev->parent;
> +
> +	list_del(&mdev->next);
> +	mutex_unlock(&parent->mdev_list_lock);

Maybe worthy of a short comment to more obviously match this unlock to
the kref_put_mutex() below.

> +
> +	device_unregister(&mdev->dev);
> +	wake_up(&parent->release_done);
> +	mdev_put_parent(parent);
> +}
> +
> +struct mdev_device *mdev_get_device(struct mdev_device *mdev)
> +{
> +	kref_get(&mdev->ref);
> +	return mdev;
> +}
> +EXPORT_SYMBOL(mdev_get_device);

Is the intention here that the caller already has a reference to mdev
and wants to get another?  Or I see the cases where use locking
to get this reference.  There's potential to misuse this, if not
outright abuse it, which worries me.  A reference cannot be
spontaneously generated, it needs to be sourced from somewhere.

> +
> +void mdev_put_device(struct mdev_device *mdev)
> +{
> +	struct parent_device *parent = mdev->parent;
> +
> +	kref_put_mutex(&mdev->ref, mdev_release_device,
> +		       &parent->mdev_list_lock);
> +}
> +EXPORT_SYMBOL(mdev_put_device);
> +
> +/*
> + * Find first mediated device from given uuid and increment refcount of
> + * mediated device. Caller should call mdev_put_device() when the use of
> + * mdev_device is done.
> + */
> +static struct mdev_device *mdev_get_first_device_by_uuid(uuid_le uuid)
> +{
> +	struct mdev_device *mdev = NULL, *p;
> +	struct parent_device *parent;
> +
> +	mutex_lock(&parent_list_lock);
> +	list_for_each_entry(parent, &parent_list, next) {
> +		mutex_lock(&parent->mdev_list_lock);
> +		list_for_each_entry(p, &parent->mdev_list, next) {
> +			if (uuid_le_cmp(p->uuid, uuid) == 0) {
> +				mdev = mdev_get_device(p);
> +				break;
> +			}
> +		}
> +		mutex_unlock(&parent->mdev_list_lock);
> +
> +		if (mdev)
> +			break;
> +	}
> +	mutex_unlock(&parent_list_lock);
> +	return mdev;
> +}

This is used later by mdev_device_start() and mdev_device_stop() to get
the parent_device so it can call the start and stop ops callbacks
respectively.  That seems to imply that all of instances for a given
uuid come from the same parent_device.  Where is that enforced?  I'm
still having a hard time buying into the uuid+instance plan when it
seems like each mdev_device should have an actual unique uuid.
Userspace tools can figure out which uuids to start for a given user, I
don't see much value in collecting them to instances within a uuid.

> +
> +/*
> + * Find mediated device from given iommu_group and increment refcount of
> + * mediated device. Caller should call mdev_put_device() when the use of
> + * mdev_device is done.
> + */
> +struct mdev_device *mdev_get_device_by_group(struct iommu_group *group)
> +{
> +	struct mdev_device *mdev = NULL, *p;
> +	struct parent_device *parent;
> +
> +	mutex_lock(&parent_list_lock);
> +	list_for_each_entry(parent, &parent_list, next) {
> +		mutex_lock(&parent->mdev_list_lock);
> +		list_for_each_entry(p, &parent->mdev_list, next) {
> +			if (!p->group)
> +				continue;
> +
> +			if (iommu_group_id(p->group) == iommu_group_id(group)) {
> +				mdev = mdev_get_device(p);
> +				break;
> +			}
> +		}
> +		mutex_unlock(&parent->mdev_list_lock);
> +
> +		if (mdev)
> +			break;
> +	}
> +	mutex_unlock(&parent_list_lock);
> +	return mdev;
> +}
> +EXPORT_SYMBOL(mdev_get_device_by_group);
> +
> +/*
> + * mdev_register_device : Register a device
> + * @dev: device structure representing parent device.
> + * @ops: Parent device operation structure to be registered.
> + *
> + * Add device to list of registered parent devices.
> + * Returns a negative value on error, otherwise 0.
> + */
> +int mdev_register_device(struct device *dev, const struct parent_ops *ops)
> +{
> +	int ret = 0;
> +	struct parent_device *parent;
> +
> +	if (!dev || !ops)
> +		return -EINVAL;
> +
> +	/* check for mandatory ops */
> +	if (!ops->create || !ops->destroy)
> +		return -EINVAL;
> +
> +	mutex_lock(&parent_list_lock);
> +
> +	/* Check for duplicate */
> +	parent = find_parent_device(dev);
> +	if (parent) {
> +		ret = -EEXIST;
> +		goto add_dev_err;
> +	}
> +
> +	parent = kzalloc(sizeof(*parent), GFP_KERNEL);
> +	if (!parent) {
> +		ret = -ENOMEM;
> +		goto add_dev_err;
> +	}
> +
> +	kref_init(&parent->ref);
> +	list_add(&parent->next, &parent_list);
> +
> +	parent->dev = dev;
> +	parent->ops = ops;
> +	mutex_init(&parent->mdev_list_lock);
> +	INIT_LIST_HEAD(&parent->mdev_list);
> +	init_waitqueue_head(&parent->release_done);
> +	mutex_unlock(&parent_list_lock);
> +
> +	ret = mdev_create_sysfs_files(dev);
> +	if (ret)
> +		goto add_sysfs_error;
> +
> +	ret = mdev_add_attribute_group(dev, ops->dev_attr_groups);
> +	if (ret)
> +		goto add_group_error;
> +
> +	dev_info(dev, "MDEV: Registered\n");
> +	return 0;
> +
> +add_group_error:
> +	mdev_remove_sysfs_files(dev);
> +add_sysfs_error:
> +	mutex_lock(&parent_list_lock);
> +	list_del(&parent->next);
> +	mutex_unlock(&parent_list_lock);
> +	mdev_put_parent(parent);
> +	return ret;
> +
> +add_dev_err:
> +	mutex_unlock(&parent_list_lock);
> +	return ret;
> +}
> +EXPORT_SYMBOL(mdev_register_device);
> +
> +/*
> + * mdev_unregister_device : Unregister a parent device
> + * @dev: device structure representing parent device.
> + *
> + * Remove device from list of registered parent devices. Give a chance to free
> + * existing mediated devices for given device.
> + */
> +
> +void mdev_unregister_device(struct device *dev)
> +{
> +	struct parent_device *parent;
> +	struct mdev_device *mdev, *n;

Above *p was used for a temp pointer.  

> +	int ret;
> +
> +	mutex_lock(&parent_list_lock);
> +	parent = find_parent_device(dev);
> +
> +	if (!parent) {
> +		mutex_unlock(&parent_list_lock);
> +		return;
> +	}
> +	dev_info(dev, "MDEV: Unregistering\n");
> +
> +	/*
> +	 * Remove parent from the list and remove create and destroy sysfs

Quoting "create" and "destroy" would make this a bit more readable:

	Remove parent from the list and remove "create" and "destroy"
	sysfs...

Took me a couple reads to figure out "remove create" wasn't a typo.

> +	 * files so that no new mediated device could be created for this parent
> +	 */
> +	list_del(&parent->next);
> +	mdev_remove_sysfs_files(dev);
> +	mutex_unlock(&parent_list_lock);
> +
> +	mdev_remove_attribute_group(dev,
> +				    parent->ops->dev_attr_groups);
> +

Why do we need to remove sysfs files under the parent_list_lock?

> +	mutex_lock(&parent->mdev_list_lock);
> +	list_for_each_entry_safe(mdev, n, &parent->mdev_list, next) {
> +		mdev_device_destroy_ops(mdev, true);
> +		mutex_unlock(&parent->mdev_list_lock);
> +		mdev_put_device(mdev);
> +		mutex_lock(&parent->mdev_list_lock);

*cringe*  Any time we need to release the list lock inside the
traversal makes me nervous.  What about using list_first_entry() since
I don't think using list_for_each_entry_safe() really makes it safe
from concurrent operations on the list once we drop that lock.

> +	}
> +	mutex_unlock(&parent->mdev_list_lock);
> +
> +	do {
> +		ret = wait_event_interruptible_timeout(parent->release_done,
> +				list_empty(&parent->mdev_list), HZ * 10);
> +		if (ret == -ERESTARTSYS) {
> +			dev_warn(dev, "Mediated devices are in use, task"
> +				      " \"%s\" (%d) "
> +				      "blocked until all are released",
> +				      current->comm, task_pid_nr(current));
> +		}
> +	} while (ret <= 0);
> +
> +	mdev_put_parent(parent);
> +}
> +EXPORT_SYMBOL(mdev_unregister_device);
> +
> +/*
> + * Functions required for mdev_sysfs
> + */
> +static void mdev_device_release(struct device *dev)
> +{
> +	struct mdev_device *mdev = to_mdev_device(dev);
> +
> +	dev_dbg(&mdev->dev, "MDEV: destroying\n");
> +	kfree(mdev);
> +}
> +
> +int mdev_device_create(struct device *dev, uuid_le uuid, uint32_t instance,
> +		       char *mdev_params)
> +{
> +	int ret;
> +	struct mdev_device *mdev;
> +	struct parent_device *parent;
> +
> +	parent = mdev_get_parent_by_dev(dev);
> +	if (!parent)
> +		return -EINVAL;
> +
> +	mutex_lock(&parent->mdev_list_lock);
> +	/* Check for duplicate */
> +	mdev = find_mdev_device(parent, uuid, instance);
> +	if (mdev) {
> +		ret = -EEXIST;
> +		goto create_err;
> +	}
> +
> +	mdev = kzalloc(sizeof(*mdev), GFP_KERNEL);
> +	if (!mdev) {
> +		ret = -ENOMEM;
> +		goto create_err;
> +	}
> +
> +	memcpy(&mdev->uuid, &uuid, sizeof(uuid_le));
> +	mdev->instance = instance;
> +	mdev->parent = parent;
> +	kref_init(&mdev->ref);
> +
> +	mdev->dev.parent  = dev;
> +	mdev->dev.bus     = &mdev_bus_type;
> +	mdev->dev.release = mdev_device_release;
> +	dev_set_name(&mdev->dev, "%pUl-%d", uuid.b, instance);
> +
> +	ret = device_register(&mdev->dev);
> +	if (ret) {
> +		put_device(&mdev->dev);
> +		goto create_err;
> +	}
> +
> +	ret = mdev_device_create_ops(mdev, mdev_params);
> +	if (ret)
> +		goto create_failed;
> +
> +	list_add(&mdev->next, &parent->mdev_list);
> +	mutex_unlock(&parent->mdev_list_lock);
> +
> +	dev_dbg(&mdev->dev, "MDEV: created\n");
> +
> +	return ret;
> +
> +create_failed:
> +	device_unregister(&mdev->dev);
> +
> +create_err:
> +	mutex_unlock(&parent->mdev_list_lock);
> +	mdev_put_parent(parent);
> +	return ret;
> +}
> +
> +int mdev_device_destroy(struct device *dev, uuid_le uuid, uint32_t instance)
> +{
> +	struct mdev_device *mdev;
> +	struct parent_device *parent;
> +	int ret;
> +
> +	parent = mdev_get_parent_by_dev(dev);
> +	if (!parent)
> +		return -EINVAL;
> +
> +	mutex_lock(&parent->mdev_list_lock);
> +	mdev = find_mdev_device(parent, uuid, instance);
> +	if (!mdev) {
> +		ret = -EINVAL;

-ENODEV?

> +		goto destroy_err;
> +	}
> +
> +	ret = mdev_device_destroy_ops(mdev, false);
> +	if (ret)
> +		goto destroy_err;
> +
> +	mutex_unlock(&parent->mdev_list_lock);
> +	mdev_put_device(mdev);
> +
> +	mdev_put_parent(parent);
> +	return ret;
> +
> +destroy_err:
> +	mutex_unlock(&parent->mdev_list_lock);
> +	mdev_put_parent(parent);
> +	return ret;
> +}
> +
> +int mdev_device_invalidate_mapping(struct mdev_device *mdev,
> +				   unsigned long addr, unsigned long size)
> +{
> +	int ret = -EINVAL;
> +	struct mdev_phys_mapping *phys_mappings;
> +	struct addr_desc *addr_desc;
> +
> +	if (!mdev || !mdev->phys_mappings.mapping)
> +		return ret;
> +
> +	phys_mappings = &mdev->phys_mappings;
> +
> +	mutex_lock(&phys_mappings->addr_desc_list_lock);
> +
> +	list_for_each_entry(addr_desc, &phys_mappings->addr_desc_list, next) {
> +
> +		if ((addr > addr_desc->start) &&
> +		    (addr + size < addr_desc->start + addr_desc->size)) {

This looks incomplete, minimally I think these should be >= and <=, but
that still only covers fully enclosed invalidation ranges.  Do we need
to support partial invalidations?

> +			unmap_mapping_range(phys_mappings->mapping,
> +					    addr, size, 0);
> +			ret = 0;
> +			goto unlock_exit;

If partial overlaps can occur, we'll need an exhaustive search.

> +		}
> +	}
> +
> +unlock_exit:
> +	mutex_unlock(&phys_mappings->addr_desc_list_lock);
> +	return ret;
> +}
> +EXPORT_SYMBOL(mdev_device_invalidate_mapping);
> +
> +/* Sanity check for the physical mapping list for mediated device */
> +
> +int mdev_add_phys_mapping(struct mdev_device *mdev,
> +			  struct address_space *mapping,
> +			  unsigned long addr, unsigned long size)
> +{
> +	struct mdev_phys_mapping *phys_mappings;
> +	struct addr_desc *addr_desc, *new_addr_desc;
> +	int ret = 0;
> +
> +	if (!mdev)
> +		return -EINVAL;
> +
> +	phys_mappings = &mdev->phys_mappings;
> +	if (phys_mappings->mapping && (mapping != phys_mappings->mapping))
> +		return -EINVAL;
> +
> +	if (!phys_mappings->mapping) {
> +		phys_mappings->mapping = mapping;
> +		mutex_init(&phys_mappings->addr_desc_list_lock);
> +		INIT_LIST_HEAD(&phys_mappings->addr_desc_list);
> +	}

This looks racy, should we be acquiring the mutex earlier?

> +
> +	mutex_lock(&phys_mappings->addr_desc_list_lock);
> +
> +	list_for_each_entry(addr_desc, &phys_mappings->addr_desc_list, next) {
> +		if ((addr + size < addr_desc->start) ||
> +		    (addr_desc->start + addr_desc->size) < addr)

<= on both, I think

> +			continue;
> +		else {
> +			/* should be no overlap */
> +			ret = -EINVAL;
> +			goto mapping_exit;
> +		}
> +	}
> +
> +	/* add the new entry to the list */
> +	new_addr_desc = kzalloc(sizeof(*new_addr_desc), GFP_KERNEL);
> +
> +	if (!new_addr_desc) {
> +		ret = -ENOMEM;
> +		goto mapping_exit;
> +	}
> +
> +	new_addr_desc->start = addr;
> +	new_addr_desc->size = size;
> +	list_add(&new_addr_desc->next, &phys_mappings->addr_desc_list);
> +
> +mapping_exit:
> +	mutex_unlock(&phys_mappings->addr_desc_list_lock);
> +	return ret;
> +}
> +EXPORT_SYMBOL(mdev_add_phys_mapping);
> +
> +void mdev_del_phys_mapping(struct mdev_device *mdev, unsigned long addr)
> +{
> +	struct mdev_phys_mapping *phys_mappings;
> +	struct addr_desc *addr_desc;
> +
> +	if (!mdev)
> +		return;
> +
> +	phys_mappings = &mdev->phys_mappings;
> +
> +	mutex_lock(&phys_mappings->addr_desc_list_lock);
> +	list_for_each_entry(addr_desc, &phys_mappings->addr_desc_list, next) {
> +		if (addr_desc->start == addr) {
> +			list_del(&addr_desc->next);
> +			kfree(addr_desc);
> +			break;
> +		}
> +	}
> +	mutex_unlock(&phys_mappings->addr_desc_list_lock);
> +}
> +EXPORT_SYMBOL(mdev_del_phys_mapping);
> +
> +void mdev_device_supported_config(struct device *dev, char *str)
> +{
> +	struct parent_device *parent;
> +
> +	parent = mdev_get_parent_by_dev(dev);
> +
> +	if (parent) {
> +		if (parent->ops->supported_config)
> +			parent->ops->supported_config(parent->dev, str);
> +		mdev_put_parent(parent);
> +	}
> +}
> +
> +int mdev_device_start(uuid_le uuid)
> +{
> +	int ret = 0;
> +	struct mdev_device *mdev;
> +	struct parent_device *parent;
> +
> +	mdev = mdev_get_first_device_by_uuid(uuid);
> +	if (!mdev)
> +		return -EINVAL;
> +
> +	parent = mdev->parent;
> +
> +	if (parent->ops->start)
> +		ret = parent->ops->start(mdev->uuid);

Assumes uuids do not span parent_devices?

> +
> +	if (ret)
> +		pr_err("mdev_start failed  %d\n", ret);
> +	else
> +		kobject_uevent(&mdev->dev.kobj, KOBJ_ONLINE);
> +
> +	mdev_put_device(mdev);
> +
> +	return ret;
> +}
> +
> +int mdev_device_stop(uuid_le uuid)
> +{
> +	int ret = 0;
> +	struct mdev_device *mdev;
> +	struct parent_device *parent;
> +
> +	mdev = mdev_get_first_device_by_uuid(uuid);
> +	if (!mdev)
> +		return -EINVAL;
> +
> +	parent = mdev->parent;
> +
> +	if (parent->ops->stop)
> +		ret = parent->ops->stop(mdev->uuid);
> +
> +	if (ret)
> +		pr_err("mdev stop failed %d\n", ret);
> +	else
> +		kobject_uevent(&mdev->dev.kobj, KOBJ_OFFLINE);
> +
> +	mdev_put_device(mdev);
> +	return ret;
> +}
> +
> +static struct class mdev_class = {
> +	.name		= MDEV_CLASS_NAME,
> +	.owner		= THIS_MODULE,
> +	.class_attrs	= mdev_class_attrs,
> +};
> +
> +static int __init mdev_init(void)
> +{
> +	int ret;
> +
> +	ret = class_register(&mdev_class);
> +	if (ret) {
> +		pr_err("Failed to register mdev class\n");
> +		return ret;
> +	}
> +
> +	ret = mdev_bus_register();
> +	if (ret) {
> +		pr_err("Failed to register mdev bus\n");
> +		class_unregister(&mdev_class);
> +		return ret;
> +	}
> +
> +	return ret;
> +}
> +
> +static void __exit mdev_exit(void)
> +{
> +	mdev_bus_unregister();
> +	class_unregister(&mdev_class);
> +}
> +
> +module_init(mdev_init)
> +module_exit(mdev_exit)
> +
> +MODULE_VERSION(DRIVER_VERSION);
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR(DRIVER_AUTHOR);
> +MODULE_DESCRIPTION(DRIVER_DESC);
> diff --git a/drivers/vfio/mdev/mdev_driver.c b/drivers/vfio/mdev/mdev_driver.c
> new file mode 100644
> index 000000000000..00680bd06224
> --- /dev/null
> +++ b/drivers/vfio/mdev/mdev_driver.c
> @@ -0,0 +1,142 @@
> +/*
> + * MDEV driver
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/device.h>
> +#include <linux/iommu.h>
> +#include <linux/mdev.h>
> +
> +#include "mdev_private.h"
> +
> +static int mdev_attach_iommu(struct mdev_device *mdev)
> +{
> +	int ret;
> +	struct iommu_group *group;
> +
> +	group = iommu_group_alloc();
> +	if (IS_ERR(group)) {
> +		dev_err(&mdev->dev, "MDEV: failed to allocate group!\n");
> +		return PTR_ERR(group);
> +	}
> +
> +	ret = iommu_group_add_device(group, &mdev->dev);
> +	if (ret) {
> +		dev_err(&mdev->dev, "MDEV: failed to add dev to group!\n");
> +		goto attach_fail;
> +	}
> +
> +	mdev->group = group;
> +
> +	dev_info(&mdev->dev, "MDEV: group_id = %d\n",
> +				 iommu_group_id(group));
> +attach_fail:
> +	iommu_group_put(group);
> +	return ret;
> +}
> +
> +static void mdev_detach_iommu(struct mdev_device *mdev)
> +{
> +	iommu_group_remove_device(&mdev->dev);
> +	mdev->group = NULL;
> +	dev_info(&mdev->dev, "MDEV: detaching iommu\n");
> +}
> +
> +static int mdev_probe(struct device *dev)
> +{
> +	struct mdev_driver *drv = to_mdev_driver(dev->driver);
> +	struct mdev_device *mdev = to_mdev_device(dev);
> +	int ret;
> +
> +	ret = mdev_attach_iommu(mdev);
> +	if (ret) {
> +		dev_err(dev, "Failed to attach IOMMU\n");
> +		return ret;
> +	}
> +
> +	if (drv && drv->probe)
> +		ret = drv->probe(dev);
> +
> +	if (ret)
> +		mdev_detach_iommu(mdev);
> +
> +	return ret;
> +}
> +
> +static int mdev_remove(struct device *dev)
> +{
> +	struct mdev_driver *drv = to_mdev_driver(dev->driver);
> +	struct mdev_device *mdev = to_mdev_device(dev);
> +
> +	if (drv && drv->remove)
> +		drv->remove(dev);
> +
> +	mdev_detach_iommu(mdev);
> +
> +	return 0;
> +}
> +
> +static int mdev_match(struct device *dev, struct device_driver *driver)
> +{
> +	struct mdev_driver *drv = to_mdev_driver(driver);
> +
> +	if (drv && drv->match)
> +		return drv->match(dev);
> +
> +	return 0;
> +}
> +
> +struct bus_type mdev_bus_type = {
> +	.name		= "mdev",
> +	.match		= mdev_match,
> +	.probe		= mdev_probe,
> +	.remove		= mdev_remove,
> +};
> +EXPORT_SYMBOL_GPL(mdev_bus_type);
> +
> +/*
> + * mdev_register_driver - register a new MDEV driver
> + * @drv: the driver to register
> + * @owner: module owner of driver to be registered
> + *
> + * Returns a negative value on error, otherwise 0.
> + */
> +int mdev_register_driver(struct mdev_driver *drv, struct module *owner)
> +{
> +	/* initialize common driver fields */
> +	drv->driver.name = drv->name;
> +	drv->driver.bus = &mdev_bus_type;
> +	drv->driver.owner = owner;
> +
> +	/* register with core */
> +	return driver_register(&drv->driver);
> +}
> +EXPORT_SYMBOL(mdev_register_driver);
> +
> +/*
> + * mdev_unregister_driver - unregister MDEV driver
> + * @drv: the driver to unregister
> + *
> + */
> +void mdev_unregister_driver(struct mdev_driver *drv)
> +{
> +	driver_unregister(&drv->driver);
> +}
> +EXPORT_SYMBOL(mdev_unregister_driver);
> +
> +int mdev_bus_register(void)
> +{
> +	return bus_register(&mdev_bus_type);
> +}
> +
> +void mdev_bus_unregister(void)
> +{
> +	bus_unregister(&mdev_bus_type);
> +}
> diff --git a/drivers/vfio/mdev/mdev_private.h b/drivers/vfio/mdev/mdev_private.h
> new file mode 100644
> index 000000000000..ee2db61a8091
> --- /dev/null
> +++ b/drivers/vfio/mdev/mdev_private.h
> @@ -0,0 +1,33 @@
> +/*
> + * Mediated device interal definitions
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#ifndef MDEV_PRIVATE_H
> +#define MDEV_PRIVATE_H
> +
> +int  mdev_bus_register(void);
> +void mdev_bus_unregister(void);
> +
> +/* Function prototypes for mdev_sysfs */
> +
> +extern struct class_attribute mdev_class_attrs[];
> +
> +int  mdev_create_sysfs_files(struct device *dev);
> +void mdev_remove_sysfs_files(struct device *dev);
> +
> +int  mdev_device_create(struct device *dev, uuid_le uuid, uint32_t instance,
> +			char *mdev_params);
> +int  mdev_device_destroy(struct device *dev, uuid_le uuid, uint32_t instance);
> +void mdev_device_supported_config(struct device *dev, char *str);
> +int  mdev_device_start(uuid_le uuid);
> +int  mdev_device_stop(uuid_le uuid);
> +
> +#endif /* MDEV_PRIVATE_H */
> diff --git a/drivers/vfio/mdev/mdev_sysfs.c b/drivers/vfio/mdev/mdev_sysfs.c
> new file mode 100644
> index 000000000000..e0457e68cf78
> --- /dev/null
> +++ b/drivers/vfio/mdev/mdev_sysfs.c
> @@ -0,0 +1,269 @@
> +/*
> + * File attributes for Mediated devices
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/sysfs.h>
> +#include <linux/ctype.h>
> +#include <linux/device.h>
> +#include <linux/slab.h>
> +#include <linux/uuid.h>
> +#include <linux/mdev.h>
> +
> +#include "mdev_private.h"
> +
> +/* Prototypes */
> +static ssize_t mdev_supported_types_show(struct device *dev,
> +					 struct device_attribute *attr,
> +					 char *buf);
> +static DEVICE_ATTR_RO(mdev_supported_types);
> +
> +static ssize_t mdev_create_store(struct device *dev,
> +				 struct device_attribute *attr,
> +				 const char *buf, size_t count);
> +static DEVICE_ATTR_WO(mdev_create);
> +
> +static ssize_t mdev_destroy_store(struct device *dev,
> +				  struct device_attribute *attr,
> +				  const char *buf, size_t count);
> +static DEVICE_ATTR_WO(mdev_destroy);
> +
> +/* Static functions */
> +
> +
> +#define SUPPORTED_TYPE_BUFFER_LENGTH	4096
> +
> +/* mdev sysfs Functions */
> +static ssize_t mdev_supported_types_show(struct device *dev,
> +					 struct device_attribute *attr,
> +					 char *buf)
> +{
> +	char *str, *ptr;
> +	ssize_t n;
> +
> +	str = kzalloc(sizeof(*str) * SUPPORTED_TYPE_BUFFER_LENGTH, GFP_KERNEL);
> +	if (!str)
> +		return -ENOMEM;
> +
> +	ptr = str;
> +	mdev_device_supported_config(dev, str);
> +
> +	n = sprintf(buf, "%s\n", str);
> +	kfree(ptr);
> +
> +	return n;
> +}
> +
> +static ssize_t mdev_create_store(struct device *dev,
> +				 struct device_attribute *attr,
> +				 const char *buf, size_t count)
> +{
> +	char *str, *pstr;
> +	char *uuid_str, *instance_str, *mdev_params = NULL, *params = NULL;
> +	uuid_le uuid;
> +	uint32_t instance;
> +	int ret;
> +
> +	pstr = str = kstrndup(buf, count, GFP_KERNEL);
> +
> +	if (!str)
> +		return -ENOMEM;
> +
> +	uuid_str = strsep(&str, ":");
> +	if (!uuid_str) {
> +		pr_err("mdev_create: Empty UUID string %s\n", buf);
> +		ret = -EINVAL;
> +		goto create_error;
> +	}
> +
> +	if (!str) {
> +		pr_err("mdev_create: mdev instance not present %s\n", buf);
> +		ret = -EINVAL;
> +		goto create_error;
> +	}
> +
> +	instance_str = strsep(&str, ":");
> +	if (!instance_str) {
> +		pr_err("mdev_create: Empty instance string %s\n", buf);
> +		ret = -EINVAL;
> +		goto create_error;
> +	}
> +
> +	ret = kstrtouint(instance_str, 0, &instance);
> +	if (ret) {
> +		pr_err("mdev_create: mdev instance parsing error %s\n", buf);
> +		goto create_error;
> +	}
> +
> +	if (str)
> +		params = mdev_params = kstrdup(str, GFP_KERNEL);
> +
> +	ret = uuid_le_to_bin(uuid_str, &uuid);
> +	if (ret) {
> +		pr_err("mdev_create: UUID parse error %s\n", buf);
> +		goto create_error;
> +	}
> +
> +	ret = mdev_device_create(dev, uuid, instance, mdev_params);
> +	if (ret)
> +		pr_err("mdev_create: Failed to create mdev device\n");
> +	else
> +		ret = count;
> +
> +create_error:
> +	kfree(params);
> +	kfree(pstr);
> +	return ret;
> +}
> +
> +static ssize_t mdev_destroy_store(struct device *dev,
> +				  struct device_attribute *attr,
> +				  const char *buf, size_t count)
> +{

I wonder if we should just have a "remove" file in sysfs under the
device.

> +	char *uuid_str, *str, *pstr;
> +	uuid_le uuid;
> +	unsigned int instance;
> +	int ret;
> +
> +	str = pstr = kstrndup(buf, count, GFP_KERNEL);
> +
> +	if (!str)
> +		return -ENOMEM;
> +
> +	uuid_str = strsep(&str, ":");
> +	if (!uuid_str) {
> +		pr_err("mdev_destroy: Empty UUID string %s\n", buf);
> +		ret = -EINVAL;
> +		goto destroy_error;
> +	}
> +
> +	if (str == NULL) {
> +		pr_err("mdev_destroy: instance not specified %s\n", buf);
> +		ret = -EINVAL;
> +		goto destroy_error;
> +	}
> +
> +	ret = kstrtouint(str, 0, &instance);
> +	if (ret) {
> +		pr_err("mdev_destroy: instance parsing error %s\n", buf);
> +		goto destroy_error;
> +	}
> +
> +	ret = uuid_le_to_bin(uuid_str, &uuid);
> +	if (ret) {
> +		pr_err("mdev_destroy: UUID parse error  %s\n", buf);
> +		goto destroy_error;
> +	}
> +
> +	ret = mdev_device_destroy(dev, uuid, instance);
> +	if (ret == 0)
> +		ret = count;
> +
> +destroy_error:
> +	kfree(pstr);
> +	return ret;
> +}
> +
> +ssize_t mdev_start_store(struct class *class, struct class_attribute *attr,
> +			 const char *buf, size_t count)
> +{
> +	char *uuid_str, *ptr;
> +	uuid_le uuid;
> +	int ret;
> +
> +	ptr = uuid_str = kstrndup(buf, count, GFP_KERNEL);
> +
> +	if (!uuid_str)
> +		return -ENOMEM;
> +
> +	ret = uuid_le_to_bin(uuid_str, &uuid);
> +	if (ret) {
> +		pr_err("mdev_start: UUID parse error  %s\n", buf);
> +		goto start_error;
> +	}
> +
> +	ret = mdev_device_start(uuid);
> +	if (ret == 0)
> +		ret = count;
> +
> +start_error:
> +	kfree(ptr);
> +	return ret;
> +}
> +
> +ssize_t mdev_stop_store(struct class *class, struct class_attribute *attr,
> +			    const char *buf, size_t count)
> +{
> +	char *uuid_str, *ptr;
> +	uuid_le uuid;
> +	int ret;
> +
> +	ptr = uuid_str = kstrndup(buf, count, GFP_KERNEL);
> +
> +	if (!uuid_str)
> +		return -ENOMEM;
> +
> +	ret = uuid_le_to_bin(uuid_str, &uuid);
> +	if (ret) {
> +		pr_err("mdev_stop: UUID parse error %s\n", buf);
> +		goto stop_error;
> +	}
> +
> +	ret = mdev_device_stop(uuid);
> +	if (ret == 0)
> +		ret = count;
> +
> +stop_error:
> +	kfree(ptr);
> +	return ret;
> +
> +}
> +
> +struct class_attribute mdev_class_attrs[] = {
> +	__ATTR_WO(mdev_start),
> +	__ATTR_WO(mdev_stop),
> +	__ATTR_NULL
> +};
> +
> +int mdev_create_sysfs_files(struct device *dev)
> +{
> +	int ret;
> +
> +	ret = sysfs_create_file(&dev->kobj,
> +				&dev_attr_mdev_supported_types.attr);
> +	if (ret) {
> +		pr_err("Failed to create mdev_supported_types sysfs entry\n");
> +		return ret;
> +	}
> +
> +	ret = sysfs_create_file(&dev->kobj, &dev_attr_mdev_create.attr);
> +	if (ret) {
> +		pr_err("Failed to create mdev_create sysfs entry\n");
> +		goto create_sysfs_failed;
> +	}
> +
> +	ret = sysfs_create_file(&dev->kobj, &dev_attr_mdev_destroy.attr);
> +	if (ret) {
> +		pr_err("Failed to create mdev_destroy sysfs entry\n");
> +		sysfs_remove_file(&dev->kobj, &dev_attr_mdev_create.attr);
> +	} else
> +		return ret;
> +
> +create_sysfs_failed:
> +	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_supported_types.attr);
> +	return ret;
> +}
> +
> +void mdev_remove_sysfs_files(struct device *dev)
> +{
> +	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_supported_types.attr);
> +	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_create.attr);
> +	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_destroy.attr);
> +}
> diff --git a/include/linux/mdev.h b/include/linux/mdev.h
> new file mode 100644
> index 000000000000..0b41f301a9b7
> --- /dev/null
> +++ b/include/linux/mdev.h
> @@ -0,0 +1,236 @@
> +/*
> + * Mediated device definition
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#ifndef MDEV_H
> +#define MDEV_H
> +
> +#include <uapi/linux/vfio.h>
> +
> +struct parent_device;
> +
> +/*
> + * Mediated device
> + */
> +
> +struct addr_desc {
> +	unsigned long start;
> +	unsigned long size;
> +	struct list_head next;
> +};
> +
> +struct mdev_phys_mapping {
> +	struct address_space *mapping;
> +	struct list_head addr_desc_list;
> +	struct mutex addr_desc_list_lock;
> +};
> +
> +struct mdev_device {
> +	struct device		dev;
> +	struct parent_device	*parent;
> +	struct iommu_group	*group;
> +	uuid_le			uuid;
> +	uint32_t		instance;
> +	void			*driver_data;
> +
> +	/* internal only */
> +	struct kref		ref;
> +	struct list_head	next;
> +
> +	struct mdev_phys_mapping phys_mappings;
> +};
> +
> +
> +/**
> + * struct parent_ops - Structure to be registered for each parent device to
> + * register the device to mdev module.
> + *
> + * @owner:		The module owner.
> + * @dev_attr_groups:	Default attributes of the parent device.
> + * @mdev_attr_groups:	Default attributes of the mediated device.
> + * @supported_config:	Called to get information about supported types.
> + *			@dev : device structure of parent device.
> + *			@config: should return string listing supported config
> + *			Returns integer: success (0) or error (< 0)
> + * @create:		Called to allocate basic resources in parent device's
> + *			driver for a particular mediated device. It is
> + *			mandatory to provide create ops.
> + *			@mdev: mdev_device structure on of mediated device
> + *			      that is being created
> + *			@mdev_params: extra parameters required by parent
> + *			device's driver.
> + *			Returns integer: success (0) or error (< 0)
> + * @destroy:		Called to free resources in parent device's driver for a
> + *			a mediated device instance. It is mandatory to provide
> + *			destroy ops.
> + *			@mdev: mdev_device device structure which is being
> + *			       destroyed
> + *			Returns integer: success (0) or error (< 0)
> + *			If VMM is running and destroy() is called that means the
> + *			mdev is being hotunpluged. Return error if VMM is
> + *			running and driver doesn't support mediated device
> + *			hotplug.
> + * @reset:		Called to reset mediated device.
> + *			@mdev: mdev_device device structure
> + *			Returns integer: success (0) or error (< 0)
> + * @start:		Called to initiate mediated device initialization
> + *			process in parent device's driver before VMM starts.
> + *			@uuid: UUID
> + *			Returns integer: success (0) or error (< 0)
> + * @stop:		Called to teardown mediated device related resources
> + *			@uuid: UUID
> + *			Returns integer: success (0) or error (< 0)
> + * @read:		Read emulation callback
> + *			@mdev: mediated device structure
> + *			@buf: read buffer
> + *			@count: number of bytes to read
> + *			@pos: address.
> + *			Retuns number on bytes read on success or error.
> + * @write:		Write emulation callback
> + *			@mdev: mediated device structure
> + *			@buf: write buffer
> + *			@count: number of bytes to be written
> + *			@pos: address.
> + *			Retuns number on bytes written on success or error.
> + * @set_irqs:		Called to send about interrupts configuration
> + *			information that VMM sets.
> + *			@mdev: mediated device structure
> + *			@flags, index, start, count and *data : same as that of
> + *			struct vfio_irq_set of VFIO_DEVICE_SET_IRQS API.
> + * @get_region_info:	Called to get VFIO region size and flags of mediated
> + *			device.
> + *			@mdev: mediated device structure
> + *			@region_index: VFIO region index
> + *			@region_info: output, returns size and flags of
> + *				      requested region.
> + *			Returns integer: success (0) or error (< 0)
> + * @validate_map_request: Validate remap pfn request
> + *			@mdev: mediated device structure
> + *			@pos: address
> + *			@virtaddr: target user address to start at. Vendor
> + *				   driver can change if required.
> + *			@pfn: parent address of kernel memory, vendor driver
> + *			      can change if required.
> + *			@size: size of map area, vendor driver can change the
> + *			       size of map area if desired.
> + *			@prot: page protection flags for this mapping, vendor
> + *			       driver can change, if required.
> + *			Returns integer: success (0) or error (< 0)
> + *
> + * Parent device that support mediated device should be registered with mdev
> + * module with parent_ops structure.
> + */
> +
> +struct parent_ops {
> +	struct module   *owner;
> +	const struct attribute_group **dev_attr_groups;
> +	const struct attribute_group **mdev_attr_groups;
> +
> +	int	(*supported_config)(struct device *dev, char *config);
> +	int     (*create)(struct mdev_device *mdev, char *mdev_params);
> +	int     (*destroy)(struct mdev_device *mdev);
> +	int     (*reset)(struct mdev_device *mdev);
> +	int     (*start)(uuid_le uuid);
> +	int     (*stop)(uuid_le uuid);
> +	ssize_t (*read)(struct mdev_device *mdev, char *buf, size_t count,
> +			loff_t pos);
> +	ssize_t (*write)(struct mdev_device *mdev, char *buf, size_t count,
> +			 loff_t pos);
> +	int     (*set_irqs)(struct mdev_device *mdev, uint32_t flags,
> +			    unsigned int index, unsigned int start,
> +			    unsigned int count, void *data);
> +	int	(*get_region_info)(struct mdev_device *mdev, int region_index,
> +				   struct vfio_region_info *region_info);
> +	int	(*validate_map_request)(struct mdev_device *mdev, loff_t pos,
> +					u64 *virtaddr, unsigned long *pfn,
> +					unsigned long *size, pgprot_t *prot);
> +};
> +
> +/*
> + * Parent Device
> + */
> +
> +struct parent_device {
> +	struct device		*dev;
> +	const struct parent_ops	*ops;
> +
> +	/* internal */
> +	struct kref		ref;
> +	struct list_head	next;
> +	struct list_head	mdev_list;
> +	struct mutex		mdev_list_lock;
> +	wait_queue_head_t	release_done;
> +};
> +
> +/**
> + * struct mdev_driver - Mediated device driver
> + * @name: driver name
> + * @probe: called when new device created
> + * @remove: called when device removed
> + * @match: called when new device or driver is added for this bus. Return 1 if
> + *	   given device can be handled by given driver and zero otherwise.
> + * @driver: device driver structure
> + *
> + **/
> +struct mdev_driver {
> +	const char *name;
> +	int  (*probe)(struct device *dev);
> +	void (*remove)(struct device *dev);
> +	int  (*match)(struct device *dev);
> +	struct device_driver driver;
> +};
> +
> +static inline struct mdev_driver *to_mdev_driver(struct device_driver *drv)
> +{
> +	return drv ? container_of(drv, struct mdev_driver, driver) : NULL;
> +}
> +
> +static inline struct mdev_device *to_mdev_device(struct device *dev)
> +{
> +	return dev ? container_of(dev, struct mdev_device, dev) : NULL;
> +}
> +
> +static inline void *mdev_get_drvdata(struct mdev_device *mdev)
> +{
> +	return mdev->driver_data;
> +}
> +
> +static inline void mdev_set_drvdata(struct mdev_device *mdev, void *data)
> +{
> +	mdev->driver_data = data;
> +}
> +
> +extern struct bus_type mdev_bus_type;
> +
> +#define dev_is_mdev(d) ((d)->bus == &mdev_bus_type)
> +
> +extern int  mdev_register_device(struct device *dev,
> +				 const struct parent_ops *ops);
> +extern void mdev_unregister_device(struct device *dev);
> +
> +extern int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
> +extern void mdev_unregister_driver(struct mdev_driver *drv);
> +
> +extern struct mdev_device *mdev_get_device(struct mdev_device *mdev);
> +extern void mdev_put_device(struct mdev_device *mdev);
> +
> +extern struct mdev_device *mdev_get_device_by_group(struct iommu_group *group);
> +
> +extern int mdev_device_invalidate_mapping(struct mdev_device *mdev,
> +					unsigned long addr, unsigned long size);
> +
> +extern int mdev_add_phys_mapping(struct mdev_device *mdev,
> +				 struct address_space *mapping,
> +				 unsigned long addr, unsigned long size);
> +
> +
> +extern void mdev_del_phys_mapping(struct mdev_device *mdev, unsigned long addr);
> +#endif /* MDEV_H */


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v6 1/4] vfio: Mediated device Core driver
@ 2016-08-09 19:00     ` Alex Williamson
  0 siblings, 0 replies; 100+ messages in thread
From: Alex Williamson @ 2016-08-09 19:00 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi

On Thu, 4 Aug 2016 00:33:51 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> Design for Mediated Device Driver:
> Main purpose of this driver is to provide a common interface for mediated
> device management that can be used by different drivers of different
> devices.
> 
> This module provides a generic interface to create the device, add it to
> mediated bus, add device to IOMMU group and then add it to vfio group.
> 
> Below is the high Level block diagram, with Nvidia, Intel and IBM devices
> as example, since these are the devices which are going to actively use
> this module as of now.
> 
>  +---------------+
>  |               |
>  | +-----------+ |  mdev_register_driver() +--------------+
>  | |           | +<------------------------+ __init()     |
>  | |           | |                         |              |
>  | |  mdev     | +------------------------>+              |<-> VFIO user
>  | |  bus      | |     probe()/remove()    | vfio_mpci.ko |    APIs
>  | |  driver   | |                         |              |
>  | |           | |                         +--------------+
>  | |           | |  mdev_register_driver() +--------------+
>  | |           | +<------------------------+ __init()     |
>  | |           | |                         |              |
>  | |           | +------------------------>+              |<-> VFIO user
>  | +-----------+ |     probe()/remove()    | vfio_mccw.ko |    APIs
>  |               |                         |              |
>  |  MDEV CORE    |                         +--------------+
>  |   MODULE      |
>  |   mdev.ko     |
>  | +-----------+ |  mdev_register_device() +--------------+
>  | |           | +<------------------------+              |
>  | |           | |                         |  nvidia.ko   |<-> physical
>  | |           | +------------------------>+              |    device
>  | |           | |        callback         +--------------+
>  | | Physical  | |
>  | |  device   | |  mdev_register_device() +--------------+
>  | | interface | |<------------------------+              |
>  | |           | |                         |  i915.ko     |<-> physical
>  | |           | +------------------------>+              |    device
>  | |           | |        callback         +--------------+
>  | |           | |
>  | |           | |  mdev_register_device() +--------------+
>  | |           | +<------------------------+              |
>  | |           | |                         | ccw_device.ko|<-> physical
>  | |           | +------------------------>+              |    device
>  | |           | |        callback         +--------------+
>  | +-----------+ |
>  +---------------+
> 
> Core driver provides two types of registration interfaces:
> 1. Registration interface for mediated bus driver:
> 
> /**
>   * struct mdev_driver - Mediated device's driver
>   * @name: driver name
>   * @probe: called when new device created
>   * @remove:called when device removed
>   * @match: called when new device or driver is added for this bus.
> 	    Return 1 if given device can be handled by given driver and
> 	    zero otherwise.
>   * @driver:device driver structure
>   *
>   **/
> struct mdev_driver {
>          const char *name;
>          int  (*probe)  (struct device *dev);
>          void (*remove) (struct device *dev);
>          int  (*match)(struct device *dev);
>          struct device_driver    driver;
> };
> 
> int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
> void mdev_unregister_driver(struct mdev_driver *drv);
> 
> Mediated device's driver for mdev should use this interface to register
> with Core driver. With this, mediated devices driver for such devices is
> responsible to add mediated device to VFIO group.
> 
> 2. Physical device driver interface
> This interface provides vendor driver the set APIs to manage physical
> device related work in their own driver. APIs are :
> - supported_config: provide supported configuration list by the vendor
> 		    driver
> - create: to allocate basic resources in vendor driver for a mediated
> 	  device.
> - destroy: to free resources in vendor driver when mediated device is
> 	   destroyed.
> - reset: to free and reallocate resources in vendor driver during reboot
> - start: to initiate mediated device initialization process from vendor
> 	 driver
> - shutdown: to teardown mediated device resources during teardown.
> - read : read emulation callback.
> - write: write emulation callback.
> - set_irqs: send interrupt configuration information that VMM sets.
> - get_region_info: to provide region size and its flags for the mediated
> 		   device.
> - validate_map_request: to validate remap pfn request.
> 
> This registration interface should be used by vendor drivers to register
> each physical device to mdev core driver.
> Locks to serialize above callbacks are removed. If required, vendor driver
> can have locks to serialize above APIs in their driver.
> 
> Added support to keep track of physical mappings for each mdev device.
> APIs to be used by mediated device bus driver to add and delete mappings to
> tracking logic:
> int mdev_add_phys_mapping(struct mdev_device *mdev,
>                           struct address_space *mapping,
>                           unsigned long addr, unsigned long size)
> void mdev_del_phys_mapping(struct mdev_device *mdev, unsigned long addr)
> 
> API to be used by vendor driver to invalidate mapping:
> int mdev_device_invalidate_mapping(struct mdev_device *mdev,
>                                    unsigned long addr, unsigned long size)
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Signed-off-by: Neo Jia <cjia@nvidia.com>
> Change-Id: I73a5084574270b14541c529461ea2f03c292d510
> ---
>  drivers/vfio/Kconfig             |   1 +
>  drivers/vfio/Makefile            |   1 +
>  drivers/vfio/mdev/Kconfig        |  12 +
>  drivers/vfio/mdev/Makefile       |   5 +
>  drivers/vfio/mdev/mdev_core.c    | 676 +++++++++++++++++++++++++++++++++++++++
>  drivers/vfio/mdev/mdev_driver.c  | 142 ++++++++
>  drivers/vfio/mdev/mdev_private.h |  33 ++
>  drivers/vfio/mdev/mdev_sysfs.c   | 269 ++++++++++++++++
>  include/linux/mdev.h             | 236 ++++++++++++++
>  9 files changed, 1375 insertions(+)
>  create mode 100644 drivers/vfio/mdev/Kconfig
>  create mode 100644 drivers/vfio/mdev/Makefile
>  create mode 100644 drivers/vfio/mdev/mdev_core.c
>  create mode 100644 drivers/vfio/mdev/mdev_driver.c
>  create mode 100644 drivers/vfio/mdev/mdev_private.h
>  create mode 100644 drivers/vfio/mdev/mdev_sysfs.c
>  create mode 100644 include/linux/mdev.h
> 
> diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
> index da6e2ce77495..23eced02aaf6 100644
> --- a/drivers/vfio/Kconfig
> +++ b/drivers/vfio/Kconfig
> @@ -48,4 +48,5 @@ menuconfig VFIO_NOIOMMU
>  
>  source "drivers/vfio/pci/Kconfig"
>  source "drivers/vfio/platform/Kconfig"
> +source "drivers/vfio/mdev/Kconfig"
>  source "virt/lib/Kconfig"
> diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
> index 7b8a31f63fea..4a23c13b6be4 100644
> --- a/drivers/vfio/Makefile
> +++ b/drivers/vfio/Makefile
> @@ -7,3 +7,4 @@ obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
>  obj-$(CONFIG_VFIO_SPAPR_EEH) += vfio_spapr_eeh.o
>  obj-$(CONFIG_VFIO_PCI) += pci/
>  obj-$(CONFIG_VFIO_PLATFORM) += platform/
> +obj-$(CONFIG_VFIO_MDEV) += mdev/
> diff --git a/drivers/vfio/mdev/Kconfig b/drivers/vfio/mdev/Kconfig
> new file mode 100644
> index 000000000000..a34fbc66f92f
> --- /dev/null
> +++ b/drivers/vfio/mdev/Kconfig
> @@ -0,0 +1,12 @@
> +
> +config VFIO_MDEV
> +    tristate "Mediated device driver framework"
> +    depends on VFIO
> +    default n
> +    help
> +        Provides a framework to virtualize device.
> +	See Documentation/vfio-mediated-device.txt for more details.
> +
> +        If you don't know what do here, say N.
> +
> +
> diff --git a/drivers/vfio/mdev/Makefile b/drivers/vfio/mdev/Makefile
> new file mode 100644
> index 000000000000..56a75e689582
> --- /dev/null
> +++ b/drivers/vfio/mdev/Makefile
> @@ -0,0 +1,5 @@
> +
> +mdev-y := mdev_core.o mdev_sysfs.o mdev_driver.o
> +
> +obj-$(CONFIG_VFIO_MDEV) += mdev.o
> +
> diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
> new file mode 100644
> index 000000000000..90ff073abfce
> --- /dev/null
> +++ b/drivers/vfio/mdev/mdev_core.c
> @@ -0,0 +1,676 @@
> +/*
> + * Mediated device Core Driver
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/module.h>
> +#include <linux/device.h>
> +#include <linux/fs.h>
> +#include <linux/slab.h>
> +#include <linux/sched.h>
> +#include <linux/uuid.h>
> +#include <linux/vfio.h>
> +#include <linux/iommu.h>
> +#include <linux/sysfs.h>
> +#include <linux/mdev.h>
> +
> +#include "mdev_private.h"
> +
> +#define DRIVER_VERSION		"0.1"
> +#define DRIVER_AUTHOR		"NVIDIA Corporation"
> +#define DRIVER_DESC		"Mediated device Core Driver"
> +
> +#define MDEV_CLASS_NAME		"mdev"
> +
> +static LIST_HEAD(parent_list);
> +static DEFINE_MUTEX(parent_list_lock);
> +
> +static int mdev_add_attribute_group(struct device *dev,
> +				    const struct attribute_group **groups)
> +{
> +	return sysfs_create_groups(&dev->kobj, groups);
> +}
> +
> +static void mdev_remove_attribute_group(struct device *dev,
> +					const struct attribute_group **groups)
> +{
> +	sysfs_remove_groups(&dev->kobj, groups);
> +}
> +
> +/* Should be called holding parent->mdev_list_lock */

I often like to prepend "__" onto the name of functions like this to
signal a special calling convention.

> +static struct mdev_device *find_mdev_device(struct parent_device *parent,
> +					    uuid_le uuid, int instance)
> +{
> +	struct mdev_device *mdev;
> +
> +	list_for_each_entry(mdev, &parent->mdev_list, next) {
> +		if ((uuid_le_cmp(mdev->uuid, uuid) == 0) &&
> +		    (mdev->instance == instance))
> +			return mdev;
> +	}
> +	return NULL;
> +}
> +
> +/* Should be called holding parent_list_lock */
> +static struct parent_device *find_parent_device(struct device *dev)
> +{
> +	struct parent_device *parent;
> +
> +	list_for_each_entry(parent, &parent_list, next) {
> +		if (parent->dev == dev)
> +			return parent;
> +	}
> +	return NULL;
> +}
> +
> +static void mdev_release_parent(struct kref *kref)
> +{
> +	struct parent_device *parent = container_of(kref, struct parent_device,
> +						    ref);
> +	kfree(parent);
> +}
> +
> +static
> +inline struct parent_device *mdev_get_parent(struct parent_device *parent)
> +{
> +	if (parent)
> +		kref_get(&parent->ref);
> +
> +	return parent;
> +}
> +
> +static inline void mdev_put_parent(struct parent_device *parent)
> +{
> +	if (parent)
> +		kref_put(&parent->ref, mdev_release_parent);
> +}
> +
> +static struct parent_device *mdev_get_parent_by_dev(struct device *dev)
> +{
> +	struct parent_device *parent = NULL, *p;
> +
> +	mutex_lock(&parent_list_lock);
> +	list_for_each_entry(p, &parent_list, next) {
> +		if (p->dev == dev) {
> +			parent = mdev_get_parent(p);
> +			break;
> +		}
> +	}
> +	mutex_unlock(&parent_list_lock);
> +	return parent;

Use what you've created:

{
	struct parent_device *parent;

	mutex_lock(&parent_list_lock);
	parent = mdev_get_parent(find_parent_device(dev));
	mutex_unlock(&parent_list_lock);

	return parent;
}

> +}
> +
> +static int mdev_device_create_ops(struct mdev_device *mdev, char *mdev_params)
> +{
> +	struct parent_device *parent = mdev->parent;
> +	int ret;
> +
> +	ret = parent->ops->create(mdev, mdev_params);
> +	if (ret)
> +		return ret;
> +
> +	ret = mdev_add_attribute_group(&mdev->dev,
> +					parent->ops->mdev_attr_groups);
> +	if (ret)
> +		parent->ops->destroy(mdev);
> +
> +	return ret;
> +}
> +
> +static int mdev_device_destroy_ops(struct mdev_device *mdev, bool force)
> +{
> +	struct parent_device *parent = mdev->parent;
> +	int ret = 0;
> +
> +	/*
> +	 * If vendor driver doesn't return success that means vendor
> +	 * driver doesn't support hot-unplug
> +	 */
> +	ret = parent->ops->destroy(mdev);
> +	if (ret && !force)
> +		return -EBUSY;

This still seems troublesome, I'm not sure why we don't just require
hot-unplug support.  Without it, we seem to have a limbo state where a
device exists, but not fully.

> +
> +	mdev_remove_attribute_group(&mdev->dev,
> +				    parent->ops->mdev_attr_groups);
> +
> +	return ret;
> +}
> +
> +static void mdev_release_device(struct kref *kref)
> +{
> +	struct mdev_device *mdev = container_of(kref, struct mdev_device, ref);
> +	struct parent_device *parent = mdev->parent;
> +
> +	list_del(&mdev->next);
> +	mutex_unlock(&parent->mdev_list_lock);

Maybe worthy of a short comment to more obviously match this unlock to
the kref_put_mutex() below.

> +
> +	device_unregister(&mdev->dev);
> +	wake_up(&parent->release_done);
> +	mdev_put_parent(parent);
> +}
> +
> +struct mdev_device *mdev_get_device(struct mdev_device *mdev)
> +{
> +	kref_get(&mdev->ref);
> +	return mdev;
> +}
> +EXPORT_SYMBOL(mdev_get_device);

Is the intention here that the caller already has a reference to mdev
and wants to get another?  Or I see the cases where use locking
to get this reference.  There's potential to misuse this, if not
outright abuse it, which worries me.  A reference cannot be
spontaneously generated, it needs to be sourced from somewhere.

> +
> +void mdev_put_device(struct mdev_device *mdev)
> +{
> +	struct parent_device *parent = mdev->parent;
> +
> +	kref_put_mutex(&mdev->ref, mdev_release_device,
> +		       &parent->mdev_list_lock);
> +}
> +EXPORT_SYMBOL(mdev_put_device);
> +
> +/*
> + * Find first mediated device from given uuid and increment refcount of
> + * mediated device. Caller should call mdev_put_device() when the use of
> + * mdev_device is done.
> + */
> +static struct mdev_device *mdev_get_first_device_by_uuid(uuid_le uuid)
> +{
> +	struct mdev_device *mdev = NULL, *p;
> +	struct parent_device *parent;
> +
> +	mutex_lock(&parent_list_lock);
> +	list_for_each_entry(parent, &parent_list, next) {
> +		mutex_lock(&parent->mdev_list_lock);
> +		list_for_each_entry(p, &parent->mdev_list, next) {
> +			if (uuid_le_cmp(p->uuid, uuid) == 0) {
> +				mdev = mdev_get_device(p);
> +				break;
> +			}
> +		}
> +		mutex_unlock(&parent->mdev_list_lock);
> +
> +		if (mdev)
> +			break;
> +	}
> +	mutex_unlock(&parent_list_lock);
> +	return mdev;
> +}

This is used later by mdev_device_start() and mdev_device_stop() to get
the parent_device so it can call the start and stop ops callbacks
respectively.  That seems to imply that all of instances for a given
uuid come from the same parent_device.  Where is that enforced?  I'm
still having a hard time buying into the uuid+instance plan when it
seems like each mdev_device should have an actual unique uuid.
Userspace tools can figure out which uuids to start for a given user, I
don't see much value in collecting them to instances within a uuid.

> +
> +/*
> + * Find mediated device from given iommu_group and increment refcount of
> + * mediated device. Caller should call mdev_put_device() when the use of
> + * mdev_device is done.
> + */
> +struct mdev_device *mdev_get_device_by_group(struct iommu_group *group)
> +{
> +	struct mdev_device *mdev = NULL, *p;
> +	struct parent_device *parent;
> +
> +	mutex_lock(&parent_list_lock);
> +	list_for_each_entry(parent, &parent_list, next) {
> +		mutex_lock(&parent->mdev_list_lock);
> +		list_for_each_entry(p, &parent->mdev_list, next) {
> +			if (!p->group)
> +				continue;
> +
> +			if (iommu_group_id(p->group) == iommu_group_id(group)) {
> +				mdev = mdev_get_device(p);
> +				break;
> +			}
> +		}
> +		mutex_unlock(&parent->mdev_list_lock);
> +
> +		if (mdev)
> +			break;
> +	}
> +	mutex_unlock(&parent_list_lock);
> +	return mdev;
> +}
> +EXPORT_SYMBOL(mdev_get_device_by_group);
> +
> +/*
> + * mdev_register_device : Register a device
> + * @dev: device structure representing parent device.
> + * @ops: Parent device operation structure to be registered.
> + *
> + * Add device to list of registered parent devices.
> + * Returns a negative value on error, otherwise 0.
> + */
> +int mdev_register_device(struct device *dev, const struct parent_ops *ops)
> +{
> +	int ret = 0;
> +	struct parent_device *parent;
> +
> +	if (!dev || !ops)
> +		return -EINVAL;
> +
> +	/* check for mandatory ops */
> +	if (!ops->create || !ops->destroy)
> +		return -EINVAL;
> +
> +	mutex_lock(&parent_list_lock);
> +
> +	/* Check for duplicate */
> +	parent = find_parent_device(dev);
> +	if (parent) {
> +		ret = -EEXIST;
> +		goto add_dev_err;
> +	}
> +
> +	parent = kzalloc(sizeof(*parent), GFP_KERNEL);
> +	if (!parent) {
> +		ret = -ENOMEM;
> +		goto add_dev_err;
> +	}
> +
> +	kref_init(&parent->ref);
> +	list_add(&parent->next, &parent_list);
> +
> +	parent->dev = dev;
> +	parent->ops = ops;
> +	mutex_init(&parent->mdev_list_lock);
> +	INIT_LIST_HEAD(&parent->mdev_list);
> +	init_waitqueue_head(&parent->release_done);
> +	mutex_unlock(&parent_list_lock);
> +
> +	ret = mdev_create_sysfs_files(dev);
> +	if (ret)
> +		goto add_sysfs_error;
> +
> +	ret = mdev_add_attribute_group(dev, ops->dev_attr_groups);
> +	if (ret)
> +		goto add_group_error;
> +
> +	dev_info(dev, "MDEV: Registered\n");
> +	return 0;
> +
> +add_group_error:
> +	mdev_remove_sysfs_files(dev);
> +add_sysfs_error:
> +	mutex_lock(&parent_list_lock);
> +	list_del(&parent->next);
> +	mutex_unlock(&parent_list_lock);
> +	mdev_put_parent(parent);
> +	return ret;
> +
> +add_dev_err:
> +	mutex_unlock(&parent_list_lock);
> +	return ret;
> +}
> +EXPORT_SYMBOL(mdev_register_device);
> +
> +/*
> + * mdev_unregister_device : Unregister a parent device
> + * @dev: device structure representing parent device.
> + *
> + * Remove device from list of registered parent devices. Give a chance to free
> + * existing mediated devices for given device.
> + */
> +
> +void mdev_unregister_device(struct device *dev)
> +{
> +	struct parent_device *parent;
> +	struct mdev_device *mdev, *n;

Above *p was used for a temp pointer.  

> +	int ret;
> +
> +	mutex_lock(&parent_list_lock);
> +	parent = find_parent_device(dev);
> +
> +	if (!parent) {
> +		mutex_unlock(&parent_list_lock);
> +		return;
> +	}
> +	dev_info(dev, "MDEV: Unregistering\n");
> +
> +	/*
> +	 * Remove parent from the list and remove create and destroy sysfs

Quoting "create" and "destroy" would make this a bit more readable:

	Remove parent from the list and remove "create" and "destroy"
	sysfs...

Took me a couple reads to figure out "remove create" wasn't a typo.

> +	 * files so that no new mediated device could be created for this parent
> +	 */
> +	list_del(&parent->next);
> +	mdev_remove_sysfs_files(dev);
> +	mutex_unlock(&parent_list_lock);
> +
> +	mdev_remove_attribute_group(dev,
> +				    parent->ops->dev_attr_groups);
> +

Why do we need to remove sysfs files under the parent_list_lock?

> +	mutex_lock(&parent->mdev_list_lock);
> +	list_for_each_entry_safe(mdev, n, &parent->mdev_list, next) {
> +		mdev_device_destroy_ops(mdev, true);
> +		mutex_unlock(&parent->mdev_list_lock);
> +		mdev_put_device(mdev);
> +		mutex_lock(&parent->mdev_list_lock);

*cringe*  Any time we need to release the list lock inside the
traversal makes me nervous.  What about using list_first_entry() since
I don't think using list_for_each_entry_safe() really makes it safe
from concurrent operations on the list once we drop that lock.

> +	}
> +	mutex_unlock(&parent->mdev_list_lock);
> +
> +	do {
> +		ret = wait_event_interruptible_timeout(parent->release_done,
> +				list_empty(&parent->mdev_list), HZ * 10);
> +		if (ret == -ERESTARTSYS) {
> +			dev_warn(dev, "Mediated devices are in use, task"
> +				      " \"%s\" (%d) "
> +				      "blocked until all are released",
> +				      current->comm, task_pid_nr(current));
> +		}
> +	} while (ret <= 0);
> +
> +	mdev_put_parent(parent);
> +}
> +EXPORT_SYMBOL(mdev_unregister_device);
> +
> +/*
> + * Functions required for mdev_sysfs
> + */
> +static void mdev_device_release(struct device *dev)
> +{
> +	struct mdev_device *mdev = to_mdev_device(dev);
> +
> +	dev_dbg(&mdev->dev, "MDEV: destroying\n");
> +	kfree(mdev);
> +}
> +
> +int mdev_device_create(struct device *dev, uuid_le uuid, uint32_t instance,
> +		       char *mdev_params)
> +{
> +	int ret;
> +	struct mdev_device *mdev;
> +	struct parent_device *parent;
> +
> +	parent = mdev_get_parent_by_dev(dev);
> +	if (!parent)
> +		return -EINVAL;
> +
> +	mutex_lock(&parent->mdev_list_lock);
> +	/* Check for duplicate */
> +	mdev = find_mdev_device(parent, uuid, instance);
> +	if (mdev) {
> +		ret = -EEXIST;
> +		goto create_err;
> +	}
> +
> +	mdev = kzalloc(sizeof(*mdev), GFP_KERNEL);
> +	if (!mdev) {
> +		ret = -ENOMEM;
> +		goto create_err;
> +	}
> +
> +	memcpy(&mdev->uuid, &uuid, sizeof(uuid_le));
> +	mdev->instance = instance;
> +	mdev->parent = parent;
> +	kref_init(&mdev->ref);
> +
> +	mdev->dev.parent  = dev;
> +	mdev->dev.bus     = &mdev_bus_type;
> +	mdev->dev.release = mdev_device_release;
> +	dev_set_name(&mdev->dev, "%pUl-%d", uuid.b, instance);
> +
> +	ret = device_register(&mdev->dev);
> +	if (ret) {
> +		put_device(&mdev->dev);
> +		goto create_err;
> +	}
> +
> +	ret = mdev_device_create_ops(mdev, mdev_params);
> +	if (ret)
> +		goto create_failed;
> +
> +	list_add(&mdev->next, &parent->mdev_list);
> +	mutex_unlock(&parent->mdev_list_lock);
> +
> +	dev_dbg(&mdev->dev, "MDEV: created\n");
> +
> +	return ret;
> +
> +create_failed:
> +	device_unregister(&mdev->dev);
> +
> +create_err:
> +	mutex_unlock(&parent->mdev_list_lock);
> +	mdev_put_parent(parent);
> +	return ret;
> +}
> +
> +int mdev_device_destroy(struct device *dev, uuid_le uuid, uint32_t instance)
> +{
> +	struct mdev_device *mdev;
> +	struct parent_device *parent;
> +	int ret;
> +
> +	parent = mdev_get_parent_by_dev(dev);
> +	if (!parent)
> +		return -EINVAL;
> +
> +	mutex_lock(&parent->mdev_list_lock);
> +	mdev = find_mdev_device(parent, uuid, instance);
> +	if (!mdev) {
> +		ret = -EINVAL;

-ENODEV?

> +		goto destroy_err;
> +	}
> +
> +	ret = mdev_device_destroy_ops(mdev, false);
> +	if (ret)
> +		goto destroy_err;
> +
> +	mutex_unlock(&parent->mdev_list_lock);
> +	mdev_put_device(mdev);
> +
> +	mdev_put_parent(parent);
> +	return ret;
> +
> +destroy_err:
> +	mutex_unlock(&parent->mdev_list_lock);
> +	mdev_put_parent(parent);
> +	return ret;
> +}
> +
> +int mdev_device_invalidate_mapping(struct mdev_device *mdev,
> +				   unsigned long addr, unsigned long size)
> +{
> +	int ret = -EINVAL;
> +	struct mdev_phys_mapping *phys_mappings;
> +	struct addr_desc *addr_desc;
> +
> +	if (!mdev || !mdev->phys_mappings.mapping)
> +		return ret;
> +
> +	phys_mappings = &mdev->phys_mappings;
> +
> +	mutex_lock(&phys_mappings->addr_desc_list_lock);
> +
> +	list_for_each_entry(addr_desc, &phys_mappings->addr_desc_list, next) {
> +
> +		if ((addr > addr_desc->start) &&
> +		    (addr + size < addr_desc->start + addr_desc->size)) {

This looks incomplete, minimally I think these should be >= and <=, but
that still only covers fully enclosed invalidation ranges.  Do we need
to support partial invalidations?

> +			unmap_mapping_range(phys_mappings->mapping,
> +					    addr, size, 0);
> +			ret = 0;
> +			goto unlock_exit;

If partial overlaps can occur, we'll need an exhaustive search.

> +		}
> +	}
> +
> +unlock_exit:
> +	mutex_unlock(&phys_mappings->addr_desc_list_lock);
> +	return ret;
> +}
> +EXPORT_SYMBOL(mdev_device_invalidate_mapping);
> +
> +/* Sanity check for the physical mapping list for mediated device */
> +
> +int mdev_add_phys_mapping(struct mdev_device *mdev,
> +			  struct address_space *mapping,
> +			  unsigned long addr, unsigned long size)
> +{
> +	struct mdev_phys_mapping *phys_mappings;
> +	struct addr_desc *addr_desc, *new_addr_desc;
> +	int ret = 0;
> +
> +	if (!mdev)
> +		return -EINVAL;
> +
> +	phys_mappings = &mdev->phys_mappings;
> +	if (phys_mappings->mapping && (mapping != phys_mappings->mapping))
> +		return -EINVAL;
> +
> +	if (!phys_mappings->mapping) {
> +		phys_mappings->mapping = mapping;
> +		mutex_init(&phys_mappings->addr_desc_list_lock);
> +		INIT_LIST_HEAD(&phys_mappings->addr_desc_list);
> +	}

This looks racy, should we be acquiring the mutex earlier?

> +
> +	mutex_lock(&phys_mappings->addr_desc_list_lock);
> +
> +	list_for_each_entry(addr_desc, &phys_mappings->addr_desc_list, next) {
> +		if ((addr + size < addr_desc->start) ||
> +		    (addr_desc->start + addr_desc->size) < addr)

<= on both, I think

> +			continue;
> +		else {
> +			/* should be no overlap */
> +			ret = -EINVAL;
> +			goto mapping_exit;
> +		}
> +	}
> +
> +	/* add the new entry to the list */
> +	new_addr_desc = kzalloc(sizeof(*new_addr_desc), GFP_KERNEL);
> +
> +	if (!new_addr_desc) {
> +		ret = -ENOMEM;
> +		goto mapping_exit;
> +	}
> +
> +	new_addr_desc->start = addr;
> +	new_addr_desc->size = size;
> +	list_add(&new_addr_desc->next, &phys_mappings->addr_desc_list);
> +
> +mapping_exit:
> +	mutex_unlock(&phys_mappings->addr_desc_list_lock);
> +	return ret;
> +}
> +EXPORT_SYMBOL(mdev_add_phys_mapping);
> +
> +void mdev_del_phys_mapping(struct mdev_device *mdev, unsigned long addr)
> +{
> +	struct mdev_phys_mapping *phys_mappings;
> +	struct addr_desc *addr_desc;
> +
> +	if (!mdev)
> +		return;
> +
> +	phys_mappings = &mdev->phys_mappings;
> +
> +	mutex_lock(&phys_mappings->addr_desc_list_lock);
> +	list_for_each_entry(addr_desc, &phys_mappings->addr_desc_list, next) {
> +		if (addr_desc->start == addr) {
> +			list_del(&addr_desc->next);
> +			kfree(addr_desc);
> +			break;
> +		}
> +	}
> +	mutex_unlock(&phys_mappings->addr_desc_list_lock);
> +}
> +EXPORT_SYMBOL(mdev_del_phys_mapping);
> +
> +void mdev_device_supported_config(struct device *dev, char *str)
> +{
> +	struct parent_device *parent;
> +
> +	parent = mdev_get_parent_by_dev(dev);
> +
> +	if (parent) {
> +		if (parent->ops->supported_config)
> +			parent->ops->supported_config(parent->dev, str);
> +		mdev_put_parent(parent);
> +	}
> +}
> +
> +int mdev_device_start(uuid_le uuid)
> +{
> +	int ret = 0;
> +	struct mdev_device *mdev;
> +	struct parent_device *parent;
> +
> +	mdev = mdev_get_first_device_by_uuid(uuid);
> +	if (!mdev)
> +		return -EINVAL;
> +
> +	parent = mdev->parent;
> +
> +	if (parent->ops->start)
> +		ret = parent->ops->start(mdev->uuid);

Assumes uuids do not span parent_devices?

> +
> +	if (ret)
> +		pr_err("mdev_start failed  %d\n", ret);
> +	else
> +		kobject_uevent(&mdev->dev.kobj, KOBJ_ONLINE);
> +
> +	mdev_put_device(mdev);
> +
> +	return ret;
> +}
> +
> +int mdev_device_stop(uuid_le uuid)
> +{
> +	int ret = 0;
> +	struct mdev_device *mdev;
> +	struct parent_device *parent;
> +
> +	mdev = mdev_get_first_device_by_uuid(uuid);
> +	if (!mdev)
> +		return -EINVAL;
> +
> +	parent = mdev->parent;
> +
> +	if (parent->ops->stop)
> +		ret = parent->ops->stop(mdev->uuid);
> +
> +	if (ret)
> +		pr_err("mdev stop failed %d\n", ret);
> +	else
> +		kobject_uevent(&mdev->dev.kobj, KOBJ_OFFLINE);
> +
> +	mdev_put_device(mdev);
> +	return ret;
> +}
> +
> +static struct class mdev_class = {
> +	.name		= MDEV_CLASS_NAME,
> +	.owner		= THIS_MODULE,
> +	.class_attrs	= mdev_class_attrs,
> +};
> +
> +static int __init mdev_init(void)
> +{
> +	int ret;
> +
> +	ret = class_register(&mdev_class);
> +	if (ret) {
> +		pr_err("Failed to register mdev class\n");
> +		return ret;
> +	}
> +
> +	ret = mdev_bus_register();
> +	if (ret) {
> +		pr_err("Failed to register mdev bus\n");
> +		class_unregister(&mdev_class);
> +		return ret;
> +	}
> +
> +	return ret;
> +}
> +
> +static void __exit mdev_exit(void)
> +{
> +	mdev_bus_unregister();
> +	class_unregister(&mdev_class);
> +}
> +
> +module_init(mdev_init)
> +module_exit(mdev_exit)
> +
> +MODULE_VERSION(DRIVER_VERSION);
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR(DRIVER_AUTHOR);
> +MODULE_DESCRIPTION(DRIVER_DESC);
> diff --git a/drivers/vfio/mdev/mdev_driver.c b/drivers/vfio/mdev/mdev_driver.c
> new file mode 100644
> index 000000000000..00680bd06224
> --- /dev/null
> +++ b/drivers/vfio/mdev/mdev_driver.c
> @@ -0,0 +1,142 @@
> +/*
> + * MDEV driver
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/device.h>
> +#include <linux/iommu.h>
> +#include <linux/mdev.h>
> +
> +#include "mdev_private.h"
> +
> +static int mdev_attach_iommu(struct mdev_device *mdev)
> +{
> +	int ret;
> +	struct iommu_group *group;
> +
> +	group = iommu_group_alloc();
> +	if (IS_ERR(group)) {
> +		dev_err(&mdev->dev, "MDEV: failed to allocate group!\n");
> +		return PTR_ERR(group);
> +	}
> +
> +	ret = iommu_group_add_device(group, &mdev->dev);
> +	if (ret) {
> +		dev_err(&mdev->dev, "MDEV: failed to add dev to group!\n");
> +		goto attach_fail;
> +	}
> +
> +	mdev->group = group;
> +
> +	dev_info(&mdev->dev, "MDEV: group_id = %d\n",
> +				 iommu_group_id(group));
> +attach_fail:
> +	iommu_group_put(group);
> +	return ret;
> +}
> +
> +static void mdev_detach_iommu(struct mdev_device *mdev)
> +{
> +	iommu_group_remove_device(&mdev->dev);
> +	mdev->group = NULL;
> +	dev_info(&mdev->dev, "MDEV: detaching iommu\n");
> +}
> +
> +static int mdev_probe(struct device *dev)
> +{
> +	struct mdev_driver *drv = to_mdev_driver(dev->driver);
> +	struct mdev_device *mdev = to_mdev_device(dev);
> +	int ret;
> +
> +	ret = mdev_attach_iommu(mdev);
> +	if (ret) {
> +		dev_err(dev, "Failed to attach IOMMU\n");
> +		return ret;
> +	}
> +
> +	if (drv && drv->probe)
> +		ret = drv->probe(dev);
> +
> +	if (ret)
> +		mdev_detach_iommu(mdev);
> +
> +	return ret;
> +}
> +
> +static int mdev_remove(struct device *dev)
> +{
> +	struct mdev_driver *drv = to_mdev_driver(dev->driver);
> +	struct mdev_device *mdev = to_mdev_device(dev);
> +
> +	if (drv && drv->remove)
> +		drv->remove(dev);
> +
> +	mdev_detach_iommu(mdev);
> +
> +	return 0;
> +}
> +
> +static int mdev_match(struct device *dev, struct device_driver *driver)
> +{
> +	struct mdev_driver *drv = to_mdev_driver(driver);
> +
> +	if (drv && drv->match)
> +		return drv->match(dev);
> +
> +	return 0;
> +}
> +
> +struct bus_type mdev_bus_type = {
> +	.name		= "mdev",
> +	.match		= mdev_match,
> +	.probe		= mdev_probe,
> +	.remove		= mdev_remove,
> +};
> +EXPORT_SYMBOL_GPL(mdev_bus_type);
> +
> +/*
> + * mdev_register_driver - register a new MDEV driver
> + * @drv: the driver to register
> + * @owner: module owner of driver to be registered
> + *
> + * Returns a negative value on error, otherwise 0.
> + */
> +int mdev_register_driver(struct mdev_driver *drv, struct module *owner)
> +{
> +	/* initialize common driver fields */
> +	drv->driver.name = drv->name;
> +	drv->driver.bus = &mdev_bus_type;
> +	drv->driver.owner = owner;
> +
> +	/* register with core */
> +	return driver_register(&drv->driver);
> +}
> +EXPORT_SYMBOL(mdev_register_driver);
> +
> +/*
> + * mdev_unregister_driver - unregister MDEV driver
> + * @drv: the driver to unregister
> + *
> + */
> +void mdev_unregister_driver(struct mdev_driver *drv)
> +{
> +	driver_unregister(&drv->driver);
> +}
> +EXPORT_SYMBOL(mdev_unregister_driver);
> +
> +int mdev_bus_register(void)
> +{
> +	return bus_register(&mdev_bus_type);
> +}
> +
> +void mdev_bus_unregister(void)
> +{
> +	bus_unregister(&mdev_bus_type);
> +}
> diff --git a/drivers/vfio/mdev/mdev_private.h b/drivers/vfio/mdev/mdev_private.h
> new file mode 100644
> index 000000000000..ee2db61a8091
> --- /dev/null
> +++ b/drivers/vfio/mdev/mdev_private.h
> @@ -0,0 +1,33 @@
> +/*
> + * Mediated device interal definitions
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#ifndef MDEV_PRIVATE_H
> +#define MDEV_PRIVATE_H
> +
> +int  mdev_bus_register(void);
> +void mdev_bus_unregister(void);
> +
> +/* Function prototypes for mdev_sysfs */
> +
> +extern struct class_attribute mdev_class_attrs[];
> +
> +int  mdev_create_sysfs_files(struct device *dev);
> +void mdev_remove_sysfs_files(struct device *dev);
> +
> +int  mdev_device_create(struct device *dev, uuid_le uuid, uint32_t instance,
> +			char *mdev_params);
> +int  mdev_device_destroy(struct device *dev, uuid_le uuid, uint32_t instance);
> +void mdev_device_supported_config(struct device *dev, char *str);
> +int  mdev_device_start(uuid_le uuid);
> +int  mdev_device_stop(uuid_le uuid);
> +
> +#endif /* MDEV_PRIVATE_H */
> diff --git a/drivers/vfio/mdev/mdev_sysfs.c b/drivers/vfio/mdev/mdev_sysfs.c
> new file mode 100644
> index 000000000000..e0457e68cf78
> --- /dev/null
> +++ b/drivers/vfio/mdev/mdev_sysfs.c
> @@ -0,0 +1,269 @@
> +/*
> + * File attributes for Mediated devices
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/sysfs.h>
> +#include <linux/ctype.h>
> +#include <linux/device.h>
> +#include <linux/slab.h>
> +#include <linux/uuid.h>
> +#include <linux/mdev.h>
> +
> +#include "mdev_private.h"
> +
> +/* Prototypes */
> +static ssize_t mdev_supported_types_show(struct device *dev,
> +					 struct device_attribute *attr,
> +					 char *buf);
> +static DEVICE_ATTR_RO(mdev_supported_types);
> +
> +static ssize_t mdev_create_store(struct device *dev,
> +				 struct device_attribute *attr,
> +				 const char *buf, size_t count);
> +static DEVICE_ATTR_WO(mdev_create);
> +
> +static ssize_t mdev_destroy_store(struct device *dev,
> +				  struct device_attribute *attr,
> +				  const char *buf, size_t count);
> +static DEVICE_ATTR_WO(mdev_destroy);
> +
> +/* Static functions */
> +
> +
> +#define SUPPORTED_TYPE_BUFFER_LENGTH	4096
> +
> +/* mdev sysfs Functions */
> +static ssize_t mdev_supported_types_show(struct device *dev,
> +					 struct device_attribute *attr,
> +					 char *buf)
> +{
> +	char *str, *ptr;
> +	ssize_t n;
> +
> +	str = kzalloc(sizeof(*str) * SUPPORTED_TYPE_BUFFER_LENGTH, GFP_KERNEL);
> +	if (!str)
> +		return -ENOMEM;
> +
> +	ptr = str;
> +	mdev_device_supported_config(dev, str);
> +
> +	n = sprintf(buf, "%s\n", str);
> +	kfree(ptr);
> +
> +	return n;
> +}
> +
> +static ssize_t mdev_create_store(struct device *dev,
> +				 struct device_attribute *attr,
> +				 const char *buf, size_t count)
> +{
> +	char *str, *pstr;
> +	char *uuid_str, *instance_str, *mdev_params = NULL, *params = NULL;
> +	uuid_le uuid;
> +	uint32_t instance;
> +	int ret;
> +
> +	pstr = str = kstrndup(buf, count, GFP_KERNEL);
> +
> +	if (!str)
> +		return -ENOMEM;
> +
> +	uuid_str = strsep(&str, ":");
> +	if (!uuid_str) {
> +		pr_err("mdev_create: Empty UUID string %s\n", buf);
> +		ret = -EINVAL;
> +		goto create_error;
> +	}
> +
> +	if (!str) {
> +		pr_err("mdev_create: mdev instance not present %s\n", buf);
> +		ret = -EINVAL;
> +		goto create_error;
> +	}
> +
> +	instance_str = strsep(&str, ":");
> +	if (!instance_str) {
> +		pr_err("mdev_create: Empty instance string %s\n", buf);
> +		ret = -EINVAL;
> +		goto create_error;
> +	}
> +
> +	ret = kstrtouint(instance_str, 0, &instance);
> +	if (ret) {
> +		pr_err("mdev_create: mdev instance parsing error %s\n", buf);
> +		goto create_error;
> +	}
> +
> +	if (str)
> +		params = mdev_params = kstrdup(str, GFP_KERNEL);
> +
> +	ret = uuid_le_to_bin(uuid_str, &uuid);
> +	if (ret) {
> +		pr_err("mdev_create: UUID parse error %s\n", buf);
> +		goto create_error;
> +	}
> +
> +	ret = mdev_device_create(dev, uuid, instance, mdev_params);
> +	if (ret)
> +		pr_err("mdev_create: Failed to create mdev device\n");
> +	else
> +		ret = count;
> +
> +create_error:
> +	kfree(params);
> +	kfree(pstr);
> +	return ret;
> +}
> +
> +static ssize_t mdev_destroy_store(struct device *dev,
> +				  struct device_attribute *attr,
> +				  const char *buf, size_t count)
> +{

I wonder if we should just have a "remove" file in sysfs under the
device.

> +	char *uuid_str, *str, *pstr;
> +	uuid_le uuid;
> +	unsigned int instance;
> +	int ret;
> +
> +	str = pstr = kstrndup(buf, count, GFP_KERNEL);
> +
> +	if (!str)
> +		return -ENOMEM;
> +
> +	uuid_str = strsep(&str, ":");
> +	if (!uuid_str) {
> +		pr_err("mdev_destroy: Empty UUID string %s\n", buf);
> +		ret = -EINVAL;
> +		goto destroy_error;
> +	}
> +
> +	if (str == NULL) {
> +		pr_err("mdev_destroy: instance not specified %s\n", buf);
> +		ret = -EINVAL;
> +		goto destroy_error;
> +	}
> +
> +	ret = kstrtouint(str, 0, &instance);
> +	if (ret) {
> +		pr_err("mdev_destroy: instance parsing error %s\n", buf);
> +		goto destroy_error;
> +	}
> +
> +	ret = uuid_le_to_bin(uuid_str, &uuid);
> +	if (ret) {
> +		pr_err("mdev_destroy: UUID parse error  %s\n", buf);
> +		goto destroy_error;
> +	}
> +
> +	ret = mdev_device_destroy(dev, uuid, instance);
> +	if (ret == 0)
> +		ret = count;
> +
> +destroy_error:
> +	kfree(pstr);
> +	return ret;
> +}
> +
> +ssize_t mdev_start_store(struct class *class, struct class_attribute *attr,
> +			 const char *buf, size_t count)
> +{
> +	char *uuid_str, *ptr;
> +	uuid_le uuid;
> +	int ret;
> +
> +	ptr = uuid_str = kstrndup(buf, count, GFP_KERNEL);
> +
> +	if (!uuid_str)
> +		return -ENOMEM;
> +
> +	ret = uuid_le_to_bin(uuid_str, &uuid);
> +	if (ret) {
> +		pr_err("mdev_start: UUID parse error  %s\n", buf);
> +		goto start_error;
> +	}
> +
> +	ret = mdev_device_start(uuid);
> +	if (ret == 0)
> +		ret = count;
> +
> +start_error:
> +	kfree(ptr);
> +	return ret;
> +}
> +
> +ssize_t mdev_stop_store(struct class *class, struct class_attribute *attr,
> +			    const char *buf, size_t count)
> +{
> +	char *uuid_str, *ptr;
> +	uuid_le uuid;
> +	int ret;
> +
> +	ptr = uuid_str = kstrndup(buf, count, GFP_KERNEL);
> +
> +	if (!uuid_str)
> +		return -ENOMEM;
> +
> +	ret = uuid_le_to_bin(uuid_str, &uuid);
> +	if (ret) {
> +		pr_err("mdev_stop: UUID parse error %s\n", buf);
> +		goto stop_error;
> +	}
> +
> +	ret = mdev_device_stop(uuid);
> +	if (ret == 0)
> +		ret = count;
> +
> +stop_error:
> +	kfree(ptr);
> +	return ret;
> +
> +}
> +
> +struct class_attribute mdev_class_attrs[] = {
> +	__ATTR_WO(mdev_start),
> +	__ATTR_WO(mdev_stop),
> +	__ATTR_NULL
> +};
> +
> +int mdev_create_sysfs_files(struct device *dev)
> +{
> +	int ret;
> +
> +	ret = sysfs_create_file(&dev->kobj,
> +				&dev_attr_mdev_supported_types.attr);
> +	if (ret) {
> +		pr_err("Failed to create mdev_supported_types sysfs entry\n");
> +		return ret;
> +	}
> +
> +	ret = sysfs_create_file(&dev->kobj, &dev_attr_mdev_create.attr);
> +	if (ret) {
> +		pr_err("Failed to create mdev_create sysfs entry\n");
> +		goto create_sysfs_failed;
> +	}
> +
> +	ret = sysfs_create_file(&dev->kobj, &dev_attr_mdev_destroy.attr);
> +	if (ret) {
> +		pr_err("Failed to create mdev_destroy sysfs entry\n");
> +		sysfs_remove_file(&dev->kobj, &dev_attr_mdev_create.attr);
> +	} else
> +		return ret;
> +
> +create_sysfs_failed:
> +	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_supported_types.attr);
> +	return ret;
> +}
> +
> +void mdev_remove_sysfs_files(struct device *dev)
> +{
> +	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_supported_types.attr);
> +	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_create.attr);
> +	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_destroy.attr);
> +}
> diff --git a/include/linux/mdev.h b/include/linux/mdev.h
> new file mode 100644
> index 000000000000..0b41f301a9b7
> --- /dev/null
> +++ b/include/linux/mdev.h
> @@ -0,0 +1,236 @@
> +/*
> + * Mediated device definition
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#ifndef MDEV_H
> +#define MDEV_H
> +
> +#include <uapi/linux/vfio.h>
> +
> +struct parent_device;
> +
> +/*
> + * Mediated device
> + */
> +
> +struct addr_desc {
> +	unsigned long start;
> +	unsigned long size;
> +	struct list_head next;
> +};
> +
> +struct mdev_phys_mapping {
> +	struct address_space *mapping;
> +	struct list_head addr_desc_list;
> +	struct mutex addr_desc_list_lock;
> +};
> +
> +struct mdev_device {
> +	struct device		dev;
> +	struct parent_device	*parent;
> +	struct iommu_group	*group;
> +	uuid_le			uuid;
> +	uint32_t		instance;
> +	void			*driver_data;
> +
> +	/* internal only */
> +	struct kref		ref;
> +	struct list_head	next;
> +
> +	struct mdev_phys_mapping phys_mappings;
> +};
> +
> +
> +/**
> + * struct parent_ops - Structure to be registered for each parent device to
> + * register the device to mdev module.
> + *
> + * @owner:		The module owner.
> + * @dev_attr_groups:	Default attributes of the parent device.
> + * @mdev_attr_groups:	Default attributes of the mediated device.
> + * @supported_config:	Called to get information about supported types.
> + *			@dev : device structure of parent device.
> + *			@config: should return string listing supported config
> + *			Returns integer: success (0) or error (< 0)
> + * @create:		Called to allocate basic resources in parent device's
> + *			driver for a particular mediated device. It is
> + *			mandatory to provide create ops.
> + *			@mdev: mdev_device structure on of mediated device
> + *			      that is being created
> + *			@mdev_params: extra parameters required by parent
> + *			device's driver.
> + *			Returns integer: success (0) or error (< 0)
> + * @destroy:		Called to free resources in parent device's driver for a
> + *			a mediated device instance. It is mandatory to provide
> + *			destroy ops.
> + *			@mdev: mdev_device device structure which is being
> + *			       destroyed
> + *			Returns integer: success (0) or error (< 0)
> + *			If VMM is running and destroy() is called that means the
> + *			mdev is being hotunpluged. Return error if VMM is
> + *			running and driver doesn't support mediated device
> + *			hotplug.
> + * @reset:		Called to reset mediated device.
> + *			@mdev: mdev_device device structure
> + *			Returns integer: success (0) or error (< 0)
> + * @start:		Called to initiate mediated device initialization
> + *			process in parent device's driver before VMM starts.
> + *			@uuid: UUID
> + *			Returns integer: success (0) or error (< 0)
> + * @stop:		Called to teardown mediated device related resources
> + *			@uuid: UUID
> + *			Returns integer: success (0) or error (< 0)
> + * @read:		Read emulation callback
> + *			@mdev: mediated device structure
> + *			@buf: read buffer
> + *			@count: number of bytes to read
> + *			@pos: address.
> + *			Retuns number on bytes read on success or error.
> + * @write:		Write emulation callback
> + *			@mdev: mediated device structure
> + *			@buf: write buffer
> + *			@count: number of bytes to be written
> + *			@pos: address.
> + *			Retuns number on bytes written on success or error.
> + * @set_irqs:		Called to send about interrupts configuration
> + *			information that VMM sets.
> + *			@mdev: mediated device structure
> + *			@flags, index, start, count and *data : same as that of
> + *			struct vfio_irq_set of VFIO_DEVICE_SET_IRQS API.
> + * @get_region_info:	Called to get VFIO region size and flags of mediated
> + *			device.
> + *			@mdev: mediated device structure
> + *			@region_index: VFIO region index
> + *			@region_info: output, returns size and flags of
> + *				      requested region.
> + *			Returns integer: success (0) or error (< 0)
> + * @validate_map_request: Validate remap pfn request
> + *			@mdev: mediated device structure
> + *			@pos: address
> + *			@virtaddr: target user address to start at. Vendor
> + *				   driver can change if required.
> + *			@pfn: parent address of kernel memory, vendor driver
> + *			      can change if required.
> + *			@size: size of map area, vendor driver can change the
> + *			       size of map area if desired.
> + *			@prot: page protection flags for this mapping, vendor
> + *			       driver can change, if required.
> + *			Returns integer: success (0) or error (< 0)
> + *
> + * Parent device that support mediated device should be registered with mdev
> + * module with parent_ops structure.
> + */
> +
> +struct parent_ops {
> +	struct module   *owner;
> +	const struct attribute_group **dev_attr_groups;
> +	const struct attribute_group **mdev_attr_groups;
> +
> +	int	(*supported_config)(struct device *dev, char *config);
> +	int     (*create)(struct mdev_device *mdev, char *mdev_params);
> +	int     (*destroy)(struct mdev_device *mdev);
> +	int     (*reset)(struct mdev_device *mdev);
> +	int     (*start)(uuid_le uuid);
> +	int     (*stop)(uuid_le uuid);
> +	ssize_t (*read)(struct mdev_device *mdev, char *buf, size_t count,
> +			loff_t pos);
> +	ssize_t (*write)(struct mdev_device *mdev, char *buf, size_t count,
> +			 loff_t pos);
> +	int     (*set_irqs)(struct mdev_device *mdev, uint32_t flags,
> +			    unsigned int index, unsigned int start,
> +			    unsigned int count, void *data);
> +	int	(*get_region_info)(struct mdev_device *mdev, int region_index,
> +				   struct vfio_region_info *region_info);
> +	int	(*validate_map_request)(struct mdev_device *mdev, loff_t pos,
> +					u64 *virtaddr, unsigned long *pfn,
> +					unsigned long *size, pgprot_t *prot);
> +};
> +
> +/*
> + * Parent Device
> + */
> +
> +struct parent_device {
> +	struct device		*dev;
> +	const struct parent_ops	*ops;
> +
> +	/* internal */
> +	struct kref		ref;
> +	struct list_head	next;
> +	struct list_head	mdev_list;
> +	struct mutex		mdev_list_lock;
> +	wait_queue_head_t	release_done;
> +};
> +
> +/**
> + * struct mdev_driver - Mediated device driver
> + * @name: driver name
> + * @probe: called when new device created
> + * @remove: called when device removed
> + * @match: called when new device or driver is added for this bus. Return 1 if
> + *	   given device can be handled by given driver and zero otherwise.
> + * @driver: device driver structure
> + *
> + **/
> +struct mdev_driver {
> +	const char *name;
> +	int  (*probe)(struct device *dev);
> +	void (*remove)(struct device *dev);
> +	int  (*match)(struct device *dev);
> +	struct device_driver driver;
> +};
> +
> +static inline struct mdev_driver *to_mdev_driver(struct device_driver *drv)
> +{
> +	return drv ? container_of(drv, struct mdev_driver, driver) : NULL;
> +}
> +
> +static inline struct mdev_device *to_mdev_device(struct device *dev)
> +{
> +	return dev ? container_of(dev, struct mdev_device, dev) : NULL;
> +}
> +
> +static inline void *mdev_get_drvdata(struct mdev_device *mdev)
> +{
> +	return mdev->driver_data;
> +}
> +
> +static inline void mdev_set_drvdata(struct mdev_device *mdev, void *data)
> +{
> +	mdev->driver_data = data;
> +}
> +
> +extern struct bus_type mdev_bus_type;
> +
> +#define dev_is_mdev(d) ((d)->bus == &mdev_bus_type)
> +
> +extern int  mdev_register_device(struct device *dev,
> +				 const struct parent_ops *ops);
> +extern void mdev_unregister_device(struct device *dev);
> +
> +extern int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
> +extern void mdev_unregister_driver(struct mdev_driver *drv);
> +
> +extern struct mdev_device *mdev_get_device(struct mdev_device *mdev);
> +extern void mdev_put_device(struct mdev_device *mdev);
> +
> +extern struct mdev_device *mdev_get_device_by_group(struct iommu_group *group);
> +
> +extern int mdev_device_invalidate_mapping(struct mdev_device *mdev,
> +					unsigned long addr, unsigned long size);
> +
> +extern int mdev_add_phys_mapping(struct mdev_device *mdev,
> +				 struct address_space *mapping,
> +				 unsigned long addr, unsigned long size);
> +
> +
> +extern void mdev_del_phys_mapping(struct mdev_device *mdev, unsigned long addr);
> +#endif /* MDEV_H */

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v6 2/4] vfio: VFIO driver for mediated PCI device
  2016-08-03 19:03   ` [Qemu-devel] " Kirti Wankhede
@ 2016-08-09 19:00     ` Alex Williamson
  -1 siblings, 0 replies; 100+ messages in thread
From: Alex Williamson @ 2016-08-09 19:00 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi

On Thu, 4 Aug 2016 00:33:52 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> MPCI VFIO driver registers with MDEV core driver. MDEV core driver creates
> mediated device and calls probe routine of MPCI VFIO driver. This driver
> adds mediated device to VFIO core module.
> Main aim of this module is to manage all VFIO APIs for each mediated PCI
> device. Those are:
> - get region information from vendor driver.
> - trap and emulate PCI config space and BAR region.
> - Send interrupt configuration information to vendor driver.
> - Device reset
> - mmap mappable region with invalidate mapping and fault on access to
>   remap pfns. If validate_map_request() is not provided by vendor driver,
>   fault handler maps physical devices region.
> - Add and delete mappable region's physical mappings to mdev's mapping
>   tracking logic.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Signed-off-by: Neo Jia <cjia@nvidia.com>
> Change-Id: I583f4734752971d3d112324d69e2508c88f359ec
> ---
>  drivers/vfio/mdev/Kconfig           |   6 +
>  drivers/vfio/mdev/Makefile          |   1 +
>  drivers/vfio/mdev/vfio_mpci.c       | 536 ++++++++++++++++++++++++++++++++++++
>  drivers/vfio/pci/vfio_pci_private.h |   6 -
>  drivers/vfio/pci/vfio_pci_rdwr.c    |   1 +
>  include/linux/vfio.h                |   7 +
>  6 files changed, 551 insertions(+), 6 deletions(-)
>  create mode 100644 drivers/vfio/mdev/vfio_mpci.c
> 
> diff --git a/drivers/vfio/mdev/Kconfig b/drivers/vfio/mdev/Kconfig
> index a34fbc66f92f..431ed595c8da 100644
> --- a/drivers/vfio/mdev/Kconfig
> +++ b/drivers/vfio/mdev/Kconfig
> @@ -9,4 +9,10 @@ config VFIO_MDEV
>  
>          If you don't know what do here, say N.
>  
> +config VFIO_MPCI
> +    tristate "VFIO support for Mediated PCI devices"
> +    depends on VFIO && PCI && VFIO_MDEV
> +    default n
> +    help
> +        VFIO based driver for mediated PCI devices.
>  
> diff --git a/drivers/vfio/mdev/Makefile b/drivers/vfio/mdev/Makefile
> index 56a75e689582..264fb03dd0e3 100644
> --- a/drivers/vfio/mdev/Makefile
> +++ b/drivers/vfio/mdev/Makefile
> @@ -2,4 +2,5 @@
>  mdev-y := mdev_core.o mdev_sysfs.o mdev_driver.o
>  
>  obj-$(CONFIG_VFIO_MDEV) += mdev.o
> +obj-$(CONFIG_VFIO_MPCI) += vfio_mpci.o
>  
> diff --git a/drivers/vfio/mdev/vfio_mpci.c b/drivers/vfio/mdev/vfio_mpci.c
> new file mode 100644
> index 000000000000..9da94b76ae3e
> --- /dev/null
> +++ b/drivers/vfio/mdev/vfio_mpci.c
> @@ -0,0 +1,536 @@
> +/*
> + * VFIO based Mediated PCI device driver
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/init.h>
> +#include <linux/module.h>
> +#include <linux/device.h>
> +#include <linux/kernel.h>
> +#include <linux/slab.h>
> +#include <linux/uuid.h>
> +#include <linux/vfio.h>
> +#include <linux/iommu.h>
> +#include <linux/mdev.h>
> +
> +#include "mdev_private.h"
> +
> +#define DRIVER_VERSION  "0.1"
> +#define DRIVER_AUTHOR   "NVIDIA Corporation"
> +#define DRIVER_DESC     "VFIO based Mediated PCI device driver"
> +
> +struct vfio_mdev {
> +	struct iommu_group *group;
> +	struct mdev_device *mdev;
> +	int		    refcnt;
> +	struct vfio_region_info vfio_region_info[VFIO_PCI_NUM_REGIONS];
> +	struct mutex	    vfio_mdev_lock;
> +};
> +
> +static int vfio_mpci_open(void *device_data)
> +{
> +	int ret = 0;
> +	struct vfio_mdev *vmdev = device_data;
> +	struct parent_device *parent = vmdev->mdev->parent;
> +
> +	if (!try_module_get(THIS_MODULE))
> +		return -ENODEV;
> +
> +	mutex_lock(&vmdev->vfio_mdev_lock);
> +	if (!vmdev->refcnt && parent->ops->get_region_info) {
> +		int index;
> +
> +		for (index = VFIO_PCI_BAR0_REGION_INDEX;
> +		     index < VFIO_PCI_NUM_REGIONS; index++) {
> +			ret = parent->ops->get_region_info(vmdev->mdev, index,
> +					      &vmdev->vfio_region_info[index]);
> +			if (ret)
> +				goto open_error;
> +		}
> +	}
> +
> +	vmdev->refcnt++;
> +
> +open_error:
> +	mutex_unlock(&vmdev->vfio_mdev_lock);
> +	if (ret)
> +		module_put(THIS_MODULE);
> +
> +	return ret;
> +}
> +
> +static void vfio_mpci_close(void *device_data)
> +{
> +	struct vfio_mdev *vmdev = device_data;
> +
> +	mutex_lock(&vmdev->vfio_mdev_lock);
> +	vmdev->refcnt--;
> +	if (!vmdev->refcnt) {
> +		memset(&vmdev->vfio_region_info, 0,
> +			sizeof(vmdev->vfio_region_info));
> +	}
> +	mutex_unlock(&vmdev->vfio_mdev_lock);
> +	module_put(THIS_MODULE);
> +}
> +
> +static u8 mpci_find_pci_capability(struct mdev_device *mdev, u8 capability)
> +{
> +	loff_t pos = VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_CONFIG_REGION_INDEX);

This creates a fixed ABI between vfio-mdev-pci and vendor drivers that
a given region starts at a pre-defined offset.  We have the offset
stored in vfio_mdev.region_info[VFIO_PCI_CONFIG_REGION_INDEX].offset,
use it.  It's just as unacceptable to impose this fixed relationship
with a vendor driver here as if a userspace driver were to do the same.

> +	struct parent_device *parent = mdev->parent;
> +	u16 status;
> +	u8  cap_ptr, cap_id = 0xff;
> +
> +	parent->ops->read(mdev, (char *)&status, sizeof(status),
> +			  pos + PCI_STATUS);
> +	if (!(status & PCI_STATUS_CAP_LIST))
> +		return 0;
> +
> +	parent->ops->read(mdev, &cap_ptr, sizeof(cap_ptr),
> +			  pos + PCI_CAPABILITY_LIST);
> +
> +	do {
> +		cap_ptr &= 0xfc;
> +		parent->ops->read(mdev, &cap_id, sizeof(cap_id),
> +				  pos + cap_ptr + PCI_CAP_LIST_ID);
> +		if (cap_id == capability)
> +			return cap_ptr;
> +		parent->ops->read(mdev, &cap_ptr, sizeof(cap_ptr),
> +				  pos + cap_ptr + PCI_CAP_LIST_NEXT);
> +	} while (cap_ptr && cap_id != 0xff);
> +
> +	return 0;
> +}
> +
> +static int mpci_get_irq_count(struct vfio_mdev *vmdev, int irq_type)
> +{
> +	loff_t pos = VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_CONFIG_REGION_INDEX);
> +	struct mdev_device *mdev = vmdev->mdev;
> +	struct parent_device *parent = mdev->parent;
> +
> +	if (irq_type == VFIO_PCI_INTX_IRQ_INDEX) {
> +		u8 pin;
> +
> +		parent->ops->read(mdev, &pin, sizeof(pin),
> +				  pos + PCI_INTERRUPT_PIN);
> +		if (IS_ENABLED(CONFIG_VFIO_PCI_INTX) && pin)
> +			return 1;
> +
> +	} else if (irq_type == VFIO_PCI_MSI_IRQ_INDEX) {
> +		u8 cap_ptr;
> +		u16 flags;
> +
> +		cap_ptr = mpci_find_pci_capability(mdev, PCI_CAP_ID_MSI);
> +		if (cap_ptr) {
> +			parent->ops->read(mdev, (char *)&flags, sizeof(flags),
> +					pos + cap_ptr + PCI_MSI_FLAGS);
> +			return 1 << ((flags & PCI_MSI_FLAGS_QMASK) >> 1);
> +		}
> +	} else if (irq_type == VFIO_PCI_MSIX_IRQ_INDEX) {
> +		u8 cap_ptr;
> +		u16 flags;
> +
> +		cap_ptr = mpci_find_pci_capability(mdev, PCI_CAP_ID_MSIX);
> +		if (cap_ptr) {
> +			parent->ops->read(mdev, (char *)&flags, sizeof(flags),
> +					pos + cap_ptr + PCI_MSIX_FLAGS);
> +
> +			return (flags & PCI_MSIX_FLAGS_QSIZE) + 1;
> +		}
> +	} else if (irq_type == VFIO_PCI_ERR_IRQ_INDEX) {
> +		u8 cap_ptr;
> +
> +		cap_ptr = mpci_find_pci_capability(mdev, PCI_CAP_ID_EXP);
> +		if (cap_ptr)
> +			return 1;
> +	} else if (irq_type == VFIO_PCI_REQ_IRQ_INDEX) {
> +		return 1;
> +	}

Much better than previous versions, but use the region_info provided by
the vendor driver.  Maybe you want helpers such as
mpci_config_{read,write}{b,w,l}.

> +
> +	return 0;
> +}
> +
> +static long vfio_mpci_unlocked_ioctl(void *device_data,
> +				     unsigned int cmd, unsigned long arg)
> +{
> +	int ret = 0;
> +	struct vfio_mdev *vmdev = device_data;
> +	unsigned long minsz;
> +
> +	switch (cmd) {
> +	case VFIO_DEVICE_GET_INFO:
> +	{
> +		struct vfio_device_info info;
> +		struct parent_device *parent = vmdev->mdev->parent;
> +
> +		minsz = offsetofend(struct vfio_device_info, num_irqs);
> +
> +		if (copy_from_user(&info, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (info.argsz < minsz)
> +			return -EINVAL;
> +
> +		info.flags = VFIO_DEVICE_FLAGS_PCI;
> +
> +		if (parent->ops->reset)
> +			info.flags |= VFIO_DEVICE_FLAGS_RESET;
> +
> +		info.num_regions = VFIO_PCI_NUM_REGIONS;
> +		info.num_irqs = VFIO_PCI_NUM_IRQS;
> +
> +		return copy_to_user((void __user *)arg, &info, minsz);
> +	}
> +	case VFIO_DEVICE_GET_REGION_INFO:
> +	{
> +		struct vfio_region_info info;
> +
> +		minsz = offsetofend(struct vfio_region_info, offset);
> +
> +		if (copy_from_user(&info, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (info.argsz < minsz)
> +			return -EINVAL;
> +
> +		switch (info.index) {
> +		case VFIO_PCI_CONFIG_REGION_INDEX:
> +		case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
> +			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);

No, vmdev->vfio_region_info[info.index].offset

> +			info.size = vmdev->vfio_region_info[info.index].size;
> +			if (!info.size) {
> +				info.flags = 0;
> +				break;
> +			}
> +
> +			info.flags = vmdev->vfio_region_info[info.index].flags;
> +			break;
> +		case VFIO_PCI_VGA_REGION_INDEX:
> +		case VFIO_PCI_ROM_REGION_INDEX:

Why?  Let the vendor driver decide.

> +		default:
> +			return -EINVAL;
> +		}
> +
> +		return copy_to_user((void __user *)arg, &info, minsz);
> +	}
> +	case VFIO_DEVICE_GET_IRQ_INFO:
> +	{
> +		struct vfio_irq_info info;
> +
> +		minsz = offsetofend(struct vfio_irq_info, count);
> +
> +		if (copy_from_user(&info, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (info.argsz < minsz || info.index >= VFIO_PCI_NUM_IRQS)
> +			return -EINVAL;
> +
> +		switch (info.index) {
> +		case VFIO_PCI_INTX_IRQ_INDEX ... VFIO_PCI_MSI_IRQ_INDEX:
> +		case VFIO_PCI_REQ_IRQ_INDEX:
> +			break;
> +			/* pass thru to return error */
> +		case VFIO_PCI_MSIX_IRQ_INDEX:

???

> +		default:
> +			return -EINVAL;
> +		}
> +
> +		info.flags = VFIO_IRQ_INFO_EVENTFD;
> +		info.count = mpci_get_irq_count(vmdev, info.index);
> +
> +		if (info.count == -1)
> +			return -EINVAL;
> +
> +		if (info.index == VFIO_PCI_INTX_IRQ_INDEX)
> +			info.flags |= (VFIO_IRQ_INFO_MASKABLE |
> +					VFIO_IRQ_INFO_AUTOMASKED);
> +		else
> +			info.flags |= VFIO_IRQ_INFO_NORESIZE;
> +
> +		return copy_to_user((void __user *)arg, &info, minsz);
> +	}
> +	case VFIO_DEVICE_SET_IRQS:
> +	{
> +		struct vfio_irq_set hdr;
> +		struct mdev_device *mdev = vmdev->mdev;
> +		struct parent_device *parent = mdev->parent;
> +		u8 *data = NULL, *ptr = NULL;
> +
> +		minsz = offsetofend(struct vfio_irq_set, count);
> +
> +		if (copy_from_user(&hdr, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (hdr.argsz < minsz || hdr.index >= VFIO_PCI_NUM_IRQS ||
> +		    hdr.flags & ~(VFIO_IRQ_SET_DATA_TYPE_MASK |
> +				  VFIO_IRQ_SET_ACTION_TYPE_MASK))
> +			return -EINVAL;
> +
> +		if (!(hdr.flags & VFIO_IRQ_SET_DATA_NONE)) {
> +			size_t size;
> +			int max = mpci_get_irq_count(vmdev, hdr.index);
> +
> +			if (hdr.flags & VFIO_IRQ_SET_DATA_BOOL)
> +				size = sizeof(uint8_t);
> +			else if (hdr.flags & VFIO_IRQ_SET_DATA_EVENTFD)
> +				size = sizeof(int32_t);
> +			else
> +				return -EINVAL;
> +
> +			if (hdr.argsz - minsz < hdr.count * size ||
> +			    hdr.start >= max || hdr.start + hdr.count > max)
> +				return -EINVAL;
> +
> +			ptr = data = memdup_user((void __user *)(arg + minsz),
> +						 hdr.count * size);
> +			if (IS_ERR(data))
> +				return PTR_ERR(data);
> +		}
> +
> +		if (parent->ops->set_irqs)
> +			ret = parent->ops->set_irqs(mdev, hdr.flags, hdr.index,
> +						    hdr.start, hdr.count, data);
> +
> +		kfree(ptr);
> +		return ret;

Return success if no set_irqs callback?

> +	}
> +	case VFIO_DEVICE_RESET:
> +	{
> +		struct parent_device *parent = vmdev->mdev->parent;
> +
> +		if (parent->ops->reset)
> +			return parent->ops->reset(vmdev->mdev);
> +
> +		return -EINVAL;
> +	}
> +	}
> +	return -ENOTTY;
> +}
> +
> +static ssize_t vfio_mpci_read(void *device_data, char __user *buf,
> +			      size_t count, loff_t *ppos)
> +{
> +	struct vfio_mdev *vmdev = device_data;
> +	struct mdev_device *mdev = vmdev->mdev;
> +	struct parent_device *parent = mdev->parent;
> +	int ret = 0;
> +
> +	if (!count)
> +		return 0;
> +
> +	if (parent->ops->read) {
> +		char *ret_data, *ptr;
> +
> +		ptr = ret_data = kzalloc(count, GFP_KERNEL);

Do we really need to support arbitrary lengths in one shot?  Seems like
we could just use a 4 or 8 byte variable on the stack and iterate until
done.

> +
> +		if (!ret_data)
> +			return  -ENOMEM;
> +
> +		ret = parent->ops->read(mdev, ret_data, count, *ppos);
> +
> +		if (ret > 0) {
> +			if (copy_to_user(buf, ret_data, ret))
> +				ret = -EFAULT;
> +			else
> +				*ppos += ret;
> +		}
> +		kfree(ptr);
> +	}
> +
> +	return ret;
> +}
> +
> +static ssize_t vfio_mpci_write(void *device_data, const char __user *buf,
> +			       size_t count, loff_t *ppos)
> +{
> +	struct vfio_mdev *vmdev = device_data;
> +	struct mdev_device *mdev = vmdev->mdev;
> +	struct parent_device *parent = mdev->parent;
> +	int ret = 0;
> +
> +	if (!count)
> +		return 0;
> +
> +	if (parent->ops->write) {
> +		char *usr_data, *ptr;
> +
> +		ptr = usr_data = memdup_user(buf, count);

Same here, how much do we care to let the user write in one pass and is
there any advantage to it?  When QEMU is our userspace we're only
likely to see 4-byte accesses anyway.

> +		if (IS_ERR(usr_data))
> +			return PTR_ERR(usr_data);
> +
> +		ret = parent->ops->write(mdev, usr_data, count, *ppos);
> +
> +		if (ret > 0)
> +			*ppos += ret;
> +
> +		kfree(ptr);
> +	}
> +
> +	return ret;
> +}
> +
> +static int mdev_dev_mmio_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> +{
> +	int ret;
> +	struct vfio_mdev *vmdev = vma->vm_private_data;
> +	struct mdev_device *mdev;
> +	struct parent_device *parent;
> +	u64 virtaddr = (u64)vmf->virtual_address;
> +	unsigned long req_size, pgoff = 0;
> +	pgprot_t pg_prot;
> +	unsigned int index;
> +
> +	if (!vmdev && !vmdev->mdev)
> +		return -EINVAL;
> +
> +	mdev = vmdev->mdev;
> +	parent  = mdev->parent;
> +
> +	pg_prot  = vma->vm_page_prot;
> +
> +	if (parent->ops->validate_map_request) {
> +		u64 offset;
> +		loff_t pos;
> +
> +		offset   = virtaddr - vma->vm_start;
> +		req_size = vma->vm_end - virtaddr;
> +		pos = (vma->vm_pgoff << PAGE_SHIFT) + offset;
> +
> +		ret = parent->ops->validate_map_request(mdev, pos, &virtaddr,
> +						&pgoff, &req_size, &pg_prot);
> +		if (ret)
> +			return ret;
> +
> +		/*
> +		 * Verify pgoff and req_size are valid and virtaddr is within
> +		 * vma range
> +		 */
> +		if (!pgoff || !req_size || (virtaddr < vma->vm_start) ||
> +		    ((virtaddr + req_size) >= vma->vm_end))
> +			return -EINVAL;
> +	} else {
> +		struct pci_dev *pdev;
> +
> +		virtaddr = vma->vm_start;
> +		req_size = vma->vm_end - vma->vm_start;
> +
> +		pdev = to_pci_dev(parent->dev);
> +		index = VFIO_PCI_OFFSET_TO_INDEX(vma->vm_pgoff << PAGE_SHIFT);

Iterate through region_info[*].offset/size provided by vendor driver.

> +		pgoff = pci_resource_start(pdev, index) >> PAGE_SHIFT;
> +	}
> +
> +	ret = remap_pfn_range(vma, virtaddr, pgoff, req_size, pg_prot);
> +
> +	return ret | VM_FAULT_NOPAGE;
> +}
> +
> +void mdev_dev_mmio_close(struct vm_area_struct *vma)
> +{
> +	struct vfio_mdev *vmdev = vma->vm_private_data;
> +	struct mdev_device *mdev = vmdev->mdev;
> +
> +	mdev_del_phys_mapping(mdev, vma->vm_pgoff << PAGE_SHIFT);
> +}
> +
> +static const struct vm_operations_struct mdev_dev_mmio_ops = {
> +	.fault = mdev_dev_mmio_fault,
> +	.close = mdev_dev_mmio_close,
> +};
> +
> +static int vfio_mpci_mmap(void *device_data, struct vm_area_struct *vma)
> +{
> +	unsigned int index;
> +	struct vfio_mdev *vmdev = device_data;
> +	struct mdev_device *mdev = vmdev->mdev;
> +
> +	index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
> +
> +	if (index >= VFIO_PCI_ROM_REGION_INDEX)
> +		return -EINVAL;
> +
> +	vma->vm_private_data = vmdev;
> +	vma->vm_ops = &mdev_dev_mmio_ops;
> +
> +	return mdev_add_phys_mapping(mdev, vma->vm_file->f_mapping,
> +				     vma->vm_pgoff << PAGE_SHIFT,
> +				     vma->vm_end - vma->vm_start);
> +}
> +
> +static const struct vfio_device_ops vfio_mpci_dev_ops = {
> +	.name		= "vfio-mpci",
> +	.open		= vfio_mpci_open,
> +	.release	= vfio_mpci_close,
> +	.ioctl		= vfio_mpci_unlocked_ioctl,
> +	.read		= vfio_mpci_read,
> +	.write		= vfio_mpci_write,
> +	.mmap		= vfio_mpci_mmap,
> +};
> +
> +int vfio_mpci_probe(struct device *dev)
> +{
> +	struct vfio_mdev *vmdev;
> +	struct mdev_device *mdev = to_mdev_device(dev);
> +	int ret;
> +
> +	vmdev = kzalloc(sizeof(*vmdev), GFP_KERNEL);
> +	if (IS_ERR(vmdev))
> +		return PTR_ERR(vmdev);
> +
> +	vmdev->mdev = mdev_get_device(mdev);
> +	vmdev->group = mdev->group;
> +	mutex_init(&vmdev->vfio_mdev_lock);
> +
> +	ret = vfio_add_group_dev(dev, &vfio_mpci_dev_ops, vmdev);
> +	if (ret)
> +		kfree(vmdev);
> +
> +	mdev_put_device(mdev);
> +	return ret;
> +}
> +
> +void vfio_mpci_remove(struct device *dev)
> +{
> +	struct vfio_mdev *vmdev;
> +
> +	vmdev = vfio_del_group_dev(dev);
> +	kfree(vmdev);
> +}
> +
> +int vfio_mpci_match(struct device *dev)
> +{
> +	if (dev_is_pci(dev->parent))

This is the wrong test, there's really no requirement that a pci mdev
device is hosted by a real pci device.  Can't we check that the device
is on an mdev_pci_bus_type?

> +		return 1;
> +
> +	return 0;
> +}
> +
> +struct mdev_driver vfio_mpci_driver = {
> +	.name	= "vfio_mpci",
> +	.probe	= vfio_mpci_probe,
> +	.remove	= vfio_mpci_remove,
> +	.match	= vfio_mpci_match,
> +};
> +
> +static int __init vfio_mpci_init(void)
> +{
> +	return mdev_register_driver(&vfio_mpci_driver, THIS_MODULE);
> +}
> +
> +static void __exit vfio_mpci_exit(void)
> +{
> +	mdev_unregister_driver(&vfio_mpci_driver);
> +}
> +
> +module_init(vfio_mpci_init)
> +module_exit(vfio_mpci_exit)
> +
> +MODULE_VERSION(DRIVER_VERSION);
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR(DRIVER_AUTHOR);
> +MODULE_DESCRIPTION(DRIVER_DESC);
> diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
> index 8a7d546d18a0..04a450908ffb 100644
> --- a/drivers/vfio/pci/vfio_pci_private.h
> +++ b/drivers/vfio/pci/vfio_pci_private.h
> @@ -19,12 +19,6 @@
>  #ifndef VFIO_PCI_PRIVATE_H
>  #define VFIO_PCI_PRIVATE_H
>  
> -#define VFIO_PCI_OFFSET_SHIFT   40
> -
> -#define VFIO_PCI_OFFSET_TO_INDEX(off)	(off >> VFIO_PCI_OFFSET_SHIFT)
> -#define VFIO_PCI_INDEX_TO_OFFSET(index)	((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
> -#define VFIO_PCI_OFFSET_MASK	(((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
> -
>  /* Special capability IDs predefined access */
>  #define PCI_CAP_ID_INVALID		0xFF	/* default raw access */
>  #define PCI_CAP_ID_INVALID_VIRT		0xFE	/* default virt access */
> diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c
> index 5ffd1d9ad4bd..5b912be9d9c3 100644
> --- a/drivers/vfio/pci/vfio_pci_rdwr.c
> +++ b/drivers/vfio/pci/vfio_pci_rdwr.c
> @@ -18,6 +18,7 @@
>  #include <linux/uaccess.h>
>  #include <linux/io.h>
>  #include <linux/vgaarb.h>
> +#include <linux/vfio.h>
>  
>  #include "vfio_pci_private.h"
>  
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index 0ecae0b1cd34..431b824b0d3e 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -18,6 +18,13 @@
>  #include <linux/poll.h>
>  #include <uapi/linux/vfio.h>
>  
> +#define VFIO_PCI_OFFSET_SHIFT   40
> +
> +#define VFIO_PCI_OFFSET_TO_INDEX(off)   (off >> VFIO_PCI_OFFSET_SHIFT)
> +#define VFIO_PCI_INDEX_TO_OFFSET(index) ((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
> +#define VFIO_PCI_OFFSET_MASK    (((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
> +
> +

Nak this, I'm not interested in making this any sort of ABI.

>  /**
>   * struct vfio_device_ops - VFIO bus driver device callbacks
>   *


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v6 2/4] vfio: VFIO driver for mediated PCI device
@ 2016-08-09 19:00     ` Alex Williamson
  0 siblings, 0 replies; 100+ messages in thread
From: Alex Williamson @ 2016-08-09 19:00 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi

On Thu, 4 Aug 2016 00:33:52 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> MPCI VFIO driver registers with MDEV core driver. MDEV core driver creates
> mediated device and calls probe routine of MPCI VFIO driver. This driver
> adds mediated device to VFIO core module.
> Main aim of this module is to manage all VFIO APIs for each mediated PCI
> device. Those are:
> - get region information from vendor driver.
> - trap and emulate PCI config space and BAR region.
> - Send interrupt configuration information to vendor driver.
> - Device reset
> - mmap mappable region with invalidate mapping and fault on access to
>   remap pfns. If validate_map_request() is not provided by vendor driver,
>   fault handler maps physical devices region.
> - Add and delete mappable region's physical mappings to mdev's mapping
>   tracking logic.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Signed-off-by: Neo Jia <cjia@nvidia.com>
> Change-Id: I583f4734752971d3d112324d69e2508c88f359ec
> ---
>  drivers/vfio/mdev/Kconfig           |   6 +
>  drivers/vfio/mdev/Makefile          |   1 +
>  drivers/vfio/mdev/vfio_mpci.c       | 536 ++++++++++++++++++++++++++++++++++++
>  drivers/vfio/pci/vfio_pci_private.h |   6 -
>  drivers/vfio/pci/vfio_pci_rdwr.c    |   1 +
>  include/linux/vfio.h                |   7 +
>  6 files changed, 551 insertions(+), 6 deletions(-)
>  create mode 100644 drivers/vfio/mdev/vfio_mpci.c
> 
> diff --git a/drivers/vfio/mdev/Kconfig b/drivers/vfio/mdev/Kconfig
> index a34fbc66f92f..431ed595c8da 100644
> --- a/drivers/vfio/mdev/Kconfig
> +++ b/drivers/vfio/mdev/Kconfig
> @@ -9,4 +9,10 @@ config VFIO_MDEV
>  
>          If you don't know what do here, say N.
>  
> +config VFIO_MPCI
> +    tristate "VFIO support for Mediated PCI devices"
> +    depends on VFIO && PCI && VFIO_MDEV
> +    default n
> +    help
> +        VFIO based driver for mediated PCI devices.
>  
> diff --git a/drivers/vfio/mdev/Makefile b/drivers/vfio/mdev/Makefile
> index 56a75e689582..264fb03dd0e3 100644
> --- a/drivers/vfio/mdev/Makefile
> +++ b/drivers/vfio/mdev/Makefile
> @@ -2,4 +2,5 @@
>  mdev-y := mdev_core.o mdev_sysfs.o mdev_driver.o
>  
>  obj-$(CONFIG_VFIO_MDEV) += mdev.o
> +obj-$(CONFIG_VFIO_MPCI) += vfio_mpci.o
>  
> diff --git a/drivers/vfio/mdev/vfio_mpci.c b/drivers/vfio/mdev/vfio_mpci.c
> new file mode 100644
> index 000000000000..9da94b76ae3e
> --- /dev/null
> +++ b/drivers/vfio/mdev/vfio_mpci.c
> @@ -0,0 +1,536 @@
> +/*
> + * VFIO based Mediated PCI device driver
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/init.h>
> +#include <linux/module.h>
> +#include <linux/device.h>
> +#include <linux/kernel.h>
> +#include <linux/slab.h>
> +#include <linux/uuid.h>
> +#include <linux/vfio.h>
> +#include <linux/iommu.h>
> +#include <linux/mdev.h>
> +
> +#include "mdev_private.h"
> +
> +#define DRIVER_VERSION  "0.1"
> +#define DRIVER_AUTHOR   "NVIDIA Corporation"
> +#define DRIVER_DESC     "VFIO based Mediated PCI device driver"
> +
> +struct vfio_mdev {
> +	struct iommu_group *group;
> +	struct mdev_device *mdev;
> +	int		    refcnt;
> +	struct vfio_region_info vfio_region_info[VFIO_PCI_NUM_REGIONS];
> +	struct mutex	    vfio_mdev_lock;
> +};
> +
> +static int vfio_mpci_open(void *device_data)
> +{
> +	int ret = 0;
> +	struct vfio_mdev *vmdev = device_data;
> +	struct parent_device *parent = vmdev->mdev->parent;
> +
> +	if (!try_module_get(THIS_MODULE))
> +		return -ENODEV;
> +
> +	mutex_lock(&vmdev->vfio_mdev_lock);
> +	if (!vmdev->refcnt && parent->ops->get_region_info) {
> +		int index;
> +
> +		for (index = VFIO_PCI_BAR0_REGION_INDEX;
> +		     index < VFIO_PCI_NUM_REGIONS; index++) {
> +			ret = parent->ops->get_region_info(vmdev->mdev, index,
> +					      &vmdev->vfio_region_info[index]);
> +			if (ret)
> +				goto open_error;
> +		}
> +	}
> +
> +	vmdev->refcnt++;
> +
> +open_error:
> +	mutex_unlock(&vmdev->vfio_mdev_lock);
> +	if (ret)
> +		module_put(THIS_MODULE);
> +
> +	return ret;
> +}
> +
> +static void vfio_mpci_close(void *device_data)
> +{
> +	struct vfio_mdev *vmdev = device_data;
> +
> +	mutex_lock(&vmdev->vfio_mdev_lock);
> +	vmdev->refcnt--;
> +	if (!vmdev->refcnt) {
> +		memset(&vmdev->vfio_region_info, 0,
> +			sizeof(vmdev->vfio_region_info));
> +	}
> +	mutex_unlock(&vmdev->vfio_mdev_lock);
> +	module_put(THIS_MODULE);
> +}
> +
> +static u8 mpci_find_pci_capability(struct mdev_device *mdev, u8 capability)
> +{
> +	loff_t pos = VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_CONFIG_REGION_INDEX);

This creates a fixed ABI between vfio-mdev-pci and vendor drivers that
a given region starts at a pre-defined offset.  We have the offset
stored in vfio_mdev.region_info[VFIO_PCI_CONFIG_REGION_INDEX].offset,
use it.  It's just as unacceptable to impose this fixed relationship
with a vendor driver here as if a userspace driver were to do the same.

> +	struct parent_device *parent = mdev->parent;
> +	u16 status;
> +	u8  cap_ptr, cap_id = 0xff;
> +
> +	parent->ops->read(mdev, (char *)&status, sizeof(status),
> +			  pos + PCI_STATUS);
> +	if (!(status & PCI_STATUS_CAP_LIST))
> +		return 0;
> +
> +	parent->ops->read(mdev, &cap_ptr, sizeof(cap_ptr),
> +			  pos + PCI_CAPABILITY_LIST);
> +
> +	do {
> +		cap_ptr &= 0xfc;
> +		parent->ops->read(mdev, &cap_id, sizeof(cap_id),
> +				  pos + cap_ptr + PCI_CAP_LIST_ID);
> +		if (cap_id == capability)
> +			return cap_ptr;
> +		parent->ops->read(mdev, &cap_ptr, sizeof(cap_ptr),
> +				  pos + cap_ptr + PCI_CAP_LIST_NEXT);
> +	} while (cap_ptr && cap_id != 0xff);
> +
> +	return 0;
> +}
> +
> +static int mpci_get_irq_count(struct vfio_mdev *vmdev, int irq_type)
> +{
> +	loff_t pos = VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_CONFIG_REGION_INDEX);
> +	struct mdev_device *mdev = vmdev->mdev;
> +	struct parent_device *parent = mdev->parent;
> +
> +	if (irq_type == VFIO_PCI_INTX_IRQ_INDEX) {
> +		u8 pin;
> +
> +		parent->ops->read(mdev, &pin, sizeof(pin),
> +				  pos + PCI_INTERRUPT_PIN);
> +		if (IS_ENABLED(CONFIG_VFIO_PCI_INTX) && pin)
> +			return 1;
> +
> +	} else if (irq_type == VFIO_PCI_MSI_IRQ_INDEX) {
> +		u8 cap_ptr;
> +		u16 flags;
> +
> +		cap_ptr = mpci_find_pci_capability(mdev, PCI_CAP_ID_MSI);
> +		if (cap_ptr) {
> +			parent->ops->read(mdev, (char *)&flags, sizeof(flags),
> +					pos + cap_ptr + PCI_MSI_FLAGS);
> +			return 1 << ((flags & PCI_MSI_FLAGS_QMASK) >> 1);
> +		}
> +	} else if (irq_type == VFIO_PCI_MSIX_IRQ_INDEX) {
> +		u8 cap_ptr;
> +		u16 flags;
> +
> +		cap_ptr = mpci_find_pci_capability(mdev, PCI_CAP_ID_MSIX);
> +		if (cap_ptr) {
> +			parent->ops->read(mdev, (char *)&flags, sizeof(flags),
> +					pos + cap_ptr + PCI_MSIX_FLAGS);
> +
> +			return (flags & PCI_MSIX_FLAGS_QSIZE) + 1;
> +		}
> +	} else if (irq_type == VFIO_PCI_ERR_IRQ_INDEX) {
> +		u8 cap_ptr;
> +
> +		cap_ptr = mpci_find_pci_capability(mdev, PCI_CAP_ID_EXP);
> +		if (cap_ptr)
> +			return 1;
> +	} else if (irq_type == VFIO_PCI_REQ_IRQ_INDEX) {
> +		return 1;
> +	}

Much better than previous versions, but use the region_info provided by
the vendor driver.  Maybe you want helpers such as
mpci_config_{read,write}{b,w,l}.

> +
> +	return 0;
> +}
> +
> +static long vfio_mpci_unlocked_ioctl(void *device_data,
> +				     unsigned int cmd, unsigned long arg)
> +{
> +	int ret = 0;
> +	struct vfio_mdev *vmdev = device_data;
> +	unsigned long minsz;
> +
> +	switch (cmd) {
> +	case VFIO_DEVICE_GET_INFO:
> +	{
> +		struct vfio_device_info info;
> +		struct parent_device *parent = vmdev->mdev->parent;
> +
> +		minsz = offsetofend(struct vfio_device_info, num_irqs);
> +
> +		if (copy_from_user(&info, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (info.argsz < minsz)
> +			return -EINVAL;
> +
> +		info.flags = VFIO_DEVICE_FLAGS_PCI;
> +
> +		if (parent->ops->reset)
> +			info.flags |= VFIO_DEVICE_FLAGS_RESET;
> +
> +		info.num_regions = VFIO_PCI_NUM_REGIONS;
> +		info.num_irqs = VFIO_PCI_NUM_IRQS;
> +
> +		return copy_to_user((void __user *)arg, &info, minsz);
> +	}
> +	case VFIO_DEVICE_GET_REGION_INFO:
> +	{
> +		struct vfio_region_info info;
> +
> +		minsz = offsetofend(struct vfio_region_info, offset);
> +
> +		if (copy_from_user(&info, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (info.argsz < minsz)
> +			return -EINVAL;
> +
> +		switch (info.index) {
> +		case VFIO_PCI_CONFIG_REGION_INDEX:
> +		case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
> +			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);

No, vmdev->vfio_region_info[info.index].offset

> +			info.size = vmdev->vfio_region_info[info.index].size;
> +			if (!info.size) {
> +				info.flags = 0;
> +				break;
> +			}
> +
> +			info.flags = vmdev->vfio_region_info[info.index].flags;
> +			break;
> +		case VFIO_PCI_VGA_REGION_INDEX:
> +		case VFIO_PCI_ROM_REGION_INDEX:

Why?  Let the vendor driver decide.

> +		default:
> +			return -EINVAL;
> +		}
> +
> +		return copy_to_user((void __user *)arg, &info, minsz);
> +	}
> +	case VFIO_DEVICE_GET_IRQ_INFO:
> +	{
> +		struct vfio_irq_info info;
> +
> +		minsz = offsetofend(struct vfio_irq_info, count);
> +
> +		if (copy_from_user(&info, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (info.argsz < minsz || info.index >= VFIO_PCI_NUM_IRQS)
> +			return -EINVAL;
> +
> +		switch (info.index) {
> +		case VFIO_PCI_INTX_IRQ_INDEX ... VFIO_PCI_MSI_IRQ_INDEX:
> +		case VFIO_PCI_REQ_IRQ_INDEX:
> +			break;
> +			/* pass thru to return error */
> +		case VFIO_PCI_MSIX_IRQ_INDEX:

???

> +		default:
> +			return -EINVAL;
> +		}
> +
> +		info.flags = VFIO_IRQ_INFO_EVENTFD;
> +		info.count = mpci_get_irq_count(vmdev, info.index);
> +
> +		if (info.count == -1)
> +			return -EINVAL;
> +
> +		if (info.index == VFIO_PCI_INTX_IRQ_INDEX)
> +			info.flags |= (VFIO_IRQ_INFO_MASKABLE |
> +					VFIO_IRQ_INFO_AUTOMASKED);
> +		else
> +			info.flags |= VFIO_IRQ_INFO_NORESIZE;
> +
> +		return copy_to_user((void __user *)arg, &info, minsz);
> +	}
> +	case VFIO_DEVICE_SET_IRQS:
> +	{
> +		struct vfio_irq_set hdr;
> +		struct mdev_device *mdev = vmdev->mdev;
> +		struct parent_device *parent = mdev->parent;
> +		u8 *data = NULL, *ptr = NULL;
> +
> +		minsz = offsetofend(struct vfio_irq_set, count);
> +
> +		if (copy_from_user(&hdr, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (hdr.argsz < minsz || hdr.index >= VFIO_PCI_NUM_IRQS ||
> +		    hdr.flags & ~(VFIO_IRQ_SET_DATA_TYPE_MASK |
> +				  VFIO_IRQ_SET_ACTION_TYPE_MASK))
> +			return -EINVAL;
> +
> +		if (!(hdr.flags & VFIO_IRQ_SET_DATA_NONE)) {
> +			size_t size;
> +			int max = mpci_get_irq_count(vmdev, hdr.index);
> +
> +			if (hdr.flags & VFIO_IRQ_SET_DATA_BOOL)
> +				size = sizeof(uint8_t);
> +			else if (hdr.flags & VFIO_IRQ_SET_DATA_EVENTFD)
> +				size = sizeof(int32_t);
> +			else
> +				return -EINVAL;
> +
> +			if (hdr.argsz - minsz < hdr.count * size ||
> +			    hdr.start >= max || hdr.start + hdr.count > max)
> +				return -EINVAL;
> +
> +			ptr = data = memdup_user((void __user *)(arg + minsz),
> +						 hdr.count * size);
> +			if (IS_ERR(data))
> +				return PTR_ERR(data);
> +		}
> +
> +		if (parent->ops->set_irqs)
> +			ret = parent->ops->set_irqs(mdev, hdr.flags, hdr.index,
> +						    hdr.start, hdr.count, data);
> +
> +		kfree(ptr);
> +		return ret;

Return success if no set_irqs callback?

> +	}
> +	case VFIO_DEVICE_RESET:
> +	{
> +		struct parent_device *parent = vmdev->mdev->parent;
> +
> +		if (parent->ops->reset)
> +			return parent->ops->reset(vmdev->mdev);
> +
> +		return -EINVAL;
> +	}
> +	}
> +	return -ENOTTY;
> +}
> +
> +static ssize_t vfio_mpci_read(void *device_data, char __user *buf,
> +			      size_t count, loff_t *ppos)
> +{
> +	struct vfio_mdev *vmdev = device_data;
> +	struct mdev_device *mdev = vmdev->mdev;
> +	struct parent_device *parent = mdev->parent;
> +	int ret = 0;
> +
> +	if (!count)
> +		return 0;
> +
> +	if (parent->ops->read) {
> +		char *ret_data, *ptr;
> +
> +		ptr = ret_data = kzalloc(count, GFP_KERNEL);

Do we really need to support arbitrary lengths in one shot?  Seems like
we could just use a 4 or 8 byte variable on the stack and iterate until
done.

> +
> +		if (!ret_data)
> +			return  -ENOMEM;
> +
> +		ret = parent->ops->read(mdev, ret_data, count, *ppos);
> +
> +		if (ret > 0) {
> +			if (copy_to_user(buf, ret_data, ret))
> +				ret = -EFAULT;
> +			else
> +				*ppos += ret;
> +		}
> +		kfree(ptr);
> +	}
> +
> +	return ret;
> +}
> +
> +static ssize_t vfio_mpci_write(void *device_data, const char __user *buf,
> +			       size_t count, loff_t *ppos)
> +{
> +	struct vfio_mdev *vmdev = device_data;
> +	struct mdev_device *mdev = vmdev->mdev;
> +	struct parent_device *parent = mdev->parent;
> +	int ret = 0;
> +
> +	if (!count)
> +		return 0;
> +
> +	if (parent->ops->write) {
> +		char *usr_data, *ptr;
> +
> +		ptr = usr_data = memdup_user(buf, count);

Same here, how much do we care to let the user write in one pass and is
there any advantage to it?  When QEMU is our userspace we're only
likely to see 4-byte accesses anyway.

> +		if (IS_ERR(usr_data))
> +			return PTR_ERR(usr_data);
> +
> +		ret = parent->ops->write(mdev, usr_data, count, *ppos);
> +
> +		if (ret > 0)
> +			*ppos += ret;
> +
> +		kfree(ptr);
> +	}
> +
> +	return ret;
> +}
> +
> +static int mdev_dev_mmio_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> +{
> +	int ret;
> +	struct vfio_mdev *vmdev = vma->vm_private_data;
> +	struct mdev_device *mdev;
> +	struct parent_device *parent;
> +	u64 virtaddr = (u64)vmf->virtual_address;
> +	unsigned long req_size, pgoff = 0;
> +	pgprot_t pg_prot;
> +	unsigned int index;
> +
> +	if (!vmdev && !vmdev->mdev)
> +		return -EINVAL;
> +
> +	mdev = vmdev->mdev;
> +	parent  = mdev->parent;
> +
> +	pg_prot  = vma->vm_page_prot;
> +
> +	if (parent->ops->validate_map_request) {
> +		u64 offset;
> +		loff_t pos;
> +
> +		offset   = virtaddr - vma->vm_start;
> +		req_size = vma->vm_end - virtaddr;
> +		pos = (vma->vm_pgoff << PAGE_SHIFT) + offset;
> +
> +		ret = parent->ops->validate_map_request(mdev, pos, &virtaddr,
> +						&pgoff, &req_size, &pg_prot);
> +		if (ret)
> +			return ret;
> +
> +		/*
> +		 * Verify pgoff and req_size are valid and virtaddr is within
> +		 * vma range
> +		 */
> +		if (!pgoff || !req_size || (virtaddr < vma->vm_start) ||
> +		    ((virtaddr + req_size) >= vma->vm_end))
> +			return -EINVAL;
> +	} else {
> +		struct pci_dev *pdev;
> +
> +		virtaddr = vma->vm_start;
> +		req_size = vma->vm_end - vma->vm_start;
> +
> +		pdev = to_pci_dev(parent->dev);
> +		index = VFIO_PCI_OFFSET_TO_INDEX(vma->vm_pgoff << PAGE_SHIFT);

Iterate through region_info[*].offset/size provided by vendor driver.

> +		pgoff = pci_resource_start(pdev, index) >> PAGE_SHIFT;
> +	}
> +
> +	ret = remap_pfn_range(vma, virtaddr, pgoff, req_size, pg_prot);
> +
> +	return ret | VM_FAULT_NOPAGE;
> +}
> +
> +void mdev_dev_mmio_close(struct vm_area_struct *vma)
> +{
> +	struct vfio_mdev *vmdev = vma->vm_private_data;
> +	struct mdev_device *mdev = vmdev->mdev;
> +
> +	mdev_del_phys_mapping(mdev, vma->vm_pgoff << PAGE_SHIFT);
> +}
> +
> +static const struct vm_operations_struct mdev_dev_mmio_ops = {
> +	.fault = mdev_dev_mmio_fault,
> +	.close = mdev_dev_mmio_close,
> +};
> +
> +static int vfio_mpci_mmap(void *device_data, struct vm_area_struct *vma)
> +{
> +	unsigned int index;
> +	struct vfio_mdev *vmdev = device_data;
> +	struct mdev_device *mdev = vmdev->mdev;
> +
> +	index = vma->vm_pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
> +
> +	if (index >= VFIO_PCI_ROM_REGION_INDEX)
> +		return -EINVAL;
> +
> +	vma->vm_private_data = vmdev;
> +	vma->vm_ops = &mdev_dev_mmio_ops;
> +
> +	return mdev_add_phys_mapping(mdev, vma->vm_file->f_mapping,
> +				     vma->vm_pgoff << PAGE_SHIFT,
> +				     vma->vm_end - vma->vm_start);
> +}
> +
> +static const struct vfio_device_ops vfio_mpci_dev_ops = {
> +	.name		= "vfio-mpci",
> +	.open		= vfio_mpci_open,
> +	.release	= vfio_mpci_close,
> +	.ioctl		= vfio_mpci_unlocked_ioctl,
> +	.read		= vfio_mpci_read,
> +	.write		= vfio_mpci_write,
> +	.mmap		= vfio_mpci_mmap,
> +};
> +
> +int vfio_mpci_probe(struct device *dev)
> +{
> +	struct vfio_mdev *vmdev;
> +	struct mdev_device *mdev = to_mdev_device(dev);
> +	int ret;
> +
> +	vmdev = kzalloc(sizeof(*vmdev), GFP_KERNEL);
> +	if (IS_ERR(vmdev))
> +		return PTR_ERR(vmdev);
> +
> +	vmdev->mdev = mdev_get_device(mdev);
> +	vmdev->group = mdev->group;
> +	mutex_init(&vmdev->vfio_mdev_lock);
> +
> +	ret = vfio_add_group_dev(dev, &vfio_mpci_dev_ops, vmdev);
> +	if (ret)
> +		kfree(vmdev);
> +
> +	mdev_put_device(mdev);
> +	return ret;
> +}
> +
> +void vfio_mpci_remove(struct device *dev)
> +{
> +	struct vfio_mdev *vmdev;
> +
> +	vmdev = vfio_del_group_dev(dev);
> +	kfree(vmdev);
> +}
> +
> +int vfio_mpci_match(struct device *dev)
> +{
> +	if (dev_is_pci(dev->parent))

This is the wrong test, there's really no requirement that a pci mdev
device is hosted by a real pci device.  Can't we check that the device
is on an mdev_pci_bus_type?

> +		return 1;
> +
> +	return 0;
> +}
> +
> +struct mdev_driver vfio_mpci_driver = {
> +	.name	= "vfio_mpci",
> +	.probe	= vfio_mpci_probe,
> +	.remove	= vfio_mpci_remove,
> +	.match	= vfio_mpci_match,
> +};
> +
> +static int __init vfio_mpci_init(void)
> +{
> +	return mdev_register_driver(&vfio_mpci_driver, THIS_MODULE);
> +}
> +
> +static void __exit vfio_mpci_exit(void)
> +{
> +	mdev_unregister_driver(&vfio_mpci_driver);
> +}
> +
> +module_init(vfio_mpci_init)
> +module_exit(vfio_mpci_exit)
> +
> +MODULE_VERSION(DRIVER_VERSION);
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR(DRIVER_AUTHOR);
> +MODULE_DESCRIPTION(DRIVER_DESC);
> diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
> index 8a7d546d18a0..04a450908ffb 100644
> --- a/drivers/vfio/pci/vfio_pci_private.h
> +++ b/drivers/vfio/pci/vfio_pci_private.h
> @@ -19,12 +19,6 @@
>  #ifndef VFIO_PCI_PRIVATE_H
>  #define VFIO_PCI_PRIVATE_H
>  
> -#define VFIO_PCI_OFFSET_SHIFT   40
> -
> -#define VFIO_PCI_OFFSET_TO_INDEX(off)	(off >> VFIO_PCI_OFFSET_SHIFT)
> -#define VFIO_PCI_INDEX_TO_OFFSET(index)	((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
> -#define VFIO_PCI_OFFSET_MASK	(((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
> -
>  /* Special capability IDs predefined access */
>  #define PCI_CAP_ID_INVALID		0xFF	/* default raw access */
>  #define PCI_CAP_ID_INVALID_VIRT		0xFE	/* default virt access */
> diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c
> index 5ffd1d9ad4bd..5b912be9d9c3 100644
> --- a/drivers/vfio/pci/vfio_pci_rdwr.c
> +++ b/drivers/vfio/pci/vfio_pci_rdwr.c
> @@ -18,6 +18,7 @@
>  #include <linux/uaccess.h>
>  #include <linux/io.h>
>  #include <linux/vgaarb.h>
> +#include <linux/vfio.h>
>  
>  #include "vfio_pci_private.h"
>  
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index 0ecae0b1cd34..431b824b0d3e 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -18,6 +18,13 @@
>  #include <linux/poll.h>
>  #include <uapi/linux/vfio.h>
>  
> +#define VFIO_PCI_OFFSET_SHIFT   40
> +
> +#define VFIO_PCI_OFFSET_TO_INDEX(off)   (off >> VFIO_PCI_OFFSET_SHIFT)
> +#define VFIO_PCI_INDEX_TO_OFFSET(index) ((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
> +#define VFIO_PCI_OFFSET_MASK    (((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
> +
> +

Nak this, I'm not interested in making this any sort of ABI.

>  /**
>   * struct vfio_device_ops - VFIO bus driver device callbacks
>   *

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v6 3/4] vfio iommu: Add support for mediated devices
  2016-08-03 19:03   ` [Qemu-devel] " Kirti Wankhede
@ 2016-08-09 19:00     ` Alex Williamson
  -1 siblings, 0 replies; 100+ messages in thread
From: Alex Williamson @ 2016-08-09 19:00 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi

On Thu, 4 Aug 2016 00:33:53 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> VFIO IOMMU drivers are designed for the devices which are IOMMU capable.
> Mediated device only uses IOMMU APIs, the underlying hardware can be
> managed by an IOMMU domain.
> 
> Aim of this change is:
> - To use most of the code of TYPE1 IOMMU driver for mediated devices
> - To support direct assigned device and mediated device in single module
> 
> Added two new callback functions to struct vfio_iommu_driver_ops. Backend
> IOMMU module that supports pining and unpinning pages for mdev devices
> should provide these functions.
> Added APIs for pining and unpining pages to VFIO module. These calls back
> into backend iommu module to actually pin and unpin pages.
> 
> This change adds pin and unpin support for mediated device to TYPE1 IOMMU
> backend module. More details:
> - When iommu_group of mediated devices is attached, task structure is
>   cached which is used later to pin pages and page accounting.
> - It keeps track of pinned pages for mediated domain. This data is used to
>   verify unpinning request and to unpin remaining pages while detaching, if
>   there are any.
> - Used existing mechanism for page accounting. If iommu capable domain
>   exist in the container then all pages are already pinned and accounted.
>   Accouting for mdev device is only done if there is no iommu capable
>   domain in the container.
> 
> Tested by assigning below combinations of devices to a single VM:
> - GPU pass through only
> - vGPU device only
> - One GPU pass through and one vGPU device
> - two GPU pass through
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Signed-off-by: Neo Jia <cjia@nvidia.com>
> Change-Id: I295d6f0f2e0579b8d9882bfd8fd5a4194b97bd9a
> ---
>  drivers/vfio/vfio.c             |  82 +++++++
>  drivers/vfio/vfio_iommu_type1.c | 499 ++++++++++++++++++++++++++++++++++++----
>  include/linux/vfio.h            |  13 +-
>  3 files changed, 546 insertions(+), 48 deletions(-)
> 
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index 6fd6fa5469de..1f87e3a30d24 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -1782,6 +1782,88 @@ void vfio_info_cap_shift(struct vfio_info_cap *caps, size_t offset)
>  }
>  EXPORT_SYMBOL_GPL(vfio_info_cap_shift);
>  
> +/*
> + * Pin a set of guest PFNs and return their associated host PFNs for mediated
> + * domain only.

Why only mediated domain?  What assumption is specific to a mediated
domain other than unnecessarily passing an mdev_device?

> + * @user_pfn [in]: array of user/guest PFNs
> + * @npage [in]: count of array elements
> + * @prot [in] : protection flags
> + * @phys_pfn[out] : array of host PFNs
> + */
> +long vfio_pin_pages(struct mdev_device *mdev, unsigned long *user_pfn,

Why use and mdev_device here?  We only reference the struct device to
get the drvdata.  (dev also not listed above in param description)

> +		    long npage, int prot, unsigned long *phys_pfn)
> +{
> +	struct vfio_device *device;
> +	struct vfio_container *container;
> +	struct vfio_iommu_driver *driver;
> +	ssize_t ret = -EINVAL;
> +
> +	if (!mdev || !user_pfn || !phys_pfn)
> +		return -EINVAL;
> +
> +	device = dev_get_drvdata(&mdev->dev);
> +
> +	if (!device || !device->group)
> +		return -EINVAL;
> +
> +	container = device->group->container;

This doesn't seem like a valid way to get a reference to the container
and in fact there is no reference at all.  I think you need to use
vfio_device_get_from_dev(), check and increment container_users around
the callback, abort on noiommu groups, and check for viability.

> +
> +	if (!container)
> +		return -EINVAL;
> +
> +	down_read(&container->group_lock);
> +
> +	driver = container->iommu_driver;
> +	if (likely(driver && driver->ops->pin_pages))
> +		ret = driver->ops->pin_pages(container->iommu_data, user_pfn,
> +					     npage, prot, phys_pfn);
> +
> +	up_read(&container->group_lock);
> +
> +	return ret;
> +
> +}
> +EXPORT_SYMBOL(vfio_pin_pages);
> +
> +/*
> + * Unpin set of host PFNs for mediated domain only.
> + * @pfn [in] : array of host PFNs to be unpinned.
> + * @npage [in] :count of elements in array, that is number of pages.
> + */
> +long vfio_unpin_pages(struct mdev_device *mdev, unsigned long *pfn, long npage)
> +{
> +	struct vfio_device *device;
> +	struct vfio_container *container;
> +	struct vfio_iommu_driver *driver;
> +	ssize_t ret = -EINVAL;
> +
> +	if (!mdev || !pfn)
> +		return -EINVAL;
> +
> +	device = dev_get_drvdata(&mdev->dev);
> +
> +	if (!device || !device->group)
> +		return -EINVAL;
> +
> +	container = device->group->container;
> +
> +	if (!container)
> +		return -EINVAL;
> +
> +	down_read(&container->group_lock);
> +
> +	driver = container->iommu_driver;
> +	if (likely(driver && driver->ops->unpin_pages))
> +		ret = driver->ops->unpin_pages(container->iommu_data, pfn,
> +					       npage);
> +
> +	up_read(&container->group_lock);
> +
> +	return ret;
> +
> +}
> +EXPORT_SYMBOL(vfio_unpin_pages);
> +
>  /**
>   * Module/class support
>   */
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 75b24e93cedb..1f4e24e0debd 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -55,18 +55,26 @@ MODULE_PARM_DESC(disable_hugepages,
>  
>  struct vfio_iommu {
>  	struct list_head	domain_list;
> +	struct vfio_domain	*mediated_domain;

s/mediated/local/?

>  	struct mutex		lock;
>  	struct rb_root		dma_list;
>  	bool			v2;
>  	bool			nesting;
>  };
>  
> +struct mdev_addr_space {
> +	struct task_struct	*task;
> +	struct rb_root		pfn_list;	/* pinned Host pfn list */
> +	struct mutex		pfn_list_lock;	/* mutex for pfn_list */
> +};

s/mdev/local/?

> +
>  struct vfio_domain {
>  	struct iommu_domain	*domain;
>  	struct list_head	next;
>  	struct list_head	group_list;
>  	int			prot;		/* IOMMU_CACHE */
>  	bool			fgsp;		/* Fine-grained super pages */
> +	struct mdev_addr_space	*mdev_addr_space;
>  };

Mediated devices are who this is for, but doesn't really define what it
does.  Perhaps we can just use "local" to describe the local page
tracking.

>  
>  struct vfio_dma {
> @@ -83,6 +91,22 @@ struct vfio_group {
>  };
>  
>  /*
> + * Guest RAM pinning working set or DMA target
> + */
> +struct vfio_pfn {
> +	struct rb_node		node;
> +	unsigned long		vaddr;		/* virtual addr */
> +	dma_addr_t		iova;		/* IOVA */
> +	unsigned long		pfn;		/* Host pfn */
> +	size_t			prot;
> +	atomic_t		ref_count;
> +};
> +
> +
> +#define IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu)	\
> +			 (list_empty(&iommu->domain_list) ? false : true)
> +
> +/*
>   * This code handles mapping and unmapping of user data buffers
>   * into DMA'ble space using the IOMMU
>   */
> @@ -130,6 +154,84 @@ static void vfio_unlink_dma(struct vfio_iommu *iommu, struct vfio_dma *old)
>  	rb_erase(&old->node, &iommu->dma_list);
>  }
>  
> +/*
> + * Helper Functions for host pfn list
> + */
> +
> +static struct vfio_pfn *vfio_find_pfn(struct vfio_domain *domain,
> +				      unsigned long pfn)
> +{
> +	struct rb_node *node;
> +	struct vfio_pfn *vpfn, *ret = NULL;
> +
> +	node = domain->mdev_addr_space->pfn_list.rb_node;
> +
> +	while (node) {
> +		vpfn = rb_entry(node, struct vfio_pfn, node);
> +
> +		if (pfn < vpfn->pfn)
> +			node = node->rb_left;
> +		else if (pfn > vpfn->pfn)
> +			node = node->rb_right;
> +		else {
> +			ret = vpfn;
> +			break;
> +		}
> +	}
> +
> +	return ret;
> +}
> +
> +static void vfio_link_pfn(struct vfio_domain *domain, struct vfio_pfn *new)
> +{
> +	struct rb_node **link, *parent = NULL;
> +	struct vfio_pfn *vpfn;
> +
> +	link = &domain->mdev_addr_space->pfn_list.rb_node;
> +	while (*link) {
> +		parent = *link;
> +		vpfn = rb_entry(parent, struct vfio_pfn, node);
> +
> +		if (new->pfn < vpfn->pfn)
> +			link = &(*link)->rb_left;
> +		else
> +			link = &(*link)->rb_right;
> +	}
> +
> +	rb_link_node(&new->node, parent, link);
> +	rb_insert_color(&new->node, &domain->mdev_addr_space->pfn_list);
> +}
> +
> +static void vfio_unlink_pfn(struct vfio_domain *domain, struct vfio_pfn *old)
> +{
> +	rb_erase(&old->node, &domain->mdev_addr_space->pfn_list);
> +}
> +
> +static int vfio_add_to_pfn_list(struct vfio_domain *domain, unsigned long vaddr,
> +				dma_addr_t iova, unsigned long pfn, size_t prot)
> +{
> +	struct vfio_pfn *vpfn;
> +
> +	vpfn = kzalloc(sizeof(*vpfn), GFP_KERNEL);
> +	if (!vpfn)
> +		return -ENOMEM;
> +
> +	vpfn->vaddr = vaddr;
> +	vpfn->iova = iova;
> +	vpfn->pfn = pfn;
> +	vpfn->prot = prot;
> +	atomic_set(&vpfn->ref_count, 1);
> +	vfio_link_pfn(domain, vpfn);
> +	return 0;
> +}
> +
> +static void vfio_remove_from_pfn_list(struct vfio_domain *domain,
> +				      struct vfio_pfn *vpfn)
> +{
> +	vfio_unlink_pfn(domain, vpfn);
> +	kfree(vpfn);
> +}
> +
>  struct vwork {
>  	struct mm_struct	*mm;
>  	long			npage;
> @@ -150,17 +252,17 @@ static void vfio_lock_acct_bg(struct work_struct *work)
>  	kfree(vwork);
>  }
>  
> -static void vfio_lock_acct(long npage)
> +static void vfio_lock_acct(struct task_struct *task, long npage)
>  {
>  	struct vwork *vwork;
>  	struct mm_struct *mm;
>  
> -	if (!current->mm || !npage)
> +	if (!task->mm || !npage)
>  		return; /* process exited or nothing to do */
>  
> -	if (down_write_trylock(&current->mm->mmap_sem)) {
> -		current->mm->locked_vm += npage;
> -		up_write(&current->mm->mmap_sem);
> +	if (down_write_trylock(&task->mm->mmap_sem)) {
> +		task->mm->locked_vm += npage;
> +		up_write(&task->mm->mmap_sem);
>  		return;
>  	}
>  
> @@ -172,7 +274,7 @@ static void vfio_lock_acct(long npage)
>  	vwork = kmalloc(sizeof(struct vwork), GFP_KERNEL);
>  	if (!vwork)
>  		return;
> -	mm = get_task_mm(current);
> +	mm = get_task_mm(task);
>  	if (!mm) {
>  		kfree(vwork);
>  		return;
> @@ -228,20 +330,31 @@ static int put_pfn(unsigned long pfn, int prot)
>  	return 0;
>  }
>  
> -static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
> +static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
> +			 int prot, unsigned long *pfn)
>  {
>  	struct page *page[1];
>  	struct vm_area_struct *vma;
> +	struct mm_struct *local_mm = mm ? mm : current->mm;

Some parens would be nice here, local_mm = (mm ? mm : current->mm);

>  	int ret = -EFAULT;
>  
> -	if (get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE), page) == 1) {
> +	if (mm) {
> +		down_read(&local_mm->mmap_sem);
> +		ret = get_user_pages_remote(NULL, local_mm, vaddr, 1,
> +					!!(prot & IOMMU_WRITE), 0, page, NULL);
> +		up_read(&local_mm->mmap_sem);
> +	} else
> +		ret = get_user_pages_fast(vaddr, 1,
> +					  !!(prot & IOMMU_WRITE), page);
> +
> +	if (ret == 1) {
>  		*pfn = page_to_pfn(page[0]);
>  		return 0;
>  	}
>  
> -	down_read(&current->mm->mmap_sem);
> +	down_read(&local_mm->mmap_sem);
>  
> -	vma = find_vma_intersection(current->mm, vaddr, vaddr + 1);
> +	vma = find_vma_intersection(local_mm, vaddr, vaddr + 1);
>  
>  	if (vma && vma->vm_flags & VM_PFNMAP) {
>  		*pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
> @@ -249,7 +362,7 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
>  			ret = 0;
>  	}
>  
> -	up_read(&current->mm->mmap_sem);
> +	up_read(&local_mm->mmap_sem);
>  
>  	return ret;
>  }
> @@ -259,8 +372,8 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
>   * the iommu can only map chunks of consecutive pfns anyway, so get the
>   * first page and all consecutive pages with the same locking.
>   */
> -static long vfio_pin_pages(unsigned long vaddr, long npage,
> -			   int prot, unsigned long *pfn_base)
> +static long __vfio_pin_pages(unsigned long vaddr, long npage,
> +			     int prot, unsigned long *pfn_base)

This is meant to handle the existing case of page tracking in the IOMMU
API, perhaps __vfio_pin_pages_remote()?

>  {
>  	unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
>  	bool lock_cap = capable(CAP_IPC_LOCK);
> @@ -270,7 +383,7 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
>  	if (!current->mm)
>  		return -ENODEV;
>  
> -	ret = vaddr_get_pfn(vaddr, prot, pfn_base);
> +	ret = vaddr_get_pfn(NULL, vaddr, prot, pfn_base);
>  	if (ret)
>  		return ret;
>  
> @@ -285,7 +398,7 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
>  
>  	if (unlikely(disable_hugepages)) {
>  		if (!rsvd)
> -			vfio_lock_acct(1);
> +			vfio_lock_acct(current, 1);
>  		return 1;
>  	}
>  
> @@ -293,7 +406,7 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
>  	for (i = 1, vaddr += PAGE_SIZE; i < npage; i++, vaddr += PAGE_SIZE) {
>  		unsigned long pfn = 0;
>  
> -		ret = vaddr_get_pfn(vaddr, prot, &pfn);
> +		ret = vaddr_get_pfn(NULL, vaddr, prot, &pfn);
>  		if (ret)
>  			break;
>  
> @@ -313,13 +426,13 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
>  	}
>  
>  	if (!rsvd)
> -		vfio_lock_acct(i);
> +		vfio_lock_acct(current, i);
>  
>  	return i;
>  }
>  
> -static long vfio_unpin_pages(unsigned long pfn, long npage,
> -			     int prot, bool do_accounting)
> +static long __vfio_unpin_pages(unsigned long pfn, long npage, int prot,
> +			       bool do_accounting)

__vfio_unpin_pages_remote()?

>  {
>  	unsigned long unlocked = 0;
>  	long i;
> @@ -328,7 +441,188 @@ static long vfio_unpin_pages(unsigned long pfn, long npage,
>  		unlocked += put_pfn(pfn++, prot);
>  
>  	if (do_accounting)
> -		vfio_lock_acct(-unlocked);
> +		vfio_lock_acct(current, -unlocked);
> +	return unlocked;
> +}
> +
> +static long __vfio_pin_pages_for_mdev(struct vfio_domain *domain,

Only seems to support pinning a single page, perhaps
__vfio_pin_page_local()

> +				      unsigned long vaddr, int prot,
> +				      unsigned long *pfn_base,
> +				      bool do_accounting)
> +{
> +	unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> +	bool lock_cap = capable(CAP_IPC_LOCK);
> +	long ret;
> +	bool rsvd;
> +	struct task_struct *task = domain->mdev_addr_space->task;
> +
> +	if (!task->mm)
> +		return -ENODEV;
> +
> +	ret = vaddr_get_pfn(task->mm, vaddr, prot, pfn_base);
> +	if (ret)
> +		return ret;
> +
> +	rsvd = is_invalid_reserved_pfn(*pfn_base);
> +
> +	if (!rsvd && !lock_cap && task->mm->locked_vm + 1 > limit) {
> +		put_pfn(*pfn_base, prot);
> +		pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n", __func__,
> +			limit << PAGE_SHIFT);
> +		return -ENOMEM;
> +	}
> +
> +	if (!rsvd && do_accounting)
> +		vfio_lock_acct(task, 1);
> +
> +	return 1;
> +}
> +
> +static void __vfio_unpin_pages_for_mdev(struct vfio_domain *domain,
> +					unsigned long pfn, int prot,
> +					bool do_accounting)
> +{
> +	put_pfn(pfn, prot);
> +
> +	if (do_accounting)
> +		vfio_lock_acct(domain->mdev_addr_space->task, -1);
> +}
> +
> +static int vfio_unpin_pfn(struct vfio_domain *domain,
> +			  struct vfio_pfn *vpfn, bool do_accounting)
> +{
> +	__vfio_unpin_pages_for_mdev(domain, vpfn->pfn, vpfn->prot,
> +				    do_accounting);
> +
> +	if (atomic_dec_and_test(&vpfn->ref_count))
> +		vfio_remove_from_pfn_list(domain, vpfn);
> +
> +	return 1;
> +}
> +
> +static long vfio_iommu_type1_pin_pages(void *iommu_data,
> +				       unsigned long *user_pfn,
> +				       long npage, int prot,
> +				       unsigned long *phys_pfn)
> +{
> +	struct vfio_iommu *iommu = iommu_data;
> +	struct vfio_domain *domain;
> +	int i, j, ret;
> +	long retpage;
> +	unsigned long remote_vaddr;
> +	unsigned long *pfn = phys_pfn;
> +	struct vfio_dma *dma;
> +	bool do_accounting = false;
> +
> +	if (!iommu || !user_pfn || !phys_pfn)
> +		return -EINVAL;
> +
> +	mutex_lock(&iommu->lock);
> +
> +	if (!iommu->mediated_domain) {
> +		ret = -EINVAL;
> +		goto pin_done;
> +	}
> +
> +	domain = iommu->mediated_domain;
> +
> +	/*
> +	 * If iommu capable domain exist in the container then all pages are
> +	 * already pinned and accounted. Accouting should be done if there is no
> +	 * iommu capable domain in the container.
> +	 */
> +	do_accounting = !IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu);
> +
> +	for (i = 0; i < npage; i++) {
> +		struct vfio_pfn *p;
> +		dma_addr_t iova;
> +
> +		iova = user_pfn[i] << PAGE_SHIFT;
> +
> +		dma = vfio_find_dma(iommu, iova, 0);
> +		if (!dma) {
> +			ret = -EINVAL;
> +			goto pin_unwind;
> +		}
> +
> +		remote_vaddr = dma->vaddr + iova - dma->iova;
> +
> +		retpage = __vfio_pin_pages_for_mdev(domain, remote_vaddr, prot,
> +						    &pfn[i], do_accounting);
> +		if (retpage <= 0) {
> +			WARN_ON(!retpage);
> +			ret = (int)retpage;
> +			goto pin_unwind;
> +		}
> +
> +		mutex_lock(&domain->mdev_addr_space->pfn_list_lock);
> +
> +		/* search if pfn exist */
> +		p = vfio_find_pfn(domain, pfn[i]);
> +		if (p) {
> +			atomic_inc(&p->ref_count);
> +			mutex_unlock(&domain->mdev_addr_space->pfn_list_lock);
> +			continue;
> +		}
> +
> +		ret = vfio_add_to_pfn_list(domain, remote_vaddr, iova,
> +					   pfn[i], prot);
> +		mutex_unlock(&domain->mdev_addr_space->pfn_list_lock);
> +
> +		if (ret) {
> +			__vfio_unpin_pages_for_mdev(domain, pfn[i], prot,
> +						    do_accounting);
> +			goto pin_unwind;
> +		}
> +	}
> +
> +	ret = i;
> +	goto pin_done;
> +
> +pin_unwind:
> +	pfn[i] = 0;
> +	mutex_lock(&domain->mdev_addr_space->pfn_list_lock);
> +	for (j = 0; j < i; j++) {
> +		struct vfio_pfn *p;
> +
> +		p = vfio_find_pfn(domain, pfn[j]);
> +		if (p)
> +			vfio_unpin_pfn(domain, p, do_accounting);
> +
> +		pfn[j] = 0;
> +	}
> +	mutex_unlock(&domain->mdev_addr_space->pfn_list_lock);
> +
> +pin_done:
> +	mutex_unlock(&iommu->lock);
> +	return ret;
> +}
> +
> +static long vfio_iommu_type1_unpin_pages(void *iommu_data, unsigned long *pfn,
> +					 long npage)
> +{
> +	struct vfio_iommu *iommu = iommu_data;
> +	struct vfio_domain *domain = NULL;
> +	long unlocked = 0;
> +	int i;
> +
> +	if (!iommu || !pfn)
> +		return -EINVAL;
> +
> +	domain = iommu->mediated_domain;
> +
> +	for (i = 0; i < npage; i++) {
> +		struct vfio_pfn *p;
> +
> +		mutex_lock(&domain->mdev_addr_space->pfn_list_lock);
> +
> +		/* verify if pfn exist in pfn_list */
> +		p = vfio_find_pfn(domain, pfn[i]);
> +		if (p)
> +			unlocked += vfio_unpin_pfn(domain, p, true);
> +
> +		mutex_unlock(&domain->mdev_addr_space->pfn_list_lock);
> +	}
>  
>  	return unlocked;
>  }
> @@ -341,6 +635,9 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
>  
>  	if (!dma->size)
>  		return;
> +
> +	if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu))
> +		return;
>  	/*
>  	 * We use the IOMMU to track the physical addresses, otherwise we'd
>  	 * need a much more complicated tracking system.  Unfortunately that
> @@ -382,15 +679,15 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
>  		if (WARN_ON(!unmapped))
>  			break;
>  
> -		unlocked += vfio_unpin_pages(phys >> PAGE_SHIFT,
> -					     unmapped >> PAGE_SHIFT,
> -					     dma->prot, false);
> +		unlocked += __vfio_unpin_pages(phys >> PAGE_SHIFT,
> +					       unmapped >> PAGE_SHIFT,
> +					       dma->prot, false);
>  		iova += unmapped;
>  
>  		cond_resched();
>  	}
>  
> -	vfio_lock_acct(-unlocked);
> +	vfio_lock_acct(current, -unlocked);
>  }
>  
>  static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
> @@ -611,10 +908,16 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>  	/* Insert zero-sized and grow as we map chunks of it */
>  	vfio_link_dma(iommu, dma);
>  
> +	/* Don't pin and map if container doesn't contain IOMMU capable domain*/
> +	if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu)) {
> +		dma->size = size;
> +		goto map_done;
> +	}
> +
>  	while (size) {
>  		/* Pin a contiguous chunk of memory */
> -		npage = vfio_pin_pages(vaddr + dma->size,
> -				       size >> PAGE_SHIFT, prot, &pfn);
> +		npage = __vfio_pin_pages(vaddr + dma->size,
> +					 size >> PAGE_SHIFT, prot, &pfn);
>  		if (npage <= 0) {
>  			WARN_ON(!npage);
>  			ret = (int)npage;
> @@ -624,7 +927,7 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>  		/* Map it! */
>  		ret = vfio_iommu_map(iommu, iova + dma->size, pfn, npage, prot);
>  		if (ret) {
> -			vfio_unpin_pages(pfn, npage, prot, true);
> +			__vfio_unpin_pages(pfn, npage, prot, true);
>  			break;
>  		}
>  
> @@ -635,6 +938,7 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>  	if (ret)
>  		vfio_remove_dma(iommu, dma);
>  
> +map_done:
>  	mutex_unlock(&iommu->lock);
>  	return ret;
>  }
> @@ -734,11 +1038,24 @@ static void vfio_test_domain_fgsp(struct vfio_domain *domain)
>  	__free_pages(pages, order);
>  }
>  
> +static struct vfio_group *find_iommu_group(struct vfio_domain *domain,
> +				   struct iommu_group *iommu_group)
> +{
> +	struct vfio_group *g;
> +
> +	list_for_each_entry(g, &domain->group_list, next) {
> +		if (g->iommu_group == iommu_group)
> +			return g;
> +	}
> +
> +	return NULL;
> +}
> +
>  static int vfio_iommu_type1_attach_group(void *iommu_data,
>  					 struct iommu_group *iommu_group)
>  {
>  	struct vfio_iommu *iommu = iommu_data;
> -	struct vfio_group *group, *g;
> +	struct vfio_group *group;
>  	struct vfio_domain *domain, *d;
>  	struct bus_type *bus = NULL;
>  	int ret;
> @@ -746,10 +1063,14 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>  	mutex_lock(&iommu->lock);
>  
>  	list_for_each_entry(d, &iommu->domain_list, next) {
> -		list_for_each_entry(g, &d->group_list, next) {
> -			if (g->iommu_group != iommu_group)
> -				continue;
> +		if (find_iommu_group(d, iommu_group)) {
> +			mutex_unlock(&iommu->lock);
> +			return -EINVAL;
> +		}
> +	}
>  
> +	if (iommu->mediated_domain) {
> +		if (find_iommu_group(iommu->mediated_domain, iommu_group)) {
>  			mutex_unlock(&iommu->lock);
>  			return -EINVAL;
>  		}
> @@ -769,6 +1090,34 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>  	if (ret)
>  		goto out_free;
>  
> +#if defined(CONFIG_VFIO_MDEV) || defined(CONFIG_VFIO_MDEV_MODULE)
> +	if (!iommu_present(bus) && (bus == &mdev_bus_type)) {
> +		if (iommu->mediated_domain) {
> +			list_add(&group->next,
> +				 &iommu->mediated_domain->group_list);
> +			kfree(domain);
> +			mutex_unlock(&iommu->lock);
> +			return 0;
> +		}
> +
> +		domain->mdev_addr_space = kzalloc(sizeof(*domain->mdev_addr_space),
> +						  GFP_KERNEL);
> +		if (!domain->mdev_addr_space) {
> +			ret = -ENOMEM;
> +			goto out_free;
> +		}
> +
> +		domain->mdev_addr_space->task = current;
> +		INIT_LIST_HEAD(&domain->group_list);
> +		list_add(&group->next, &domain->group_list);
> +		domain->mdev_addr_space->pfn_list = RB_ROOT;
> +		mutex_init(&domain->mdev_addr_space->pfn_list_lock);
> +		iommu->mediated_domain = domain;
> +		mutex_unlock(&iommu->lock);
> +		return 0;
> +	}
> +#endif
> +
>  	domain->domain = iommu_domain_alloc(bus);
>  	if (!domain->domain) {
>  		ret = -EIO;
> @@ -859,6 +1208,18 @@ static void vfio_iommu_unmap_unpin_all(struct vfio_iommu *iommu)
>  		vfio_remove_dma(iommu, rb_entry(node, struct vfio_dma, node));
>  }
>  
> +static void vfio_mdev_unpin_all(struct vfio_domain *domain)
> +{
> +	struct rb_node *node;
> +
> +	mutex_lock(&domain->mdev_addr_space->pfn_list_lock);
> +	while ((node = rb_first(&domain->mdev_addr_space->pfn_list))) {
> +		vfio_unpin_pfn(domain,
> +				rb_entry(node, struct vfio_pfn, node), false);
> +	}
> +	mutex_unlock(&domain->mdev_addr_space->pfn_list_lock);
> +}
> +
>  static void vfio_iommu_type1_detach_group(void *iommu_data,
>  					  struct iommu_group *iommu_group)
>  {
> @@ -868,31 +1229,52 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
>  
>  	mutex_lock(&iommu->lock);
>  
> -	list_for_each_entry(domain, &iommu->domain_list, next) {
> -		list_for_each_entry(group, &domain->group_list, next) {
> -			if (group->iommu_group != iommu_group)
> -				continue;
> +	if (iommu->mediated_domain) {
> +		domain = iommu->mediated_domain;
> +		group = find_iommu_group(domain, iommu_group);
> +		if (group) {
> +			list_del(&group->next);
> +			kfree(group);
>  
> +			if (list_empty(&domain->group_list)) {
> +				vfio_mdev_unpin_all(domain);
> +				if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu))
> +					vfio_iommu_unmap_unpin_all(iommu);
> +				kfree(domain);
> +				iommu->mediated_domain = NULL;
> +			}
> +			goto detach_group_done;
> +		}
> +	}
> +
> +	if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu))
> +		goto detach_group_done;
> +
> +	list_for_each_entry(domain, &iommu->domain_list, next) {
> +		group = find_iommu_group(domain, iommu_group);
> +		if (group) {
>  			iommu_detach_group(domain->domain, iommu_group);
>  			list_del(&group->next);
>  			kfree(group);
>  			/*
>  			 * Group ownership provides privilege, if the group
>  			 * list is empty, the domain goes away.  If it's the
> -			 * last domain, then all the mappings go away too.
> +			 * last domain with iommu and mediated domain doesn't
> +			 * exist, the all the mappings go away too.
>  			 */
>  			if (list_empty(&domain->group_list)) {
> -				if (list_is_singular(&iommu->domain_list))
> +				if (list_is_singular(&iommu->domain_list) &&
> +				    !iommu->mediated_domain)
>  					vfio_iommu_unmap_unpin_all(iommu);
>  				iommu_domain_free(domain->domain);
>  				list_del(&domain->next);
>  				kfree(domain);
>  			}
> -			goto done;
> +			break;
>  		}
>  	}
>  
> -done:
> +detach_group_done:
>  	mutex_unlock(&iommu->lock);
>  }
>  
> @@ -924,27 +1306,48 @@ static void *vfio_iommu_type1_open(unsigned long arg)
>  	return iommu;
>  }
>  
> +static void vfio_release_domain(struct vfio_domain *domain)
> +{
> +	struct vfio_group *group, *group_tmp;
> +
> +	list_for_each_entry_safe(group, group_tmp,
> +				 &domain->group_list, next) {
> +		if (!domain->mdev_addr_space)
> +			iommu_detach_group(domain->domain, group->iommu_group);
> +		list_del(&group->next);
> +		kfree(group);
> +	}
> +
> +	if (domain->mdev_addr_space)
> +		vfio_mdev_unpin_all(domain);
> +	else
> +		iommu_domain_free(domain->domain);
> +}
> +
>  static void vfio_iommu_type1_release(void *iommu_data)
>  {
>  	struct vfio_iommu *iommu = iommu_data;
>  	struct vfio_domain *domain, *domain_tmp;
> -	struct vfio_group *group, *group_tmp;
> +
> +	if (iommu->mediated_domain) {
> +		vfio_release_domain(iommu->mediated_domain);
> +		kfree(iommu->mediated_domain);
> +		iommu->mediated_domain = NULL;
> +	}
>  
>  	vfio_iommu_unmap_unpin_all(iommu);
>  
> +	if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu))
> +		goto release_exit;
> +
>  	list_for_each_entry_safe(domain, domain_tmp,
>  				 &iommu->domain_list, next) {
> -		list_for_each_entry_safe(group, group_tmp,
> -					 &domain->group_list, next) {
> -			iommu_detach_group(domain->domain, group->iommu_group);
> -			list_del(&group->next);
> -			kfree(group);
> -		}
> -		iommu_domain_free(domain->domain);
> +		vfio_release_domain(domain);
>  		list_del(&domain->next);
>  		kfree(domain);
>  	}
>  
> +release_exit:
>  	kfree(iommu);
>  }
>  
> @@ -1048,6 +1451,8 @@ static const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_type1 = {
>  	.ioctl		= vfio_iommu_type1_ioctl,
>  	.attach_group	= vfio_iommu_type1_attach_group,
>  	.detach_group	= vfio_iommu_type1_detach_group,
> +	.pin_pages	= vfio_iommu_type1_pin_pages,
> +	.unpin_pages	= vfio_iommu_type1_unpin_pages,
>  };
>  


I see how you're trying to only do accounting when there is only an
mdev (local) domain, but the devices attached to the normal iommu API
domain can go away at any point.  Where do we re-establish accounting
should the pinning from those devices be removed?  I don't see that as
being an optional support case since userspace can already do this.

>  static int __init vfio_iommu_type1_init(void)
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index 431b824b0d3e..abae882122aa 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -17,6 +17,7 @@
>  #include <linux/workqueue.h>
>  #include <linux/poll.h>
>  #include <uapi/linux/vfio.h>
> +#include <linux/mdev.h>
>  
>  #define VFIO_PCI_OFFSET_SHIFT   40
>  
> @@ -82,7 +83,11 @@ struct vfio_iommu_driver_ops {
>  					struct iommu_group *group);
>  	void		(*detach_group)(void *iommu_data,
>  					struct iommu_group *group);
> -
> +	long		(*pin_pages)(void *iommu_data, unsigned long *user_pfn,
> +				     long npage, int prot,
> +				     unsigned long *phys_pfn);
> +	long		(*unpin_pages)(void *iommu_data, unsigned long *pfn,
> +				       long npage);
>  };
>  
>  extern int vfio_register_iommu_driver(const struct vfio_iommu_driver_ops *ops);
> @@ -134,6 +139,12 @@ static inline long vfio_spapr_iommu_eeh_ioctl(struct iommu_group *group,
>  }
>  #endif /* CONFIG_EEH */
>  
> +extern long vfio_pin_pages(struct mdev_device *mdev, unsigned long *user_pfn,
> +			   long npage, int prot, unsigned long *phys_pfn);
> +
> +extern long vfio_unpin_pages(struct mdev_device *mdev, unsigned long *pfn,
> +			     long npage);
> +
>  /*
>   * IRQfd - generic
>   */


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v6 3/4] vfio iommu: Add support for mediated devices
@ 2016-08-09 19:00     ` Alex Williamson
  0 siblings, 0 replies; 100+ messages in thread
From: Alex Williamson @ 2016-08-09 19:00 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi

On Thu, 4 Aug 2016 00:33:53 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> VFIO IOMMU drivers are designed for the devices which are IOMMU capable.
> Mediated device only uses IOMMU APIs, the underlying hardware can be
> managed by an IOMMU domain.
> 
> Aim of this change is:
> - To use most of the code of TYPE1 IOMMU driver for mediated devices
> - To support direct assigned device and mediated device in single module
> 
> Added two new callback functions to struct vfio_iommu_driver_ops. Backend
> IOMMU module that supports pining and unpinning pages for mdev devices
> should provide these functions.
> Added APIs for pining and unpining pages to VFIO module. These calls back
> into backend iommu module to actually pin and unpin pages.
> 
> This change adds pin and unpin support for mediated device to TYPE1 IOMMU
> backend module. More details:
> - When iommu_group of mediated devices is attached, task structure is
>   cached which is used later to pin pages and page accounting.
> - It keeps track of pinned pages for mediated domain. This data is used to
>   verify unpinning request and to unpin remaining pages while detaching, if
>   there are any.
> - Used existing mechanism for page accounting. If iommu capable domain
>   exist in the container then all pages are already pinned and accounted.
>   Accouting for mdev device is only done if there is no iommu capable
>   domain in the container.
> 
> Tested by assigning below combinations of devices to a single VM:
> - GPU pass through only
> - vGPU device only
> - One GPU pass through and one vGPU device
> - two GPU pass through
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Signed-off-by: Neo Jia <cjia@nvidia.com>
> Change-Id: I295d6f0f2e0579b8d9882bfd8fd5a4194b97bd9a
> ---
>  drivers/vfio/vfio.c             |  82 +++++++
>  drivers/vfio/vfio_iommu_type1.c | 499 ++++++++++++++++++++++++++++++++++++----
>  include/linux/vfio.h            |  13 +-
>  3 files changed, 546 insertions(+), 48 deletions(-)
> 
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index 6fd6fa5469de..1f87e3a30d24 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -1782,6 +1782,88 @@ void vfio_info_cap_shift(struct vfio_info_cap *caps, size_t offset)
>  }
>  EXPORT_SYMBOL_GPL(vfio_info_cap_shift);
>  
> +/*
> + * Pin a set of guest PFNs and return their associated host PFNs for mediated
> + * domain only.

Why only mediated domain?  What assumption is specific to a mediated
domain other than unnecessarily passing an mdev_device?

> + * @user_pfn [in]: array of user/guest PFNs
> + * @npage [in]: count of array elements
> + * @prot [in] : protection flags
> + * @phys_pfn[out] : array of host PFNs
> + */
> +long vfio_pin_pages(struct mdev_device *mdev, unsigned long *user_pfn,

Why use and mdev_device here?  We only reference the struct device to
get the drvdata.  (dev also not listed above in param description)

> +		    long npage, int prot, unsigned long *phys_pfn)
> +{
> +	struct vfio_device *device;
> +	struct vfio_container *container;
> +	struct vfio_iommu_driver *driver;
> +	ssize_t ret = -EINVAL;
> +
> +	if (!mdev || !user_pfn || !phys_pfn)
> +		return -EINVAL;
> +
> +	device = dev_get_drvdata(&mdev->dev);
> +
> +	if (!device || !device->group)
> +		return -EINVAL;
> +
> +	container = device->group->container;

This doesn't seem like a valid way to get a reference to the container
and in fact there is no reference at all.  I think you need to use
vfio_device_get_from_dev(), check and increment container_users around
the callback, abort on noiommu groups, and check for viability.

> +
> +	if (!container)
> +		return -EINVAL;
> +
> +	down_read(&container->group_lock);
> +
> +	driver = container->iommu_driver;
> +	if (likely(driver && driver->ops->pin_pages))
> +		ret = driver->ops->pin_pages(container->iommu_data, user_pfn,
> +					     npage, prot, phys_pfn);
> +
> +	up_read(&container->group_lock);
> +
> +	return ret;
> +
> +}
> +EXPORT_SYMBOL(vfio_pin_pages);
> +
> +/*
> + * Unpin set of host PFNs for mediated domain only.
> + * @pfn [in] : array of host PFNs to be unpinned.
> + * @npage [in] :count of elements in array, that is number of pages.
> + */
> +long vfio_unpin_pages(struct mdev_device *mdev, unsigned long *pfn, long npage)
> +{
> +	struct vfio_device *device;
> +	struct vfio_container *container;
> +	struct vfio_iommu_driver *driver;
> +	ssize_t ret = -EINVAL;
> +
> +	if (!mdev || !pfn)
> +		return -EINVAL;
> +
> +	device = dev_get_drvdata(&mdev->dev);
> +
> +	if (!device || !device->group)
> +		return -EINVAL;
> +
> +	container = device->group->container;
> +
> +	if (!container)
> +		return -EINVAL;
> +
> +	down_read(&container->group_lock);
> +
> +	driver = container->iommu_driver;
> +	if (likely(driver && driver->ops->unpin_pages))
> +		ret = driver->ops->unpin_pages(container->iommu_data, pfn,
> +					       npage);
> +
> +	up_read(&container->group_lock);
> +
> +	return ret;
> +
> +}
> +EXPORT_SYMBOL(vfio_unpin_pages);
> +
>  /**
>   * Module/class support
>   */
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 75b24e93cedb..1f4e24e0debd 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -55,18 +55,26 @@ MODULE_PARM_DESC(disable_hugepages,
>  
>  struct vfio_iommu {
>  	struct list_head	domain_list;
> +	struct vfio_domain	*mediated_domain;

s/mediated/local/?

>  	struct mutex		lock;
>  	struct rb_root		dma_list;
>  	bool			v2;
>  	bool			nesting;
>  };
>  
> +struct mdev_addr_space {
> +	struct task_struct	*task;
> +	struct rb_root		pfn_list;	/* pinned Host pfn list */
> +	struct mutex		pfn_list_lock;	/* mutex for pfn_list */
> +};

s/mdev/local/?

> +
>  struct vfio_domain {
>  	struct iommu_domain	*domain;
>  	struct list_head	next;
>  	struct list_head	group_list;
>  	int			prot;		/* IOMMU_CACHE */
>  	bool			fgsp;		/* Fine-grained super pages */
> +	struct mdev_addr_space	*mdev_addr_space;
>  };

Mediated devices are who this is for, but doesn't really define what it
does.  Perhaps we can just use "local" to describe the local page
tracking.

>  
>  struct vfio_dma {
> @@ -83,6 +91,22 @@ struct vfio_group {
>  };
>  
>  /*
> + * Guest RAM pinning working set or DMA target
> + */
> +struct vfio_pfn {
> +	struct rb_node		node;
> +	unsigned long		vaddr;		/* virtual addr */
> +	dma_addr_t		iova;		/* IOVA */
> +	unsigned long		pfn;		/* Host pfn */
> +	size_t			prot;
> +	atomic_t		ref_count;
> +};
> +
> +
> +#define IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu)	\
> +			 (list_empty(&iommu->domain_list) ? false : true)
> +
> +/*
>   * This code handles mapping and unmapping of user data buffers
>   * into DMA'ble space using the IOMMU
>   */
> @@ -130,6 +154,84 @@ static void vfio_unlink_dma(struct vfio_iommu *iommu, struct vfio_dma *old)
>  	rb_erase(&old->node, &iommu->dma_list);
>  }
>  
> +/*
> + * Helper Functions for host pfn list
> + */
> +
> +static struct vfio_pfn *vfio_find_pfn(struct vfio_domain *domain,
> +				      unsigned long pfn)
> +{
> +	struct rb_node *node;
> +	struct vfio_pfn *vpfn, *ret = NULL;
> +
> +	node = domain->mdev_addr_space->pfn_list.rb_node;
> +
> +	while (node) {
> +		vpfn = rb_entry(node, struct vfio_pfn, node);
> +
> +		if (pfn < vpfn->pfn)
> +			node = node->rb_left;
> +		else if (pfn > vpfn->pfn)
> +			node = node->rb_right;
> +		else {
> +			ret = vpfn;
> +			break;
> +		}
> +	}
> +
> +	return ret;
> +}
> +
> +static void vfio_link_pfn(struct vfio_domain *domain, struct vfio_pfn *new)
> +{
> +	struct rb_node **link, *parent = NULL;
> +	struct vfio_pfn *vpfn;
> +
> +	link = &domain->mdev_addr_space->pfn_list.rb_node;
> +	while (*link) {
> +		parent = *link;
> +		vpfn = rb_entry(parent, struct vfio_pfn, node);
> +
> +		if (new->pfn < vpfn->pfn)
> +			link = &(*link)->rb_left;
> +		else
> +			link = &(*link)->rb_right;
> +	}
> +
> +	rb_link_node(&new->node, parent, link);
> +	rb_insert_color(&new->node, &domain->mdev_addr_space->pfn_list);
> +}
> +
> +static void vfio_unlink_pfn(struct vfio_domain *domain, struct vfio_pfn *old)
> +{
> +	rb_erase(&old->node, &domain->mdev_addr_space->pfn_list);
> +}
> +
> +static int vfio_add_to_pfn_list(struct vfio_domain *domain, unsigned long vaddr,
> +				dma_addr_t iova, unsigned long pfn, size_t prot)
> +{
> +	struct vfio_pfn *vpfn;
> +
> +	vpfn = kzalloc(sizeof(*vpfn), GFP_KERNEL);
> +	if (!vpfn)
> +		return -ENOMEM;
> +
> +	vpfn->vaddr = vaddr;
> +	vpfn->iova = iova;
> +	vpfn->pfn = pfn;
> +	vpfn->prot = prot;
> +	atomic_set(&vpfn->ref_count, 1);
> +	vfio_link_pfn(domain, vpfn);
> +	return 0;
> +}
> +
> +static void vfio_remove_from_pfn_list(struct vfio_domain *domain,
> +				      struct vfio_pfn *vpfn)
> +{
> +	vfio_unlink_pfn(domain, vpfn);
> +	kfree(vpfn);
> +}
> +
>  struct vwork {
>  	struct mm_struct	*mm;
>  	long			npage;
> @@ -150,17 +252,17 @@ static void vfio_lock_acct_bg(struct work_struct *work)
>  	kfree(vwork);
>  }
>  
> -static void vfio_lock_acct(long npage)
> +static void vfio_lock_acct(struct task_struct *task, long npage)
>  {
>  	struct vwork *vwork;
>  	struct mm_struct *mm;
>  
> -	if (!current->mm || !npage)
> +	if (!task->mm || !npage)
>  		return; /* process exited or nothing to do */
>  
> -	if (down_write_trylock(&current->mm->mmap_sem)) {
> -		current->mm->locked_vm += npage;
> -		up_write(&current->mm->mmap_sem);
> +	if (down_write_trylock(&task->mm->mmap_sem)) {
> +		task->mm->locked_vm += npage;
> +		up_write(&task->mm->mmap_sem);
>  		return;
>  	}
>  
> @@ -172,7 +274,7 @@ static void vfio_lock_acct(long npage)
>  	vwork = kmalloc(sizeof(struct vwork), GFP_KERNEL);
>  	if (!vwork)
>  		return;
> -	mm = get_task_mm(current);
> +	mm = get_task_mm(task);
>  	if (!mm) {
>  		kfree(vwork);
>  		return;
> @@ -228,20 +330,31 @@ static int put_pfn(unsigned long pfn, int prot)
>  	return 0;
>  }
>  
> -static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
> +static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
> +			 int prot, unsigned long *pfn)
>  {
>  	struct page *page[1];
>  	struct vm_area_struct *vma;
> +	struct mm_struct *local_mm = mm ? mm : current->mm;

Some parens would be nice here, local_mm = (mm ? mm : current->mm);

>  	int ret = -EFAULT;
>  
> -	if (get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE), page) == 1) {
> +	if (mm) {
> +		down_read(&local_mm->mmap_sem);
> +		ret = get_user_pages_remote(NULL, local_mm, vaddr, 1,
> +					!!(prot & IOMMU_WRITE), 0, page, NULL);
> +		up_read(&local_mm->mmap_sem);
> +	} else
> +		ret = get_user_pages_fast(vaddr, 1,
> +					  !!(prot & IOMMU_WRITE), page);
> +
> +	if (ret == 1) {
>  		*pfn = page_to_pfn(page[0]);
>  		return 0;
>  	}
>  
> -	down_read(&current->mm->mmap_sem);
> +	down_read(&local_mm->mmap_sem);
>  
> -	vma = find_vma_intersection(current->mm, vaddr, vaddr + 1);
> +	vma = find_vma_intersection(local_mm, vaddr, vaddr + 1);
>  
>  	if (vma && vma->vm_flags & VM_PFNMAP) {
>  		*pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
> @@ -249,7 +362,7 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
>  			ret = 0;
>  	}
>  
> -	up_read(&current->mm->mmap_sem);
> +	up_read(&local_mm->mmap_sem);
>  
>  	return ret;
>  }
> @@ -259,8 +372,8 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
>   * the iommu can only map chunks of consecutive pfns anyway, so get the
>   * first page and all consecutive pages with the same locking.
>   */
> -static long vfio_pin_pages(unsigned long vaddr, long npage,
> -			   int prot, unsigned long *pfn_base)
> +static long __vfio_pin_pages(unsigned long vaddr, long npage,
> +			     int prot, unsigned long *pfn_base)

This is meant to handle the existing case of page tracking in the IOMMU
API, perhaps __vfio_pin_pages_remote()?

>  {
>  	unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
>  	bool lock_cap = capable(CAP_IPC_LOCK);
> @@ -270,7 +383,7 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
>  	if (!current->mm)
>  		return -ENODEV;
>  
> -	ret = vaddr_get_pfn(vaddr, prot, pfn_base);
> +	ret = vaddr_get_pfn(NULL, vaddr, prot, pfn_base);
>  	if (ret)
>  		return ret;
>  
> @@ -285,7 +398,7 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
>  
>  	if (unlikely(disable_hugepages)) {
>  		if (!rsvd)
> -			vfio_lock_acct(1);
> +			vfio_lock_acct(current, 1);
>  		return 1;
>  	}
>  
> @@ -293,7 +406,7 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
>  	for (i = 1, vaddr += PAGE_SIZE; i < npage; i++, vaddr += PAGE_SIZE) {
>  		unsigned long pfn = 0;
>  
> -		ret = vaddr_get_pfn(vaddr, prot, &pfn);
> +		ret = vaddr_get_pfn(NULL, vaddr, prot, &pfn);
>  		if (ret)
>  			break;
>  
> @@ -313,13 +426,13 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
>  	}
>  
>  	if (!rsvd)
> -		vfio_lock_acct(i);
> +		vfio_lock_acct(current, i);
>  
>  	return i;
>  }
>  
> -static long vfio_unpin_pages(unsigned long pfn, long npage,
> -			     int prot, bool do_accounting)
> +static long __vfio_unpin_pages(unsigned long pfn, long npage, int prot,
> +			       bool do_accounting)

__vfio_unpin_pages_remote()?

>  {
>  	unsigned long unlocked = 0;
>  	long i;
> @@ -328,7 +441,188 @@ static long vfio_unpin_pages(unsigned long pfn, long npage,
>  		unlocked += put_pfn(pfn++, prot);
>  
>  	if (do_accounting)
> -		vfio_lock_acct(-unlocked);
> +		vfio_lock_acct(current, -unlocked);
> +	return unlocked;
> +}
> +
> +static long __vfio_pin_pages_for_mdev(struct vfio_domain *domain,

Only seems to support pinning a single page, perhaps
__vfio_pin_page_local()

> +				      unsigned long vaddr, int prot,
> +				      unsigned long *pfn_base,
> +				      bool do_accounting)
> +{
> +	unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> +	bool lock_cap = capable(CAP_IPC_LOCK);
> +	long ret;
> +	bool rsvd;
> +	struct task_struct *task = domain->mdev_addr_space->task;
> +
> +	if (!task->mm)
> +		return -ENODEV;
> +
> +	ret = vaddr_get_pfn(task->mm, vaddr, prot, pfn_base);
> +	if (ret)
> +		return ret;
> +
> +	rsvd = is_invalid_reserved_pfn(*pfn_base);
> +
> +	if (!rsvd && !lock_cap && task->mm->locked_vm + 1 > limit) {
> +		put_pfn(*pfn_base, prot);
> +		pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n", __func__,
> +			limit << PAGE_SHIFT);
> +		return -ENOMEM;
> +	}
> +
> +	if (!rsvd && do_accounting)
> +		vfio_lock_acct(task, 1);
> +
> +	return 1;
> +}
> +
> +static void __vfio_unpin_pages_for_mdev(struct vfio_domain *domain,
> +					unsigned long pfn, int prot,
> +					bool do_accounting)
> +{
> +	put_pfn(pfn, prot);
> +
> +	if (do_accounting)
> +		vfio_lock_acct(domain->mdev_addr_space->task, -1);
> +}
> +
> +static int vfio_unpin_pfn(struct vfio_domain *domain,
> +			  struct vfio_pfn *vpfn, bool do_accounting)
> +{
> +	__vfio_unpin_pages_for_mdev(domain, vpfn->pfn, vpfn->prot,
> +				    do_accounting);
> +
> +	if (atomic_dec_and_test(&vpfn->ref_count))
> +		vfio_remove_from_pfn_list(domain, vpfn);
> +
> +	return 1;
> +}
> +
> +static long vfio_iommu_type1_pin_pages(void *iommu_data,
> +				       unsigned long *user_pfn,
> +				       long npage, int prot,
> +				       unsigned long *phys_pfn)
> +{
> +	struct vfio_iommu *iommu = iommu_data;
> +	struct vfio_domain *domain;
> +	int i, j, ret;
> +	long retpage;
> +	unsigned long remote_vaddr;
> +	unsigned long *pfn = phys_pfn;
> +	struct vfio_dma *dma;
> +	bool do_accounting = false;
> +
> +	if (!iommu || !user_pfn || !phys_pfn)
> +		return -EINVAL;
> +
> +	mutex_lock(&iommu->lock);
> +
> +	if (!iommu->mediated_domain) {
> +		ret = -EINVAL;
> +		goto pin_done;
> +	}
> +
> +	domain = iommu->mediated_domain;
> +
> +	/*
> +	 * If iommu capable domain exist in the container then all pages are
> +	 * already pinned and accounted. Accouting should be done if there is no
> +	 * iommu capable domain in the container.
> +	 */
> +	do_accounting = !IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu);
> +
> +	for (i = 0; i < npage; i++) {
> +		struct vfio_pfn *p;
> +		dma_addr_t iova;
> +
> +		iova = user_pfn[i] << PAGE_SHIFT;
> +
> +		dma = vfio_find_dma(iommu, iova, 0);
> +		if (!dma) {
> +			ret = -EINVAL;
> +			goto pin_unwind;
> +		}
> +
> +		remote_vaddr = dma->vaddr + iova - dma->iova;
> +
> +		retpage = __vfio_pin_pages_for_mdev(domain, remote_vaddr, prot,
> +						    &pfn[i], do_accounting);
> +		if (retpage <= 0) {
> +			WARN_ON(!retpage);
> +			ret = (int)retpage;
> +			goto pin_unwind;
> +		}
> +
> +		mutex_lock(&domain->mdev_addr_space->pfn_list_lock);
> +
> +		/* search if pfn exist */
> +		p = vfio_find_pfn(domain, pfn[i]);
> +		if (p) {
> +			atomic_inc(&p->ref_count);
> +			mutex_unlock(&domain->mdev_addr_space->pfn_list_lock);
> +			continue;
> +		}
> +
> +		ret = vfio_add_to_pfn_list(domain, remote_vaddr, iova,
> +					   pfn[i], prot);
> +		mutex_unlock(&domain->mdev_addr_space->pfn_list_lock);
> +
> +		if (ret) {
> +			__vfio_unpin_pages_for_mdev(domain, pfn[i], prot,
> +						    do_accounting);
> +			goto pin_unwind;
> +		}
> +	}
> +
> +	ret = i;
> +	goto pin_done;
> +
> +pin_unwind:
> +	pfn[i] = 0;
> +	mutex_lock(&domain->mdev_addr_space->pfn_list_lock);
> +	for (j = 0; j < i; j++) {
> +		struct vfio_pfn *p;
> +
> +		p = vfio_find_pfn(domain, pfn[j]);
> +		if (p)
> +			vfio_unpin_pfn(domain, p, do_accounting);
> +
> +		pfn[j] = 0;
> +	}
> +	mutex_unlock(&domain->mdev_addr_space->pfn_list_lock);
> +
> +pin_done:
> +	mutex_unlock(&iommu->lock);
> +	return ret;
> +}
> +
> +static long vfio_iommu_type1_unpin_pages(void *iommu_data, unsigned long *pfn,
> +					 long npage)
> +{
> +	struct vfio_iommu *iommu = iommu_data;
> +	struct vfio_domain *domain = NULL;
> +	long unlocked = 0;
> +	int i;
> +
> +	if (!iommu || !pfn)
> +		return -EINVAL;
> +
> +	domain = iommu->mediated_domain;
> +
> +	for (i = 0; i < npage; i++) {
> +		struct vfio_pfn *p;
> +
> +		mutex_lock(&domain->mdev_addr_space->pfn_list_lock);
> +
> +		/* verify if pfn exist in pfn_list */
> +		p = vfio_find_pfn(domain, pfn[i]);
> +		if (p)
> +			unlocked += vfio_unpin_pfn(domain, p, true);
> +
> +		mutex_unlock(&domain->mdev_addr_space->pfn_list_lock);
> +	}
>  
>  	return unlocked;
>  }
> @@ -341,6 +635,9 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
>  
>  	if (!dma->size)
>  		return;
> +
> +	if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu))
> +		return;
>  	/*
>  	 * We use the IOMMU to track the physical addresses, otherwise we'd
>  	 * need a much more complicated tracking system.  Unfortunately that
> @@ -382,15 +679,15 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
>  		if (WARN_ON(!unmapped))
>  			break;
>  
> -		unlocked += vfio_unpin_pages(phys >> PAGE_SHIFT,
> -					     unmapped >> PAGE_SHIFT,
> -					     dma->prot, false);
> +		unlocked += __vfio_unpin_pages(phys >> PAGE_SHIFT,
> +					       unmapped >> PAGE_SHIFT,
> +					       dma->prot, false);
>  		iova += unmapped;
>  
>  		cond_resched();
>  	}
>  
> -	vfio_lock_acct(-unlocked);
> +	vfio_lock_acct(current, -unlocked);
>  }
>  
>  static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
> @@ -611,10 +908,16 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>  	/* Insert zero-sized and grow as we map chunks of it */
>  	vfio_link_dma(iommu, dma);
>  
> +	/* Don't pin and map if container doesn't contain IOMMU capable domain*/
> +	if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu)) {
> +		dma->size = size;
> +		goto map_done;
> +	}
> +
>  	while (size) {
>  		/* Pin a contiguous chunk of memory */
> -		npage = vfio_pin_pages(vaddr + dma->size,
> -				       size >> PAGE_SHIFT, prot, &pfn);
> +		npage = __vfio_pin_pages(vaddr + dma->size,
> +					 size >> PAGE_SHIFT, prot, &pfn);
>  		if (npage <= 0) {
>  			WARN_ON(!npage);
>  			ret = (int)npage;
> @@ -624,7 +927,7 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>  		/* Map it! */
>  		ret = vfio_iommu_map(iommu, iova + dma->size, pfn, npage, prot);
>  		if (ret) {
> -			vfio_unpin_pages(pfn, npage, prot, true);
> +			__vfio_unpin_pages(pfn, npage, prot, true);
>  			break;
>  		}
>  
> @@ -635,6 +938,7 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>  	if (ret)
>  		vfio_remove_dma(iommu, dma);
>  
> +map_done:
>  	mutex_unlock(&iommu->lock);
>  	return ret;
>  }
> @@ -734,11 +1038,24 @@ static void vfio_test_domain_fgsp(struct vfio_domain *domain)
>  	__free_pages(pages, order);
>  }
>  
> +static struct vfio_group *find_iommu_group(struct vfio_domain *domain,
> +				   struct iommu_group *iommu_group)
> +{
> +	struct vfio_group *g;
> +
> +	list_for_each_entry(g, &domain->group_list, next) {
> +		if (g->iommu_group == iommu_group)
> +			return g;
> +	}
> +
> +	return NULL;
> +}
> +
>  static int vfio_iommu_type1_attach_group(void *iommu_data,
>  					 struct iommu_group *iommu_group)
>  {
>  	struct vfio_iommu *iommu = iommu_data;
> -	struct vfio_group *group, *g;
> +	struct vfio_group *group;
>  	struct vfio_domain *domain, *d;
>  	struct bus_type *bus = NULL;
>  	int ret;
> @@ -746,10 +1063,14 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>  	mutex_lock(&iommu->lock);
>  
>  	list_for_each_entry(d, &iommu->domain_list, next) {
> -		list_for_each_entry(g, &d->group_list, next) {
> -			if (g->iommu_group != iommu_group)
> -				continue;
> +		if (find_iommu_group(d, iommu_group)) {
> +			mutex_unlock(&iommu->lock);
> +			return -EINVAL;
> +		}
> +	}
>  
> +	if (iommu->mediated_domain) {
> +		if (find_iommu_group(iommu->mediated_domain, iommu_group)) {
>  			mutex_unlock(&iommu->lock);
>  			return -EINVAL;
>  		}
> @@ -769,6 +1090,34 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>  	if (ret)
>  		goto out_free;
>  
> +#if defined(CONFIG_VFIO_MDEV) || defined(CONFIG_VFIO_MDEV_MODULE)
> +	if (!iommu_present(bus) && (bus == &mdev_bus_type)) {
> +		if (iommu->mediated_domain) {
> +			list_add(&group->next,
> +				 &iommu->mediated_domain->group_list);
> +			kfree(domain);
> +			mutex_unlock(&iommu->lock);
> +			return 0;
> +		}
> +
> +		domain->mdev_addr_space = kzalloc(sizeof(*domain->mdev_addr_space),
> +						  GFP_KERNEL);
> +		if (!domain->mdev_addr_space) {
> +			ret = -ENOMEM;
> +			goto out_free;
> +		}
> +
> +		domain->mdev_addr_space->task = current;
> +		INIT_LIST_HEAD(&domain->group_list);
> +		list_add(&group->next, &domain->group_list);
> +		domain->mdev_addr_space->pfn_list = RB_ROOT;
> +		mutex_init(&domain->mdev_addr_space->pfn_list_lock);
> +		iommu->mediated_domain = domain;
> +		mutex_unlock(&iommu->lock);
> +		return 0;
> +	}
> +#endif
> +
>  	domain->domain = iommu_domain_alloc(bus);
>  	if (!domain->domain) {
>  		ret = -EIO;
> @@ -859,6 +1208,18 @@ static void vfio_iommu_unmap_unpin_all(struct vfio_iommu *iommu)
>  		vfio_remove_dma(iommu, rb_entry(node, struct vfio_dma, node));
>  }
>  
> +static void vfio_mdev_unpin_all(struct vfio_domain *domain)
> +{
> +	struct rb_node *node;
> +
> +	mutex_lock(&domain->mdev_addr_space->pfn_list_lock);
> +	while ((node = rb_first(&domain->mdev_addr_space->pfn_list))) {
> +		vfio_unpin_pfn(domain,
> +				rb_entry(node, struct vfio_pfn, node), false);
> +	}
> +	mutex_unlock(&domain->mdev_addr_space->pfn_list_lock);
> +}
> +
>  static void vfio_iommu_type1_detach_group(void *iommu_data,
>  					  struct iommu_group *iommu_group)
>  {
> @@ -868,31 +1229,52 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
>  
>  	mutex_lock(&iommu->lock);
>  
> -	list_for_each_entry(domain, &iommu->domain_list, next) {
> -		list_for_each_entry(group, &domain->group_list, next) {
> -			if (group->iommu_group != iommu_group)
> -				continue;
> +	if (iommu->mediated_domain) {
> +		domain = iommu->mediated_domain;
> +		group = find_iommu_group(domain, iommu_group);
> +		if (group) {
> +			list_del(&group->next);
> +			kfree(group);
>  
> +			if (list_empty(&domain->group_list)) {
> +				vfio_mdev_unpin_all(domain);
> +				if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu))
> +					vfio_iommu_unmap_unpin_all(iommu);
> +				kfree(domain);
> +				iommu->mediated_domain = NULL;
> +			}
> +			goto detach_group_done;
> +		}
> +	}
> +
> +	if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu))
> +		goto detach_group_done;
> +
> +	list_for_each_entry(domain, &iommu->domain_list, next) {
> +		group = find_iommu_group(domain, iommu_group);
> +		if (group) {
>  			iommu_detach_group(domain->domain, iommu_group);
>  			list_del(&group->next);
>  			kfree(group);
>  			/*
>  			 * Group ownership provides privilege, if the group
>  			 * list is empty, the domain goes away.  If it's the
> -			 * last domain, then all the mappings go away too.
> +			 * last domain with iommu and mediated domain doesn't
> +			 * exist, the all the mappings go away too.
>  			 */
>  			if (list_empty(&domain->group_list)) {
> -				if (list_is_singular(&iommu->domain_list))
> +				if (list_is_singular(&iommu->domain_list) &&
> +				    !iommu->mediated_domain)
>  					vfio_iommu_unmap_unpin_all(iommu);
>  				iommu_domain_free(domain->domain);
>  				list_del(&domain->next);
>  				kfree(domain);
>  			}
> -			goto done;
> +			break;
>  		}
>  	}
>  
> -done:
> +detach_group_done:
>  	mutex_unlock(&iommu->lock);
>  }
>  
> @@ -924,27 +1306,48 @@ static void *vfio_iommu_type1_open(unsigned long arg)
>  	return iommu;
>  }
>  
> +static void vfio_release_domain(struct vfio_domain *domain)
> +{
> +	struct vfio_group *group, *group_tmp;
> +
> +	list_for_each_entry_safe(group, group_tmp,
> +				 &domain->group_list, next) {
> +		if (!domain->mdev_addr_space)
> +			iommu_detach_group(domain->domain, group->iommu_group);
> +		list_del(&group->next);
> +		kfree(group);
> +	}
> +
> +	if (domain->mdev_addr_space)
> +		vfio_mdev_unpin_all(domain);
> +	else
> +		iommu_domain_free(domain->domain);
> +}
> +
>  static void vfio_iommu_type1_release(void *iommu_data)
>  {
>  	struct vfio_iommu *iommu = iommu_data;
>  	struct vfio_domain *domain, *domain_tmp;
> -	struct vfio_group *group, *group_tmp;
> +
> +	if (iommu->mediated_domain) {
> +		vfio_release_domain(iommu->mediated_domain);
> +		kfree(iommu->mediated_domain);
> +		iommu->mediated_domain = NULL;
> +	}
>  
>  	vfio_iommu_unmap_unpin_all(iommu);
>  
> +	if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu))
> +		goto release_exit;
> +
>  	list_for_each_entry_safe(domain, domain_tmp,
>  				 &iommu->domain_list, next) {
> -		list_for_each_entry_safe(group, group_tmp,
> -					 &domain->group_list, next) {
> -			iommu_detach_group(domain->domain, group->iommu_group);
> -			list_del(&group->next);
> -			kfree(group);
> -		}
> -		iommu_domain_free(domain->domain);
> +		vfio_release_domain(domain);
>  		list_del(&domain->next);
>  		kfree(domain);
>  	}
>  
> +release_exit:
>  	kfree(iommu);
>  }
>  
> @@ -1048,6 +1451,8 @@ static const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_type1 = {
>  	.ioctl		= vfio_iommu_type1_ioctl,
>  	.attach_group	= vfio_iommu_type1_attach_group,
>  	.detach_group	= vfio_iommu_type1_detach_group,
> +	.pin_pages	= vfio_iommu_type1_pin_pages,
> +	.unpin_pages	= vfio_iommu_type1_unpin_pages,
>  };
>  


I see how you're trying to only do accounting when there is only an
mdev (local) domain, but the devices attached to the normal iommu API
domain can go away at any point.  Where do we re-establish accounting
should the pinning from those devices be removed?  I don't see that as
being an optional support case since userspace can already do this.

>  static int __init vfio_iommu_type1_init(void)
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index 431b824b0d3e..abae882122aa 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -17,6 +17,7 @@
>  #include <linux/workqueue.h>
>  #include <linux/poll.h>
>  #include <uapi/linux/vfio.h>
> +#include <linux/mdev.h>
>  
>  #define VFIO_PCI_OFFSET_SHIFT   40
>  
> @@ -82,7 +83,11 @@ struct vfio_iommu_driver_ops {
>  					struct iommu_group *group);
>  	void		(*detach_group)(void *iommu_data,
>  					struct iommu_group *group);
> -
> +	long		(*pin_pages)(void *iommu_data, unsigned long *user_pfn,
> +				     long npage, int prot,
> +				     unsigned long *phys_pfn);
> +	long		(*unpin_pages)(void *iommu_data, unsigned long *pfn,
> +				       long npage);
>  };
>  
>  extern int vfio_register_iommu_driver(const struct vfio_iommu_driver_ops *ops);
> @@ -134,6 +139,12 @@ static inline long vfio_spapr_iommu_eeh_ioctl(struct iommu_group *group,
>  }
>  #endif /* CONFIG_EEH */
>  
> +extern long vfio_pin_pages(struct mdev_device *mdev, unsigned long *user_pfn,
> +			   long npage, int prot, unsigned long *phys_pfn);
> +
> +extern long vfio_unpin_pages(struct mdev_device *mdev, unsigned long *pfn,
> +			     long npage);
> +
>  /*
>   * IRQfd - generic
>   */

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v6 2/4] vfio: VFIO driver for mediated PCI device
  2016-08-09 19:00     ` [Qemu-devel] " Alex Williamson
@ 2016-08-10 21:23       ` Kirti Wankhede
  -1 siblings, 0 replies; 100+ messages in thread
From: Kirti Wankhede @ 2016-08-10 21:23 UTC (permalink / raw)
  To: Alex Williamson
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi



On 8/10/2016 12:30 AM, Alex Williamson wrote:
> On Thu, 4 Aug 2016 00:33:52 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 

...

>> +
>> +		switch (info.index) {
>> +		case VFIO_PCI_CONFIG_REGION_INDEX:
>> +		case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
>> +			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
> 
> No, vmdev->vfio_region_info[info.index].offset
>

Ok.

>> +			info.size = vmdev->vfio_region_info[info.index].size;
>> +			if (!info.size) {
>> +				info.flags = 0;
>> +				break;
>> +			}
>> +
>> +			info.flags = vmdev->vfio_region_info[info.index].flags;
>> +			break;
>> +		case VFIO_PCI_VGA_REGION_INDEX:
>> +		case VFIO_PCI_ROM_REGION_INDEX:
> 
> Why?  Let the vendor driver decide.
> 

Ok.

>> +		switch (info.index) {
>> +		case VFIO_PCI_INTX_IRQ_INDEX ... VFIO_PCI_MSI_IRQ_INDEX:
>> +		case VFIO_PCI_REQ_IRQ_INDEX:
>> +			break;
>> +			/* pass thru to return error */
>> +		case VFIO_PCI_MSIX_IRQ_INDEX:
> 
> ???

Sorry, I missed to update this. Updating it.

>> +	case VFIO_DEVICE_SET_IRQS:
>> +	{
...
>> +
>> +		if (parent->ops->set_irqs)
>> +			ret = parent->ops->set_irqs(mdev, hdr.flags, hdr.index,
>> +						    hdr.start, hdr.count, data);
>> +
>> +		kfree(ptr);
>> +		return ret;
> 
> Return success if no set_irqs callback?
>

Ideally, vendor driver should provide this function. If vendor driver
doesn't provide it, do we really need to fail here?


>> +static ssize_t vfio_mpci_read(void *device_data, char __user *buf,
>> +			      size_t count, loff_t *ppos)
>> +{
>> +	struct vfio_mdev *vmdev = device_data;
>> +	struct mdev_device *mdev = vmdev->mdev;
>> +	struct parent_device *parent = mdev->parent;
>> +	int ret = 0;
>> +
>> +	if (!count)
>> +		return 0;
>> +
>> +	if (parent->ops->read) {
>> +		char *ret_data, *ptr;
>> +
>> +		ptr = ret_data = kzalloc(count, GFP_KERNEL);
> 
> Do we really need to support arbitrary lengths in one shot?  Seems like
> we could just use a 4 or 8 byte variable on the stack and iterate until
> done.
> 

We just want to pass the arguments to vendor driver as is here. Vendor
driver could take care of that.

>> +
>> +static ssize_t vfio_mpci_write(void *device_data, const char __user *buf,
>> +			       size_t count, loff_t *ppos)
>> +{
>> +	struct vfio_mdev *vmdev = device_data;
>> +	struct mdev_device *mdev = vmdev->mdev;
>> +	struct parent_device *parent = mdev->parent;
>> +	int ret = 0;
>> +
>> +	if (!count)
>> +		return 0;
>> +
>> +	if (parent->ops->write) {
>> +		char *usr_data, *ptr;
>> +
>> +		ptr = usr_data = memdup_user(buf, count);
> 
> Same here, how much do we care to let the user write in one pass and is
> there any advantage to it?  When QEMU is our userspace we're only
> likely to see 4-byte accesses anyway.

Same as above.

>> +
>> +static int mdev_dev_mmio_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
>> +{
...
>> +	} else {
>> +		struct pci_dev *pdev;
>> +
>> +		virtaddr = vma->vm_start;
>> +		req_size = vma->vm_end - vma->vm_start;
>> +
>> +		pdev = to_pci_dev(parent->dev);
>> +		index = VFIO_PCI_OFFSET_TO_INDEX(vma->vm_pgoff << PAGE_SHIFT);
> 
> Iterate through region_info[*].offset/size provided by vendor driver.
> 

Yes, makes sense.

>> +
>> +int vfio_mpci_match(struct device *dev)
>> +{
>> +	if (dev_is_pci(dev->parent))
> 
> This is the wrong test, there's really no requirement that a pci mdev
> device is hosted by a real pci device.  

Ideally this module is for the mediated device whose parent is PCI
device. And we are relying on kernel functions like
pci_resource_start(), to_pci_dev() in this module, so better to check it
while loading.


> Can't we check that the device
> is on an mdev_pci_bus_type?
> 

I didn't get this part.

Each mediated device is of mdev_bus_type. But VFIO module could be
different based on parent device type and loaded at the same time. For
example, there should be different modules for channel IO or any other
type of devices and could be loaded at the same time. Then when mdev
device is created based on check in match() function of each module, and
proper driver would be linked for that mdev device.

If this check is not based on parent device type, do you expect to set
parent device type by vendor driver and accordingly load corresponding
VFIO driver?


>> @@ -18,6 +18,7 @@
>>  #include <linux/uaccess.h>
>>  #include <linux/io.h>
>>  #include <linux/vgaarb.h>
>> +#include <linux/vfio.h>
>>  
>>  #include "vfio_pci_private.h"
>>  
>> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
>> index 0ecae0b1cd34..431b824b0d3e 100644
>> --- a/include/linux/vfio.h
>> +++ b/include/linux/vfio.h
>> @@ -18,6 +18,13 @@
>>  #include <linux/poll.h>
>>  #include <uapi/linux/vfio.h>
>>  
>> +#define VFIO_PCI_OFFSET_SHIFT   40
>> +
>> +#define VFIO_PCI_OFFSET_TO_INDEX(off)   (off >> VFIO_PCI_OFFSET_SHIFT)
>> +#define VFIO_PCI_INDEX_TO_OFFSET(index) ((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
>> +#define VFIO_PCI_OFFSET_MASK    (((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
>> +
>> +
> 
> Nak this, I'm not interested in making this any sort of ABI.
> 

These macros are used by drivers/vfio/pci/vfio_pci.c and
drivers/vfio/mdev/vfio_mpci.c and to use those in both these modules,
they should be moved to common place as you suggested in earlier
reviews. I think this is better common place. Are there any other
suggestion?

>> +static u8 mpci_find_pci_capability(struct mdev_device *mdev, u8 capability)
>> +{
>> +	loff_t pos = VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_CONFIG_REGION_INDEX);
> 
> This creates a fixed ABI between vfio-mdev-pci and vendor drivers that
> a given region starts at a pre-defined offset.  We have the offset
> stored in vfio_mdev.region_info[VFIO_PCI_CONFIG_REGION_INDEX].offset,
> use it.  It's just as unacceptable to impose this fixed relationship
> with a vendor driver here as if a userspace driver were to do the same.
> 

In the v5 version, where config space was cached in this module,
suggestion was to don't care about data or caching it at read/write,
just pass it through. Now since VFIO_PCI_* macros are also available
here, vendor driver can use it to decode pos to find region index and
offset of access. Then vendor driver itself add
vmdev->vfio_region_info[info.index].offset, which is known to him.
Either we do this in VFIO module or vendor driver?

Thanks,
Kirti.








^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v6 2/4] vfio: VFIO driver for mediated PCI device
@ 2016-08-10 21:23       ` Kirti Wankhede
  0 siblings, 0 replies; 100+ messages in thread
From: Kirti Wankhede @ 2016-08-10 21:23 UTC (permalink / raw)
  To: Alex Williamson
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi



On 8/10/2016 12:30 AM, Alex Williamson wrote:
> On Thu, 4 Aug 2016 00:33:52 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 

...

>> +
>> +		switch (info.index) {
>> +		case VFIO_PCI_CONFIG_REGION_INDEX:
>> +		case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
>> +			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
> 
> No, vmdev->vfio_region_info[info.index].offset
>

Ok.

>> +			info.size = vmdev->vfio_region_info[info.index].size;
>> +			if (!info.size) {
>> +				info.flags = 0;
>> +				break;
>> +			}
>> +
>> +			info.flags = vmdev->vfio_region_info[info.index].flags;
>> +			break;
>> +		case VFIO_PCI_VGA_REGION_INDEX:
>> +		case VFIO_PCI_ROM_REGION_INDEX:
> 
> Why?  Let the vendor driver decide.
> 

Ok.

>> +		switch (info.index) {
>> +		case VFIO_PCI_INTX_IRQ_INDEX ... VFIO_PCI_MSI_IRQ_INDEX:
>> +		case VFIO_PCI_REQ_IRQ_INDEX:
>> +			break;
>> +			/* pass thru to return error */
>> +		case VFIO_PCI_MSIX_IRQ_INDEX:
> 
> ???

Sorry, I missed to update this. Updating it.

>> +	case VFIO_DEVICE_SET_IRQS:
>> +	{
...
>> +
>> +		if (parent->ops->set_irqs)
>> +			ret = parent->ops->set_irqs(mdev, hdr.flags, hdr.index,
>> +						    hdr.start, hdr.count, data);
>> +
>> +		kfree(ptr);
>> +		return ret;
> 
> Return success if no set_irqs callback?
>

Ideally, vendor driver should provide this function. If vendor driver
doesn't provide it, do we really need to fail here?


>> +static ssize_t vfio_mpci_read(void *device_data, char __user *buf,
>> +			      size_t count, loff_t *ppos)
>> +{
>> +	struct vfio_mdev *vmdev = device_data;
>> +	struct mdev_device *mdev = vmdev->mdev;
>> +	struct parent_device *parent = mdev->parent;
>> +	int ret = 0;
>> +
>> +	if (!count)
>> +		return 0;
>> +
>> +	if (parent->ops->read) {
>> +		char *ret_data, *ptr;
>> +
>> +		ptr = ret_data = kzalloc(count, GFP_KERNEL);
> 
> Do we really need to support arbitrary lengths in one shot?  Seems like
> we could just use a 4 or 8 byte variable on the stack and iterate until
> done.
> 

We just want to pass the arguments to vendor driver as is here. Vendor
driver could take care of that.

>> +
>> +static ssize_t vfio_mpci_write(void *device_data, const char __user *buf,
>> +			       size_t count, loff_t *ppos)
>> +{
>> +	struct vfio_mdev *vmdev = device_data;
>> +	struct mdev_device *mdev = vmdev->mdev;
>> +	struct parent_device *parent = mdev->parent;
>> +	int ret = 0;
>> +
>> +	if (!count)
>> +		return 0;
>> +
>> +	if (parent->ops->write) {
>> +		char *usr_data, *ptr;
>> +
>> +		ptr = usr_data = memdup_user(buf, count);
> 
> Same here, how much do we care to let the user write in one pass and is
> there any advantage to it?  When QEMU is our userspace we're only
> likely to see 4-byte accesses anyway.

Same as above.

>> +
>> +static int mdev_dev_mmio_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
>> +{
...
>> +	} else {
>> +		struct pci_dev *pdev;
>> +
>> +		virtaddr = vma->vm_start;
>> +		req_size = vma->vm_end - vma->vm_start;
>> +
>> +		pdev = to_pci_dev(parent->dev);
>> +		index = VFIO_PCI_OFFSET_TO_INDEX(vma->vm_pgoff << PAGE_SHIFT);
> 
> Iterate through region_info[*].offset/size provided by vendor driver.
> 

Yes, makes sense.

>> +
>> +int vfio_mpci_match(struct device *dev)
>> +{
>> +	if (dev_is_pci(dev->parent))
> 
> This is the wrong test, there's really no requirement that a pci mdev
> device is hosted by a real pci device.  

Ideally this module is for the mediated device whose parent is PCI
device. And we are relying on kernel functions like
pci_resource_start(), to_pci_dev() in this module, so better to check it
while loading.


> Can't we check that the device
> is on an mdev_pci_bus_type?
> 

I didn't get this part.

Each mediated device is of mdev_bus_type. But VFIO module could be
different based on parent device type and loaded at the same time. For
example, there should be different modules for channel IO or any other
type of devices and could be loaded at the same time. Then when mdev
device is created based on check in match() function of each module, and
proper driver would be linked for that mdev device.

If this check is not based on parent device type, do you expect to set
parent device type by vendor driver and accordingly load corresponding
VFIO driver?


>> @@ -18,6 +18,7 @@
>>  #include <linux/uaccess.h>
>>  #include <linux/io.h>
>>  #include <linux/vgaarb.h>
>> +#include <linux/vfio.h>
>>  
>>  #include "vfio_pci_private.h"
>>  
>> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
>> index 0ecae0b1cd34..431b824b0d3e 100644
>> --- a/include/linux/vfio.h
>> +++ b/include/linux/vfio.h
>> @@ -18,6 +18,13 @@
>>  #include <linux/poll.h>
>>  #include <uapi/linux/vfio.h>
>>  
>> +#define VFIO_PCI_OFFSET_SHIFT   40
>> +
>> +#define VFIO_PCI_OFFSET_TO_INDEX(off)   (off >> VFIO_PCI_OFFSET_SHIFT)
>> +#define VFIO_PCI_INDEX_TO_OFFSET(index) ((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
>> +#define VFIO_PCI_OFFSET_MASK    (((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
>> +
>> +
> 
> Nak this, I'm not interested in making this any sort of ABI.
> 

These macros are used by drivers/vfio/pci/vfio_pci.c and
drivers/vfio/mdev/vfio_mpci.c and to use those in both these modules,
they should be moved to common place as you suggested in earlier
reviews. I think this is better common place. Are there any other
suggestion?

>> +static u8 mpci_find_pci_capability(struct mdev_device *mdev, u8 capability)
>> +{
>> +	loff_t pos = VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_CONFIG_REGION_INDEX);
> 
> This creates a fixed ABI between vfio-mdev-pci and vendor drivers that
> a given region starts at a pre-defined offset.  We have the offset
> stored in vfio_mdev.region_info[VFIO_PCI_CONFIG_REGION_INDEX].offset,
> use it.  It's just as unacceptable to impose this fixed relationship
> with a vendor driver here as if a userspace driver were to do the same.
> 

In the v5 version, where config space was cached in this module,
suggestion was to don't care about data or caching it at read/write,
just pass it through. Now since VFIO_PCI_* macros are also available
here, vendor driver can use it to decode pos to find region index and
offset of access. Then vendor driver itself add
vmdev->vfio_region_info[info.index].offset, which is known to him.
Either we do this in VFIO module or vendor driver?

Thanks,
Kirti.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v6 2/4] vfio: VFIO driver for mediated PCI device
  2016-08-10 21:23       ` [Qemu-devel] " Kirti Wankhede
@ 2016-08-10 23:00         ` Alex Williamson
  -1 siblings, 0 replies; 100+ messages in thread
From: Alex Williamson @ 2016-08-10 23:00 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi

On Thu, 11 Aug 2016 02:53:10 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 8/10/2016 12:30 AM, Alex Williamson wrote:
> > On Thu, 4 Aug 2016 00:33:52 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> 
> ...
> 
> >> +
> >> +		switch (info.index) {
> >> +		case VFIO_PCI_CONFIG_REGION_INDEX:
> >> +		case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
> >> +			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);  
> > 
> > No, vmdev->vfio_region_info[info.index].offset
> >  
> 
> Ok.
> 
> >> +			info.size = vmdev->vfio_region_info[info.index].size;
> >> +			if (!info.size) {
> >> +				info.flags = 0;
> >> +				break;
> >> +			}
> >> +
> >> +			info.flags = vmdev->vfio_region_info[info.index].flags;
> >> +			break;
> >> +		case VFIO_PCI_VGA_REGION_INDEX:
> >> +		case VFIO_PCI_ROM_REGION_INDEX:  
> > 
> > Why?  Let the vendor driver decide.
> >   
> 
> Ok.
> 
> >> +		switch (info.index) {
> >> +		case VFIO_PCI_INTX_IRQ_INDEX ... VFIO_PCI_MSI_IRQ_INDEX:
> >> +		case VFIO_PCI_REQ_IRQ_INDEX:
> >> +			break;
> >> +			/* pass thru to return error */
> >> +		case VFIO_PCI_MSIX_IRQ_INDEX:  
> > 
> > ???  
> 
> Sorry, I missed to update this. Updating it.
> 
> >> +	case VFIO_DEVICE_SET_IRQS:
> >> +	{  
> ...
> >> +
> >> +		if (parent->ops->set_irqs)
> >> +			ret = parent->ops->set_irqs(mdev, hdr.flags, hdr.index,
> >> +						    hdr.start, hdr.count, data);
> >> +
> >> +		kfree(ptr);
> >> +		return ret;  
> > 
> > Return success if no set_irqs callback?
> >  
> 
> Ideally, vendor driver should provide this function. If vendor driver
> doesn't provide it, do we really need to fail here?

Wouldn't you as a user expect to get an error if you try to call an
ioctl that has no backing rather than assume success and never receive
and interrupt?
 
> >> +static ssize_t vfio_mpci_read(void *device_data, char __user *buf,
> >> +			      size_t count, loff_t *ppos)
> >> +{
> >> +	struct vfio_mdev *vmdev = device_data;
> >> +	struct mdev_device *mdev = vmdev->mdev;
> >> +	struct parent_device *parent = mdev->parent;
> >> +	int ret = 0;
> >> +
> >> +	if (!count)
> >> +		return 0;
> >> +
> >> +	if (parent->ops->read) {
> >> +		char *ret_data, *ptr;
> >> +
> >> +		ptr = ret_data = kzalloc(count, GFP_KERNEL);  
> > 
> > Do we really need to support arbitrary lengths in one shot?  Seems like
> > we could just use a 4 or 8 byte variable on the stack and iterate until
> > done.
> >   
> 
> We just want to pass the arguments to vendor driver as is here. Vendor
> driver could take care of that.

But I think this is exploitable, it lets the user make the kernel
allocate an arbitrarily sized buffer.
 
> >> +
> >> +static ssize_t vfio_mpci_write(void *device_data, const char __user *buf,
> >> +			       size_t count, loff_t *ppos)
> >> +{
> >> +	struct vfio_mdev *vmdev = device_data;
> >> +	struct mdev_device *mdev = vmdev->mdev;
> >> +	struct parent_device *parent = mdev->parent;
> >> +	int ret = 0;
> >> +
> >> +	if (!count)
> >> +		return 0;
> >> +
> >> +	if (parent->ops->write) {
> >> +		char *usr_data, *ptr;
> >> +
> >> +		ptr = usr_data = memdup_user(buf, count);  
> > 
> > Same here, how much do we care to let the user write in one pass and is
> > there any advantage to it?  When QEMU is our userspace we're only
> > likely to see 4-byte accesses anyway.  
> 
> Same as above.
> 
> >> +
> >> +static int mdev_dev_mmio_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> >> +{  
> ...
> >> +	} else {
> >> +		struct pci_dev *pdev;
> >> +
> >> +		virtaddr = vma->vm_start;
> >> +		req_size = vma->vm_end - vma->vm_start;
> >> +
> >> +		pdev = to_pci_dev(parent->dev);
> >> +		index = VFIO_PCI_OFFSET_TO_INDEX(vma->vm_pgoff << PAGE_SHIFT);  
> > 
> > Iterate through region_info[*].offset/size provided by vendor driver.
> >   
> 
> Yes, makes sense.
> 
> >> +
> >> +int vfio_mpci_match(struct device *dev)
> >> +{
> >> +	if (dev_is_pci(dev->parent))  
> > 
> > This is the wrong test, there's really no requirement that a pci mdev
> > device is hosted by a real pci device.    
> 
> Ideally this module is for the mediated device whose parent is PCI
> device. And we are relying on kernel functions like
> pci_resource_start(), to_pci_dev() in this module, so better to check it
> while loading.

IMO, we don't want to care what the parent device is, it's not ideal,
it's actually a limitation to impose that it is a PCI device.  I want to
be able to make purely virtual mediated devices.  I only see that you
use these functions in the mmio fault handling.  Is it useful to assume
that on mmio fault we map to the parent device PCI BAR regions?  Just
require that the vendor driver provides a fault mapping function or
SIGBUS if we get a fault and it doesn't.

> > Can't we check that the device
> > is on an mdev_pci_bus_type?
> >   
> 
> I didn't get this part.
> 
> Each mediated device is of mdev_bus_type. But VFIO module could be
> different based on parent device type and loaded at the same time. For
> example, there should be different modules for channel IO or any other
> type of devices and could be loaded at the same time. Then when mdev
> device is created based on check in match() function of each module, and
> proper driver would be linked for that mdev device.
> 
> If this check is not based on parent device type, do you expect to set
> parent device type by vendor driver and accordingly load corresponding
> VFIO driver?

mdev_pci_bus_type was an off the cuff response since the driver.bus
controls which devices a probe function will see.  If we have a unique
bus for a driver and create devices appropriately, we really don't
even need a match function.  That would still work, but what if you
made a get_device_info callback to the vendor driver rather than
creating that info in the mediated bus driver layer.  Then the probe
function here could simply check the flags to see if the device is
VFIO_DEVICE_FLAGS_PCI?

> >> @@ -18,6 +18,7 @@
> >>  #include <linux/uaccess.h>
> >>  #include <linux/io.h>
> >>  #include <linux/vgaarb.h>
> >> +#include <linux/vfio.h>
> >>  
> >>  #include "vfio_pci_private.h"
> >>  
> >> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> >> index 0ecae0b1cd34..431b824b0d3e 100644
> >> --- a/include/linux/vfio.h
> >> +++ b/include/linux/vfio.h
> >> @@ -18,6 +18,13 @@
> >>  #include <linux/poll.h>
> >>  #include <uapi/linux/vfio.h>
> >>  
> >> +#define VFIO_PCI_OFFSET_SHIFT   40
> >> +
> >> +#define VFIO_PCI_OFFSET_TO_INDEX(off)   (off >> VFIO_PCI_OFFSET_SHIFT)
> >> +#define VFIO_PCI_INDEX_TO_OFFSET(index) ((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
> >> +#define VFIO_PCI_OFFSET_MASK    (((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
> >> +
> >> +  
> > 
> > Nak this, I'm not interested in making this any sort of ABI.
> >   
> 
> These macros are used by drivers/vfio/pci/vfio_pci.c and
> drivers/vfio/mdev/vfio_mpci.c and to use those in both these modules,
> they should be moved to common place as you suggested in earlier
> reviews. I think this is better common place. Are there any other
> suggestion?

They're only used in ways that I objected to above and you've agreed
to.  These define implementation details that must not become part of
the mediated vendor driver ABI.  A vendor driver is free to redefine
this the same if they want, but as we can see with how easily they slip
into code where they don't belong, the only way to make sure they don't
become ABI is to keep them in private headers.
 
> >> +static u8 mpci_find_pci_capability(struct mdev_device *mdev, u8 capability)
> >> +{
> >> +	loff_t pos = VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_CONFIG_REGION_INDEX);  
> > 
> > This creates a fixed ABI between vfio-mdev-pci and vendor drivers that
> > a given region starts at a pre-defined offset.  We have the offset
> > stored in vfio_mdev.region_info[VFIO_PCI_CONFIG_REGION_INDEX].offset,
> > use it.  It's just as unacceptable to impose this fixed relationship
> > with a vendor driver here as if a userspace driver were to do the same.
> >   
> 
> In the v5 version, where config space was cached in this module,
> suggestion was to don't care about data or caching it at read/write,
> just pass it through. Now since VFIO_PCI_* macros are also available
> here, vendor driver can use it to decode pos to find region index and
> offset of access. Then vendor driver itself add
> vmdev->vfio_region_info[info.index].offset, which is known to him.
> Either we do this in VFIO module or vendor driver?

As I say above, a vendor driver is absolutely free to use the same
index/offset scheme, but it absolutely must not be part of the ABI
between vendor drivers and the mediated driver core.  It's up to the
vendor driver to define that relation and moving these to a common
header is clearly too dangerous.  I'm sorry if I've said otherwise in
the past, but I've only recently discovered a userspace driver (DPDK)
copying these defines and ignoring the index offsets reported through
the REGION_INFO API.  So I'm now bitterly aware how an internal
implementation detail can be abused and if we don't catch them, it's
going to lock us into an implementation that was designed to be
flexible.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v6 2/4] vfio: VFIO driver for mediated PCI device
@ 2016-08-10 23:00         ` Alex Williamson
  0 siblings, 0 replies; 100+ messages in thread
From: Alex Williamson @ 2016-08-10 23:00 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi

On Thu, 11 Aug 2016 02:53:10 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 8/10/2016 12:30 AM, Alex Williamson wrote:
> > On Thu, 4 Aug 2016 00:33:52 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> 
> ...
> 
> >> +
> >> +		switch (info.index) {
> >> +		case VFIO_PCI_CONFIG_REGION_INDEX:
> >> +		case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
> >> +			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);  
> > 
> > No, vmdev->vfio_region_info[info.index].offset
> >  
> 
> Ok.
> 
> >> +			info.size = vmdev->vfio_region_info[info.index].size;
> >> +			if (!info.size) {
> >> +				info.flags = 0;
> >> +				break;
> >> +			}
> >> +
> >> +			info.flags = vmdev->vfio_region_info[info.index].flags;
> >> +			break;
> >> +		case VFIO_PCI_VGA_REGION_INDEX:
> >> +		case VFIO_PCI_ROM_REGION_INDEX:  
> > 
> > Why?  Let the vendor driver decide.
> >   
> 
> Ok.
> 
> >> +		switch (info.index) {
> >> +		case VFIO_PCI_INTX_IRQ_INDEX ... VFIO_PCI_MSI_IRQ_INDEX:
> >> +		case VFIO_PCI_REQ_IRQ_INDEX:
> >> +			break;
> >> +			/* pass thru to return error */
> >> +		case VFIO_PCI_MSIX_IRQ_INDEX:  
> > 
> > ???  
> 
> Sorry, I missed to update this. Updating it.
> 
> >> +	case VFIO_DEVICE_SET_IRQS:
> >> +	{  
> ...
> >> +
> >> +		if (parent->ops->set_irqs)
> >> +			ret = parent->ops->set_irqs(mdev, hdr.flags, hdr.index,
> >> +						    hdr.start, hdr.count, data);
> >> +
> >> +		kfree(ptr);
> >> +		return ret;  
> > 
> > Return success if no set_irqs callback?
> >  
> 
> Ideally, vendor driver should provide this function. If vendor driver
> doesn't provide it, do we really need to fail here?

Wouldn't you as a user expect to get an error if you try to call an
ioctl that has no backing rather than assume success and never receive
and interrupt?
 
> >> +static ssize_t vfio_mpci_read(void *device_data, char __user *buf,
> >> +			      size_t count, loff_t *ppos)
> >> +{
> >> +	struct vfio_mdev *vmdev = device_data;
> >> +	struct mdev_device *mdev = vmdev->mdev;
> >> +	struct parent_device *parent = mdev->parent;
> >> +	int ret = 0;
> >> +
> >> +	if (!count)
> >> +		return 0;
> >> +
> >> +	if (parent->ops->read) {
> >> +		char *ret_data, *ptr;
> >> +
> >> +		ptr = ret_data = kzalloc(count, GFP_KERNEL);  
> > 
> > Do we really need to support arbitrary lengths in one shot?  Seems like
> > we could just use a 4 or 8 byte variable on the stack and iterate until
> > done.
> >   
> 
> We just want to pass the arguments to vendor driver as is here. Vendor
> driver could take care of that.

But I think this is exploitable, it lets the user make the kernel
allocate an arbitrarily sized buffer.
 
> >> +
> >> +static ssize_t vfio_mpci_write(void *device_data, const char __user *buf,
> >> +			       size_t count, loff_t *ppos)
> >> +{
> >> +	struct vfio_mdev *vmdev = device_data;
> >> +	struct mdev_device *mdev = vmdev->mdev;
> >> +	struct parent_device *parent = mdev->parent;
> >> +	int ret = 0;
> >> +
> >> +	if (!count)
> >> +		return 0;
> >> +
> >> +	if (parent->ops->write) {
> >> +		char *usr_data, *ptr;
> >> +
> >> +		ptr = usr_data = memdup_user(buf, count);  
> > 
> > Same here, how much do we care to let the user write in one pass and is
> > there any advantage to it?  When QEMU is our userspace we're only
> > likely to see 4-byte accesses anyway.  
> 
> Same as above.
> 
> >> +
> >> +static int mdev_dev_mmio_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> >> +{  
> ...
> >> +	} else {
> >> +		struct pci_dev *pdev;
> >> +
> >> +		virtaddr = vma->vm_start;
> >> +		req_size = vma->vm_end - vma->vm_start;
> >> +
> >> +		pdev = to_pci_dev(parent->dev);
> >> +		index = VFIO_PCI_OFFSET_TO_INDEX(vma->vm_pgoff << PAGE_SHIFT);  
> > 
> > Iterate through region_info[*].offset/size provided by vendor driver.
> >   
> 
> Yes, makes sense.
> 
> >> +
> >> +int vfio_mpci_match(struct device *dev)
> >> +{
> >> +	if (dev_is_pci(dev->parent))  
> > 
> > This is the wrong test, there's really no requirement that a pci mdev
> > device is hosted by a real pci device.    
> 
> Ideally this module is for the mediated device whose parent is PCI
> device. And we are relying on kernel functions like
> pci_resource_start(), to_pci_dev() in this module, so better to check it
> while loading.

IMO, we don't want to care what the parent device is, it's not ideal,
it's actually a limitation to impose that it is a PCI device.  I want to
be able to make purely virtual mediated devices.  I only see that you
use these functions in the mmio fault handling.  Is it useful to assume
that on mmio fault we map to the parent device PCI BAR regions?  Just
require that the vendor driver provides a fault mapping function or
SIGBUS if we get a fault and it doesn't.

> > Can't we check that the device
> > is on an mdev_pci_bus_type?
> >   
> 
> I didn't get this part.
> 
> Each mediated device is of mdev_bus_type. But VFIO module could be
> different based on parent device type and loaded at the same time. For
> example, there should be different modules for channel IO or any other
> type of devices and could be loaded at the same time. Then when mdev
> device is created based on check in match() function of each module, and
> proper driver would be linked for that mdev device.
> 
> If this check is not based on parent device type, do you expect to set
> parent device type by vendor driver and accordingly load corresponding
> VFIO driver?

mdev_pci_bus_type was an off the cuff response since the driver.bus
controls which devices a probe function will see.  If we have a unique
bus for a driver and create devices appropriately, we really don't
even need a match function.  That would still work, but what if you
made a get_device_info callback to the vendor driver rather than
creating that info in the mediated bus driver layer.  Then the probe
function here could simply check the flags to see if the device is
VFIO_DEVICE_FLAGS_PCI?

> >> @@ -18,6 +18,7 @@
> >>  #include <linux/uaccess.h>
> >>  #include <linux/io.h>
> >>  #include <linux/vgaarb.h>
> >> +#include <linux/vfio.h>
> >>  
> >>  #include "vfio_pci_private.h"
> >>  
> >> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> >> index 0ecae0b1cd34..431b824b0d3e 100644
> >> --- a/include/linux/vfio.h
> >> +++ b/include/linux/vfio.h
> >> @@ -18,6 +18,13 @@
> >>  #include <linux/poll.h>
> >>  #include <uapi/linux/vfio.h>
> >>  
> >> +#define VFIO_PCI_OFFSET_SHIFT   40
> >> +
> >> +#define VFIO_PCI_OFFSET_TO_INDEX(off)   (off >> VFIO_PCI_OFFSET_SHIFT)
> >> +#define VFIO_PCI_INDEX_TO_OFFSET(index) ((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
> >> +#define VFIO_PCI_OFFSET_MASK    (((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
> >> +
> >> +  
> > 
> > Nak this, I'm not interested in making this any sort of ABI.
> >   
> 
> These macros are used by drivers/vfio/pci/vfio_pci.c and
> drivers/vfio/mdev/vfio_mpci.c and to use those in both these modules,
> they should be moved to common place as you suggested in earlier
> reviews. I think this is better common place. Are there any other
> suggestion?

They're only used in ways that I objected to above and you've agreed
to.  These define implementation details that must not become part of
the mediated vendor driver ABI.  A vendor driver is free to redefine
this the same if they want, but as we can see with how easily they slip
into code where they don't belong, the only way to make sure they don't
become ABI is to keep them in private headers.
 
> >> +static u8 mpci_find_pci_capability(struct mdev_device *mdev, u8 capability)
> >> +{
> >> +	loff_t pos = VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_CONFIG_REGION_INDEX);  
> > 
> > This creates a fixed ABI between vfio-mdev-pci and vendor drivers that
> > a given region starts at a pre-defined offset.  We have the offset
> > stored in vfio_mdev.region_info[VFIO_PCI_CONFIG_REGION_INDEX].offset,
> > use it.  It's just as unacceptable to impose this fixed relationship
> > with a vendor driver here as if a userspace driver were to do the same.
> >   
> 
> In the v5 version, where config space was cached in this module,
> suggestion was to don't care about data or caching it at read/write,
> just pass it through. Now since VFIO_PCI_* macros are also available
> here, vendor driver can use it to decode pos to find region index and
> offset of access. Then vendor driver itself add
> vmdev->vfio_region_info[info.index].offset, which is known to him.
> Either we do this in VFIO module or vendor driver?

As I say above, a vendor driver is absolutely free to use the same
index/offset scheme, but it absolutely must not be part of the ABI
between vendor drivers and the mediated driver core.  It's up to the
vendor driver to define that relation and moving these to a common
header is clearly too dangerous.  I'm sorry if I've said otherwise in
the past, but I've only recently discovered a userspace driver (DPDK)
copying these defines and ignoring the index offsets reported through
the REGION_INFO API.  So I'm now bitterly aware how an internal
implementation detail can be abused and if we don't catch them, it's
going to lock us into an implementation that was designed to be
flexible.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v6 3/4] vfio iommu: Add support for mediated devices
  2016-08-09 19:00     ` [Qemu-devel] " Alex Williamson
@ 2016-08-11 14:22       ` Kirti Wankhede
  -1 siblings, 0 replies; 100+ messages in thread
From: Kirti Wankhede @ 2016-08-11 14:22 UTC (permalink / raw)
  To: Alex Williamson
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi


Thanks Alex. I'll take care of suggested nits and rename structures and
function.

On 8/10/2016 12:30 AM, Alex Williamson wrote:
> On Thu, 4 Aug 2016 00:33:53 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>
...

>>
>> +/*
>> + * Pin a set of guest PFNs and return their associated host PFNs for
mediated
>> + * domain only.
>
> Why only mediated domain?  What assumption is specific to a mediated
> domain other than unnecessarily passing an mdev_device?
>
>> + * @user_pfn [in]: array of user/guest PFNs
>> + * @npage [in]: count of array elements
>> + * @prot [in] : protection flags
>> + * @phys_pfn[out] : array of host PFNs
>> + */
>> +long vfio_pin_pages(struct mdev_device *mdev, unsigned long *user_pfn,
>
> Why use and mdev_device here?  We only reference the struct device to
> get the drvdata.  (dev also not listed above in param description)
>

Ok.

>> +		    long npage, int prot, unsigned long *phys_pfn)
>> +{
>> +	struct vfio_device *device;
>> +	struct vfio_container *container;
>> +	struct vfio_iommu_driver *driver;
>> +	ssize_t ret = -EINVAL;
>> +
>> +	if (!mdev || !user_pfn || !phys_pfn)
>> +		return -EINVAL;
>> +
>> +	device = dev_get_drvdata(&mdev->dev);
>> +
>> +	if (!device || !device->group)
>> +		return -EINVAL;
>> +
>> +	container = device->group->container;
>
> This doesn't seem like a valid way to get a reference to the container
> and in fact there is no reference at all.  I think you need to use
> vfio_device_get_from_dev(), check and increment container_users around
> the callback, abort on noiommu groups, and check for viability.
>

Thanks for pointing that out. I'll change it as suggested.

>
>
> I see how you're trying to only do accounting when there is only an
> mdev (local) domain, but the devices attached to the normal iommu API
> domain can go away at any point.  Where do we re-establish accounting
> should the pinning from those devices be removed?  I don't see that as
> being an optional support case since userspace can already do this.
>

I missed this case. So in that case, when
vfio_iommu_type1_detach_group() for iommu group for that device is
called and it is the last entry in iommu capable domain_list, it should
re-iterate through pfn_list of mediated_domain and do the accounting,
right? Then we also have to update accounting when iommu capable device
is hotplugged while mediated_domain already exist.

Thanks,
Kirti


^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v6 3/4] vfio iommu: Add support for mediated devices
@ 2016-08-11 14:22       ` Kirti Wankhede
  0 siblings, 0 replies; 100+ messages in thread
From: Kirti Wankhede @ 2016-08-11 14:22 UTC (permalink / raw)
  To: Alex Williamson
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi


Thanks Alex. I'll take care of suggested nits and rename structures and
function.

On 8/10/2016 12:30 AM, Alex Williamson wrote:
> On Thu, 4 Aug 2016 00:33:53 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>
...

>>
>> +/*
>> + * Pin a set of guest PFNs and return their associated host PFNs for
mediated
>> + * domain only.
>
> Why only mediated domain?  What assumption is specific to a mediated
> domain other than unnecessarily passing an mdev_device?
>
>> + * @user_pfn [in]: array of user/guest PFNs
>> + * @npage [in]: count of array elements
>> + * @prot [in] : protection flags
>> + * @phys_pfn[out] : array of host PFNs
>> + */
>> +long vfio_pin_pages(struct mdev_device *mdev, unsigned long *user_pfn,
>
> Why use and mdev_device here?  We only reference the struct device to
> get the drvdata.  (dev also not listed above in param description)
>

Ok.

>> +		    long npage, int prot, unsigned long *phys_pfn)
>> +{
>> +	struct vfio_device *device;
>> +	struct vfio_container *container;
>> +	struct vfio_iommu_driver *driver;
>> +	ssize_t ret = -EINVAL;
>> +
>> +	if (!mdev || !user_pfn || !phys_pfn)
>> +		return -EINVAL;
>> +
>> +	device = dev_get_drvdata(&mdev->dev);
>> +
>> +	if (!device || !device->group)
>> +		return -EINVAL;
>> +
>> +	container = device->group->container;
>
> This doesn't seem like a valid way to get a reference to the container
> and in fact there is no reference at all.  I think you need to use
> vfio_device_get_from_dev(), check and increment container_users around
> the callback, abort on noiommu groups, and check for viability.
>

Thanks for pointing that out. I'll change it as suggested.

>
>
> I see how you're trying to only do accounting when there is only an
> mdev (local) domain, but the devices attached to the normal iommu API
> domain can go away at any point.  Where do we re-establish accounting
> should the pinning from those devices be removed?  I don't see that as
> being an optional support case since userspace can already do this.
>

I missed this case. So in that case, when
vfio_iommu_type1_detach_group() for iommu group for that device is
called and it is the last entry in iommu capable domain_list, it should
re-iterate through pfn_list of mediated_domain and do the accounting,
right? Then we also have to update accounting when iommu capable device
is hotplugged while mediated_domain already exist.

Thanks,
Kirti

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v6 2/4] vfio: VFIO driver for mediated PCI device
  2016-08-10 23:00         ` [Qemu-devel] " Alex Williamson
@ 2016-08-11 15:59           ` Kirti Wankhede
  -1 siblings, 0 replies; 100+ messages in thread
From: Kirti Wankhede @ 2016-08-11 15:59 UTC (permalink / raw)
  To: Alex Williamson
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi



On 8/11/2016 4:30 AM, Alex Williamson wrote:
> On Thu, 11 Aug 2016 02:53:10 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> On 8/10/2016 12:30 AM, Alex Williamson wrote:
>>> On Thu, 4 Aug 2016 00:33:52 +0530
>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>   
>>
>> ...
>>
>>>> +
>>>> +		switch (info.index) {
>>>> +		case VFIO_PCI_CONFIG_REGION_INDEX:
>>>> +		case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
>>>> +			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);  
>>>
>>> No, vmdev->vfio_region_info[info.index].offset
>>>  
>>
>> Ok.
>>
>>>> +			info.size = vmdev->vfio_region_info[info.index].size;
>>>> +			if (!info.size) {
>>>> +				info.flags = 0;
>>>> +				break;
>>>> +			}
>>>> +
>>>> +			info.flags = vmdev->vfio_region_info[info.index].flags;
>>>> +			break;
>>>> +		case VFIO_PCI_VGA_REGION_INDEX:
>>>> +		case VFIO_PCI_ROM_REGION_INDEX:  
>>>
>>> Why?  Let the vendor driver decide.
>>>   
>>
>> Ok.
>>
>>>> +		switch (info.index) {
>>>> +		case VFIO_PCI_INTX_IRQ_INDEX ... VFIO_PCI_MSI_IRQ_INDEX:
>>>> +		case VFIO_PCI_REQ_IRQ_INDEX:
>>>> +			break;
>>>> +			/* pass thru to return error */
>>>> +		case VFIO_PCI_MSIX_IRQ_INDEX:  
>>>
>>> ???  
>>
>> Sorry, I missed to update this. Updating it.
>>
>>>> +	case VFIO_DEVICE_SET_IRQS:
>>>> +	{  
>> ...
>>>> +
>>>> +		if (parent->ops->set_irqs)
>>>> +			ret = parent->ops->set_irqs(mdev, hdr.flags, hdr.index,
>>>> +						    hdr.start, hdr.count, data);
>>>> +
>>>> +		kfree(ptr);
>>>> +		return ret;  
>>>
>>> Return success if no set_irqs callback?
>>>  
>>
>> Ideally, vendor driver should provide this function. If vendor driver
>> doesn't provide it, do we really need to fail here?
> 
> Wouldn't you as a user expect to get an error if you try to call an
> ioctl that has no backing rather than assume success and never receive
> and interrupt?
>  

If we really don't want to proceed if set_irqs() is not provided then
its better to add it in mandatory list in mdev_register_device() in
mdev_core.c and fail earlier, i.e. fail to register the device.


>>>> +static ssize_t vfio_mpci_read(void *device_data, char __user *buf,
>>>> +			      size_t count, loff_t *ppos)
>>>> +{
>>>> +	struct vfio_mdev *vmdev = device_data;
>>>> +	struct mdev_device *mdev = vmdev->mdev;
>>>> +	struct parent_device *parent = mdev->parent;
>>>> +	int ret = 0;
>>>> +
>>>> +	if (!count)
>>>> +		return 0;
>>>> +
>>>> +	if (parent->ops->read) {
>>>> +		char *ret_data, *ptr;
>>>> +
>>>> +		ptr = ret_data = kzalloc(count, GFP_KERNEL);  
>>>
>>> Do we really need to support arbitrary lengths in one shot?  Seems like
>>> we could just use a 4 or 8 byte variable on the stack and iterate until
>>> done.
>>>   
>>
>> We just want to pass the arguments to vendor driver as is here. Vendor
>> driver could take care of that.
> 
> But I think this is exploitable, it lets the user make the kernel
> allocate an arbitrarily sized buffer.
>  
>>>> +
>>>> +static ssize_t vfio_mpci_write(void *device_data, const char __user *buf,
>>>> +			       size_t count, loff_t *ppos)
>>>> +{
>>>> +	struct vfio_mdev *vmdev = device_data;
>>>> +	struct mdev_device *mdev = vmdev->mdev;
>>>> +	struct parent_device *parent = mdev->parent;
>>>> +	int ret = 0;
>>>> +
>>>> +	if (!count)
>>>> +		return 0;
>>>> +
>>>> +	if (parent->ops->write) {
>>>> +		char *usr_data, *ptr;
>>>> +
>>>> +		ptr = usr_data = memdup_user(buf, count);  
>>>
>>> Same here, how much do we care to let the user write in one pass and is
>>> there any advantage to it?  When QEMU is our userspace we're only
>>> likely to see 4-byte accesses anyway.  
>>
>> Same as above.
>>
>>>> +
>>>> +static int mdev_dev_mmio_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
>>>> +{  
>> ...
>>>> +	} else {
>>>> +		struct pci_dev *pdev;
>>>> +
>>>> +		virtaddr = vma->vm_start;
>>>> +		req_size = vma->vm_end - vma->vm_start;
>>>> +
>>>> +		pdev = to_pci_dev(parent->dev);
>>>> +		index = VFIO_PCI_OFFSET_TO_INDEX(vma->vm_pgoff << PAGE_SHIFT);  
>>>
>>> Iterate through region_info[*].offset/size provided by vendor driver.
>>>   
>>
>> Yes, makes sense.
>>
>>>> +
>>>> +int vfio_mpci_match(struct device *dev)
>>>> +{
>>>> +	if (dev_is_pci(dev->parent))  
>>>
>>> This is the wrong test, there's really no requirement that a pci mdev
>>> device is hosted by a real pci device.    
>>
>> Ideally this module is for the mediated device whose parent is PCI
>> device. And we are relying on kernel functions like
>> pci_resource_start(), to_pci_dev() in this module, so better to check it
>> while loading.
> 
> IMO, we don't want to care what the parent device is, it's not ideal,
> it's actually a limitation to impose that it is a PCI device.  I want to
> be able to make purely virtual mediated devices.  I only see that you
> use these functions in the mmio fault handling.  Is it useful to assume
> that on mmio fault we map to the parent device PCI BAR regions?  Just
> require that the vendor driver provides a fault mapping function or
> SIGBUS if we get a fault and it doesn't.
> 
>>> Can't we check that the device
>>> is on an mdev_pci_bus_type?
>>>   
>>
>> I didn't get this part.
>>
>> Each mediated device is of mdev_bus_type. But VFIO module could be
>> different based on parent device type and loaded at the same time. For
>> example, there should be different modules for channel IO or any other
>> type of devices and could be loaded at the same time. Then when mdev
>> device is created based on check in match() function of each module, and
>> proper driver would be linked for that mdev device.
>>
>> If this check is not based on parent device type, do you expect to set
>> parent device type by vendor driver and accordingly load corresponding
>> VFIO driver?
> 
> mdev_pci_bus_type was an off the cuff response since the driver.bus
> controls which devices a probe function will see.  If we have a unique
> bus for a driver and create devices appropriately, we really don't
> even need a match function. 

I still think that all types of mdev devices should have unique bus type
so that VFIO IOMMU module could be used for any type of mediated device
without any change. Otherwise we have to add checks for all supported
bus types in vfio_iommu_type1_attach_group().

> That would still work, but what if you
> made a get_device_info callback to the vendor driver rather than
> creating that info in the mediated bus driver layer.  Then the probe
> function here could simply check the flags to see if the device is
> VFIO_DEVICE_FLAGS_PCI?
> 

Right. get_device_info() would be a mandatory callback and it would be
vendor driver's responsibility to return proper flag.


>>>> @@ -18,6 +18,7 @@
>>>>  #include <linux/uaccess.h>
>>>>  #include <linux/io.h>
>>>>  #include <linux/vgaarb.h>
>>>> +#include <linux/vfio.h>
>>>>  
>>>>  #include "vfio_pci_private.h"
>>>>  
>>>> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
>>>> index 0ecae0b1cd34..431b824b0d3e 100644
>>>> --- a/include/linux/vfio.h
>>>> +++ b/include/linux/vfio.h
>>>> @@ -18,6 +18,13 @@
>>>>  #include <linux/poll.h>
>>>>  #include <uapi/linux/vfio.h>
>>>>  
>>>> +#define VFIO_PCI_OFFSET_SHIFT   40
>>>> +
>>>> +#define VFIO_PCI_OFFSET_TO_INDEX(off)   (off >> VFIO_PCI_OFFSET_SHIFT)
>>>> +#define VFIO_PCI_INDEX_TO_OFFSET(index) ((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
>>>> +#define VFIO_PCI_OFFSET_MASK    (((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
>>>> +
>>>> +  
>>>
>>> Nak this, I'm not interested in making this any sort of ABI.
>>>   
>>
>> These macros are used by drivers/vfio/pci/vfio_pci.c and
>> drivers/vfio/mdev/vfio_mpci.c and to use those in both these modules,
>> they should be moved to common place as you suggested in earlier
>> reviews. I think this is better common place. Are there any other
>> suggestion?
> 
> They're only used in ways that I objected to above and you've agreed
> to.  These define implementation details that must not become part of
> the mediated vendor driver ABI.  A vendor driver is free to redefine
> this the same if they want, but as we can see with how easily they slip
> into code where they don't belong, the only way to make sure they don't
> become ABI is to keep them in private headers.
>  

Then I think, I can't use these macros in mdev modules, they are defined
in drivers/vfio/pci/vfio_pci_private.h
I have to define similar macros in drivers/vfio/mdev/mdev_private.h?

parent->ops->get_region_info() is called from vfio_mpci_open() that is
before PCI config space is setup. Main expectation from
get_region_info() was to get flags and size. At this point of time
vendor driver also don't know about the base addresses of regions.

    case VFIO_DEVICE_GET_REGION_INFO:
...

        info.offset = vmdev->vfio_region_info[info.index].offset;

In that case, as suggested in previous reply, above is not going to work.
I'll define such macros in drivers/vfio/mdev/mdev_private.h, set above
offset according to these macros. Then on first access to any BAR
region, i.e. after PCI config space is populated, call
parent->ops->get_region_info() again so that
vfio_region_info[index].offset for all regions are set by vendor driver.
Then use these offsets to calculate 'pos' for
read/write/validate_map_request(). Does this seems reasonable?

Thanks,
Kirti

>>>> +static u8 mpci_find_pci_capability(struct mdev_device *mdev, u8 capability)
>>>> +{
>>>> +	loff_t pos = VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_CONFIG_REGION_INDEX);  
>>>
>>> This creates a fixed ABI between vfio-mdev-pci and vendor drivers that
>>> a given region starts at a pre-defined offset.  We have the offset
>>> stored in vfio_mdev.region_info[VFIO_PCI_CONFIG_REGION_INDEX].offset,
>>> use it.  It's just as unacceptable to impose this fixed relationship
>>> with a vendor driver here as if a userspace driver were to do the same.
>>>   
>>
>> In the v5 version, where config space was cached in this module,
>> suggestion was to don't care about data or caching it at read/write,
>> just pass it through. Now since VFIO_PCI_* macros are also available
>> here, vendor driver can use it to decode pos to find region index and
>> offset of access. Then vendor driver itself add
>> vmdev->vfio_region_info[info.index].offset, which is known to him.
>> Either we do this in VFIO module or vendor driver?
> 
> As I say above, a vendor driver is absolutely free to use the same
> index/offset scheme, but it absolutely must not be part of the ABI
> between vendor drivers and the mediated driver core.  It's up to the
> vendor driver to define that relation and moving these to a common
> header is clearly too dangerous.  I'm sorry if I've said otherwise in
> the past, but I've only recently discovered a userspace driver (DPDK)
> copying these defines and ignoring the index offsets reported through
> the REGION_INFO API.  So I'm now bitterly aware how an internal
> implementation detail can be abused and if we don't catch them, it's
> going to lock us into an implementation that was designed to be
> flexible.  Thanks,
> 
> Alex
> 

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v6 2/4] vfio: VFIO driver for mediated PCI device
@ 2016-08-11 15:59           ` Kirti Wankhede
  0 siblings, 0 replies; 100+ messages in thread
From: Kirti Wankhede @ 2016-08-11 15:59 UTC (permalink / raw)
  To: Alex Williamson
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi



On 8/11/2016 4:30 AM, Alex Williamson wrote:
> On Thu, 11 Aug 2016 02:53:10 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> On 8/10/2016 12:30 AM, Alex Williamson wrote:
>>> On Thu, 4 Aug 2016 00:33:52 +0530
>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>   
>>
>> ...
>>
>>>> +
>>>> +		switch (info.index) {
>>>> +		case VFIO_PCI_CONFIG_REGION_INDEX:
>>>> +		case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
>>>> +			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);  
>>>
>>> No, vmdev->vfio_region_info[info.index].offset
>>>  
>>
>> Ok.
>>
>>>> +			info.size = vmdev->vfio_region_info[info.index].size;
>>>> +			if (!info.size) {
>>>> +				info.flags = 0;
>>>> +				break;
>>>> +			}
>>>> +
>>>> +			info.flags = vmdev->vfio_region_info[info.index].flags;
>>>> +			break;
>>>> +		case VFIO_PCI_VGA_REGION_INDEX:
>>>> +		case VFIO_PCI_ROM_REGION_INDEX:  
>>>
>>> Why?  Let the vendor driver decide.
>>>   
>>
>> Ok.
>>
>>>> +		switch (info.index) {
>>>> +		case VFIO_PCI_INTX_IRQ_INDEX ... VFIO_PCI_MSI_IRQ_INDEX:
>>>> +		case VFIO_PCI_REQ_IRQ_INDEX:
>>>> +			break;
>>>> +			/* pass thru to return error */
>>>> +		case VFIO_PCI_MSIX_IRQ_INDEX:  
>>>
>>> ???  
>>
>> Sorry, I missed to update this. Updating it.
>>
>>>> +	case VFIO_DEVICE_SET_IRQS:
>>>> +	{  
>> ...
>>>> +
>>>> +		if (parent->ops->set_irqs)
>>>> +			ret = parent->ops->set_irqs(mdev, hdr.flags, hdr.index,
>>>> +						    hdr.start, hdr.count, data);
>>>> +
>>>> +		kfree(ptr);
>>>> +		return ret;  
>>>
>>> Return success if no set_irqs callback?
>>>  
>>
>> Ideally, vendor driver should provide this function. If vendor driver
>> doesn't provide it, do we really need to fail here?
> 
> Wouldn't you as a user expect to get an error if you try to call an
> ioctl that has no backing rather than assume success and never receive
> and interrupt?
>  

If we really don't want to proceed if set_irqs() is not provided then
its better to add it in mandatory list in mdev_register_device() in
mdev_core.c and fail earlier, i.e. fail to register the device.


>>>> +static ssize_t vfio_mpci_read(void *device_data, char __user *buf,
>>>> +			      size_t count, loff_t *ppos)
>>>> +{
>>>> +	struct vfio_mdev *vmdev = device_data;
>>>> +	struct mdev_device *mdev = vmdev->mdev;
>>>> +	struct parent_device *parent = mdev->parent;
>>>> +	int ret = 0;
>>>> +
>>>> +	if (!count)
>>>> +		return 0;
>>>> +
>>>> +	if (parent->ops->read) {
>>>> +		char *ret_data, *ptr;
>>>> +
>>>> +		ptr = ret_data = kzalloc(count, GFP_KERNEL);  
>>>
>>> Do we really need to support arbitrary lengths in one shot?  Seems like
>>> we could just use a 4 or 8 byte variable on the stack and iterate until
>>> done.
>>>   
>>
>> We just want to pass the arguments to vendor driver as is here. Vendor
>> driver could take care of that.
> 
> But I think this is exploitable, it lets the user make the kernel
> allocate an arbitrarily sized buffer.
>  
>>>> +
>>>> +static ssize_t vfio_mpci_write(void *device_data, const char __user *buf,
>>>> +			       size_t count, loff_t *ppos)
>>>> +{
>>>> +	struct vfio_mdev *vmdev = device_data;
>>>> +	struct mdev_device *mdev = vmdev->mdev;
>>>> +	struct parent_device *parent = mdev->parent;
>>>> +	int ret = 0;
>>>> +
>>>> +	if (!count)
>>>> +		return 0;
>>>> +
>>>> +	if (parent->ops->write) {
>>>> +		char *usr_data, *ptr;
>>>> +
>>>> +		ptr = usr_data = memdup_user(buf, count);  
>>>
>>> Same here, how much do we care to let the user write in one pass and is
>>> there any advantage to it?  When QEMU is our userspace we're only
>>> likely to see 4-byte accesses anyway.  
>>
>> Same as above.
>>
>>>> +
>>>> +static int mdev_dev_mmio_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
>>>> +{  
>> ...
>>>> +	} else {
>>>> +		struct pci_dev *pdev;
>>>> +
>>>> +		virtaddr = vma->vm_start;
>>>> +		req_size = vma->vm_end - vma->vm_start;
>>>> +
>>>> +		pdev = to_pci_dev(parent->dev);
>>>> +		index = VFIO_PCI_OFFSET_TO_INDEX(vma->vm_pgoff << PAGE_SHIFT);  
>>>
>>> Iterate through region_info[*].offset/size provided by vendor driver.
>>>   
>>
>> Yes, makes sense.
>>
>>>> +
>>>> +int vfio_mpci_match(struct device *dev)
>>>> +{
>>>> +	if (dev_is_pci(dev->parent))  
>>>
>>> This is the wrong test, there's really no requirement that a pci mdev
>>> device is hosted by a real pci device.    
>>
>> Ideally this module is for the mediated device whose parent is PCI
>> device. And we are relying on kernel functions like
>> pci_resource_start(), to_pci_dev() in this module, so better to check it
>> while loading.
> 
> IMO, we don't want to care what the parent device is, it's not ideal,
> it's actually a limitation to impose that it is a PCI device.  I want to
> be able to make purely virtual mediated devices.  I only see that you
> use these functions in the mmio fault handling.  Is it useful to assume
> that on mmio fault we map to the parent device PCI BAR regions?  Just
> require that the vendor driver provides a fault mapping function or
> SIGBUS if we get a fault and it doesn't.
> 
>>> Can't we check that the device
>>> is on an mdev_pci_bus_type?
>>>   
>>
>> I didn't get this part.
>>
>> Each mediated device is of mdev_bus_type. But VFIO module could be
>> different based on parent device type and loaded at the same time. For
>> example, there should be different modules for channel IO or any other
>> type of devices and could be loaded at the same time. Then when mdev
>> device is created based on check in match() function of each module, and
>> proper driver would be linked for that mdev device.
>>
>> If this check is not based on parent device type, do you expect to set
>> parent device type by vendor driver and accordingly load corresponding
>> VFIO driver?
> 
> mdev_pci_bus_type was an off the cuff response since the driver.bus
> controls which devices a probe function will see.  If we have a unique
> bus for a driver and create devices appropriately, we really don't
> even need a match function. 

I still think that all types of mdev devices should have unique bus type
so that VFIO IOMMU module could be used for any type of mediated device
without any change. Otherwise we have to add checks for all supported
bus types in vfio_iommu_type1_attach_group().

> That would still work, but what if you
> made a get_device_info callback to the vendor driver rather than
> creating that info in the mediated bus driver layer.  Then the probe
> function here could simply check the flags to see if the device is
> VFIO_DEVICE_FLAGS_PCI?
> 

Right. get_device_info() would be a mandatory callback and it would be
vendor driver's responsibility to return proper flag.


>>>> @@ -18,6 +18,7 @@
>>>>  #include <linux/uaccess.h>
>>>>  #include <linux/io.h>
>>>>  #include <linux/vgaarb.h>
>>>> +#include <linux/vfio.h>
>>>>  
>>>>  #include "vfio_pci_private.h"
>>>>  
>>>> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
>>>> index 0ecae0b1cd34..431b824b0d3e 100644
>>>> --- a/include/linux/vfio.h
>>>> +++ b/include/linux/vfio.h
>>>> @@ -18,6 +18,13 @@
>>>>  #include <linux/poll.h>
>>>>  #include <uapi/linux/vfio.h>
>>>>  
>>>> +#define VFIO_PCI_OFFSET_SHIFT   40
>>>> +
>>>> +#define VFIO_PCI_OFFSET_TO_INDEX(off)   (off >> VFIO_PCI_OFFSET_SHIFT)
>>>> +#define VFIO_PCI_INDEX_TO_OFFSET(index) ((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
>>>> +#define VFIO_PCI_OFFSET_MASK    (((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
>>>> +
>>>> +  
>>>
>>> Nak this, I'm not interested in making this any sort of ABI.
>>>   
>>
>> These macros are used by drivers/vfio/pci/vfio_pci.c and
>> drivers/vfio/mdev/vfio_mpci.c and to use those in both these modules,
>> they should be moved to common place as you suggested in earlier
>> reviews. I think this is better common place. Are there any other
>> suggestion?
> 
> They're only used in ways that I objected to above and you've agreed
> to.  These define implementation details that must not become part of
> the mediated vendor driver ABI.  A vendor driver is free to redefine
> this the same if they want, but as we can see with how easily they slip
> into code where they don't belong, the only way to make sure they don't
> become ABI is to keep them in private headers.
>  

Then I think, I can't use these macros in mdev modules, they are defined
in drivers/vfio/pci/vfio_pci_private.h
I have to define similar macros in drivers/vfio/mdev/mdev_private.h?

parent->ops->get_region_info() is called from vfio_mpci_open() that is
before PCI config space is setup. Main expectation from
get_region_info() was to get flags and size. At this point of time
vendor driver also don't know about the base addresses of regions.

    case VFIO_DEVICE_GET_REGION_INFO:
...

        info.offset = vmdev->vfio_region_info[info.index].offset;

In that case, as suggested in previous reply, above is not going to work.
I'll define such macros in drivers/vfio/mdev/mdev_private.h, set above
offset according to these macros. Then on first access to any BAR
region, i.e. after PCI config space is populated, call
parent->ops->get_region_info() again so that
vfio_region_info[index].offset for all regions are set by vendor driver.
Then use these offsets to calculate 'pos' for
read/write/validate_map_request(). Does this seems reasonable?

Thanks,
Kirti

>>>> +static u8 mpci_find_pci_capability(struct mdev_device *mdev, u8 capability)
>>>> +{
>>>> +	loff_t pos = VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_CONFIG_REGION_INDEX);  
>>>
>>> This creates a fixed ABI between vfio-mdev-pci and vendor drivers that
>>> a given region starts at a pre-defined offset.  We have the offset
>>> stored in vfio_mdev.region_info[VFIO_PCI_CONFIG_REGION_INDEX].offset,
>>> use it.  It's just as unacceptable to impose this fixed relationship
>>> with a vendor driver here as if a userspace driver were to do the same.
>>>   
>>
>> In the v5 version, where config space was cached in this module,
>> suggestion was to don't care about data or caching it at read/write,
>> just pass it through. Now since VFIO_PCI_* macros are also available
>> here, vendor driver can use it to decode pos to find region index and
>> offset of access. Then vendor driver itself add
>> vmdev->vfio_region_info[info.index].offset, which is known to him.
>> Either we do this in VFIO module or vendor driver?
> 
> As I say above, a vendor driver is absolutely free to use the same
> index/offset scheme, but it absolutely must not be part of the ABI
> between vendor drivers and the mediated driver core.  It's up to the
> vendor driver to define that relation and moving these to a common
> header is clearly too dangerous.  I'm sorry if I've said otherwise in
> the past, but I've only recently discovered a userspace driver (DPDK)
> copying these defines and ignoring the index offsets reported through
> the REGION_INFO API.  So I'm now bitterly aware how an internal
> implementation detail can be abused and if we don't catch them, it's
> going to lock us into an implementation that was designed to be
> flexible.  Thanks,
> 
> Alex
> 

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v6 2/4] vfio: VFIO driver for mediated PCI device
  2016-08-11 15:59           ` [Qemu-devel] " Kirti Wankhede
@ 2016-08-11 16:24             ` Alex Williamson
  -1 siblings, 0 replies; 100+ messages in thread
From: Alex Williamson @ 2016-08-11 16:24 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi

On Thu, 11 Aug 2016 21:29:35 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 8/11/2016 4:30 AM, Alex Williamson wrote:
> > On Thu, 11 Aug 2016 02:53:10 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> On 8/10/2016 12:30 AM, Alex Williamson wrote:  
> >>> On Thu, 4 Aug 2016 00:33:52 +0530
> >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>     
> >>
> >> ...
> >>  
> >>>> +
> >>>> +		switch (info.index) {
> >>>> +		case VFIO_PCI_CONFIG_REGION_INDEX:
> >>>> +		case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
> >>>> +			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);    
> >>>
> >>> No, vmdev->vfio_region_info[info.index].offset
> >>>    
> >>
> >> Ok.
> >>  
> >>>> +			info.size = vmdev->vfio_region_info[info.index].size;
> >>>> +			if (!info.size) {
> >>>> +				info.flags = 0;
> >>>> +				break;
> >>>> +			}
> >>>> +
> >>>> +			info.flags = vmdev->vfio_region_info[info.index].flags;
> >>>> +			break;
> >>>> +		case VFIO_PCI_VGA_REGION_INDEX:
> >>>> +		case VFIO_PCI_ROM_REGION_INDEX:    
> >>>
> >>> Why?  Let the vendor driver decide.
> >>>     
> >>
> >> Ok.
> >>  
> >>>> +		switch (info.index) {
> >>>> +		case VFIO_PCI_INTX_IRQ_INDEX ... VFIO_PCI_MSI_IRQ_INDEX:
> >>>> +		case VFIO_PCI_REQ_IRQ_INDEX:
> >>>> +			break;
> >>>> +			/* pass thru to return error */
> >>>> +		case VFIO_PCI_MSIX_IRQ_INDEX:    
> >>>
> >>> ???    
> >>
> >> Sorry, I missed to update this. Updating it.
> >>  
> >>>> +	case VFIO_DEVICE_SET_IRQS:
> >>>> +	{    
> >> ...  
> >>>> +
> >>>> +		if (parent->ops->set_irqs)
> >>>> +			ret = parent->ops->set_irqs(mdev, hdr.flags, hdr.index,
> >>>> +						    hdr.start, hdr.count, data);
> >>>> +
> >>>> +		kfree(ptr);
> >>>> +		return ret;    
> >>>
> >>> Return success if no set_irqs callback?
> >>>    
> >>
> >> Ideally, vendor driver should provide this function. If vendor driver
> >> doesn't provide it, do we really need to fail here?  
> > 
> > Wouldn't you as a user expect to get an error if you try to call an
> > ioctl that has no backing rather than assume success and never receive
> > and interrupt?
> >    
> 
> If we really don't want to proceed if set_irqs() is not provided then
> its better to add it in mandatory list in mdev_register_device() in
> mdev_core.c and fail earlier, i.e. fail to register the device.

Is a device required to implement some form of interrupt to be useful?
What if there's a memory-only device that does not report INTx or
provide MSI or MSI-X capabilities?  It could still be PCI spec
complaint.  Really though it's just a matter of whether we're going to
require the mediated driver to provide a set_irqs() stub or let them
skip it and return error ourselves.  Either is really fine with me, but
we can't return success for an ioctl that has no backing.
 
> >>>> +static ssize_t vfio_mpci_read(void *device_data, char __user *buf,
> >>>> +			      size_t count, loff_t *ppos)
> >>>> +{
> >>>> +	struct vfio_mdev *vmdev = device_data;
> >>>> +	struct mdev_device *mdev = vmdev->mdev;
> >>>> +	struct parent_device *parent = mdev->parent;
> >>>> +	int ret = 0;
> >>>> +
> >>>> +	if (!count)
> >>>> +		return 0;
> >>>> +
> >>>> +	if (parent->ops->read) {
> >>>> +		char *ret_data, *ptr;
> >>>> +
> >>>> +		ptr = ret_data = kzalloc(count, GFP_KERNEL);    
> >>>
> >>> Do we really need to support arbitrary lengths in one shot?  Seems like
> >>> we could just use a 4 or 8 byte variable on the stack and iterate until
> >>> done.
> >>>     
> >>
> >> We just want to pass the arguments to vendor driver as is here. Vendor
> >> driver could take care of that.  
> > 
> > But I think this is exploitable, it lets the user make the kernel
> > allocate an arbitrarily sized buffer.
> >    
> >>>> +
> >>>> +static ssize_t vfio_mpci_write(void *device_data, const char __user *buf,
> >>>> +			       size_t count, loff_t *ppos)
> >>>> +{
> >>>> +	struct vfio_mdev *vmdev = device_data;
> >>>> +	struct mdev_device *mdev = vmdev->mdev;
> >>>> +	struct parent_device *parent = mdev->parent;
> >>>> +	int ret = 0;
> >>>> +
> >>>> +	if (!count)
> >>>> +		return 0;
> >>>> +
> >>>> +	if (parent->ops->write) {
> >>>> +		char *usr_data, *ptr;
> >>>> +
> >>>> +		ptr = usr_data = memdup_user(buf, count);    
> >>>
> >>> Same here, how much do we care to let the user write in one pass and is
> >>> there any advantage to it?  When QEMU is our userspace we're only
> >>> likely to see 4-byte accesses anyway.    
> >>
> >> Same as above.
> >>  
> >>>> +
> >>>> +static int mdev_dev_mmio_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> >>>> +{    
> >> ...  
> >>>> +	} else {
> >>>> +		struct pci_dev *pdev;
> >>>> +
> >>>> +		virtaddr = vma->vm_start;
> >>>> +		req_size = vma->vm_end - vma->vm_start;
> >>>> +
> >>>> +		pdev = to_pci_dev(parent->dev);
> >>>> +		index = VFIO_PCI_OFFSET_TO_INDEX(vma->vm_pgoff << PAGE_SHIFT);    
> >>>
> >>> Iterate through region_info[*].offset/size provided by vendor driver.
> >>>     
> >>
> >> Yes, makes sense.
> >>  
> >>>> +
> >>>> +int vfio_mpci_match(struct device *dev)
> >>>> +{
> >>>> +	if (dev_is_pci(dev->parent))    
> >>>
> >>> This is the wrong test, there's really no requirement that a pci mdev
> >>> device is hosted by a real pci device.      
> >>
> >> Ideally this module is for the mediated device whose parent is PCI
> >> device. And we are relying on kernel functions like
> >> pci_resource_start(), to_pci_dev() in this module, so better to check it
> >> while loading.  
> > 
> > IMO, we don't want to care what the parent device is, it's not ideal,
> > it's actually a limitation to impose that it is a PCI device.  I want to
> > be able to make purely virtual mediated devices.  I only see that you
> > use these functions in the mmio fault handling.  Is it useful to assume
> > that on mmio fault we map to the parent device PCI BAR regions?  Just
> > require that the vendor driver provides a fault mapping function or
> > SIGBUS if we get a fault and it doesn't.
> >   
> >>> Can't we check that the device
> >>> is on an mdev_pci_bus_type?
> >>>     
> >>
> >> I didn't get this part.
> >>
> >> Each mediated device is of mdev_bus_type. But VFIO module could be
> >> different based on parent device type and loaded at the same time. For
> >> example, there should be different modules for channel IO or any other
> >> type of devices and could be loaded at the same time. Then when mdev
> >> device is created based on check in match() function of each module, and
> >> proper driver would be linked for that mdev device.
> >>
> >> If this check is not based on parent device type, do you expect to set
> >> parent device type by vendor driver and accordingly load corresponding
> >> VFIO driver?  
> > 
> > mdev_pci_bus_type was an off the cuff response since the driver.bus
> > controls which devices a probe function will see.  If we have a unique
> > bus for a driver and create devices appropriately, we really don't
> > even need a match function.   
> 
> I still think that all types of mdev devices should have unique bus type
> so that VFIO IOMMU module could be used for any type of mediated device
> without any change. Otherwise we have to add checks for all supported
> bus types in vfio_iommu_type1_attach_group().

Good point, so perhaps the vendor driver reporting the type through
vfio_device_info.flags is the way to go.

> > That would still work, but what if you
> > made a get_device_info callback to the vendor driver rather than
> > creating that info in the mediated bus driver layer.  Then the probe
> > function here could simply check the flags to see if the device is
> > VFIO_DEVICE_FLAGS_PCI?
> >   
> 
> Right. get_device_info() would be a mandatory callback and it would be
> vendor driver's responsibility to return proper flag.

Yep, then we don't care what the parent device is, the flags will tell
us that the mediated device adheres to PCI and that's all we care about
for binding here.

> >>>> @@ -18,6 +18,7 @@
> >>>>  #include <linux/uaccess.h>
> >>>>  #include <linux/io.h>
> >>>>  #include <linux/vgaarb.h>
> >>>> +#include <linux/vfio.h>
> >>>>  
> >>>>  #include "vfio_pci_private.h"
> >>>>  
> >>>> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> >>>> index 0ecae0b1cd34..431b824b0d3e 100644
> >>>> --- a/include/linux/vfio.h
> >>>> +++ b/include/linux/vfio.h
> >>>> @@ -18,6 +18,13 @@
> >>>>  #include <linux/poll.h>
> >>>>  #include <uapi/linux/vfio.h>
> >>>>  
> >>>> +#define VFIO_PCI_OFFSET_SHIFT   40
> >>>> +
> >>>> +#define VFIO_PCI_OFFSET_TO_INDEX(off)   (off >> VFIO_PCI_OFFSET_SHIFT)
> >>>> +#define VFIO_PCI_INDEX_TO_OFFSET(index) ((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
> >>>> +#define VFIO_PCI_OFFSET_MASK    (((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
> >>>> +
> >>>> +    
> >>>
> >>> Nak this, I'm not interested in making this any sort of ABI.
> >>>     
> >>
> >> These macros are used by drivers/vfio/pci/vfio_pci.c and
> >> drivers/vfio/mdev/vfio_mpci.c and to use those in both these modules,
> >> they should be moved to common place as you suggested in earlier
> >> reviews. I think this is better common place. Are there any other
> >> suggestion?  
> > 
> > They're only used in ways that I objected to above and you've agreed
> > to.  These define implementation details that must not become part of
> > the mediated vendor driver ABI.  A vendor driver is free to redefine
> > this the same if they want, but as we can see with how easily they slip
> > into code where they don't belong, the only way to make sure they don't
> > become ABI is to keep them in private headers.
> >    
> 
> Then I think, I can't use these macros in mdev modules, they are defined
> in drivers/vfio/pci/vfio_pci_private.h
> I have to define similar macros in drivers/vfio/mdev/mdev_private.h?
> 
> parent->ops->get_region_info() is called from vfio_mpci_open() that is
> before PCI config space is setup. Main expectation from
> get_region_info() was to get flags and size. At this point of time
> vendor driver also don't know about the base addresses of regions.
> 
>     case VFIO_DEVICE_GET_REGION_INFO:
> ...
> 
>         info.offset = vmdev->vfio_region_info[info.index].offset;
> 
> In that case, as suggested in previous reply, above is not going to work.
> I'll define such macros in drivers/vfio/mdev/mdev_private.h, set above
> offset according to these macros. Then on first access to any BAR
> region, i.e. after PCI config space is populated, call
> parent->ops->get_region_info() again so that
> vfio_region_info[index].offset for all regions are set by vendor driver.
> Then use these offsets to calculate 'pos' for
> read/write/validate_map_request(). Does this seems reasonable?

This doesn't make any sense to me, there should be absolutely no reason
for the mid-layer mediated device infrastructure to impose region
offsets.  vfio-pci is a leaf driver, like the mediated vendor driver.
Only the leaf drivers can define how they layout the offsets within the
device file descriptor.  Being a VFIO_PCI device only defines region
indexes to resources, not offsets (ie. region 0 is BAR0, region 1 is
BAR1,... region 7 is PCI config space).  If this mid-layer even needs
to know region offsets, then caching them on opening the vendor device
is certainly sufficient.  Remember we're talking about the offset into
the vfio device file descriptor, how that potentially maps onto a
physical MMIO space later doesn't matter here.  It seems like maybe
we're confusing those points.  Anyway, the more I hear about needing to
reproduce these INDEX/OFFSET translation macros in places they
shouldn't be used, the more confident I am in keeping them private.
Thanks,

Alex

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v6 2/4] vfio: VFIO driver for mediated PCI device
@ 2016-08-11 16:24             ` Alex Williamson
  0 siblings, 0 replies; 100+ messages in thread
From: Alex Williamson @ 2016-08-11 16:24 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi

On Thu, 11 Aug 2016 21:29:35 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 8/11/2016 4:30 AM, Alex Williamson wrote:
> > On Thu, 11 Aug 2016 02:53:10 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> On 8/10/2016 12:30 AM, Alex Williamson wrote:  
> >>> On Thu, 4 Aug 2016 00:33:52 +0530
> >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>     
> >>
> >> ...
> >>  
> >>>> +
> >>>> +		switch (info.index) {
> >>>> +		case VFIO_PCI_CONFIG_REGION_INDEX:
> >>>> +		case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
> >>>> +			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);    
> >>>
> >>> No, vmdev->vfio_region_info[info.index].offset
> >>>    
> >>
> >> Ok.
> >>  
> >>>> +			info.size = vmdev->vfio_region_info[info.index].size;
> >>>> +			if (!info.size) {
> >>>> +				info.flags = 0;
> >>>> +				break;
> >>>> +			}
> >>>> +
> >>>> +			info.flags = vmdev->vfio_region_info[info.index].flags;
> >>>> +			break;
> >>>> +		case VFIO_PCI_VGA_REGION_INDEX:
> >>>> +		case VFIO_PCI_ROM_REGION_INDEX:    
> >>>
> >>> Why?  Let the vendor driver decide.
> >>>     
> >>
> >> Ok.
> >>  
> >>>> +		switch (info.index) {
> >>>> +		case VFIO_PCI_INTX_IRQ_INDEX ... VFIO_PCI_MSI_IRQ_INDEX:
> >>>> +		case VFIO_PCI_REQ_IRQ_INDEX:
> >>>> +			break;
> >>>> +			/* pass thru to return error */
> >>>> +		case VFIO_PCI_MSIX_IRQ_INDEX:    
> >>>
> >>> ???    
> >>
> >> Sorry, I missed to update this. Updating it.
> >>  
> >>>> +	case VFIO_DEVICE_SET_IRQS:
> >>>> +	{    
> >> ...  
> >>>> +
> >>>> +		if (parent->ops->set_irqs)
> >>>> +			ret = parent->ops->set_irqs(mdev, hdr.flags, hdr.index,
> >>>> +						    hdr.start, hdr.count, data);
> >>>> +
> >>>> +		kfree(ptr);
> >>>> +		return ret;    
> >>>
> >>> Return success if no set_irqs callback?
> >>>    
> >>
> >> Ideally, vendor driver should provide this function. If vendor driver
> >> doesn't provide it, do we really need to fail here?  
> > 
> > Wouldn't you as a user expect to get an error if you try to call an
> > ioctl that has no backing rather than assume success and never receive
> > and interrupt?
> >    
> 
> If we really don't want to proceed if set_irqs() is not provided then
> its better to add it in mandatory list in mdev_register_device() in
> mdev_core.c and fail earlier, i.e. fail to register the device.

Is a device required to implement some form of interrupt to be useful?
What if there's a memory-only device that does not report INTx or
provide MSI or MSI-X capabilities?  It could still be PCI spec
complaint.  Really though it's just a matter of whether we're going to
require the mediated driver to provide a set_irqs() stub or let them
skip it and return error ourselves.  Either is really fine with me, but
we can't return success for an ioctl that has no backing.
 
> >>>> +static ssize_t vfio_mpci_read(void *device_data, char __user *buf,
> >>>> +			      size_t count, loff_t *ppos)
> >>>> +{
> >>>> +	struct vfio_mdev *vmdev = device_data;
> >>>> +	struct mdev_device *mdev = vmdev->mdev;
> >>>> +	struct parent_device *parent = mdev->parent;
> >>>> +	int ret = 0;
> >>>> +
> >>>> +	if (!count)
> >>>> +		return 0;
> >>>> +
> >>>> +	if (parent->ops->read) {
> >>>> +		char *ret_data, *ptr;
> >>>> +
> >>>> +		ptr = ret_data = kzalloc(count, GFP_KERNEL);    
> >>>
> >>> Do we really need to support arbitrary lengths in one shot?  Seems like
> >>> we could just use a 4 or 8 byte variable on the stack and iterate until
> >>> done.
> >>>     
> >>
> >> We just want to pass the arguments to vendor driver as is here. Vendor
> >> driver could take care of that.  
> > 
> > But I think this is exploitable, it lets the user make the kernel
> > allocate an arbitrarily sized buffer.
> >    
> >>>> +
> >>>> +static ssize_t vfio_mpci_write(void *device_data, const char __user *buf,
> >>>> +			       size_t count, loff_t *ppos)
> >>>> +{
> >>>> +	struct vfio_mdev *vmdev = device_data;
> >>>> +	struct mdev_device *mdev = vmdev->mdev;
> >>>> +	struct parent_device *parent = mdev->parent;
> >>>> +	int ret = 0;
> >>>> +
> >>>> +	if (!count)
> >>>> +		return 0;
> >>>> +
> >>>> +	if (parent->ops->write) {
> >>>> +		char *usr_data, *ptr;
> >>>> +
> >>>> +		ptr = usr_data = memdup_user(buf, count);    
> >>>
> >>> Same here, how much do we care to let the user write in one pass and is
> >>> there any advantage to it?  When QEMU is our userspace we're only
> >>> likely to see 4-byte accesses anyway.    
> >>
> >> Same as above.
> >>  
> >>>> +
> >>>> +static int mdev_dev_mmio_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> >>>> +{    
> >> ...  
> >>>> +	} else {
> >>>> +		struct pci_dev *pdev;
> >>>> +
> >>>> +		virtaddr = vma->vm_start;
> >>>> +		req_size = vma->vm_end - vma->vm_start;
> >>>> +
> >>>> +		pdev = to_pci_dev(parent->dev);
> >>>> +		index = VFIO_PCI_OFFSET_TO_INDEX(vma->vm_pgoff << PAGE_SHIFT);    
> >>>
> >>> Iterate through region_info[*].offset/size provided by vendor driver.
> >>>     
> >>
> >> Yes, makes sense.
> >>  
> >>>> +
> >>>> +int vfio_mpci_match(struct device *dev)
> >>>> +{
> >>>> +	if (dev_is_pci(dev->parent))    
> >>>
> >>> This is the wrong test, there's really no requirement that a pci mdev
> >>> device is hosted by a real pci device.      
> >>
> >> Ideally this module is for the mediated device whose parent is PCI
> >> device. And we are relying on kernel functions like
> >> pci_resource_start(), to_pci_dev() in this module, so better to check it
> >> while loading.  
> > 
> > IMO, we don't want to care what the parent device is, it's not ideal,
> > it's actually a limitation to impose that it is a PCI device.  I want to
> > be able to make purely virtual mediated devices.  I only see that you
> > use these functions in the mmio fault handling.  Is it useful to assume
> > that on mmio fault we map to the parent device PCI BAR regions?  Just
> > require that the vendor driver provides a fault mapping function or
> > SIGBUS if we get a fault and it doesn't.
> >   
> >>> Can't we check that the device
> >>> is on an mdev_pci_bus_type?
> >>>     
> >>
> >> I didn't get this part.
> >>
> >> Each mediated device is of mdev_bus_type. But VFIO module could be
> >> different based on parent device type and loaded at the same time. For
> >> example, there should be different modules for channel IO or any other
> >> type of devices and could be loaded at the same time. Then when mdev
> >> device is created based on check in match() function of each module, and
> >> proper driver would be linked for that mdev device.
> >>
> >> If this check is not based on parent device type, do you expect to set
> >> parent device type by vendor driver and accordingly load corresponding
> >> VFIO driver?  
> > 
> > mdev_pci_bus_type was an off the cuff response since the driver.bus
> > controls which devices a probe function will see.  If we have a unique
> > bus for a driver and create devices appropriately, we really don't
> > even need a match function.   
> 
> I still think that all types of mdev devices should have unique bus type
> so that VFIO IOMMU module could be used for any type of mediated device
> without any change. Otherwise we have to add checks for all supported
> bus types in vfio_iommu_type1_attach_group().

Good point, so perhaps the vendor driver reporting the type through
vfio_device_info.flags is the way to go.

> > That would still work, but what if you
> > made a get_device_info callback to the vendor driver rather than
> > creating that info in the mediated bus driver layer.  Then the probe
> > function here could simply check the flags to see if the device is
> > VFIO_DEVICE_FLAGS_PCI?
> >   
> 
> Right. get_device_info() would be a mandatory callback and it would be
> vendor driver's responsibility to return proper flag.

Yep, then we don't care what the parent device is, the flags will tell
us that the mediated device adheres to PCI and that's all we care about
for binding here.

> >>>> @@ -18,6 +18,7 @@
> >>>>  #include <linux/uaccess.h>
> >>>>  #include <linux/io.h>
> >>>>  #include <linux/vgaarb.h>
> >>>> +#include <linux/vfio.h>
> >>>>  
> >>>>  #include "vfio_pci_private.h"
> >>>>  
> >>>> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> >>>> index 0ecae0b1cd34..431b824b0d3e 100644
> >>>> --- a/include/linux/vfio.h
> >>>> +++ b/include/linux/vfio.h
> >>>> @@ -18,6 +18,13 @@
> >>>>  #include <linux/poll.h>
> >>>>  #include <uapi/linux/vfio.h>
> >>>>  
> >>>> +#define VFIO_PCI_OFFSET_SHIFT   40
> >>>> +
> >>>> +#define VFIO_PCI_OFFSET_TO_INDEX(off)   (off >> VFIO_PCI_OFFSET_SHIFT)
> >>>> +#define VFIO_PCI_INDEX_TO_OFFSET(index) ((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
> >>>> +#define VFIO_PCI_OFFSET_MASK    (((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
> >>>> +
> >>>> +    
> >>>
> >>> Nak this, I'm not interested in making this any sort of ABI.
> >>>     
> >>
> >> These macros are used by drivers/vfio/pci/vfio_pci.c and
> >> drivers/vfio/mdev/vfio_mpci.c and to use those in both these modules,
> >> they should be moved to common place as you suggested in earlier
> >> reviews. I think this is better common place. Are there any other
> >> suggestion?  
> > 
> > They're only used in ways that I objected to above and you've agreed
> > to.  These define implementation details that must not become part of
> > the mediated vendor driver ABI.  A vendor driver is free to redefine
> > this the same if they want, but as we can see with how easily they slip
> > into code where they don't belong, the only way to make sure they don't
> > become ABI is to keep them in private headers.
> >    
> 
> Then I think, I can't use these macros in mdev modules, they are defined
> in drivers/vfio/pci/vfio_pci_private.h
> I have to define similar macros in drivers/vfio/mdev/mdev_private.h?
> 
> parent->ops->get_region_info() is called from vfio_mpci_open() that is
> before PCI config space is setup. Main expectation from
> get_region_info() was to get flags and size. At this point of time
> vendor driver also don't know about the base addresses of regions.
> 
>     case VFIO_DEVICE_GET_REGION_INFO:
> ...
> 
>         info.offset = vmdev->vfio_region_info[info.index].offset;
> 
> In that case, as suggested in previous reply, above is not going to work.
> I'll define such macros in drivers/vfio/mdev/mdev_private.h, set above
> offset according to these macros. Then on first access to any BAR
> region, i.e. after PCI config space is populated, call
> parent->ops->get_region_info() again so that
> vfio_region_info[index].offset for all regions are set by vendor driver.
> Then use these offsets to calculate 'pos' for
> read/write/validate_map_request(). Does this seems reasonable?

This doesn't make any sense to me, there should be absolutely no reason
for the mid-layer mediated device infrastructure to impose region
offsets.  vfio-pci is a leaf driver, like the mediated vendor driver.
Only the leaf drivers can define how they layout the offsets within the
device file descriptor.  Being a VFIO_PCI device only defines region
indexes to resources, not offsets (ie. region 0 is BAR0, region 1 is
BAR1,... region 7 is PCI config space).  If this mid-layer even needs
to know region offsets, then caching them on opening the vendor device
is certainly sufficient.  Remember we're talking about the offset into
the vfio device file descriptor, how that potentially maps onto a
physical MMIO space later doesn't matter here.  It seems like maybe
we're confusing those points.  Anyway, the more I hear about needing to
reproduce these INDEX/OFFSET translation macros in places they
shouldn't be used, the more confident I am in keeping them private.
Thanks,

Alex

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v6 3/4] vfio iommu: Add support for mediated devices
  2016-08-11 14:22       ` [Qemu-devel] " Kirti Wankhede
@ 2016-08-11 16:28         ` Alex Williamson
  -1 siblings, 0 replies; 100+ messages in thread
From: Alex Williamson @ 2016-08-11 16:28 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi

On Thu, 11 Aug 2016 19:52:06 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> Thanks Alex. I'll take care of suggested nits and rename structures and
> function.
> 
> On 8/10/2016 12:30 AM, Alex Williamson wrote:
> > On Thu, 4 Aug 2016 00:33:53 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >  
> ...
> 
> >>
> >> +/*
> >> + * Pin a set of guest PFNs and return their associated host PFNs for  
> mediated
> >> + * domain only.  
> >
> > Why only mediated domain?  What assumption is specific to a mediated
> > domain other than unnecessarily passing an mdev_device?
> >  
> >> + * @user_pfn [in]: array of user/guest PFNs
> >> + * @npage [in]: count of array elements
> >> + * @prot [in] : protection flags
> >> + * @phys_pfn[out] : array of host PFNs
> >> + */
> >> +long vfio_pin_pages(struct mdev_device *mdev, unsigned long *user_pfn,  
> >
> > Why use and mdev_device here?  We only reference the struct device to
> > get the drvdata.  (dev also not listed above in param description)
> >  
> 
> Ok.
> 
> >> +		    long npage, int prot, unsigned long *phys_pfn)
> >> +{
> >> +	struct vfio_device *device;
> >> +	struct vfio_container *container;
> >> +	struct vfio_iommu_driver *driver;
> >> +	ssize_t ret = -EINVAL;
> >> +
> >> +	if (!mdev || !user_pfn || !phys_pfn)
> >> +		return -EINVAL;
> >> +
> >> +	device = dev_get_drvdata(&mdev->dev);
> >> +
> >> +	if (!device || !device->group)
> >> +		return -EINVAL;
> >> +
> >> +	container = device->group->container;  
> >
> > This doesn't seem like a valid way to get a reference to the container
> > and in fact there is no reference at all.  I think you need to use
> > vfio_device_get_from_dev(), check and increment container_users around
> > the callback, abort on noiommu groups, and check for viability.
> >  
> 
> Thanks for pointing that out. I'll change it as suggested.
> 
> >
> >
> > I see how you're trying to only do accounting when there is only an
> > mdev (local) domain, but the devices attached to the normal iommu API
> > domain can go away at any point.  Where do we re-establish accounting
> > should the pinning from those devices be removed?  I don't see that as
> > being an optional support case since userspace can already do this.
> >  
> 
> I missed this case. So in that case, when
> vfio_iommu_type1_detach_group() for iommu group for that device is
> called and it is the last entry in iommu capable domain_list, it should
> re-iterate through pfn_list of mediated_domain and do the accounting,
> right? Then we also have to update accounting when iommu capable device
> is hotplugged while mediated_domain already exist.

Yes, so pages are going to get pinned once for the iommu api domain and
once for the mediated domain, and accounting needs to be updated for
those domains coming and going in any order.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v6 3/4] vfio iommu: Add support for mediated devices
@ 2016-08-11 16:28         ` Alex Williamson
  0 siblings, 0 replies; 100+ messages in thread
From: Alex Williamson @ 2016-08-11 16:28 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi

On Thu, 11 Aug 2016 19:52:06 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> Thanks Alex. I'll take care of suggested nits and rename structures and
> function.
> 
> On 8/10/2016 12:30 AM, Alex Williamson wrote:
> > On Thu, 4 Aug 2016 00:33:53 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >  
> ...
> 
> >>
> >> +/*
> >> + * Pin a set of guest PFNs and return their associated host PFNs for  
> mediated
> >> + * domain only.  
> >
> > Why only mediated domain?  What assumption is specific to a mediated
> > domain other than unnecessarily passing an mdev_device?
> >  
> >> + * @user_pfn [in]: array of user/guest PFNs
> >> + * @npage [in]: count of array elements
> >> + * @prot [in] : protection flags
> >> + * @phys_pfn[out] : array of host PFNs
> >> + */
> >> +long vfio_pin_pages(struct mdev_device *mdev, unsigned long *user_pfn,  
> >
> > Why use and mdev_device here?  We only reference the struct device to
> > get the drvdata.  (dev also not listed above in param description)
> >  
> 
> Ok.
> 
> >> +		    long npage, int prot, unsigned long *phys_pfn)
> >> +{
> >> +	struct vfio_device *device;
> >> +	struct vfio_container *container;
> >> +	struct vfio_iommu_driver *driver;
> >> +	ssize_t ret = -EINVAL;
> >> +
> >> +	if (!mdev || !user_pfn || !phys_pfn)
> >> +		return -EINVAL;
> >> +
> >> +	device = dev_get_drvdata(&mdev->dev);
> >> +
> >> +	if (!device || !device->group)
> >> +		return -EINVAL;
> >> +
> >> +	container = device->group->container;  
> >
> > This doesn't seem like a valid way to get a reference to the container
> > and in fact there is no reference at all.  I think you need to use
> > vfio_device_get_from_dev(), check and increment container_users around
> > the callback, abort on noiommu groups, and check for viability.
> >  
> 
> Thanks for pointing that out. I'll change it as suggested.
> 
> >
> >
> > I see how you're trying to only do accounting when there is only an
> > mdev (local) domain, but the devices attached to the normal iommu API
> > domain can go away at any point.  Where do we re-establish accounting
> > should the pinning from those devices be removed?  I don't see that as
> > being an optional support case since userspace can already do this.
> >  
> 
> I missed this case. So in that case, when
> vfio_iommu_type1_detach_group() for iommu group for that device is
> called and it is the last entry in iommu capable domain_list, it should
> re-iterate through pfn_list of mediated_domain and do the accounting,
> right? Then we also have to update accounting when iommu capable device
> is hotplugged while mediated_domain already exist.

Yes, so pages are going to get pinned once for the iommu api domain and
once for the mediated domain, and accounting needs to be updated for
those domains coming and going in any order.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v6 2/4] vfio: VFIO driver for mediated PCI device
  2016-08-11 16:24             ` [Qemu-devel] " Alex Williamson
@ 2016-08-11 17:46               ` Kirti Wankhede
  -1 siblings, 0 replies; 100+ messages in thread
From: Kirti Wankhede @ 2016-08-11 17:46 UTC (permalink / raw)
  To: Alex Williamson
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi



On 8/11/2016 9:54 PM, Alex Williamson wrote:
> On Thu, 11 Aug 2016 21:29:35 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> On 8/11/2016 4:30 AM, Alex Williamson wrote:
>>> On Thu, 11 Aug 2016 02:53:10 +0530
>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>   
>>>> On 8/10/2016 12:30 AM, Alex Williamson wrote:  
>>>>> On Thu, 4 Aug 2016 00:33:52 +0530
>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>>>     
>>>>
>>>> ...
>>>>>>  #include "vfio_pci_private.h"
>>>>>>  
>>>>>> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
>>>>>> index 0ecae0b1cd34..431b824b0d3e 100644
>>>>>> --- a/include/linux/vfio.h
>>>>>> +++ b/include/linux/vfio.h
>>>>>> @@ -18,6 +18,13 @@
>>>>>>  #include <linux/poll.h>
>>>>>>  #include <uapi/linux/vfio.h>
>>>>>>  
>>>>>> +#define VFIO_PCI_OFFSET_SHIFT   40
>>>>>> +
>>>>>> +#define VFIO_PCI_OFFSET_TO_INDEX(off)   (off >> VFIO_PCI_OFFSET_SHIFT)
>>>>>> +#define VFIO_PCI_INDEX_TO_OFFSET(index) ((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
>>>>>> +#define VFIO_PCI_OFFSET_MASK    (((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
>>>>>> +
>>>>>> +    
>>>>>
>>>>> Nak this, I'm not interested in making this any sort of ABI.
>>>>>     
>>>>
>>>> These macros are used by drivers/vfio/pci/vfio_pci.c and
>>>> drivers/vfio/mdev/vfio_mpci.c and to use those in both these modules,
>>>> they should be moved to common place as you suggested in earlier
>>>> reviews. I think this is better common place. Are there any other
>>>> suggestion?  
>>>
>>> They're only used in ways that I objected to above and you've agreed
>>> to.  These define implementation details that must not become part of
>>> the mediated vendor driver ABI.  A vendor driver is free to redefine
>>> this the same if they want, but as we can see with how easily they slip
>>> into code where they don't belong, the only way to make sure they don't
>>> become ABI is to keep them in private headers.
>>>    
>>
>> Then I think, I can't use these macros in mdev modules, they are defined
>> in drivers/vfio/pci/vfio_pci_private.h
>> I have to define similar macros in drivers/vfio/mdev/mdev_private.h?
>>
>> parent->ops->get_region_info() is called from vfio_mpci_open() that is
>> before PCI config space is setup. Main expectation from
>> get_region_info() was to get flags and size. At this point of time
>> vendor driver also don't know about the base addresses of regions.
>>
>>     case VFIO_DEVICE_GET_REGION_INFO:
>> ...
>>
>>         info.offset = vmdev->vfio_region_info[info.index].offset;
>>
>> In that case, as suggested in previous reply, above is not going to work.
>> I'll define such macros in drivers/vfio/mdev/mdev_private.h, set above
>> offset according to these macros. Then on first access to any BAR
>> region, i.e. after PCI config space is populated, call
>> parent->ops->get_region_info() again so that
>> vfio_region_info[index].offset for all regions are set by vendor driver.
>> Then use these offsets to calculate 'pos' for
>> read/write/validate_map_request(). Does this seems reasonable?
> 
> This doesn't make any sense to me, there should be absolutely no reason
> for the mid-layer mediated device infrastructure to impose region
> offsets.  vfio-pci is a leaf driver, like the mediated vendor driver.
> Only the leaf drivers can define how they layout the offsets within the
> device file descriptor.  Being a VFIO_PCI device only defines region
> indexes to resources, not offsets (ie. region 0 is BAR0, region 1 is
> BAR1,... region 7 is PCI config space).  If this mid-layer even needs
> to know region offsets, then caching them on opening the vendor device
> is certainly sufficient.  Remember we're talking about the offset into
> the vfio device file descriptor, how that potentially maps onto a
> physical MMIO space later doesn't matter here.  It seems like maybe
> we're confusing those points.  Anyway, the more I hear about needing to
> reproduce these INDEX/OFFSET translation macros in places they
> shouldn't be used, the more confident I am in keeping them private.

If vendor driver defines the offsets into vfio device file descriptor,
it will be vendor drivers responsibility that the ranges defined (offset
to offset + size) are not overlapping with other regions ranges. There
will be no validation in vfio-mpci, right?

In current implementation there is a provision that if
validate_map_request() callback is not provided, map it to physical
device's region and start of physical device's BAR address is queried
using pci_resource_start(). Since with the above change that you are
proposing, index could not be extracted from offset. Then if vendor
driver doesn't provide validate_map_request(), return SIGBUS from fault
handler.
So that impose indirect requirement that if vendor driver sets
VFIO_REGION_INFO_FLAG_MMAP for any region, they should provide
validate_map_request().

Thanks,
Kirti.

> Thanks,
> 
> Alex
> 

-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information.  Any unauthorized review, use, disclosure or distribution
is prohibited.  If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v6 2/4] vfio: VFIO driver for mediated PCI device
@ 2016-08-11 17:46               ` Kirti Wankhede
  0 siblings, 0 replies; 100+ messages in thread
From: Kirti Wankhede @ 2016-08-11 17:46 UTC (permalink / raw)
  To: Alex Williamson
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi



On 8/11/2016 9:54 PM, Alex Williamson wrote:
> On Thu, 11 Aug 2016 21:29:35 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> On 8/11/2016 4:30 AM, Alex Williamson wrote:
>>> On Thu, 11 Aug 2016 02:53:10 +0530
>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>   
>>>> On 8/10/2016 12:30 AM, Alex Williamson wrote:  
>>>>> On Thu, 4 Aug 2016 00:33:52 +0530
>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>>>     
>>>>
>>>> ...
>>>>>>  #include "vfio_pci_private.h"
>>>>>>  
>>>>>> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
>>>>>> index 0ecae0b1cd34..431b824b0d3e 100644
>>>>>> --- a/include/linux/vfio.h
>>>>>> +++ b/include/linux/vfio.h
>>>>>> @@ -18,6 +18,13 @@
>>>>>>  #include <linux/poll.h>
>>>>>>  #include <uapi/linux/vfio.h>
>>>>>>  
>>>>>> +#define VFIO_PCI_OFFSET_SHIFT   40
>>>>>> +
>>>>>> +#define VFIO_PCI_OFFSET_TO_INDEX(off)   (off >> VFIO_PCI_OFFSET_SHIFT)
>>>>>> +#define VFIO_PCI_INDEX_TO_OFFSET(index) ((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
>>>>>> +#define VFIO_PCI_OFFSET_MASK    (((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
>>>>>> +
>>>>>> +    
>>>>>
>>>>> Nak this, I'm not interested in making this any sort of ABI.
>>>>>     
>>>>
>>>> These macros are used by drivers/vfio/pci/vfio_pci.c and
>>>> drivers/vfio/mdev/vfio_mpci.c and to use those in both these modules,
>>>> they should be moved to common place as you suggested in earlier
>>>> reviews. I think this is better common place. Are there any other
>>>> suggestion?  
>>>
>>> They're only used in ways that I objected to above and you've agreed
>>> to.  These define implementation details that must not become part of
>>> the mediated vendor driver ABI.  A vendor driver is free to redefine
>>> this the same if they want, but as we can see with how easily they slip
>>> into code where they don't belong, the only way to make sure they don't
>>> become ABI is to keep them in private headers.
>>>    
>>
>> Then I think, I can't use these macros in mdev modules, they are defined
>> in drivers/vfio/pci/vfio_pci_private.h
>> I have to define similar macros in drivers/vfio/mdev/mdev_private.h?
>>
>> parent->ops->get_region_info() is called from vfio_mpci_open() that is
>> before PCI config space is setup. Main expectation from
>> get_region_info() was to get flags and size. At this point of time
>> vendor driver also don't know about the base addresses of regions.
>>
>>     case VFIO_DEVICE_GET_REGION_INFO:
>> ...
>>
>>         info.offset = vmdev->vfio_region_info[info.index].offset;
>>
>> In that case, as suggested in previous reply, above is not going to work.
>> I'll define such macros in drivers/vfio/mdev/mdev_private.h, set above
>> offset according to these macros. Then on first access to any BAR
>> region, i.e. after PCI config space is populated, call
>> parent->ops->get_region_info() again so that
>> vfio_region_info[index].offset for all regions are set by vendor driver.
>> Then use these offsets to calculate 'pos' for
>> read/write/validate_map_request(). Does this seems reasonable?
> 
> This doesn't make any sense to me, there should be absolutely no reason
> for the mid-layer mediated device infrastructure to impose region
> offsets.  vfio-pci is a leaf driver, like the mediated vendor driver.
> Only the leaf drivers can define how they layout the offsets within the
> device file descriptor.  Being a VFIO_PCI device only defines region
> indexes to resources, not offsets (ie. region 0 is BAR0, region 1 is
> BAR1,... region 7 is PCI config space).  If this mid-layer even needs
> to know region offsets, then caching them on opening the vendor device
> is certainly sufficient.  Remember we're talking about the offset into
> the vfio device file descriptor, how that potentially maps onto a
> physical MMIO space later doesn't matter here.  It seems like maybe
> we're confusing those points.  Anyway, the more I hear about needing to
> reproduce these INDEX/OFFSET translation macros in places they
> shouldn't be used, the more confident I am in keeping them private.

If vendor driver defines the offsets into vfio device file descriptor,
it will be vendor drivers responsibility that the ranges defined (offset
to offset + size) are not overlapping with other regions ranges. There
will be no validation in vfio-mpci, right?

In current implementation there is a provision that if
validate_map_request() callback is not provided, map it to physical
device's region and start of physical device's BAR address is queried
using pci_resource_start(). Since with the above change that you are
proposing, index could not be extracted from offset. Then if vendor
driver doesn't provide validate_map_request(), return SIGBUS from fault
handler.
So that impose indirect requirement that if vendor driver sets
VFIO_REGION_INFO_FLAG_MMAP for any region, they should provide
validate_map_request().

Thanks,
Kirti.

> Thanks,
> 
> Alex
> 

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v6 2/4] vfio: VFIO driver for mediated PCI device
  2016-08-11 17:46               ` [Qemu-devel] " Kirti Wankhede
@ 2016-08-11 18:43                 ` Alex Williamson
  -1 siblings, 0 replies; 100+ messages in thread
From: Alex Williamson @ 2016-08-11 18:43 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi

On Thu, 11 Aug 2016 23:16:06 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 8/11/2016 9:54 PM, Alex Williamson wrote:
> > On Thu, 11 Aug 2016 21:29:35 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> On 8/11/2016 4:30 AM, Alex Williamson wrote:  
> >>> On Thu, 11 Aug 2016 02:53:10 +0530
> >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>     
> >>>> On 8/10/2016 12:30 AM, Alex Williamson wrote:    
> >>>>> On Thu, 4 Aug 2016 00:33:52 +0530
> >>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>>>       
> >>>>
> >>>> ...  
> >>>>>>  #include "vfio_pci_private.h"
> >>>>>>  
> >>>>>> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> >>>>>> index 0ecae0b1cd34..431b824b0d3e 100644
> >>>>>> --- a/include/linux/vfio.h
> >>>>>> +++ b/include/linux/vfio.h
> >>>>>> @@ -18,6 +18,13 @@
> >>>>>>  #include <linux/poll.h>
> >>>>>>  #include <uapi/linux/vfio.h>
> >>>>>>  
> >>>>>> +#define VFIO_PCI_OFFSET_SHIFT   40
> >>>>>> +
> >>>>>> +#define VFIO_PCI_OFFSET_TO_INDEX(off)   (off >> VFIO_PCI_OFFSET_SHIFT)
> >>>>>> +#define VFIO_PCI_INDEX_TO_OFFSET(index) ((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
> >>>>>> +#define VFIO_PCI_OFFSET_MASK    (((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
> >>>>>> +
> >>>>>> +      
> >>>>>
> >>>>> Nak this, I'm not interested in making this any sort of ABI.
> >>>>>       
> >>>>
> >>>> These macros are used by drivers/vfio/pci/vfio_pci.c and
> >>>> drivers/vfio/mdev/vfio_mpci.c and to use those in both these modules,
> >>>> they should be moved to common place as you suggested in earlier
> >>>> reviews. I think this is better common place. Are there any other
> >>>> suggestion?    
> >>>
> >>> They're only used in ways that I objected to above and you've agreed
> >>> to.  These define implementation details that must not become part of
> >>> the mediated vendor driver ABI.  A vendor driver is free to redefine
> >>> this the same if they want, but as we can see with how easily they slip
> >>> into code where they don't belong, the only way to make sure they don't
> >>> become ABI is to keep them in private headers.
> >>>      
> >>
> >> Then I think, I can't use these macros in mdev modules, they are defined
> >> in drivers/vfio/pci/vfio_pci_private.h
> >> I have to define similar macros in drivers/vfio/mdev/mdev_private.h?
> >>
> >> parent->ops->get_region_info() is called from vfio_mpci_open() that is
> >> before PCI config space is setup. Main expectation from
> >> get_region_info() was to get flags and size. At this point of time
> >> vendor driver also don't know about the base addresses of regions.
> >>
> >>     case VFIO_DEVICE_GET_REGION_INFO:
> >> ...
> >>
> >>         info.offset = vmdev->vfio_region_info[info.index].offset;
> >>
> >> In that case, as suggested in previous reply, above is not going to work.
> >> I'll define such macros in drivers/vfio/mdev/mdev_private.h, set above
> >> offset according to these macros. Then on first access to any BAR
> >> region, i.e. after PCI config space is populated, call
> >> parent->ops->get_region_info() again so that
> >> vfio_region_info[index].offset for all regions are set by vendor driver.
> >> Then use these offsets to calculate 'pos' for
> >> read/write/validate_map_request(). Does this seems reasonable?  
> > 
> > This doesn't make any sense to me, there should be absolutely no reason
> > for the mid-layer mediated device infrastructure to impose region
> > offsets.  vfio-pci is a leaf driver, like the mediated vendor driver.
> > Only the leaf drivers can define how they layout the offsets within the
> > device file descriptor.  Being a VFIO_PCI device only defines region
> > indexes to resources, not offsets (ie. region 0 is BAR0, region 1 is
> > BAR1,... region 7 is PCI config space).  If this mid-layer even needs
> > to know region offsets, then caching them on opening the vendor device
> > is certainly sufficient.  Remember we're talking about the offset into
> > the vfio device file descriptor, how that potentially maps onto a
> > physical MMIO space later doesn't matter here.  It seems like maybe
> > we're confusing those points.  Anyway, the more I hear about needing to
> > reproduce these INDEX/OFFSET translation macros in places they
> > shouldn't be used, the more confident I am in keeping them private.  
> 
> If vendor driver defines the offsets into vfio device file descriptor,
> it will be vendor drivers responsibility that the ranges defined (offset
> to offset + size) are not overlapping with other regions ranges. There
> will be no validation in vfio-mpci, right?

Right, this seems like a pretty basic requirement of the vendor driver
to offer region ranges that do not overlap and there's plenty else
about the vendor driver that the mid-layer can't validate its
behavior...

> In current implementation there is a provision that if
> validate_map_request() callback is not provided, map it to physical
> device's region and start of physical device's BAR address is queried
> using pci_resource_start(). Since with the above change that you are
> proposing, index could not be extracted from offset. Then if vendor
> driver doesn't provide validate_map_request(), return SIGBUS from fault
> handler.
> So that impose indirect requirement that if vendor driver sets
> VFIO_REGION_INFO_FLAG_MMAP for any region, they should provide
> validate_map_request().

TBH, I don't see how providing a default implementation of
validate_map_request() is useful.  How many mediated devices are going
to want to identity map resources from the parent?  Even if they do, it
seems we can only support a single mediated device per parent device
since each will map the same parent resource offset. Let's not even try
to define a default.  If we get a fault and the vendor driver hasn't
provided a handler, send a SIGBUS.  I expect we should also allow
vendor drivers to fill the mapping at mmap() time rather than expecting
this map on fault scheme.  Maybe the mid-level driver should not even be
interacting with mmap() and should let the vendor driver entirely
determine the handling.

For the most part these mid-level drivers, like mediated pci, should be
as thin as possible, and to some extent I wonder if we need them at
all.  We mostly want user interaction with the vfio device file
descriptor to pass directly to the vendor driver and we should only be
adding logic to the mid-level driver when it actually provides some
useful and generic simplification to the vendor driver.  Things like
this default fault handling scheme don't appear to be generic at all,
it's actually a very unique use case I think.  For the most part
I think the mediated interface is just a shim to standardize the
lifecycle of a mediated device for management purposes,
integrate "fake/virtual" devices into the vfio infrastructure,
provide common page tracking, pinning and mapping services, but
the device interface itself should mostly just pass through the
vfio device API straight through to the vendor driver.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v6 2/4] vfio: VFIO driver for mediated PCI device
@ 2016-08-11 18:43                 ` Alex Williamson
  0 siblings, 0 replies; 100+ messages in thread
From: Alex Williamson @ 2016-08-11 18:43 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi

On Thu, 11 Aug 2016 23:16:06 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 8/11/2016 9:54 PM, Alex Williamson wrote:
> > On Thu, 11 Aug 2016 21:29:35 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> On 8/11/2016 4:30 AM, Alex Williamson wrote:  
> >>> On Thu, 11 Aug 2016 02:53:10 +0530
> >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>     
> >>>> On 8/10/2016 12:30 AM, Alex Williamson wrote:    
> >>>>> On Thu, 4 Aug 2016 00:33:52 +0530
> >>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>>>       
> >>>>
> >>>> ...  
> >>>>>>  #include "vfio_pci_private.h"
> >>>>>>  
> >>>>>> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> >>>>>> index 0ecae0b1cd34..431b824b0d3e 100644
> >>>>>> --- a/include/linux/vfio.h
> >>>>>> +++ b/include/linux/vfio.h
> >>>>>> @@ -18,6 +18,13 @@
> >>>>>>  #include <linux/poll.h>
> >>>>>>  #include <uapi/linux/vfio.h>
> >>>>>>  
> >>>>>> +#define VFIO_PCI_OFFSET_SHIFT   40
> >>>>>> +
> >>>>>> +#define VFIO_PCI_OFFSET_TO_INDEX(off)   (off >> VFIO_PCI_OFFSET_SHIFT)
> >>>>>> +#define VFIO_PCI_INDEX_TO_OFFSET(index) ((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
> >>>>>> +#define VFIO_PCI_OFFSET_MASK    (((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
> >>>>>> +
> >>>>>> +      
> >>>>>
> >>>>> Nak this, I'm not interested in making this any sort of ABI.
> >>>>>       
> >>>>
> >>>> These macros are used by drivers/vfio/pci/vfio_pci.c and
> >>>> drivers/vfio/mdev/vfio_mpci.c and to use those in both these modules,
> >>>> they should be moved to common place as you suggested in earlier
> >>>> reviews. I think this is better common place. Are there any other
> >>>> suggestion?    
> >>>
> >>> They're only used in ways that I objected to above and you've agreed
> >>> to.  These define implementation details that must not become part of
> >>> the mediated vendor driver ABI.  A vendor driver is free to redefine
> >>> this the same if they want, but as we can see with how easily they slip
> >>> into code where they don't belong, the only way to make sure they don't
> >>> become ABI is to keep them in private headers.
> >>>      
> >>
> >> Then I think, I can't use these macros in mdev modules, they are defined
> >> in drivers/vfio/pci/vfio_pci_private.h
> >> I have to define similar macros in drivers/vfio/mdev/mdev_private.h?
> >>
> >> parent->ops->get_region_info() is called from vfio_mpci_open() that is
> >> before PCI config space is setup. Main expectation from
> >> get_region_info() was to get flags and size. At this point of time
> >> vendor driver also don't know about the base addresses of regions.
> >>
> >>     case VFIO_DEVICE_GET_REGION_INFO:
> >> ...
> >>
> >>         info.offset = vmdev->vfio_region_info[info.index].offset;
> >>
> >> In that case, as suggested in previous reply, above is not going to work.
> >> I'll define such macros in drivers/vfio/mdev/mdev_private.h, set above
> >> offset according to these macros. Then on first access to any BAR
> >> region, i.e. after PCI config space is populated, call
> >> parent->ops->get_region_info() again so that
> >> vfio_region_info[index].offset for all regions are set by vendor driver.
> >> Then use these offsets to calculate 'pos' for
> >> read/write/validate_map_request(). Does this seems reasonable?  
> > 
> > This doesn't make any sense to me, there should be absolutely no reason
> > for the mid-layer mediated device infrastructure to impose region
> > offsets.  vfio-pci is a leaf driver, like the mediated vendor driver.
> > Only the leaf drivers can define how they layout the offsets within the
> > device file descriptor.  Being a VFIO_PCI device only defines region
> > indexes to resources, not offsets (ie. region 0 is BAR0, region 1 is
> > BAR1,... region 7 is PCI config space).  If this mid-layer even needs
> > to know region offsets, then caching them on opening the vendor device
> > is certainly sufficient.  Remember we're talking about the offset into
> > the vfio device file descriptor, how that potentially maps onto a
> > physical MMIO space later doesn't matter here.  It seems like maybe
> > we're confusing those points.  Anyway, the more I hear about needing to
> > reproduce these INDEX/OFFSET translation macros in places they
> > shouldn't be used, the more confident I am in keeping them private.  
> 
> If vendor driver defines the offsets into vfio device file descriptor,
> it will be vendor drivers responsibility that the ranges defined (offset
> to offset + size) are not overlapping with other regions ranges. There
> will be no validation in vfio-mpci, right?

Right, this seems like a pretty basic requirement of the vendor driver
to offer region ranges that do not overlap and there's plenty else
about the vendor driver that the mid-layer can't validate its
behavior...

> In current implementation there is a provision that if
> validate_map_request() callback is not provided, map it to physical
> device's region and start of physical device's BAR address is queried
> using pci_resource_start(). Since with the above change that you are
> proposing, index could not be extracted from offset. Then if vendor
> driver doesn't provide validate_map_request(), return SIGBUS from fault
> handler.
> So that impose indirect requirement that if vendor driver sets
> VFIO_REGION_INFO_FLAG_MMAP for any region, they should provide
> validate_map_request().

TBH, I don't see how providing a default implementation of
validate_map_request() is useful.  How many mediated devices are going
to want to identity map resources from the parent?  Even if they do, it
seems we can only support a single mediated device per parent device
since each will map the same parent resource offset. Let's not even try
to define a default.  If we get a fault and the vendor driver hasn't
provided a handler, send a SIGBUS.  I expect we should also allow
vendor drivers to fill the mapping at mmap() time rather than expecting
this map on fault scheme.  Maybe the mid-level driver should not even be
interacting with mmap() and should let the vendor driver entirely
determine the handling.

For the most part these mid-level drivers, like mediated pci, should be
as thin as possible, and to some extent I wonder if we need them at
all.  We mostly want user interaction with the vfio device file
descriptor to pass directly to the vendor driver and we should only be
adding logic to the mid-level driver when it actually provides some
useful and generic simplification to the vendor driver.  Things like
this default fault handling scheme don't appear to be generic at all,
it's actually a very unique use case I think.  For the most part
I think the mediated interface is just a shim to standardize the
lifecycle of a mediated device for management purposes,
integrate "fake/virtual" devices into the vfio infrastructure,
provide common page tracking, pinning and mapping services, but
the device interface itself should mostly just pass through the
vfio device API straight through to the vendor driver.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v6 2/4] vfio: VFIO driver for mediated PCI device
  2016-08-11 18:43                 ` [Qemu-devel] " Alex Williamson
@ 2016-08-12 17:57                   ` Kirti Wankhede
  -1 siblings, 0 replies; 100+ messages in thread
From: Kirti Wankhede @ 2016-08-12 17:57 UTC (permalink / raw)
  To: Alex Williamson
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi



On 8/12/2016 12:13 AM, Alex Williamson wrote:

> 
> TBH, I don't see how providing a default implementation of
> validate_map_request() is useful.  How many mediated devices are going
> to want to identity map resources from the parent?  Even if they do, it
> seems we can only support a single mediated device per parent device
> since each will map the same parent resource offset. Let's not even try
> to define a default.  If we get a fault and the vendor driver hasn't
> provided a handler, send a SIGBUS.  I expect we should also allow
> vendor drivers to fill the mapping at mmap() time rather than expecting
> this map on fault scheme.  Maybe the mid-level driver should not even be
> interacting with mmap() and should let the vendor driver entirely
> determine the handling.
>

Should we go ahead with pass through mmap() call to vendor driver and
let vendor driver decide what to do in mmap() call, either
remap_pfn_range in mmap() or do fault on access and handle the fault in
their driver. In that case we don't need to track mappings in mdev core.
Let vendor driver do that on their own, right?



> For the most part these mid-level drivers, like mediated pci, should be
> as thin as possible, and to some extent I wonder if we need them at
> all.  We mostly want user interaction with the vfio device file
> descriptor to pass directly to the vendor driver and we should only be
> adding logic to the mid-level driver when it actually provides some
> useful and generic simplification to the vendor driver.  Things like
> this default fault handling scheme don't appear to be generic at all,
> it's actually a very unique use case I think.  For the most part
> I think the mediated interface is just a shim to standardize the
> lifecycle of a mediated device for management purposes,
> integrate "fake/virtual" devices into the vfio infrastructure,
> provide common page tracking, pinning and mapping services, but
> the device interface itself should mostly just pass through the
> vfio device API straight through to the vendor driver.  Thanks,
> 
> Alex
> 

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v6 2/4] vfio: VFIO driver for mediated PCI device
@ 2016-08-12 17:57                   ` Kirti Wankhede
  0 siblings, 0 replies; 100+ messages in thread
From: Kirti Wankhede @ 2016-08-12 17:57 UTC (permalink / raw)
  To: Alex Williamson
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi



On 8/12/2016 12:13 AM, Alex Williamson wrote:

> 
> TBH, I don't see how providing a default implementation of
> validate_map_request() is useful.  How many mediated devices are going
> to want to identity map resources from the parent?  Even if they do, it
> seems we can only support a single mediated device per parent device
> since each will map the same parent resource offset. Let's not even try
> to define a default.  If we get a fault and the vendor driver hasn't
> provided a handler, send a SIGBUS.  I expect we should also allow
> vendor drivers to fill the mapping at mmap() time rather than expecting
> this map on fault scheme.  Maybe the mid-level driver should not even be
> interacting with mmap() and should let the vendor driver entirely
> determine the handling.
>

Should we go ahead with pass through mmap() call to vendor driver and
let vendor driver decide what to do in mmap() call, either
remap_pfn_range in mmap() or do fault on access and handle the fault in
their driver. In that case we don't need to track mappings in mdev core.
Let vendor driver do that on their own, right?



> For the most part these mid-level drivers, like mediated pci, should be
> as thin as possible, and to some extent I wonder if we need them at
> all.  We mostly want user interaction with the vfio device file
> descriptor to pass directly to the vendor driver and we should only be
> adding logic to the mid-level driver when it actually provides some
> useful and generic simplification to the vendor driver.  Things like
> this default fault handling scheme don't appear to be generic at all,
> it's actually a very unique use case I think.  For the most part
> I think the mediated interface is just a shim to standardize the
> lifecycle of a mediated device for management purposes,
> integrate "fake/virtual" devices into the vfio infrastructure,
> provide common page tracking, pinning and mapping services, but
> the device interface itself should mostly just pass through the
> vfio device API straight through to the vendor driver.  Thanks,
> 
> Alex
> 

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v6 1/4] vfio: Mediated device Core driver
  2016-08-09 19:00     ` [Qemu-devel] " Alex Williamson
@ 2016-08-12 18:44       ` Kirti Wankhede
  -1 siblings, 0 replies; 100+ messages in thread
From: Kirti Wankhede @ 2016-08-12 18:44 UTC (permalink / raw)
  To: Alex Williamson
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi



On 8/10/2016 12:30 AM, Alex Williamson wrote:
> On Thu, 4 Aug 2016 00:33:51 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
> This is used later by mdev_device_start() and mdev_device_stop() to get
> the parent_device so it can call the start and stop ops callbacks
> respectively.  That seems to imply that all of instances for a given
> uuid come from the same parent_device.  Where is that enforced?  I'm
> still having a hard time buying into the uuid+instance plan when it
> seems like each mdev_device should have an actual unique uuid.
> Userspace tools can figure out which uuids to start for a given user, I
> don't see much value in collecting them to instances within a uuid.
> 

Initially we started discussion with VM_UUID+instance suggestion, where
instance was introduced to support multiple devices in a VM.
'mdev_create' creates device and 'mdev_start' is to commit resources of
all instances of similar devices assigned to VM.

For example, to create 2 devices:
# echo "$UUID:0:params" > /sys/devices/../mdev_create
# echo "$UUID:1:params" > /sys/devices/../mdev_create

"$UUID-0" and "$UUID-1" devices are created.

Commit resources for above devices with single 'mdev_start':
# echo "$UUID" > /sys/class/mdev/mdev_start

Considering $UUID to be a unique UUID of a device, we don't need
'instance', so 'mdev_create' would look like:

# echo "$UUID1:params" > /sys/devices/../mdev_create
# echo "$UUID2:params" > /sys/devices/../mdev_create

where $UUID1 and $UUID2 would be mdev device's unique UUID and 'params'
would be vendor specific parameters.

Device nodes would be created as "$UUID1" and "$UUID"

Then 'mdev_start' would be:
# echo "$UUID1, $UUID2" > /sys/class/mdev/mdev_start

Similarly 'mdev_stop' and 'mdev_destroy' would be:

# echo "$UUID1, $UUID2" > /sys/class/mdev/mdev_stop

and

# echo "$UUID1" > /sys/devices/../mdev_destroy
# echo "$UUID2" > /sys/devices/../mdev_destroy

Does this seems reasonable?

Thanks,
Kirti

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v6 1/4] vfio: Mediated device Core driver
@ 2016-08-12 18:44       ` Kirti Wankhede
  0 siblings, 0 replies; 100+ messages in thread
From: Kirti Wankhede @ 2016-08-12 18:44 UTC (permalink / raw)
  To: Alex Williamson
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi



On 8/10/2016 12:30 AM, Alex Williamson wrote:
> On Thu, 4 Aug 2016 00:33:51 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
> This is used later by mdev_device_start() and mdev_device_stop() to get
> the parent_device so it can call the start and stop ops callbacks
> respectively.  That seems to imply that all of instances for a given
> uuid come from the same parent_device.  Where is that enforced?  I'm
> still having a hard time buying into the uuid+instance plan when it
> seems like each mdev_device should have an actual unique uuid.
> Userspace tools can figure out which uuids to start for a given user, I
> don't see much value in collecting them to instances within a uuid.
> 

Initially we started discussion with VM_UUID+instance suggestion, where
instance was introduced to support multiple devices in a VM.
'mdev_create' creates device and 'mdev_start' is to commit resources of
all instances of similar devices assigned to VM.

For example, to create 2 devices:
# echo "$UUID:0:params" > /sys/devices/../mdev_create
# echo "$UUID:1:params" > /sys/devices/../mdev_create

"$UUID-0" and "$UUID-1" devices are created.

Commit resources for above devices with single 'mdev_start':
# echo "$UUID" > /sys/class/mdev/mdev_start

Considering $UUID to be a unique UUID of a device, we don't need
'instance', so 'mdev_create' would look like:

# echo "$UUID1:params" > /sys/devices/../mdev_create
# echo "$UUID2:params" > /sys/devices/../mdev_create

where $UUID1 and $UUID2 would be mdev device's unique UUID and 'params'
would be vendor specific parameters.

Device nodes would be created as "$UUID1" and "$UUID"

Then 'mdev_start' would be:
# echo "$UUID1, $UUID2" > /sys/class/mdev/mdev_start

Similarly 'mdev_stop' and 'mdev_destroy' would be:

# echo "$UUID1, $UUID2" > /sys/class/mdev/mdev_stop

and

# echo "$UUID1" > /sys/devices/../mdev_destroy
# echo "$UUID2" > /sys/devices/../mdev_destroy

Does this seems reasonable?

Thanks,
Kirti

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v6 1/4] vfio: Mediated device Core driver
  2016-08-12 18:44       ` [Qemu-devel] " Kirti Wankhede
@ 2016-08-12 21:16         ` Alex Williamson
  -1 siblings, 0 replies; 100+ messages in thread
From: Alex Williamson @ 2016-08-12 21:16 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi

On Sat, 13 Aug 2016 00:14:39 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 8/10/2016 12:30 AM, Alex Williamson wrote:
> > On Thu, 4 Aug 2016 00:33:51 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > 
> > This is used later by mdev_device_start() and mdev_device_stop() to get
> > the parent_device so it can call the start and stop ops callbacks
> > respectively.  That seems to imply that all of instances for a given
> > uuid come from the same parent_device.  Where is that enforced?  I'm
> > still having a hard time buying into the uuid+instance plan when it
> > seems like each mdev_device should have an actual unique uuid.
> > Userspace tools can figure out which uuids to start for a given user, I
> > don't see much value in collecting them to instances within a uuid.
> >   
> 
> Initially we started discussion with VM_UUID+instance suggestion, where
> instance was introduced to support multiple devices in a VM.

The instance number was never required in order to support multiple
devices in a VM, IIRC this UUID+instance scheme was to appease NVIDIA
management tools which wanted to re-use the VM UUID by creating vGPU
devices with that same UUID and therefore associate udev events to a
given VM.  Only then does an instance number become necessary since the
UUID needs to be static for a vGPUs within a VM.  This has always felt
like a very dodgy solution when we should probably just be querying
libvirt to give us a device to VM association.

> 'mdev_create' creates device and 'mdev_start' is to commit resources of
> all instances of similar devices assigned to VM.
> 
> For example, to create 2 devices:
> # echo "$UUID:0:params" > /sys/devices/../mdev_create
> # echo "$UUID:1:params" > /sys/devices/../mdev_create
> 
> "$UUID-0" and "$UUID-1" devices are created.
> 
> Commit resources for above devices with single 'mdev_start':
> # echo "$UUID" > /sys/class/mdev/mdev_start
> 
> Considering $UUID to be a unique UUID of a device, we don't need
> 'instance', so 'mdev_create' would look like:
> 
> # echo "$UUID1:params" > /sys/devices/../mdev_create
> # echo "$UUID2:params" > /sys/devices/../mdev_create
> 
> where $UUID1 and $UUID2 would be mdev device's unique UUID and 'params'
> would be vendor specific parameters.
> 
> Device nodes would be created as "$UUID1" and "$UUID"
> 
> Then 'mdev_start' would be:
> # echo "$UUID1, $UUID2" > /sys/class/mdev/mdev_start
> 
> Similarly 'mdev_stop' and 'mdev_destroy' would be:
> 
> # echo "$UUID1, $UUID2" > /sys/class/mdev/mdev_stop

I'm not sure a comma separated list makes sense here, for both
simplicity in the kernel and more fine grained error reporting, we
probably want to start/stop them individually.  Actually, why is it
that we can't use the mediated device being opened and released to
automatically signal to the backend vendor driver to commit and release
resources? I don't fully understand why userspace needs this interface.

> and
> 
> # echo "$UUID1" > /sys/devices/../mdev_destroy
> # echo "$UUID2" > /sys/devices/../mdev_destroy
> 
> Does this seems reasonable?

I've been hoping we could drop the instance numbers and create actual
unique UUIDs per mediated device for a while ;)  Thanks,

Alex

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v6 1/4] vfio: Mediated device Core driver
@ 2016-08-12 21:16         ` Alex Williamson
  0 siblings, 0 replies; 100+ messages in thread
From: Alex Williamson @ 2016-08-12 21:16 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi

On Sat, 13 Aug 2016 00:14:39 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 8/10/2016 12:30 AM, Alex Williamson wrote:
> > On Thu, 4 Aug 2016 00:33:51 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > 
> > This is used later by mdev_device_start() and mdev_device_stop() to get
> > the parent_device so it can call the start and stop ops callbacks
> > respectively.  That seems to imply that all of instances for a given
> > uuid come from the same parent_device.  Where is that enforced?  I'm
> > still having a hard time buying into the uuid+instance plan when it
> > seems like each mdev_device should have an actual unique uuid.
> > Userspace tools can figure out which uuids to start for a given user, I
> > don't see much value in collecting them to instances within a uuid.
> >   
> 
> Initially we started discussion with VM_UUID+instance suggestion, where
> instance was introduced to support multiple devices in a VM.

The instance number was never required in order to support multiple
devices in a VM, IIRC this UUID+instance scheme was to appease NVIDIA
management tools which wanted to re-use the VM UUID by creating vGPU
devices with that same UUID and therefore associate udev events to a
given VM.  Only then does an instance number become necessary since the
UUID needs to be static for a vGPUs within a VM.  This has always felt
like a very dodgy solution when we should probably just be querying
libvirt to give us a device to VM association.

> 'mdev_create' creates device and 'mdev_start' is to commit resources of
> all instances of similar devices assigned to VM.
> 
> For example, to create 2 devices:
> # echo "$UUID:0:params" > /sys/devices/../mdev_create
> # echo "$UUID:1:params" > /sys/devices/../mdev_create
> 
> "$UUID-0" and "$UUID-1" devices are created.
> 
> Commit resources for above devices with single 'mdev_start':
> # echo "$UUID" > /sys/class/mdev/mdev_start
> 
> Considering $UUID to be a unique UUID of a device, we don't need
> 'instance', so 'mdev_create' would look like:
> 
> # echo "$UUID1:params" > /sys/devices/../mdev_create
> # echo "$UUID2:params" > /sys/devices/../mdev_create
> 
> where $UUID1 and $UUID2 would be mdev device's unique UUID and 'params'
> would be vendor specific parameters.
> 
> Device nodes would be created as "$UUID1" and "$UUID"
> 
> Then 'mdev_start' would be:
> # echo "$UUID1, $UUID2" > /sys/class/mdev/mdev_start
> 
> Similarly 'mdev_stop' and 'mdev_destroy' would be:
> 
> # echo "$UUID1, $UUID2" > /sys/class/mdev/mdev_stop

I'm not sure a comma separated list makes sense here, for both
simplicity in the kernel and more fine grained error reporting, we
probably want to start/stop them individually.  Actually, why is it
that we can't use the mediated device being opened and released to
automatically signal to the backend vendor driver to commit and release
resources? I don't fully understand why userspace needs this interface.

> and
> 
> # echo "$UUID1" > /sys/devices/../mdev_destroy
> # echo "$UUID2" > /sys/devices/../mdev_destroy
> 
> Does this seems reasonable?

I've been hoping we could drop the instance numbers and create actual
unique UUIDs per mediated device for a while ;)  Thanks,

Alex

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v6 2/4] vfio: VFIO driver for mediated PCI device
  2016-08-12 17:57                   ` [Qemu-devel] " Kirti Wankhede
@ 2016-08-12 21:25                     ` Alex Williamson
  -1 siblings, 0 replies; 100+ messages in thread
From: Alex Williamson @ 2016-08-12 21:25 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi

On Fri, 12 Aug 2016 23:27:01 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 8/12/2016 12:13 AM, Alex Williamson wrote:
> 
> > 
> > TBH, I don't see how providing a default implementation of
> > validate_map_request() is useful.  How many mediated devices are going
> > to want to identity map resources from the parent?  Even if they do, it
> > seems we can only support a single mediated device per parent device
> > since each will map the same parent resource offset. Let's not even try
> > to define a default.  If we get a fault and the vendor driver hasn't
> > provided a handler, send a SIGBUS.  I expect we should also allow
> > vendor drivers to fill the mapping at mmap() time rather than expecting
> > this map on fault scheme.  Maybe the mid-level driver should not even be
> > interacting with mmap() and should let the vendor driver entirely
> > determine the handling.
> >  
> 
> Should we go ahead with pass through mmap() call to vendor driver and
> let vendor driver decide what to do in mmap() call, either
> remap_pfn_range in mmap() or do fault on access and handle the fault in
> their driver. In that case we don't need to track mappings in mdev core.
> Let vendor driver do that on their own, right?

This sounds right to me, I don't think we want to impose either model
on the vendor driver.  The vendor driver owns the vfio device file
descriptor and is responsible for managing it should they expose mmap
support for regions on the file descriptor.  They either need to insert
mappings at the point where mmap() is called or setup fault handlers to
insert them on demand.  If we can provide helper functions so that each
vendor driver doesn't need to re-invent either of those, that would be
a bonus.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v6 2/4] vfio: VFIO driver for mediated PCI device
@ 2016-08-12 21:25                     ` Alex Williamson
  0 siblings, 0 replies; 100+ messages in thread
From: Alex Williamson @ 2016-08-12 21:25 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi

On Fri, 12 Aug 2016 23:27:01 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 8/12/2016 12:13 AM, Alex Williamson wrote:
> 
> > 
> > TBH, I don't see how providing a default implementation of
> > validate_map_request() is useful.  How many mediated devices are going
> > to want to identity map resources from the parent?  Even if they do, it
> > seems we can only support a single mediated device per parent device
> > since each will map the same parent resource offset. Let's not even try
> > to define a default.  If we get a fault and the vendor driver hasn't
> > provided a handler, send a SIGBUS.  I expect we should also allow
> > vendor drivers to fill the mapping at mmap() time rather than expecting
> > this map on fault scheme.  Maybe the mid-level driver should not even be
> > interacting with mmap() and should let the vendor driver entirely
> > determine the handling.
> >  
> 
> Should we go ahead with pass through mmap() call to vendor driver and
> let vendor driver decide what to do in mmap() call, either
> remap_pfn_range in mmap() or do fault on access and handle the fault in
> their driver. In that case we don't need to track mappings in mdev core.
> Let vendor driver do that on their own, right?

This sounds right to me, I don't think we want to impose either model
on the vendor driver.  The vendor driver owns the vfio device file
descriptor and is responsible for managing it should they expose mmap
support for regions on the file descriptor.  They either need to insert
mappings at the point where mmap() is called or setup fault handlers to
insert them on demand.  If we can provide helper functions so that each
vendor driver doesn't need to re-invent either of those, that would be
a bonus.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v6 1/4] vfio: Mediated device Core driver
  2016-08-12 21:16         ` [Qemu-devel] " Alex Williamson
@ 2016-08-13  0:37           ` Kirti Wankhede
  -1 siblings, 0 replies; 100+ messages in thread
From: Kirti Wankhede @ 2016-08-13  0:37 UTC (permalink / raw)
  To: Alex Williamson
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi



On 8/13/2016 2:46 AM, Alex Williamson wrote:
> On Sat, 13 Aug 2016 00:14:39 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> On 8/10/2016 12:30 AM, Alex Williamson wrote:
>>> On Thu, 4 Aug 2016 00:33:51 +0530
>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>
>>> This is used later by mdev_device_start() and mdev_device_stop() to get
>>> the parent_device so it can call the start and stop ops callbacks
>>> respectively.  That seems to imply that all of instances for a given
>>> uuid come from the same parent_device.  Where is that enforced?  I'm
>>> still having a hard time buying into the uuid+instance plan when it
>>> seems like each mdev_device should have an actual unique uuid.
>>> Userspace tools can figure out which uuids to start for a given user, I
>>> don't see much value in collecting them to instances within a uuid.
>>>   
>>
>> Initially we started discussion with VM_UUID+instance suggestion, where
>> instance was introduced to support multiple devices in a VM.
> 
> The instance number was never required in order to support multiple
> devices in a VM, IIRC this UUID+instance scheme was to appease NVIDIA
> management tools which wanted to re-use the VM UUID by creating vGPU
> devices with that same UUID and therefore associate udev events to a
> given VM.  Only then does an instance number become necessary since the
> UUID needs to be static for a vGPUs within a VM.  This has always felt
> like a very dodgy solution when we should probably just be querying
> libvirt to give us a device to VM association.
> 
>> 'mdev_create' creates device and 'mdev_start' is to commit resources of
>> all instances of similar devices assigned to VM.
>>
>> For example, to create 2 devices:
>> # echo "$UUID:0:params" > /sys/devices/../mdev_create
>> # echo "$UUID:1:params" > /sys/devices/../mdev_create
>>
>> "$UUID-0" and "$UUID-1" devices are created.
>>
>> Commit resources for above devices with single 'mdev_start':
>> # echo "$UUID" > /sys/class/mdev/mdev_start
>>
>> Considering $UUID to be a unique UUID of a device, we don't need
>> 'instance', so 'mdev_create' would look like:
>>
>> # echo "$UUID1:params" > /sys/devices/../mdev_create
>> # echo "$UUID2:params" > /sys/devices/../mdev_create
>>
>> where $UUID1 and $UUID2 would be mdev device's unique UUID and 'params'
>> would be vendor specific parameters.
>>
>> Device nodes would be created as "$UUID1" and "$UUID"
>>
>> Then 'mdev_start' would be:
>> # echo "$UUID1, $UUID2" > /sys/class/mdev/mdev_start
>>
>> Similarly 'mdev_stop' and 'mdev_destroy' would be:
>>
>> # echo "$UUID1, $UUID2" > /sys/class/mdev/mdev_stop
> 
> I'm not sure a comma separated list makes sense here, for both
> simplicity in the kernel and more fine grained error reporting, we
> probably want to start/stop them individually.  Actually, why is it
> that we can't use the mediated device being opened and released to
> automatically signal to the backend vendor driver to commit and release
> resources? I don't fully understand why userspace needs this interface.
> 

For NVIDIA vGPU solution we need to know all devices assigned to a VM in
one shot to commit resources of all vGPUs assigned to a VM along with
some common resources.

For start callback, I can pass on the list of UUIDs as is to vendor
driver. Let vendor driver decide whether to iterate for each device and
commit resources or do it in one shot.

Thanks,
Kirti

>> and
>>
>> # echo "$UUID1" > /sys/devices/../mdev_destroy
>> # echo "$UUID2" > /sys/devices/../mdev_destroy
>>
>> Does this seems reasonable?
> 
> I've been hoping we could drop the instance numbers and create actual
> unique UUIDs per mediated device for a while ;)  Thanks,
> 
> Alex
> 

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v6 1/4] vfio: Mediated device Core driver
@ 2016-08-13  0:37           ` Kirti Wankhede
  0 siblings, 0 replies; 100+ messages in thread
From: Kirti Wankhede @ 2016-08-13  0:37 UTC (permalink / raw)
  To: Alex Williamson
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi



On 8/13/2016 2:46 AM, Alex Williamson wrote:
> On Sat, 13 Aug 2016 00:14:39 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> On 8/10/2016 12:30 AM, Alex Williamson wrote:
>>> On Thu, 4 Aug 2016 00:33:51 +0530
>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>
>>> This is used later by mdev_device_start() and mdev_device_stop() to get
>>> the parent_device so it can call the start and stop ops callbacks
>>> respectively.  That seems to imply that all of instances for a given
>>> uuid come from the same parent_device.  Where is that enforced?  I'm
>>> still having a hard time buying into the uuid+instance plan when it
>>> seems like each mdev_device should have an actual unique uuid.
>>> Userspace tools can figure out which uuids to start for a given user, I
>>> don't see much value in collecting them to instances within a uuid.
>>>   
>>
>> Initially we started discussion with VM_UUID+instance suggestion, where
>> instance was introduced to support multiple devices in a VM.
> 
> The instance number was never required in order to support multiple
> devices in a VM, IIRC this UUID+instance scheme was to appease NVIDIA
> management tools which wanted to re-use the VM UUID by creating vGPU
> devices with that same UUID and therefore associate udev events to a
> given VM.  Only then does an instance number become necessary since the
> UUID needs to be static for a vGPUs within a VM.  This has always felt
> like a very dodgy solution when we should probably just be querying
> libvirt to give us a device to VM association.
> 
>> 'mdev_create' creates device and 'mdev_start' is to commit resources of
>> all instances of similar devices assigned to VM.
>>
>> For example, to create 2 devices:
>> # echo "$UUID:0:params" > /sys/devices/../mdev_create
>> # echo "$UUID:1:params" > /sys/devices/../mdev_create
>>
>> "$UUID-0" and "$UUID-1" devices are created.
>>
>> Commit resources for above devices with single 'mdev_start':
>> # echo "$UUID" > /sys/class/mdev/mdev_start
>>
>> Considering $UUID to be a unique UUID of a device, we don't need
>> 'instance', so 'mdev_create' would look like:
>>
>> # echo "$UUID1:params" > /sys/devices/../mdev_create
>> # echo "$UUID2:params" > /sys/devices/../mdev_create
>>
>> where $UUID1 and $UUID2 would be mdev device's unique UUID and 'params'
>> would be vendor specific parameters.
>>
>> Device nodes would be created as "$UUID1" and "$UUID"
>>
>> Then 'mdev_start' would be:
>> # echo "$UUID1, $UUID2" > /sys/class/mdev/mdev_start
>>
>> Similarly 'mdev_stop' and 'mdev_destroy' would be:
>>
>> # echo "$UUID1, $UUID2" > /sys/class/mdev/mdev_stop
> 
> I'm not sure a comma separated list makes sense here, for both
> simplicity in the kernel and more fine grained error reporting, we
> probably want to start/stop them individually.  Actually, why is it
> that we can't use the mediated device being opened and released to
> automatically signal to the backend vendor driver to commit and release
> resources? I don't fully understand why userspace needs this interface.
> 

For NVIDIA vGPU solution we need to know all devices assigned to a VM in
one shot to commit resources of all vGPUs assigned to a VM along with
some common resources.

For start callback, I can pass on the list of UUIDs as is to vendor
driver. Let vendor driver decide whether to iterate for each device and
commit resources or do it in one shot.

Thanks,
Kirti

>> and
>>
>> # echo "$UUID1" > /sys/devices/../mdev_destroy
>> # echo "$UUID2" > /sys/devices/../mdev_destroy
>>
>> Does this seems reasonable?
> 
> I've been hoping we could drop the instance numbers and create actual
> unique UUIDs per mediated device for a while ;)  Thanks,
> 
> Alex
> 

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v6 2/4] vfio: VFIO driver for mediated PCI device
  2016-08-12 21:25                     ` [Qemu-devel] " Alex Williamson
@ 2016-08-13  0:42                       ` Kirti Wankhede
  -1 siblings, 0 replies; 100+ messages in thread
From: Kirti Wankhede @ 2016-08-13  0:42 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kevin.tian, cjia, kvm, qemu-devel, jike.song, kraxel, pbonzini, bjsdjshi



On 8/13/2016 2:55 AM, Alex Williamson wrote:
> On Fri, 12 Aug 2016 23:27:01 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> On 8/12/2016 12:13 AM, Alex Williamson wrote:
>>
>>>
>>> TBH, I don't see how providing a default implementation of
>>> validate_map_request() is useful.  How many mediated devices are going
>>> to want to identity map resources from the parent?  Even if they do, it
>>> seems we can only support a single mediated device per parent device
>>> since each will map the same parent resource offset. Let's not even try
>>> to define a default.  If we get a fault and the vendor driver hasn't
>>> provided a handler, send a SIGBUS.  I expect we should also allow
>>> vendor drivers to fill the mapping at mmap() time rather than expecting
>>> this map on fault scheme.  Maybe the mid-level driver should not even be
>>> interacting with mmap() and should let the vendor driver entirely
>>> determine the handling.
>>>  
>>
>> Should we go ahead with pass through mmap() call to vendor driver and
>> let vendor driver decide what to do in mmap() call, either
>> remap_pfn_range in mmap() or do fault on access and handle the fault in
>> their driver. In that case we don't need to track mappings in mdev core.
>> Let vendor driver do that on their own, right?
> 
> This sounds right to me, I don't think we want to impose either model
> on the vendor driver.  The vendor driver owns the vfio device file
> descriptor and is responsible for managing it should they expose mmap
> support for regions on the file descriptor.  They either need to insert
> mappings at the point where mmap() is called or setup fault handlers to
> insert them on demand.  If we can provide helper functions so that each
> vendor driver doesn't need to re-invent either of those, that would be
> a bonus.  Thanks,
> 

Since mmap() is going to be handled in vendor driver, let vendor driver
do their own tracking logic of mappings based on which way they decide
to go. No need to keep it in mdev coer module and try to handle all the
cases in one function.

Thanks,
Kirti


> Alex
> 

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v6 2/4] vfio: VFIO driver for mediated PCI device
@ 2016-08-13  0:42                       ` Kirti Wankhede
  0 siblings, 0 replies; 100+ messages in thread
From: Kirti Wankhede @ 2016-08-13  0:42 UTC (permalink / raw)
  To: Alex Williamson
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi



On 8/13/2016 2:55 AM, Alex Williamson wrote:
> On Fri, 12 Aug 2016 23:27:01 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> On 8/12/2016 12:13 AM, Alex Williamson wrote:
>>
>>>
>>> TBH, I don't see how providing a default implementation of
>>> validate_map_request() is useful.  How many mediated devices are going
>>> to want to identity map resources from the parent?  Even if they do, it
>>> seems we can only support a single mediated device per parent device
>>> since each will map the same parent resource offset. Let's not even try
>>> to define a default.  If we get a fault and the vendor driver hasn't
>>> provided a handler, send a SIGBUS.  I expect we should also allow
>>> vendor drivers to fill the mapping at mmap() time rather than expecting
>>> this map on fault scheme.  Maybe the mid-level driver should not even be
>>> interacting with mmap() and should let the vendor driver entirely
>>> determine the handling.
>>>  
>>
>> Should we go ahead with pass through mmap() call to vendor driver and
>> let vendor driver decide what to do in mmap() call, either
>> remap_pfn_range in mmap() or do fault on access and handle the fault in
>> their driver. In that case we don't need to track mappings in mdev core.
>> Let vendor driver do that on their own, right?
> 
> This sounds right to me, I don't think we want to impose either model
> on the vendor driver.  The vendor driver owns the vfio device file
> descriptor and is responsible for managing it should they expose mmap
> support for regions on the file descriptor.  They either need to insert
> mappings at the point where mmap() is called or setup fault handlers to
> insert them on demand.  If we can provide helper functions so that each
> vendor driver doesn't need to re-invent either of those, that would be
> a bonus.  Thanks,
> 

Since mmap() is going to be handled in vendor driver, let vendor driver
do their own tracking logic of mappings based on which way they decide
to go. No need to keep it in mdev coer module and try to handle all the
cases in one function.

Thanks,
Kirti


> Alex
> 

^ permalink raw reply	[flat|nested] 100+ messages in thread

* RE: [PATCH v6 1/4] vfio: Mediated device Core driver
  2016-08-05  6:13       ` [Qemu-devel] " Kirti Wankhede
@ 2016-08-15  9:15         ` Tian, Kevin
  -1 siblings, 0 replies; 100+ messages in thread
From: Tian, Kevin @ 2016-08-15  9:15 UTC (permalink / raw)
  To: Kirti Wankhede, alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, Song, Jike, bjsdjshi

> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> Sent: Friday, August 05, 2016 2:13 PM
> 
> On 8/4/2016 12:51 PM, Tian, Kevin wrote:
> >> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> >> Sent: Thursday, August 04, 2016 3:04 AM
> >>
> >>
> >> 2. Physical device driver interface
> >> This interface provides vendor driver the set APIs to manage physical
> >> device related work in their own driver. APIs are :
> >> - supported_config: provide supported configuration list by the vendor
> >> 		    driver
> >> - create: to allocate basic resources in vendor driver for a mediated
> >> 	  device.
> >> - destroy: to free resources in vendor driver when mediated device is
> >> 	   destroyed.
> >> - reset: to free and reallocate resources in vendor driver during reboot
> >
> > Currently I saw 'reset' callback only invoked from VFIO ioctl path. Do
> > you think whether it makes sense to expose a sysfs 'reset' node too,
> > similar to what people see under a PCI device node?
> >
> 
> All vendor drivers might not support reset of mdev from sysfs. But those
> who want to support can expose 'reset' node using 'mdev_attr_groups' of
> 'struct parent_ops'.
> 

Yes, this way it works. Just wonder whether it makes sense to expose reset
sysfs node by default if a reset callback is provided by vendor driver. :-)

Thanks
Kevin

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v6 1/4] vfio: Mediated device Core driver
@ 2016-08-15  9:15         ` Tian, Kevin
  0 siblings, 0 replies; 100+ messages in thread
From: Tian, Kevin @ 2016-08-15  9:15 UTC (permalink / raw)
  To: Kirti Wankhede, alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, Song, Jike, bjsdjshi

> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> Sent: Friday, August 05, 2016 2:13 PM
> 
> On 8/4/2016 12:51 PM, Tian, Kevin wrote:
> >> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> >> Sent: Thursday, August 04, 2016 3:04 AM
> >>
> >>
> >> 2. Physical device driver interface
> >> This interface provides vendor driver the set APIs to manage physical
> >> device related work in their own driver. APIs are :
> >> - supported_config: provide supported configuration list by the vendor
> >> 		    driver
> >> - create: to allocate basic resources in vendor driver for a mediated
> >> 	  device.
> >> - destroy: to free resources in vendor driver when mediated device is
> >> 	   destroyed.
> >> - reset: to free and reallocate resources in vendor driver during reboot
> >
> > Currently I saw 'reset' callback only invoked from VFIO ioctl path. Do
> > you think whether it makes sense to expose a sysfs 'reset' node too,
> > similar to what people see under a PCI device node?
> >
> 
> All vendor drivers might not support reset of mdev from sysfs. But those
> who want to support can expose 'reset' node using 'mdev_attr_groups' of
> 'struct parent_ops'.
> 

Yes, this way it works. Just wonder whether it makes sense to expose reset
sysfs node by default if a reset callback is provided by vendor driver. :-)

Thanks
Kevin

^ permalink raw reply	[flat|nested] 100+ messages in thread

* RE: [PATCH v6 1/4] vfio: Mediated device Core driver
  2016-08-13  0:37           ` [Qemu-devel] " Kirti Wankhede
@ 2016-08-15  9:38             ` Tian, Kevin
  -1 siblings, 0 replies; 100+ messages in thread
From: Tian, Kevin @ 2016-08-15  9:38 UTC (permalink / raw)
  To: Kirti Wankhede, Alex Williamson
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, Song, Jike, bjsdjshi

> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> Sent: Saturday, August 13, 2016 8:37 AM
> 
> 
> 
> On 8/13/2016 2:46 AM, Alex Williamson wrote:
> > On Sat, 13 Aug 2016 00:14:39 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >
> >> On 8/10/2016 12:30 AM, Alex Williamson wrote:
> >>> On Thu, 4 Aug 2016 00:33:51 +0530
> >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>
> >>> This is used later by mdev_device_start() and mdev_device_stop() to get
> >>> the parent_device so it can call the start and stop ops callbacks
> >>> respectively.  That seems to imply that all of instances for a given
> >>> uuid come from the same parent_device.  Where is that enforced?  I'm
> >>> still having a hard time buying into the uuid+instance plan when it
> >>> seems like each mdev_device should have an actual unique uuid.
> >>> Userspace tools can figure out which uuids to start for a given user, I
> >>> don't see much value in collecting them to instances within a uuid.
> >>>
> >>
> >> Initially we started discussion with VM_UUID+instance suggestion, where
> >> instance was introduced to support multiple devices in a VM.
> >
> > The instance number was never required in order to support multiple
> > devices in a VM, IIRC this UUID+instance scheme was to appease NVIDIA
> > management tools which wanted to re-use the VM UUID by creating vGPU
> > devices with that same UUID and therefore associate udev events to a
> > given VM.  Only then does an instance number become necessary since the
> > UUID needs to be static for a vGPUs within a VM.  This has always felt
> > like a very dodgy solution when we should probably just be querying
> > libvirt to give us a device to VM association.

Agree with Alex here. We'd better not assume that UUID will be a VM_UUID
for mdev in the basic design. It's bound to NVIDIA management stack too tightly.

I'm OK to give enough flexibility for various upper level management stacks,
e.g. instead of $UUID+INDEX style, would $UUID+STRING provide a better
option where either UUID or STRING could be optional? Upper management 
stack can choose its own policy to identify a mdev:

a) $UUID only, so each mdev is allocated with a unique UUID
b) STRING only, which could be an index (0, 1, 2, ...), or any combination 
(vgpu0, vgpu1, etc.)
c) $UUID+STRING, where UUID could be a VM UUID, and STRING could be
a numeric index

> >
> >> 'mdev_create' creates device and 'mdev_start' is to commit resources of
> >> all instances of similar devices assigned to VM.
> >>
> >> For example, to create 2 devices:
> >> # echo "$UUID:0:params" > /sys/devices/../mdev_create
> >> # echo "$UUID:1:params" > /sys/devices/../mdev_create
> >>
> >> "$UUID-0" and "$UUID-1" devices are created.
> >>
> >> Commit resources for above devices with single 'mdev_start':
> >> # echo "$UUID" > /sys/class/mdev/mdev_start
> >>
> >> Considering $UUID to be a unique UUID of a device, we don't need
> >> 'instance', so 'mdev_create' would look like:
> >>
> >> # echo "$UUID1:params" > /sys/devices/../mdev_create
> >> # echo "$UUID2:params" > /sys/devices/../mdev_create
> >>
> >> where $UUID1 and $UUID2 would be mdev device's unique UUID and 'params'
> >> would be vendor specific parameters.
> >>
> >> Device nodes would be created as "$UUID1" and "$UUID"
> >>
> >> Then 'mdev_start' would be:
> >> # echo "$UUID1, $UUID2" > /sys/class/mdev/mdev_start
> >>
> >> Similarly 'mdev_stop' and 'mdev_destroy' would be:
> >>
> >> # echo "$UUID1, $UUID2" > /sys/class/mdev/mdev_stop
> >
> > I'm not sure a comma separated list makes sense here, for both
> > simplicity in the kernel and more fine grained error reporting, we
> > probably want to start/stop them individually.  Actually, why is it
> > that we can't use the mediated device being opened and released to
> > automatically signal to the backend vendor driver to commit and release
> > resources? I don't fully understand why userspace needs this interface.

There is a meaningful use of start/stop interface, as required in live
migration support. Such interface allows vendor driver to quiescent 
mdev activity on source device before mdev hardware state is snapshot,
and then resume mdev activity on dest device after its state is recovered.
Intel has implemented experimental live migration support in KVMGT (soon
to release), based on above two interfaces (plus another two to get/set
mdev state).

> >
> 
> For NVIDIA vGPU solution we need to know all devices assigned to a VM in
> one shot to commit resources of all vGPUs assigned to a VM along with
> some common resources.

Kirti, can you elaborate the background about above one-shot commit
requirement? It's hard to understand such a requirement. 

As I relied in another mail, I really hope start/stop become a per-mdev
attribute instead of global one, e.g.:

echo "0/1" > /sys/class/mdev/12345678-1234-1234-1234-123456789abc/start

In many scenario the user space client may only want to talk to mdev
instance directly, w/o need to contact its parent device. Still take
live migration for example, I don't think Qemu wants to know parent
device of assigned mdev instances.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v6 1/4] vfio: Mediated device Core driver
@ 2016-08-15  9:38             ` Tian, Kevin
  0 siblings, 0 replies; 100+ messages in thread
From: Tian, Kevin @ 2016-08-15  9:38 UTC (permalink / raw)
  To: Kirti Wankhede, Alex Williamson
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, Song, Jike, bjsdjshi

> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> Sent: Saturday, August 13, 2016 8:37 AM
> 
> 
> 
> On 8/13/2016 2:46 AM, Alex Williamson wrote:
> > On Sat, 13 Aug 2016 00:14:39 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >
> >> On 8/10/2016 12:30 AM, Alex Williamson wrote:
> >>> On Thu, 4 Aug 2016 00:33:51 +0530
> >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>
> >>> This is used later by mdev_device_start() and mdev_device_stop() to get
> >>> the parent_device so it can call the start and stop ops callbacks
> >>> respectively.  That seems to imply that all of instances for a given
> >>> uuid come from the same parent_device.  Where is that enforced?  I'm
> >>> still having a hard time buying into the uuid+instance plan when it
> >>> seems like each mdev_device should have an actual unique uuid.
> >>> Userspace tools can figure out which uuids to start for a given user, I
> >>> don't see much value in collecting them to instances within a uuid.
> >>>
> >>
> >> Initially we started discussion with VM_UUID+instance suggestion, where
> >> instance was introduced to support multiple devices in a VM.
> >
> > The instance number was never required in order to support multiple
> > devices in a VM, IIRC this UUID+instance scheme was to appease NVIDIA
> > management tools which wanted to re-use the VM UUID by creating vGPU
> > devices with that same UUID and therefore associate udev events to a
> > given VM.  Only then does an instance number become necessary since the
> > UUID needs to be static for a vGPUs within a VM.  This has always felt
> > like a very dodgy solution when we should probably just be querying
> > libvirt to give us a device to VM association.

Agree with Alex here. We'd better not assume that UUID will be a VM_UUID
for mdev in the basic design. It's bound to NVIDIA management stack too tightly.

I'm OK to give enough flexibility for various upper level management stacks,
e.g. instead of $UUID+INDEX style, would $UUID+STRING provide a better
option where either UUID or STRING could be optional? Upper management 
stack can choose its own policy to identify a mdev:

a) $UUID only, so each mdev is allocated with a unique UUID
b) STRING only, which could be an index (0, 1, 2, ...), or any combination 
(vgpu0, vgpu1, etc.)
c) $UUID+STRING, where UUID could be a VM UUID, and STRING could be
a numeric index

> >
> >> 'mdev_create' creates device and 'mdev_start' is to commit resources of
> >> all instances of similar devices assigned to VM.
> >>
> >> For example, to create 2 devices:
> >> # echo "$UUID:0:params" > /sys/devices/../mdev_create
> >> # echo "$UUID:1:params" > /sys/devices/../mdev_create
> >>
> >> "$UUID-0" and "$UUID-1" devices are created.
> >>
> >> Commit resources for above devices with single 'mdev_start':
> >> # echo "$UUID" > /sys/class/mdev/mdev_start
> >>
> >> Considering $UUID to be a unique UUID of a device, we don't need
> >> 'instance', so 'mdev_create' would look like:
> >>
> >> # echo "$UUID1:params" > /sys/devices/../mdev_create
> >> # echo "$UUID2:params" > /sys/devices/../mdev_create
> >>
> >> where $UUID1 and $UUID2 would be mdev device's unique UUID and 'params'
> >> would be vendor specific parameters.
> >>
> >> Device nodes would be created as "$UUID1" and "$UUID"
> >>
> >> Then 'mdev_start' would be:
> >> # echo "$UUID1, $UUID2" > /sys/class/mdev/mdev_start
> >>
> >> Similarly 'mdev_stop' and 'mdev_destroy' would be:
> >>
> >> # echo "$UUID1, $UUID2" > /sys/class/mdev/mdev_stop
> >
> > I'm not sure a comma separated list makes sense here, for both
> > simplicity in the kernel and more fine grained error reporting, we
> > probably want to start/stop them individually.  Actually, why is it
> > that we can't use the mediated device being opened and released to
> > automatically signal to the backend vendor driver to commit and release
> > resources? I don't fully understand why userspace needs this interface.

There is a meaningful use of start/stop interface, as required in live
migration support. Such interface allows vendor driver to quiescent 
mdev activity on source device before mdev hardware state is snapshot,
and then resume mdev activity on dest device after its state is recovered.
Intel has implemented experimental live migration support in KVMGT (soon
to release), based on above two interfaces (plus another two to get/set
mdev state).

> >
> 
> For NVIDIA vGPU solution we need to know all devices assigned to a VM in
> one shot to commit resources of all vGPUs assigned to a VM along with
> some common resources.

Kirti, can you elaborate the background about above one-shot commit
requirement? It's hard to understand such a requirement. 

As I relied in another mail, I really hope start/stop become a per-mdev
attribute instead of global one, e.g.:

echo "0/1" > /sys/class/mdev/12345678-1234-1234-1234-123456789abc/start

In many scenario the user space client may only want to talk to mdev
instance directly, w/o need to contact its parent device. Still take
live migration for example, I don't think Qemu wants to know parent
device of assigned mdev instances.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v6 1/4] vfio: Mediated device Core driver
  2016-08-15  9:38             ` [Qemu-devel] " Tian, Kevin
@ 2016-08-15 15:59               ` Alex Williamson
  -1 siblings, 0 replies; 100+ messages in thread
From: Alex Williamson @ 2016-08-15 15:59 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Kirti Wankhede, pbonzini, kraxel, cjia, qemu-devel, kvm, Song,
	Jike, bjsdjshi

On Mon, 15 Aug 2016 09:38:52 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> > Sent: Saturday, August 13, 2016 8:37 AM
> > 
> > 
> > 
> > On 8/13/2016 2:46 AM, Alex Williamson wrote:  
> > > On Sat, 13 Aug 2016 00:14:39 +0530
> > > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > >  
> > >> On 8/10/2016 12:30 AM, Alex Williamson wrote:  
> > >>> On Thu, 4 Aug 2016 00:33:51 +0530
> > >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > >>>
> > >>> This is used later by mdev_device_start() and mdev_device_stop() to get
> > >>> the parent_device so it can call the start and stop ops callbacks
> > >>> respectively.  That seems to imply that all of instances for a given
> > >>> uuid come from the same parent_device.  Where is that enforced?  I'm
> > >>> still having a hard time buying into the uuid+instance plan when it
> > >>> seems like each mdev_device should have an actual unique uuid.
> > >>> Userspace tools can figure out which uuids to start for a given user, I
> > >>> don't see much value in collecting them to instances within a uuid.
> > >>>  
> > >>
> > >> Initially we started discussion with VM_UUID+instance suggestion, where
> > >> instance was introduced to support multiple devices in a VM.  
> > >
> > > The instance number was never required in order to support multiple
> > > devices in a VM, IIRC this UUID+instance scheme was to appease NVIDIA
> > > management tools which wanted to re-use the VM UUID by creating vGPU
> > > devices with that same UUID and therefore associate udev events to a
> > > given VM.  Only then does an instance number become necessary since the
> > > UUID needs to be static for a vGPUs within a VM.  This has always felt
> > > like a very dodgy solution when we should probably just be querying
> > > libvirt to give us a device to VM association.  
> 
> Agree with Alex here. We'd better not assume that UUID will be a VM_UUID
> for mdev in the basic design. It's bound to NVIDIA management stack too tightly.
> 
> I'm OK to give enough flexibility for various upper level management stacks,
> e.g. instead of $UUID+INDEX style, would $UUID+STRING provide a better
> option where either UUID or STRING could be optional? Upper management 
> stack can choose its own policy to identify a mdev:
> 
> a) $UUID only, so each mdev is allocated with a unique UUID
> b) STRING only, which could be an index (0, 1, 2, ...), or any combination 
> (vgpu0, vgpu1, etc.)
> c) $UUID+STRING, where UUID could be a VM UUID, and STRING could be
> a numeric index
> 
> > >  
> > >> 'mdev_create' creates device and 'mdev_start' is to commit resources of
> > >> all instances of similar devices assigned to VM.
> > >>
> > >> For example, to create 2 devices:
> > >> # echo "$UUID:0:params" > /sys/devices/../mdev_create
> > >> # echo "$UUID:1:params" > /sys/devices/../mdev_create
> > >>
> > >> "$UUID-0" and "$UUID-1" devices are created.
> > >>
> > >> Commit resources for above devices with single 'mdev_start':
> > >> # echo "$UUID" > /sys/class/mdev/mdev_start
> > >>
> > >> Considering $UUID to be a unique UUID of a device, we don't need
> > >> 'instance', so 'mdev_create' would look like:
> > >>
> > >> # echo "$UUID1:params" > /sys/devices/../mdev_create
> > >> # echo "$UUID2:params" > /sys/devices/../mdev_create
> > >>
> > >> where $UUID1 and $UUID2 would be mdev device's unique UUID and 'params'
> > >> would be vendor specific parameters.
> > >>
> > >> Device nodes would be created as "$UUID1" and "$UUID"
> > >>
> > >> Then 'mdev_start' would be:
> > >> # echo "$UUID1, $UUID2" > /sys/class/mdev/mdev_start
> > >>
> > >> Similarly 'mdev_stop' and 'mdev_destroy' would be:
> > >>
> > >> # echo "$UUID1, $UUID2" > /sys/class/mdev/mdev_stop  
> > >
> > > I'm not sure a comma separated list makes sense here, for both
> > > simplicity in the kernel and more fine grained error reporting, we
> > > probably want to start/stop them individually.  Actually, why is it
> > > that we can't use the mediated device being opened and released to
> > > automatically signal to the backend vendor driver to commit and release
> > > resources? I don't fully understand why userspace needs this interface.  
> 
> There is a meaningful use of start/stop interface, as required in live
> migration support. Such interface allows vendor driver to quiescent 
> mdev activity on source device before mdev hardware state is snapshot,
> and then resume mdev activity on dest device after its state is recovered.
> Intel has implemented experimental live migration support in KVMGT (soon
> to release), based on above two interfaces (plus another two to get/set
> mdev state).

Ok, that's actually an interesting use case for start/stop.

> > 
> > For NVIDIA vGPU solution we need to know all devices assigned to a VM in
> > one shot to commit resources of all vGPUs assigned to a VM along with
> > some common resources.  
> 
> Kirti, can you elaborate the background about above one-shot commit
> requirement? It's hard to understand such a requirement. 

Agree, I know NVIDIA isn't planning to support hotplug initially, but
this seems like we're precluding hotplug from the design.  I don't
understand what's driving this one-shot requirement.

> As I relied in another mail, I really hope start/stop become a per-mdev
> attribute instead of global one, e.g.:
> 
> echo "0/1" > /sys/class/mdev/12345678-1234-1234-1234-123456789abc/start
> 
> In many scenario the user space client may only want to talk to mdev
> instance directly, w/o need to contact its parent device. Still take
> live migration for example, I don't think Qemu wants to know parent
> device of assigned mdev instances.

Yep, QEMU won't know the parent device, only libvirt level tools
managing the creation and destruction of the mdev device would know
that.  Perhaps in addition to migration uses we could even use
start/stop for basic power management, device D3 state in the guest
could translate to a stop command to remove that vGPU from scheduling
while still retaining most of the state and resource allocations.
Thanks,

Alex

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v6 1/4] vfio: Mediated device Core driver
@ 2016-08-15 15:59               ` Alex Williamson
  0 siblings, 0 replies; 100+ messages in thread
From: Alex Williamson @ 2016-08-15 15:59 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Kirti Wankhede, pbonzini, kraxel, cjia, qemu-devel, kvm, Song,
	Jike, bjsdjshi

On Mon, 15 Aug 2016 09:38:52 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> > Sent: Saturday, August 13, 2016 8:37 AM
> > 
> > 
> > 
> > On 8/13/2016 2:46 AM, Alex Williamson wrote:  
> > > On Sat, 13 Aug 2016 00:14:39 +0530
> > > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > >  
> > >> On 8/10/2016 12:30 AM, Alex Williamson wrote:  
> > >>> On Thu, 4 Aug 2016 00:33:51 +0530
> > >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > >>>
> > >>> This is used later by mdev_device_start() and mdev_device_stop() to get
> > >>> the parent_device so it can call the start and stop ops callbacks
> > >>> respectively.  That seems to imply that all of instances for a given
> > >>> uuid come from the same parent_device.  Where is that enforced?  I'm
> > >>> still having a hard time buying into the uuid+instance plan when it
> > >>> seems like each mdev_device should have an actual unique uuid.
> > >>> Userspace tools can figure out which uuids to start for a given user, I
> > >>> don't see much value in collecting them to instances within a uuid.
> > >>>  
> > >>
> > >> Initially we started discussion with VM_UUID+instance suggestion, where
> > >> instance was introduced to support multiple devices in a VM.  
> > >
> > > The instance number was never required in order to support multiple
> > > devices in a VM, IIRC this UUID+instance scheme was to appease NVIDIA
> > > management tools which wanted to re-use the VM UUID by creating vGPU
> > > devices with that same UUID and therefore associate udev events to a
> > > given VM.  Only then does an instance number become necessary since the
> > > UUID needs to be static for a vGPUs within a VM.  This has always felt
> > > like a very dodgy solution when we should probably just be querying
> > > libvirt to give us a device to VM association.  
> 
> Agree with Alex here. We'd better not assume that UUID will be a VM_UUID
> for mdev in the basic design. It's bound to NVIDIA management stack too tightly.
> 
> I'm OK to give enough flexibility for various upper level management stacks,
> e.g. instead of $UUID+INDEX style, would $UUID+STRING provide a better
> option where either UUID or STRING could be optional? Upper management 
> stack can choose its own policy to identify a mdev:
> 
> a) $UUID only, so each mdev is allocated with a unique UUID
> b) STRING only, which could be an index (0, 1, 2, ...), or any combination 
> (vgpu0, vgpu1, etc.)
> c) $UUID+STRING, where UUID could be a VM UUID, and STRING could be
> a numeric index
> 
> > >  
> > >> 'mdev_create' creates device and 'mdev_start' is to commit resources of
> > >> all instances of similar devices assigned to VM.
> > >>
> > >> For example, to create 2 devices:
> > >> # echo "$UUID:0:params" > /sys/devices/../mdev_create
> > >> # echo "$UUID:1:params" > /sys/devices/../mdev_create
> > >>
> > >> "$UUID-0" and "$UUID-1" devices are created.
> > >>
> > >> Commit resources for above devices with single 'mdev_start':
> > >> # echo "$UUID" > /sys/class/mdev/mdev_start
> > >>
> > >> Considering $UUID to be a unique UUID of a device, we don't need
> > >> 'instance', so 'mdev_create' would look like:
> > >>
> > >> # echo "$UUID1:params" > /sys/devices/../mdev_create
> > >> # echo "$UUID2:params" > /sys/devices/../mdev_create
> > >>
> > >> where $UUID1 and $UUID2 would be mdev device's unique UUID and 'params'
> > >> would be vendor specific parameters.
> > >>
> > >> Device nodes would be created as "$UUID1" and "$UUID"
> > >>
> > >> Then 'mdev_start' would be:
> > >> # echo "$UUID1, $UUID2" > /sys/class/mdev/mdev_start
> > >>
> > >> Similarly 'mdev_stop' and 'mdev_destroy' would be:
> > >>
> > >> # echo "$UUID1, $UUID2" > /sys/class/mdev/mdev_stop  
> > >
> > > I'm not sure a comma separated list makes sense here, for both
> > > simplicity in the kernel and more fine grained error reporting, we
> > > probably want to start/stop them individually.  Actually, why is it
> > > that we can't use the mediated device being opened and released to
> > > automatically signal to the backend vendor driver to commit and release
> > > resources? I don't fully understand why userspace needs this interface.  
> 
> There is a meaningful use of start/stop interface, as required in live
> migration support. Such interface allows vendor driver to quiescent 
> mdev activity on source device before mdev hardware state is snapshot,
> and then resume mdev activity on dest device after its state is recovered.
> Intel has implemented experimental live migration support in KVMGT (soon
> to release), based on above two interfaces (plus another two to get/set
> mdev state).

Ok, that's actually an interesting use case for start/stop.

> > 
> > For NVIDIA vGPU solution we need to know all devices assigned to a VM in
> > one shot to commit resources of all vGPUs assigned to a VM along with
> > some common resources.  
> 
> Kirti, can you elaborate the background about above one-shot commit
> requirement? It's hard to understand such a requirement. 

Agree, I know NVIDIA isn't planning to support hotplug initially, but
this seems like we're precluding hotplug from the design.  I don't
understand what's driving this one-shot requirement.

> As I relied in another mail, I really hope start/stop become a per-mdev
> attribute instead of global one, e.g.:
> 
> echo "0/1" > /sys/class/mdev/12345678-1234-1234-1234-123456789abc/start
> 
> In many scenario the user space client may only want to talk to mdev
> instance directly, w/o need to contact its parent device. Still take
> live migration for example, I don't think Qemu wants to know parent
> device of assigned mdev instances.

Yep, QEMU won't know the parent device, only libvirt level tools
managing the creation and destruction of the mdev device would know
that.  Perhaps in addition to migration uses we could even use
start/stop for basic power management, device D3 state in the guest
could translate to a stop command to remove that vGPU from scheduling
while still retaining most of the state and resource allocations.
Thanks,

Alex

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v6 1/4] vfio: Mediated device Core driver
  2016-08-15  9:38             ` [Qemu-devel] " Tian, Kevin
@ 2016-08-15 19:59               ` Neo Jia
  -1 siblings, 0 replies; 100+ messages in thread
From: Neo Jia @ 2016-08-15 19:59 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Kirti Wankhede, Alex Williamson, pbonzini, kraxel, qemu-devel,
	kvm, Song, Jike, bjsdjshi

On Mon, Aug 15, 2016 at 09:38:52AM +0000, Tian, Kevin wrote:
> > From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> > Sent: Saturday, August 13, 2016 8:37 AM
> > 
> > 
> > 
> > On 8/13/2016 2:46 AM, Alex Williamson wrote:
> > > On Sat, 13 Aug 2016 00:14:39 +0530
> > > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > >
> > >> On 8/10/2016 12:30 AM, Alex Williamson wrote:
> > >>> On Thu, 4 Aug 2016 00:33:51 +0530
> > >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > >>>
> > >>> This is used later by mdev_device_start() and mdev_device_stop() to get
> > >>> the parent_device so it can call the start and stop ops callbacks
> > >>> respectively.  That seems to imply that all of instances for a given
> > >>> uuid come from the same parent_device.  Where is that enforced?  I'm
> > >>> still having a hard time buying into the uuid+instance plan when it
> > >>> seems like each mdev_device should have an actual unique uuid.
> > >>> Userspace tools can figure out which uuids to start for a given user, I
> > >>> don't see much value in collecting them to instances within a uuid.
> > >>>
> > >>
> > >> Initially we started discussion with VM_UUID+instance suggestion, where
> > >> instance was introduced to support multiple devices in a VM.
> > >
> > > The instance number was never required in order to support multiple
> > > devices in a VM, IIRC this UUID+instance scheme was to appease NVIDIA
> > > management tools which wanted to re-use the VM UUID by creating vGPU
> > > devices with that same UUID and therefore associate udev events to a
> > > given VM.  Only then does an instance number become necessary since the
> > > UUID needs to be static for a vGPUs within a VM.  This has always felt
> > > like a very dodgy solution when we should probably just be querying
> > > libvirt to give us a device to VM association.
> 
> Agree with Alex here. We'd better not assume that UUID will be a VM_UUID
> for mdev in the basic design. It's bound to NVIDIA management stack too tightly.
> 
> I'm OK to give enough flexibility for various upper level management stacks,
> e.g. instead of $UUID+INDEX style, would $UUID+STRING provide a better
> option where either UUID or STRING could be optional? Upper management 
> stack can choose its own policy to identify a mdev:
> 
> a) $UUID only, so each mdev is allocated with a unique UUID
> b) STRING only, which could be an index (0, 1, 2, ...), or any combination 
> (vgpu0, vgpu1, etc.)
> c) $UUID+STRING, where UUID could be a VM UUID, and STRING could be
> a numeric index
> 
> > >
> > >> 'mdev_create' creates device and 'mdev_start' is to commit resources of
> > >> all instances of similar devices assigned to VM.
> > >>
> > >> For example, to create 2 devices:
> > >> # echo "$UUID:0:params" > /sys/devices/../mdev_create
> > >> # echo "$UUID:1:params" > /sys/devices/../mdev_create
> > >>
> > >> "$UUID-0" and "$UUID-1" devices are created.
> > >>
> > >> Commit resources for above devices with single 'mdev_start':
> > >> # echo "$UUID" > /sys/class/mdev/mdev_start
> > >>
> > >> Considering $UUID to be a unique UUID of a device, we don't need
> > >> 'instance', so 'mdev_create' would look like:
> > >>
> > >> # echo "$UUID1:params" > /sys/devices/../mdev_create
> > >> # echo "$UUID2:params" > /sys/devices/../mdev_create
> > >>
> > >> where $UUID1 and $UUID2 would be mdev device's unique UUID and 'params'
> > >> would be vendor specific parameters.
> > >>
> > >> Device nodes would be created as "$UUID1" and "$UUID"
> > >>
> > >> Then 'mdev_start' would be:
> > >> # echo "$UUID1, $UUID2" > /sys/class/mdev/mdev_start
> > >>
> > >> Similarly 'mdev_stop' and 'mdev_destroy' would be:
> > >>
> > >> # echo "$UUID1, $UUID2" > /sys/class/mdev/mdev_stop
> > >
> > > I'm not sure a comma separated list makes sense here, for both
> > > simplicity in the kernel and more fine grained error reporting, we
> > > probably want to start/stop them individually.  Actually, why is it
> > > that we can't use the mediated device being opened and released to
> > > automatically signal to the backend vendor driver to commit and release
> > > resources? I don't fully understand why userspace needs this interface.
> 
> There is a meaningful use of start/stop interface, as required in live
> migration support. Such interface allows vendor driver to quiescent 
> mdev activity on source device before mdev hardware state is snapshot,
> and then resume mdev activity on dest device after its state is recovered.
> Intel has implemented experimental live migration support in KVMGT (soon
> to release), based on above two interfaces (plus another two to get/set
> mdev state).
> 
> > >
> > 
> > For NVIDIA vGPU solution we need to know all devices assigned to a VM in
> > one shot to commit resources of all vGPUs assigned to a VM along with
> > some common resources.
> 
> Kirti, can you elaborate the background about above one-shot commit
> requirement? It's hard to understand such a requirement. 
> 
> As I relied in another mail, I really hope start/stop become a per-mdev
> attribute instead of global one, e.g.:
> 
> echo "0/1" > /sys/class/mdev/12345678-1234-1234-1234-123456789abc/start
> 
> In many scenario the user space client may only want to talk to mdev
> instance directly, w/o need to contact its parent device. Still take
> live migration for example, I don't think Qemu wants to know parent
> device of assigned mdev instances.

Hi Kevin,

Having a global /sys/class/mdev/mdev_start doesn't require anybody to know
parent device. you can just do 

echo "mdev_UUID" > /sys/class/mdev/mdev_start

or 

echo "mdev_UUID" > /sys/class/mdev/mdev_stop

without knowing the parent device.

Thanks,
Neo

> 
> Thanks
> Kevin

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v6 1/4] vfio: Mediated device Core driver
@ 2016-08-15 19:59               ` Neo Jia
  0 siblings, 0 replies; 100+ messages in thread
From: Neo Jia @ 2016-08-15 19:59 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Kirti Wankhede, Alex Williamson, pbonzini, kraxel, qemu-devel,
	kvm, Song, Jike, bjsdjshi

On Mon, Aug 15, 2016 at 09:38:52AM +0000, Tian, Kevin wrote:
> > From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> > Sent: Saturday, August 13, 2016 8:37 AM
> > 
> > 
> > 
> > On 8/13/2016 2:46 AM, Alex Williamson wrote:
> > > On Sat, 13 Aug 2016 00:14:39 +0530
> > > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > >
> > >> On 8/10/2016 12:30 AM, Alex Williamson wrote:
> > >>> On Thu, 4 Aug 2016 00:33:51 +0530
> > >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > >>>
> > >>> This is used later by mdev_device_start() and mdev_device_stop() to get
> > >>> the parent_device so it can call the start and stop ops callbacks
> > >>> respectively.  That seems to imply that all of instances for a given
> > >>> uuid come from the same parent_device.  Where is that enforced?  I'm
> > >>> still having a hard time buying into the uuid+instance plan when it
> > >>> seems like each mdev_device should have an actual unique uuid.
> > >>> Userspace tools can figure out which uuids to start for a given user, I
> > >>> don't see much value in collecting them to instances within a uuid.
> > >>>
> > >>
> > >> Initially we started discussion with VM_UUID+instance suggestion, where
> > >> instance was introduced to support multiple devices in a VM.
> > >
> > > The instance number was never required in order to support multiple
> > > devices in a VM, IIRC this UUID+instance scheme was to appease NVIDIA
> > > management tools which wanted to re-use the VM UUID by creating vGPU
> > > devices with that same UUID and therefore associate udev events to a
> > > given VM.  Only then does an instance number become necessary since the
> > > UUID needs to be static for a vGPUs within a VM.  This has always felt
> > > like a very dodgy solution when we should probably just be querying
> > > libvirt to give us a device to VM association.
> 
> Agree with Alex here. We'd better not assume that UUID will be a VM_UUID
> for mdev in the basic design. It's bound to NVIDIA management stack too tightly.
> 
> I'm OK to give enough flexibility for various upper level management stacks,
> e.g. instead of $UUID+INDEX style, would $UUID+STRING provide a better
> option where either UUID or STRING could be optional? Upper management 
> stack can choose its own policy to identify a mdev:
> 
> a) $UUID only, so each mdev is allocated with a unique UUID
> b) STRING only, which could be an index (0, 1, 2, ...), or any combination 
> (vgpu0, vgpu1, etc.)
> c) $UUID+STRING, where UUID could be a VM UUID, and STRING could be
> a numeric index
> 
> > >
> > >> 'mdev_create' creates device and 'mdev_start' is to commit resources of
> > >> all instances of similar devices assigned to VM.
> > >>
> > >> For example, to create 2 devices:
> > >> # echo "$UUID:0:params" > /sys/devices/../mdev_create
> > >> # echo "$UUID:1:params" > /sys/devices/../mdev_create
> > >>
> > >> "$UUID-0" and "$UUID-1" devices are created.
> > >>
> > >> Commit resources for above devices with single 'mdev_start':
> > >> # echo "$UUID" > /sys/class/mdev/mdev_start
> > >>
> > >> Considering $UUID to be a unique UUID of a device, we don't need
> > >> 'instance', so 'mdev_create' would look like:
> > >>
> > >> # echo "$UUID1:params" > /sys/devices/../mdev_create
> > >> # echo "$UUID2:params" > /sys/devices/../mdev_create
> > >>
> > >> where $UUID1 and $UUID2 would be mdev device's unique UUID and 'params'
> > >> would be vendor specific parameters.
> > >>
> > >> Device nodes would be created as "$UUID1" and "$UUID"
> > >>
> > >> Then 'mdev_start' would be:
> > >> # echo "$UUID1, $UUID2" > /sys/class/mdev/mdev_start
> > >>
> > >> Similarly 'mdev_stop' and 'mdev_destroy' would be:
> > >>
> > >> # echo "$UUID1, $UUID2" > /sys/class/mdev/mdev_stop
> > >
> > > I'm not sure a comma separated list makes sense here, for both
> > > simplicity in the kernel and more fine grained error reporting, we
> > > probably want to start/stop them individually.  Actually, why is it
> > > that we can't use the mediated device being opened and released to
> > > automatically signal to the backend vendor driver to commit and release
> > > resources? I don't fully understand why userspace needs this interface.
> 
> There is a meaningful use of start/stop interface, as required in live
> migration support. Such interface allows vendor driver to quiescent 
> mdev activity on source device before mdev hardware state is snapshot,
> and then resume mdev activity on dest device after its state is recovered.
> Intel has implemented experimental live migration support in KVMGT (soon
> to release), based on above two interfaces (plus another two to get/set
> mdev state).
> 
> > >
> > 
> > For NVIDIA vGPU solution we need to know all devices assigned to a VM in
> > one shot to commit resources of all vGPUs assigned to a VM along with
> > some common resources.
> 
> Kirti, can you elaborate the background about above one-shot commit
> requirement? It's hard to understand such a requirement. 
> 
> As I relied in another mail, I really hope start/stop become a per-mdev
> attribute instead of global one, e.g.:
> 
> echo "0/1" > /sys/class/mdev/12345678-1234-1234-1234-123456789abc/start
> 
> In many scenario the user space client may only want to talk to mdev
> instance directly, w/o need to contact its parent device. Still take
> live migration for example, I don't think Qemu wants to know parent
> device of assigned mdev instances.

Hi Kevin,

Having a global /sys/class/mdev/mdev_start doesn't require anybody to know
parent device. you can just do 

echo "mdev_UUID" > /sys/class/mdev/mdev_start

or 

echo "mdev_UUID" > /sys/class/mdev/mdev_stop

without knowing the parent device.

Thanks,
Neo

> 
> Thanks
> Kevin

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v6 1/4] vfio: Mediated device Core driver
  2016-08-15 15:59               ` [Qemu-devel] " Alex Williamson
@ 2016-08-15 22:09                 ` Neo Jia
  -1 siblings, 0 replies; 100+ messages in thread
From: Neo Jia @ 2016-08-15 22:09 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, Kirti Wankhede, pbonzini, kraxel, qemu-devel, kvm,
	Song, Jike, bjsdjshi

On Mon, Aug 15, 2016 at 09:59:26AM -0600, Alex Williamson wrote:
> On Mon, 15 Aug 2016 09:38:52 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> > > Sent: Saturday, August 13, 2016 8:37 AM
> > > 
> > > 
> > > 
> > > On 8/13/2016 2:46 AM, Alex Williamson wrote:  
> > > > On Sat, 13 Aug 2016 00:14:39 +0530
> > > > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > > >  
> > > >> On 8/10/2016 12:30 AM, Alex Williamson wrote:  
> > > >>> On Thu, 4 Aug 2016 00:33:51 +0530
> > > >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > > >>>
> > > >>> This is used later by mdev_device_start() and mdev_device_stop() to get
> > > >>> the parent_device so it can call the start and stop ops callbacks
> > > >>> respectively.  That seems to imply that all of instances for a given
> > > >>> uuid come from the same parent_device.  Where is that enforced?  I'm
> > > >>> still having a hard time buying into the uuid+instance plan when it
> > > >>> seems like each mdev_device should have an actual unique uuid.
> > > >>> Userspace tools can figure out which uuids to start for a given user, I
> > > >>> don't see much value in collecting them to instances within a uuid.
> > > >>>  
> > > >>
> > > >> Initially we started discussion with VM_UUID+instance suggestion, where
> > > >> instance was introduced to support multiple devices in a VM.  
> > > >
> > > > The instance number was never required in order to support multiple
> > > > devices in a VM, IIRC this UUID+instance scheme was to appease NVIDIA
> > > > management tools which wanted to re-use the VM UUID by creating vGPU
> > > > devices with that same UUID and therefore associate udev events to a
> > > > given VM.  Only then does an instance number become necessary since the
> > > > UUID needs to be static for a vGPUs within a VM.  This has always felt
> > > > like a very dodgy solution when we should probably just be querying
> > > > libvirt to give us a device to VM association.  
> > 
> > Agree with Alex here. We'd better not assume that UUID will be a VM_UUID
> > for mdev in the basic design. It's bound to NVIDIA management stack too tightly.
> > 
> > I'm OK to give enough flexibility for various upper level management stacks,
> > e.g. instead of $UUID+INDEX style, would $UUID+STRING provide a better
> > option where either UUID or STRING could be optional? Upper management 
> > stack can choose its own policy to identify a mdev:
> > 
> > a) $UUID only, so each mdev is allocated with a unique UUID
> > b) STRING only, which could be an index (0, 1, 2, ...), or any combination 
> > (vgpu0, vgpu1, etc.)
> > c) $UUID+STRING, where UUID could be a VM UUID, and STRING could be
> > a numeric index
> > 
> > > >  
> > > >> 'mdev_create' creates device and 'mdev_start' is to commit resources of
> > > >> all instances of similar devices assigned to VM.
> > > >>
> > > >> For example, to create 2 devices:
> > > >> # echo "$UUID:0:params" > /sys/devices/../mdev_create
> > > >> # echo "$UUID:1:params" > /sys/devices/../mdev_create
> > > >>
> > > >> "$UUID-0" and "$UUID-1" devices are created.
> > > >>
> > > >> Commit resources for above devices with single 'mdev_start':
> > > >> # echo "$UUID" > /sys/class/mdev/mdev_start
> > > >>
> > > >> Considering $UUID to be a unique UUID of a device, we don't need
> > > >> 'instance', so 'mdev_create' would look like:
> > > >>
> > > >> # echo "$UUID1:params" > /sys/devices/../mdev_create
> > > >> # echo "$UUID2:params" > /sys/devices/../mdev_create
> > > >>
> > > >> where $UUID1 and $UUID2 would be mdev device's unique UUID and 'params'
> > > >> would be vendor specific parameters.
> > > >>
> > > >> Device nodes would be created as "$UUID1" and "$UUID"
> > > >>
> > > >> Then 'mdev_start' would be:
> > > >> # echo "$UUID1, $UUID2" > /sys/class/mdev/mdev_start
> > > >>
> > > >> Similarly 'mdev_stop' and 'mdev_destroy' would be:
> > > >>
> > > >> # echo "$UUID1, $UUID2" > /sys/class/mdev/mdev_stop  
> > > >
> > > > I'm not sure a comma separated list makes sense here, for both
> > > > simplicity in the kernel and more fine grained error reporting, we
> > > > probably want to start/stop them individually.  Actually, why is it
> > > > that we can't use the mediated device being opened and released to
> > > > automatically signal to the backend vendor driver to commit and release
> > > > resources? I don't fully understand why userspace needs this interface.  
> > 
> > There is a meaningful use of start/stop interface, as required in live
> > migration support. Such interface allows vendor driver to quiescent 
> > mdev activity on source device before mdev hardware state is snapshot,
> > and then resume mdev activity on dest device after its state is recovered.
> > Intel has implemented experimental live migration support in KVMGT (soon
> > to release), based on above two interfaces (plus another two to get/set
> > mdev state).
> 
> Ok, that's actually an interesting use case for start/stop.
> 
> > > 
> > > For NVIDIA vGPU solution we need to know all devices assigned to a VM in
> > > one shot to commit resources of all vGPUs assigned to a VM along with
> > > some common resources.  
> > 
> > Kirti, can you elaborate the background about above one-shot commit
> > requirement? It's hard to understand such a requirement. 
> 
> Agree, I know NVIDIA isn't planning to support hotplug initially, but
> this seems like we're precluding hotplug from the design.  I don't
> understand what's driving this one-shot requirement.

Hi Alex,

The requirement here is based on how our internal vGPU device model designed and
with this we are able to pre-allocate resources required for multiple virtual
devices within same domain.

And I don't think this syntax will stop us from supporting hotplug at all.

For example, you can always create a virtual mdev and then do

echo "mdev_UUID" > /sys/class/mdev/mdev_start

then use QEMU monitor to add the device for hotplug.

> 
> > As I relied in another mail, I really hope start/stop become a per-mdev
> > attribute instead of global one, e.g.:
> > 
> > echo "0/1" > /sys/class/mdev/12345678-1234-1234-1234-123456789abc/start
> > 
> > In many scenario the user space client may only want to talk to mdev
> > instance directly, w/o need to contact its parent device. Still take
> > live migration for example, I don't think Qemu wants to know parent
> > device of assigned mdev instances.
> 
> Yep, QEMU won't know the parent device, only libvirt level tools
> managing the creation and destruction of the mdev device would know
> that.  Perhaps in addition to migration uses we could even use
> start/stop for basic power management, device D3 state in the guest
> could translate to a stop command to remove that vGPU from scheduling
> while still retaining most of the state and resource allocations.

Just recap what I have replied to Kevin on his previous email, the current
mdev_start and mdev_stop doesn't require any knowledge of parent device.

Thanks,
Neo

> Thanks,
> 
> Alex

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v6 1/4] vfio: Mediated device Core driver
@ 2016-08-15 22:09                 ` Neo Jia
  0 siblings, 0 replies; 100+ messages in thread
From: Neo Jia @ 2016-08-15 22:09 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, Kirti Wankhede, pbonzini, kraxel, qemu-devel, kvm,
	Song, Jike, bjsdjshi

On Mon, Aug 15, 2016 at 09:59:26AM -0600, Alex Williamson wrote:
> On Mon, 15 Aug 2016 09:38:52 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> > > Sent: Saturday, August 13, 2016 8:37 AM
> > > 
> > > 
> > > 
> > > On 8/13/2016 2:46 AM, Alex Williamson wrote:  
> > > > On Sat, 13 Aug 2016 00:14:39 +0530
> > > > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > > >  
> > > >> On 8/10/2016 12:30 AM, Alex Williamson wrote:  
> > > >>> On Thu, 4 Aug 2016 00:33:51 +0530
> > > >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > > >>>
> > > >>> This is used later by mdev_device_start() and mdev_device_stop() to get
> > > >>> the parent_device so it can call the start and stop ops callbacks
> > > >>> respectively.  That seems to imply that all of instances for a given
> > > >>> uuid come from the same parent_device.  Where is that enforced?  I'm
> > > >>> still having a hard time buying into the uuid+instance plan when it
> > > >>> seems like each mdev_device should have an actual unique uuid.
> > > >>> Userspace tools can figure out which uuids to start for a given user, I
> > > >>> don't see much value in collecting them to instances within a uuid.
> > > >>>  
> > > >>
> > > >> Initially we started discussion with VM_UUID+instance suggestion, where
> > > >> instance was introduced to support multiple devices in a VM.  
> > > >
> > > > The instance number was never required in order to support multiple
> > > > devices in a VM, IIRC this UUID+instance scheme was to appease NVIDIA
> > > > management tools which wanted to re-use the VM UUID by creating vGPU
> > > > devices with that same UUID and therefore associate udev events to a
> > > > given VM.  Only then does an instance number become necessary since the
> > > > UUID needs to be static for a vGPUs within a VM.  This has always felt
> > > > like a very dodgy solution when we should probably just be querying
> > > > libvirt to give us a device to VM association.  
> > 
> > Agree with Alex here. We'd better not assume that UUID will be a VM_UUID
> > for mdev in the basic design. It's bound to NVIDIA management stack too tightly.
> > 
> > I'm OK to give enough flexibility for various upper level management stacks,
> > e.g. instead of $UUID+INDEX style, would $UUID+STRING provide a better
> > option where either UUID or STRING could be optional? Upper management 
> > stack can choose its own policy to identify a mdev:
> > 
> > a) $UUID only, so each mdev is allocated with a unique UUID
> > b) STRING only, which could be an index (0, 1, 2, ...), or any combination 
> > (vgpu0, vgpu1, etc.)
> > c) $UUID+STRING, where UUID could be a VM UUID, and STRING could be
> > a numeric index
> > 
> > > >  
> > > >> 'mdev_create' creates device and 'mdev_start' is to commit resources of
> > > >> all instances of similar devices assigned to VM.
> > > >>
> > > >> For example, to create 2 devices:
> > > >> # echo "$UUID:0:params" > /sys/devices/../mdev_create
> > > >> # echo "$UUID:1:params" > /sys/devices/../mdev_create
> > > >>
> > > >> "$UUID-0" and "$UUID-1" devices are created.
> > > >>
> > > >> Commit resources for above devices with single 'mdev_start':
> > > >> # echo "$UUID" > /sys/class/mdev/mdev_start
> > > >>
> > > >> Considering $UUID to be a unique UUID of a device, we don't need
> > > >> 'instance', so 'mdev_create' would look like:
> > > >>
> > > >> # echo "$UUID1:params" > /sys/devices/../mdev_create
> > > >> # echo "$UUID2:params" > /sys/devices/../mdev_create
> > > >>
> > > >> where $UUID1 and $UUID2 would be mdev device's unique UUID and 'params'
> > > >> would be vendor specific parameters.
> > > >>
> > > >> Device nodes would be created as "$UUID1" and "$UUID"
> > > >>
> > > >> Then 'mdev_start' would be:
> > > >> # echo "$UUID1, $UUID2" > /sys/class/mdev/mdev_start
> > > >>
> > > >> Similarly 'mdev_stop' and 'mdev_destroy' would be:
> > > >>
> > > >> # echo "$UUID1, $UUID2" > /sys/class/mdev/mdev_stop  
> > > >
> > > > I'm not sure a comma separated list makes sense here, for both
> > > > simplicity in the kernel and more fine grained error reporting, we
> > > > probably want to start/stop them individually.  Actually, why is it
> > > > that we can't use the mediated device being opened and released to
> > > > automatically signal to the backend vendor driver to commit and release
> > > > resources? I don't fully understand why userspace needs this interface.  
> > 
> > There is a meaningful use of start/stop interface, as required in live
> > migration support. Such interface allows vendor driver to quiescent 
> > mdev activity on source device before mdev hardware state is snapshot,
> > and then resume mdev activity on dest device after its state is recovered.
> > Intel has implemented experimental live migration support in KVMGT (soon
> > to release), based on above two interfaces (plus another two to get/set
> > mdev state).
> 
> Ok, that's actually an interesting use case for start/stop.
> 
> > > 
> > > For NVIDIA vGPU solution we need to know all devices assigned to a VM in
> > > one shot to commit resources of all vGPUs assigned to a VM along with
> > > some common resources.  
> > 
> > Kirti, can you elaborate the background about above one-shot commit
> > requirement? It's hard to understand such a requirement. 
> 
> Agree, I know NVIDIA isn't planning to support hotplug initially, but
> this seems like we're precluding hotplug from the design.  I don't
> understand what's driving this one-shot requirement.

Hi Alex,

The requirement here is based on how our internal vGPU device model designed and
with this we are able to pre-allocate resources required for multiple virtual
devices within same domain.

And I don't think this syntax will stop us from supporting hotplug at all.

For example, you can always create a virtual mdev and then do

echo "mdev_UUID" > /sys/class/mdev/mdev_start

then use QEMU monitor to add the device for hotplug.

> 
> > As I relied in another mail, I really hope start/stop become a per-mdev
> > attribute instead of global one, e.g.:
> > 
> > echo "0/1" > /sys/class/mdev/12345678-1234-1234-1234-123456789abc/start
> > 
> > In many scenario the user space client may only want to talk to mdev
> > instance directly, w/o need to contact its parent device. Still take
> > live migration for example, I don't think Qemu wants to know parent
> > device of assigned mdev instances.
> 
> Yep, QEMU won't know the parent device, only libvirt level tools
> managing the creation and destruction of the mdev device would know
> that.  Perhaps in addition to migration uses we could even use
> start/stop for basic power management, device D3 state in the guest
> could translate to a stop command to remove that vGPU from scheduling
> while still retaining most of the state and resource allocations.

Just recap what I have replied to Kevin on his previous email, the current
mdev_start and mdev_stop doesn't require any knowledge of parent device.

Thanks,
Neo

> Thanks,
> 
> Alex

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v6 1/4] vfio: Mediated device Core driver
  2016-08-15 19:59               ` [Qemu-devel] " Neo Jia
  (?)
@ 2016-08-15 22:47               ` Alex Williamson
  2016-08-15 23:54                   ` [Qemu-devel] " Neo Jia
                                   ` (2 more replies)
  -1 siblings, 3 replies; 100+ messages in thread
From: Alex Williamson @ 2016-08-15 22:47 UTC (permalink / raw)
  To: Neo Jia
  Cc: Tian, Kevin, Song, Jike, kvm, qemu-devel, Kirti Wankhede, kraxel,
	pbonzini, bjsdjshi

On Mon, 15 Aug 2016 12:59:08 -0700
Neo Jia <cjia@nvidia.com> wrote:

> On Mon, Aug 15, 2016 at 09:38:52AM +0000, Tian, Kevin wrote:
> > > From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> > > Sent: Saturday, August 13, 2016 8:37 AM
> > > 
> > > 
> > > 
> > > On 8/13/2016 2:46 AM, Alex Williamson wrote:  
> > > > On Sat, 13 Aug 2016 00:14:39 +0530
> > > > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > > >  
> > > >> On 8/10/2016 12:30 AM, Alex Williamson wrote:  
> > > >>> On Thu, 4 Aug 2016 00:33:51 +0530
> > > >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > > >>>
> > > >>> This is used later by mdev_device_start() and mdev_device_stop() to get
> > > >>> the parent_device so it can call the start and stop ops callbacks
> > > >>> respectively.  That seems to imply that all of instances for a given
> > > >>> uuid come from the same parent_device.  Where is that enforced?  I'm
> > > >>> still having a hard time buying into the uuid+instance plan when it
> > > >>> seems like each mdev_device should have an actual unique uuid.
> > > >>> Userspace tools can figure out which uuids to start for a given user, I
> > > >>> don't see much value in collecting them to instances within a uuid.
> > > >>>  
> > > >>
> > > >> Initially we started discussion with VM_UUID+instance suggestion, where
> > > >> instance was introduced to support multiple devices in a VM.  
> > > >
> > > > The instance number was never required in order to support multiple
> > > > devices in a VM, IIRC this UUID+instance scheme was to appease NVIDIA
> > > > management tools which wanted to re-use the VM UUID by creating vGPU
> > > > devices with that same UUID and therefore associate udev events to a
> > > > given VM.  Only then does an instance number become necessary since the
> > > > UUID needs to be static for a vGPUs within a VM.  This has always felt
> > > > like a very dodgy solution when we should probably just be querying
> > > > libvirt to give us a device to VM association.  
> > 
> > Agree with Alex here. We'd better not assume that UUID will be a VM_UUID
> > for mdev in the basic design. It's bound to NVIDIA management stack too tightly.
> > 
> > I'm OK to give enough flexibility for various upper level management stacks,
> > e.g. instead of $UUID+INDEX style, would $UUID+STRING provide a better
> > option where either UUID or STRING could be optional? Upper management 
> > stack can choose its own policy to identify a mdev:
> > 
> > a) $UUID only, so each mdev is allocated with a unique UUID
> > b) STRING only, which could be an index (0, 1, 2, ...), or any combination 
> > (vgpu0, vgpu1, etc.)
> > c) $UUID+STRING, where UUID could be a VM UUID, and STRING could be
> > a numeric index
> >   
> > > >  
> > > >> 'mdev_create' creates device and 'mdev_start' is to commit resources of
> > > >> all instances of similar devices assigned to VM.
> > > >>
> > > >> For example, to create 2 devices:
> > > >> # echo "$UUID:0:params" > /sys/devices/../mdev_create
> > > >> # echo "$UUID:1:params" > /sys/devices/../mdev_create
> > > >>
> > > >> "$UUID-0" and "$UUID-1" devices are created.
> > > >>
> > > >> Commit resources for above devices with single 'mdev_start':
> > > >> # echo "$UUID" > /sys/class/mdev/mdev_start
> > > >>
> > > >> Considering $UUID to be a unique UUID of a device, we don't need
> > > >> 'instance', so 'mdev_create' would look like:
> > > >>
> > > >> # echo "$UUID1:params" > /sys/devices/../mdev_create
> > > >> # echo "$UUID2:params" > /sys/devices/../mdev_create
> > > >>
> > > >> where $UUID1 and $UUID2 would be mdev device's unique UUID and 'params'
> > > >> would be vendor specific parameters.
> > > >>
> > > >> Device nodes would be created as "$UUID1" and "$UUID"
> > > >>
> > > >> Then 'mdev_start' would be:
> > > >> # echo "$UUID1, $UUID2" > /sys/class/mdev/mdev_start
> > > >>
> > > >> Similarly 'mdev_stop' and 'mdev_destroy' would be:
> > > >>
> > > >> # echo "$UUID1, $UUID2" > /sys/class/mdev/mdev_stop  
> > > >
> > > > I'm not sure a comma separated list makes sense here, for both
> > > > simplicity in the kernel and more fine grained error reporting, we
> > > > probably want to start/stop them individually.  Actually, why is it
> > > > that we can't use the mediated device being opened and released to
> > > > automatically signal to the backend vendor driver to commit and release
> > > > resources? I don't fully understand why userspace needs this interface.  
> > 
> > There is a meaningful use of start/stop interface, as required in live
> > migration support. Such interface allows vendor driver to quiescent 
> > mdev activity on source device before mdev hardware state is snapshot,
> > and then resume mdev activity on dest device after its state is recovered.
> > Intel has implemented experimental live migration support in KVMGT (soon
> > to release), based on above two interfaces (plus another two to get/set
> > mdev state).
> >   
> > > >  
> > > 
> > > For NVIDIA vGPU solution we need to know all devices assigned to a VM in
> > > one shot to commit resources of all vGPUs assigned to a VM along with
> > > some common resources.  
> > 
> > Kirti, can you elaborate the background about above one-shot commit
> > requirement? It's hard to understand such a requirement. 
> > 
> > As I relied in another mail, I really hope start/stop become a per-mdev
> > attribute instead of global one, e.g.:
> > 
> > echo "0/1" > /sys/class/mdev/12345678-1234-1234-1234-123456789abc/start
> > 
> > In many scenario the user space client may only want to talk to mdev
> > instance directly, w/o need to contact its parent device. Still take
> > live migration for example, I don't think Qemu wants to know parent
> > device of assigned mdev instances.  
> 
> Hi Kevin,
> 
> Having a global /sys/class/mdev/mdev_start doesn't require anybody to know
> parent device. you can just do 
> 
> echo "mdev_UUID" > /sys/class/mdev/mdev_start
> 
> or 
> 
> echo "mdev_UUID" > /sys/class/mdev/mdev_stop
> 
> without knowing the parent device.

That doesn't give an individual user the ability to stop and start
their devices though, because in order for a user to have write
permissions there, they get permission to DoS other users by pumping
arbitrary UUIDs into those files.  By placing start/stop per mdev, we
have mdev level granularity of granting start/stop privileges.  Really
though, do we want QEMU fumbling around through sysfs or do we want an
interface through the vfio API to perform start/stop?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v6 1/4] vfio: Mediated device Core driver
  2016-08-15 22:09                 ` [Qemu-devel] " Neo Jia
@ 2016-08-15 22:52                   ` Alex Williamson
  -1 siblings, 0 replies; 100+ messages in thread
From: Alex Williamson @ 2016-08-15 22:52 UTC (permalink / raw)
  To: Neo Jia
  Cc: Tian, Kevin, Kirti Wankhede, pbonzini, kraxel, qemu-devel, kvm,
	Song, Jike, bjsdjshi

On Mon, 15 Aug 2016 15:09:30 -0700
Neo Jia <cjia@nvidia.com> wrote:

> On Mon, Aug 15, 2016 at 09:59:26AM -0600, Alex Williamson wrote:
> > On Mon, 15 Aug 2016 09:38:52 +0000
> > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> >   
> > > > From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> > > > Sent: Saturday, August 13, 2016 8:37 AM
> > > > 
> > > > 
> > > > 
> > > > On 8/13/2016 2:46 AM, Alex Williamson wrote:    
> > > > > On Sat, 13 Aug 2016 00:14:39 +0530
> > > > > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > > > >    
> > > > >> On 8/10/2016 12:30 AM, Alex Williamson wrote:    
> > > > >>> On Thu, 4 Aug 2016 00:33:51 +0530
> > > > >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > > > >>>
> > > > >>> This is used later by mdev_device_start() and mdev_device_stop() to get
> > > > >>> the parent_device so it can call the start and stop ops callbacks
> > > > >>> respectively.  That seems to imply that all of instances for a given
> > > > >>> uuid come from the same parent_device.  Where is that enforced?  I'm
> > > > >>> still having a hard time buying into the uuid+instance plan when it
> > > > >>> seems like each mdev_device should have an actual unique uuid.
> > > > >>> Userspace tools can figure out which uuids to start for a given user, I
> > > > >>> don't see much value in collecting them to instances within a uuid.
> > > > >>>    
> > > > >>
> > > > >> Initially we started discussion with VM_UUID+instance suggestion, where
> > > > >> instance was introduced to support multiple devices in a VM.    
> > > > >
> > > > > The instance number was never required in order to support multiple
> > > > > devices in a VM, IIRC this UUID+instance scheme was to appease NVIDIA
> > > > > management tools which wanted to re-use the VM UUID by creating vGPU
> > > > > devices with that same UUID and therefore associate udev events to a
> > > > > given VM.  Only then does an instance number become necessary since the
> > > > > UUID needs to be static for a vGPUs within a VM.  This has always felt
> > > > > like a very dodgy solution when we should probably just be querying
> > > > > libvirt to give us a device to VM association.    
> > > 
> > > Agree with Alex here. We'd better not assume that UUID will be a VM_UUID
> > > for mdev in the basic design. It's bound to NVIDIA management stack too tightly.
> > > 
> > > I'm OK to give enough flexibility for various upper level management stacks,
> > > e.g. instead of $UUID+INDEX style, would $UUID+STRING provide a better
> > > option where either UUID or STRING could be optional? Upper management 
> > > stack can choose its own policy to identify a mdev:
> > > 
> > > a) $UUID only, so each mdev is allocated with a unique UUID
> > > b) STRING only, which could be an index (0, 1, 2, ...), or any combination 
> > > (vgpu0, vgpu1, etc.)
> > > c) $UUID+STRING, where UUID could be a VM UUID, and STRING could be
> > > a numeric index
> > >   
> > > > >    
> > > > >> 'mdev_create' creates device and 'mdev_start' is to commit resources of
> > > > >> all instances of similar devices assigned to VM.
> > > > >>
> > > > >> For example, to create 2 devices:
> > > > >> # echo "$UUID:0:params" > /sys/devices/../mdev_create
> > > > >> # echo "$UUID:1:params" > /sys/devices/../mdev_create
> > > > >>
> > > > >> "$UUID-0" and "$UUID-1" devices are created.
> > > > >>
> > > > >> Commit resources for above devices with single 'mdev_start':
> > > > >> # echo "$UUID" > /sys/class/mdev/mdev_start
> > > > >>
> > > > >> Considering $UUID to be a unique UUID of a device, we don't need
> > > > >> 'instance', so 'mdev_create' would look like:
> > > > >>
> > > > >> # echo "$UUID1:params" > /sys/devices/../mdev_create
> > > > >> # echo "$UUID2:params" > /sys/devices/../mdev_create
> > > > >>
> > > > >> where $UUID1 and $UUID2 would be mdev device's unique UUID and 'params'
> > > > >> would be vendor specific parameters.
> > > > >>
> > > > >> Device nodes would be created as "$UUID1" and "$UUID"
> > > > >>
> > > > >> Then 'mdev_start' would be:
> > > > >> # echo "$UUID1, $UUID2" > /sys/class/mdev/mdev_start
> > > > >>
> > > > >> Similarly 'mdev_stop' and 'mdev_destroy' would be:
> > > > >>
> > > > >> # echo "$UUID1, $UUID2" > /sys/class/mdev/mdev_stop    
> > > > >
> > > > > I'm not sure a comma separated list makes sense here, for both
> > > > > simplicity in the kernel and more fine grained error reporting, we
> > > > > probably want to start/stop them individually.  Actually, why is it
> > > > > that we can't use the mediated device being opened and released to
> > > > > automatically signal to the backend vendor driver to commit and release
> > > > > resources? I don't fully understand why userspace needs this interface.    
> > > 
> > > There is a meaningful use of start/stop interface, as required in live
> > > migration support. Such interface allows vendor driver to quiescent 
> > > mdev activity on source device before mdev hardware state is snapshot,
> > > and then resume mdev activity on dest device after its state is recovered.
> > > Intel has implemented experimental live migration support in KVMGT (soon
> > > to release), based on above two interfaces (plus another two to get/set
> > > mdev state).  
> > 
> > Ok, that's actually an interesting use case for start/stop.
> >   
> > > > 
> > > > For NVIDIA vGPU solution we need to know all devices assigned to a VM in
> > > > one shot to commit resources of all vGPUs assigned to a VM along with
> > > > some common resources.    
> > > 
> > > Kirti, can you elaborate the background about above one-shot commit
> > > requirement? It's hard to understand such a requirement.   
> > 
> > Agree, I know NVIDIA isn't planning to support hotplug initially, but
> > this seems like we're precluding hotplug from the design.  I don't
> > understand what's driving this one-shot requirement.  
> 
> Hi Alex,
> 
> The requirement here is based on how our internal vGPU device model designed and
> with this we are able to pre-allocate resources required for multiple virtual
> devices within same domain.
> 
> And I don't think this syntax will stop us from supporting hotplug at all.
> 
> For example, you can always create a virtual mdev and then do
> 
> echo "mdev_UUID" > /sys/class/mdev/mdev_start
> 
> then use QEMU monitor to add the device for hotplug.

Hi Neo,

I'm still not understanding the advantage you get from the "one-shot"
approach then if we can always add more mdevs by starting them later.
Are the hotplug mdevs somehow less capable than the initial set of
mdevs added in a single shot?  If the initial set is allocated
from the "same domain", does that give them some sort of hardware
locality/resource benefit?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v6 1/4] vfio: Mediated device Core driver
@ 2016-08-15 22:52                   ` Alex Williamson
  0 siblings, 0 replies; 100+ messages in thread
From: Alex Williamson @ 2016-08-15 22:52 UTC (permalink / raw)
  To: Neo Jia
  Cc: Tian, Kevin, Kirti Wankhede, pbonzini, kraxel, qemu-devel, kvm,
	Song, Jike, bjsdjshi

On Mon, 15 Aug 2016 15:09:30 -0700
Neo Jia <cjia@nvidia.com> wrote:

> On Mon, Aug 15, 2016 at 09:59:26AM -0600, Alex Williamson wrote:
> > On Mon, 15 Aug 2016 09:38:52 +0000
> > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> >   
> > > > From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> > > > Sent: Saturday, August 13, 2016 8:37 AM
> > > > 
> > > > 
> > > > 
> > > > On 8/13/2016 2:46 AM, Alex Williamson wrote:    
> > > > > On Sat, 13 Aug 2016 00:14:39 +0530
> > > > > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > > > >    
> > > > >> On 8/10/2016 12:30 AM, Alex Williamson wrote:    
> > > > >>> On Thu, 4 Aug 2016 00:33:51 +0530
> > > > >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > > > >>>
> > > > >>> This is used later by mdev_device_start() and mdev_device_stop() to get
> > > > >>> the parent_device so it can call the start and stop ops callbacks
> > > > >>> respectively.  That seems to imply that all of instances for a given
> > > > >>> uuid come from the same parent_device.  Where is that enforced?  I'm
> > > > >>> still having a hard time buying into the uuid+instance plan when it
> > > > >>> seems like each mdev_device should have an actual unique uuid.
> > > > >>> Userspace tools can figure out which uuids to start for a given user, I
> > > > >>> don't see much value in collecting them to instances within a uuid.
> > > > >>>    
> > > > >>
> > > > >> Initially we started discussion with VM_UUID+instance suggestion, where
> > > > >> instance was introduced to support multiple devices in a VM.    
> > > > >
> > > > > The instance number was never required in order to support multiple
> > > > > devices in a VM, IIRC this UUID+instance scheme was to appease NVIDIA
> > > > > management tools which wanted to re-use the VM UUID by creating vGPU
> > > > > devices with that same UUID and therefore associate udev events to a
> > > > > given VM.  Only then does an instance number become necessary since the
> > > > > UUID needs to be static for a vGPUs within a VM.  This has always felt
> > > > > like a very dodgy solution when we should probably just be querying
> > > > > libvirt to give us a device to VM association.    
> > > 
> > > Agree with Alex here. We'd better not assume that UUID will be a VM_UUID
> > > for mdev in the basic design. It's bound to NVIDIA management stack too tightly.
> > > 
> > > I'm OK to give enough flexibility for various upper level management stacks,
> > > e.g. instead of $UUID+INDEX style, would $UUID+STRING provide a better
> > > option where either UUID or STRING could be optional? Upper management 
> > > stack can choose its own policy to identify a mdev:
> > > 
> > > a) $UUID only, so each mdev is allocated with a unique UUID
> > > b) STRING only, which could be an index (0, 1, 2, ...), or any combination 
> > > (vgpu0, vgpu1, etc.)
> > > c) $UUID+STRING, where UUID could be a VM UUID, and STRING could be
> > > a numeric index
> > >   
> > > > >    
> > > > >> 'mdev_create' creates device and 'mdev_start' is to commit resources of
> > > > >> all instances of similar devices assigned to VM.
> > > > >>
> > > > >> For example, to create 2 devices:
> > > > >> # echo "$UUID:0:params" > /sys/devices/../mdev_create
> > > > >> # echo "$UUID:1:params" > /sys/devices/../mdev_create
> > > > >>
> > > > >> "$UUID-0" and "$UUID-1" devices are created.
> > > > >>
> > > > >> Commit resources for above devices with single 'mdev_start':
> > > > >> # echo "$UUID" > /sys/class/mdev/mdev_start
> > > > >>
> > > > >> Considering $UUID to be a unique UUID of a device, we don't need
> > > > >> 'instance', so 'mdev_create' would look like:
> > > > >>
> > > > >> # echo "$UUID1:params" > /sys/devices/../mdev_create
> > > > >> # echo "$UUID2:params" > /sys/devices/../mdev_create
> > > > >>
> > > > >> where $UUID1 and $UUID2 would be mdev device's unique UUID and 'params'
> > > > >> would be vendor specific parameters.
> > > > >>
> > > > >> Device nodes would be created as "$UUID1" and "$UUID"
> > > > >>
> > > > >> Then 'mdev_start' would be:
> > > > >> # echo "$UUID1, $UUID2" > /sys/class/mdev/mdev_start
> > > > >>
> > > > >> Similarly 'mdev_stop' and 'mdev_destroy' would be:
> > > > >>
> > > > >> # echo "$UUID1, $UUID2" > /sys/class/mdev/mdev_stop    
> > > > >
> > > > > I'm not sure a comma separated list makes sense here, for both
> > > > > simplicity in the kernel and more fine grained error reporting, we
> > > > > probably want to start/stop them individually.  Actually, why is it
> > > > > that we can't use the mediated device being opened and released to
> > > > > automatically signal to the backend vendor driver to commit and release
> > > > > resources? I don't fully understand why userspace needs this interface.    
> > > 
> > > There is a meaningful use of start/stop interface, as required in live
> > > migration support. Such interface allows vendor driver to quiescent 
> > > mdev activity on source device before mdev hardware state is snapshot,
> > > and then resume mdev activity on dest device after its state is recovered.
> > > Intel has implemented experimental live migration support in KVMGT (soon
> > > to release), based on above two interfaces (plus another two to get/set
> > > mdev state).  
> > 
> > Ok, that's actually an interesting use case for start/stop.
> >   
> > > > 
> > > > For NVIDIA vGPU solution we need to know all devices assigned to a VM in
> > > > one shot to commit resources of all vGPUs assigned to a VM along with
> > > > some common resources.    
> > > 
> > > Kirti, can you elaborate the background about above one-shot commit
> > > requirement? It's hard to understand such a requirement.   
> > 
> > Agree, I know NVIDIA isn't planning to support hotplug initially, but
> > this seems like we're precluding hotplug from the design.  I don't
> > understand what's driving this one-shot requirement.  
> 
> Hi Alex,
> 
> The requirement here is based on how our internal vGPU device model designed and
> with this we are able to pre-allocate resources required for multiple virtual
> devices within same domain.
> 
> And I don't think this syntax will stop us from supporting hotplug at all.
> 
> For example, you can always create a virtual mdev and then do
> 
> echo "mdev_UUID" > /sys/class/mdev/mdev_start
> 
> then use QEMU monitor to add the device for hotplug.

Hi Neo,

I'm still not understanding the advantage you get from the "one-shot"
approach then if we can always add more mdevs by starting them later.
Are the hotplug mdevs somehow less capable than the initial set of
mdevs added in a single shot?  If the initial set is allocated
from the "same domain", does that give them some sort of hardware
locality/resource benefit?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v6 1/4] vfio: Mediated device Core driver
  2016-08-15 22:52                   ` [Qemu-devel] " Alex Williamson
@ 2016-08-15 23:23                     ` Neo Jia
  -1 siblings, 0 replies; 100+ messages in thread
From: Neo Jia @ 2016-08-15 23:23 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, Kirti Wankhede, pbonzini, kraxel, qemu-devel, kvm,
	Song, Jike, bjsdjshi

On Mon, Aug 15, 2016 at 04:52:39PM -0600, Alex Williamson wrote:
> On Mon, 15 Aug 2016 15:09:30 -0700
> Neo Jia <cjia@nvidia.com> wrote:
> 
> > On Mon, Aug 15, 2016 at 09:59:26AM -0600, Alex Williamson wrote:
> > > On Mon, 15 Aug 2016 09:38:52 +0000
> > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > >   
> > > > > From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> > > > > Sent: Saturday, August 13, 2016 8:37 AM
> > > > > 
> > > > > 
> > > > > 
> > > > > On 8/13/2016 2:46 AM, Alex Williamson wrote:    
> > > > > > On Sat, 13 Aug 2016 00:14:39 +0530
> > > > > > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > > > > >    
> > > > > >> On 8/10/2016 12:30 AM, Alex Williamson wrote:    
> > > > > >>> On Thu, 4 Aug 2016 00:33:51 +0530
> > > > > >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > > > > >>>
> > > > > >>> This is used later by mdev_device_start() and mdev_device_stop() to get
> > > > > >>> the parent_device so it can call the start and stop ops callbacks
> > > > > >>> respectively.  That seems to imply that all of instances for a given
> > > > > >>> uuid come from the same parent_device.  Where is that enforced?  I'm
> > > > > >>> still having a hard time buying into the uuid+instance plan when it
> > > > > >>> seems like each mdev_device should have an actual unique uuid.
> > > > > >>> Userspace tools can figure out which uuids to start for a given user, I
> > > > > >>> don't see much value in collecting them to instances within a uuid.
> > > > > >>>    
> > > > > >>
> > > > > >> Initially we started discussion with VM_UUID+instance suggestion, where
> > > > > >> instance was introduced to support multiple devices in a VM.    
> > > > > >
> > > > > > The instance number was never required in order to support multiple
> > > > > > devices in a VM, IIRC this UUID+instance scheme was to appease NVIDIA
> > > > > > management tools which wanted to re-use the VM UUID by creating vGPU
> > > > > > devices with that same UUID and therefore associate udev events to a
> > > > > > given VM.  Only then does an instance number become necessary since the
> > > > > > UUID needs to be static for a vGPUs within a VM.  This has always felt
> > > > > > like a very dodgy solution when we should probably just be querying
> > > > > > libvirt to give us a device to VM association.    
> > > > 
> > > > Agree with Alex here. We'd better not assume that UUID will be a VM_UUID
> > > > for mdev in the basic design. It's bound to NVIDIA management stack too tightly.
> > > > 
> > > > I'm OK to give enough flexibility for various upper level management stacks,
> > > > e.g. instead of $UUID+INDEX style, would $UUID+STRING provide a better
> > > > option where either UUID or STRING could be optional? Upper management 
> > > > stack can choose its own policy to identify a mdev:
> > > > 
> > > > a) $UUID only, so each mdev is allocated with a unique UUID
> > > > b) STRING only, which could be an index (0, 1, 2, ...), or any combination 
> > > > (vgpu0, vgpu1, etc.)
> > > > c) $UUID+STRING, where UUID could be a VM UUID, and STRING could be
> > > > a numeric index
> > > >   
> > > > > >    
> > > > > >> 'mdev_create' creates device and 'mdev_start' is to commit resources of
> > > > > >> all instances of similar devices assigned to VM.
> > > > > >>
> > > > > >> For example, to create 2 devices:
> > > > > >> # echo "$UUID:0:params" > /sys/devices/../mdev_create
> > > > > >> # echo "$UUID:1:params" > /sys/devices/../mdev_create
> > > > > >>
> > > > > >> "$UUID-0" and "$UUID-1" devices are created.
> > > > > >>
> > > > > >> Commit resources for above devices with single 'mdev_start':
> > > > > >> # echo "$UUID" > /sys/class/mdev/mdev_start
> > > > > >>
> > > > > >> Considering $UUID to be a unique UUID of a device, we don't need
> > > > > >> 'instance', so 'mdev_create' would look like:
> > > > > >>
> > > > > >> # echo "$UUID1:params" > /sys/devices/../mdev_create
> > > > > >> # echo "$UUID2:params" > /sys/devices/../mdev_create
> > > > > >>
> > > > > >> where $UUID1 and $UUID2 would be mdev device's unique UUID and 'params'
> > > > > >> would be vendor specific parameters.
> > > > > >>
> > > > > >> Device nodes would be created as "$UUID1" and "$UUID"
> > > > > >>
> > > > > >> Then 'mdev_start' would be:
> > > > > >> # echo "$UUID1, $UUID2" > /sys/class/mdev/mdev_start
> > > > > >>
> > > > > >> Similarly 'mdev_stop' and 'mdev_destroy' would be:
> > > > > >>
> > > > > >> # echo "$UUID1, $UUID2" > /sys/class/mdev/mdev_stop    
> > > > > >
> > > > > > I'm not sure a comma separated list makes sense here, for both
> > > > > > simplicity in the kernel and more fine grained error reporting, we
> > > > > > probably want to start/stop them individually.  Actually, why is it
> > > > > > that we can't use the mediated device being opened and released to
> > > > > > automatically signal to the backend vendor driver to commit and release
> > > > > > resources? I don't fully understand why userspace needs this interface.    
> > > > 
> > > > There is a meaningful use of start/stop interface, as required in live
> > > > migration support. Such interface allows vendor driver to quiescent 
> > > > mdev activity on source device before mdev hardware state is snapshot,
> > > > and then resume mdev activity on dest device after its state is recovered.
> > > > Intel has implemented experimental live migration support in KVMGT (soon
> > > > to release), based on above two interfaces (plus another two to get/set
> > > > mdev state).  
> > > 
> > > Ok, that's actually an interesting use case for start/stop.
> > >   
> > > > > 
> > > > > For NVIDIA vGPU solution we need to know all devices assigned to a VM in
> > > > > one shot to commit resources of all vGPUs assigned to a VM along with
> > > > > some common resources.    
> > > > 
> > > > Kirti, can you elaborate the background about above one-shot commit
> > > > requirement? It's hard to understand such a requirement.   
> > > 
> > > Agree, I know NVIDIA isn't planning to support hotplug initially, but
> > > this seems like we're precluding hotplug from the design.  I don't
> > > understand what's driving this one-shot requirement.  
> > 
> > Hi Alex,
> > 
> > The requirement here is based on how our internal vGPU device model designed and
> > with this we are able to pre-allocate resources required for multiple virtual
> > devices within same domain.
> > 
> > And I don't think this syntax will stop us from supporting hotplug at all.
> > 
> > For example, you can always create a virtual mdev and then do
> > 
> > echo "mdev_UUID" > /sys/class/mdev/mdev_start
> > 
> > then use QEMU monitor to add the device for hotplug.
> 
> Hi Neo,
> 
> I'm still not understanding the advantage you get from the "one-shot"
> approach then if we can always add more mdevs by starting them later.
> Are the hotplug mdevs somehow less capable than the initial set of
> mdevs added in a single shot?  If the initial set is allocated
> from the "same domain", does that give them some sort of hardware
> locality/resource benefit?  Thanks,

Hi Alex,

At least we will not able to guarantee some special hardware resource for the
hotplug devices.

So from our point of view, we also have dedicated internal SW entity to manage all
virtual devices for each "domain/virtual machine", and such SW entity will be created 
at virtual device start time.

This is why we need to do this in one-shot to support multiple virtual device
per VM case.

Thanks,
Neo

> 
> Alex

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v6 1/4] vfio: Mediated device Core driver
@ 2016-08-15 23:23                     ` Neo Jia
  0 siblings, 0 replies; 100+ messages in thread
From: Neo Jia @ 2016-08-15 23:23 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, Kirti Wankhede, pbonzini, kraxel, qemu-devel, kvm,
	Song, Jike, bjsdjshi

On Mon, Aug 15, 2016 at 04:52:39PM -0600, Alex Williamson wrote:
> On Mon, 15 Aug 2016 15:09:30 -0700
> Neo Jia <cjia@nvidia.com> wrote:
> 
> > On Mon, Aug 15, 2016 at 09:59:26AM -0600, Alex Williamson wrote:
> > > On Mon, 15 Aug 2016 09:38:52 +0000
> > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > >   
> > > > > From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> > > > > Sent: Saturday, August 13, 2016 8:37 AM
> > > > > 
> > > > > 
> > > > > 
> > > > > On 8/13/2016 2:46 AM, Alex Williamson wrote:    
> > > > > > On Sat, 13 Aug 2016 00:14:39 +0530
> > > > > > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > > > > >    
> > > > > >> On 8/10/2016 12:30 AM, Alex Williamson wrote:    
> > > > > >>> On Thu, 4 Aug 2016 00:33:51 +0530
> > > > > >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > > > > >>>
> > > > > >>> This is used later by mdev_device_start() and mdev_device_stop() to get
> > > > > >>> the parent_device so it can call the start and stop ops callbacks
> > > > > >>> respectively.  That seems to imply that all of instances for a given
> > > > > >>> uuid come from the same parent_device.  Where is that enforced?  I'm
> > > > > >>> still having a hard time buying into the uuid+instance plan when it
> > > > > >>> seems like each mdev_device should have an actual unique uuid.
> > > > > >>> Userspace tools can figure out which uuids to start for a given user, I
> > > > > >>> don't see much value in collecting them to instances within a uuid.
> > > > > >>>    
> > > > > >>
> > > > > >> Initially we started discussion with VM_UUID+instance suggestion, where
> > > > > >> instance was introduced to support multiple devices in a VM.    
> > > > > >
> > > > > > The instance number was never required in order to support multiple
> > > > > > devices in a VM, IIRC this UUID+instance scheme was to appease NVIDIA
> > > > > > management tools which wanted to re-use the VM UUID by creating vGPU
> > > > > > devices with that same UUID and therefore associate udev events to a
> > > > > > given VM.  Only then does an instance number become necessary since the
> > > > > > UUID needs to be static for a vGPUs within a VM.  This has always felt
> > > > > > like a very dodgy solution when we should probably just be querying
> > > > > > libvirt to give us a device to VM association.    
> > > > 
> > > > Agree with Alex here. We'd better not assume that UUID will be a VM_UUID
> > > > for mdev in the basic design. It's bound to NVIDIA management stack too tightly.
> > > > 
> > > > I'm OK to give enough flexibility for various upper level management stacks,
> > > > e.g. instead of $UUID+INDEX style, would $UUID+STRING provide a better
> > > > option where either UUID or STRING could be optional? Upper management 
> > > > stack can choose its own policy to identify a mdev:
> > > > 
> > > > a) $UUID only, so each mdev is allocated with a unique UUID
> > > > b) STRING only, which could be an index (0, 1, 2, ...), or any combination 
> > > > (vgpu0, vgpu1, etc.)
> > > > c) $UUID+STRING, where UUID could be a VM UUID, and STRING could be
> > > > a numeric index
> > > >   
> > > > > >    
> > > > > >> 'mdev_create' creates device and 'mdev_start' is to commit resources of
> > > > > >> all instances of similar devices assigned to VM.
> > > > > >>
> > > > > >> For example, to create 2 devices:
> > > > > >> # echo "$UUID:0:params" > /sys/devices/../mdev_create
> > > > > >> # echo "$UUID:1:params" > /sys/devices/../mdev_create
> > > > > >>
> > > > > >> "$UUID-0" and "$UUID-1" devices are created.
> > > > > >>
> > > > > >> Commit resources for above devices with single 'mdev_start':
> > > > > >> # echo "$UUID" > /sys/class/mdev/mdev_start
> > > > > >>
> > > > > >> Considering $UUID to be a unique UUID of a device, we don't need
> > > > > >> 'instance', so 'mdev_create' would look like:
> > > > > >>
> > > > > >> # echo "$UUID1:params" > /sys/devices/../mdev_create
> > > > > >> # echo "$UUID2:params" > /sys/devices/../mdev_create
> > > > > >>
> > > > > >> where $UUID1 and $UUID2 would be mdev device's unique UUID and 'params'
> > > > > >> would be vendor specific parameters.
> > > > > >>
> > > > > >> Device nodes would be created as "$UUID1" and "$UUID"
> > > > > >>
> > > > > >> Then 'mdev_start' would be:
> > > > > >> # echo "$UUID1, $UUID2" > /sys/class/mdev/mdev_start
> > > > > >>
> > > > > >> Similarly 'mdev_stop' and 'mdev_destroy' would be:
> > > > > >>
> > > > > >> # echo "$UUID1, $UUID2" > /sys/class/mdev/mdev_stop    
> > > > > >
> > > > > > I'm not sure a comma separated list makes sense here, for both
> > > > > > simplicity in the kernel and more fine grained error reporting, we
> > > > > > probably want to start/stop them individually.  Actually, why is it
> > > > > > that we can't use the mediated device being opened and released to
> > > > > > automatically signal to the backend vendor driver to commit and release
> > > > > > resources? I don't fully understand why userspace needs this interface.    
> > > > 
> > > > There is a meaningful use of start/stop interface, as required in live
> > > > migration support. Such interface allows vendor driver to quiescent 
> > > > mdev activity on source device before mdev hardware state is snapshot,
> > > > and then resume mdev activity on dest device after its state is recovered.
> > > > Intel has implemented experimental live migration support in KVMGT (soon
> > > > to release), based on above two interfaces (plus another two to get/set
> > > > mdev state).  
> > > 
> > > Ok, that's actually an interesting use case for start/stop.
> > >   
> > > > > 
> > > > > For NVIDIA vGPU solution we need to know all devices assigned to a VM in
> > > > > one shot to commit resources of all vGPUs assigned to a VM along with
> > > > > some common resources.    
> > > > 
> > > > Kirti, can you elaborate the background about above one-shot commit
> > > > requirement? It's hard to understand such a requirement.   
> > > 
> > > Agree, I know NVIDIA isn't planning to support hotplug initially, but
> > > this seems like we're precluding hotplug from the design.  I don't
> > > understand what's driving this one-shot requirement.  
> > 
> > Hi Alex,
> > 
> > The requirement here is based on how our internal vGPU device model designed and
> > with this we are able to pre-allocate resources required for multiple virtual
> > devices within same domain.
> > 
> > And I don't think this syntax will stop us from supporting hotplug at all.
> > 
> > For example, you can always create a virtual mdev and then do
> > 
> > echo "mdev_UUID" > /sys/class/mdev/mdev_start
> > 
> > then use QEMU monitor to add the device for hotplug.
> 
> Hi Neo,
> 
> I'm still not understanding the advantage you get from the "one-shot"
> approach then if we can always add more mdevs by starting them later.
> Are the hotplug mdevs somehow less capable than the initial set of
> mdevs added in a single shot?  If the initial set is allocated
> from the "same domain", does that give them some sort of hardware
> locality/resource benefit?  Thanks,

Hi Alex,

At least we will not able to guarantee some special hardware resource for the
hotplug devices.

So from our point of view, we also have dedicated internal SW entity to manage all
virtual devices for each "domain/virtual machine", and such SW entity will be created 
at virtual device start time.

This is why we need to do this in one-shot to support multiple virtual device
per VM case.

Thanks,
Neo

> 
> Alex

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v6 1/4] vfio: Mediated device Core driver
  2016-08-15 22:47               ` Alex Williamson
@ 2016-08-15 23:54                   ` Neo Jia
  2016-08-16  0:18                   ` [Qemu-devel] " Tian, Kevin
  2016-08-16 20:30                 ` Neo Jia
  2 siblings, 0 replies; 100+ messages in thread
From: Neo Jia @ 2016-08-15 23:54 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, kvm, Song, Jike, qemu-devel, Kirti Wankhede, kraxel,
	pbonzini, bjsdjshi

On Mon, Aug 15, 2016 at 04:47:41PM -0600, Alex Williamson wrote:
> On Mon, 15 Aug 2016 12:59:08 -0700
> Neo Jia <cjia@nvidia.com> wrote:
> 
> > On Mon, Aug 15, 2016 at 09:38:52AM +0000, Tian, Kevin wrote:
> > > > From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> > > > Sent: Saturday, August 13, 2016 8:37 AM
> > > > 
> > > > 
> > > > 
> > > > On 8/13/2016 2:46 AM, Alex Williamson wrote:  
> > > > > On Sat, 13 Aug 2016 00:14:39 +0530
> > > > > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > > > >  
> > > > >> On 8/10/2016 12:30 AM, Alex Williamson wrote:  
> > > > >>> On Thu, 4 Aug 2016 00:33:51 +0530
> > > > >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > > > >>>
> > > > >>> This is used later by mdev_device_start() and mdev_device_stop() to get
> > > > >>> the parent_device so it can call the start and stop ops callbacks
> > > > >>> respectively.  That seems to imply that all of instances for a given
> > > > >>> uuid come from the same parent_device.  Where is that enforced?  I'm
> > > > >>> still having a hard time buying into the uuid+instance plan when it
> > > > >>> seems like each mdev_device should have an actual unique uuid.
> > > > >>> Userspace tools can figure out which uuids to start for a given user, I
> > > > >>> don't see much value in collecting them to instances within a uuid.
> > > > >>>  
> > > > >>
> > > > >> Initially we started discussion with VM_UUID+instance suggestion, where
> > > > >> instance was introduced to support multiple devices in a VM.  
> > > > >
> > > > > The instance number was never required in order to support multiple
> > > > > devices in a VM, IIRC this UUID+instance scheme was to appease NVIDIA
> > > > > management tools which wanted to re-use the VM UUID by creating vGPU
> > > > > devices with that same UUID and therefore associate udev events to a
> > > > > given VM.  Only then does an instance number become necessary since the
> > > > > UUID needs to be static for a vGPUs within a VM.  This has always felt
> > > > > like a very dodgy solution when we should probably just be querying
> > > > > libvirt to give us a device to VM association.  
> > > 
> > > Agree with Alex here. We'd better not assume that UUID will be a VM_UUID
> > > for mdev in the basic design. It's bound to NVIDIA management stack too tightly.
> > > 
> > > I'm OK to give enough flexibility for various upper level management stacks,
> > > e.g. instead of $UUID+INDEX style, would $UUID+STRING provide a better
> > > option where either UUID or STRING could be optional? Upper management 
> > > stack can choose its own policy to identify a mdev:
> > > 
> > > a) $UUID only, so each mdev is allocated with a unique UUID
> > > b) STRING only, which could be an index (0, 1, 2, ...), or any combination 
> > > (vgpu0, vgpu1, etc.)
> > > c) $UUID+STRING, where UUID could be a VM UUID, and STRING could be
> > > a numeric index
> > >   
> > > > >  
> > > > >> 'mdev_create' creates device and 'mdev_start' is to commit resources of
> > > > >> all instances of similar devices assigned to VM.
> > > > >>
> > > > >> For example, to create 2 devices:
> > > > >> # echo "$UUID:0:params" > /sys/devices/../mdev_create
> > > > >> # echo "$UUID:1:params" > /sys/devices/../mdev_create
> > > > >>
> > > > >> "$UUID-0" and "$UUID-1" devices are created.
> > > > >>
> > > > >> Commit resources for above devices with single 'mdev_start':
> > > > >> # echo "$UUID" > /sys/class/mdev/mdev_start
> > > > >>
> > > > >> Considering $UUID to be a unique UUID of a device, we don't need
> > > > >> 'instance', so 'mdev_create' would look like:
> > > > >>
> > > > >> # echo "$UUID1:params" > /sys/devices/../mdev_create
> > > > >> # echo "$UUID2:params" > /sys/devices/../mdev_create
> > > > >>
> > > > >> where $UUID1 and $UUID2 would be mdev device's unique UUID and 'params'
> > > > >> would be vendor specific parameters.
> > > > >>
> > > > >> Device nodes would be created as "$UUID1" and "$UUID"
> > > > >>
> > > > >> Then 'mdev_start' would be:
> > > > >> # echo "$UUID1, $UUID2" > /sys/class/mdev/mdev_start
> > > > >>
> > > > >> Similarly 'mdev_stop' and 'mdev_destroy' would be:
> > > > >>
> > > > >> # echo "$UUID1, $UUID2" > /sys/class/mdev/mdev_stop  
> > > > >
> > > > > I'm not sure a comma separated list makes sense here, for both
> > > > > simplicity in the kernel and more fine grained error reporting, we
> > > > > probably want to start/stop them individually.  Actually, why is it
> > > > > that we can't use the mediated device being opened and released to
> > > > > automatically signal to the backend vendor driver to commit and release
> > > > > resources? I don't fully understand why userspace needs this interface.  
> > > 
> > > There is a meaningful use of start/stop interface, as required in live
> > > migration support. Such interface allows vendor driver to quiescent 
> > > mdev activity on source device before mdev hardware state is snapshot,
> > > and then resume mdev activity on dest device after its state is recovered.
> > > Intel has implemented experimental live migration support in KVMGT (soon
> > > to release), based on above two interfaces (plus another two to get/set
> > > mdev state).
> > >   
> > > > >  
> > > > 
> > > > For NVIDIA vGPU solution we need to know all devices assigned to a VM in
> > > > one shot to commit resources of all vGPUs assigned to a VM along with
> > > > some common resources.  
> > > 
> > > Kirti, can you elaborate the background about above one-shot commit
> > > requirement? It's hard to understand such a requirement. 
> > > 
> > > As I relied in another mail, I really hope start/stop become a per-mdev
> > > attribute instead of global one, e.g.:
> > > 
> > > echo "0/1" > /sys/class/mdev/12345678-1234-1234-1234-123456789abc/start
> > > 
> > > In many scenario the user space client may only want to talk to mdev
> > > instance directly, w/o need to contact its parent device. Still take
> > > live migration for example, I don't think Qemu wants to know parent
> > > device of assigned mdev instances.  
> > 
> > Hi Kevin,
> > 
> > Having a global /sys/class/mdev/mdev_start doesn't require anybody to know
> > parent device. you can just do 
> > 
> > echo "mdev_UUID" > /sys/class/mdev/mdev_start
> > 
> > or 
> > 
> > echo "mdev_UUID" > /sys/class/mdev/mdev_stop
> > 
> > without knowing the parent device.
> 
> That doesn't give an individual user the ability to stop and start
> their devices though, because in order for a user to have write
> permissions there, they get permission to DoS other users by pumping
> arbitrary UUIDs into those files.  By placing start/stop per mdev, we
> have mdev level granularity of granting start/stop privileges.  Really
> though, do we want QEMU fumbling around through sysfs or do we want an
> interface through the vfio API to perform start/stop?  Thanks,

Hi Alex,

With the current sysfs proposal, I don't think QEMU needs to do anything
regards to manage virtual device lifecycle. It is part of the upper layer
management stack who is responsible to create virtual device and get
ready for consumers like QEMU.

In terms of the VFIO API, I assume we will basically loop through all virtual
devices inside QEMU vfio/pci.c for example, and call the start or stop
individually, right?

If that is the case, it doesn't change much from the current design about how to
handle the "one-short" start requirement, and we are adding more dependency with
VFIO API.

So, I think we probably should just focus on the sysfs and make sure such
interface work for us.

Thanks,
Neo

> 
> Alex

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v6 1/4] vfio: Mediated device Core driver
@ 2016-08-15 23:54                   ` Neo Jia
  0 siblings, 0 replies; 100+ messages in thread
From: Neo Jia @ 2016-08-15 23:54 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, Song, Jike, kvm, qemu-devel, Kirti Wankhede, kraxel,
	pbonzini, bjsdjshi

On Mon, Aug 15, 2016 at 04:47:41PM -0600, Alex Williamson wrote:
> On Mon, 15 Aug 2016 12:59:08 -0700
> Neo Jia <cjia@nvidia.com> wrote:
> 
> > On Mon, Aug 15, 2016 at 09:38:52AM +0000, Tian, Kevin wrote:
> > > > From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> > > > Sent: Saturday, August 13, 2016 8:37 AM
> > > > 
> > > > 
> > > > 
> > > > On 8/13/2016 2:46 AM, Alex Williamson wrote:  
> > > > > On Sat, 13 Aug 2016 00:14:39 +0530
> > > > > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > > > >  
> > > > >> On 8/10/2016 12:30 AM, Alex Williamson wrote:  
> > > > >>> On Thu, 4 Aug 2016 00:33:51 +0530
> > > > >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > > > >>>
> > > > >>> This is used later by mdev_device_start() and mdev_device_stop() to get
> > > > >>> the parent_device so it can call the start and stop ops callbacks
> > > > >>> respectively.  That seems to imply that all of instances for a given
> > > > >>> uuid come from the same parent_device.  Where is that enforced?  I'm
> > > > >>> still having a hard time buying into the uuid+instance plan when it
> > > > >>> seems like each mdev_device should have an actual unique uuid.
> > > > >>> Userspace tools can figure out which uuids to start for a given user, I
> > > > >>> don't see much value in collecting them to instances within a uuid.
> > > > >>>  
> > > > >>
> > > > >> Initially we started discussion with VM_UUID+instance suggestion, where
> > > > >> instance was introduced to support multiple devices in a VM.  
> > > > >
> > > > > The instance number was never required in order to support multiple
> > > > > devices in a VM, IIRC this UUID+instance scheme was to appease NVIDIA
> > > > > management tools which wanted to re-use the VM UUID by creating vGPU
> > > > > devices with that same UUID and therefore associate udev events to a
> > > > > given VM.  Only then does an instance number become necessary since the
> > > > > UUID needs to be static for a vGPUs within a VM.  This has always felt
> > > > > like a very dodgy solution when we should probably just be querying
> > > > > libvirt to give us a device to VM association.  
> > > 
> > > Agree with Alex here. We'd better not assume that UUID will be a VM_UUID
> > > for mdev in the basic design. It's bound to NVIDIA management stack too tightly.
> > > 
> > > I'm OK to give enough flexibility for various upper level management stacks,
> > > e.g. instead of $UUID+INDEX style, would $UUID+STRING provide a better
> > > option where either UUID or STRING could be optional? Upper management 
> > > stack can choose its own policy to identify a mdev:
> > > 
> > > a) $UUID only, so each mdev is allocated with a unique UUID
> > > b) STRING only, which could be an index (0, 1, 2, ...), or any combination 
> > > (vgpu0, vgpu1, etc.)
> > > c) $UUID+STRING, where UUID could be a VM UUID, and STRING could be
> > > a numeric index
> > >   
> > > > >  
> > > > >> 'mdev_create' creates device and 'mdev_start' is to commit resources of
> > > > >> all instances of similar devices assigned to VM.
> > > > >>
> > > > >> For example, to create 2 devices:
> > > > >> # echo "$UUID:0:params" > /sys/devices/../mdev_create
> > > > >> # echo "$UUID:1:params" > /sys/devices/../mdev_create
> > > > >>
> > > > >> "$UUID-0" and "$UUID-1" devices are created.
> > > > >>
> > > > >> Commit resources for above devices with single 'mdev_start':
> > > > >> # echo "$UUID" > /sys/class/mdev/mdev_start
> > > > >>
> > > > >> Considering $UUID to be a unique UUID of a device, we don't need
> > > > >> 'instance', so 'mdev_create' would look like:
> > > > >>
> > > > >> # echo "$UUID1:params" > /sys/devices/../mdev_create
> > > > >> # echo "$UUID2:params" > /sys/devices/../mdev_create
> > > > >>
> > > > >> where $UUID1 and $UUID2 would be mdev device's unique UUID and 'params'
> > > > >> would be vendor specific parameters.
> > > > >>
> > > > >> Device nodes would be created as "$UUID1" and "$UUID"
> > > > >>
> > > > >> Then 'mdev_start' would be:
> > > > >> # echo "$UUID1, $UUID2" > /sys/class/mdev/mdev_start
> > > > >>
> > > > >> Similarly 'mdev_stop' and 'mdev_destroy' would be:
> > > > >>
> > > > >> # echo "$UUID1, $UUID2" > /sys/class/mdev/mdev_stop  
> > > > >
> > > > > I'm not sure a comma separated list makes sense here, for both
> > > > > simplicity in the kernel and more fine grained error reporting, we
> > > > > probably want to start/stop them individually.  Actually, why is it
> > > > > that we can't use the mediated device being opened and released to
> > > > > automatically signal to the backend vendor driver to commit and release
> > > > > resources? I don't fully understand why userspace needs this interface.  
> > > 
> > > There is a meaningful use of start/stop interface, as required in live
> > > migration support. Such interface allows vendor driver to quiescent 
> > > mdev activity on source device before mdev hardware state is snapshot,
> > > and then resume mdev activity on dest device after its state is recovered.
> > > Intel has implemented experimental live migration support in KVMGT (soon
> > > to release), based on above two interfaces (plus another two to get/set
> > > mdev state).
> > >   
> > > > >  
> > > > 
> > > > For NVIDIA vGPU solution we need to know all devices assigned to a VM in
> > > > one shot to commit resources of all vGPUs assigned to a VM along with
> > > > some common resources.  
> > > 
> > > Kirti, can you elaborate the background about above one-shot commit
> > > requirement? It's hard to understand such a requirement. 
> > > 
> > > As I relied in another mail, I really hope start/stop become a per-mdev
> > > attribute instead of global one, e.g.:
> > > 
> > > echo "0/1" > /sys/class/mdev/12345678-1234-1234-1234-123456789abc/start
> > > 
> > > In many scenario the user space client may only want to talk to mdev
> > > instance directly, w/o need to contact its parent device. Still take
> > > live migration for example, I don't think Qemu wants to know parent
> > > device of assigned mdev instances.  
> > 
> > Hi Kevin,
> > 
> > Having a global /sys/class/mdev/mdev_start doesn't require anybody to know
> > parent device. you can just do 
> > 
> > echo "mdev_UUID" > /sys/class/mdev/mdev_start
> > 
> > or 
> > 
> > echo "mdev_UUID" > /sys/class/mdev/mdev_stop
> > 
> > without knowing the parent device.
> 
> That doesn't give an individual user the ability to stop and start
> their devices though, because in order for a user to have write
> permissions there, they get permission to DoS other users by pumping
> arbitrary UUIDs into those files.  By placing start/stop per mdev, we
> have mdev level granularity of granting start/stop privileges.  Really
> though, do we want QEMU fumbling around through sysfs or do we want an
> interface through the vfio API to perform start/stop?  Thanks,

Hi Alex,

With the current sysfs proposal, I don't think QEMU needs to do anything
regards to manage virtual device lifecycle. It is part of the upper layer
management stack who is responsible to create virtual device and get
ready for consumers like QEMU.

In terms of the VFIO API, I assume we will basically loop through all virtual
devices inside QEMU vfio/pci.c for example, and call the start or stop
individually, right?

If that is the case, it doesn't change much from the current design about how to
handle the "one-short" start requirement, and we are adding more dependency with
VFIO API.

So, I think we probably should just focus on the sysfs and make sure such
interface work for us.

Thanks,
Neo

> 
> Alex

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v6 1/4] vfio: Mediated device Core driver
  2016-08-15 22:47               ` Alex Williamson
@ 2016-08-16  0:18                   ` Tian, Kevin
  2016-08-16  0:18                   ` [Qemu-devel] " Tian, Kevin
  2016-08-16 20:30                 ` Neo Jia
  2 siblings, 0 replies; 100+ messages in thread
From: Tian, Kevin @ 2016-08-16  0:18 UTC (permalink / raw)
  To: Alex Williamson, Neo Jia
  Cc: Song, Jike, kvm, qemu-devel, Kirti Wankhede, kraxel, pbonzini, bjsdjshi

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Tuesday, August 16, 2016 6:48 AM
> 
> On Mon, 15 Aug 2016 12:59:08 -0700
> Neo Jia <cjia@nvidia.com> wrote:
> 
> > On Mon, Aug 15, 2016 at 09:38:52AM +0000, Tian, Kevin wrote:
> > > > From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> > > > Sent: Saturday, August 13, 2016 8:37 AM
> > > >
> > > >
> > > >
> > > > On 8/13/2016 2:46 AM, Alex Williamson wrote:
> > > > > On Sat, 13 Aug 2016 00:14:39 +0530
> > > > > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > > > >
> > > > >> On 8/10/2016 12:30 AM, Alex Williamson wrote:
> > > > >>> On Thu, 4 Aug 2016 00:33:51 +0530
> > > > >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > > > >>>
> > > > >>> This is used later by mdev_device_start() and mdev_device_stop() to get
> > > > >>> the parent_device so it can call the start and stop ops callbacks
> > > > >>> respectively.  That seems to imply that all of instances for a given
> > > > >>> uuid come from the same parent_device.  Where is that enforced?  I'm
> > > > >>> still having a hard time buying into the uuid+instance plan when it
> > > > >>> seems like each mdev_device should have an actual unique uuid.
> > > > >>> Userspace tools can figure out which uuids to start for a given user, I
> > > > >>> don't see much value in collecting them to instances within a uuid.
> > > > >>>
> > > > >>
> > > > >> Initially we started discussion with VM_UUID+instance suggestion, where
> > > > >> instance was introduced to support multiple devices in a VM.
> > > > >
> > > > > The instance number was never required in order to support multiple
> > > > > devices in a VM, IIRC this UUID+instance scheme was to appease NVIDIA
> > > > > management tools which wanted to re-use the VM UUID by creating vGPU
> > > > > devices with that same UUID and therefore associate udev events to a
> > > > > given VM.  Only then does an instance number become necessary since the
> > > > > UUID needs to be static for a vGPUs within a VM.  This has always felt
> > > > > like a very dodgy solution when we should probably just be querying
> > > > > libvirt to give us a device to VM association.
> > >
> > > Agree with Alex here. We'd better not assume that UUID will be a VM_UUID
> > > for mdev in the basic design. It's bound to NVIDIA management stack too tightly.
> > >
> > > I'm OK to give enough flexibility for various upper level management stacks,
> > > e.g. instead of $UUID+INDEX style, would $UUID+STRING provide a better
> > > option where either UUID or STRING could be optional? Upper management
> > > stack can choose its own policy to identify a mdev:
> > >
> > > a) $UUID only, so each mdev is allocated with a unique UUID
> > > b) STRING only, which could be an index (0, 1, 2, ...), or any combination
> > > (vgpu0, vgpu1, etc.)
> > > c) $UUID+STRING, where UUID could be a VM UUID, and STRING could be
> > > a numeric index
> > >
> > > > >
> > > > >> 'mdev_create' creates device and 'mdev_start' is to commit resources of
> > > > >> all instances of similar devices assigned to VM.
> > > > >>
> > > > >> For example, to create 2 devices:
> > > > >> # echo "$UUID:0:params" > /sys/devices/../mdev_create
> > > > >> # echo "$UUID:1:params" > /sys/devices/../mdev_create
> > > > >>
> > > > >> "$UUID-0" and "$UUID-1" devices are created.
> > > > >>
> > > > >> Commit resources for above devices with single 'mdev_start':
> > > > >> # echo "$UUID" > /sys/class/mdev/mdev_start
> > > > >>
> > > > >> Considering $UUID to be a unique UUID of a device, we don't need
> > > > >> 'instance', so 'mdev_create' would look like:
> > > > >>
> > > > >> # echo "$UUID1:params" > /sys/devices/../mdev_create
> > > > >> # echo "$UUID2:params" > /sys/devices/../mdev_create
> > > > >>
> > > > >> where $UUID1 and $UUID2 would be mdev device's unique UUID and 'params'
> > > > >> would be vendor specific parameters.
> > > > >>
> > > > >> Device nodes would be created as "$UUID1" and "$UUID"
> > > > >>
> > > > >> Then 'mdev_start' would be:
> > > > >> # echo "$UUID1, $UUID2" > /sys/class/mdev/mdev_start
> > > > >>
> > > > >> Similarly 'mdev_stop' and 'mdev_destroy' would be:
> > > > >>
> > > > >> # echo "$UUID1, $UUID2" > /sys/class/mdev/mdev_stop
> > > > >
> > > > > I'm not sure a comma separated list makes sense here, for both
> > > > > simplicity in the kernel and more fine grained error reporting, we
> > > > > probably want to start/stop them individually.  Actually, why is it
> > > > > that we can't use the mediated device being opened and released to
> > > > > automatically signal to the backend vendor driver to commit and release
> > > > > resources? I don't fully understand why userspace needs this interface.
> > >
> > > There is a meaningful use of start/stop interface, as required in live
> > > migration support. Such interface allows vendor driver to quiescent
> > > mdev activity on source device before mdev hardware state is snapshot,
> > > and then resume mdev activity on dest device after its state is recovered.
> > > Intel has implemented experimental live migration support in KVMGT (soon
> > > to release), based on above two interfaces (plus another two to get/set
> > > mdev state).
> > >
> > > > >
> > > >
> > > > For NVIDIA vGPU solution we need to know all devices assigned to a VM in
> > > > one shot to commit resources of all vGPUs assigned to a VM along with
> > > > some common resources.
> > >
> > > Kirti, can you elaborate the background about above one-shot commit
> > > requirement? It's hard to understand such a requirement.
> > >
> > > As I relied in another mail, I really hope start/stop become a per-mdev
> > > attribute instead of global one, e.g.:
> > >
> > > echo "0/1" > /sys/class/mdev/12345678-1234-1234-1234-123456789abc/start
> > >
> > > In many scenario the user space client may only want to talk to mdev
> > > instance directly, w/o need to contact its parent device. Still take
> > > live migration for example, I don't think Qemu wants to know parent
> > > device of assigned mdev instances.
> >
> > Hi Kevin,
> >
> > Having a global /sys/class/mdev/mdev_start doesn't require anybody to know
> > parent device. you can just do
> >
> > echo "mdev_UUID" > /sys/class/mdev/mdev_start
> >
> > or
> >
> > echo "mdev_UUID" > /sys/class/mdev/mdev_stop
> >
> > without knowing the parent device.
> 
> That doesn't give an individual user the ability to stop and start
> their devices though, because in order for a user to have write
> permissions there, they get permission to DoS other users by pumping
> arbitrary UUIDs into those files.  By placing start/stop per mdev, we
> have mdev level granularity of granting start/stop privileges.  Really

Agree.

> though, do we want QEMU fumbling around through sysfs or do we want an
> interface through the vfio API to perform start/stop?  Thanks,
> 

Either way looks OK. If we think mdev as a vfio feature, seems vfio API
can be a good fit to carry those mdev attributes. On the other hand, if
we think mdev as a standalone component (vfio is just one kernel 
implementation), using sysfs might be more generic (e.g. for some reason
mdev is split from vfio in the future?) similar to other physical devices. 
Another limitation with vfio API is that vendor driver cannot deliver 
additional capability w/o updating kernel vfio driver...

Thanks
Kevin

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v6 1/4] vfio: Mediated device Core driver
@ 2016-08-16  0:18                   ` Tian, Kevin
  0 siblings, 0 replies; 100+ messages in thread
From: Tian, Kevin @ 2016-08-16  0:18 UTC (permalink / raw)
  To: Alex Williamson, Neo Jia
  Cc: Song, Jike, kvm, qemu-devel, Kirti Wankhede, kraxel, pbonzini, bjsdjshi

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Tuesday, August 16, 2016 6:48 AM
> 
> On Mon, 15 Aug 2016 12:59:08 -0700
> Neo Jia <cjia@nvidia.com> wrote:
> 
> > On Mon, Aug 15, 2016 at 09:38:52AM +0000, Tian, Kevin wrote:
> > > > From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> > > > Sent: Saturday, August 13, 2016 8:37 AM
> > > >
> > > >
> > > >
> > > > On 8/13/2016 2:46 AM, Alex Williamson wrote:
> > > > > On Sat, 13 Aug 2016 00:14:39 +0530
> > > > > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > > > >
> > > > >> On 8/10/2016 12:30 AM, Alex Williamson wrote:
> > > > >>> On Thu, 4 Aug 2016 00:33:51 +0530
> > > > >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > > > >>>
> > > > >>> This is used later by mdev_device_start() and mdev_device_stop() to get
> > > > >>> the parent_device so it can call the start and stop ops callbacks
> > > > >>> respectively.  That seems to imply that all of instances for a given
> > > > >>> uuid come from the same parent_device.  Where is that enforced?  I'm
> > > > >>> still having a hard time buying into the uuid+instance plan when it
> > > > >>> seems like each mdev_device should have an actual unique uuid.
> > > > >>> Userspace tools can figure out which uuids to start for a given user, I
> > > > >>> don't see much value in collecting them to instances within a uuid.
> > > > >>>
> > > > >>
> > > > >> Initially we started discussion with VM_UUID+instance suggestion, where
> > > > >> instance was introduced to support multiple devices in a VM.
> > > > >
> > > > > The instance number was never required in order to support multiple
> > > > > devices in a VM, IIRC this UUID+instance scheme was to appease NVIDIA
> > > > > management tools which wanted to re-use the VM UUID by creating vGPU
> > > > > devices with that same UUID and therefore associate udev events to a
> > > > > given VM.  Only then does an instance number become necessary since the
> > > > > UUID needs to be static for a vGPUs within a VM.  This has always felt
> > > > > like a very dodgy solution when we should probably just be querying
> > > > > libvirt to give us a device to VM association.
> > >
> > > Agree with Alex here. We'd better not assume that UUID will be a VM_UUID
> > > for mdev in the basic design. It's bound to NVIDIA management stack too tightly.
> > >
> > > I'm OK to give enough flexibility for various upper level management stacks,
> > > e.g. instead of $UUID+INDEX style, would $UUID+STRING provide a better
> > > option where either UUID or STRING could be optional? Upper management
> > > stack can choose its own policy to identify a mdev:
> > >
> > > a) $UUID only, so each mdev is allocated with a unique UUID
> > > b) STRING only, which could be an index (0, 1, 2, ...), or any combination
> > > (vgpu0, vgpu1, etc.)
> > > c) $UUID+STRING, where UUID could be a VM UUID, and STRING could be
> > > a numeric index
> > >
> > > > >
> > > > >> 'mdev_create' creates device and 'mdev_start' is to commit resources of
> > > > >> all instances of similar devices assigned to VM.
> > > > >>
> > > > >> For example, to create 2 devices:
> > > > >> # echo "$UUID:0:params" > /sys/devices/../mdev_create
> > > > >> # echo "$UUID:1:params" > /sys/devices/../mdev_create
> > > > >>
> > > > >> "$UUID-0" and "$UUID-1" devices are created.
> > > > >>
> > > > >> Commit resources for above devices with single 'mdev_start':
> > > > >> # echo "$UUID" > /sys/class/mdev/mdev_start
> > > > >>
> > > > >> Considering $UUID to be a unique UUID of a device, we don't need
> > > > >> 'instance', so 'mdev_create' would look like:
> > > > >>
> > > > >> # echo "$UUID1:params" > /sys/devices/../mdev_create
> > > > >> # echo "$UUID2:params" > /sys/devices/../mdev_create
> > > > >>
> > > > >> where $UUID1 and $UUID2 would be mdev device's unique UUID and 'params'
> > > > >> would be vendor specific parameters.
> > > > >>
> > > > >> Device nodes would be created as "$UUID1" and "$UUID"
> > > > >>
> > > > >> Then 'mdev_start' would be:
> > > > >> # echo "$UUID1, $UUID2" > /sys/class/mdev/mdev_start
> > > > >>
> > > > >> Similarly 'mdev_stop' and 'mdev_destroy' would be:
> > > > >>
> > > > >> # echo "$UUID1, $UUID2" > /sys/class/mdev/mdev_stop
> > > > >
> > > > > I'm not sure a comma separated list makes sense here, for both
> > > > > simplicity in the kernel and more fine grained error reporting, we
> > > > > probably want to start/stop them individually.  Actually, why is it
> > > > > that we can't use the mediated device being opened and released to
> > > > > automatically signal to the backend vendor driver to commit and release
> > > > > resources? I don't fully understand why userspace needs this interface.
> > >
> > > There is a meaningful use of start/stop interface, as required in live
> > > migration support. Such interface allows vendor driver to quiescent
> > > mdev activity on source device before mdev hardware state is snapshot,
> > > and then resume mdev activity on dest device after its state is recovered.
> > > Intel has implemented experimental live migration support in KVMGT (soon
> > > to release), based on above two interfaces (plus another two to get/set
> > > mdev state).
> > >
> > > > >
> > > >
> > > > For NVIDIA vGPU solution we need to know all devices assigned to a VM in
> > > > one shot to commit resources of all vGPUs assigned to a VM along with
> > > > some common resources.
> > >
> > > Kirti, can you elaborate the background about above one-shot commit
> > > requirement? It's hard to understand such a requirement.
> > >
> > > As I relied in another mail, I really hope start/stop become a per-mdev
> > > attribute instead of global one, e.g.:
> > >
> > > echo "0/1" > /sys/class/mdev/12345678-1234-1234-1234-123456789abc/start
> > >
> > > In many scenario the user space client may only want to talk to mdev
> > > instance directly, w/o need to contact its parent device. Still take
> > > live migration for example, I don't think Qemu wants to know parent
> > > device of assigned mdev instances.
> >
> > Hi Kevin,
> >
> > Having a global /sys/class/mdev/mdev_start doesn't require anybody to know
> > parent device. you can just do
> >
> > echo "mdev_UUID" > /sys/class/mdev/mdev_start
> >
> > or
> >
> > echo "mdev_UUID" > /sys/class/mdev/mdev_stop
> >
> > without knowing the parent device.
> 
> That doesn't give an individual user the ability to stop and start
> their devices though, because in order for a user to have write
> permissions there, they get permission to DoS other users by pumping
> arbitrary UUIDs into those files.  By placing start/stop per mdev, we
> have mdev level granularity of granting start/stop privileges.  Really

Agree.

> though, do we want QEMU fumbling around through sysfs or do we want an
> interface through the vfio API to perform start/stop?  Thanks,
> 

Either way looks OK. If we think mdev as a vfio feature, seems vfio API
can be a good fit to carry those mdev attributes. On the other hand, if
we think mdev as a standalone component (vfio is just one kernel 
implementation), using sysfs might be more generic (e.g. for some reason
mdev is split from vfio in the future?) similar to other physical devices. 
Another limitation with vfio API is that vendor driver cannot deliver 
additional capability w/o updating kernel vfio driver...

Thanks
Kevin

^ permalink raw reply	[flat|nested] 100+ messages in thread

* RE: [PATCH v6 1/4] vfio: Mediated device Core driver
  2016-08-15 19:59               ` [Qemu-devel] " Neo Jia
@ 2016-08-16  0:30                 ` Tian, Kevin
  -1 siblings, 0 replies; 100+ messages in thread
From: Tian, Kevin @ 2016-08-16  0:30 UTC (permalink / raw)
  To: Neo Jia
  Cc: Kirti Wankhede, Alex Williamson, pbonzini, kraxel, qemu-devel,
	kvm, Song, Jike, bjsdjshi

> From: Neo Jia [mailto:cjia@nvidia.com]
> Sent: Tuesday, August 16, 2016 3:59 AM
> 
> On Mon, Aug 15, 2016 at 09:38:52AM +0000, Tian, Kevin wrote:
> > > From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> > > Sent: Saturday, August 13, 2016 8:37 AM
> > >
> > >
> > >
> > > On 8/13/2016 2:46 AM, Alex Williamson wrote:
> > > > On Sat, 13 Aug 2016 00:14:39 +0530
> > > > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > > >
> > > >> On 8/10/2016 12:30 AM, Alex Williamson wrote:
> > > >>> On Thu, 4 Aug 2016 00:33:51 +0530
> > > >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > > >>>
> > > >>> This is used later by mdev_device_start() and mdev_device_stop() to get
> > > >>> the parent_device so it can call the start and stop ops callbacks
> > > >>> respectively.  That seems to imply that all of instances for a given
> > > >>> uuid come from the same parent_device.  Where is that enforced?  I'm
> > > >>> still having a hard time buying into the uuid+instance plan when it
> > > >>> seems like each mdev_device should have an actual unique uuid.
> > > >>> Userspace tools can figure out which uuids to start for a given user, I
> > > >>> don't see much value in collecting them to instances within a uuid.
> > > >>>
> > > >>
> > > >> Initially we started discussion with VM_UUID+instance suggestion, where
> > > >> instance was introduced to support multiple devices in a VM.
> > > >
> > > > The instance number was never required in order to support multiple
> > > > devices in a VM, IIRC this UUID+instance scheme was to appease NVIDIA
> > > > management tools which wanted to re-use the VM UUID by creating vGPU
> > > > devices with that same UUID and therefore associate udev events to a
> > > > given VM.  Only then does an instance number become necessary since the
> > > > UUID needs to be static for a vGPUs within a VM.  This has always felt
> > > > like a very dodgy solution when we should probably just be querying
> > > > libvirt to give us a device to VM association.
> >
> > Agree with Alex here. We'd better not assume that UUID will be a VM_UUID
> > for mdev in the basic design. It's bound to NVIDIA management stack too tightly.
> >
> > I'm OK to give enough flexibility for various upper level management stacks,
> > e.g. instead of $UUID+INDEX style, would $UUID+STRING provide a better
> > option where either UUID or STRING could be optional? Upper management
> > stack can choose its own policy to identify a mdev:
> >
> > a) $UUID only, so each mdev is allocated with a unique UUID
> > b) STRING only, which could be an index (0, 1, 2, ...), or any combination
> > (vgpu0, vgpu1, etc.)
> > c) $UUID+STRING, where UUID could be a VM UUID, and STRING could be
> > a numeric index
> >
> > > >
> > > >> 'mdev_create' creates device and 'mdev_start' is to commit resources of
> > > >> all instances of similar devices assigned to VM.
> > > >>
> > > >> For example, to create 2 devices:
> > > >> # echo "$UUID:0:params" > /sys/devices/../mdev_create
> > > >> # echo "$UUID:1:params" > /sys/devices/../mdev_create
> > > >>
> > > >> "$UUID-0" and "$UUID-1" devices are created.
> > > >>
> > > >> Commit resources for above devices with single 'mdev_start':
> > > >> # echo "$UUID" > /sys/class/mdev/mdev_start
> > > >>
> > > >> Considering $UUID to be a unique UUID of a device, we don't need
> > > >> 'instance', so 'mdev_create' would look like:
> > > >>
> > > >> # echo "$UUID1:params" > /sys/devices/../mdev_create
> > > >> # echo "$UUID2:params" > /sys/devices/../mdev_create
> > > >>
> > > >> where $UUID1 and $UUID2 would be mdev device's unique UUID and 'params'
> > > >> would be vendor specific parameters.
> > > >>
> > > >> Device nodes would be created as "$UUID1" and "$UUID"
> > > >>
> > > >> Then 'mdev_start' would be:
> > > >> # echo "$UUID1, $UUID2" > /sys/class/mdev/mdev_start
> > > >>
> > > >> Similarly 'mdev_stop' and 'mdev_destroy' would be:
> > > >>
> > > >> # echo "$UUID1, $UUID2" > /sys/class/mdev/mdev_stop
> > > >
> > > > I'm not sure a comma separated list makes sense here, for both
> > > > simplicity in the kernel and more fine grained error reporting, we
> > > > probably want to start/stop them individually.  Actually, why is it
> > > > that we can't use the mediated device being opened and released to
> > > > automatically signal to the backend vendor driver to commit and release
> > > > resources? I don't fully understand why userspace needs this interface.
> >
> > There is a meaningful use of start/stop interface, as required in live
> > migration support. Such interface allows vendor driver to quiescent
> > mdev activity on source device before mdev hardware state is snapshot,
> > and then resume mdev activity on dest device after its state is recovered.
> > Intel has implemented experimental live migration support in KVMGT (soon
> > to release), based on above two interfaces (plus another two to get/set
> > mdev state).
> >
> > > >
> > >
> > > For NVIDIA vGPU solution we need to know all devices assigned to a VM in
> > > one shot to commit resources of all vGPUs assigned to a VM along with
> > > some common resources.
> >
> > Kirti, can you elaborate the background about above one-shot commit
> > requirement? It's hard to understand such a requirement.
> >
> > As I relied in another mail, I really hope start/stop become a per-mdev
> > attribute instead of global one, e.g.:
> >
> > echo "0/1" > /sys/class/mdev/12345678-1234-1234-1234-123456789abc/start
> >
> > In many scenario the user space client may only want to talk to mdev
> > instance directly, w/o need to contact its parent device. Still take
> > live migration for example, I don't think Qemu wants to know parent
> > device of assigned mdev instances.
> 
> Hi Kevin,
> 
> Having a global /sys/class/mdev/mdev_start doesn't require anybody to know
> parent device. you can just do
> 
> echo "mdev_UUID" > /sys/class/mdev/mdev_start
> 
> or
> 
> echo "mdev_UUID" > /sys/class/mdev/mdev_stop
> 
> without knowing the parent device.
> 

You can look at some existing sysfs example, e.g.:

echo "0/1" > /sys/bus/cpu/devices/cpu1/online

You may also argue why not using a global style:

echo "cpu1" > /sys/bus/cpu/devices/cpu_online
echo "cpu1" > /sys/bus/cpu/devices/cpu_offline

There are many similar examples...

Thanks
Kevin

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v6 1/4] vfio: Mediated device Core driver
@ 2016-08-16  0:30                 ` Tian, Kevin
  0 siblings, 0 replies; 100+ messages in thread
From: Tian, Kevin @ 2016-08-16  0:30 UTC (permalink / raw)
  To: Neo Jia
  Cc: Kirti Wankhede, Alex Williamson, pbonzini, kraxel, qemu-devel,
	kvm, Song, Jike, bjsdjshi

> From: Neo Jia [mailto:cjia@nvidia.com]
> Sent: Tuesday, August 16, 2016 3:59 AM
> 
> On Mon, Aug 15, 2016 at 09:38:52AM +0000, Tian, Kevin wrote:
> > > From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> > > Sent: Saturday, August 13, 2016 8:37 AM
> > >
> > >
> > >
> > > On 8/13/2016 2:46 AM, Alex Williamson wrote:
> > > > On Sat, 13 Aug 2016 00:14:39 +0530
> > > > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > > >
> > > >> On 8/10/2016 12:30 AM, Alex Williamson wrote:
> > > >>> On Thu, 4 Aug 2016 00:33:51 +0530
> > > >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > > >>>
> > > >>> This is used later by mdev_device_start() and mdev_device_stop() to get
> > > >>> the parent_device so it can call the start and stop ops callbacks
> > > >>> respectively.  That seems to imply that all of instances for a given
> > > >>> uuid come from the same parent_device.  Where is that enforced?  I'm
> > > >>> still having a hard time buying into the uuid+instance plan when it
> > > >>> seems like each mdev_device should have an actual unique uuid.
> > > >>> Userspace tools can figure out which uuids to start for a given user, I
> > > >>> don't see much value in collecting them to instances within a uuid.
> > > >>>
> > > >>
> > > >> Initially we started discussion with VM_UUID+instance suggestion, where
> > > >> instance was introduced to support multiple devices in a VM.
> > > >
> > > > The instance number was never required in order to support multiple
> > > > devices in a VM, IIRC this UUID+instance scheme was to appease NVIDIA
> > > > management tools which wanted to re-use the VM UUID by creating vGPU
> > > > devices with that same UUID and therefore associate udev events to a
> > > > given VM.  Only then does an instance number become necessary since the
> > > > UUID needs to be static for a vGPUs within a VM.  This has always felt
> > > > like a very dodgy solution when we should probably just be querying
> > > > libvirt to give us a device to VM association.
> >
> > Agree with Alex here. We'd better not assume that UUID will be a VM_UUID
> > for mdev in the basic design. It's bound to NVIDIA management stack too tightly.
> >
> > I'm OK to give enough flexibility for various upper level management stacks,
> > e.g. instead of $UUID+INDEX style, would $UUID+STRING provide a better
> > option where either UUID or STRING could be optional? Upper management
> > stack can choose its own policy to identify a mdev:
> >
> > a) $UUID only, so each mdev is allocated with a unique UUID
> > b) STRING only, which could be an index (0, 1, 2, ...), or any combination
> > (vgpu0, vgpu1, etc.)
> > c) $UUID+STRING, where UUID could be a VM UUID, and STRING could be
> > a numeric index
> >
> > > >
> > > >> 'mdev_create' creates device and 'mdev_start' is to commit resources of
> > > >> all instances of similar devices assigned to VM.
> > > >>
> > > >> For example, to create 2 devices:
> > > >> # echo "$UUID:0:params" > /sys/devices/../mdev_create
> > > >> # echo "$UUID:1:params" > /sys/devices/../mdev_create
> > > >>
> > > >> "$UUID-0" and "$UUID-1" devices are created.
> > > >>
> > > >> Commit resources for above devices with single 'mdev_start':
> > > >> # echo "$UUID" > /sys/class/mdev/mdev_start
> > > >>
> > > >> Considering $UUID to be a unique UUID of a device, we don't need
> > > >> 'instance', so 'mdev_create' would look like:
> > > >>
> > > >> # echo "$UUID1:params" > /sys/devices/../mdev_create
> > > >> # echo "$UUID2:params" > /sys/devices/../mdev_create
> > > >>
> > > >> where $UUID1 and $UUID2 would be mdev device's unique UUID and 'params'
> > > >> would be vendor specific parameters.
> > > >>
> > > >> Device nodes would be created as "$UUID1" and "$UUID"
> > > >>
> > > >> Then 'mdev_start' would be:
> > > >> # echo "$UUID1, $UUID2" > /sys/class/mdev/mdev_start
> > > >>
> > > >> Similarly 'mdev_stop' and 'mdev_destroy' would be:
> > > >>
> > > >> # echo "$UUID1, $UUID2" > /sys/class/mdev/mdev_stop
> > > >
> > > > I'm not sure a comma separated list makes sense here, for both
> > > > simplicity in the kernel and more fine grained error reporting, we
> > > > probably want to start/stop them individually.  Actually, why is it
> > > > that we can't use the mediated device being opened and released to
> > > > automatically signal to the backend vendor driver to commit and release
> > > > resources? I don't fully understand why userspace needs this interface.
> >
> > There is a meaningful use of start/stop interface, as required in live
> > migration support. Such interface allows vendor driver to quiescent
> > mdev activity on source device before mdev hardware state is snapshot,
> > and then resume mdev activity on dest device after its state is recovered.
> > Intel has implemented experimental live migration support in KVMGT (soon
> > to release), based on above two interfaces (plus another two to get/set
> > mdev state).
> >
> > > >
> > >
> > > For NVIDIA vGPU solution we need to know all devices assigned to a VM in
> > > one shot to commit resources of all vGPUs assigned to a VM along with
> > > some common resources.
> >
> > Kirti, can you elaborate the background about above one-shot commit
> > requirement? It's hard to understand such a requirement.
> >
> > As I relied in another mail, I really hope start/stop become a per-mdev
> > attribute instead of global one, e.g.:
> >
> > echo "0/1" > /sys/class/mdev/12345678-1234-1234-1234-123456789abc/start
> >
> > In many scenario the user space client may only want to talk to mdev
> > instance directly, w/o need to contact its parent device. Still take
> > live migration for example, I don't think Qemu wants to know parent
> > device of assigned mdev instances.
> 
> Hi Kevin,
> 
> Having a global /sys/class/mdev/mdev_start doesn't require anybody to know
> parent device. you can just do
> 
> echo "mdev_UUID" > /sys/class/mdev/mdev_start
> 
> or
> 
> echo "mdev_UUID" > /sys/class/mdev/mdev_stop
> 
> without knowing the parent device.
> 

You can look at some existing sysfs example, e.g.:

echo "0/1" > /sys/bus/cpu/devices/cpu1/online

You may also argue why not using a global style:

echo "cpu1" > /sys/bus/cpu/devices/cpu_online
echo "cpu1" > /sys/bus/cpu/devices/cpu_offline

There are many similar examples...

Thanks
Kevin

^ permalink raw reply	[flat|nested] 100+ messages in thread

* RE: [PATCH v6 1/4] vfio: Mediated device Core driver
  2016-08-15 23:23                     ` [Qemu-devel] " Neo Jia
@ 2016-08-16  0:49                       ` Tian, Kevin
  -1 siblings, 0 replies; 100+ messages in thread
From: Tian, Kevin @ 2016-08-16  0:49 UTC (permalink / raw)
  To: Neo Jia, Alex Williamson
  Cc: Kirti Wankhede, pbonzini, kraxel, qemu-devel, kvm, Song, Jike, bjsdjshi

> From: Neo Jia [mailto:cjia@nvidia.com]
> Sent: Tuesday, August 16, 2016 7:24 AM
> 
> > > >
> > > > > >
> > > > > > For NVIDIA vGPU solution we need to know all devices assigned to a VM in
> > > > > > one shot to commit resources of all vGPUs assigned to a VM along with
> > > > > > some common resources.
> > > > >
> > > > > Kirti, can you elaborate the background about above one-shot commit
> > > > > requirement? It's hard to understand such a requirement.
> > > >
> > > > Agree, I know NVIDIA isn't planning to support hotplug initially, but
> > > > this seems like we're precluding hotplug from the design.  I don't
> > > > understand what's driving this one-shot requirement.
> > >
> > > Hi Alex,
> > >
> > > The requirement here is based on how our internal vGPU device model designed and
> > > with this we are able to pre-allocate resources required for multiple virtual
> > > devices within same domain.
> > >
> > > And I don't think this syntax will stop us from supporting hotplug at all.
> > >
> > > For example, you can always create a virtual mdev and then do
> > >
> > > echo "mdev_UUID" > /sys/class/mdev/mdev_start
> > >
> > > then use QEMU monitor to add the device for hotplug.
> >
> > Hi Neo,
> >
> > I'm still not understanding the advantage you get from the "one-shot"
> > approach then if we can always add more mdevs by starting them later.
> > Are the hotplug mdevs somehow less capable than the initial set of
> > mdevs added in a single shot?  If the initial set is allocated
> > from the "same domain", does that give them some sort of hardware
> > locality/resource benefit?  Thanks,
> 
> Hi Alex,
> 
> At least we will not able to guarantee some special hardware resource for the
> hotplug devices.
> 
> So from our point of view, we also have dedicated internal SW entity to manage all
> virtual devices for each "domain/virtual machine", and such SW entity will be created
> at virtual device start time.
> 
> This is why we need to do this in one-shot to support multiple virtual device
> per VM case.
> 

Is pre-allocation of special hardware resource done one-time for all mdev instances?
Can it be done one-by-one as long as mdev is started early before VM is launched?

If such one-shot requirement is really required, it would be cleaner to me to
introduce a mdev group concept, so mdev instances with one-short start 
requirements can be put under a mdev group. Then you can do one-shot start
by:

echo "0/1" > /sys/class/mdev/group/0/start

Thanks
Kevin

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v6 1/4] vfio: Mediated device Core driver
@ 2016-08-16  0:49                       ` Tian, Kevin
  0 siblings, 0 replies; 100+ messages in thread
From: Tian, Kevin @ 2016-08-16  0:49 UTC (permalink / raw)
  To: Neo Jia, Alex Williamson
  Cc: Kirti Wankhede, pbonzini, kraxel, qemu-devel, kvm, Song, Jike, bjsdjshi

> From: Neo Jia [mailto:cjia@nvidia.com]
> Sent: Tuesday, August 16, 2016 7:24 AM
> 
> > > >
> > > > > >
> > > > > > For NVIDIA vGPU solution we need to know all devices assigned to a VM in
> > > > > > one shot to commit resources of all vGPUs assigned to a VM along with
> > > > > > some common resources.
> > > > >
> > > > > Kirti, can you elaborate the background about above one-shot commit
> > > > > requirement? It's hard to understand such a requirement.
> > > >
> > > > Agree, I know NVIDIA isn't planning to support hotplug initially, but
> > > > this seems like we're precluding hotplug from the design.  I don't
> > > > understand what's driving this one-shot requirement.
> > >
> > > Hi Alex,
> > >
> > > The requirement here is based on how our internal vGPU device model designed and
> > > with this we are able to pre-allocate resources required for multiple virtual
> > > devices within same domain.
> > >
> > > And I don't think this syntax will stop us from supporting hotplug at all.
> > >
> > > For example, you can always create a virtual mdev and then do
> > >
> > > echo "mdev_UUID" > /sys/class/mdev/mdev_start
> > >
> > > then use QEMU monitor to add the device for hotplug.
> >
> > Hi Neo,
> >
> > I'm still not understanding the advantage you get from the "one-shot"
> > approach then if we can always add more mdevs by starting them later.
> > Are the hotplug mdevs somehow less capable than the initial set of
> > mdevs added in a single shot?  If the initial set is allocated
> > from the "same domain", does that give them some sort of hardware
> > locality/resource benefit?  Thanks,
> 
> Hi Alex,
> 
> At least we will not able to guarantee some special hardware resource for the
> hotplug devices.
> 
> So from our point of view, we also have dedicated internal SW entity to manage all
> virtual devices for each "domain/virtual machine", and such SW entity will be created
> at virtual device start time.
> 
> This is why we need to do this in one-shot to support multiple virtual device
> per VM case.
> 

Is pre-allocation of special hardware resource done one-time for all mdev instances?
Can it be done one-by-one as long as mdev is started early before VM is launched?

If such one-shot requirement is really required, it would be cleaner to me to
introduce a mdev group concept, so mdev instances with one-short start 
requirements can be put under a mdev group. Then you can do one-shot start
by:

echo "0/1" > /sys/class/mdev/group/0/start

Thanks
Kevin

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v6 1/4] vfio: Mediated device Core driver
  2016-08-16  0:30                 ` [Qemu-devel] " Tian, Kevin
@ 2016-08-16  3:45                   ` Neo Jia
  -1 siblings, 0 replies; 100+ messages in thread
From: Neo Jia @ 2016-08-16  3:45 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Kirti Wankhede, Alex Williamson, pbonzini, kraxel, qemu-devel,
	kvm, Song, Jike, bjsdjshi

On Tue, Aug 16, 2016 at 12:30:25AM +0000, Tian, Kevin wrote:
> > From: Neo Jia [mailto:cjia@nvidia.com]
> > Sent: Tuesday, August 16, 2016 3:59 AM

> > > >
> > > > For NVIDIA vGPU solution we need to know all devices assigned to a VM in
> > > > one shot to commit resources of all vGPUs assigned to a VM along with
> > > > some common resources.
> > >
> > > Kirti, can you elaborate the background about above one-shot commit
> > > requirement? It's hard to understand such a requirement.
> > >
> > > As I relied in another mail, I really hope start/stop become a per-mdev
> > > attribute instead of global one, e.g.:
> > >
> > > echo "0/1" > /sys/class/mdev/12345678-1234-1234-1234-123456789abc/start
> > >
> > > In many scenario the user space client may only want to talk to mdev
> > > instance directly, w/o need to contact its parent device. Still take
> > > live migration for example, I don't think Qemu wants to know parent
> > > device of assigned mdev instances.
> > 
> > Hi Kevin,
> > 
> > Having a global /sys/class/mdev/mdev_start doesn't require anybody to know
> > parent device. you can just do
> > 
> > echo "mdev_UUID" > /sys/class/mdev/mdev_start
> > 
> > or
> > 
> > echo "mdev_UUID" > /sys/class/mdev/mdev_stop
> > 
> > without knowing the parent device.
> > 
> 
> You can look at some existing sysfs example, e.g.:
> 
> echo "0/1" > /sys/bus/cpu/devices/cpu1/online
> 
> You may also argue why not using a global style:
> 
> echo "cpu1" > /sys/bus/cpu/devices/cpu_online
> echo "cpu1" > /sys/bus/cpu/devices/cpu_offline
> 
> There are many similar examples...

Hi Kevin,

My response above is to your question about using the global sysfs entry as you
don't want to have the global path because

"I don't think Qemu wants to know parent device of assigned mdev instances.".

So I just want to confirm with you that (in case you miss):

    /sys/class/mdev/mdev_start | mdev_stop 

doesn't require the knowledge of parent device.

Thanks,
Neo

> 
> Thanks
> Kevin

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v6 1/4] vfio: Mediated device Core driver
@ 2016-08-16  3:45                   ` Neo Jia
  0 siblings, 0 replies; 100+ messages in thread
From: Neo Jia @ 2016-08-16  3:45 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Kirti Wankhede, Alex Williamson, pbonzini, kraxel, qemu-devel,
	kvm, Song, Jike, bjsdjshi

On Tue, Aug 16, 2016 at 12:30:25AM +0000, Tian, Kevin wrote:
> > From: Neo Jia [mailto:cjia@nvidia.com]
> > Sent: Tuesday, August 16, 2016 3:59 AM

> > > >
> > > > For NVIDIA vGPU solution we need to know all devices assigned to a VM in
> > > > one shot to commit resources of all vGPUs assigned to a VM along with
> > > > some common resources.
> > >
> > > Kirti, can you elaborate the background about above one-shot commit
> > > requirement? It's hard to understand such a requirement.
> > >
> > > As I relied in another mail, I really hope start/stop become a per-mdev
> > > attribute instead of global one, e.g.:
> > >
> > > echo "0/1" > /sys/class/mdev/12345678-1234-1234-1234-123456789abc/start
> > >
> > > In many scenario the user space client may only want to talk to mdev
> > > instance directly, w/o need to contact its parent device. Still take
> > > live migration for example, I don't think Qemu wants to know parent
> > > device of assigned mdev instances.
> > 
> > Hi Kevin,
> > 
> > Having a global /sys/class/mdev/mdev_start doesn't require anybody to know
> > parent device. you can just do
> > 
> > echo "mdev_UUID" > /sys/class/mdev/mdev_start
> > 
> > or
> > 
> > echo "mdev_UUID" > /sys/class/mdev/mdev_stop
> > 
> > without knowing the parent device.
> > 
> 
> You can look at some existing sysfs example, e.g.:
> 
> echo "0/1" > /sys/bus/cpu/devices/cpu1/online
> 
> You may also argue why not using a global style:
> 
> echo "cpu1" > /sys/bus/cpu/devices/cpu_online
> echo "cpu1" > /sys/bus/cpu/devices/cpu_offline
> 
> There are many similar examples...

Hi Kevin,

My response above is to your question about using the global sysfs entry as you
don't want to have the global path because

"I don't think Qemu wants to know parent device of assigned mdev instances.".

So I just want to confirm with you that (in case you miss):

    /sys/class/mdev/mdev_start | mdev_stop 

doesn't require the knowledge of parent device.

Thanks,
Neo

> 
> Thanks
> Kevin

^ permalink raw reply	[flat|nested] 100+ messages in thread

* RE: [PATCH v6 1/4] vfio: Mediated device Core driver
  2016-08-16  3:45                   ` [Qemu-devel] " Neo Jia
@ 2016-08-16  3:50                     ` Tian, Kevin
  -1 siblings, 0 replies; 100+ messages in thread
From: Tian, Kevin @ 2016-08-16  3:50 UTC (permalink / raw)
  To: Neo Jia
  Cc: Kirti Wankhede, Alex Williamson, pbonzini, kraxel, qemu-devel,
	kvm, Song, Jike, bjsdjshi

> From: Neo Jia [mailto:cjia@nvidia.com]
> Sent: Tuesday, August 16, 2016 11:46 AM
> 
> On Tue, Aug 16, 2016 at 12:30:25AM +0000, Tian, Kevin wrote:
> > > From: Neo Jia [mailto:cjia@nvidia.com]
> > > Sent: Tuesday, August 16, 2016 3:59 AM
> 
> > > > >
> > > > > For NVIDIA vGPU solution we need to know all devices assigned to a VM in
> > > > > one shot to commit resources of all vGPUs assigned to a VM along with
> > > > > some common resources.
> > > >
> > > > Kirti, can you elaborate the background about above one-shot commit
> > > > requirement? It's hard to understand such a requirement.
> > > >
> > > > As I relied in another mail, I really hope start/stop become a per-mdev
> > > > attribute instead of global one, e.g.:
> > > >
> > > > echo "0/1" > /sys/class/mdev/12345678-1234-1234-1234-123456789abc/start
> > > >
> > > > In many scenario the user space client may only want to talk to mdev
> > > > instance directly, w/o need to contact its parent device. Still take
> > > > live migration for example, I don't think Qemu wants to know parent
> > > > device of assigned mdev instances.
> > >
> > > Hi Kevin,
> > >
> > > Having a global /sys/class/mdev/mdev_start doesn't require anybody to know
> > > parent device. you can just do
> > >
> > > echo "mdev_UUID" > /sys/class/mdev/mdev_start
> > >
> > > or
> > >
> > > echo "mdev_UUID" > /sys/class/mdev/mdev_stop
> > >
> > > without knowing the parent device.
> > >
> >
> > You can look at some existing sysfs example, e.g.:
> >
> > echo "0/1" > /sys/bus/cpu/devices/cpu1/online
> >
> > You may also argue why not using a global style:
> >
> > echo "cpu1" > /sys/bus/cpu/devices/cpu_online
> > echo "cpu1" > /sys/bus/cpu/devices/cpu_offline
> >
> > There are many similar examples...
> 
> Hi Kevin,
> 
> My response above is to your question about using the global sysfs entry as you
> don't want to have the global path because
> 
> "I don't think Qemu wants to know parent device of assigned mdev instances.".
> 
> So I just want to confirm with you that (in case you miss):
> 
>     /sys/class/mdev/mdev_start | mdev_stop
> 
> doesn't require the knowledge of parent device.
> 

Qemu is just one example, where your explanation of parent device
makes sense but still it's not good for Qemu to populate /sys/class/mdev
directly. Qemu is passed with the actual sysfs path of assigned mdev
instance, so any mdev attributes touched by Qemu should be put under 
that node (e.g. start/stop for live migration usage as I explained earlier).

Thanks
Kevin

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v6 1/4] vfio: Mediated device Core driver
@ 2016-08-16  3:50                     ` Tian, Kevin
  0 siblings, 0 replies; 100+ messages in thread
From: Tian, Kevin @ 2016-08-16  3:50 UTC (permalink / raw)
  To: Neo Jia
  Cc: Kirti Wankhede, Alex Williamson, pbonzini, kraxel, qemu-devel,
	kvm, Song, Jike, bjsdjshi

> From: Neo Jia [mailto:cjia@nvidia.com]
> Sent: Tuesday, August 16, 2016 11:46 AM
> 
> On Tue, Aug 16, 2016 at 12:30:25AM +0000, Tian, Kevin wrote:
> > > From: Neo Jia [mailto:cjia@nvidia.com]
> > > Sent: Tuesday, August 16, 2016 3:59 AM
> 
> > > > >
> > > > > For NVIDIA vGPU solution we need to know all devices assigned to a VM in
> > > > > one shot to commit resources of all vGPUs assigned to a VM along with
> > > > > some common resources.
> > > >
> > > > Kirti, can you elaborate the background about above one-shot commit
> > > > requirement? It's hard to understand such a requirement.
> > > >
> > > > As I relied in another mail, I really hope start/stop become a per-mdev
> > > > attribute instead of global one, e.g.:
> > > >
> > > > echo "0/1" > /sys/class/mdev/12345678-1234-1234-1234-123456789abc/start
> > > >
> > > > In many scenario the user space client may only want to talk to mdev
> > > > instance directly, w/o need to contact its parent device. Still take
> > > > live migration for example, I don't think Qemu wants to know parent
> > > > device of assigned mdev instances.
> > >
> > > Hi Kevin,
> > >
> > > Having a global /sys/class/mdev/mdev_start doesn't require anybody to know
> > > parent device. you can just do
> > >
> > > echo "mdev_UUID" > /sys/class/mdev/mdev_start
> > >
> > > or
> > >
> > > echo "mdev_UUID" > /sys/class/mdev/mdev_stop
> > >
> > > without knowing the parent device.
> > >
> >
> > You can look at some existing sysfs example, e.g.:
> >
> > echo "0/1" > /sys/bus/cpu/devices/cpu1/online
> >
> > You may also argue why not using a global style:
> >
> > echo "cpu1" > /sys/bus/cpu/devices/cpu_online
> > echo "cpu1" > /sys/bus/cpu/devices/cpu_offline
> >
> > There are many similar examples...
> 
> Hi Kevin,
> 
> My response above is to your question about using the global sysfs entry as you
> don't want to have the global path because
> 
> "I don't think Qemu wants to know parent device of assigned mdev instances.".
> 
> So I just want to confirm with you that (in case you miss):
> 
>     /sys/class/mdev/mdev_start | mdev_stop
> 
> doesn't require the knowledge of parent device.
> 

Qemu is just one example, where your explanation of parent device
makes sense but still it's not good for Qemu to populate /sys/class/mdev
directly. Qemu is passed with the actual sysfs path of assigned mdev
instance, so any mdev attributes touched by Qemu should be put under 
that node (e.g. start/stop for live migration usage as I explained earlier).

Thanks
Kevin

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v6 1/4] vfio: Mediated device Core driver
  2016-08-16  3:50                     ` [Qemu-devel] " Tian, Kevin
@ 2016-08-16  4:16                       ` Neo Jia
  -1 siblings, 0 replies; 100+ messages in thread
From: Neo Jia @ 2016-08-16  4:16 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Kirti Wankhede, Alex Williamson, pbonzini, kraxel, qemu-devel,
	kvm, Song, Jike, bjsdjshi

On Tue, Aug 16, 2016 at 03:50:44AM +0000, Tian, Kevin wrote:
> > From: Neo Jia [mailto:cjia@nvidia.com]
> > Sent: Tuesday, August 16, 2016 11:46 AM
> > 
> > On Tue, Aug 16, 2016 at 12:30:25AM +0000, Tian, Kevin wrote:
> > > > From: Neo Jia [mailto:cjia@nvidia.com]
> > > > Sent: Tuesday, August 16, 2016 3:59 AM
> > 
> > > > > >
> > > > > > For NVIDIA vGPU solution we need to know all devices assigned to a VM in
> > > > > > one shot to commit resources of all vGPUs assigned to a VM along with
> > > > > > some common resources.
> > > > >
> > > > > Kirti, can you elaborate the background about above one-shot commit
> > > > > requirement? It's hard to understand such a requirement.
> > > > >
> > > > > As I relied in another mail, I really hope start/stop become a per-mdev
> > > > > attribute instead of global one, e.g.:
> > > > >
> > > > > echo "0/1" > /sys/class/mdev/12345678-1234-1234-1234-123456789abc/start
> > > > >
> > > > > In many scenario the user space client may only want to talk to mdev
> > > > > instance directly, w/o need to contact its parent device. Still take
> > > > > live migration for example, I don't think Qemu wants to know parent
> > > > > device of assigned mdev instances.
> > > >
> > > > Hi Kevin,
> > > >
> > > > Having a global /sys/class/mdev/mdev_start doesn't require anybody to know
> > > > parent device. you can just do
> > > >
> > > > echo "mdev_UUID" > /sys/class/mdev/mdev_start
> > > >
> > > > or
> > > >
> > > > echo "mdev_UUID" > /sys/class/mdev/mdev_stop
> > > >
> > > > without knowing the parent device.
> > > >
> > >
> > > You can look at some existing sysfs example, e.g.:
> > >
> > > echo "0/1" > /sys/bus/cpu/devices/cpu1/online
> > >
> > > You may also argue why not using a global style:
> > >
> > > echo "cpu1" > /sys/bus/cpu/devices/cpu_online
> > > echo "cpu1" > /sys/bus/cpu/devices/cpu_offline
> > >
> > > There are many similar examples...
> > 
> > Hi Kevin,
> > 
> > My response above is to your question about using the global sysfs entry as you
> > don't want to have the global path because
> > 
> > "I don't think Qemu wants to know parent device of assigned mdev instances.".
> > 
> > So I just want to confirm with you that (in case you miss):
> > 
> >     /sys/class/mdev/mdev_start | mdev_stop
> > 
> > doesn't require the knowledge of parent device.
> > 
> 
> Qemu is just one example, where your explanation of parent device
> makes sense but still it's not good for Qemu to populate /sys/class/mdev
> directly. Qemu is passed with the actual sysfs path of assigned mdev
> instance, so any mdev attributes touched by Qemu should be put under 
> that node (e.g. start/stop for live migration usage as I explained earlier).

Exactly, qemu is passed with the actual sysfs path.

So, QEMU doesn't touch the file /sys/class/mdev/mdev_start | mdev_stop at all.

QEMU will take the sysfs path as input:

 -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818-0,id=vgpu0

As you are saying in live migration, QEMU needs to access "start" and "stop".  Could you 
please share more details, such as how QEMU access the "start" and "stop" sysfs,
when and what triggers that?

Thanks,
Neo

> 

> Thanks
> Kevin

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v6 1/4] vfio: Mediated device Core driver
@ 2016-08-16  4:16                       ` Neo Jia
  0 siblings, 0 replies; 100+ messages in thread
From: Neo Jia @ 2016-08-16  4:16 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Kirti Wankhede, Alex Williamson, pbonzini, kraxel, qemu-devel,
	kvm, Song, Jike, bjsdjshi

On Tue, Aug 16, 2016 at 03:50:44AM +0000, Tian, Kevin wrote:
> > From: Neo Jia [mailto:cjia@nvidia.com]
> > Sent: Tuesday, August 16, 2016 11:46 AM
> > 
> > On Tue, Aug 16, 2016 at 12:30:25AM +0000, Tian, Kevin wrote:
> > > > From: Neo Jia [mailto:cjia@nvidia.com]
> > > > Sent: Tuesday, August 16, 2016 3:59 AM
> > 
> > > > > >
> > > > > > For NVIDIA vGPU solution we need to know all devices assigned to a VM in
> > > > > > one shot to commit resources of all vGPUs assigned to a VM along with
> > > > > > some common resources.
> > > > >
> > > > > Kirti, can you elaborate the background about above one-shot commit
> > > > > requirement? It's hard to understand such a requirement.
> > > > >
> > > > > As I relied in another mail, I really hope start/stop become a per-mdev
> > > > > attribute instead of global one, e.g.:
> > > > >
> > > > > echo "0/1" > /sys/class/mdev/12345678-1234-1234-1234-123456789abc/start
> > > > >
> > > > > In many scenario the user space client may only want to talk to mdev
> > > > > instance directly, w/o need to contact its parent device. Still take
> > > > > live migration for example, I don't think Qemu wants to know parent
> > > > > device of assigned mdev instances.
> > > >
> > > > Hi Kevin,
> > > >
> > > > Having a global /sys/class/mdev/mdev_start doesn't require anybody to know
> > > > parent device. you can just do
> > > >
> > > > echo "mdev_UUID" > /sys/class/mdev/mdev_start
> > > >
> > > > or
> > > >
> > > > echo "mdev_UUID" > /sys/class/mdev/mdev_stop
> > > >
> > > > without knowing the parent device.
> > > >
> > >
> > > You can look at some existing sysfs example, e.g.:
> > >
> > > echo "0/1" > /sys/bus/cpu/devices/cpu1/online
> > >
> > > You may also argue why not using a global style:
> > >
> > > echo "cpu1" > /sys/bus/cpu/devices/cpu_online
> > > echo "cpu1" > /sys/bus/cpu/devices/cpu_offline
> > >
> > > There are many similar examples...
> > 
> > Hi Kevin,
> > 
> > My response above is to your question about using the global sysfs entry as you
> > don't want to have the global path because
> > 
> > "I don't think Qemu wants to know parent device of assigned mdev instances.".
> > 
> > So I just want to confirm with you that (in case you miss):
> > 
> >     /sys/class/mdev/mdev_start | mdev_stop
> > 
> > doesn't require the knowledge of parent device.
> > 
> 
> Qemu is just one example, where your explanation of parent device
> makes sense but still it's not good for Qemu to populate /sys/class/mdev
> directly. Qemu is passed with the actual sysfs path of assigned mdev
> instance, so any mdev attributes touched by Qemu should be put under 
> that node (e.g. start/stop for live migration usage as I explained earlier).

Exactly, qemu is passed with the actual sysfs path.

So, QEMU doesn't touch the file /sys/class/mdev/mdev_start | mdev_stop at all.

QEMU will take the sysfs path as input:

 -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818-0,id=vgpu0

As you are saying in live migration, QEMU needs to access "start" and "stop".  Could you 
please share more details, such as how QEMU access the "start" and "stop" sysfs,
when and what triggers that?

Thanks,
Neo

> 

> Thanks
> Kevin

^ permalink raw reply	[flat|nested] 100+ messages in thread

* RE: [PATCH v6 1/4] vfio: Mediated device Core driver
  2016-08-16  4:16                       ` [Qemu-devel] " Neo Jia
@ 2016-08-16  4:52                         ` Tian, Kevin
  -1 siblings, 0 replies; 100+ messages in thread
From: Tian, Kevin @ 2016-08-16  4:52 UTC (permalink / raw)
  To: Neo Jia
  Cc: Kirti Wankhede, Alex Williamson, pbonzini, kraxel, qemu-devel,
	kvm, Song, Jike, bjsdjshi

> From: Neo Jia [mailto:cjia@nvidia.com]
> Sent: Tuesday, August 16, 2016 12:17 PM
> 
> On Tue, Aug 16, 2016 at 03:50:44AM +0000, Tian, Kevin wrote:
> > > From: Neo Jia [mailto:cjia@nvidia.com]
> > > Sent: Tuesday, August 16, 2016 11:46 AM
> > >
> > > On Tue, Aug 16, 2016 at 12:30:25AM +0000, Tian, Kevin wrote:
> > > > > From: Neo Jia [mailto:cjia@nvidia.com]
> > > > > Sent: Tuesday, August 16, 2016 3:59 AM
> > >
> > > > > > >
> > > > > > > For NVIDIA vGPU solution we need to know all devices assigned to a VM in
> > > > > > > one shot to commit resources of all vGPUs assigned to a VM along with
> > > > > > > some common resources.
> > > > > >
> > > > > > Kirti, can you elaborate the background about above one-shot commit
> > > > > > requirement? It's hard to understand such a requirement.
> > > > > >
> > > > > > As I relied in another mail, I really hope start/stop become a per-mdev
> > > > > > attribute instead of global one, e.g.:
> > > > > >
> > > > > > echo "0/1" >
> /sys/class/mdev/12345678-1234-1234-1234-123456789abc/start
> > > > > >
> > > > > > In many scenario the user space client may only want to talk to mdev
> > > > > > instance directly, w/o need to contact its parent device. Still take
> > > > > > live migration for example, I don't think Qemu wants to know parent
> > > > > > device of assigned mdev instances.
> > > > >
> > > > > Hi Kevin,
> > > > >
> > > > > Having a global /sys/class/mdev/mdev_start doesn't require anybody to know
> > > > > parent device. you can just do
> > > > >
> > > > > echo "mdev_UUID" > /sys/class/mdev/mdev_start
> > > > >
> > > > > or
> > > > >
> > > > > echo "mdev_UUID" > /sys/class/mdev/mdev_stop
> > > > >
> > > > > without knowing the parent device.
> > > > >
> > > >
> > > > You can look at some existing sysfs example, e.g.:
> > > >
> > > > echo "0/1" > /sys/bus/cpu/devices/cpu1/online
> > > >
> > > > You may also argue why not using a global style:
> > > >
> > > > echo "cpu1" > /sys/bus/cpu/devices/cpu_online
> > > > echo "cpu1" > /sys/bus/cpu/devices/cpu_offline
> > > >
> > > > There are many similar examples...
> > >
> > > Hi Kevin,
> > >
> > > My response above is to your question about using the global sysfs entry as you
> > > don't want to have the global path because
> > >
> > > "I don't think Qemu wants to know parent device of assigned mdev instances.".
> > >
> > > So I just want to confirm with you that (in case you miss):
> > >
> > >     /sys/class/mdev/mdev_start | mdev_stop
> > >
> > > doesn't require the knowledge of parent device.
> > >
> >
> > Qemu is just one example, where your explanation of parent device
> > makes sense but still it's not good for Qemu to populate /sys/class/mdev
> > directly. Qemu is passed with the actual sysfs path of assigned mdev
> > instance, so any mdev attributes touched by Qemu should be put under
> > that node (e.g. start/stop for live migration usage as I explained earlier).
> 
> Exactly, qemu is passed with the actual sysfs path.
> 
> So, QEMU doesn't touch the file /sys/class/mdev/mdev_start | mdev_stop at all.
> 
> QEMU will take the sysfs path as input:
> 
>  -device
> vfio-pci,sysfsdev=/sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818-0,i
> d=vgpu0

no need of passing "id=vgpu0" here. If necessary you can put id as an attribute 
under sysfs mdev node:

/sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818-0/id

> 
> As you are saying in live migration, QEMU needs to access "start" and "stop".  Could you
> please share more details, such as how QEMU access the "start" and "stop" sysfs,
> when and what triggers that?
> 

A conceptual flow as below:

1. Quiescent mdev activity on the parent device (e.g. stop scheduling, wait for
in-flight DMA completed, etc.)

echo "0" > /sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818-0/start

2. Save mdev state:

cat /sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818-0/state > xxx

3. xxx will be part of the final VM image and copied to a new machine

4. Allocate/prepare mdev on the new machine for this VM

5. Restore mdev state:

cat xxx > /sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818-0/state
(might be a different path name)

6. start mdev on the new parent device:

echo "1" > /sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818-0/start

Thanks
Kevin

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v6 1/4] vfio: Mediated device Core driver
@ 2016-08-16  4:52                         ` Tian, Kevin
  0 siblings, 0 replies; 100+ messages in thread
From: Tian, Kevin @ 2016-08-16  4:52 UTC (permalink / raw)
  To: Neo Jia
  Cc: Kirti Wankhede, Alex Williamson, pbonzini, kraxel, qemu-devel,
	kvm, Song, Jike, bjsdjshi

> From: Neo Jia [mailto:cjia@nvidia.com]
> Sent: Tuesday, August 16, 2016 12:17 PM
> 
> On Tue, Aug 16, 2016 at 03:50:44AM +0000, Tian, Kevin wrote:
> > > From: Neo Jia [mailto:cjia@nvidia.com]
> > > Sent: Tuesday, August 16, 2016 11:46 AM
> > >
> > > On Tue, Aug 16, 2016 at 12:30:25AM +0000, Tian, Kevin wrote:
> > > > > From: Neo Jia [mailto:cjia@nvidia.com]
> > > > > Sent: Tuesday, August 16, 2016 3:59 AM
> > >
> > > > > > >
> > > > > > > For NVIDIA vGPU solution we need to know all devices assigned to a VM in
> > > > > > > one shot to commit resources of all vGPUs assigned to a VM along with
> > > > > > > some common resources.
> > > > > >
> > > > > > Kirti, can you elaborate the background about above one-shot commit
> > > > > > requirement? It's hard to understand such a requirement.
> > > > > >
> > > > > > As I relied in another mail, I really hope start/stop become a per-mdev
> > > > > > attribute instead of global one, e.g.:
> > > > > >
> > > > > > echo "0/1" >
> /sys/class/mdev/12345678-1234-1234-1234-123456789abc/start
> > > > > >
> > > > > > In many scenario the user space client may only want to talk to mdev
> > > > > > instance directly, w/o need to contact its parent device. Still take
> > > > > > live migration for example, I don't think Qemu wants to know parent
> > > > > > device of assigned mdev instances.
> > > > >
> > > > > Hi Kevin,
> > > > >
> > > > > Having a global /sys/class/mdev/mdev_start doesn't require anybody to know
> > > > > parent device. you can just do
> > > > >
> > > > > echo "mdev_UUID" > /sys/class/mdev/mdev_start
> > > > >
> > > > > or
> > > > >
> > > > > echo "mdev_UUID" > /sys/class/mdev/mdev_stop
> > > > >
> > > > > without knowing the parent device.
> > > > >
> > > >
> > > > You can look at some existing sysfs example, e.g.:
> > > >
> > > > echo "0/1" > /sys/bus/cpu/devices/cpu1/online
> > > >
> > > > You may also argue why not using a global style:
> > > >
> > > > echo "cpu1" > /sys/bus/cpu/devices/cpu_online
> > > > echo "cpu1" > /sys/bus/cpu/devices/cpu_offline
> > > >
> > > > There are many similar examples...
> > >
> > > Hi Kevin,
> > >
> > > My response above is to your question about using the global sysfs entry as you
> > > don't want to have the global path because
> > >
> > > "I don't think Qemu wants to know parent device of assigned mdev instances.".
> > >
> > > So I just want to confirm with you that (in case you miss):
> > >
> > >     /sys/class/mdev/mdev_start | mdev_stop
> > >
> > > doesn't require the knowledge of parent device.
> > >
> >
> > Qemu is just one example, where your explanation of parent device
> > makes sense but still it's not good for Qemu to populate /sys/class/mdev
> > directly. Qemu is passed with the actual sysfs path of assigned mdev
> > instance, so any mdev attributes touched by Qemu should be put under
> > that node (e.g. start/stop for live migration usage as I explained earlier).
> 
> Exactly, qemu is passed with the actual sysfs path.
> 
> So, QEMU doesn't touch the file /sys/class/mdev/mdev_start | mdev_stop at all.
> 
> QEMU will take the sysfs path as input:
> 
>  -device
> vfio-pci,sysfsdev=/sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818-0,i
> d=vgpu0

no need of passing "id=vgpu0" here. If necessary you can put id as an attribute 
under sysfs mdev node:

/sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818-0/id

> 
> As you are saying in live migration, QEMU needs to access "start" and "stop".  Could you
> please share more details, such as how QEMU access the "start" and "stop" sysfs,
> when and what triggers that?
> 

A conceptual flow as below:

1. Quiescent mdev activity on the parent device (e.g. stop scheduling, wait for
in-flight DMA completed, etc.)

echo "0" > /sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818-0/start

2. Save mdev state:

cat /sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818-0/state > xxx

3. xxx will be part of the final VM image and copied to a new machine

4. Allocate/prepare mdev on the new machine for this VM

5. Restore mdev state:

cat xxx > /sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818-0/state
(might be a different path name)

6. start mdev on the new parent device:

echo "1" > /sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818-0/start

Thanks
Kevin

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v6 1/4] vfio: Mediated device Core driver
  2016-08-16  4:52                         ` [Qemu-devel] " Tian, Kevin
@ 2016-08-16  5:43                           ` Neo Jia
  -1 siblings, 0 replies; 100+ messages in thread
From: Neo Jia @ 2016-08-16  5:43 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Song, Jike, Alex Williamson, kvm, qemu-devel, Kirti Wankhede,
	kraxel, pbonzini, bjsdjshi

On Tue, Aug 16, 2016 at 04:52:30AM +0000, Tian, Kevin wrote:
> > From: Neo Jia [mailto:cjia@nvidia.com]
> > Sent: Tuesday, August 16, 2016 12:17 PM
> > 
> > On Tue, Aug 16, 2016 at 03:50:44AM +0000, Tian, Kevin wrote:
> > > > From: Neo Jia [mailto:cjia@nvidia.com]
> > > > Sent: Tuesday, August 16, 2016 11:46 AM
> > > >
> > > > On Tue, Aug 16, 2016 at 12:30:25AM +0000, Tian, Kevin wrote:
> > > > > > From: Neo Jia [mailto:cjia@nvidia.com]
> > > > > > Sent: Tuesday, August 16, 2016 3:59 AM
> > > >
> > > > > > > >
> > > > > > > > For NVIDIA vGPU solution we need to know all devices assigned to a VM in
> > > > > > > > one shot to commit resources of all vGPUs assigned to a VM along with
> > > > > > > > some common resources.
> > > > > > >
> > > > > > > Kirti, can you elaborate the background about above one-shot commit
> > > > > > > requirement? It's hard to understand such a requirement.
> > > > > > >
> > > > > > > As I relied in another mail, I really hope start/stop become a per-mdev
> > > > > > > attribute instead of global one, e.g.:
> > > > > > >
> > > > > > > echo "0/1" >
> > /sys/class/mdev/12345678-1234-1234-1234-123456789abc/start
> > > > > > >
> > > > > > > In many scenario the user space client may only want to talk to mdev
> > > > > > > instance directly, w/o need to contact its parent device. Still take
> > > > > > > live migration for example, I don't think Qemu wants to know parent
> > > > > > > device of assigned mdev instances.
> > > > > >
> > > > > > Hi Kevin,
> > > > > >
> > > > > > Having a global /sys/class/mdev/mdev_start doesn't require anybody to know
> > > > > > parent device. you can just do
> > > > > >
> > > > > > echo "mdev_UUID" > /sys/class/mdev/mdev_start
> > > > > >
> > > > > > or
> > > > > >
> > > > > > echo "mdev_UUID" > /sys/class/mdev/mdev_stop
> > > > > >
> > > > > > without knowing the parent device.
> > > > > >
> > > > >
> > > > > You can look at some existing sysfs example, e.g.:
> > > > >
> > > > > echo "0/1" > /sys/bus/cpu/devices/cpu1/online
> > > > >
> > > > > You may also argue why not using a global style:
> > > > >
> > > > > echo "cpu1" > /sys/bus/cpu/devices/cpu_online
> > > > > echo "cpu1" > /sys/bus/cpu/devices/cpu_offline
> > > > >
> > > > > There are many similar examples...
> > > >
> > > > Hi Kevin,
> > > >
> > > > My response above is to your question about using the global sysfs entry as you
> > > > don't want to have the global path because
> > > >
> > > > "I don't think Qemu wants to know parent device of assigned mdev instances.".
> > > >
> > > > So I just want to confirm with you that (in case you miss):
> > > >
> > > >     /sys/class/mdev/mdev_start | mdev_stop
> > > >
> > > > doesn't require the knowledge of parent device.
> > > >
> > >
> > > Qemu is just one example, where your explanation of parent device
> > > makes sense but still it's not good for Qemu to populate /sys/class/mdev
> > > directly. Qemu is passed with the actual sysfs path of assigned mdev
> > > instance, so any mdev attributes touched by Qemu should be put under
> > > that node (e.g. start/stop for live migration usage as I explained earlier).
> > 
> > Exactly, qemu is passed with the actual sysfs path.
> > 
> > So, QEMU doesn't touch the file /sys/class/mdev/mdev_start | mdev_stop at all.
> > 
> > QEMU will take the sysfs path as input:
> > 
> >  -device
> > vfio-pci,sysfsdev=/sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818-0,i
> > d=vgpu0
> 
> no need of passing "id=vgpu0" here. If necessary you can put id as an attribute 
> under sysfs mdev node:
> 
> /sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818-0/id

I think we have moved away from the device index based on Alex's comment, so the
device path will be:

 /sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818

> 
> > 
> > As you are saying in live migration, QEMU needs to access "start" and "stop".  Could you
> > please share more details, such as how QEMU access the "start" and "stop" sysfs,
> > when and what triggers that?
> > 
> 
> A conceptual flow as below:
> 
> 1. Quiescent mdev activity on the parent device (e.g. stop scheduling, wait for
> in-flight DMA completed, etc.)
> 
> echo "0" > /sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818-0/start
> 
> 2. Save mdev state:
> 
> cat /sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818-0/state > xxx
> 
> 3. xxx will be part of the final VM image and copied to a new machine
> 
> 4. Allocate/prepare mdev on the new machine for this VM
> 
> 5. Restore mdev state:
> 
> cat xxx > /sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818-0/state
> (might be a different path name)
> 
> 6. start mdev on the new parent device:
> 
> echo "1" > /sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818-0/start

Thanks for the sequence, so based on above live migration, the access of "start/stop"
are from other user space program not QEMU process.

(Just to be clear, I am not saying that I won't consider your suggestion of 
accommodating the "start/stop" file from global to mdev node, but I do want to point 
out that keeping them inside global shouldn't impact your live migration sequence 
above.)

Thanks,
Neo

> 
> Thanks
> Kevin

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v6 1/4] vfio: Mediated device Core driver
@ 2016-08-16  5:43                           ` Neo Jia
  0 siblings, 0 replies; 100+ messages in thread
From: Neo Jia @ 2016-08-16  5:43 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Kirti Wankhede, Alex Williamson, pbonzini, kraxel, qemu-devel,
	kvm, Song, Jike, bjsdjshi

On Tue, Aug 16, 2016 at 04:52:30AM +0000, Tian, Kevin wrote:
> > From: Neo Jia [mailto:cjia@nvidia.com]
> > Sent: Tuesday, August 16, 2016 12:17 PM
> > 
> > On Tue, Aug 16, 2016 at 03:50:44AM +0000, Tian, Kevin wrote:
> > > > From: Neo Jia [mailto:cjia@nvidia.com]
> > > > Sent: Tuesday, August 16, 2016 11:46 AM
> > > >
> > > > On Tue, Aug 16, 2016 at 12:30:25AM +0000, Tian, Kevin wrote:
> > > > > > From: Neo Jia [mailto:cjia@nvidia.com]
> > > > > > Sent: Tuesday, August 16, 2016 3:59 AM
> > > >
> > > > > > > >
> > > > > > > > For NVIDIA vGPU solution we need to know all devices assigned to a VM in
> > > > > > > > one shot to commit resources of all vGPUs assigned to a VM along with
> > > > > > > > some common resources.
> > > > > > >
> > > > > > > Kirti, can you elaborate the background about above one-shot commit
> > > > > > > requirement? It's hard to understand such a requirement.
> > > > > > >
> > > > > > > As I relied in another mail, I really hope start/stop become a per-mdev
> > > > > > > attribute instead of global one, e.g.:
> > > > > > >
> > > > > > > echo "0/1" >
> > /sys/class/mdev/12345678-1234-1234-1234-123456789abc/start
> > > > > > >
> > > > > > > In many scenario the user space client may only want to talk to mdev
> > > > > > > instance directly, w/o need to contact its parent device. Still take
> > > > > > > live migration for example, I don't think Qemu wants to know parent
> > > > > > > device of assigned mdev instances.
> > > > > >
> > > > > > Hi Kevin,
> > > > > >
> > > > > > Having a global /sys/class/mdev/mdev_start doesn't require anybody to know
> > > > > > parent device. you can just do
> > > > > >
> > > > > > echo "mdev_UUID" > /sys/class/mdev/mdev_start
> > > > > >
> > > > > > or
> > > > > >
> > > > > > echo "mdev_UUID" > /sys/class/mdev/mdev_stop
> > > > > >
> > > > > > without knowing the parent device.
> > > > > >
> > > > >
> > > > > You can look at some existing sysfs example, e.g.:
> > > > >
> > > > > echo "0/1" > /sys/bus/cpu/devices/cpu1/online
> > > > >
> > > > > You may also argue why not using a global style:
> > > > >
> > > > > echo "cpu1" > /sys/bus/cpu/devices/cpu_online
> > > > > echo "cpu1" > /sys/bus/cpu/devices/cpu_offline
> > > > >
> > > > > There are many similar examples...
> > > >
> > > > Hi Kevin,
> > > >
> > > > My response above is to your question about using the global sysfs entry as you
> > > > don't want to have the global path because
> > > >
> > > > "I don't think Qemu wants to know parent device of assigned mdev instances.".
> > > >
> > > > So I just want to confirm with you that (in case you miss):
> > > >
> > > >     /sys/class/mdev/mdev_start | mdev_stop
> > > >
> > > > doesn't require the knowledge of parent device.
> > > >
> > >
> > > Qemu is just one example, where your explanation of parent device
> > > makes sense but still it's not good for Qemu to populate /sys/class/mdev
> > > directly. Qemu is passed with the actual sysfs path of assigned mdev
> > > instance, so any mdev attributes touched by Qemu should be put under
> > > that node (e.g. start/stop for live migration usage as I explained earlier).
> > 
> > Exactly, qemu is passed with the actual sysfs path.
> > 
> > So, QEMU doesn't touch the file /sys/class/mdev/mdev_start | mdev_stop at all.
> > 
> > QEMU will take the sysfs path as input:
> > 
> >  -device
> > vfio-pci,sysfsdev=/sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818-0,i
> > d=vgpu0
> 
> no need of passing "id=vgpu0" here. If necessary you can put id as an attribute 
> under sysfs mdev node:
> 
> /sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818-0/id

I think we have moved away from the device index based on Alex's comment, so the
device path will be:

 /sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818

> 
> > 
> > As you are saying in live migration, QEMU needs to access "start" and "stop".  Could you
> > please share more details, such as how QEMU access the "start" and "stop" sysfs,
> > when and what triggers that?
> > 
> 
> A conceptual flow as below:
> 
> 1. Quiescent mdev activity on the parent device (e.g. stop scheduling, wait for
> in-flight DMA completed, etc.)
> 
> echo "0" > /sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818-0/start
> 
> 2. Save mdev state:
> 
> cat /sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818-0/state > xxx
> 
> 3. xxx will be part of the final VM image and copied to a new machine
> 
> 4. Allocate/prepare mdev on the new machine for this VM
> 
> 5. Restore mdev state:
> 
> cat xxx > /sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818-0/state
> (might be a different path name)
> 
> 6. start mdev on the new parent device:
> 
> echo "1" > /sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818-0/start

Thanks for the sequence, so based on above live migration, the access of "start/stop"
are from other user space program not QEMU process.

(Just to be clear, I am not saying that I won't consider your suggestion of 
accommodating the "start/stop" file from global to mdev node, but I do want to point 
out that keeping them inside global shouldn't impact your live migration sequence 
above.)

Thanks,
Neo

> 
> Thanks
> Kevin

^ permalink raw reply	[flat|nested] 100+ messages in thread

* RE: [PATCH v6 1/4] vfio: Mediated device Core driver
  2016-08-16  5:43                           ` [Qemu-devel] " Neo Jia
@ 2016-08-16  5:58                             ` Tian, Kevin
  -1 siblings, 0 replies; 100+ messages in thread
From: Tian, Kevin @ 2016-08-16  5:58 UTC (permalink / raw)
  To: Neo Jia
  Cc: Kirti Wankhede, Alex Williamson, pbonzini, kraxel, qemu-devel,
	kvm, Song, Jike, bjsdjshi

> From: Neo Jia [mailto:cjia@nvidia.com]
> Sent: Tuesday, August 16, 2016 1:44 PM
> 
> On Tue, Aug 16, 2016 at 04:52:30AM +0000, Tian, Kevin wrote:
> > > From: Neo Jia [mailto:cjia@nvidia.com]
> > > Sent: Tuesday, August 16, 2016 12:17 PM
> > >
> > > On Tue, Aug 16, 2016 at 03:50:44AM +0000, Tian, Kevin wrote:
> > > > > From: Neo Jia [mailto:cjia@nvidia.com]
> > > > > Sent: Tuesday, August 16, 2016 11:46 AM
> > > > >
> > > > > On Tue, Aug 16, 2016 at 12:30:25AM +0000, Tian, Kevin wrote:
> > > > > > > From: Neo Jia [mailto:cjia@nvidia.com]
> > > > > > > Sent: Tuesday, August 16, 2016 3:59 AM
> > > > >
> > > > > > > > >
> > > > > > > > > For NVIDIA vGPU solution we need to know all devices assigned to a VM
> in
> > > > > > > > > one shot to commit resources of all vGPUs assigned to a VM along with
> > > > > > > > > some common resources.
> > > > > > > >
> > > > > > > > Kirti, can you elaborate the background about above one-shot commit
> > > > > > > > requirement? It's hard to understand such a requirement.
> > > > > > > >
> > > > > > > > As I relied in another mail, I really hope start/stop become a per-mdev
> > > > > > > > attribute instead of global one, e.g.:
> > > > > > > >
> > > > > > > > echo "0/1" >
> > > /sys/class/mdev/12345678-1234-1234-1234-123456789abc/start
> > > > > > > >
> > > > > > > > In many scenario the user space client may only want to talk to mdev
> > > > > > > > instance directly, w/o need to contact its parent device. Still take
> > > > > > > > live migration for example, I don't think Qemu wants to know parent
> > > > > > > > device of assigned mdev instances.
> > > > > > >
> > > > > > > Hi Kevin,
> > > > > > >
> > > > > > > Having a global /sys/class/mdev/mdev_start doesn't require anybody to
> know
> > > > > > > parent device. you can just do
> > > > > > >
> > > > > > > echo "mdev_UUID" > /sys/class/mdev/mdev_start
> > > > > > >
> > > > > > > or
> > > > > > >
> > > > > > > echo "mdev_UUID" > /sys/class/mdev/mdev_stop
> > > > > > >
> > > > > > > without knowing the parent device.
> > > > > > >
> > > > > >
> > > > > > You can look at some existing sysfs example, e.g.:
> > > > > >
> > > > > > echo "0/1" > /sys/bus/cpu/devices/cpu1/online
> > > > > >
> > > > > > You may also argue why not using a global style:
> > > > > >
> > > > > > echo "cpu1" > /sys/bus/cpu/devices/cpu_online
> > > > > > echo "cpu1" > /sys/bus/cpu/devices/cpu_offline
> > > > > >
> > > > > > There are many similar examples...
> > > > >
> > > > > Hi Kevin,
> > > > >
> > > > > My response above is to your question about using the global sysfs entry as you
> > > > > don't want to have the global path because
> > > > >
> > > > > "I don't think Qemu wants to know parent device of assigned mdev instances.".
> > > > >
> > > > > So I just want to confirm with you that (in case you miss):
> > > > >
> > > > >     /sys/class/mdev/mdev_start | mdev_stop
> > > > >
> > > > > doesn't require the knowledge of parent device.
> > > > >
> > > >
> > > > Qemu is just one example, where your explanation of parent device
> > > > makes sense but still it's not good for Qemu to populate /sys/class/mdev
> > > > directly. Qemu is passed with the actual sysfs path of assigned mdev
> > > > instance, so any mdev attributes touched by Qemu should be put under
> > > > that node (e.g. start/stop for live migration usage as I explained earlier).
> > >
> > > Exactly, qemu is passed with the actual sysfs path.
> > >
> > > So, QEMU doesn't touch the file /sys/class/mdev/mdev_start | mdev_stop at all.
> > >
> > > QEMU will take the sysfs path as input:
> > >
> > >  -device
> > >
> vfio-pci,sysfsdev=/sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818-0,i
> > > d=vgpu0
> >
> > no need of passing "id=vgpu0" here. If necessary you can put id as an attribute
> > under sysfs mdev node:
> >
> > /sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818-0/id
> 
> I think we have moved away from the device index based on Alex's comment, so the
> device path will be:
> 
>  /sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818

pass /sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818
as parameter, and then Qemu can access 'id' under that path. You
don't need to pass a separate 'id' field. That's my point.


> 
> >
> > >
> > > As you are saying in live migration, QEMU needs to access "start" and "stop".  Could
> you
> > > please share more details, such as how QEMU access the "start" and "stop" sysfs,
> > > when and what triggers that?
> > >
> >
> > A conceptual flow as below:
> >
> > 1. Quiescent mdev activity on the parent device (e.g. stop scheduling, wait for
> > in-flight DMA completed, etc.)
> >
> > echo "0" > /sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818-0/start
> >
> > 2. Save mdev state:
> >
> > cat /sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818-0/state > xxx
> >
> > 3. xxx will be part of the final VM image and copied to a new machine
> >
> > 4. Allocate/prepare mdev on the new machine for this VM
> >
> > 5. Restore mdev state:
> >
> > cat xxx > /sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818-0/state
> > (might be a different path name)
> >
> > 6. start mdev on the new parent device:
> >
> > echo "1" > /sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818-0/start
> 
> Thanks for the sequence, so based on above live migration, the access of "start/stop"
> are from other user space program not QEMU process.
> 
> (Just to be clear, I am not saying that I won't consider your suggestion of
> accommodating the "start/stop" file from global to mdev node, but I do want to point
> out that keeping them inside global shouldn't impact your live migration sequence
> above.)
> 

come on... I just use above bash command to show the step. Qemu itself definitely
needs to open the sysfs file descriptor and then read/write the fd... If we use VFIO
API as example, it might be more obvious. :-)

Thanks
Kevin

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v6 1/4] vfio: Mediated device Core driver
@ 2016-08-16  5:58                             ` Tian, Kevin
  0 siblings, 0 replies; 100+ messages in thread
From: Tian, Kevin @ 2016-08-16  5:58 UTC (permalink / raw)
  To: Neo Jia
  Cc: Kirti Wankhede, Alex Williamson, pbonzini, kraxel, qemu-devel,
	kvm, Song, Jike, bjsdjshi

> From: Neo Jia [mailto:cjia@nvidia.com]
> Sent: Tuesday, August 16, 2016 1:44 PM
> 
> On Tue, Aug 16, 2016 at 04:52:30AM +0000, Tian, Kevin wrote:
> > > From: Neo Jia [mailto:cjia@nvidia.com]
> > > Sent: Tuesday, August 16, 2016 12:17 PM
> > >
> > > On Tue, Aug 16, 2016 at 03:50:44AM +0000, Tian, Kevin wrote:
> > > > > From: Neo Jia [mailto:cjia@nvidia.com]
> > > > > Sent: Tuesday, August 16, 2016 11:46 AM
> > > > >
> > > > > On Tue, Aug 16, 2016 at 12:30:25AM +0000, Tian, Kevin wrote:
> > > > > > > From: Neo Jia [mailto:cjia@nvidia.com]
> > > > > > > Sent: Tuesday, August 16, 2016 3:59 AM
> > > > >
> > > > > > > > >
> > > > > > > > > For NVIDIA vGPU solution we need to know all devices assigned to a VM
> in
> > > > > > > > > one shot to commit resources of all vGPUs assigned to a VM along with
> > > > > > > > > some common resources.
> > > > > > > >
> > > > > > > > Kirti, can you elaborate the background about above one-shot commit
> > > > > > > > requirement? It's hard to understand such a requirement.
> > > > > > > >
> > > > > > > > As I relied in another mail, I really hope start/stop become a per-mdev
> > > > > > > > attribute instead of global one, e.g.:
> > > > > > > >
> > > > > > > > echo "0/1" >
> > > /sys/class/mdev/12345678-1234-1234-1234-123456789abc/start
> > > > > > > >
> > > > > > > > In many scenario the user space client may only want to talk to mdev
> > > > > > > > instance directly, w/o need to contact its parent device. Still take
> > > > > > > > live migration for example, I don't think Qemu wants to know parent
> > > > > > > > device of assigned mdev instances.
> > > > > > >
> > > > > > > Hi Kevin,
> > > > > > >
> > > > > > > Having a global /sys/class/mdev/mdev_start doesn't require anybody to
> know
> > > > > > > parent device. you can just do
> > > > > > >
> > > > > > > echo "mdev_UUID" > /sys/class/mdev/mdev_start
> > > > > > >
> > > > > > > or
> > > > > > >
> > > > > > > echo "mdev_UUID" > /sys/class/mdev/mdev_stop
> > > > > > >
> > > > > > > without knowing the parent device.
> > > > > > >
> > > > > >
> > > > > > You can look at some existing sysfs example, e.g.:
> > > > > >
> > > > > > echo "0/1" > /sys/bus/cpu/devices/cpu1/online
> > > > > >
> > > > > > You may also argue why not using a global style:
> > > > > >
> > > > > > echo "cpu1" > /sys/bus/cpu/devices/cpu_online
> > > > > > echo "cpu1" > /sys/bus/cpu/devices/cpu_offline
> > > > > >
> > > > > > There are many similar examples...
> > > > >
> > > > > Hi Kevin,
> > > > >
> > > > > My response above is to your question about using the global sysfs entry as you
> > > > > don't want to have the global path because
> > > > >
> > > > > "I don't think Qemu wants to know parent device of assigned mdev instances.".
> > > > >
> > > > > So I just want to confirm with you that (in case you miss):
> > > > >
> > > > >     /sys/class/mdev/mdev_start | mdev_stop
> > > > >
> > > > > doesn't require the knowledge of parent device.
> > > > >
> > > >
> > > > Qemu is just one example, where your explanation of parent device
> > > > makes sense but still it's not good for Qemu to populate /sys/class/mdev
> > > > directly. Qemu is passed with the actual sysfs path of assigned mdev
> > > > instance, so any mdev attributes touched by Qemu should be put under
> > > > that node (e.g. start/stop for live migration usage as I explained earlier).
> > >
> > > Exactly, qemu is passed with the actual sysfs path.
> > >
> > > So, QEMU doesn't touch the file /sys/class/mdev/mdev_start | mdev_stop at all.
> > >
> > > QEMU will take the sysfs path as input:
> > >
> > >  -device
> > >
> vfio-pci,sysfsdev=/sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818-0,i
> > > d=vgpu0
> >
> > no need of passing "id=vgpu0" here. If necessary you can put id as an attribute
> > under sysfs mdev node:
> >
> > /sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818-0/id
> 
> I think we have moved away from the device index based on Alex's comment, so the
> device path will be:
> 
>  /sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818

pass /sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818
as parameter, and then Qemu can access 'id' under that path. You
don't need to pass a separate 'id' field. That's my point.


> 
> >
> > >
> > > As you are saying in live migration, QEMU needs to access "start" and "stop".  Could
> you
> > > please share more details, such as how QEMU access the "start" and "stop" sysfs,
> > > when and what triggers that?
> > >
> >
> > A conceptual flow as below:
> >
> > 1. Quiescent mdev activity on the parent device (e.g. stop scheduling, wait for
> > in-flight DMA completed, etc.)
> >
> > echo "0" > /sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818-0/start
> >
> > 2. Save mdev state:
> >
> > cat /sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818-0/state > xxx
> >
> > 3. xxx will be part of the final VM image and copied to a new machine
> >
> > 4. Allocate/prepare mdev on the new machine for this VM
> >
> > 5. Restore mdev state:
> >
> > cat xxx > /sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818-0/state
> > (might be a different path name)
> >
> > 6. start mdev on the new parent device:
> >
> > echo "1" > /sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818-0/start
> 
> Thanks for the sequence, so based on above live migration, the access of "start/stop"
> are from other user space program not QEMU process.
> 
> (Just to be clear, I am not saying that I won't consider your suggestion of
> accommodating the "start/stop" file from global to mdev node, but I do want to point
> out that keeping them inside global shouldn't impact your live migration sequence
> above.)
> 

come on... I just use above bash command to show the step. Qemu itself definitely
needs to open the sysfs file descriptor and then read/write the fd... If we use VFIO
API as example, it might be more obvious. :-)

Thanks
Kevin

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v6 1/4] vfio: Mediated device Core driver
  2016-08-16  5:58                             ` [Qemu-devel] " Tian, Kevin
@ 2016-08-16  6:13                               ` Neo Jia
  -1 siblings, 0 replies; 100+ messages in thread
From: Neo Jia @ 2016-08-16  6:13 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Kirti Wankhede, Alex Williamson, pbonzini, kraxel, qemu-devel,
	kvm, Song, Jike, bjsdjshi

On Tue, Aug 16, 2016 at 05:58:54AM +0000, Tian, Kevin wrote:
> > From: Neo Jia [mailto:cjia@nvidia.com]
> > Sent: Tuesday, August 16, 2016 1:44 PM
> > 
> > On Tue, Aug 16, 2016 at 04:52:30AM +0000, Tian, Kevin wrote:
> > > > From: Neo Jia [mailto:cjia@nvidia.com]
> > > > Sent: Tuesday, August 16, 2016 12:17 PM
> > > >
> > > > On Tue, Aug 16, 2016 at 03:50:44AM +0000, Tian, Kevin wrote:
> > > > > > From: Neo Jia [mailto:cjia@nvidia.com]
> > > > > > Sent: Tuesday, August 16, 2016 11:46 AM
> > > > > >
> > > > > > On Tue, Aug 16, 2016 at 12:30:25AM +0000, Tian, Kevin wrote:
> > > > > > > > From: Neo Jia [mailto:cjia@nvidia.com]
> > > > > > > > Sent: Tuesday, August 16, 2016 3:59 AM
> > > > > >
> > > > > > > > > >
> > > > > > > > > > For NVIDIA vGPU solution we need to know all devices assigned to a VM
> > in
> > > > > > > > > > one shot to commit resources of all vGPUs assigned to a VM along with
> > > > > > > > > > some common resources.
> > > > > > > > >
> > > > > > > > > Kirti, can you elaborate the background about above one-shot commit
> > > > > > > > > requirement? It's hard to understand such a requirement.
> > > > > > > > >
> > > > > > > > > As I relied in another mail, I really hope start/stop become a per-mdev
> > > > > > > > > attribute instead of global one, e.g.:
> > > > > > > > >
> > > > > > > > > echo "0/1" >
> > > > /sys/class/mdev/12345678-1234-1234-1234-123456789abc/start
> > > > > > > > >
> > > > > > > > > In many scenario the user space client may only want to talk to mdev
> > > > > > > > > instance directly, w/o need to contact its parent device. Still take
> > > > > > > > > live migration for example, I don't think Qemu wants to know parent
> > > > > > > > > device of assigned mdev instances.
> > > > > > > >
> > > > > > > > Hi Kevin,
> > > > > > > >
> > > > > > > > Having a global /sys/class/mdev/mdev_start doesn't require anybody to
> > know
> > > > > > > > parent device. you can just do
> > > > > > > >
> > > > > > > > echo "mdev_UUID" > /sys/class/mdev/mdev_start
> > > > > > > >
> > > > > > > > or
> > > > > > > >
> > > > > > > > echo "mdev_UUID" > /sys/class/mdev/mdev_stop
> > > > > > > >
> > > > > > > > without knowing the parent device.
> > > > > > > >
> > > > > > >
> > > > > > > You can look at some existing sysfs example, e.g.:
> > > > > > >
> > > > > > > echo "0/1" > /sys/bus/cpu/devices/cpu1/online
> > > > > > >
> > > > > > > You may also argue why not using a global style:
> > > > > > >
> > > > > > > echo "cpu1" > /sys/bus/cpu/devices/cpu_online
> > > > > > > echo "cpu1" > /sys/bus/cpu/devices/cpu_offline
> > > > > > >
> > > > > > > There are many similar examples...
> > > > > >
> > > > > > Hi Kevin,
> > > > > >
> > > > > > My response above is to your question about using the global sysfs entry as you
> > > > > > don't want to have the global path because
> > > > > >
> > > > > > "I don't think Qemu wants to know parent device of assigned mdev instances.".
> > > > > >
> > > > > > So I just want to confirm with you that (in case you miss):
> > > > > >
> > > > > >     /sys/class/mdev/mdev_start | mdev_stop
> > > > > >
> > > > > > doesn't require the knowledge of parent device.
> > > > > >
> > > > >
> > > > > Qemu is just one example, where your explanation of parent device
> > > > > makes sense but still it's not good for Qemu to populate /sys/class/mdev
> > > > > directly. Qemu is passed with the actual sysfs path of assigned mdev
> > > > > instance, so any mdev attributes touched by Qemu should be put under
> > > > > that node (e.g. start/stop for live migration usage as I explained earlier).
> > > >
> > > > Exactly, qemu is passed with the actual sysfs path.
> > > >
> > > > So, QEMU doesn't touch the file /sys/class/mdev/mdev_start | mdev_stop at all.
> > > >
> > > > QEMU will take the sysfs path as input:
> > > >
> > > >  -device
> > > >
> > vfio-pci,sysfsdev=/sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818-0,i
> > > > d=vgpu0
> > >
> > > no need of passing "id=vgpu0" here. If necessary you can put id as an attribute
> > > under sysfs mdev node:
> > >
> > > /sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818-0/id
> > 
> > I think we have moved away from the device index based on Alex's comment, so the
> > device path will be:
> > 
> >  /sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818
> 
> pass /sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818
> as parameter, and then Qemu can access 'id' under that path. You
> don't need to pass a separate 'id' field. That's my point.
> 
> 
> > 
> > >
> > > >
> > > > As you are saying in live migration, QEMU needs to access "start" and "stop".  Could
> > you
> > > > please share more details, such as how QEMU access the "start" and "stop" sysfs,
> > > > when and what triggers that?
> > > >
> > >
> > > A conceptual flow as below:
> > >
> > > 1. Quiescent mdev activity on the parent device (e.g. stop scheduling, wait for
> > > in-flight DMA completed, etc.)
> > >
> > > echo "0" > /sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818-0/start
> > >
> > > 2. Save mdev state:
> > >
> > > cat /sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818-0/state > xxx
> > >
> > > 3. xxx will be part of the final VM image and copied to a new machine
> > >
> > > 4. Allocate/prepare mdev on the new machine for this VM
> > >
> > > 5. Restore mdev state:
> > >
> > > cat xxx > /sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818-0/state
> > > (might be a different path name)
> > >
> > > 6. start mdev on the new parent device:
> > >
> > > echo "1" > /sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818-0/start
> > 
> > Thanks for the sequence, so based on above live migration, the access of "start/stop"
> > are from other user space program not QEMU process.
> > 
> > (Just to be clear, I am not saying that I won't consider your suggestion of
> > accommodating the "start/stop" file from global to mdev node, but I do want to point
> > out that keeping them inside global shouldn't impact your live migration sequence
> > above.)
> > 
> 
> come on... I just use above bash command to show the step. Qemu itself definitely
> needs to open the sysfs file descriptor and then read/write the fd... If we use VFIO
> API as example, it might be more obvious. :-)

Hi Kevin,

I am not just picking on this example.

The only fd that QEMU will access is the VFIO device fd (other than those
container/IOMMU type1 stuff), and there is no changes required for QEMU.

Thanks,
Neo

> 
> Thanks
> Kevin

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v6 1/4] vfio: Mediated device Core driver
@ 2016-08-16  6:13                               ` Neo Jia
  0 siblings, 0 replies; 100+ messages in thread
From: Neo Jia @ 2016-08-16  6:13 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Kirti Wankhede, Alex Williamson, pbonzini, kraxel, qemu-devel,
	kvm, Song, Jike, bjsdjshi

On Tue, Aug 16, 2016 at 05:58:54AM +0000, Tian, Kevin wrote:
> > From: Neo Jia [mailto:cjia@nvidia.com]
> > Sent: Tuesday, August 16, 2016 1:44 PM
> > 
> > On Tue, Aug 16, 2016 at 04:52:30AM +0000, Tian, Kevin wrote:
> > > > From: Neo Jia [mailto:cjia@nvidia.com]
> > > > Sent: Tuesday, August 16, 2016 12:17 PM
> > > >
> > > > On Tue, Aug 16, 2016 at 03:50:44AM +0000, Tian, Kevin wrote:
> > > > > > From: Neo Jia [mailto:cjia@nvidia.com]
> > > > > > Sent: Tuesday, August 16, 2016 11:46 AM
> > > > > >
> > > > > > On Tue, Aug 16, 2016 at 12:30:25AM +0000, Tian, Kevin wrote:
> > > > > > > > From: Neo Jia [mailto:cjia@nvidia.com]
> > > > > > > > Sent: Tuesday, August 16, 2016 3:59 AM
> > > > > >
> > > > > > > > > >
> > > > > > > > > > For NVIDIA vGPU solution we need to know all devices assigned to a VM
> > in
> > > > > > > > > > one shot to commit resources of all vGPUs assigned to a VM along with
> > > > > > > > > > some common resources.
> > > > > > > > >
> > > > > > > > > Kirti, can you elaborate the background about above one-shot commit
> > > > > > > > > requirement? It's hard to understand such a requirement.
> > > > > > > > >
> > > > > > > > > As I relied in another mail, I really hope start/stop become a per-mdev
> > > > > > > > > attribute instead of global one, e.g.:
> > > > > > > > >
> > > > > > > > > echo "0/1" >
> > > > /sys/class/mdev/12345678-1234-1234-1234-123456789abc/start
> > > > > > > > >
> > > > > > > > > In many scenario the user space client may only want to talk to mdev
> > > > > > > > > instance directly, w/o need to contact its parent device. Still take
> > > > > > > > > live migration for example, I don't think Qemu wants to know parent
> > > > > > > > > device of assigned mdev instances.
> > > > > > > >
> > > > > > > > Hi Kevin,
> > > > > > > >
> > > > > > > > Having a global /sys/class/mdev/mdev_start doesn't require anybody to
> > know
> > > > > > > > parent device. you can just do
> > > > > > > >
> > > > > > > > echo "mdev_UUID" > /sys/class/mdev/mdev_start
> > > > > > > >
> > > > > > > > or
> > > > > > > >
> > > > > > > > echo "mdev_UUID" > /sys/class/mdev/mdev_stop
> > > > > > > >
> > > > > > > > without knowing the parent device.
> > > > > > > >
> > > > > > >
> > > > > > > You can look at some existing sysfs example, e.g.:
> > > > > > >
> > > > > > > echo "0/1" > /sys/bus/cpu/devices/cpu1/online
> > > > > > >
> > > > > > > You may also argue why not using a global style:
> > > > > > >
> > > > > > > echo "cpu1" > /sys/bus/cpu/devices/cpu_online
> > > > > > > echo "cpu1" > /sys/bus/cpu/devices/cpu_offline
> > > > > > >
> > > > > > > There are many similar examples...
> > > > > >
> > > > > > Hi Kevin,
> > > > > >
> > > > > > My response above is to your question about using the global sysfs entry as you
> > > > > > don't want to have the global path because
> > > > > >
> > > > > > "I don't think Qemu wants to know parent device of assigned mdev instances.".
> > > > > >
> > > > > > So I just want to confirm with you that (in case you miss):
> > > > > >
> > > > > >     /sys/class/mdev/mdev_start | mdev_stop
> > > > > >
> > > > > > doesn't require the knowledge of parent device.
> > > > > >
> > > > >
> > > > > Qemu is just one example, where your explanation of parent device
> > > > > makes sense but still it's not good for Qemu to populate /sys/class/mdev
> > > > > directly. Qemu is passed with the actual sysfs path of assigned mdev
> > > > > instance, so any mdev attributes touched by Qemu should be put under
> > > > > that node (e.g. start/stop for live migration usage as I explained earlier).
> > > >
> > > > Exactly, qemu is passed with the actual sysfs path.
> > > >
> > > > So, QEMU doesn't touch the file /sys/class/mdev/mdev_start | mdev_stop at all.
> > > >
> > > > QEMU will take the sysfs path as input:
> > > >
> > > >  -device
> > > >
> > vfio-pci,sysfsdev=/sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818-0,i
> > > > d=vgpu0
> > >
> > > no need of passing "id=vgpu0" here. If necessary you can put id as an attribute
> > > under sysfs mdev node:
> > >
> > > /sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818-0/id
> > 
> > I think we have moved away from the device index based on Alex's comment, so the
> > device path will be:
> > 
> >  /sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818
> 
> pass /sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818
> as parameter, and then Qemu can access 'id' under that path. You
> don't need to pass a separate 'id' field. That's my point.
> 
> 
> > 
> > >
> > > >
> > > > As you are saying in live migration, QEMU needs to access "start" and "stop".  Could
> > you
> > > > please share more details, such as how QEMU access the "start" and "stop" sysfs,
> > > > when and what triggers that?
> > > >
> > >
> > > A conceptual flow as below:
> > >
> > > 1. Quiescent mdev activity on the parent device (e.g. stop scheduling, wait for
> > > in-flight DMA completed, etc.)
> > >
> > > echo "0" > /sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818-0/start
> > >
> > > 2. Save mdev state:
> > >
> > > cat /sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818-0/state > xxx
> > >
> > > 3. xxx will be part of the final VM image and copied to a new machine
> > >
> > > 4. Allocate/prepare mdev on the new machine for this VM
> > >
> > > 5. Restore mdev state:
> > >
> > > cat xxx > /sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818-0/state
> > > (might be a different path name)
> > >
> > > 6. start mdev on the new parent device:
> > >
> > > echo "1" > /sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818-0/start
> > 
> > Thanks for the sequence, so based on above live migration, the access of "start/stop"
> > are from other user space program not QEMU process.
> > 
> > (Just to be clear, I am not saying that I won't consider your suggestion of
> > accommodating the "start/stop" file from global to mdev node, but I do want to point
> > out that keeping them inside global shouldn't impact your live migration sequence
> > above.)
> > 
> 
> come on... I just use above bash command to show the step. Qemu itself definitely
> needs to open the sysfs file descriptor and then read/write the fd... If we use VFIO
> API as example, it might be more obvious. :-)

Hi Kevin,

I am not just picking on this example.

The only fd that QEMU will access is the VFIO device fd (other than those
container/IOMMU type1 stuff), and there is no changes required for QEMU.

Thanks,
Neo

> 
> Thanks
> Kevin

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v6 1/4] vfio: Mediated device Core driver
  2016-08-16  4:52                         ` [Qemu-devel] " Tian, Kevin
  (?)
  (?)
@ 2016-08-16 12:49                         ` Alex Williamson
  -1 siblings, 0 replies; 100+ messages in thread
From: Alex Williamson @ 2016-08-16 12:49 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Neo Jia, Song, Jike, kvm, qemu-devel, Kirti Wankhede, kraxel,
	pbonzini, bjsdjshi

On Tue, 16 Aug 2016 04:52:30 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Neo Jia [mailto:cjia@nvidia.com]
> > Sent: Tuesday, August 16, 2016 12:17 PM
> > 
> > On Tue, Aug 16, 2016 at 03:50:44AM +0000, Tian, Kevin wrote:  
> > > > From: Neo Jia [mailto:cjia@nvidia.com]
> > > > Sent: Tuesday, August 16, 2016 11:46 AM
> > > >
> > > > On Tue, Aug 16, 2016 at 12:30:25AM +0000, Tian, Kevin wrote:  
> > > > > > From: Neo Jia [mailto:cjia@nvidia.com]
> > > > > > Sent: Tuesday, August 16, 2016 3:59 AM  
> > > >  
> > > > > > > >
> > > > > > > > For NVIDIA vGPU solution we need to know all devices assigned to a VM in
> > > > > > > > one shot to commit resources of all vGPUs assigned to a VM along with
> > > > > > > > some common resources.  
> > > > > > >
> > > > > > > Kirti, can you elaborate the background about above one-shot commit
> > > > > > > requirement? It's hard to understand such a requirement.
> > > > > > >
> > > > > > > As I relied in another mail, I really hope start/stop become a per-mdev
> > > > > > > attribute instead of global one, e.g.:
> > > > > > >
> > > > > > > echo "0/1" >  
> > /sys/class/mdev/12345678-1234-1234-1234-123456789abc/start  
> > > > > > >
> > > > > > > In many scenario the user space client may only want to talk to mdev
> > > > > > > instance directly, w/o need to contact its parent device. Still take
> > > > > > > live migration for example, I don't think Qemu wants to know parent
> > > > > > > device of assigned mdev instances.  
> > > > > >
> > > > > > Hi Kevin,
> > > > > >
> > > > > > Having a global /sys/class/mdev/mdev_start doesn't require anybody to know
> > > > > > parent device. you can just do
> > > > > >
> > > > > > echo "mdev_UUID" > /sys/class/mdev/mdev_start
> > > > > >
> > > > > > or
> > > > > >
> > > > > > echo "mdev_UUID" > /sys/class/mdev/mdev_stop
> > > > > >
> > > > > > without knowing the parent device.
> > > > > >  
> > > > >
> > > > > You can look at some existing sysfs example, e.g.:
> > > > >
> > > > > echo "0/1" > /sys/bus/cpu/devices/cpu1/online
> > > > >
> > > > > You may also argue why not using a global style:
> > > > >
> > > > > echo "cpu1" > /sys/bus/cpu/devices/cpu_online
> > > > > echo "cpu1" > /sys/bus/cpu/devices/cpu_offline
> > > > >
> > > > > There are many similar examples...  
> > > >
> > > > Hi Kevin,
> > > >
> > > > My response above is to your question about using the global sysfs entry as you
> > > > don't want to have the global path because
> > > >
> > > > "I don't think Qemu wants to know parent device of assigned mdev instances.".
> > > >
> > > > So I just want to confirm with you that (in case you miss):
> > > >
> > > >     /sys/class/mdev/mdev_start | mdev_stop
> > > >
> > > > doesn't require the knowledge of parent device.
> > > >  
> > >
> > > Qemu is just one example, where your explanation of parent device
> > > makes sense but still it's not good for Qemu to populate /sys/class/mdev
> > > directly. Qemu is passed with the actual sysfs path of assigned mdev
> > > instance, so any mdev attributes touched by Qemu should be put under
> > > that node (e.g. start/stop for live migration usage as I explained earlier).  
> > 
> > Exactly, qemu is passed with the actual sysfs path.
> > 
> > So, QEMU doesn't touch the file /sys/class/mdev/mdev_start | mdev_stop at all.
> > 
> > QEMU will take the sysfs path as input:
> > 
> >  -device
> > vfio-pci,sysfsdev=/sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818-0,i
> > d=vgpu0  
> 
> no need of passing "id=vgpu0" here. If necessary you can put id as an attribute 
> under sysfs mdev node:
> 
> /sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818-0/id

QEMU needs an id parameter for devices, libvirt gives devices arbitrary
names, typically hostdev# for assigned devices.  This id is used to
reference the device for hmp/qmp commands.  This is not something the
mdev infrastructure should define.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v6 1/4] vfio: Mediated device Core driver
  2016-08-15 22:47               ` Alex Williamson
  2016-08-15 23:54                   ` [Qemu-devel] " Neo Jia
  2016-08-16  0:18                   ` [Qemu-devel] " Tian, Kevin
@ 2016-08-16 20:30                 ` Neo Jia
  2016-08-16 20:51                   ` Alex Williamson
  2 siblings, 1 reply; 100+ messages in thread
From: Neo Jia @ 2016-08-16 20:30 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, Song, Jike, kvm, qemu-devel, Kirti Wankhede, kraxel,
	pbonzini, bjsdjshi

On Mon, Aug 15, 2016 at 04:47:41PM -0600, Alex Williamson wrote:
> On Mon, 15 Aug 2016 12:59:08 -0700
> Neo Jia <cjia@nvidia.com> wrote:
> 
> > > > >
> > > > > I'm not sure a comma separated list makes sense here, for both
> > > > > simplicity in the kernel and more fine grained error reporting, we
> > > > > probably want to start/stop them individually.  Actually, why is it
> > > > > that we can't use the mediated device being opened and released to
> > > > > automatically signal to the backend vendor driver to commit and release
> > > > > resources? I don't fully understand why userspace needs this interface.  
> > > 
> 
> That doesn't give an individual user the ability to stop and start
> their devices though, because in order for a user to have write
> permissions there, they get permission to DoS other users by pumping
> arbitrary UUIDs into those files.  By placing start/stop per mdev, we
> have mdev level granularity of granting start/stop privileges.  Really
> though, do we want QEMU fumbling around through sysfs or do we want an
> interface through the vfio API to perform start/stop?  Thanks,

Hi Alex,

I think those two suggests make sense, so we will move the "start/stop"
under mdev sysfs. 

This will be incorporated in our next v7 patch and by doing that, it will make
the locking scheme easier.

Thanks,
Neo

> 
> Alex

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v6 1/4] vfio: Mediated device Core driver
  2016-08-16 20:30                 ` Neo Jia
@ 2016-08-16 20:51                   ` Alex Williamson
  2016-08-16 21:17                     ` Neo Jia
  0 siblings, 1 reply; 100+ messages in thread
From: Alex Williamson @ 2016-08-16 20:51 UTC (permalink / raw)
  To: Neo Jia
  Cc: Tian, Kevin, Song, Jike, kvm, qemu-devel, Kirti Wankhede, kraxel,
	pbonzini, bjsdjshi

On Tue, 16 Aug 2016 13:30:06 -0700
Neo Jia <cjia@nvidia.com> wrote:

> On Mon, Aug 15, 2016 at 04:47:41PM -0600, Alex Williamson wrote:
> > On Mon, 15 Aug 2016 12:59:08 -0700
> > Neo Jia <cjia@nvidia.com> wrote:
> >   
> > > > > >
> > > > > > I'm not sure a comma separated list makes sense here, for both
> > > > > > simplicity in the kernel and more fine grained error reporting, we
> > > > > > probably want to start/stop them individually.  Actually, why is it
> > > > > > that we can't use the mediated device being opened and released to
> > > > > > automatically signal to the backend vendor driver to commit and release
> > > > > > resources? I don't fully understand why userspace needs this interface.    
> > > >   
> > 
> > That doesn't give an individual user the ability to stop and start
> > their devices though, because in order for a user to have write
> > permissions there, they get permission to DoS other users by pumping
> > arbitrary UUIDs into those files.  By placing start/stop per mdev, we
> > have mdev level granularity of granting start/stop privileges.  Really
> > though, do we want QEMU fumbling around through sysfs or do we want an
> > interface through the vfio API to perform start/stop?  Thanks,  
> 
> Hi Alex,
> 
> I think those two suggests make sense, so we will move the "start/stop"
> under mdev sysfs. 
> 
> This will be incorporated in our next v7 patch and by doing that, it will make
> the locking scheme easier.

Thanks Neo.  Also note that the semantics change when we move to per
device control.  It would be redundant to 'echo $UUID' into a start
file which only controls a single device.  So that means we probably
just want an 'echo 1'.  But if we can 'echo 1' then we can also 'echo
0', so we can reduce this to a single sysfs file.  Sysfs already has a
common interface for starting and stopping devices, the "online" file.
So I think we should probably move in that direction.  Additionally, an
"online" file should support a _show() function, so if we have an Intel
vGPU that perhaps does not need start/stop support, online could report
"1" after create to show that it's already online, possibly even
generate an error trying to change the online state.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [PATCH v6 1/4] vfio: Mediated device Core driver
  2016-08-16  6:13                               ` [Qemu-devel] " Neo Jia
@ 2016-08-16 21:03                                 ` Alex Williamson
  -1 siblings, 0 replies; 100+ messages in thread
From: Alex Williamson @ 2016-08-16 21:03 UTC (permalink / raw)
  To: Neo Jia
  Cc: Tian, Kevin, Kirti Wankhede, pbonzini, kraxel, qemu-devel, kvm,
	Song, Jike, bjsdjshi

On Mon, 15 Aug 2016 23:13:20 -0700
Neo Jia <cjia@nvidia.com> wrote:

> On Tue, Aug 16, 2016 at 05:58:54AM +0000, Tian, Kevin wrote:
> > > From: Neo Jia [mailto:cjia@nvidia.com]
> > > Sent: Tuesday, August 16, 2016 1:44 PM
> > > 
> > > On Tue, Aug 16, 2016 at 04:52:30AM +0000, Tian, Kevin wrote:  
> > > > > From: Neo Jia [mailto:cjia@nvidia.com]
> > > > > Sent: Tuesday, August 16, 2016 12:17 PM
> > > > >
> > > > > On Tue, Aug 16, 2016 at 03:50:44AM +0000, Tian, Kevin wrote:  
> > > > > > > From: Neo Jia [mailto:cjia@nvidia.com]
> > > > > > > Sent: Tuesday, August 16, 2016 11:46 AM
> > > > > > >
> > > > > > > On Tue, Aug 16, 2016 at 12:30:25AM +0000, Tian, Kevin wrote:  
> > > > > > > > > From: Neo Jia [mailto:cjia@nvidia.com]
> > > > > > > > > Sent: Tuesday, August 16, 2016 3:59 AM  
> > > > > > >  
> > > > > > > > > > >
> > > > > > > > > > > For NVIDIA vGPU solution we need to know all devices assigned to a VM  
> > > in  
> > > > > > > > > > > one shot to commit resources of all vGPUs assigned to a VM along with
> > > > > > > > > > > some common resources.  
> > > > > > > > > >
> > > > > > > > > > Kirti, can you elaborate the background about above one-shot commit
> > > > > > > > > > requirement? It's hard to understand such a requirement.
> > > > > > > > > >
> > > > > > > > > > As I relied in another mail, I really hope start/stop become a per-mdev
> > > > > > > > > > attribute instead of global one, e.g.:
> > > > > > > > > >
> > > > > > > > > > echo "0/1" >  
> > > > > /sys/class/mdev/12345678-1234-1234-1234-123456789abc/start  
> > > > > > > > > >
> > > > > > > > > > In many scenario the user space client may only want to talk to mdev
> > > > > > > > > > instance directly, w/o need to contact its parent device. Still take
> > > > > > > > > > live migration for example, I don't think Qemu wants to know parent
> > > > > > > > > > device of assigned mdev instances.  
> > > > > > > > >
> > > > > > > > > Hi Kevin,
> > > > > > > > >
> > > > > > > > > Having a global /sys/class/mdev/mdev_start doesn't require anybody to  
> > > know  
> > > > > > > > > parent device. you can just do
> > > > > > > > >
> > > > > > > > > echo "mdev_UUID" > /sys/class/mdev/mdev_start
> > > > > > > > >
> > > > > > > > > or
> > > > > > > > >
> > > > > > > > > echo "mdev_UUID" > /sys/class/mdev/mdev_stop
> > > > > > > > >
> > > > > > > > > without knowing the parent device.
> > > > > > > > >  
> > > > > > > >
> > > > > > > > You can look at some existing sysfs example, e.g.:
> > > > > > > >
> > > > > > > > echo "0/1" > /sys/bus/cpu/devices/cpu1/online
> > > > > > > >
> > > > > > > > You may also argue why not using a global style:
> > > > > > > >
> > > > > > > > echo "cpu1" > /sys/bus/cpu/devices/cpu_online
> > > > > > > > echo "cpu1" > /sys/bus/cpu/devices/cpu_offline
> > > > > > > >
> > > > > > > > There are many similar examples...  
> > > > > > >
> > > > > > > Hi Kevin,
> > > > > > >
> > > > > > > My response above is to your question about using the global sysfs entry as you
> > > > > > > don't want to have the global path because
> > > > > > >
> > > > > > > "I don't think Qemu wants to know parent device of assigned mdev instances.".
> > > > > > >
> > > > > > > So I just want to confirm with you that (in case you miss):
> > > > > > >
> > > > > > >     /sys/class/mdev/mdev_start | mdev_stop
> > > > > > >
> > > > > > > doesn't require the knowledge of parent device.
> > > > > > >  
> > > > > >
> > > > > > Qemu is just one example, where your explanation of parent device
> > > > > > makes sense but still it's not good for Qemu to populate /sys/class/mdev
> > > > > > directly. Qemu is passed with the actual sysfs path of assigned mdev
> > > > > > instance, so any mdev attributes touched by Qemu should be put under
> > > > > > that node (e.g. start/stop for live migration usage as I explained earlier).  
> > > > >
> > > > > Exactly, qemu is passed with the actual sysfs path.
> > > > >
> > > > > So, QEMU doesn't touch the file /sys/class/mdev/mdev_start | mdev_stop at all.
> > > > >
> > > > > QEMU will take the sysfs path as input:
> > > > >
> > > > >  -device
> > > > >  
> > > vfio-pci,sysfsdev=/sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818-0,i  
> > > > > d=vgpu0  
> > > >
> > > > no need of passing "id=vgpu0" here. If necessary you can put id as an attribute
> > > > under sysfs mdev node:
> > > >
> > > > /sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818-0/id  
> > > 
> > > I think we have moved away from the device index based on Alex's comment, so the
> > > device path will be:
> > > 
> > >  /sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818  
> > 
> > pass /sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818
> > as parameter, and then Qemu can access 'id' under that path. You
> > don't need to pass a separate 'id' field. That's my point.

As noted in previous reply, id is a QEMU device parameter, it's not a
property of the mdev device.  I'd also caution against adding arbitrary
sysfs files with the expectation that QEMU will be able to manipulate
them.  I believe one of the benefits of vfio vs legacy KVM device
assignment is that vfio is self contained through the vfio API.  Each
time we expect QEMU to interact with the system via sysfs, that's a new
file that libvirt needs to add access to and a new attack vector where
we need to worry about security.  I still think it's the right idea
from a sysfs perspective to move to per device files, or an "online"
file to replace both, but let's be a little more strategic how we
expect QEMU to interact with the device.

> > > > > As you are saying in live migration, QEMU needs to access "start" and "stop".  Could  
> > > you  
> > > > > please share more details, such as how QEMU access the "start" and "stop" sysfs,
> > > > > when and what triggers that?
> > > > >  
> > > >
> > > > A conceptual flow as below:
> > > >
> > > > 1. Quiescent mdev activity on the parent device (e.g. stop scheduling, wait for
> > > > in-flight DMA completed, etc.)
> > > >
> > > > echo "0" > /sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818-0/start
> > > >
> > > > 2. Save mdev state:
> > > >
> > > > cat /sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818-0/state > xxx
> > > >
> > > > 3. xxx will be part of the final VM image and copied to a new machine
> > > >
> > > > 4. Allocate/prepare mdev on the new machine for this VM
> > > >
> > > > 5. Restore mdev state:
> > > >
> > > > cat xxx > /sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818-0/state
> > > > (might be a different path name)
> > > >
> > > > 6. start mdev on the new parent device:
> > > >
> > > > echo "1" > /sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818-0/start  

For example, another solution to this would be a device specific vfio
region dedicated to starting and stopping the device an manipulating
state.  A simplistic API might be that the first dword of this region
is reserved for control of the device, the ability to pause and resume
the device, and reads and writes to the remainder of the region collect
and restore device state, respectively.  We'd need to spend more time
thinking about such an API, but something like this could be added
between the kernel and QEMU as a feature that doesn't change the
existing API.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v6 1/4] vfio: Mediated device Core driver
@ 2016-08-16 21:03                                 ` Alex Williamson
  0 siblings, 0 replies; 100+ messages in thread
From: Alex Williamson @ 2016-08-16 21:03 UTC (permalink / raw)
  To: Neo Jia
  Cc: Tian, Kevin, Kirti Wankhede, pbonzini, kraxel, qemu-devel, kvm,
	Song, Jike, bjsdjshi

On Mon, 15 Aug 2016 23:13:20 -0700
Neo Jia <cjia@nvidia.com> wrote:

> On Tue, Aug 16, 2016 at 05:58:54AM +0000, Tian, Kevin wrote:
> > > From: Neo Jia [mailto:cjia@nvidia.com]
> > > Sent: Tuesday, August 16, 2016 1:44 PM
> > > 
> > > On Tue, Aug 16, 2016 at 04:52:30AM +0000, Tian, Kevin wrote:  
> > > > > From: Neo Jia [mailto:cjia@nvidia.com]
> > > > > Sent: Tuesday, August 16, 2016 12:17 PM
> > > > >
> > > > > On Tue, Aug 16, 2016 at 03:50:44AM +0000, Tian, Kevin wrote:  
> > > > > > > From: Neo Jia [mailto:cjia@nvidia.com]
> > > > > > > Sent: Tuesday, August 16, 2016 11:46 AM
> > > > > > >
> > > > > > > On Tue, Aug 16, 2016 at 12:30:25AM +0000, Tian, Kevin wrote:  
> > > > > > > > > From: Neo Jia [mailto:cjia@nvidia.com]
> > > > > > > > > Sent: Tuesday, August 16, 2016 3:59 AM  
> > > > > > >  
> > > > > > > > > > >
> > > > > > > > > > > For NVIDIA vGPU solution we need to know all devices assigned to a VM  
> > > in  
> > > > > > > > > > > one shot to commit resources of all vGPUs assigned to a VM along with
> > > > > > > > > > > some common resources.  
> > > > > > > > > >
> > > > > > > > > > Kirti, can you elaborate the background about above one-shot commit
> > > > > > > > > > requirement? It's hard to understand such a requirement.
> > > > > > > > > >
> > > > > > > > > > As I relied in another mail, I really hope start/stop become a per-mdev
> > > > > > > > > > attribute instead of global one, e.g.:
> > > > > > > > > >
> > > > > > > > > > echo "0/1" >  
> > > > > /sys/class/mdev/12345678-1234-1234-1234-123456789abc/start  
> > > > > > > > > >
> > > > > > > > > > In many scenario the user space client may only want to talk to mdev
> > > > > > > > > > instance directly, w/o need to contact its parent device. Still take
> > > > > > > > > > live migration for example, I don't think Qemu wants to know parent
> > > > > > > > > > device of assigned mdev instances.  
> > > > > > > > >
> > > > > > > > > Hi Kevin,
> > > > > > > > >
> > > > > > > > > Having a global /sys/class/mdev/mdev_start doesn't require anybody to  
> > > know  
> > > > > > > > > parent device. you can just do
> > > > > > > > >
> > > > > > > > > echo "mdev_UUID" > /sys/class/mdev/mdev_start
> > > > > > > > >
> > > > > > > > > or
> > > > > > > > >
> > > > > > > > > echo "mdev_UUID" > /sys/class/mdev/mdev_stop
> > > > > > > > >
> > > > > > > > > without knowing the parent device.
> > > > > > > > >  
> > > > > > > >
> > > > > > > > You can look at some existing sysfs example, e.g.:
> > > > > > > >
> > > > > > > > echo "0/1" > /sys/bus/cpu/devices/cpu1/online
> > > > > > > >
> > > > > > > > You may also argue why not using a global style:
> > > > > > > >
> > > > > > > > echo "cpu1" > /sys/bus/cpu/devices/cpu_online
> > > > > > > > echo "cpu1" > /sys/bus/cpu/devices/cpu_offline
> > > > > > > >
> > > > > > > > There are many similar examples...  
> > > > > > >
> > > > > > > Hi Kevin,
> > > > > > >
> > > > > > > My response above is to your question about using the global sysfs entry as you
> > > > > > > don't want to have the global path because
> > > > > > >
> > > > > > > "I don't think Qemu wants to know parent device of assigned mdev instances.".
> > > > > > >
> > > > > > > So I just want to confirm with you that (in case you miss):
> > > > > > >
> > > > > > >     /sys/class/mdev/mdev_start | mdev_stop
> > > > > > >
> > > > > > > doesn't require the knowledge of parent device.
> > > > > > >  
> > > > > >
> > > > > > Qemu is just one example, where your explanation of parent device
> > > > > > makes sense but still it's not good for Qemu to populate /sys/class/mdev
> > > > > > directly. Qemu is passed with the actual sysfs path of assigned mdev
> > > > > > instance, so any mdev attributes touched by Qemu should be put under
> > > > > > that node (e.g. start/stop for live migration usage as I explained earlier).  
> > > > >
> > > > > Exactly, qemu is passed with the actual sysfs path.
> > > > >
> > > > > So, QEMU doesn't touch the file /sys/class/mdev/mdev_start | mdev_stop at all.
> > > > >
> > > > > QEMU will take the sysfs path as input:
> > > > >
> > > > >  -device
> > > > >  
> > > vfio-pci,sysfsdev=/sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818-0,i  
> > > > > d=vgpu0  
> > > >
> > > > no need of passing "id=vgpu0" here. If necessary you can put id as an attribute
> > > > under sysfs mdev node:
> > > >
> > > > /sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818-0/id  
> > > 
> > > I think we have moved away from the device index based on Alex's comment, so the
> > > device path will be:
> > > 
> > >  /sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818  
> > 
> > pass /sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818
> > as parameter, and then Qemu can access 'id' under that path. You
> > don't need to pass a separate 'id' field. That's my point.

As noted in previous reply, id is a QEMU device parameter, it's not a
property of the mdev device.  I'd also caution against adding arbitrary
sysfs files with the expectation that QEMU will be able to manipulate
them.  I believe one of the benefits of vfio vs legacy KVM device
assignment is that vfio is self contained through the vfio API.  Each
time we expect QEMU to interact with the system via sysfs, that's a new
file that libvirt needs to add access to and a new attack vector where
we need to worry about security.  I still think it's the right idea
from a sysfs perspective to move to per device files, or an "online"
file to replace both, but let's be a little more strategic how we
expect QEMU to interact with the device.

> > > > > As you are saying in live migration, QEMU needs to access "start" and "stop".  Could  
> > > you  
> > > > > please share more details, such as how QEMU access the "start" and "stop" sysfs,
> > > > > when and what triggers that?
> > > > >  
> > > >
> > > > A conceptual flow as below:
> > > >
> > > > 1. Quiescent mdev activity on the parent device (e.g. stop scheduling, wait for
> > > > in-flight DMA completed, etc.)
> > > >
> > > > echo "0" > /sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818-0/start
> > > >
> > > > 2. Save mdev state:
> > > >
> > > > cat /sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818-0/state > xxx
> > > >
> > > > 3. xxx will be part of the final VM image and copied to a new machine
> > > >
> > > > 4. Allocate/prepare mdev on the new machine for this VM
> > > >
> > > > 5. Restore mdev state:
> > > >
> > > > cat xxx > /sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818-0/state
> > > > (might be a different path name)
> > > >
> > > > 6. start mdev on the new parent device:
> > > >
> > > > echo "1" > /sys/bus/mdev/devices/c0b26072-dd1b-4340-84fe-bf338c510818-0/start  

For example, another solution to this would be a device specific vfio
region dedicated to starting and stopping the device an manipulating
state.  A simplistic API might be that the first dword of this region
is reserved for control of the device, the ability to pause and resume
the device, and reads and writes to the remainder of the region collect
and restore device state, respectively.  We'd need to spend more time
thinking about such an API, but something like this could be added
between the kernel and QEMU as a feature that doesn't change the
existing API.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v6 1/4] vfio: Mediated device Core driver
  2016-08-16 20:51                   ` Alex Williamson
@ 2016-08-16 21:17                     ` Neo Jia
  0 siblings, 0 replies; 100+ messages in thread
From: Neo Jia @ 2016-08-16 21:17 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, Song, Jike, kvm, qemu-devel, Kirti Wankhede, kraxel,
	pbonzini, bjsdjshi

On Tue, Aug 16, 2016 at 02:51:03PM -0600, Alex Williamson wrote:
> On Tue, 16 Aug 2016 13:30:06 -0700
> Neo Jia <cjia@nvidia.com> wrote:
> 
> > On Mon, Aug 15, 2016 at 04:47:41PM -0600, Alex Williamson wrote:
> > > On Mon, 15 Aug 2016 12:59:08 -0700
> > > Neo Jia <cjia@nvidia.com> wrote:
> > >   
> > > > > > >
> > > > > > > I'm not sure a comma separated list makes sense here, for both
> > > > > > > simplicity in the kernel and more fine grained error reporting, we
> > > > > > > probably want to start/stop them individually.  Actually, why is it
> > > > > > > that we can't use the mediated device being opened and released to
> > > > > > > automatically signal to the backend vendor driver to commit and release
> > > > > > > resources? I don't fully understand why userspace needs this interface.    
> > > > >   
> > > 
> > > That doesn't give an individual user the ability to stop and start
> > > their devices though, because in order for a user to have write
> > > permissions there, they get permission to DoS other users by pumping
> > > arbitrary UUIDs into those files.  By placing start/stop per mdev, we
> > > have mdev level granularity of granting start/stop privileges.  Really
> > > though, do we want QEMU fumbling around through sysfs or do we want an
> > > interface through the vfio API to perform start/stop?  Thanks,  
> > 
> > Hi Alex,
> > 
> > I think those two suggests make sense, so we will move the "start/stop"
> > under mdev sysfs. 
> > 
> > This will be incorporated in our next v7 patch and by doing that, it will make
> > the locking scheme easier.
> 
> Thanks Neo.  Also note that the semantics change when we move to per
> device control.  It would be redundant to 'echo $UUID' into a start
> file which only controls a single device.  So that means we probably
> just want an 'echo 1'.  But if we can 'echo 1' then we can also 'echo
> 0', so we can reduce this to a single sysfs file.  Sysfs already has a
> common interface for starting and stopping devices, the "online" file.
> So I think we should probably move in that direction.  Additionally, an
> "online" file should support a _show() function, so if we have an Intel
> vGPU that perhaps does not need start/stop support, online could report
> "1" after create to show that it's already online, possibly even
> generate an error trying to change the online state.  Thanks,

Agree. We will adopt the similar syntax and support _show() function.

Thanks,
Neo

> 
> Alex

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v6 4/4] docs: Add Documentation for Mediated devices
  2016-08-03 19:03   ` [Qemu-devel] " Kirti Wankhede
  (?)
  (?)
@ 2016-08-24 22:36   ` Daniel P. Berrange
  -1 siblings, 0 replies; 100+ messages in thread
From: Daniel P. Berrange @ 2016-08-24 22:36 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: alex.williamson, pbonzini, kraxel, cjia, jike.song, kvm,
	kevin.tian, qemu-devel, bjsdjshi

On Thu, Aug 04, 2016 at 12:33:54AM +0530, Kirti Wankhede wrote:
> diff --git a/Documentation/vfio-mediated-device.txt b/Documentation/vfio-mediated-device.txt
> new file mode 100644
> index 000000000000..029152670141
> --- /dev/null
> +++ b/Documentation/vfio-mediated-device.txt
> @@ -0,0 +1,235 @@
> +Mediated device management interface via sysfs
> +-------------------------------------------------------------------------------
> +This is the interface that allows user space software, like libvirt, to query
> +and configure mediated device in a HW agnostic fashion. This management
> +interface provide flexibility to underlying physical device's driver to support
> +mediated device hotplug, multiple mediated devices per virtual machine, multiple
> +mediated devices from different physical devices, etc.

A key point from the libvirt POV is that we want to be able to use the
sysfs interfaces without having to write vendor specific custom code for
each vendor's hardware.

> +Under per-physical device sysfs:
> +--------------------------------
> +
> +* mdev_supported_types: (read only)
> +    List the current supported mediated device types and its details.

This really ought to describe the data format that is to be reported,
as from libvirt POV we don't want to see every vendor's driver reporting
arbitrarily different information here.

> +* mdev_create: (write only)
> +	Create a mediated device on target physical device.
> +	Input syntax: <UUID:idx:params>
> +	where,
> +		UUID: mediated device's UUID
> +		idx: mediated device index inside a VM
> +		params: extra parameters required by driver

There's no specification about what 'params' is - it just looks like
an arbitrary vendor specific blob, which is not something that's
particularly pleasant to use. How would a userspace application
discover what parameters exist, and whether they are required to be
passed, vs optional, and standardization of those parameters across
different vendors's vGPU drivers so we don't have each vendor doing
something different.

> +	Example:
> +	# echo "12345678-1234-1234-1234-123456789abc:0:0" >
> +				 /sys/bus/pci/devices/0000\:05\:00.0/mdev_create
> +
> +* mdev_destroy: (write only)
> +	Destroy a mediated device on a target physical device.
> +	Input syntax: <UUID:idx>
> +	where,
> +		UUID: mediated device's UUID
> +		idx: mediated device index inside a VM
> +	Example:
> +	# echo "12345678-1234-1234-1234-123456789abc:0" >
> +			       /sys/bus/pci/devices/0000\:05\:00.0/mdev_destroy

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|

^ permalink raw reply	[flat|nested] 100+ messages in thread

end of thread, other threads:[~2016-08-24 22:36 UTC | newest]

Thread overview: 100+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-03 19:03 [PATCH v6 0/4] Add Mediated device support Kirti Wankhede
2016-08-03 19:03 ` [Qemu-devel] " Kirti Wankhede
2016-08-03 19:03 ` [PATCH v6 1/4] vfio: Mediated device Core driver Kirti Wankhede
2016-08-03 19:03   ` [Qemu-devel] " Kirti Wankhede
2016-08-04  7:21   ` Tian, Kevin
2016-08-04  7:21     ` [Qemu-devel] " Tian, Kevin
2016-08-05  6:13     ` Kirti Wankhede
2016-08-05  6:13       ` [Qemu-devel] " Kirti Wankhede
2016-08-15  9:15       ` Tian, Kevin
2016-08-15  9:15         ` [Qemu-devel] " Tian, Kevin
2016-08-09 19:00   ` Alex Williamson
2016-08-09 19:00     ` [Qemu-devel] " Alex Williamson
2016-08-12 18:44     ` Kirti Wankhede
2016-08-12 18:44       ` [Qemu-devel] " Kirti Wankhede
2016-08-12 21:16       ` Alex Williamson
2016-08-12 21:16         ` [Qemu-devel] " Alex Williamson
2016-08-13  0:37         ` Kirti Wankhede
2016-08-13  0:37           ` [Qemu-devel] " Kirti Wankhede
2016-08-15  9:38           ` Tian, Kevin
2016-08-15  9:38             ` [Qemu-devel] " Tian, Kevin
2016-08-15 15:59             ` Alex Williamson
2016-08-15 15:59               ` [Qemu-devel] " Alex Williamson
2016-08-15 22:09               ` Neo Jia
2016-08-15 22:09                 ` [Qemu-devel] " Neo Jia
2016-08-15 22:52                 ` Alex Williamson
2016-08-15 22:52                   ` [Qemu-devel] " Alex Williamson
2016-08-15 23:23                   ` Neo Jia
2016-08-15 23:23                     ` [Qemu-devel] " Neo Jia
2016-08-16  0:49                     ` Tian, Kevin
2016-08-16  0:49                       ` [Qemu-devel] " Tian, Kevin
2016-08-15 19:59             ` Neo Jia
2016-08-15 19:59               ` [Qemu-devel] " Neo Jia
2016-08-15 22:47               ` Alex Williamson
2016-08-15 23:54                 ` Neo Jia
2016-08-15 23:54                   ` [Qemu-devel] " Neo Jia
2016-08-16  0:18                 ` Tian, Kevin
2016-08-16  0:18                   ` [Qemu-devel] " Tian, Kevin
2016-08-16 20:30                 ` Neo Jia
2016-08-16 20:51                   ` Alex Williamson
2016-08-16 21:17                     ` Neo Jia
2016-08-16  0:30               ` Tian, Kevin
2016-08-16  0:30                 ` [Qemu-devel] " Tian, Kevin
2016-08-16  3:45                 ` Neo Jia
2016-08-16  3:45                   ` [Qemu-devel] " Neo Jia
2016-08-16  3:50                   ` Tian, Kevin
2016-08-16  3:50                     ` [Qemu-devel] " Tian, Kevin
2016-08-16  4:16                     ` Neo Jia
2016-08-16  4:16                       ` [Qemu-devel] " Neo Jia
2016-08-16  4:52                       ` Tian, Kevin
2016-08-16  4:52                         ` [Qemu-devel] " Tian, Kevin
2016-08-16  5:43                         ` Neo Jia
2016-08-16  5:43                           ` [Qemu-devel] " Neo Jia
2016-08-16  5:58                           ` Tian, Kevin
2016-08-16  5:58                             ` [Qemu-devel] " Tian, Kevin
2016-08-16  6:13                             ` Neo Jia
2016-08-16  6:13                               ` [Qemu-devel] " Neo Jia
2016-08-16 21:03                               ` Alex Williamson
2016-08-16 21:03                                 ` [Qemu-devel] " Alex Williamson
2016-08-16 12:49                         ` Alex Williamson
2016-08-03 19:03 ` [PATCH v6 2/4] vfio: VFIO driver for mediated PCI device Kirti Wankhede
2016-08-03 19:03   ` [Qemu-devel] " Kirti Wankhede
2016-08-03 21:03   ` kbuild test robot
2016-08-03 21:03     ` [Qemu-devel] " kbuild test robot
2016-08-04  0:19   ` kbuild test robot
2016-08-04  0:19     ` [Qemu-devel] " kbuild test robot
2016-08-09 19:00   ` Alex Williamson
2016-08-09 19:00     ` [Qemu-devel] " Alex Williamson
2016-08-10 21:23     ` Kirti Wankhede
2016-08-10 21:23       ` [Qemu-devel] " Kirti Wankhede
2016-08-10 23:00       ` Alex Williamson
2016-08-10 23:00         ` [Qemu-devel] " Alex Williamson
2016-08-11 15:59         ` Kirti Wankhede
2016-08-11 15:59           ` [Qemu-devel] " Kirti Wankhede
2016-08-11 16:24           ` Alex Williamson
2016-08-11 16:24             ` [Qemu-devel] " Alex Williamson
2016-08-11 17:46             ` Kirti Wankhede
2016-08-11 17:46               ` [Qemu-devel] " Kirti Wankhede
2016-08-11 18:43               ` Alex Williamson
2016-08-11 18:43                 ` [Qemu-devel] " Alex Williamson
2016-08-12 17:57                 ` Kirti Wankhede
2016-08-12 17:57                   ` [Qemu-devel] " Kirti Wankhede
2016-08-12 21:25                   ` Alex Williamson
2016-08-12 21:25                     ` [Qemu-devel] " Alex Williamson
2016-08-13  0:42                     ` Kirti Wankhede
2016-08-13  0:42                       ` [Qemu-devel] " Kirti Wankhede
2016-08-03 19:03 ` [PATCH v6 3/4] vfio iommu: Add support for mediated devices Kirti Wankhede
2016-08-03 19:03   ` [Qemu-devel] " Kirti Wankhede
2016-08-09 19:00   ` Alex Williamson
2016-08-09 19:00     ` [Qemu-devel] " Alex Williamson
2016-08-11 14:22     ` Kirti Wankhede
2016-08-11 14:22       ` [Qemu-devel] " Kirti Wankhede
2016-08-11 16:28       ` Alex Williamson
2016-08-11 16:28         ` [Qemu-devel] " Alex Williamson
2016-08-03 19:03 ` [PATCH v6 4/4] docs: Add Documentation for Mediated devices Kirti Wankhede
2016-08-03 19:03   ` [Qemu-devel] " Kirti Wankhede
2016-08-04  7:31   ` Tian, Kevin
2016-08-04  7:31     ` [Qemu-devel] " Tian, Kevin
2016-08-05  7:45     ` Kirti Wankhede
2016-08-05  7:45       ` [Qemu-devel] " Kirti Wankhede
2016-08-24 22:36   ` Daniel P. Berrange

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.