All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v5 0/3] Add Mediated device support
@ 2016-06-20 16:31 ` Kirti Wankhede
  0 siblings, 0 replies; 51+ messages in thread
From: Kirti Wankhede @ 2016-06-20 16:31 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: shuai.ruan, jike.song, kvm, kevin.tian, qemu-devel,
	Kirti Wankhede, zhiyuan.lv, bjsdjshi

This series adds Mediated device support to v4.6 Linux host kernel. Purpose
of this series is to provide a common interface for mediated device
management that can be used by different devices. This series introduces
Mdev core module that create and manage mediated devices, VFIO based driver
for mediated PCI devices that are created by Mdev core module and update
VFIO type1 IOMMU module to support mediated devices.

What's new in v5?
- Improved mdev_put_device() and mdev_get_device() for mediated devices and
  locking for per mdev_device registration callbacks.

What's left to do?
- Issues with mmap region fault handler, EPT is not correctly populated with the
  information provided by remap_pfn_range() inside fault handler.

- mmap invalidation mechanism will be added once above issue gets resolved.

Tested:
- Single vGPU VM
- Multiple vGPU VMs on same GPU


Thanks,
Kirti


Kirti Wankhede (3):
  Mediated device Core driver
  VFIO driver for mediated PCI device
  VFIO Type1 IOMMU: Add support for mediated devices

 drivers/vfio/Kconfig                |   1 +
 drivers/vfio/Makefile               |   1 +
 drivers/vfio/mdev/Kconfig           |  18 +
 drivers/vfio/mdev/Makefile          |   6 +
 drivers/vfio/mdev/mdev_core.c       | 595 ++++++++++++++++++++++++++++++++
 drivers/vfio/mdev/mdev_driver.c     | 138 ++++++++
 drivers/vfio/mdev/mdev_private.h    |  33 ++
 drivers/vfio/mdev/mdev_sysfs.c      | 300 +++++++++++++++++
 drivers/vfio/mdev/vfio_mpci.c       | 654 ++++++++++++++++++++++++++++++++++++
 drivers/vfio/pci/vfio_pci_private.h |   6 -
 drivers/vfio/pci/vfio_pci_rdwr.c    |   1 +
 drivers/vfio/vfio_iommu_type1.c     | 444 ++++++++++++++++++++++--
 include/linux/mdev.h                | 232 +++++++++++++
 include/linux/vfio.h                |  13 +
 14 files changed, 2404 insertions(+), 38 deletions(-)
 create mode 100644 drivers/vfio/mdev/Kconfig
 create mode 100644 drivers/vfio/mdev/Makefile
 create mode 100644 drivers/vfio/mdev/mdev_core.c
 create mode 100644 drivers/vfio/mdev/mdev_driver.c
 create mode 100644 drivers/vfio/mdev/mdev_private.h
 create mode 100644 drivers/vfio/mdev/mdev_sysfs.c
 create mode 100644 drivers/vfio/mdev/vfio_mpci.c
 create mode 100644 include/linux/mdev.h

-- 
2.7.0

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Qemu-devel] [RFC PATCH v5 0/3] Add Mediated device support
@ 2016-06-20 16:31 ` Kirti Wankhede
  0 siblings, 0 replies; 51+ messages in thread
From: Kirti Wankhede @ 2016-06-20 16:31 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, kevin.tian, shuai.ruan, jike.song, zhiyuan.lv,
	bjsdjshi, Kirti Wankhede

This series adds Mediated device support to v4.6 Linux host kernel. Purpose
of this series is to provide a common interface for mediated device
management that can be used by different devices. This series introduces
Mdev core module that create and manage mediated devices, VFIO based driver
for mediated PCI devices that are created by Mdev core module and update
VFIO type1 IOMMU module to support mediated devices.

What's new in v5?
- Improved mdev_put_device() and mdev_get_device() for mediated devices and
  locking for per mdev_device registration callbacks.

What's left to do?
- Issues with mmap region fault handler, EPT is not correctly populated with the
  information provided by remap_pfn_range() inside fault handler.

- mmap invalidation mechanism will be added once above issue gets resolved.

Tested:
- Single vGPU VM
- Multiple vGPU VMs on same GPU


Thanks,
Kirti


Kirti Wankhede (3):
  Mediated device Core driver
  VFIO driver for mediated PCI device
  VFIO Type1 IOMMU: Add support for mediated devices

 drivers/vfio/Kconfig                |   1 +
 drivers/vfio/Makefile               |   1 +
 drivers/vfio/mdev/Kconfig           |  18 +
 drivers/vfio/mdev/Makefile          |   6 +
 drivers/vfio/mdev/mdev_core.c       | 595 ++++++++++++++++++++++++++++++++
 drivers/vfio/mdev/mdev_driver.c     | 138 ++++++++
 drivers/vfio/mdev/mdev_private.h    |  33 ++
 drivers/vfio/mdev/mdev_sysfs.c      | 300 +++++++++++++++++
 drivers/vfio/mdev/vfio_mpci.c       | 654 ++++++++++++++++++++++++++++++++++++
 drivers/vfio/pci/vfio_pci_private.h |   6 -
 drivers/vfio/pci/vfio_pci_rdwr.c    |   1 +
 drivers/vfio/vfio_iommu_type1.c     | 444 ++++++++++++++++++++++--
 include/linux/mdev.h                | 232 +++++++++++++
 include/linux/vfio.h                |  13 +
 14 files changed, 2404 insertions(+), 38 deletions(-)
 create mode 100644 drivers/vfio/mdev/Kconfig
 create mode 100644 drivers/vfio/mdev/Makefile
 create mode 100644 drivers/vfio/mdev/mdev_core.c
 create mode 100644 drivers/vfio/mdev/mdev_driver.c
 create mode 100644 drivers/vfio/mdev/mdev_private.h
 create mode 100644 drivers/vfio/mdev/mdev_sysfs.c
 create mode 100644 drivers/vfio/mdev/vfio_mpci.c
 create mode 100644 include/linux/mdev.h

-- 
2.7.0

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 1/3] Mediated device Core driver
  2016-06-20 16:31 ` [Qemu-devel] " Kirti Wankhede
@ 2016-06-20 16:31   ` Kirti Wankhede
  -1 siblings, 0 replies; 51+ messages in thread
From: Kirti Wankhede @ 2016-06-20 16:31 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, kevin.tian, shuai.ruan, jike.song, zhiyuan.lv,
	bjsdjshi, Kirti Wankhede

Design for Mediated Device Driver:
Main purpose of this driver is to provide a common interface for mediated
device management that can be used by differnt drivers of different
devices.

This module provides a generic interface to create the device, add it to
mediated bus, add device to IOMMU group and then add it to vfio group.

Below is the high Level block diagram, with Nvidia, Intel and IBM devices
as example, since these are the devices which are going to actively use
this module as of now.

 +---------------+
 |               |
 | +-----------+ |  mdev_register_driver() +--------------+
 | |           | +<------------------------+ __init()     |
 | |           | |                         |              |
 | |  mdev     | +------------------------>+              |<-> VFIO user
 | |  bus      | |     probe()/remove()    | vfio_mpci.ko |    APIs
 | |  driver   | |                         |              |
 | |           | |                         +--------------+
 | |           | |  mdev_register_driver() +--------------+
 | |           | +<------------------------+ __init()     |
 | |           | |                         |              |
 | |           | +------------------------>+              |<-> VFIO user
 | +-----------+ |     probe()/remove()    | vfio_mccw.ko |    APIs
 |               |                         |              |
 |  MDEV CORE    |                         +--------------+
 |   MODULE      |
 |   mdev.ko     |
 | +-----------+ |  mdev_register_device() +--------------+
 | |           | +<------------------------+              |
 | |           | |                         |  nvidia.ko   |<-> physical
 | |           | +------------------------>+              |    device
 | |           | |        callback         +--------------+
 | | Physical  | |
 | |  device   | |  mdev_register_device() +--------------+
 | | interface | |<------------------------+              |
 | |           | |                         |  i915.ko     |<-> physical
 | |           | +------------------------>+              |    device
 | |           | |        callback         +--------------+
 | |           | |
 | |           | |  mdev_register_device() +--------------+
 | |           | +<------------------------+              |
 | |           | |                         | ccw_device.ko|<-> physical
 | |           | +------------------------>+              |    device
 | |           | |        callback         +--------------+
 | +-----------+ |
 +---------------+

Core driver provides two types of registration interfaces:
1. Registration interface for mediated bus driver:

/**
  * struct mdev_driver - Mediated device's driver
  * @name: driver name
  * @probe: called when new device created
  * @remove:called when device removed
  * @match: called when new device or driver is added for this bus.
	    Return 1 if given device can be handled by given driver and
	    zero otherwise.
  * @driver:device driver structure
  *
  **/
struct mdev_driver {
         const char *name;
         int  (*probe)  (struct device *dev);
         void (*remove) (struct device *dev);
	 int  (*match)(struct device *dev);
         struct device_driver    driver;
};

int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
void mdev_unregister_driver(struct mdev_driver *drv);

Mediated device's driver for mdev should use this interface to register
with Core driver. With this, mediated devices driver for such devices is
responsible to add mediated device to VFIO group.

2. Physical device driver interface
This interface provides vendor driver the set APIs to manage physical
device related work in their own driver. APIs are :
- supported_config: provide supported configuration list by the vendor
		    driver
- create: to allocate basic resources in vendor driver for a mediated
	  device.
- destroy: to free resources in vendor driver when mediated device is
	   destroyed.
- start: to initiate mediated device initialization process from vendor
	 driver when VM boots and before QEMU starts.
- shutdown: to teardown mediated device resources during VM teardown.
- read : read emulation callback.
- write: write emulation callback.
- set_irqs: send interrupt configuration information that QEMU sets.
- get_region_info: to provide region size and its flags for the mediated
		   device.
- validate_map_request: to validate remap pfn request.

This registration interface should be used by vendor drivers to register
each physical device to mdev core driver.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Change-Id: I73a5084574270b14541c529461ea2f03c292d510
---
 drivers/vfio/Kconfig             |   1 +
 drivers/vfio/Makefile            |   1 +
 drivers/vfio/mdev/Kconfig        |  11 +
 drivers/vfio/mdev/Makefile       |   5 +
 drivers/vfio/mdev/mdev_core.c    | 595 +++++++++++++++++++++++++++++++++++++++
 drivers/vfio/mdev/mdev_driver.c  | 138 +++++++++
 drivers/vfio/mdev/mdev_private.h |  33 +++
 drivers/vfio/mdev/mdev_sysfs.c   | 300 ++++++++++++++++++++
 include/linux/mdev.h             | 232 +++++++++++++++
 9 files changed, 1316 insertions(+)
 create mode 100644 drivers/vfio/mdev/Kconfig
 create mode 100644 drivers/vfio/mdev/Makefile
 create mode 100644 drivers/vfio/mdev/mdev_core.c
 create mode 100644 drivers/vfio/mdev/mdev_driver.c
 create mode 100644 drivers/vfio/mdev/mdev_private.h
 create mode 100644 drivers/vfio/mdev/mdev_sysfs.c
 create mode 100644 include/linux/mdev.h

diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index da6e2ce77495..23eced02aaf6 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -48,4 +48,5 @@ menuconfig VFIO_NOIOMMU
 
 source "drivers/vfio/pci/Kconfig"
 source "drivers/vfio/platform/Kconfig"
+source "drivers/vfio/mdev/Kconfig"
 source "virt/lib/Kconfig"
diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
index 7b8a31f63fea..7c70753e54ab 100644
--- a/drivers/vfio/Makefile
+++ b/drivers/vfio/Makefile
@@ -7,3 +7,4 @@ obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
 obj-$(CONFIG_VFIO_SPAPR_EEH) += vfio_spapr_eeh.o
 obj-$(CONFIG_VFIO_PCI) += pci/
 obj-$(CONFIG_VFIO_PLATFORM) += platform/
+obj-$(CONFIG_MDEV) += mdev/
diff --git a/drivers/vfio/mdev/Kconfig b/drivers/vfio/mdev/Kconfig
new file mode 100644
index 000000000000..951e2bb06a3f
--- /dev/null
+++ b/drivers/vfio/mdev/Kconfig
@@ -0,0 +1,11 @@
+
+config MDEV
+    tristate "Mediated device driver framework"
+    depends on VFIO
+    default n
+    help
+        MDEV provides a framework to virtualize device without SR-IOV cap
+        See Documentation/mdev.txt for more details.
+
+        If you don't know what do here, say N.
+
diff --git a/drivers/vfio/mdev/Makefile b/drivers/vfio/mdev/Makefile
new file mode 100644
index 000000000000..2c6d11f7bc24
--- /dev/null
+++ b/drivers/vfio/mdev/Makefile
@@ -0,0 +1,5 @@
+
+mdev-y := mdev_core.o mdev_sysfs.o mdev_driver.o
+
+obj-$(CONFIG_MDEV) += mdev.o
+
diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
new file mode 100644
index 000000000000..3c45ed2ae1e9
--- /dev/null
+++ b/drivers/vfio/mdev/mdev_core.c
@@ -0,0 +1,595 @@
+/*
+ * Mediated device Core Driver
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/slab.h>
+#include <linux/cdev.h>
+#include <linux/sched.h>
+#include <linux/uuid.h>
+#include <linux/vfio.h>
+#include <linux/iommu.h>
+#include <linux/sysfs.h>
+#include <linux/mdev.h>
+
+#include "mdev_private.h"
+
+#define DRIVER_VERSION		"0.1"
+#define DRIVER_AUTHOR		"NVIDIA Corporation"
+#define DRIVER_DESC		"Mediated device Core Driver"
+
+#define MDEV_CLASS_NAME		"mdev"
+
+static struct devices_list {
+	struct list_head    dev_list;
+	struct mutex        list_lock;
+} parent_devices;
+
+static int mdev_add_attribute_group(struct device *dev,
+				    const struct attribute_group **groups)
+{
+	return sysfs_create_groups(&dev->kobj, groups);
+}
+
+static void mdev_remove_attribute_group(struct device *dev,
+					const struct attribute_group **groups)
+{
+	sysfs_remove_groups(&dev->kobj, groups);
+}
+
+static struct mdev_device *find_mdev_device(struct parent_device *parent,
+					    uuid_le uuid, int instance)
+{
+	struct mdev_device *mdev = NULL, *p;
+
+	list_for_each_entry(p, &parent->mdev_list, next) {
+		if ((uuid_le_cmp(p->uuid, uuid) == 0) &&
+		    (p->instance == instance)) {
+			mdev = p;
+			break;
+		}
+	}
+	return mdev;
+}
+
+/* Should be called holding parent_devices.list_lock */
+static struct parent_device *find_parent_device(struct device *dev)
+{
+	struct parent_device *parent = NULL, *p;
+
+	WARN_ON(!mutex_is_locked(&parent_devices.list_lock));
+	list_for_each_entry(p, &parent_devices.dev_list, next) {
+		if (p->dev == dev) {
+			parent = p;
+			break;
+		}
+	}
+	return parent;
+}
+
+static void mdev_release_parent(struct kref *kref)
+{
+	struct parent_device *parent = container_of(kref, struct parent_device,
+						    ref);
+	kfree(parent);
+}
+
+static
+inline struct parent_device *mdev_get_parent(struct parent_device *parent)
+{
+	if (parent)
+		kref_get(&parent->ref);
+
+	return parent;
+}
+
+static inline void mdev_put_parent(struct parent_device *parent)
+{
+	if (parent)
+		kref_put(&parent->ref, mdev_release_parent);
+}
+
+static struct parent_device *mdev_get_parent_by_dev(struct device *dev)
+{
+	struct parent_device *parent = NULL, *p;
+
+	mutex_lock(&parent_devices.list_lock);
+	list_for_each_entry(p, &parent_devices.dev_list, next) {
+		if (p->dev == dev) {
+			parent = mdev_get_parent(p);
+			break;
+		}
+	}
+	mutex_unlock(&parent_devices.list_lock);
+	return parent;
+}
+
+static int mdev_device_create_ops(struct mdev_device *mdev, char *mdev_params)
+{
+	struct parent_device *parent = mdev->parent;
+	int ret;
+
+	mutex_lock(&parent->ops_lock);
+	if (parent->ops->create) {
+		ret = parent->ops->create(mdev->dev.parent, mdev->uuid,
+					mdev->instance, mdev_params);
+		if (ret)
+			goto create_ops_err;
+	}
+
+	ret = mdev_add_attribute_group(&mdev->dev,
+					parent->ops->mdev_attr_groups);
+create_ops_err:
+	mutex_unlock(&parent->ops_lock);
+	return ret;
+}
+
+static int mdev_device_destroy_ops(struct mdev_device *mdev, bool force)
+{
+	struct parent_device *parent = mdev->parent;
+	int ret = 0;
+
+	/*
+	 * If vendor driver doesn't return success that means vendor
+	 * driver doesn't support hot-unplug
+	 */
+	mutex_lock(&parent->ops_lock);
+	if (parent->ops->destroy) {
+		ret = parent->ops->destroy(parent->dev, mdev->uuid,
+					   mdev->instance);
+		if (ret && !force) {
+			ret = -EBUSY;
+			goto destroy_ops_err;
+		}
+	}
+	mdev_remove_attribute_group(&mdev->dev,
+				    parent->ops->mdev_attr_groups);
+destroy_ops_err:
+	mutex_unlock(&parent->ops_lock);
+
+	return ret;
+}
+
+static void mdev_release_device(struct kref *kref)
+{
+	struct mdev_device *mdev = container_of(kref, struct mdev_device, ref);
+	struct parent_device *parent = mdev->parent;
+
+	device_unregister(&mdev->dev);
+	wake_up(&parent->release_done);
+	mdev_put_parent(parent);
+}
+
+struct mdev_device *mdev_get_device(struct mdev_device *mdev)
+{
+	if (mdev)
+		kref_get(&mdev->ref);
+
+	return mdev;
+}
+EXPORT_SYMBOL(mdev_get_device);
+
+void mdev_put_device(struct mdev_device *mdev)
+{
+	if (mdev)
+		kref_put(&mdev->ref, mdev_release_device);
+}
+EXPORT_SYMBOL(mdev_put_device);
+
+/*
+ * Find first mediated device from given uuid and increment refcount of
+ * mediated device. Caller should call mdev_put_device() when the use of
+ * mdev_device is done.
+ */
+static struct mdev_device *mdev_get_first_device_by_uuid(uuid_le uuid)
+{
+	struct mdev_device *mdev = NULL, *p;
+	struct parent_device *parent;
+
+	mutex_lock(&parent_devices.list_lock);
+	list_for_each_entry(parent, &parent_devices.dev_list, next) {
+		mutex_lock(&parent->mdev_list_lock);
+		list_for_each_entry(p, &parent->mdev_list, next) {
+			if (uuid_le_cmp(p->uuid, uuid) == 0) {
+				mdev = mdev_get_device(p);
+				break;
+			}
+		}
+		mutex_unlock(&parent->mdev_list_lock);
+
+		if (mdev)
+			break;
+	}
+	mutex_unlock(&parent_devices.list_lock);
+	return mdev;
+}
+
+/*
+ * Find mediated device from given iommu_group and increment refcount of
+ * mediated device. Caller should call mdev_put_device() when the use of
+ * mdev_device is done.
+ */
+struct mdev_device *mdev_get_device_by_group(struct iommu_group *group)
+{
+	struct mdev_device *mdev = NULL, *p;
+	struct parent_device *parent;
+
+	mutex_lock(&parent_devices.list_lock);
+	list_for_each_entry(parent, &parent_devices.dev_list, next) {
+		mutex_lock(&parent->mdev_list_lock);
+		list_for_each_entry(p, &parent->mdev_list, next) {
+			if (!p->group)
+				continue;
+
+			if (iommu_group_id(p->group) == iommu_group_id(group)) {
+				mdev = mdev_get_device(p);
+				break;
+			}
+		}
+		mutex_unlock(&parent->mdev_list_lock);
+
+		if (mdev)
+			break;
+	}
+	mutex_unlock(&parent_devices.list_lock);
+	return mdev;
+}
+EXPORT_SYMBOL(mdev_get_device_by_group);
+
+/*
+ * mdev_register_device : Register a device
+ * @dev: device structure representing parent device.
+ * @ops: Parent device operation structure to be registered.
+ *
+ * Add device to list of registered parent devices.
+ * Returns a negative value on error, otherwise 0.
+ */
+int mdev_register_device(struct device *dev, const struct parent_ops *ops)
+{
+	int ret = 0;
+	struct parent_device *parent;
+
+	if (!dev || !ops)
+		return -EINVAL;
+
+	mutex_lock(&parent_devices.list_lock);
+
+	/* Check for duplicate */
+	parent = find_parent_device(dev);
+	if (parent) {
+		ret = -EEXIST;
+		goto add_dev_err;
+	}
+
+	parent = kzalloc(sizeof(*parent), GFP_KERNEL);
+	if (!parent) {
+		ret = -ENOMEM;
+		goto add_dev_err;
+	}
+
+	kref_init(&parent->ref);
+	list_add(&parent->next, &parent_devices.dev_list);
+	mutex_unlock(&parent_devices.list_lock);
+
+	parent->dev = dev;
+	parent->ops = ops;
+	mutex_init(&parent->ops_lock);
+	mutex_init(&parent->mdev_list_lock);
+	INIT_LIST_HEAD(&parent->mdev_list);
+	init_waitqueue_head(&parent->release_done);
+
+	ret = mdev_create_sysfs_files(dev);
+	if (ret)
+		goto add_sysfs_error;
+
+	ret = mdev_add_attribute_group(dev, ops->dev_attr_groups);
+	if (ret)
+		goto add_group_error;
+
+	dev_info(dev, "MDEV: Registered\n");
+	return 0;
+
+add_group_error:
+	mdev_remove_sysfs_files(dev);
+add_sysfs_error:
+	mutex_lock(&parent_devices.list_lock);
+	list_del(&parent->next);
+	mutex_unlock(&parent_devices.list_lock);
+	mdev_put_parent(parent);
+	return ret;
+
+add_dev_err:
+	mutex_unlock(&parent_devices.list_lock);
+	return ret;
+}
+EXPORT_SYMBOL(mdev_register_device);
+
+/*
+ * mdev_unregister_device : Unregister a parent device
+ * @dev: device structure representing parent device.
+ *
+ * Remove device from list of registered parent devices. Give a chance to free
+ * existing mediated devices for given device.
+ */
+
+void mdev_unregister_device(struct device *dev)
+{
+	struct parent_device *parent;
+	struct mdev_device *mdev, *n;
+	int ret;
+
+	mutex_lock(&parent_devices.list_lock);
+	parent = find_parent_device(dev);
+
+	if (!parent) {
+		mutex_unlock(&parent_devices.list_lock);
+		return;
+	}
+	dev_info(dev, "MDEV: Unregistering\n");
+
+	/*
+	 * Remove parent from the list and remove create and destroy sysfs
+	 * files so that no new mediated device could be created for this parent
+	 */
+	list_del(&parent->next);
+	mdev_remove_sysfs_files(dev);
+	mutex_unlock(&parent_devices.list_lock);
+
+	mutex_lock(&parent->ops_lock);
+	mdev_remove_attribute_group(dev,
+				    parent->ops->dev_attr_groups);
+	mutex_unlock(&parent->ops_lock);
+
+	mutex_lock(&parent->mdev_list_lock);
+	list_for_each_entry_safe(mdev, n, &parent->mdev_list, next) {
+		mdev_device_destroy_ops(mdev, true);
+		list_del(&mdev->next);
+		mdev_put_device(mdev);
+	}
+	mutex_unlock(&parent->mdev_list_lock);
+
+	do {
+		ret = wait_event_interruptible_timeout(parent->release_done,
+				list_empty(&parent->mdev_list), HZ * 10);
+		if (ret == -ERESTARTSYS) {
+			dev_warn(dev, "Mediated devices are in use, task"
+				      " \"%s\" (%d) "
+				      "blocked until all are released",
+				      current->comm, task_pid_nr(current));
+		}
+	} while (ret <= 0);
+
+	mdev_put_parent(parent);
+}
+EXPORT_SYMBOL(mdev_unregister_device);
+
+/*
+ * Functions required for mdev-sysfs
+ */
+static void mdev_device_release(struct device *dev)
+{
+	struct mdev_device *mdev = to_mdev_device(dev);
+
+	dev_dbg(&mdev->dev, "MDEV: destroying\n");
+	kfree(mdev);
+}
+
+int mdev_device_create(struct device *dev, uuid_le uuid, uint32_t instance,
+		       char *mdev_params)
+{
+	int ret;
+	struct mdev_device *mdev;
+	struct parent_device *parent;
+
+	parent = mdev_get_parent_by_dev(dev);
+	if (!parent)
+		return -EINVAL;
+
+	/* Check for duplicate */
+	mdev = find_mdev_device(parent, uuid, instance);
+	if (mdev) {
+		ret = -EEXIST;
+		goto create_err;
+	}
+
+	mdev = kzalloc(sizeof(*mdev), GFP_KERNEL);
+	if (!mdev) {
+		ret = -ENOMEM;
+		goto create_err;
+	}
+
+	memcpy(&mdev->uuid, &uuid, sizeof(uuid_le));
+	mdev->instance = instance;
+	mdev->parent = parent;
+	mutex_init(&mdev->ops_lock);
+	kref_init(&mdev->ref);
+
+	mdev->dev.parent  = dev;
+	mdev->dev.bus     = &mdev_bus_type;
+	mdev->dev.release = mdev_device_release;
+	dev_set_name(&mdev->dev, "%pUb-%d", uuid.b, instance);
+
+	ret = device_register(&mdev->dev);
+	if (ret) {
+		put_device(&mdev->dev);
+		goto create_err;
+	}
+
+	ret = mdev_device_create_ops(mdev, mdev_params);
+	if (ret)
+		goto create_failed;
+
+	mutex_lock(&parent->mdev_list_lock);
+	list_add(&mdev->next, &parent->mdev_list);
+	mutex_unlock(&parent->mdev_list_lock);
+
+	dev_dbg(&mdev->dev, "MDEV: created\n");
+
+	return ret;
+
+create_failed:
+	device_unregister(&mdev->dev);
+
+create_err:
+	mdev_put_parent(parent);
+	return ret;
+}
+
+int mdev_device_destroy(struct device *dev, uuid_le uuid, uint32_t instance)
+{
+	struct mdev_device *mdev;
+	struct parent_device *parent;
+	int ret;
+
+	parent = mdev_get_parent_by_dev(dev);
+	if (!parent) {
+		ret = -EINVAL;
+		goto destroy_err;
+	}
+
+	mdev = find_mdev_device(parent, uuid, instance);
+	if (!mdev) {
+		ret = -EINVAL;
+		goto destroy_err;
+	}
+
+	ret = mdev_device_destroy_ops(mdev, false);
+	if (ret)
+		goto destroy_err;
+
+	mdev_put_parent(parent);
+
+	mutex_lock(&parent->mdev_list_lock);
+	list_del(&mdev->next);
+	mutex_unlock(&parent->mdev_list_lock);
+
+	mdev_put_device(mdev);
+	return ret;
+
+destroy_err:
+	mdev_put_parent(parent);
+	return ret;
+}
+
+void mdev_device_supported_config(struct device *dev, char *str)
+{
+	struct parent_device *parent;
+
+	parent = mdev_get_parent_by_dev(dev);
+
+	if (parent) {
+		mutex_lock(&parent->ops_lock);
+		if (parent->ops->supported_config)
+			parent->ops->supported_config(parent->dev, str);
+		mutex_unlock(&parent->ops_lock);
+		mdev_put_parent(parent);
+	}
+}
+
+int mdev_device_start(uuid_le uuid)
+{
+	int ret = 0;
+	struct mdev_device *mdev;
+	struct parent_device *parent;
+
+	mdev = mdev_get_first_device_by_uuid(uuid);
+	if (!mdev)
+		return -EINVAL;
+
+	parent = mdev->parent;
+
+	mutex_lock(&parent->ops_lock);
+	if (parent->ops->start)
+		ret = parent->ops->start(mdev->uuid);
+	mutex_unlock(&parent->ops_lock);
+
+	if (ret)
+		pr_err("mdev_start failed  %d\n", ret);
+	else
+		kobject_uevent(&mdev->dev.kobj, KOBJ_ONLINE);
+
+	mdev_put_device(mdev);
+
+	return ret;
+}
+
+int mdev_device_shutdown(uuid_le uuid)
+{
+	int ret = 0;
+	struct mdev_device *mdev;
+	struct parent_device *parent;
+
+	mdev = mdev_get_first_device_by_uuid(uuid);
+	if (!mdev)
+		return -EINVAL;
+
+	parent = mdev->parent;
+
+	mutex_lock(&parent->ops_lock);
+	if (parent->ops->shutdown)
+		ret = parent->ops->shutdown(mdev->uuid);
+	mutex_unlock(&parent->ops_lock);
+
+	if (ret)
+		pr_err("mdev_shutdown failed %d\n", ret);
+	else
+		kobject_uevent(&mdev->dev.kobj, KOBJ_OFFLINE);
+
+	mdev_put_device(mdev);
+	return ret;
+}
+
+static struct class mdev_class = {
+	.name		= MDEV_CLASS_NAME,
+	.owner		= THIS_MODULE,
+	.class_attrs	= mdev_class_attrs,
+};
+
+static int __init mdev_init(void)
+{
+	int ret;
+
+	mutex_init(&parent_devices.list_lock);
+	INIT_LIST_HEAD(&parent_devices.dev_list);
+
+	ret = class_register(&mdev_class);
+	if (ret) {
+		pr_err("Failed to register mdev class\n");
+		return ret;
+	}
+
+	ret = mdev_bus_register();
+	if (ret) {
+		pr_err("Failed to register mdev bus\n");
+		class_unregister(&mdev_class);
+		return ret;
+	}
+
+	return ret;
+}
+
+static void __exit mdev_exit(void)
+{
+	mdev_bus_unregister();
+	class_unregister(&mdev_class);
+}
+
+module_init(mdev_init)
+module_exit(mdev_exit)
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/drivers/vfio/mdev/mdev_driver.c b/drivers/vfio/mdev/mdev_driver.c
new file mode 100644
index 000000000000..f1aed541111d
--- /dev/null
+++ b/drivers/vfio/mdev/mdev_driver.c
@@ -0,0 +1,138 @@
+/*
+ * MDEV driver
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/device.h>
+#include <linux/iommu.h>
+#include <linux/mdev.h>
+
+#include "mdev_private.h"
+
+static int mdev_attach_iommu(struct mdev_device *mdev)
+{
+	int ret;
+	struct iommu_group *group;
+
+	group = iommu_group_alloc();
+	if (IS_ERR(group)) {
+		dev_err(&mdev->dev, "MDEV: failed to allocate group!\n");
+		return PTR_ERR(group);
+	}
+
+	ret = iommu_group_add_device(group, &mdev->dev);
+	if (ret) {
+		dev_err(&mdev->dev, "MDEV: failed to add dev to group!\n");
+		goto attach_fail;
+	}
+
+	mdev->group = group;
+
+	dev_info(&mdev->dev, "MDEV: group_id = %d\n",
+				 iommu_group_id(group));
+attach_fail:
+	iommu_group_put(group);
+	return ret;
+}
+
+static void mdev_detach_iommu(struct mdev_device *mdev)
+{
+	iommu_group_remove_device(&mdev->dev);
+	dev_info(&mdev->dev, "MDEV: detaching iommu\n");
+}
+
+static int mdev_probe(struct device *dev)
+{
+	struct mdev_driver *drv = to_mdev_driver(dev->driver);
+	struct mdev_device *mdev = to_mdev_device(dev);
+	int ret;
+
+	ret = mdev_attach_iommu(mdev);
+	if (ret) {
+		dev_err(dev, "Failed to attach IOMMU\n");
+		return ret;
+	}
+
+	if (drv && drv->probe)
+		ret = drv->probe(dev);
+
+	return ret;
+}
+
+static int mdev_remove(struct device *dev)
+{
+	struct mdev_driver *drv = to_mdev_driver(dev->driver);
+	struct mdev_device *mdev = to_mdev_device(dev);
+
+	if (drv && drv->remove)
+		drv->remove(dev);
+
+	mdev_detach_iommu(mdev);
+
+	return 0;
+}
+
+static int mdev_match(struct device *dev, struct device_driver *drv)
+{
+	struct mdev_driver *mdrv = to_mdev_driver(drv);
+
+	if (mdrv && mdrv->match)
+		return mdrv->match(dev);
+
+	return 0;
+}
+
+struct bus_type mdev_bus_type = {
+	.name		= "mdev",
+	.match		= mdev_match,
+	.probe		= mdev_probe,
+	.remove		= mdev_remove,
+};
+EXPORT_SYMBOL_GPL(mdev_bus_type);
+
+/*
+ * mdev_register_driver - register a new MDEV driver
+ * @drv: the driver to register
+ * @owner: module owner of driver to be registered
+ *
+ * Returns a negative value on error, otherwise 0.
+ */
+int mdev_register_driver(struct mdev_driver *drv, struct module *owner)
+{
+	/* initialize common driver fields */
+	drv->driver.name = drv->name;
+	drv->driver.bus = &mdev_bus_type;
+	drv->driver.owner = owner;
+
+	/* register with core */
+	return driver_register(&drv->driver);
+}
+EXPORT_SYMBOL(mdev_register_driver);
+
+/*
+ * mdev_unregister_driver - unregister MDEV driver
+ * @drv: the driver to unregister
+ *
+ */
+void mdev_unregister_driver(struct mdev_driver *drv)
+{
+	driver_unregister(&drv->driver);
+}
+EXPORT_SYMBOL(mdev_unregister_driver);
+
+int mdev_bus_register(void)
+{
+	return bus_register(&mdev_bus_type);
+}
+
+void mdev_bus_unregister(void)
+{
+	bus_unregister(&mdev_bus_type);
+}
diff --git a/drivers/vfio/mdev/mdev_private.h b/drivers/vfio/mdev/mdev_private.h
new file mode 100644
index 000000000000..991d7f796169
--- /dev/null
+++ b/drivers/vfio/mdev/mdev_private.h
@@ -0,0 +1,33 @@
+/*
+ * Mediated device interal definitions
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef MDEV_PRIVATE_H
+#define MDEV_PRIVATE_H
+
+int  mdev_bus_register(void);
+void mdev_bus_unregister(void);
+
+/* Function prototypes for mdev_sysfs */
+
+extern struct class_attribute mdev_class_attrs[];
+
+int  mdev_create_sysfs_files(struct device *dev);
+void mdev_remove_sysfs_files(struct device *dev);
+
+int  mdev_device_create(struct device *dev, uuid_le uuid, uint32_t instance,
+			char *mdev_params);
+int  mdev_device_destroy(struct device *dev, uuid_le uuid, uint32_t instance);
+void mdev_device_supported_config(struct device *dev, char *str);
+int  mdev_device_start(uuid_le uuid);
+int  mdev_device_shutdown(uuid_le uuid);
+
+#endif /* MDEV_PRIVATE_H */
diff --git a/drivers/vfio/mdev/mdev_sysfs.c b/drivers/vfio/mdev/mdev_sysfs.c
new file mode 100644
index 000000000000..48b66e40009e
--- /dev/null
+++ b/drivers/vfio/mdev/mdev_sysfs.c
@@ -0,0 +1,300 @@
+/*
+ * File attributes for Mediated devices
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/sysfs.h>
+#include <linux/ctype.h>
+#include <linux/device.h>
+#include <linux/slab.h>
+#include <linux/uuid.h>
+#include <linux/mdev.h>
+
+#include "mdev_private.h"
+
+/* Prototypes */
+static ssize_t mdev_supported_types_show(struct device *dev,
+					 struct device_attribute *attr,
+					 char *buf);
+static DEVICE_ATTR_RO(mdev_supported_types);
+
+static ssize_t mdev_create_store(struct device *dev,
+				 struct device_attribute *attr,
+				 const char *buf, size_t count);
+static DEVICE_ATTR_WO(mdev_create);
+
+static ssize_t mdev_destroy_store(struct device *dev,
+				  struct device_attribute *attr,
+				  const char *buf, size_t count);
+static DEVICE_ATTR_WO(mdev_destroy);
+
+/* Static functions */
+
+#define UUID_CHAR_LENGTH	36
+#define UUID_BYTE_LENGTH	16
+
+#define SUPPORTED_TYPE_BUFFER_LENGTH	1024
+
+static inline bool is_uuid_sep(char sep)
+{
+	if (sep == '\n' || sep == '-' || sep == ':' || sep == '\0')
+		return true;
+	return false;
+}
+
+static int uuid_parse(const char *str, uuid_le *uuid)
+{
+	int i;
+
+	if (strlen(str) < UUID_CHAR_LENGTH)
+		return -EINVAL;
+
+	for (i = 0; i < UUID_BYTE_LENGTH; i++) {
+		if (!isxdigit(str[0]) || !isxdigit(str[1])) {
+			pr_err("%s err", __func__);
+			return -EINVAL;
+		}
+
+		uuid->b[i] = (hex_to_bin(str[0]) << 4) | hex_to_bin(str[1]);
+		str += 2;
+		if (is_uuid_sep(*str))
+			str++;
+	}
+
+	return 0;
+}
+
+/* mdev sysfs Functions */
+static ssize_t mdev_supported_types_show(struct device *dev,
+					 struct device_attribute *attr,
+					 char *buf)
+{
+	char *str, *ptr;
+	ssize_t n;
+
+	str = kzalloc(sizeof(*str) * SUPPORTED_TYPE_BUFFER_LENGTH, GFP_KERNEL);
+	if (!str)
+		return -ENOMEM;
+
+	ptr = str;
+	mdev_device_supported_config(dev, str);
+
+	n = sprintf(buf, "%s\n", str);
+	kfree(ptr);
+
+	return n;
+}
+
+static ssize_t mdev_create_store(struct device *dev,
+				 struct device_attribute *attr,
+				 const char *buf, size_t count)
+{
+	char *str, *pstr;
+	char *uuid_str, *instance_str, *mdev_params = NULL, *params = NULL;
+	uuid_le uuid;
+	uint32_t instance;
+	int ret;
+
+	pstr = str = kstrndup(buf, count, GFP_KERNEL);
+
+	if (!str)
+		return -ENOMEM;
+
+	uuid_str = strsep(&str, ":");
+	if (!uuid_str) {
+		pr_err("mdev_create: Empty UUID string %s\n", buf);
+		ret = -EINVAL;
+		goto create_error;
+	}
+
+	if (!str) {
+		pr_err("mdev_create: mdev instance not present %s\n", buf);
+		ret = -EINVAL;
+		goto create_error;
+	}
+
+	instance_str = strsep(&str, ":");
+	if (!instance_str) {
+		pr_err("mdev_create: Empty instance string %s\n", buf);
+		ret = -EINVAL;
+		goto create_error;
+	}
+
+	ret = kstrtouint(instance_str, 0, &instance);
+	if (ret) {
+		pr_err("mdev_create: mdev instance parsing error %s\n", buf);
+		goto create_error;
+	}
+
+	if (str)
+		params = mdev_params = kstrdup(str, GFP_KERNEL);
+
+	ret = uuid_parse(uuid_str, &uuid);
+	if (ret) {
+		pr_err("mdev_create: UUID parse error %s\n", buf);
+		goto create_error;
+	}
+
+	ret = mdev_device_create(dev, uuid, instance, mdev_params);
+	if (ret)
+		pr_err("mdev_create: Failed to create mdev device\n");
+	else
+		ret = count;
+
+create_error:
+	kfree(params);
+	kfree(pstr);
+	return ret;
+}
+
+static ssize_t mdev_destroy_store(struct device *dev,
+				  struct device_attribute *attr,
+				  const char *buf, size_t count)
+{
+	char *uuid_str, *str, *pstr;
+	uuid_le uuid;
+	unsigned int instance;
+	int ret;
+
+	str = pstr = kstrndup(buf, count, GFP_KERNEL);
+
+	if (!str)
+		return -ENOMEM;
+
+	uuid_str = strsep(&str, ":");
+	if (!uuid_str) {
+		pr_err("mdev_destroy: Empty UUID string %s\n", buf);
+		ret = -EINVAL;
+		goto destroy_error;
+	}
+
+	if (str == NULL) {
+		pr_err("mdev_destroy: instance not specified %s\n", buf);
+		ret = -EINVAL;
+		goto destroy_error;
+	}
+
+	ret = kstrtouint(str, 0, &instance);
+	if (ret) {
+		pr_err("mdev_destroy: instance parsing error %s\n", buf);
+		goto destroy_error;
+	}
+
+	ret = uuid_parse(uuid_str, &uuid);
+	if (ret) {
+		pr_err("mdev_destroy: UUID parse error  %s\n", buf);
+		goto destroy_error;
+	}
+
+	ret = mdev_device_destroy(dev, uuid, instance);
+	if (ret == 0)
+		ret = count;
+
+destroy_error:
+	kfree(pstr);
+	return ret;
+}
+
+ssize_t mdev_start_store(struct class *class, struct class_attribute *attr,
+			 const char *buf, size_t count)
+{
+	char *uuid_str, *ptr;
+	uuid_le uuid;
+	int ret;
+
+	ptr = uuid_str = kstrndup(buf, count, GFP_KERNEL);
+
+	if (!uuid_str)
+		return -ENOMEM;
+
+	ret = uuid_parse(uuid_str, &uuid);
+	if (ret) {
+		pr_err("mdev_start: UUID parse error  %s\n", buf);
+		goto start_error;
+	}
+
+	ret = mdev_device_start(uuid);
+	if (ret == 0)
+		ret = count;
+
+start_error:
+	kfree(ptr);
+	return ret;
+}
+
+ssize_t mdev_shutdown_store(struct class *class, struct class_attribute *attr,
+			    const char *buf, size_t count)
+{
+	char *uuid_str, *ptr;
+	uuid_le uuid;
+	int ret;
+
+	ptr = uuid_str = kstrndup(buf, count, GFP_KERNEL);
+
+	if (!uuid_str)
+		return -ENOMEM;
+
+	ret = uuid_parse(uuid_str, &uuid);
+	if (ret) {
+		pr_err("mdev_shutdown: UUID parse error %s\n", buf);
+		goto shutdown_error;
+	}
+
+	ret = mdev_device_shutdown(uuid);
+	if (ret == 0)
+		ret = count;
+
+shutdown_error:
+	kfree(ptr);
+	return ret;
+
+}
+
+struct class_attribute mdev_class_attrs[] = {
+	__ATTR_WO(mdev_start),
+	__ATTR_WO(mdev_shutdown),
+	__ATTR_NULL
+};
+
+int mdev_create_sysfs_files(struct device *dev)
+{
+	int ret;
+
+	ret = sysfs_create_file(&dev->kobj,
+				&dev_attr_mdev_supported_types.attr);
+	if (ret) {
+		pr_err("Failed to create mdev_supported_types sysfs entry\n");
+		return ret;
+	}
+
+	ret = sysfs_create_file(&dev->kobj, &dev_attr_mdev_create.attr);
+	if (ret) {
+		pr_err("Failed to create mdev_create sysfs entry\n");
+		goto create_sysfs_failed;
+	}
+
+	ret = sysfs_create_file(&dev->kobj, &dev_attr_mdev_destroy.attr);
+	if (ret) {
+		pr_err("Failed to create mdev_destroy sysfs entry\n");
+		sysfs_remove_file(&dev->kobj, &dev_attr_mdev_create.attr);
+	} else
+		return ret;
+
+create_sysfs_failed:
+	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_supported_types.attr);
+	return ret;
+}
+
+void mdev_remove_sysfs_files(struct device *dev)
+{
+	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_supported_types.attr);
+	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_create.attr);
+	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_destroy.attr);
+}
diff --git a/include/linux/mdev.h b/include/linux/mdev.h
new file mode 100644
index 000000000000..31b6f8572cfa
--- /dev/null
+++ b/include/linux/mdev.h
@@ -0,0 +1,232 @@
+/*
+ * Mediated device definition
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef MDEV_H
+#define MDEV_H
+
+/* Common Data structures */
+
+struct pci_region_info {
+	uint64_t start;
+	uint64_t size;
+	uint32_t flags;		/* VFIO region info flags */
+};
+
+enum mdev_emul_space {
+	EMUL_CONFIG_SPACE,	/* PCI configuration space */
+	EMUL_IO,		/* I/O register space */
+	EMUL_MMIO		/* Memory-mapped I/O space */
+};
+
+struct parent_device;
+
+/*
+ * Mediated device
+ */
+
+struct mdev_device {
+	struct device		dev;
+	struct parent_device	*parent;
+	struct iommu_group	*group;
+	void			*iommu_data;
+	uuid_le			uuid;
+	uint32_t		instance;
+
+	/* internal only */
+	struct kref		ref;
+	struct mutex		ops_lock;
+	struct list_head	next;
+};
+
+
+/**
+ * struct parent_ops - Structure to be registered for each parent device to
+ * register the device to mdev module.
+ *
+ * @owner:		The module owner.
+ * @dev_attr_groups:	Default attributes of the parent device.
+ * @mdev_attr_groups:	Default attributes of the mediated device.
+ * @supported_config:	Called to get information about supported types.
+ *			@dev : device structure of parent device.
+ *			@config: should return string listing supported config
+ *			Returns integer: success (0) or error (< 0)
+ * @create:		Called to allocate basic resources in parent device's
+ *			driver for a particular mediated device
+ *			@dev: parent device structure on which mediated device
+ *			      should be created
+ *			@uuid: VM's uuid for which VM it is intended to
+ *			@instance: mediated instance in that VM
+ *			@mdev_params: extra parameters required by parent
+ *			device's driver.
+ *			Returns integer: success (0) or error (< 0)
+ * @destroy:		Called to free resources in parent device's driver for a
+ *			a mediated device instance of that VM.
+ *			@dev: parent device structure to which this mediated
+ *			      device points to.
+ *			@uuid: VM's uuid for which the mediated device belongs
+ *			@instance: mdev instance in that VM
+ *			Returns integer: success (0) or error (< 0)
+ *			If VM is running and destroy() is called that means the
+ *			mdev is being hotunpluged. Return error if VM is running
+ *			and driver doesn't support mediated device hotplug.
+ * @start:		Called to initiate mediated device initialization
+ *			process in parent device's driver when VM boots before
+ *			VMM starts
+ *			@uuid: VM's UUID which is booting.
+ *			Returns integer: success (0) or error (< 0)
+ * @shutdown:		Called to teardown mediated device related resources for
+ *			the VM
+ *			@uuid: VM's UUID which is shutting down .
+ *			Returns integer: success (0) or error (< 0)
+ * @read:		Read emulation callback
+ *			@mdev: mediated device structure
+ *			@buf: read buffer
+ *			@count: number of bytes to read
+ *			@address_space: specifies for which address space the
+ *			request is intended for - pci_config_space, IO register
+ *			space or MMIO space.
+ *			@addr: address.
+ *			Retuns number on bytes read on success or error.
+ * @write:		Write emulation callback
+ *			@mdev: mediated device structure
+ *			@buf: write buffer
+ *			@count: number of bytes to be written
+ *			@address_space: specifies for which address space the
+ *			request is intended for - pci_config_space, IO register
+ *			space or MMIO space.
+ *			@addr: address.
+ *			Retuns number on bytes written on success or error.
+ * @set_irqs:		Called to send about interrupts configuration
+ *			information that VMM sets.
+ *			@mdev: mediated device structure
+ *			@flags, index, start, count and *data : same as that of
+ *			struct vfio_irq_set of VFIO_DEVICE_SET_IRQS API.
+ * @get_region_info:	Called to get VFIO region size and flags of mediated
+ *			device.
+ *			@mdev: mediated device structure
+ *			@region_index: VFIO region index
+ *			@region_info: output, returns size and flags of
+ *				      requested region.
+ *			Returns integer: success (0) or error (< 0)
+ * @validate_map_request: Validate remap pfn request
+ *			@mdev: mediated device structure
+ *			@virtaddr: target user address to start at
+ *			@pfn: parent address of kernel memory, vendor driver
+ *			      can change if required.
+ *			@size: size of map area, vendor driver can change the
+ *			       size of map area if desired.
+ *			@prot: page protection flags for this mapping, vendor
+ *			       driver can change, if required.
+ *			Returns integer: success (0) or error (< 0)
+ *
+ * Parent device that support mediated device should be registered with mdev
+ * module with parent_ops structure.
+ */
+
+struct parent_ops {
+	struct module   *owner;
+	const struct attribute_group **dev_attr_groups;
+	const struct attribute_group **mdev_attr_groups;
+
+	int	(*supported_config)(struct device *dev, char *config);
+	int     (*create)(struct device *dev, uuid_le uuid,
+			  uint32_t instance, char *mdev_params);
+	int     (*destroy)(struct device *dev, uuid_le uuid,
+			   uint32_t instance);
+	int     (*start)(uuid_le uuid);
+	int     (*shutdown)(uuid_le uuid);
+	ssize_t (*read)(struct mdev_device *vdev, char *buf, size_t count,
+			enum mdev_emul_space address_space, loff_t pos);
+	ssize_t (*write)(struct mdev_device *vdev, char *buf, size_t count,
+			 enum mdev_emul_space address_space, loff_t pos);
+	int     (*set_irqs)(struct mdev_device *vdev, uint32_t flags,
+			    unsigned int index, unsigned int start,
+			    unsigned int count, void *data);
+	int	(*get_region_info)(struct mdev_device *vdev, int region_index,
+				 struct pci_region_info *region_info);
+	int	(*validate_map_request)(struct mdev_device *vdev,
+					unsigned long virtaddr,
+					unsigned long *pfn, unsigned long *size,
+					pgprot_t *prot);
+};
+
+/*
+ * Parent Device
+ */
+struct parent_device {
+	struct device		*dev;
+	const struct parent_ops	*ops;
+
+	/* internal */
+	struct kref		ref;
+	struct mutex		ops_lock;
+	struct list_head	next;
+	struct list_head	mdev_list;
+	struct mutex		mdev_list_lock;
+	wait_queue_head_t	release_done;
+};
+
+/**
+ * struct mdev_driver - Mediated device driver
+ * @name: driver name
+ * @probe: called when new device created
+ * @remove: called when device removed
+ * @match: called when new device or driver is added for this bus. Return 1 if
+ *	   given device can be handled by given driver and zero otherwise.
+ * @driver: device driver structure
+ *
+ **/
+struct mdev_driver {
+	const char *name;
+	int  (*probe)(struct device *dev);
+	void (*remove)(struct device *dev);
+	int  (*match)(struct device *dev);
+	struct device_driver driver;
+};
+
+static inline struct mdev_driver *to_mdev_driver(struct device_driver *drv)
+{
+	return drv ? container_of(drv, struct mdev_driver, driver) : NULL;
+}
+
+static inline struct mdev_device *to_mdev_device(struct device *dev)
+{
+	return dev ? container_of(dev, struct mdev_device, dev) : NULL;
+}
+
+static inline void *mdev_get_drvdata(struct mdev_device *mdev)
+{
+	return dev_get_drvdata(&mdev->dev);
+}
+
+static inline void mdev_set_drvdata(struct mdev_device *mdev, void *data)
+{
+	dev_set_drvdata(&mdev->dev, data);
+}
+
+extern struct bus_type mdev_bus_type;
+
+#define dev_is_mdev(d) ((d)->bus == &mdev_bus_type)
+
+extern int  mdev_register_device(struct device *dev,
+				 const struct parent_ops *ops);
+extern void mdev_unregister_device(struct device *dev);
+
+extern int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
+extern void mdev_unregister_driver(struct mdev_driver *drv);
+
+extern struct mdev_device *mdev_get_device(struct mdev_device *mdev);
+extern void mdev_put_device(struct mdev_device *mdev);
+
+extern struct mdev_device *mdev_get_device_by_group(struct iommu_group *group);
+
+#endif /* MDEV_H */
-- 
2.7.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [Qemu-devel] [PATCH 1/3] Mediated device Core driver
@ 2016-06-20 16:31   ` Kirti Wankhede
  0 siblings, 0 replies; 51+ messages in thread
From: Kirti Wankhede @ 2016-06-20 16:31 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, kevin.tian, shuai.ruan, jike.song, zhiyuan.lv,
	bjsdjshi, Kirti Wankhede

Design for Mediated Device Driver:
Main purpose of this driver is to provide a common interface for mediated
device management that can be used by differnt drivers of different
devices.

This module provides a generic interface to create the device, add it to
mediated bus, add device to IOMMU group and then add it to vfio group.

Below is the high Level block diagram, with Nvidia, Intel and IBM devices
as example, since these are the devices which are going to actively use
this module as of now.

 +---------------+
 |               |
 | +-----------+ |  mdev_register_driver() +--------------+
 | |           | +<------------------------+ __init()     |
 | |           | |                         |              |
 | |  mdev     | +------------------------>+              |<-> VFIO user
 | |  bus      | |     probe()/remove()    | vfio_mpci.ko |    APIs
 | |  driver   | |                         |              |
 | |           | |                         +--------------+
 | |           | |  mdev_register_driver() +--------------+
 | |           | +<------------------------+ __init()     |
 | |           | |                         |              |
 | |           | +------------------------>+              |<-> VFIO user
 | +-----------+ |     probe()/remove()    | vfio_mccw.ko |    APIs
 |               |                         |              |
 |  MDEV CORE    |                         +--------------+
 |   MODULE      |
 |   mdev.ko     |
 | +-----------+ |  mdev_register_device() +--------------+
 | |           | +<------------------------+              |
 | |           | |                         |  nvidia.ko   |<-> physical
 | |           | +------------------------>+              |    device
 | |           | |        callback         +--------------+
 | | Physical  | |
 | |  device   | |  mdev_register_device() +--------------+
 | | interface | |<------------------------+              |
 | |           | |                         |  i915.ko     |<-> physical
 | |           | +------------------------>+              |    device
 | |           | |        callback         +--------------+
 | |           | |
 | |           | |  mdev_register_device() +--------------+
 | |           | +<------------------------+              |
 | |           | |                         | ccw_device.ko|<-> physical
 | |           | +------------------------>+              |    device
 | |           | |        callback         +--------------+
 | +-----------+ |
 +---------------+

Core driver provides two types of registration interfaces:
1. Registration interface for mediated bus driver:

/**
  * struct mdev_driver - Mediated device's driver
  * @name: driver name
  * @probe: called when new device created
  * @remove:called when device removed
  * @match: called when new device or driver is added for this bus.
	    Return 1 if given device can be handled by given driver and
	    zero otherwise.
  * @driver:device driver structure
  *
  **/
struct mdev_driver {
         const char *name;
         int  (*probe)  (struct device *dev);
         void (*remove) (struct device *dev);
	 int  (*match)(struct device *dev);
         struct device_driver    driver;
};

int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
void mdev_unregister_driver(struct mdev_driver *drv);

Mediated device's driver for mdev should use this interface to register
with Core driver. With this, mediated devices driver for such devices is
responsible to add mediated device to VFIO group.

2. Physical device driver interface
This interface provides vendor driver the set APIs to manage physical
device related work in their own driver. APIs are :
- supported_config: provide supported configuration list by the vendor
		    driver
- create: to allocate basic resources in vendor driver for a mediated
	  device.
- destroy: to free resources in vendor driver when mediated device is
	   destroyed.
- start: to initiate mediated device initialization process from vendor
	 driver when VM boots and before QEMU starts.
- shutdown: to teardown mediated device resources during VM teardown.
- read : read emulation callback.
- write: write emulation callback.
- set_irqs: send interrupt configuration information that QEMU sets.
- get_region_info: to provide region size and its flags for the mediated
		   device.
- validate_map_request: to validate remap pfn request.

This registration interface should be used by vendor drivers to register
each physical device to mdev core driver.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Change-Id: I73a5084574270b14541c529461ea2f03c292d510
---
 drivers/vfio/Kconfig             |   1 +
 drivers/vfio/Makefile            |   1 +
 drivers/vfio/mdev/Kconfig        |  11 +
 drivers/vfio/mdev/Makefile       |   5 +
 drivers/vfio/mdev/mdev_core.c    | 595 +++++++++++++++++++++++++++++++++++++++
 drivers/vfio/mdev/mdev_driver.c  | 138 +++++++++
 drivers/vfio/mdev/mdev_private.h |  33 +++
 drivers/vfio/mdev/mdev_sysfs.c   | 300 ++++++++++++++++++++
 include/linux/mdev.h             | 232 +++++++++++++++
 9 files changed, 1316 insertions(+)
 create mode 100644 drivers/vfio/mdev/Kconfig
 create mode 100644 drivers/vfio/mdev/Makefile
 create mode 100644 drivers/vfio/mdev/mdev_core.c
 create mode 100644 drivers/vfio/mdev/mdev_driver.c
 create mode 100644 drivers/vfio/mdev/mdev_private.h
 create mode 100644 drivers/vfio/mdev/mdev_sysfs.c
 create mode 100644 include/linux/mdev.h

diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index da6e2ce77495..23eced02aaf6 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -48,4 +48,5 @@ menuconfig VFIO_NOIOMMU
 
 source "drivers/vfio/pci/Kconfig"
 source "drivers/vfio/platform/Kconfig"
+source "drivers/vfio/mdev/Kconfig"
 source "virt/lib/Kconfig"
diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
index 7b8a31f63fea..7c70753e54ab 100644
--- a/drivers/vfio/Makefile
+++ b/drivers/vfio/Makefile
@@ -7,3 +7,4 @@ obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
 obj-$(CONFIG_VFIO_SPAPR_EEH) += vfio_spapr_eeh.o
 obj-$(CONFIG_VFIO_PCI) += pci/
 obj-$(CONFIG_VFIO_PLATFORM) += platform/
+obj-$(CONFIG_MDEV) += mdev/
diff --git a/drivers/vfio/mdev/Kconfig b/drivers/vfio/mdev/Kconfig
new file mode 100644
index 000000000000..951e2bb06a3f
--- /dev/null
+++ b/drivers/vfio/mdev/Kconfig
@@ -0,0 +1,11 @@
+
+config MDEV
+    tristate "Mediated device driver framework"
+    depends on VFIO
+    default n
+    help
+        MDEV provides a framework to virtualize device without SR-IOV cap
+        See Documentation/mdev.txt for more details.
+
+        If you don't know what do here, say N.
+
diff --git a/drivers/vfio/mdev/Makefile b/drivers/vfio/mdev/Makefile
new file mode 100644
index 000000000000..2c6d11f7bc24
--- /dev/null
+++ b/drivers/vfio/mdev/Makefile
@@ -0,0 +1,5 @@
+
+mdev-y := mdev_core.o mdev_sysfs.o mdev_driver.o
+
+obj-$(CONFIG_MDEV) += mdev.o
+
diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
new file mode 100644
index 000000000000..3c45ed2ae1e9
--- /dev/null
+++ b/drivers/vfio/mdev/mdev_core.c
@@ -0,0 +1,595 @@
+/*
+ * Mediated device Core Driver
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/slab.h>
+#include <linux/cdev.h>
+#include <linux/sched.h>
+#include <linux/uuid.h>
+#include <linux/vfio.h>
+#include <linux/iommu.h>
+#include <linux/sysfs.h>
+#include <linux/mdev.h>
+
+#include "mdev_private.h"
+
+#define DRIVER_VERSION		"0.1"
+#define DRIVER_AUTHOR		"NVIDIA Corporation"
+#define DRIVER_DESC		"Mediated device Core Driver"
+
+#define MDEV_CLASS_NAME		"mdev"
+
+static struct devices_list {
+	struct list_head    dev_list;
+	struct mutex        list_lock;
+} parent_devices;
+
+static int mdev_add_attribute_group(struct device *dev,
+				    const struct attribute_group **groups)
+{
+	return sysfs_create_groups(&dev->kobj, groups);
+}
+
+static void mdev_remove_attribute_group(struct device *dev,
+					const struct attribute_group **groups)
+{
+	sysfs_remove_groups(&dev->kobj, groups);
+}
+
+static struct mdev_device *find_mdev_device(struct parent_device *parent,
+					    uuid_le uuid, int instance)
+{
+	struct mdev_device *mdev = NULL, *p;
+
+	list_for_each_entry(p, &parent->mdev_list, next) {
+		if ((uuid_le_cmp(p->uuid, uuid) == 0) &&
+		    (p->instance == instance)) {
+			mdev = p;
+			break;
+		}
+	}
+	return mdev;
+}
+
+/* Should be called holding parent_devices.list_lock */
+static struct parent_device *find_parent_device(struct device *dev)
+{
+	struct parent_device *parent = NULL, *p;
+
+	WARN_ON(!mutex_is_locked(&parent_devices.list_lock));
+	list_for_each_entry(p, &parent_devices.dev_list, next) {
+		if (p->dev == dev) {
+			parent = p;
+			break;
+		}
+	}
+	return parent;
+}
+
+static void mdev_release_parent(struct kref *kref)
+{
+	struct parent_device *parent = container_of(kref, struct parent_device,
+						    ref);
+	kfree(parent);
+}
+
+static
+inline struct parent_device *mdev_get_parent(struct parent_device *parent)
+{
+	if (parent)
+		kref_get(&parent->ref);
+
+	return parent;
+}
+
+static inline void mdev_put_parent(struct parent_device *parent)
+{
+	if (parent)
+		kref_put(&parent->ref, mdev_release_parent);
+}
+
+static struct parent_device *mdev_get_parent_by_dev(struct device *dev)
+{
+	struct parent_device *parent = NULL, *p;
+
+	mutex_lock(&parent_devices.list_lock);
+	list_for_each_entry(p, &parent_devices.dev_list, next) {
+		if (p->dev == dev) {
+			parent = mdev_get_parent(p);
+			break;
+		}
+	}
+	mutex_unlock(&parent_devices.list_lock);
+	return parent;
+}
+
+static int mdev_device_create_ops(struct mdev_device *mdev, char *mdev_params)
+{
+	struct parent_device *parent = mdev->parent;
+	int ret;
+
+	mutex_lock(&parent->ops_lock);
+	if (parent->ops->create) {
+		ret = parent->ops->create(mdev->dev.parent, mdev->uuid,
+					mdev->instance, mdev_params);
+		if (ret)
+			goto create_ops_err;
+	}
+
+	ret = mdev_add_attribute_group(&mdev->dev,
+					parent->ops->mdev_attr_groups);
+create_ops_err:
+	mutex_unlock(&parent->ops_lock);
+	return ret;
+}
+
+static int mdev_device_destroy_ops(struct mdev_device *mdev, bool force)
+{
+	struct parent_device *parent = mdev->parent;
+	int ret = 0;
+
+	/*
+	 * If vendor driver doesn't return success that means vendor
+	 * driver doesn't support hot-unplug
+	 */
+	mutex_lock(&parent->ops_lock);
+	if (parent->ops->destroy) {
+		ret = parent->ops->destroy(parent->dev, mdev->uuid,
+					   mdev->instance);
+		if (ret && !force) {
+			ret = -EBUSY;
+			goto destroy_ops_err;
+		}
+	}
+	mdev_remove_attribute_group(&mdev->dev,
+				    parent->ops->mdev_attr_groups);
+destroy_ops_err:
+	mutex_unlock(&parent->ops_lock);
+
+	return ret;
+}
+
+static void mdev_release_device(struct kref *kref)
+{
+	struct mdev_device *mdev = container_of(kref, struct mdev_device, ref);
+	struct parent_device *parent = mdev->parent;
+
+	device_unregister(&mdev->dev);
+	wake_up(&parent->release_done);
+	mdev_put_parent(parent);
+}
+
+struct mdev_device *mdev_get_device(struct mdev_device *mdev)
+{
+	if (mdev)
+		kref_get(&mdev->ref);
+
+	return mdev;
+}
+EXPORT_SYMBOL(mdev_get_device);
+
+void mdev_put_device(struct mdev_device *mdev)
+{
+	if (mdev)
+		kref_put(&mdev->ref, mdev_release_device);
+}
+EXPORT_SYMBOL(mdev_put_device);
+
+/*
+ * Find first mediated device from given uuid and increment refcount of
+ * mediated device. Caller should call mdev_put_device() when the use of
+ * mdev_device is done.
+ */
+static struct mdev_device *mdev_get_first_device_by_uuid(uuid_le uuid)
+{
+	struct mdev_device *mdev = NULL, *p;
+	struct parent_device *parent;
+
+	mutex_lock(&parent_devices.list_lock);
+	list_for_each_entry(parent, &parent_devices.dev_list, next) {
+		mutex_lock(&parent->mdev_list_lock);
+		list_for_each_entry(p, &parent->mdev_list, next) {
+			if (uuid_le_cmp(p->uuid, uuid) == 0) {
+				mdev = mdev_get_device(p);
+				break;
+			}
+		}
+		mutex_unlock(&parent->mdev_list_lock);
+
+		if (mdev)
+			break;
+	}
+	mutex_unlock(&parent_devices.list_lock);
+	return mdev;
+}
+
+/*
+ * Find mediated device from given iommu_group and increment refcount of
+ * mediated device. Caller should call mdev_put_device() when the use of
+ * mdev_device is done.
+ */
+struct mdev_device *mdev_get_device_by_group(struct iommu_group *group)
+{
+	struct mdev_device *mdev = NULL, *p;
+	struct parent_device *parent;
+
+	mutex_lock(&parent_devices.list_lock);
+	list_for_each_entry(parent, &parent_devices.dev_list, next) {
+		mutex_lock(&parent->mdev_list_lock);
+		list_for_each_entry(p, &parent->mdev_list, next) {
+			if (!p->group)
+				continue;
+
+			if (iommu_group_id(p->group) == iommu_group_id(group)) {
+				mdev = mdev_get_device(p);
+				break;
+			}
+		}
+		mutex_unlock(&parent->mdev_list_lock);
+
+		if (mdev)
+			break;
+	}
+	mutex_unlock(&parent_devices.list_lock);
+	return mdev;
+}
+EXPORT_SYMBOL(mdev_get_device_by_group);
+
+/*
+ * mdev_register_device : Register a device
+ * @dev: device structure representing parent device.
+ * @ops: Parent device operation structure to be registered.
+ *
+ * Add device to list of registered parent devices.
+ * Returns a negative value on error, otherwise 0.
+ */
+int mdev_register_device(struct device *dev, const struct parent_ops *ops)
+{
+	int ret = 0;
+	struct parent_device *parent;
+
+	if (!dev || !ops)
+		return -EINVAL;
+
+	mutex_lock(&parent_devices.list_lock);
+
+	/* Check for duplicate */
+	parent = find_parent_device(dev);
+	if (parent) {
+		ret = -EEXIST;
+		goto add_dev_err;
+	}
+
+	parent = kzalloc(sizeof(*parent), GFP_KERNEL);
+	if (!parent) {
+		ret = -ENOMEM;
+		goto add_dev_err;
+	}
+
+	kref_init(&parent->ref);
+	list_add(&parent->next, &parent_devices.dev_list);
+	mutex_unlock(&parent_devices.list_lock);
+
+	parent->dev = dev;
+	parent->ops = ops;
+	mutex_init(&parent->ops_lock);
+	mutex_init(&parent->mdev_list_lock);
+	INIT_LIST_HEAD(&parent->mdev_list);
+	init_waitqueue_head(&parent->release_done);
+
+	ret = mdev_create_sysfs_files(dev);
+	if (ret)
+		goto add_sysfs_error;
+
+	ret = mdev_add_attribute_group(dev, ops->dev_attr_groups);
+	if (ret)
+		goto add_group_error;
+
+	dev_info(dev, "MDEV: Registered\n");
+	return 0;
+
+add_group_error:
+	mdev_remove_sysfs_files(dev);
+add_sysfs_error:
+	mutex_lock(&parent_devices.list_lock);
+	list_del(&parent->next);
+	mutex_unlock(&parent_devices.list_lock);
+	mdev_put_parent(parent);
+	return ret;
+
+add_dev_err:
+	mutex_unlock(&parent_devices.list_lock);
+	return ret;
+}
+EXPORT_SYMBOL(mdev_register_device);
+
+/*
+ * mdev_unregister_device : Unregister a parent device
+ * @dev: device structure representing parent device.
+ *
+ * Remove device from list of registered parent devices. Give a chance to free
+ * existing mediated devices for given device.
+ */
+
+void mdev_unregister_device(struct device *dev)
+{
+	struct parent_device *parent;
+	struct mdev_device *mdev, *n;
+	int ret;
+
+	mutex_lock(&parent_devices.list_lock);
+	parent = find_parent_device(dev);
+
+	if (!parent) {
+		mutex_unlock(&parent_devices.list_lock);
+		return;
+	}
+	dev_info(dev, "MDEV: Unregistering\n");
+
+	/*
+	 * Remove parent from the list and remove create and destroy sysfs
+	 * files so that no new mediated device could be created for this parent
+	 */
+	list_del(&parent->next);
+	mdev_remove_sysfs_files(dev);
+	mutex_unlock(&parent_devices.list_lock);
+
+	mutex_lock(&parent->ops_lock);
+	mdev_remove_attribute_group(dev,
+				    parent->ops->dev_attr_groups);
+	mutex_unlock(&parent->ops_lock);
+
+	mutex_lock(&parent->mdev_list_lock);
+	list_for_each_entry_safe(mdev, n, &parent->mdev_list, next) {
+		mdev_device_destroy_ops(mdev, true);
+		list_del(&mdev->next);
+		mdev_put_device(mdev);
+	}
+	mutex_unlock(&parent->mdev_list_lock);
+
+	do {
+		ret = wait_event_interruptible_timeout(parent->release_done,
+				list_empty(&parent->mdev_list), HZ * 10);
+		if (ret == -ERESTARTSYS) {
+			dev_warn(dev, "Mediated devices are in use, task"
+				      " \"%s\" (%d) "
+				      "blocked until all are released",
+				      current->comm, task_pid_nr(current));
+		}
+	} while (ret <= 0);
+
+	mdev_put_parent(parent);
+}
+EXPORT_SYMBOL(mdev_unregister_device);
+
+/*
+ * Functions required for mdev-sysfs
+ */
+static void mdev_device_release(struct device *dev)
+{
+	struct mdev_device *mdev = to_mdev_device(dev);
+
+	dev_dbg(&mdev->dev, "MDEV: destroying\n");
+	kfree(mdev);
+}
+
+int mdev_device_create(struct device *dev, uuid_le uuid, uint32_t instance,
+		       char *mdev_params)
+{
+	int ret;
+	struct mdev_device *mdev;
+	struct parent_device *parent;
+
+	parent = mdev_get_parent_by_dev(dev);
+	if (!parent)
+		return -EINVAL;
+
+	/* Check for duplicate */
+	mdev = find_mdev_device(parent, uuid, instance);
+	if (mdev) {
+		ret = -EEXIST;
+		goto create_err;
+	}
+
+	mdev = kzalloc(sizeof(*mdev), GFP_KERNEL);
+	if (!mdev) {
+		ret = -ENOMEM;
+		goto create_err;
+	}
+
+	memcpy(&mdev->uuid, &uuid, sizeof(uuid_le));
+	mdev->instance = instance;
+	mdev->parent = parent;
+	mutex_init(&mdev->ops_lock);
+	kref_init(&mdev->ref);
+
+	mdev->dev.parent  = dev;
+	mdev->dev.bus     = &mdev_bus_type;
+	mdev->dev.release = mdev_device_release;
+	dev_set_name(&mdev->dev, "%pUb-%d", uuid.b, instance);
+
+	ret = device_register(&mdev->dev);
+	if (ret) {
+		put_device(&mdev->dev);
+		goto create_err;
+	}
+
+	ret = mdev_device_create_ops(mdev, mdev_params);
+	if (ret)
+		goto create_failed;
+
+	mutex_lock(&parent->mdev_list_lock);
+	list_add(&mdev->next, &parent->mdev_list);
+	mutex_unlock(&parent->mdev_list_lock);
+
+	dev_dbg(&mdev->dev, "MDEV: created\n");
+
+	return ret;
+
+create_failed:
+	device_unregister(&mdev->dev);
+
+create_err:
+	mdev_put_parent(parent);
+	return ret;
+}
+
+int mdev_device_destroy(struct device *dev, uuid_le uuid, uint32_t instance)
+{
+	struct mdev_device *mdev;
+	struct parent_device *parent;
+	int ret;
+
+	parent = mdev_get_parent_by_dev(dev);
+	if (!parent) {
+		ret = -EINVAL;
+		goto destroy_err;
+	}
+
+	mdev = find_mdev_device(parent, uuid, instance);
+	if (!mdev) {
+		ret = -EINVAL;
+		goto destroy_err;
+	}
+
+	ret = mdev_device_destroy_ops(mdev, false);
+	if (ret)
+		goto destroy_err;
+
+	mdev_put_parent(parent);
+
+	mutex_lock(&parent->mdev_list_lock);
+	list_del(&mdev->next);
+	mutex_unlock(&parent->mdev_list_lock);
+
+	mdev_put_device(mdev);
+	return ret;
+
+destroy_err:
+	mdev_put_parent(parent);
+	return ret;
+}
+
+void mdev_device_supported_config(struct device *dev, char *str)
+{
+	struct parent_device *parent;
+
+	parent = mdev_get_parent_by_dev(dev);
+
+	if (parent) {
+		mutex_lock(&parent->ops_lock);
+		if (parent->ops->supported_config)
+			parent->ops->supported_config(parent->dev, str);
+		mutex_unlock(&parent->ops_lock);
+		mdev_put_parent(parent);
+	}
+}
+
+int mdev_device_start(uuid_le uuid)
+{
+	int ret = 0;
+	struct mdev_device *mdev;
+	struct parent_device *parent;
+
+	mdev = mdev_get_first_device_by_uuid(uuid);
+	if (!mdev)
+		return -EINVAL;
+
+	parent = mdev->parent;
+
+	mutex_lock(&parent->ops_lock);
+	if (parent->ops->start)
+		ret = parent->ops->start(mdev->uuid);
+	mutex_unlock(&parent->ops_lock);
+
+	if (ret)
+		pr_err("mdev_start failed  %d\n", ret);
+	else
+		kobject_uevent(&mdev->dev.kobj, KOBJ_ONLINE);
+
+	mdev_put_device(mdev);
+
+	return ret;
+}
+
+int mdev_device_shutdown(uuid_le uuid)
+{
+	int ret = 0;
+	struct mdev_device *mdev;
+	struct parent_device *parent;
+
+	mdev = mdev_get_first_device_by_uuid(uuid);
+	if (!mdev)
+		return -EINVAL;
+
+	parent = mdev->parent;
+
+	mutex_lock(&parent->ops_lock);
+	if (parent->ops->shutdown)
+		ret = parent->ops->shutdown(mdev->uuid);
+	mutex_unlock(&parent->ops_lock);
+
+	if (ret)
+		pr_err("mdev_shutdown failed %d\n", ret);
+	else
+		kobject_uevent(&mdev->dev.kobj, KOBJ_OFFLINE);
+
+	mdev_put_device(mdev);
+	return ret;
+}
+
+static struct class mdev_class = {
+	.name		= MDEV_CLASS_NAME,
+	.owner		= THIS_MODULE,
+	.class_attrs	= mdev_class_attrs,
+};
+
+static int __init mdev_init(void)
+{
+	int ret;
+
+	mutex_init(&parent_devices.list_lock);
+	INIT_LIST_HEAD(&parent_devices.dev_list);
+
+	ret = class_register(&mdev_class);
+	if (ret) {
+		pr_err("Failed to register mdev class\n");
+		return ret;
+	}
+
+	ret = mdev_bus_register();
+	if (ret) {
+		pr_err("Failed to register mdev bus\n");
+		class_unregister(&mdev_class);
+		return ret;
+	}
+
+	return ret;
+}
+
+static void __exit mdev_exit(void)
+{
+	mdev_bus_unregister();
+	class_unregister(&mdev_class);
+}
+
+module_init(mdev_init)
+module_exit(mdev_exit)
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/drivers/vfio/mdev/mdev_driver.c b/drivers/vfio/mdev/mdev_driver.c
new file mode 100644
index 000000000000..f1aed541111d
--- /dev/null
+++ b/drivers/vfio/mdev/mdev_driver.c
@@ -0,0 +1,138 @@
+/*
+ * MDEV driver
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/device.h>
+#include <linux/iommu.h>
+#include <linux/mdev.h>
+
+#include "mdev_private.h"
+
+static int mdev_attach_iommu(struct mdev_device *mdev)
+{
+	int ret;
+	struct iommu_group *group;
+
+	group = iommu_group_alloc();
+	if (IS_ERR(group)) {
+		dev_err(&mdev->dev, "MDEV: failed to allocate group!\n");
+		return PTR_ERR(group);
+	}
+
+	ret = iommu_group_add_device(group, &mdev->dev);
+	if (ret) {
+		dev_err(&mdev->dev, "MDEV: failed to add dev to group!\n");
+		goto attach_fail;
+	}
+
+	mdev->group = group;
+
+	dev_info(&mdev->dev, "MDEV: group_id = %d\n",
+				 iommu_group_id(group));
+attach_fail:
+	iommu_group_put(group);
+	return ret;
+}
+
+static void mdev_detach_iommu(struct mdev_device *mdev)
+{
+	iommu_group_remove_device(&mdev->dev);
+	dev_info(&mdev->dev, "MDEV: detaching iommu\n");
+}
+
+static int mdev_probe(struct device *dev)
+{
+	struct mdev_driver *drv = to_mdev_driver(dev->driver);
+	struct mdev_device *mdev = to_mdev_device(dev);
+	int ret;
+
+	ret = mdev_attach_iommu(mdev);
+	if (ret) {
+		dev_err(dev, "Failed to attach IOMMU\n");
+		return ret;
+	}
+
+	if (drv && drv->probe)
+		ret = drv->probe(dev);
+
+	return ret;
+}
+
+static int mdev_remove(struct device *dev)
+{
+	struct mdev_driver *drv = to_mdev_driver(dev->driver);
+	struct mdev_device *mdev = to_mdev_device(dev);
+
+	if (drv && drv->remove)
+		drv->remove(dev);
+
+	mdev_detach_iommu(mdev);
+
+	return 0;
+}
+
+static int mdev_match(struct device *dev, struct device_driver *drv)
+{
+	struct mdev_driver *mdrv = to_mdev_driver(drv);
+
+	if (mdrv && mdrv->match)
+		return mdrv->match(dev);
+
+	return 0;
+}
+
+struct bus_type mdev_bus_type = {
+	.name		= "mdev",
+	.match		= mdev_match,
+	.probe		= mdev_probe,
+	.remove		= mdev_remove,
+};
+EXPORT_SYMBOL_GPL(mdev_bus_type);
+
+/*
+ * mdev_register_driver - register a new MDEV driver
+ * @drv: the driver to register
+ * @owner: module owner of driver to be registered
+ *
+ * Returns a negative value on error, otherwise 0.
+ */
+int mdev_register_driver(struct mdev_driver *drv, struct module *owner)
+{
+	/* initialize common driver fields */
+	drv->driver.name = drv->name;
+	drv->driver.bus = &mdev_bus_type;
+	drv->driver.owner = owner;
+
+	/* register with core */
+	return driver_register(&drv->driver);
+}
+EXPORT_SYMBOL(mdev_register_driver);
+
+/*
+ * mdev_unregister_driver - unregister MDEV driver
+ * @drv: the driver to unregister
+ *
+ */
+void mdev_unregister_driver(struct mdev_driver *drv)
+{
+	driver_unregister(&drv->driver);
+}
+EXPORT_SYMBOL(mdev_unregister_driver);
+
+int mdev_bus_register(void)
+{
+	return bus_register(&mdev_bus_type);
+}
+
+void mdev_bus_unregister(void)
+{
+	bus_unregister(&mdev_bus_type);
+}
diff --git a/drivers/vfio/mdev/mdev_private.h b/drivers/vfio/mdev/mdev_private.h
new file mode 100644
index 000000000000..991d7f796169
--- /dev/null
+++ b/drivers/vfio/mdev/mdev_private.h
@@ -0,0 +1,33 @@
+/*
+ * Mediated device interal definitions
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef MDEV_PRIVATE_H
+#define MDEV_PRIVATE_H
+
+int  mdev_bus_register(void);
+void mdev_bus_unregister(void);
+
+/* Function prototypes for mdev_sysfs */
+
+extern struct class_attribute mdev_class_attrs[];
+
+int  mdev_create_sysfs_files(struct device *dev);
+void mdev_remove_sysfs_files(struct device *dev);
+
+int  mdev_device_create(struct device *dev, uuid_le uuid, uint32_t instance,
+			char *mdev_params);
+int  mdev_device_destroy(struct device *dev, uuid_le uuid, uint32_t instance);
+void mdev_device_supported_config(struct device *dev, char *str);
+int  mdev_device_start(uuid_le uuid);
+int  mdev_device_shutdown(uuid_le uuid);
+
+#endif /* MDEV_PRIVATE_H */
diff --git a/drivers/vfio/mdev/mdev_sysfs.c b/drivers/vfio/mdev/mdev_sysfs.c
new file mode 100644
index 000000000000..48b66e40009e
--- /dev/null
+++ b/drivers/vfio/mdev/mdev_sysfs.c
@@ -0,0 +1,300 @@
+/*
+ * File attributes for Mediated devices
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/sysfs.h>
+#include <linux/ctype.h>
+#include <linux/device.h>
+#include <linux/slab.h>
+#include <linux/uuid.h>
+#include <linux/mdev.h>
+
+#include "mdev_private.h"
+
+/* Prototypes */
+static ssize_t mdev_supported_types_show(struct device *dev,
+					 struct device_attribute *attr,
+					 char *buf);
+static DEVICE_ATTR_RO(mdev_supported_types);
+
+static ssize_t mdev_create_store(struct device *dev,
+				 struct device_attribute *attr,
+				 const char *buf, size_t count);
+static DEVICE_ATTR_WO(mdev_create);
+
+static ssize_t mdev_destroy_store(struct device *dev,
+				  struct device_attribute *attr,
+				  const char *buf, size_t count);
+static DEVICE_ATTR_WO(mdev_destroy);
+
+/* Static functions */
+
+#define UUID_CHAR_LENGTH	36
+#define UUID_BYTE_LENGTH	16
+
+#define SUPPORTED_TYPE_BUFFER_LENGTH	1024
+
+static inline bool is_uuid_sep(char sep)
+{
+	if (sep == '\n' || sep == '-' || sep == ':' || sep == '\0')
+		return true;
+	return false;
+}
+
+static int uuid_parse(const char *str, uuid_le *uuid)
+{
+	int i;
+
+	if (strlen(str) < UUID_CHAR_LENGTH)
+		return -EINVAL;
+
+	for (i = 0; i < UUID_BYTE_LENGTH; i++) {
+		if (!isxdigit(str[0]) || !isxdigit(str[1])) {
+			pr_err("%s err", __func__);
+			return -EINVAL;
+		}
+
+		uuid->b[i] = (hex_to_bin(str[0]) << 4) | hex_to_bin(str[1]);
+		str += 2;
+		if (is_uuid_sep(*str))
+			str++;
+	}
+
+	return 0;
+}
+
+/* mdev sysfs Functions */
+static ssize_t mdev_supported_types_show(struct device *dev,
+					 struct device_attribute *attr,
+					 char *buf)
+{
+	char *str, *ptr;
+	ssize_t n;
+
+	str = kzalloc(sizeof(*str) * SUPPORTED_TYPE_BUFFER_LENGTH, GFP_KERNEL);
+	if (!str)
+		return -ENOMEM;
+
+	ptr = str;
+	mdev_device_supported_config(dev, str);
+
+	n = sprintf(buf, "%s\n", str);
+	kfree(ptr);
+
+	return n;
+}
+
+static ssize_t mdev_create_store(struct device *dev,
+				 struct device_attribute *attr,
+				 const char *buf, size_t count)
+{
+	char *str, *pstr;
+	char *uuid_str, *instance_str, *mdev_params = NULL, *params = NULL;
+	uuid_le uuid;
+	uint32_t instance;
+	int ret;
+
+	pstr = str = kstrndup(buf, count, GFP_KERNEL);
+
+	if (!str)
+		return -ENOMEM;
+
+	uuid_str = strsep(&str, ":");
+	if (!uuid_str) {
+		pr_err("mdev_create: Empty UUID string %s\n", buf);
+		ret = -EINVAL;
+		goto create_error;
+	}
+
+	if (!str) {
+		pr_err("mdev_create: mdev instance not present %s\n", buf);
+		ret = -EINVAL;
+		goto create_error;
+	}
+
+	instance_str = strsep(&str, ":");
+	if (!instance_str) {
+		pr_err("mdev_create: Empty instance string %s\n", buf);
+		ret = -EINVAL;
+		goto create_error;
+	}
+
+	ret = kstrtouint(instance_str, 0, &instance);
+	if (ret) {
+		pr_err("mdev_create: mdev instance parsing error %s\n", buf);
+		goto create_error;
+	}
+
+	if (str)
+		params = mdev_params = kstrdup(str, GFP_KERNEL);
+
+	ret = uuid_parse(uuid_str, &uuid);
+	if (ret) {
+		pr_err("mdev_create: UUID parse error %s\n", buf);
+		goto create_error;
+	}
+
+	ret = mdev_device_create(dev, uuid, instance, mdev_params);
+	if (ret)
+		pr_err("mdev_create: Failed to create mdev device\n");
+	else
+		ret = count;
+
+create_error:
+	kfree(params);
+	kfree(pstr);
+	return ret;
+}
+
+static ssize_t mdev_destroy_store(struct device *dev,
+				  struct device_attribute *attr,
+				  const char *buf, size_t count)
+{
+	char *uuid_str, *str, *pstr;
+	uuid_le uuid;
+	unsigned int instance;
+	int ret;
+
+	str = pstr = kstrndup(buf, count, GFP_KERNEL);
+
+	if (!str)
+		return -ENOMEM;
+
+	uuid_str = strsep(&str, ":");
+	if (!uuid_str) {
+		pr_err("mdev_destroy: Empty UUID string %s\n", buf);
+		ret = -EINVAL;
+		goto destroy_error;
+	}
+
+	if (str == NULL) {
+		pr_err("mdev_destroy: instance not specified %s\n", buf);
+		ret = -EINVAL;
+		goto destroy_error;
+	}
+
+	ret = kstrtouint(str, 0, &instance);
+	if (ret) {
+		pr_err("mdev_destroy: instance parsing error %s\n", buf);
+		goto destroy_error;
+	}
+
+	ret = uuid_parse(uuid_str, &uuid);
+	if (ret) {
+		pr_err("mdev_destroy: UUID parse error  %s\n", buf);
+		goto destroy_error;
+	}
+
+	ret = mdev_device_destroy(dev, uuid, instance);
+	if (ret == 0)
+		ret = count;
+
+destroy_error:
+	kfree(pstr);
+	return ret;
+}
+
+ssize_t mdev_start_store(struct class *class, struct class_attribute *attr,
+			 const char *buf, size_t count)
+{
+	char *uuid_str, *ptr;
+	uuid_le uuid;
+	int ret;
+
+	ptr = uuid_str = kstrndup(buf, count, GFP_KERNEL);
+
+	if (!uuid_str)
+		return -ENOMEM;
+
+	ret = uuid_parse(uuid_str, &uuid);
+	if (ret) {
+		pr_err("mdev_start: UUID parse error  %s\n", buf);
+		goto start_error;
+	}
+
+	ret = mdev_device_start(uuid);
+	if (ret == 0)
+		ret = count;
+
+start_error:
+	kfree(ptr);
+	return ret;
+}
+
+ssize_t mdev_shutdown_store(struct class *class, struct class_attribute *attr,
+			    const char *buf, size_t count)
+{
+	char *uuid_str, *ptr;
+	uuid_le uuid;
+	int ret;
+
+	ptr = uuid_str = kstrndup(buf, count, GFP_KERNEL);
+
+	if (!uuid_str)
+		return -ENOMEM;
+
+	ret = uuid_parse(uuid_str, &uuid);
+	if (ret) {
+		pr_err("mdev_shutdown: UUID parse error %s\n", buf);
+		goto shutdown_error;
+	}
+
+	ret = mdev_device_shutdown(uuid);
+	if (ret == 0)
+		ret = count;
+
+shutdown_error:
+	kfree(ptr);
+	return ret;
+
+}
+
+struct class_attribute mdev_class_attrs[] = {
+	__ATTR_WO(mdev_start),
+	__ATTR_WO(mdev_shutdown),
+	__ATTR_NULL
+};
+
+int mdev_create_sysfs_files(struct device *dev)
+{
+	int ret;
+
+	ret = sysfs_create_file(&dev->kobj,
+				&dev_attr_mdev_supported_types.attr);
+	if (ret) {
+		pr_err("Failed to create mdev_supported_types sysfs entry\n");
+		return ret;
+	}
+
+	ret = sysfs_create_file(&dev->kobj, &dev_attr_mdev_create.attr);
+	if (ret) {
+		pr_err("Failed to create mdev_create sysfs entry\n");
+		goto create_sysfs_failed;
+	}
+
+	ret = sysfs_create_file(&dev->kobj, &dev_attr_mdev_destroy.attr);
+	if (ret) {
+		pr_err("Failed to create mdev_destroy sysfs entry\n");
+		sysfs_remove_file(&dev->kobj, &dev_attr_mdev_create.attr);
+	} else
+		return ret;
+
+create_sysfs_failed:
+	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_supported_types.attr);
+	return ret;
+}
+
+void mdev_remove_sysfs_files(struct device *dev)
+{
+	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_supported_types.attr);
+	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_create.attr);
+	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_destroy.attr);
+}
diff --git a/include/linux/mdev.h b/include/linux/mdev.h
new file mode 100644
index 000000000000..31b6f8572cfa
--- /dev/null
+++ b/include/linux/mdev.h
@@ -0,0 +1,232 @@
+/*
+ * Mediated device definition
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef MDEV_H
+#define MDEV_H
+
+/* Common Data structures */
+
+struct pci_region_info {
+	uint64_t start;
+	uint64_t size;
+	uint32_t flags;		/* VFIO region info flags */
+};
+
+enum mdev_emul_space {
+	EMUL_CONFIG_SPACE,	/* PCI configuration space */
+	EMUL_IO,		/* I/O register space */
+	EMUL_MMIO		/* Memory-mapped I/O space */
+};
+
+struct parent_device;
+
+/*
+ * Mediated device
+ */
+
+struct mdev_device {
+	struct device		dev;
+	struct parent_device	*parent;
+	struct iommu_group	*group;
+	void			*iommu_data;
+	uuid_le			uuid;
+	uint32_t		instance;
+
+	/* internal only */
+	struct kref		ref;
+	struct mutex		ops_lock;
+	struct list_head	next;
+};
+
+
+/**
+ * struct parent_ops - Structure to be registered for each parent device to
+ * register the device to mdev module.
+ *
+ * @owner:		The module owner.
+ * @dev_attr_groups:	Default attributes of the parent device.
+ * @mdev_attr_groups:	Default attributes of the mediated device.
+ * @supported_config:	Called to get information about supported types.
+ *			@dev : device structure of parent device.
+ *			@config: should return string listing supported config
+ *			Returns integer: success (0) or error (< 0)
+ * @create:		Called to allocate basic resources in parent device's
+ *			driver for a particular mediated device
+ *			@dev: parent device structure on which mediated device
+ *			      should be created
+ *			@uuid: VM's uuid for which VM it is intended to
+ *			@instance: mediated instance in that VM
+ *			@mdev_params: extra parameters required by parent
+ *			device's driver.
+ *			Returns integer: success (0) or error (< 0)
+ * @destroy:		Called to free resources in parent device's driver for a
+ *			a mediated device instance of that VM.
+ *			@dev: parent device structure to which this mediated
+ *			      device points to.
+ *			@uuid: VM's uuid for which the mediated device belongs
+ *			@instance: mdev instance in that VM
+ *			Returns integer: success (0) or error (< 0)
+ *			If VM is running and destroy() is called that means the
+ *			mdev is being hotunpluged. Return error if VM is running
+ *			and driver doesn't support mediated device hotplug.
+ * @start:		Called to initiate mediated device initialization
+ *			process in parent device's driver when VM boots before
+ *			VMM starts
+ *			@uuid: VM's UUID which is booting.
+ *			Returns integer: success (0) or error (< 0)
+ * @shutdown:		Called to teardown mediated device related resources for
+ *			the VM
+ *			@uuid: VM's UUID which is shutting down .
+ *			Returns integer: success (0) or error (< 0)
+ * @read:		Read emulation callback
+ *			@mdev: mediated device structure
+ *			@buf: read buffer
+ *			@count: number of bytes to read
+ *			@address_space: specifies for which address space the
+ *			request is intended for - pci_config_space, IO register
+ *			space or MMIO space.
+ *			@addr: address.
+ *			Retuns number on bytes read on success or error.
+ * @write:		Write emulation callback
+ *			@mdev: mediated device structure
+ *			@buf: write buffer
+ *			@count: number of bytes to be written
+ *			@address_space: specifies for which address space the
+ *			request is intended for - pci_config_space, IO register
+ *			space or MMIO space.
+ *			@addr: address.
+ *			Retuns number on bytes written on success or error.
+ * @set_irqs:		Called to send about interrupts configuration
+ *			information that VMM sets.
+ *			@mdev: mediated device structure
+ *			@flags, index, start, count and *data : same as that of
+ *			struct vfio_irq_set of VFIO_DEVICE_SET_IRQS API.
+ * @get_region_info:	Called to get VFIO region size and flags of mediated
+ *			device.
+ *			@mdev: mediated device structure
+ *			@region_index: VFIO region index
+ *			@region_info: output, returns size and flags of
+ *				      requested region.
+ *			Returns integer: success (0) or error (< 0)
+ * @validate_map_request: Validate remap pfn request
+ *			@mdev: mediated device structure
+ *			@virtaddr: target user address to start at
+ *			@pfn: parent address of kernel memory, vendor driver
+ *			      can change if required.
+ *			@size: size of map area, vendor driver can change the
+ *			       size of map area if desired.
+ *			@prot: page protection flags for this mapping, vendor
+ *			       driver can change, if required.
+ *			Returns integer: success (0) or error (< 0)
+ *
+ * Parent device that support mediated device should be registered with mdev
+ * module with parent_ops structure.
+ */
+
+struct parent_ops {
+	struct module   *owner;
+	const struct attribute_group **dev_attr_groups;
+	const struct attribute_group **mdev_attr_groups;
+
+	int	(*supported_config)(struct device *dev, char *config);
+	int     (*create)(struct device *dev, uuid_le uuid,
+			  uint32_t instance, char *mdev_params);
+	int     (*destroy)(struct device *dev, uuid_le uuid,
+			   uint32_t instance);
+	int     (*start)(uuid_le uuid);
+	int     (*shutdown)(uuid_le uuid);
+	ssize_t (*read)(struct mdev_device *vdev, char *buf, size_t count,
+			enum mdev_emul_space address_space, loff_t pos);
+	ssize_t (*write)(struct mdev_device *vdev, char *buf, size_t count,
+			 enum mdev_emul_space address_space, loff_t pos);
+	int     (*set_irqs)(struct mdev_device *vdev, uint32_t flags,
+			    unsigned int index, unsigned int start,
+			    unsigned int count, void *data);
+	int	(*get_region_info)(struct mdev_device *vdev, int region_index,
+				 struct pci_region_info *region_info);
+	int	(*validate_map_request)(struct mdev_device *vdev,
+					unsigned long virtaddr,
+					unsigned long *pfn, unsigned long *size,
+					pgprot_t *prot);
+};
+
+/*
+ * Parent Device
+ */
+struct parent_device {
+	struct device		*dev;
+	const struct parent_ops	*ops;
+
+	/* internal */
+	struct kref		ref;
+	struct mutex		ops_lock;
+	struct list_head	next;
+	struct list_head	mdev_list;
+	struct mutex		mdev_list_lock;
+	wait_queue_head_t	release_done;
+};
+
+/**
+ * struct mdev_driver - Mediated device driver
+ * @name: driver name
+ * @probe: called when new device created
+ * @remove: called when device removed
+ * @match: called when new device or driver is added for this bus. Return 1 if
+ *	   given device can be handled by given driver and zero otherwise.
+ * @driver: device driver structure
+ *
+ **/
+struct mdev_driver {
+	const char *name;
+	int  (*probe)(struct device *dev);
+	void (*remove)(struct device *dev);
+	int  (*match)(struct device *dev);
+	struct device_driver driver;
+};
+
+static inline struct mdev_driver *to_mdev_driver(struct device_driver *drv)
+{
+	return drv ? container_of(drv, struct mdev_driver, driver) : NULL;
+}
+
+static inline struct mdev_device *to_mdev_device(struct device *dev)
+{
+	return dev ? container_of(dev, struct mdev_device, dev) : NULL;
+}
+
+static inline void *mdev_get_drvdata(struct mdev_device *mdev)
+{
+	return dev_get_drvdata(&mdev->dev);
+}
+
+static inline void mdev_set_drvdata(struct mdev_device *mdev, void *data)
+{
+	dev_set_drvdata(&mdev->dev, data);
+}
+
+extern struct bus_type mdev_bus_type;
+
+#define dev_is_mdev(d) ((d)->bus == &mdev_bus_type)
+
+extern int  mdev_register_device(struct device *dev,
+				 const struct parent_ops *ops);
+extern void mdev_unregister_device(struct device *dev);
+
+extern int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
+extern void mdev_unregister_driver(struct mdev_driver *drv);
+
+extern struct mdev_device *mdev_get_device(struct mdev_device *mdev);
+extern void mdev_put_device(struct mdev_device *mdev);
+
+extern struct mdev_device *mdev_get_device_by_group(struct iommu_group *group);
+
+#endif /* MDEV_H */
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 2/3] VFIO driver for mediated PCI device
  2016-06-20 16:31 ` [Qemu-devel] " Kirti Wankhede
@ 2016-06-20 16:31   ` Kirti Wankhede
  -1 siblings, 0 replies; 51+ messages in thread
From: Kirti Wankhede @ 2016-06-20 16:31 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: shuai.ruan, jike.song, kvm, kevin.tian, qemu-devel,
	Kirti Wankhede, zhiyuan.lv, bjsdjshi

VFIO driver registers with MDEV core driver. MDEV core driver creates
mediated device and calls probe routine of MPCI VFIO driver. This MPCI
VFIO driver adds mediated device to VFIO core module.
Main aim of this module is to manage all VFIO APIs for each mediated PCI
device.
Those are:
- get region information from vendor driver.
- trap and emulate PCI config space and BAR region.
- Send interrupt configuration information to vendor driver.
- mmap mappable region with invalidate mapping and fault on access to
  remap pfn.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Change-Id: I583f4734752971d3d112324d69e2508c88f359ec
---
 drivers/vfio/mdev/Kconfig           |   7 +
 drivers/vfio/mdev/Makefile          |   1 +
 drivers/vfio/mdev/vfio_mpci.c       | 654 ++++++++++++++++++++++++++++++++++++
 drivers/vfio/pci/vfio_pci_private.h |   6 -
 drivers/vfio/pci/vfio_pci_rdwr.c    |   1 +
 include/linux/vfio.h                |   7 +
 6 files changed, 670 insertions(+), 6 deletions(-)
 create mode 100644 drivers/vfio/mdev/vfio_mpci.c

diff --git a/drivers/vfio/mdev/Kconfig b/drivers/vfio/mdev/Kconfig
index 951e2bb06a3f..8d9e78aaa80f 100644
--- a/drivers/vfio/mdev/Kconfig
+++ b/drivers/vfio/mdev/Kconfig
@@ -9,3 +9,10 @@ config MDEV
 
         If you don't know what do here, say N.
 
+config VFIO_MPCI
+    tristate "VFIO support for Mediated PCI devices"
+    depends on VFIO && PCI && MDEV
+    default n
+    help
+        VFIO based driver for mediated PCI devices.
+
diff --git a/drivers/vfio/mdev/Makefile b/drivers/vfio/mdev/Makefile
index 2c6d11f7bc24..cd5e7625e1ec 100644
--- a/drivers/vfio/mdev/Makefile
+++ b/drivers/vfio/mdev/Makefile
@@ -2,4 +2,5 @@
 mdev-y := mdev_core.o mdev_sysfs.o mdev_driver.o
 
 obj-$(CONFIG_MDEV) += mdev.o
+obj-$(CONFIG_VFIO_MPCI) += vfio_mpci.o
 
diff --git a/drivers/vfio/mdev/vfio_mpci.c b/drivers/vfio/mdev/vfio_mpci.c
new file mode 100644
index 000000000000..267879a05c39
--- /dev/null
+++ b/drivers/vfio/mdev/vfio_mpci.c
@@ -0,0 +1,654 @@
+/*
+ * VFIO based Mediated PCI device driver
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/uuid.h>
+#include <linux/vfio.h>
+#include <linux/iommu.h>
+#include <linux/mdev.h>
+
+#include "mdev_private.h"
+
+#define DRIVER_VERSION  "0.1"
+#define DRIVER_AUTHOR   "NVIDIA Corporation"
+#define DRIVER_DESC     "VFIO based Mediated PCI device driver"
+
+struct vfio_mdev {
+	struct iommu_group *group;
+	struct mdev_device *mdev;
+	int		    refcnt;
+	struct pci_region_info vfio_region_info[VFIO_PCI_NUM_REGIONS];
+	u8		    *vconfig;
+	struct mutex	    vfio_mdev_lock;
+};
+
+static int get_mdev_region_info(struct mdev_device *mdev,
+				struct pci_region_info *vfio_region_info,
+				int index)
+{
+	int ret = -EINVAL;
+	struct parent_device *parent = mdev->parent;
+
+	if (parent && dev_is_pci(parent->dev) && parent->ops->get_region_info) {
+		mutex_lock(&mdev->ops_lock);
+		ret = parent->ops->get_region_info(mdev, index,
+						    vfio_region_info);
+		mutex_unlock(&mdev->ops_lock);
+	}
+	return ret;
+}
+
+static void mdev_read_base(struct vfio_mdev *vmdev)
+{
+	int index, pos;
+	u32 start_lo, start_hi;
+	u32 mem_type;
+
+	pos = PCI_BASE_ADDRESS_0;
+
+	for (index = 0; index <= VFIO_PCI_BAR5_REGION_INDEX; index++) {
+
+		if (!vmdev->vfio_region_info[index].size)
+			continue;
+
+		start_lo = (*(u32 *)(vmdev->vconfig + pos)) &
+					PCI_BASE_ADDRESS_MEM_MASK;
+		mem_type = (*(u32 *)(vmdev->vconfig + pos)) &
+					PCI_BASE_ADDRESS_MEM_TYPE_MASK;
+
+		switch (mem_type) {
+		case PCI_BASE_ADDRESS_MEM_TYPE_64:
+			start_hi = (*(u32 *)(vmdev->vconfig + pos + 4));
+			pos += 4;
+			break;
+		case PCI_BASE_ADDRESS_MEM_TYPE_32:
+		case PCI_BASE_ADDRESS_MEM_TYPE_1M:
+			/* 1M mem BAR treated as 32-bit BAR */
+		default:
+			/* mem unknown type treated as 32-bit BAR */
+			start_hi = 0;
+			break;
+		}
+		pos += 4;
+		vmdev->vfio_region_info[index].start = ((u64)start_hi << 32) |
+							start_lo;
+	}
+}
+
+static int vfio_mpci_open(void *device_data)
+{
+	int ret = 0;
+	struct vfio_mdev *vmdev = device_data;
+
+	if (!try_module_get(THIS_MODULE))
+		return -ENODEV;
+
+	mutex_lock(&vmdev->vfio_mdev_lock);
+	if (!vmdev->refcnt) {
+		u8 *vconfig;
+		int index;
+		struct pci_region_info *cfg_reg;
+
+		for (index = VFIO_PCI_BAR0_REGION_INDEX;
+		     index < VFIO_PCI_NUM_REGIONS; index++) {
+			ret = get_mdev_region_info(vmdev->mdev,
+						&vmdev->vfio_region_info[index],
+						index);
+			if (ret)
+				goto open_error;
+		}
+		cfg_reg = &vmdev->vfio_region_info[VFIO_PCI_CONFIG_REGION_INDEX];
+		if (!cfg_reg->size)
+			goto open_error;
+
+		vconfig = kzalloc(cfg_reg->size, GFP_KERNEL);
+		if (IS_ERR(vconfig)) {
+			ret = PTR_ERR(vconfig);
+			goto open_error;
+		}
+
+		vmdev->vconfig = vconfig;
+	}
+
+	vmdev->refcnt++;
+open_error:
+
+	mutex_unlock(&vmdev->vfio_mdev_lock);
+	if (ret)
+		module_put(THIS_MODULE);
+
+	return ret;
+}
+
+static void vfio_mpci_close(void *device_data)
+{
+	struct vfio_mdev *vmdev = device_data;
+
+	mutex_lock(&vmdev->vfio_mdev_lock);
+	vmdev->refcnt--;
+	if (!vmdev->refcnt) {
+		memset(&vmdev->vfio_region_info, 0,
+			sizeof(vmdev->vfio_region_info));
+		kfree(vmdev->vconfig);
+	}
+	mutex_unlock(&vmdev->vfio_mdev_lock);
+	module_put(THIS_MODULE);
+}
+
+static int mdev_get_irq_count(struct vfio_mdev *vmdev, int irq_type)
+{
+	/* Don't support MSIX for now */
+	if (irq_type == VFIO_PCI_MSIX_IRQ_INDEX)
+		return -1;
+
+	return 1;
+}
+
+static long vfio_mpci_unlocked_ioctl(void *device_data,
+				     unsigned int cmd, unsigned long arg)
+{
+	int ret = 0;
+	struct vfio_mdev *vmdev = device_data;
+	unsigned long minsz;
+
+	switch (cmd) {
+	case VFIO_DEVICE_GET_INFO:
+	{
+		struct vfio_device_info info;
+
+		minsz = offsetofend(struct vfio_device_info, num_irqs);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz)
+			return -EINVAL;
+
+		info.flags = VFIO_DEVICE_FLAGS_PCI;
+		info.num_regions = VFIO_PCI_NUM_REGIONS;
+		info.num_irqs = VFIO_PCI_NUM_IRQS;
+
+		return copy_to_user((void __user *)arg, &info, minsz);
+	}
+	case VFIO_DEVICE_GET_REGION_INFO:
+	{
+		struct vfio_region_info info;
+
+		minsz = offsetofend(struct vfio_region_info, offset);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz)
+			return -EINVAL;
+
+		switch (info.index) {
+		case VFIO_PCI_CONFIG_REGION_INDEX:
+		case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
+			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+			info.size = vmdev->vfio_region_info[info.index].size;
+			if (!info.size) {
+				info.flags = 0;
+				break;
+			}
+
+			info.flags = vmdev->vfio_region_info[info.index].flags;
+			break;
+		case VFIO_PCI_VGA_REGION_INDEX:
+		case VFIO_PCI_ROM_REGION_INDEX:
+		default:
+			return -EINVAL;
+		}
+
+		return copy_to_user((void __user *)arg, &info, minsz);
+	}
+	case VFIO_DEVICE_GET_IRQ_INFO:
+	{
+		struct vfio_irq_info info;
+
+		minsz = offsetofend(struct vfio_irq_info, count);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz || info.index >= VFIO_PCI_NUM_IRQS)
+			return -EINVAL;
+
+		switch (info.index) {
+		case VFIO_PCI_INTX_IRQ_INDEX ... VFIO_PCI_MSI_IRQ_INDEX:
+		case VFIO_PCI_REQ_IRQ_INDEX:
+			break;
+			/* pass thru to return error */
+		case VFIO_PCI_MSIX_IRQ_INDEX:
+		default:
+			return -EINVAL;
+		}
+
+		info.count = VFIO_PCI_NUM_IRQS;
+		info.flags = VFIO_IRQ_INFO_EVENTFD;
+		info.count = mdev_get_irq_count(vmdev, info.index);
+
+		if (info.count == -1)
+			return -EINVAL;
+
+		if (info.index == VFIO_PCI_INTX_IRQ_INDEX)
+			info.flags |= (VFIO_IRQ_INFO_MASKABLE |
+					VFIO_IRQ_INFO_AUTOMASKED);
+		else
+			info.flags |= VFIO_IRQ_INFO_NORESIZE;
+
+		return copy_to_user((void __user *)arg, &info, minsz);
+	}
+	case VFIO_DEVICE_SET_IRQS:
+	{
+		struct vfio_irq_set hdr;
+		struct mdev_device *mdev = vmdev->mdev;
+		struct parent_device *parent = vmdev->mdev->parent;
+		u8 *data = NULL, *ptr = NULL;
+
+		minsz = offsetofend(struct vfio_irq_set, count);
+
+		if (copy_from_user(&hdr, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (hdr.argsz < minsz || hdr.index >= VFIO_PCI_NUM_IRQS ||
+		    hdr.flags & ~(VFIO_IRQ_SET_DATA_TYPE_MASK |
+		    VFIO_IRQ_SET_ACTION_TYPE_MASK))
+			return -EINVAL;
+
+		if (!(hdr.flags & VFIO_IRQ_SET_DATA_NONE)) {
+			size_t size;
+			int max = mdev_get_irq_count(vmdev, hdr.index);
+
+			if (hdr.flags & VFIO_IRQ_SET_DATA_BOOL)
+				size = sizeof(uint8_t);
+			else if (hdr.flags & VFIO_IRQ_SET_DATA_EVENTFD)
+				size = sizeof(int32_t);
+			else
+				return -EINVAL;
+
+			if (hdr.argsz - minsz < hdr.count * size ||
+			    hdr.start >= max || hdr.start + hdr.count > max)
+				return -EINVAL;
+
+			ptr = data = memdup_user((void __user *)(arg + minsz),
+						 hdr.count * size);
+			if (IS_ERR(data))
+				return PTR_ERR(data);
+		}
+
+		if (parent && parent->ops->set_irqs) {
+			mutex_lock(&mdev->ops_lock);
+			ret = parent->ops->set_irqs(mdev, hdr.flags, hdr.index,
+						    hdr.start, hdr.count, data);
+			mutex_unlock(&mdev->ops_lock);
+		}
+
+		kfree(ptr);
+		return ret;
+	}
+	}
+	return -ENOTTY;
+}
+
+ssize_t mdev_dev_config_rw(struct vfio_mdev *vmdev, char __user *buf,
+			   size_t count, loff_t *ppos, bool iswrite)
+{
+	struct mdev_device *mdev = vmdev->mdev;
+	struct parent_device *parent = mdev->parent;
+	int size = vmdev->vfio_region_info[VFIO_PCI_CONFIG_REGION_INDEX].size;
+	int ret = 0;
+	uint64_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
+
+	if (pos < 0 || pos >= size ||
+	    pos + count > size) {
+		pr_err("%s pos 0x%llx out of range\n", __func__, pos);
+		ret = -EFAULT;
+		goto config_rw_exit;
+	}
+
+	if (iswrite) {
+		char *usr_data, *ptr;
+
+		ptr = usr_data = memdup_user(buf, count);
+		if (IS_ERR(usr_data)) {
+			ret = PTR_ERR(usr_data);
+			goto config_rw_exit;
+		}
+
+		ret = parent->ops->write(mdev, usr_data, count,
+					  EMUL_CONFIG_SPACE, pos);
+
+		memcpy((void *)(vmdev->vconfig + pos), (void *)usr_data, count);
+		kfree(ptr);
+	} else {
+		char *ret_data, *ptr;
+
+		ptr = ret_data = kzalloc(count, GFP_KERNEL);
+
+		if (IS_ERR(ret_data)) {
+			ret = PTR_ERR(ret_data);
+			goto config_rw_exit;
+		}
+
+		ret = parent->ops->read(mdev, ret_data, count,
+					EMUL_CONFIG_SPACE, pos);
+
+		if (ret > 0) {
+			if (copy_to_user(buf, ret_data, ret))
+				ret = -EFAULT;
+			else
+				memcpy((void *)(vmdev->vconfig + pos),
+					(void *)ret_data, count);
+		}
+		kfree(ptr);
+	}
+config_rw_exit:
+
+	if (ret > 0)
+		*ppos += ret;
+
+	return ret;
+}
+
+ssize_t mdev_dev_bar_rw(struct vfio_mdev *vmdev, char __user *buf,
+			size_t count, loff_t *ppos, bool iswrite)
+{
+	struct mdev_device *mdev = vmdev->mdev;
+	struct parent_device *parent = mdev->parent;
+	loff_t offset = *ppos & VFIO_PCI_OFFSET_MASK;
+	loff_t pos;
+	int bar_index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+	int ret = 0;
+
+	if (!vmdev->vfio_region_info[bar_index].start)
+		mdev_read_base(vmdev);
+
+	if (offset >= vmdev->vfio_region_info[bar_index].size) {
+		ret = -EINVAL;
+		goto bar_rw_exit;
+	}
+
+	count = min(count,
+		    (size_t)(vmdev->vfio_region_info[bar_index].size - offset));
+
+	pos = vmdev->vfio_region_info[bar_index].start + offset;
+
+	if (iswrite) {
+		char *usr_data, *ptr;
+
+		ptr = usr_data = memdup_user(buf, count);
+		if (IS_ERR(usr_data)) {
+			ret = PTR_ERR(usr_data);
+			goto bar_rw_exit;
+		}
+
+		ret = parent->ops->write(mdev, usr_data, count, EMUL_MMIO, pos);
+
+		kfree(ptr);
+	} else {
+		char *ret_data, *ptr;
+
+		ptr = ret_data = kzalloc(count, GFP_KERNEL);
+
+		if (!ret_data) {
+			ret = -ENOMEM;
+			goto bar_rw_exit;
+		}
+
+		ret = parent->ops->read(mdev, ret_data, count, EMUL_MMIO, pos);
+
+		if (ret > 0) {
+			if (copy_to_user(buf, ret_data, ret))
+				ret = -EFAULT;
+		}
+		kfree(ptr);
+	}
+
+bar_rw_exit:
+
+	if (ret > 0)
+		*ppos += ret;
+
+	return ret;
+}
+
+
+static ssize_t mdev_dev_rw(void *device_data, char __user *buf,
+			   size_t count, loff_t *ppos, bool iswrite)
+{
+	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+	struct vfio_mdev *vmdev = device_data;
+
+	if (index >= VFIO_PCI_NUM_REGIONS)
+		return -EINVAL;
+
+	switch (index) {
+	case VFIO_PCI_CONFIG_REGION_INDEX:
+		return mdev_dev_config_rw(vmdev, buf, count, ppos, iswrite);
+
+	case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
+		return mdev_dev_bar_rw(vmdev, buf, count, ppos, iswrite);
+
+	case VFIO_PCI_ROM_REGION_INDEX:
+	case VFIO_PCI_VGA_REGION_INDEX:
+		break;
+	}
+
+	return -EINVAL;
+}
+
+
+static ssize_t vfio_mpci_read(void *device_data, char __user *buf,
+			      size_t count, loff_t *ppos)
+{
+	struct vfio_mdev *vmdev = device_data;
+	struct mdev_device *mdev = vmdev->mdev;
+	struct parent_device *parent = mdev->parent;
+	int ret = 0;
+
+	if (!count)
+		return 0;
+
+	if (IS_ERR_OR_NULL(buf))
+		return -EINVAL;
+
+	if (parent && parent->ops->read) {
+		mutex_lock(&mdev->ops_lock);
+		ret = mdev_dev_rw(device_data, buf, count, ppos, false);
+		mutex_unlock(&mdev->ops_lock);
+	}
+
+	return ret;
+}
+
+static ssize_t vfio_mpci_write(void *device_data, const char __user *buf,
+			       size_t count, loff_t *ppos)
+{
+	struct vfio_mdev *vmdev = device_data;
+	struct mdev_device *mdev = vmdev->mdev;
+	struct parent_device *parent = mdev->parent;
+	int ret = 0;
+
+	if (!count)
+		return 0;
+
+	if (IS_ERR_OR_NULL(buf))
+		return -EINVAL;
+
+	if (parent && parent->ops->write) {
+		mutex_lock(&mdev->ops_lock);
+		ret = mdev_dev_rw(device_data, (char __user *)buf, count,
+				  ppos, true);
+		mutex_unlock(&mdev->ops_lock);
+	}
+
+	return ret;
+}
+
+static int mdev_dev_mmio_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	int ret;
+	struct vfio_mdev *vmdev = vma->vm_private_data;
+	struct mdev_device *mdev;
+	struct parent_device *parent;
+	u64 virtaddr = (u64)vmf->virtual_address;
+	u64 offset, phyaddr;
+	unsigned long req_size, pgoff;
+	pgprot_t pg_prot;
+
+	if (!vmdev && !vmdev->mdev)
+		return -EINVAL;
+
+	mdev = vmdev->mdev;
+	parent  = mdev->parent;
+
+	offset   = virtaddr - vma->vm_start;
+	phyaddr  = (vma->vm_pgoff << PAGE_SHIFT) + offset;
+	pgoff    = phyaddr >> PAGE_SHIFT;
+	req_size = vma->vm_end - virtaddr;
+	pg_prot  = vma->vm_page_prot;
+
+	if (parent && parent->ops->validate_map_request) {
+		mutex_lock(&mdev->ops_lock);
+		ret = parent->ops->validate_map_request(mdev, virtaddr,
+							 &pgoff, &req_size,
+							 &pg_prot);
+		mutex_unlock(&mdev->ops_lock);
+		if (ret)
+			return ret;
+
+		if (!req_size)
+			return -EINVAL;
+	}
+
+	ret = remap_pfn_range(vma, virtaddr, pgoff, req_size, pg_prot);
+
+	return ret | VM_FAULT_NOPAGE;
+}
+
+static const struct vm_operations_struct mdev_dev_mmio_ops = {
+	.fault = mdev_dev_mmio_fault,
+};
+
+
+static int vfio_mpci_mmap(void *device_data, struct vm_area_struct *vma)
+{
+	unsigned int index;
+	struct vfio_mdev *vmdev = device_data;
+	struct mdev_device *mdev = vmdev->mdev;
+	struct pci_dev *pdev;
+	unsigned long pgoff;
+	loff_t offset;
+
+	if (!mdev->parent || !dev_is_pci(mdev->parent->dev))
+		return -EINVAL;
+
+	pdev = to_pci_dev(mdev->parent->dev);
+
+	offset = vma->vm_pgoff << PAGE_SHIFT;
+
+	index = VFIO_PCI_OFFSET_TO_INDEX(offset);
+
+	if (index >= VFIO_PCI_ROM_REGION_INDEX)
+		return -EINVAL;
+
+	pgoff = vma->vm_pgoff &
+		((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
+
+	vma->vm_pgoff = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff;
+
+	vma->vm_private_data = vmdev;
+	vma->vm_ops = &mdev_dev_mmio_ops;
+
+	return 0;
+}
+
+static const struct vfio_device_ops vfio_mpci_dev_ops = {
+	.name		= "vfio-mpci",
+	.open		= vfio_mpci_open,
+	.release	= vfio_mpci_close,
+	.ioctl		= vfio_mpci_unlocked_ioctl,
+	.read		= vfio_mpci_read,
+	.write		= vfio_mpci_write,
+	.mmap		= vfio_mpci_mmap,
+};
+
+int vfio_mpci_probe(struct device *dev)
+{
+	struct vfio_mdev *vmdev;
+	struct mdev_device *mdev = to_mdev_device(dev);
+	int ret;
+
+	if (!mdev)
+		return -EINVAL;
+
+	vmdev = kzalloc(sizeof(*vmdev), GFP_KERNEL);
+	if (IS_ERR(vmdev))
+		return PTR_ERR(vmdev);
+
+	vmdev->mdev = mdev_get_device(mdev);
+	vmdev->group = mdev->group;
+	mutex_init(&vmdev->vfio_mdev_lock);
+
+	ret = vfio_add_group_dev(dev, &vfio_mpci_dev_ops, vmdev);
+	if (ret)
+		kfree(vmdev);
+
+	mdev_put_device(mdev);
+	return ret;
+}
+
+void vfio_mpci_remove(struct device *dev)
+{
+	struct vfio_mdev *vmdev;
+
+	vmdev = vfio_del_group_dev(dev);
+	kfree(vmdev);
+}
+
+int vfio_mpci_match(struct device *dev)
+{
+	if (dev_is_pci(dev->parent))
+		return 1;
+
+	return 0;
+}
+
+struct mdev_driver vfio_mpci_driver = {
+	.name	= "vfio_mpci",
+	.probe	= vfio_mpci_probe,
+	.remove	= vfio_mpci_remove,
+	.match	= vfio_mpci_match,
+};
+
+static int __init vfio_mpci_init(void)
+{
+	return mdev_register_driver(&vfio_mpci_driver, THIS_MODULE);
+}
+
+static void __exit vfio_mpci_exit(void)
+{
+	mdev_unregister_driver(&vfio_mpci_driver);
+}
+
+module_init(vfio_mpci_init)
+module_exit(vfio_mpci_exit)
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index 8a7d546d18a0..04a450908ffb 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -19,12 +19,6 @@
 #ifndef VFIO_PCI_PRIVATE_H
 #define VFIO_PCI_PRIVATE_H
 
-#define VFIO_PCI_OFFSET_SHIFT   40
-
-#define VFIO_PCI_OFFSET_TO_INDEX(off)	(off >> VFIO_PCI_OFFSET_SHIFT)
-#define VFIO_PCI_INDEX_TO_OFFSET(index)	((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
-#define VFIO_PCI_OFFSET_MASK	(((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
-
 /* Special capability IDs predefined access */
 #define PCI_CAP_ID_INVALID		0xFF	/* default raw access */
 #define PCI_CAP_ID_INVALID_VIRT		0xFE	/* default virt access */
diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c
index 5ffd1d9ad4bd..5b912be9d9c3 100644
--- a/drivers/vfio/pci/vfio_pci_rdwr.c
+++ b/drivers/vfio/pci/vfio_pci_rdwr.c
@@ -18,6 +18,7 @@
 #include <linux/uaccess.h>
 #include <linux/io.h>
 #include <linux/vgaarb.h>
+#include <linux/vfio.h>
 
 #include "vfio_pci_private.h"
 
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 0ecae0b1cd34..431b824b0d3e 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -18,6 +18,13 @@
 #include <linux/poll.h>
 #include <uapi/linux/vfio.h>
 
+#define VFIO_PCI_OFFSET_SHIFT   40
+
+#define VFIO_PCI_OFFSET_TO_INDEX(off)   (off >> VFIO_PCI_OFFSET_SHIFT)
+#define VFIO_PCI_INDEX_TO_OFFSET(index) ((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
+#define VFIO_PCI_OFFSET_MASK    (((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
+
+
 /**
  * struct vfio_device_ops - VFIO bus driver device callbacks
  *
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [Qemu-devel] [PATCH 2/3] VFIO driver for mediated PCI device
@ 2016-06-20 16:31   ` Kirti Wankhede
  0 siblings, 0 replies; 51+ messages in thread
From: Kirti Wankhede @ 2016-06-20 16:31 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, kevin.tian, shuai.ruan, jike.song, zhiyuan.lv,
	bjsdjshi, Kirti Wankhede

VFIO driver registers with MDEV core driver. MDEV core driver creates
mediated device and calls probe routine of MPCI VFIO driver. This MPCI
VFIO driver adds mediated device to VFIO core module.
Main aim of this module is to manage all VFIO APIs for each mediated PCI
device.
Those are:
- get region information from vendor driver.
- trap and emulate PCI config space and BAR region.
- Send interrupt configuration information to vendor driver.
- mmap mappable region with invalidate mapping and fault on access to
  remap pfn.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Change-Id: I583f4734752971d3d112324d69e2508c88f359ec
---
 drivers/vfio/mdev/Kconfig           |   7 +
 drivers/vfio/mdev/Makefile          |   1 +
 drivers/vfio/mdev/vfio_mpci.c       | 654 ++++++++++++++++++++++++++++++++++++
 drivers/vfio/pci/vfio_pci_private.h |   6 -
 drivers/vfio/pci/vfio_pci_rdwr.c    |   1 +
 include/linux/vfio.h                |   7 +
 6 files changed, 670 insertions(+), 6 deletions(-)
 create mode 100644 drivers/vfio/mdev/vfio_mpci.c

diff --git a/drivers/vfio/mdev/Kconfig b/drivers/vfio/mdev/Kconfig
index 951e2bb06a3f..8d9e78aaa80f 100644
--- a/drivers/vfio/mdev/Kconfig
+++ b/drivers/vfio/mdev/Kconfig
@@ -9,3 +9,10 @@ config MDEV
 
         If you don't know what do here, say N.
 
+config VFIO_MPCI
+    tristate "VFIO support for Mediated PCI devices"
+    depends on VFIO && PCI && MDEV
+    default n
+    help
+        VFIO based driver for mediated PCI devices.
+
diff --git a/drivers/vfio/mdev/Makefile b/drivers/vfio/mdev/Makefile
index 2c6d11f7bc24..cd5e7625e1ec 100644
--- a/drivers/vfio/mdev/Makefile
+++ b/drivers/vfio/mdev/Makefile
@@ -2,4 +2,5 @@
 mdev-y := mdev_core.o mdev_sysfs.o mdev_driver.o
 
 obj-$(CONFIG_MDEV) += mdev.o
+obj-$(CONFIG_VFIO_MPCI) += vfio_mpci.o
 
diff --git a/drivers/vfio/mdev/vfio_mpci.c b/drivers/vfio/mdev/vfio_mpci.c
new file mode 100644
index 000000000000..267879a05c39
--- /dev/null
+++ b/drivers/vfio/mdev/vfio_mpci.c
@@ -0,0 +1,654 @@
+/*
+ * VFIO based Mediated PCI device driver
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/uuid.h>
+#include <linux/vfio.h>
+#include <linux/iommu.h>
+#include <linux/mdev.h>
+
+#include "mdev_private.h"
+
+#define DRIVER_VERSION  "0.1"
+#define DRIVER_AUTHOR   "NVIDIA Corporation"
+#define DRIVER_DESC     "VFIO based Mediated PCI device driver"
+
+struct vfio_mdev {
+	struct iommu_group *group;
+	struct mdev_device *mdev;
+	int		    refcnt;
+	struct pci_region_info vfio_region_info[VFIO_PCI_NUM_REGIONS];
+	u8		    *vconfig;
+	struct mutex	    vfio_mdev_lock;
+};
+
+static int get_mdev_region_info(struct mdev_device *mdev,
+				struct pci_region_info *vfio_region_info,
+				int index)
+{
+	int ret = -EINVAL;
+	struct parent_device *parent = mdev->parent;
+
+	if (parent && dev_is_pci(parent->dev) && parent->ops->get_region_info) {
+		mutex_lock(&mdev->ops_lock);
+		ret = parent->ops->get_region_info(mdev, index,
+						    vfio_region_info);
+		mutex_unlock(&mdev->ops_lock);
+	}
+	return ret;
+}
+
+static void mdev_read_base(struct vfio_mdev *vmdev)
+{
+	int index, pos;
+	u32 start_lo, start_hi;
+	u32 mem_type;
+
+	pos = PCI_BASE_ADDRESS_0;
+
+	for (index = 0; index <= VFIO_PCI_BAR5_REGION_INDEX; index++) {
+
+		if (!vmdev->vfio_region_info[index].size)
+			continue;
+
+		start_lo = (*(u32 *)(vmdev->vconfig + pos)) &
+					PCI_BASE_ADDRESS_MEM_MASK;
+		mem_type = (*(u32 *)(vmdev->vconfig + pos)) &
+					PCI_BASE_ADDRESS_MEM_TYPE_MASK;
+
+		switch (mem_type) {
+		case PCI_BASE_ADDRESS_MEM_TYPE_64:
+			start_hi = (*(u32 *)(vmdev->vconfig + pos + 4));
+			pos += 4;
+			break;
+		case PCI_BASE_ADDRESS_MEM_TYPE_32:
+		case PCI_BASE_ADDRESS_MEM_TYPE_1M:
+			/* 1M mem BAR treated as 32-bit BAR */
+		default:
+			/* mem unknown type treated as 32-bit BAR */
+			start_hi = 0;
+			break;
+		}
+		pos += 4;
+		vmdev->vfio_region_info[index].start = ((u64)start_hi << 32) |
+							start_lo;
+	}
+}
+
+static int vfio_mpci_open(void *device_data)
+{
+	int ret = 0;
+	struct vfio_mdev *vmdev = device_data;
+
+	if (!try_module_get(THIS_MODULE))
+		return -ENODEV;
+
+	mutex_lock(&vmdev->vfio_mdev_lock);
+	if (!vmdev->refcnt) {
+		u8 *vconfig;
+		int index;
+		struct pci_region_info *cfg_reg;
+
+		for (index = VFIO_PCI_BAR0_REGION_INDEX;
+		     index < VFIO_PCI_NUM_REGIONS; index++) {
+			ret = get_mdev_region_info(vmdev->mdev,
+						&vmdev->vfio_region_info[index],
+						index);
+			if (ret)
+				goto open_error;
+		}
+		cfg_reg = &vmdev->vfio_region_info[VFIO_PCI_CONFIG_REGION_INDEX];
+		if (!cfg_reg->size)
+			goto open_error;
+
+		vconfig = kzalloc(cfg_reg->size, GFP_KERNEL);
+		if (IS_ERR(vconfig)) {
+			ret = PTR_ERR(vconfig);
+			goto open_error;
+		}
+
+		vmdev->vconfig = vconfig;
+	}
+
+	vmdev->refcnt++;
+open_error:
+
+	mutex_unlock(&vmdev->vfio_mdev_lock);
+	if (ret)
+		module_put(THIS_MODULE);
+
+	return ret;
+}
+
+static void vfio_mpci_close(void *device_data)
+{
+	struct vfio_mdev *vmdev = device_data;
+
+	mutex_lock(&vmdev->vfio_mdev_lock);
+	vmdev->refcnt--;
+	if (!vmdev->refcnt) {
+		memset(&vmdev->vfio_region_info, 0,
+			sizeof(vmdev->vfio_region_info));
+		kfree(vmdev->vconfig);
+	}
+	mutex_unlock(&vmdev->vfio_mdev_lock);
+	module_put(THIS_MODULE);
+}
+
+static int mdev_get_irq_count(struct vfio_mdev *vmdev, int irq_type)
+{
+	/* Don't support MSIX for now */
+	if (irq_type == VFIO_PCI_MSIX_IRQ_INDEX)
+		return -1;
+
+	return 1;
+}
+
+static long vfio_mpci_unlocked_ioctl(void *device_data,
+				     unsigned int cmd, unsigned long arg)
+{
+	int ret = 0;
+	struct vfio_mdev *vmdev = device_data;
+	unsigned long minsz;
+
+	switch (cmd) {
+	case VFIO_DEVICE_GET_INFO:
+	{
+		struct vfio_device_info info;
+
+		minsz = offsetofend(struct vfio_device_info, num_irqs);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz)
+			return -EINVAL;
+
+		info.flags = VFIO_DEVICE_FLAGS_PCI;
+		info.num_regions = VFIO_PCI_NUM_REGIONS;
+		info.num_irqs = VFIO_PCI_NUM_IRQS;
+
+		return copy_to_user((void __user *)arg, &info, minsz);
+	}
+	case VFIO_DEVICE_GET_REGION_INFO:
+	{
+		struct vfio_region_info info;
+
+		minsz = offsetofend(struct vfio_region_info, offset);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz)
+			return -EINVAL;
+
+		switch (info.index) {
+		case VFIO_PCI_CONFIG_REGION_INDEX:
+		case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
+			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+			info.size = vmdev->vfio_region_info[info.index].size;
+			if (!info.size) {
+				info.flags = 0;
+				break;
+			}
+
+			info.flags = vmdev->vfio_region_info[info.index].flags;
+			break;
+		case VFIO_PCI_VGA_REGION_INDEX:
+		case VFIO_PCI_ROM_REGION_INDEX:
+		default:
+			return -EINVAL;
+		}
+
+		return copy_to_user((void __user *)arg, &info, minsz);
+	}
+	case VFIO_DEVICE_GET_IRQ_INFO:
+	{
+		struct vfio_irq_info info;
+
+		minsz = offsetofend(struct vfio_irq_info, count);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz || info.index >= VFIO_PCI_NUM_IRQS)
+			return -EINVAL;
+
+		switch (info.index) {
+		case VFIO_PCI_INTX_IRQ_INDEX ... VFIO_PCI_MSI_IRQ_INDEX:
+		case VFIO_PCI_REQ_IRQ_INDEX:
+			break;
+			/* pass thru to return error */
+		case VFIO_PCI_MSIX_IRQ_INDEX:
+		default:
+			return -EINVAL;
+		}
+
+		info.count = VFIO_PCI_NUM_IRQS;
+		info.flags = VFIO_IRQ_INFO_EVENTFD;
+		info.count = mdev_get_irq_count(vmdev, info.index);
+
+		if (info.count == -1)
+			return -EINVAL;
+
+		if (info.index == VFIO_PCI_INTX_IRQ_INDEX)
+			info.flags |= (VFIO_IRQ_INFO_MASKABLE |
+					VFIO_IRQ_INFO_AUTOMASKED);
+		else
+			info.flags |= VFIO_IRQ_INFO_NORESIZE;
+
+		return copy_to_user((void __user *)arg, &info, minsz);
+	}
+	case VFIO_DEVICE_SET_IRQS:
+	{
+		struct vfio_irq_set hdr;
+		struct mdev_device *mdev = vmdev->mdev;
+		struct parent_device *parent = vmdev->mdev->parent;
+		u8 *data = NULL, *ptr = NULL;
+
+		minsz = offsetofend(struct vfio_irq_set, count);
+
+		if (copy_from_user(&hdr, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (hdr.argsz < minsz || hdr.index >= VFIO_PCI_NUM_IRQS ||
+		    hdr.flags & ~(VFIO_IRQ_SET_DATA_TYPE_MASK |
+		    VFIO_IRQ_SET_ACTION_TYPE_MASK))
+			return -EINVAL;
+
+		if (!(hdr.flags & VFIO_IRQ_SET_DATA_NONE)) {
+			size_t size;
+			int max = mdev_get_irq_count(vmdev, hdr.index);
+
+			if (hdr.flags & VFIO_IRQ_SET_DATA_BOOL)
+				size = sizeof(uint8_t);
+			else if (hdr.flags & VFIO_IRQ_SET_DATA_EVENTFD)
+				size = sizeof(int32_t);
+			else
+				return -EINVAL;
+
+			if (hdr.argsz - minsz < hdr.count * size ||
+			    hdr.start >= max || hdr.start + hdr.count > max)
+				return -EINVAL;
+
+			ptr = data = memdup_user((void __user *)(arg + minsz),
+						 hdr.count * size);
+			if (IS_ERR(data))
+				return PTR_ERR(data);
+		}
+
+		if (parent && parent->ops->set_irqs) {
+			mutex_lock(&mdev->ops_lock);
+			ret = parent->ops->set_irqs(mdev, hdr.flags, hdr.index,
+						    hdr.start, hdr.count, data);
+			mutex_unlock(&mdev->ops_lock);
+		}
+
+		kfree(ptr);
+		return ret;
+	}
+	}
+	return -ENOTTY;
+}
+
+ssize_t mdev_dev_config_rw(struct vfio_mdev *vmdev, char __user *buf,
+			   size_t count, loff_t *ppos, bool iswrite)
+{
+	struct mdev_device *mdev = vmdev->mdev;
+	struct parent_device *parent = mdev->parent;
+	int size = vmdev->vfio_region_info[VFIO_PCI_CONFIG_REGION_INDEX].size;
+	int ret = 0;
+	uint64_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
+
+	if (pos < 0 || pos >= size ||
+	    pos + count > size) {
+		pr_err("%s pos 0x%llx out of range\n", __func__, pos);
+		ret = -EFAULT;
+		goto config_rw_exit;
+	}
+
+	if (iswrite) {
+		char *usr_data, *ptr;
+
+		ptr = usr_data = memdup_user(buf, count);
+		if (IS_ERR(usr_data)) {
+			ret = PTR_ERR(usr_data);
+			goto config_rw_exit;
+		}
+
+		ret = parent->ops->write(mdev, usr_data, count,
+					  EMUL_CONFIG_SPACE, pos);
+
+		memcpy((void *)(vmdev->vconfig + pos), (void *)usr_data, count);
+		kfree(ptr);
+	} else {
+		char *ret_data, *ptr;
+
+		ptr = ret_data = kzalloc(count, GFP_KERNEL);
+
+		if (IS_ERR(ret_data)) {
+			ret = PTR_ERR(ret_data);
+			goto config_rw_exit;
+		}
+
+		ret = parent->ops->read(mdev, ret_data, count,
+					EMUL_CONFIG_SPACE, pos);
+
+		if (ret > 0) {
+			if (copy_to_user(buf, ret_data, ret))
+				ret = -EFAULT;
+			else
+				memcpy((void *)(vmdev->vconfig + pos),
+					(void *)ret_data, count);
+		}
+		kfree(ptr);
+	}
+config_rw_exit:
+
+	if (ret > 0)
+		*ppos += ret;
+
+	return ret;
+}
+
+ssize_t mdev_dev_bar_rw(struct vfio_mdev *vmdev, char __user *buf,
+			size_t count, loff_t *ppos, bool iswrite)
+{
+	struct mdev_device *mdev = vmdev->mdev;
+	struct parent_device *parent = mdev->parent;
+	loff_t offset = *ppos & VFIO_PCI_OFFSET_MASK;
+	loff_t pos;
+	int bar_index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+	int ret = 0;
+
+	if (!vmdev->vfio_region_info[bar_index].start)
+		mdev_read_base(vmdev);
+
+	if (offset >= vmdev->vfio_region_info[bar_index].size) {
+		ret = -EINVAL;
+		goto bar_rw_exit;
+	}
+
+	count = min(count,
+		    (size_t)(vmdev->vfio_region_info[bar_index].size - offset));
+
+	pos = vmdev->vfio_region_info[bar_index].start + offset;
+
+	if (iswrite) {
+		char *usr_data, *ptr;
+
+		ptr = usr_data = memdup_user(buf, count);
+		if (IS_ERR(usr_data)) {
+			ret = PTR_ERR(usr_data);
+			goto bar_rw_exit;
+		}
+
+		ret = parent->ops->write(mdev, usr_data, count, EMUL_MMIO, pos);
+
+		kfree(ptr);
+	} else {
+		char *ret_data, *ptr;
+
+		ptr = ret_data = kzalloc(count, GFP_KERNEL);
+
+		if (!ret_data) {
+			ret = -ENOMEM;
+			goto bar_rw_exit;
+		}
+
+		ret = parent->ops->read(mdev, ret_data, count, EMUL_MMIO, pos);
+
+		if (ret > 0) {
+			if (copy_to_user(buf, ret_data, ret))
+				ret = -EFAULT;
+		}
+		kfree(ptr);
+	}
+
+bar_rw_exit:
+
+	if (ret > 0)
+		*ppos += ret;
+
+	return ret;
+}
+
+
+static ssize_t mdev_dev_rw(void *device_data, char __user *buf,
+			   size_t count, loff_t *ppos, bool iswrite)
+{
+	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+	struct vfio_mdev *vmdev = device_data;
+
+	if (index >= VFIO_PCI_NUM_REGIONS)
+		return -EINVAL;
+
+	switch (index) {
+	case VFIO_PCI_CONFIG_REGION_INDEX:
+		return mdev_dev_config_rw(vmdev, buf, count, ppos, iswrite);
+
+	case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
+		return mdev_dev_bar_rw(vmdev, buf, count, ppos, iswrite);
+
+	case VFIO_PCI_ROM_REGION_INDEX:
+	case VFIO_PCI_VGA_REGION_INDEX:
+		break;
+	}
+
+	return -EINVAL;
+}
+
+
+static ssize_t vfio_mpci_read(void *device_data, char __user *buf,
+			      size_t count, loff_t *ppos)
+{
+	struct vfio_mdev *vmdev = device_data;
+	struct mdev_device *mdev = vmdev->mdev;
+	struct parent_device *parent = mdev->parent;
+	int ret = 0;
+
+	if (!count)
+		return 0;
+
+	if (IS_ERR_OR_NULL(buf))
+		return -EINVAL;
+
+	if (parent && parent->ops->read) {
+		mutex_lock(&mdev->ops_lock);
+		ret = mdev_dev_rw(device_data, buf, count, ppos, false);
+		mutex_unlock(&mdev->ops_lock);
+	}
+
+	return ret;
+}
+
+static ssize_t vfio_mpci_write(void *device_data, const char __user *buf,
+			       size_t count, loff_t *ppos)
+{
+	struct vfio_mdev *vmdev = device_data;
+	struct mdev_device *mdev = vmdev->mdev;
+	struct parent_device *parent = mdev->parent;
+	int ret = 0;
+
+	if (!count)
+		return 0;
+
+	if (IS_ERR_OR_NULL(buf))
+		return -EINVAL;
+
+	if (parent && parent->ops->write) {
+		mutex_lock(&mdev->ops_lock);
+		ret = mdev_dev_rw(device_data, (char __user *)buf, count,
+				  ppos, true);
+		mutex_unlock(&mdev->ops_lock);
+	}
+
+	return ret;
+}
+
+static int mdev_dev_mmio_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	int ret;
+	struct vfio_mdev *vmdev = vma->vm_private_data;
+	struct mdev_device *mdev;
+	struct parent_device *parent;
+	u64 virtaddr = (u64)vmf->virtual_address;
+	u64 offset, phyaddr;
+	unsigned long req_size, pgoff;
+	pgprot_t pg_prot;
+
+	if (!vmdev && !vmdev->mdev)
+		return -EINVAL;
+
+	mdev = vmdev->mdev;
+	parent  = mdev->parent;
+
+	offset   = virtaddr - vma->vm_start;
+	phyaddr  = (vma->vm_pgoff << PAGE_SHIFT) + offset;
+	pgoff    = phyaddr >> PAGE_SHIFT;
+	req_size = vma->vm_end - virtaddr;
+	pg_prot  = vma->vm_page_prot;
+
+	if (parent && parent->ops->validate_map_request) {
+		mutex_lock(&mdev->ops_lock);
+		ret = parent->ops->validate_map_request(mdev, virtaddr,
+							 &pgoff, &req_size,
+							 &pg_prot);
+		mutex_unlock(&mdev->ops_lock);
+		if (ret)
+			return ret;
+
+		if (!req_size)
+			return -EINVAL;
+	}
+
+	ret = remap_pfn_range(vma, virtaddr, pgoff, req_size, pg_prot);
+
+	return ret | VM_FAULT_NOPAGE;
+}
+
+static const struct vm_operations_struct mdev_dev_mmio_ops = {
+	.fault = mdev_dev_mmio_fault,
+};
+
+
+static int vfio_mpci_mmap(void *device_data, struct vm_area_struct *vma)
+{
+	unsigned int index;
+	struct vfio_mdev *vmdev = device_data;
+	struct mdev_device *mdev = vmdev->mdev;
+	struct pci_dev *pdev;
+	unsigned long pgoff;
+	loff_t offset;
+
+	if (!mdev->parent || !dev_is_pci(mdev->parent->dev))
+		return -EINVAL;
+
+	pdev = to_pci_dev(mdev->parent->dev);
+
+	offset = vma->vm_pgoff << PAGE_SHIFT;
+
+	index = VFIO_PCI_OFFSET_TO_INDEX(offset);
+
+	if (index >= VFIO_PCI_ROM_REGION_INDEX)
+		return -EINVAL;
+
+	pgoff = vma->vm_pgoff &
+		((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
+
+	vma->vm_pgoff = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff;
+
+	vma->vm_private_data = vmdev;
+	vma->vm_ops = &mdev_dev_mmio_ops;
+
+	return 0;
+}
+
+static const struct vfio_device_ops vfio_mpci_dev_ops = {
+	.name		= "vfio-mpci",
+	.open		= vfio_mpci_open,
+	.release	= vfio_mpci_close,
+	.ioctl		= vfio_mpci_unlocked_ioctl,
+	.read		= vfio_mpci_read,
+	.write		= vfio_mpci_write,
+	.mmap		= vfio_mpci_mmap,
+};
+
+int vfio_mpci_probe(struct device *dev)
+{
+	struct vfio_mdev *vmdev;
+	struct mdev_device *mdev = to_mdev_device(dev);
+	int ret;
+
+	if (!mdev)
+		return -EINVAL;
+
+	vmdev = kzalloc(sizeof(*vmdev), GFP_KERNEL);
+	if (IS_ERR(vmdev))
+		return PTR_ERR(vmdev);
+
+	vmdev->mdev = mdev_get_device(mdev);
+	vmdev->group = mdev->group;
+	mutex_init(&vmdev->vfio_mdev_lock);
+
+	ret = vfio_add_group_dev(dev, &vfio_mpci_dev_ops, vmdev);
+	if (ret)
+		kfree(vmdev);
+
+	mdev_put_device(mdev);
+	return ret;
+}
+
+void vfio_mpci_remove(struct device *dev)
+{
+	struct vfio_mdev *vmdev;
+
+	vmdev = vfio_del_group_dev(dev);
+	kfree(vmdev);
+}
+
+int vfio_mpci_match(struct device *dev)
+{
+	if (dev_is_pci(dev->parent))
+		return 1;
+
+	return 0;
+}
+
+struct mdev_driver vfio_mpci_driver = {
+	.name	= "vfio_mpci",
+	.probe	= vfio_mpci_probe,
+	.remove	= vfio_mpci_remove,
+	.match	= vfio_mpci_match,
+};
+
+static int __init vfio_mpci_init(void)
+{
+	return mdev_register_driver(&vfio_mpci_driver, THIS_MODULE);
+}
+
+static void __exit vfio_mpci_exit(void)
+{
+	mdev_unregister_driver(&vfio_mpci_driver);
+}
+
+module_init(vfio_mpci_init)
+module_exit(vfio_mpci_exit)
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index 8a7d546d18a0..04a450908ffb 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -19,12 +19,6 @@
 #ifndef VFIO_PCI_PRIVATE_H
 #define VFIO_PCI_PRIVATE_H
 
-#define VFIO_PCI_OFFSET_SHIFT   40
-
-#define VFIO_PCI_OFFSET_TO_INDEX(off)	(off >> VFIO_PCI_OFFSET_SHIFT)
-#define VFIO_PCI_INDEX_TO_OFFSET(index)	((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
-#define VFIO_PCI_OFFSET_MASK	(((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
-
 /* Special capability IDs predefined access */
 #define PCI_CAP_ID_INVALID		0xFF	/* default raw access */
 #define PCI_CAP_ID_INVALID_VIRT		0xFE	/* default virt access */
diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c
index 5ffd1d9ad4bd..5b912be9d9c3 100644
--- a/drivers/vfio/pci/vfio_pci_rdwr.c
+++ b/drivers/vfio/pci/vfio_pci_rdwr.c
@@ -18,6 +18,7 @@
 #include <linux/uaccess.h>
 #include <linux/io.h>
 #include <linux/vgaarb.h>
+#include <linux/vfio.h>
 
 #include "vfio_pci_private.h"
 
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 0ecae0b1cd34..431b824b0d3e 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -18,6 +18,13 @@
 #include <linux/poll.h>
 #include <uapi/linux/vfio.h>
 
+#define VFIO_PCI_OFFSET_SHIFT   40
+
+#define VFIO_PCI_OFFSET_TO_INDEX(off)   (off >> VFIO_PCI_OFFSET_SHIFT)
+#define VFIO_PCI_INDEX_TO_OFFSET(index) ((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
+#define VFIO_PCI_OFFSET_MASK    (((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
+
+
 /**
  * struct vfio_device_ops - VFIO bus driver device callbacks
  *
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 3/3] VFIO Type1 IOMMU: Add support for mediated devices
  2016-06-20 16:31 ` [Qemu-devel] " Kirti Wankhede
@ 2016-06-20 16:31   ` Kirti Wankhede
  -1 siblings, 0 replies; 51+ messages in thread
From: Kirti Wankhede @ 2016-06-20 16:31 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, kevin.tian, shuai.ruan, jike.song, zhiyuan.lv,
	bjsdjshi, Kirti Wankhede

VFIO Type1 IOMMU driver is designed for the devices which are IOMMU
capable. Mediated device only uses IOMMU TYPE1 API, the underlying
hardware can be managed by an IOMMU domain.

This change exports functions to pin and unpin pages for mediated devices.
It maintains data of pinned pages for mediated domain. This data is used to
verify unpinning request and to unpin remaining pages from detach_group()
if there are any.

Aim of this change is:
- To use most of the code of IOMMU driver for mediated devices
- To support direct assigned device and mediated device by single module

Updated the change to keep mediated domain structure out of domain_list.

Tested by assigning below combinations of devices to a single VM:
- GPU pass through only
- vGPU device only
- One GPU pass through and one vGPU device
- two GPU pass through

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Change-Id: I295d6f0f2e0579b8d9882bfd8fd5a4194b97bd9a
---
 drivers/vfio/vfio_iommu_type1.c | 444 +++++++++++++++++++++++++++++++++++++---
 include/linux/vfio.h            |   6 +
 2 files changed, 418 insertions(+), 32 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 75b24e93cedb..f17dd104fe27 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -36,6 +36,7 @@
 #include <linux/uaccess.h>
 #include <linux/vfio.h>
 #include <linux/workqueue.h>
+#include <linux/mdev.h>
 
 #define DRIVER_VERSION  "0.2"
 #define DRIVER_AUTHOR   "Alex Williamson <alex.williamson@redhat.com>"
@@ -55,6 +56,7 @@ MODULE_PARM_DESC(disable_hugepages,
 
 struct vfio_iommu {
 	struct list_head	domain_list;
+	struct vfio_domain	*mediated_domain;
 	struct mutex		lock;
 	struct rb_root		dma_list;
 	bool			v2;
@@ -67,6 +69,13 @@ struct vfio_domain {
 	struct list_head	group_list;
 	int			prot;		/* IOMMU_CACHE */
 	bool			fgsp;		/* Fine-grained super pages */
+
+	/* Domain for mediated device which is without physical IOMMU */
+	bool			mediated_device;
+
+	struct mm_struct	*mm;
+	struct rb_root		pfn_list;	/* pinned Host pfn list */
+	struct mutex		pfn_list_lock;	/* mutex for pfn_list */
 };
 
 struct vfio_dma {
@@ -79,10 +88,26 @@ struct vfio_dma {
 
 struct vfio_group {
 	struct iommu_group	*iommu_group;
+#if defined(CONFIG_MDEV) || defined(CONFIG_MDEV_MODULE)
+	struct mdev_device	*mdev;
+#endif
 	struct list_head	next;
 };
 
 /*
+ * Guest RAM pinning working set or DMA target
+ */
+struct vfio_pfn {
+	struct rb_node		node;
+	unsigned long		vaddr;		/* virtual addr */
+	dma_addr_t		iova;		/* IOVA */
+	unsigned long		npage;		/* number of pages */
+	unsigned long		pfn;		/* Host pfn */
+	size_t			prot;
+	atomic_t		ref_count;
+};
+
+/*
  * This code handles mapping and unmapping of user data buffers
  * into DMA'ble space using the IOMMU
  */
@@ -130,6 +155,64 @@ static void vfio_unlink_dma(struct vfio_iommu *iommu, struct vfio_dma *old)
 	rb_erase(&old->node, &iommu->dma_list);
 }
 
+/*
+ * Helper Functions for host pfn list
+ */
+
+static struct vfio_pfn *vfio_find_pfn(struct vfio_domain *domain,
+				      unsigned long pfn)
+{
+	struct rb_node *node;
+	struct vfio_pfn *vpfn, *ret = NULL;
+
+	mutex_lock(&domain->pfn_list_lock);
+	node = domain->pfn_list.rb_node;
+
+	while (node) {
+		vpfn = rb_entry(node, struct vfio_pfn, node);
+
+		if (pfn < vpfn->pfn)
+			node = node->rb_left;
+		else if (pfn > vpfn->pfn)
+			node = node->rb_right;
+		else {
+			ret = vpfn;
+			break;
+		}
+	}
+
+	mutex_unlock(&domain->pfn_list_lock);
+	return ret;
+}
+
+static void vfio_link_pfn(struct vfio_domain *domain, struct vfio_pfn *new)
+{
+	struct rb_node **link, *parent = NULL;
+	struct vfio_pfn *vpfn;
+
+	mutex_lock(&domain->pfn_list_lock);
+	link = &domain->pfn_list.rb_node;
+	while (*link) {
+		parent = *link;
+		vpfn = rb_entry(parent, struct vfio_pfn, node);
+
+		if (new->pfn < vpfn->pfn)
+			link = &(*link)->rb_left;
+		else
+			link = &(*link)->rb_right;
+	}
+
+	rb_link_node(&new->node, parent, link);
+	rb_insert_color(&new->node, &domain->pfn_list);
+	mutex_unlock(&domain->pfn_list_lock);
+}
+
+/* call by holding domain->pfn_list_lock */
+static void vfio_unlink_pfn(struct vfio_domain *domain, struct vfio_pfn *old)
+{
+	rb_erase(&old->node, &domain->pfn_list);
+}
+
 struct vwork {
 	struct mm_struct	*mm;
 	long			npage;
@@ -228,20 +311,29 @@ static int put_pfn(unsigned long pfn, int prot)
 	return 0;
 }
 
-static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
+static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
+			 int prot, unsigned long *pfn)
 {
 	struct page *page[1];
 	struct vm_area_struct *vma;
+	struct mm_struct *local_mm = mm;
 	int ret = -EFAULT;
 
-	if (get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE), page) == 1) {
+	if (!local_mm && !current->mm)
+		return -ENODEV;
+
+	if (!local_mm)
+		local_mm = current->mm;
+
+	down_read(&local_mm->mmap_sem);
+	if (get_user_pages_remote(NULL, local_mm, vaddr, 1,
+				!!(prot & IOMMU_WRITE), 0, page, NULL) == 1) {
 		*pfn = page_to_pfn(page[0]);
-		return 0;
+		ret = 0;
+		goto done_pfn;
 	}
 
-	down_read(&current->mm->mmap_sem);
-
-	vma = find_vma_intersection(current->mm, vaddr, vaddr + 1);
+	vma = find_vma_intersection(local_mm, vaddr, vaddr + 1);
 
 	if (vma && vma->vm_flags & VM_PFNMAP) {
 		*pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
@@ -249,7 +341,8 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
 			ret = 0;
 	}
 
-	up_read(&current->mm->mmap_sem);
+done_pfn:
+	up_read(&local_mm->mmap_sem);
 
 	return ret;
 }
@@ -259,18 +352,19 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
  * the iommu can only map chunks of consecutive pfns anyway, so get the
  * first page and all consecutive pages with the same locking.
  */
-static long vfio_pin_pages(unsigned long vaddr, long npage,
-			   int prot, unsigned long *pfn_base)
+static long vfio_pin_pages_internal(struct vfio_domain *domain,
+				    unsigned long vaddr, long npage,
+				    int prot, unsigned long *pfn_base)
 {
 	unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
 	bool lock_cap = capable(CAP_IPC_LOCK);
 	long ret, i;
 	bool rsvd;
 
-	if (!current->mm)
+	if (!domain)
 		return -ENODEV;
 
-	ret = vaddr_get_pfn(vaddr, prot, pfn_base);
+	ret = vaddr_get_pfn(domain->mm, vaddr, prot, pfn_base);
 	if (ret)
 		return ret;
 
@@ -293,7 +387,7 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
 	for (i = 1, vaddr += PAGE_SIZE; i < npage; i++, vaddr += PAGE_SIZE) {
 		unsigned long pfn = 0;
 
-		ret = vaddr_get_pfn(vaddr, prot, &pfn);
+		ret = vaddr_get_pfn(domain->mm, vaddr, prot, &pfn);
 		if (ret)
 			break;
 
@@ -318,20 +412,165 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
 	return i;
 }
 
-static long vfio_unpin_pages(unsigned long pfn, long npage,
-			     int prot, bool do_accounting)
+static long vfio_unpin_pages_internal(struct vfio_domain *domain,
+				      unsigned long pfn, long npage, int prot,
+				      bool do_accounting)
 {
 	unsigned long unlocked = 0;
 	long i;
 
+	if (!domain)
+		return -ENODEV;
+
 	for (i = 0; i < npage; i++)
 		unlocked += put_pfn(pfn++, prot);
 
 	if (do_accounting)
 		vfio_lock_acct(-unlocked);
+	return unlocked;
+}
+
+/*
+ * Pin a set of guest PFNs and return their associated host PFNs for API
+ * supported domain only.
+ * @vaddr [in]: array of guest PFNs
+ * @npage [in]: count of array elements
+ * @prot [in] : protection flags
+ * @pfn_base[out] : array of host PFNs
+ */
+long vfio_pin_pages(void *iommu_data, dma_addr_t *vaddr, long npage,
+		   int prot, dma_addr_t *pfn_base)
+{
+	struct vfio_iommu *iommu = iommu_data;
+	struct vfio_domain *domain = NULL;
+	int i = 0, ret = 0;
+	long retpage;
+	unsigned long remote_vaddr = 0;
+	dma_addr_t *pfn = pfn_base;
+	struct vfio_dma *dma;
+
+	if (!iommu || !vaddr || !pfn_base)
+		return -EINVAL;
+
+	mutex_lock(&iommu->lock);
+
+	if (!iommu->mediated_domain) {
+		ret = -EINVAL;
+		goto pin_done;
+	}
+
+	domain = iommu->mediated_domain;
+
+	for (i = 0; i < npage; i++) {
+		struct vfio_pfn *p, *lpfn;
+		unsigned long tpfn;
+		dma_addr_t iova;
+		long pg_cnt = 1;
+
+		iova = vaddr[i] << PAGE_SHIFT;
+
+		dma = vfio_find_dma(iommu, iova, 0 /*  size */);
+		if (!dma) {
+			ret = -EINVAL;
+			goto pin_done;
+		}
+
+		remote_vaddr = dma->vaddr + iova - dma->iova;
+
+		retpage = vfio_pin_pages_internal(domain, remote_vaddr,
+						  pg_cnt, prot, &tpfn);
+		if (retpage <= 0) {
+			WARN_ON(!retpage);
+			ret = (int)retpage;
+			goto pin_done;
+		}
+
+		pfn[i] = tpfn;
+
+		/* search if pfn exist */
+		p = vfio_find_pfn(domain, tpfn);
+		if (p) {
+			atomic_inc(&p->ref_count);
+			continue;
+		}
+
+		/* add to pfn_list */
+		lpfn = kzalloc(sizeof(*lpfn), GFP_KERNEL);
+		if (!lpfn) {
+			ret = -ENOMEM;
+			goto pin_done;
+		}
+		lpfn->vaddr = remote_vaddr;
+		lpfn->iova = iova;
+		lpfn->pfn = pfn[i];
+		lpfn->npage = 1;
+		lpfn->prot = prot;
+		atomic_inc(&lpfn->ref_count);
+		vfio_link_pfn(domain, lpfn);
+	}
+
+	ret = i;
+
+pin_done:
+	mutex_unlock(&iommu->lock);
+	return ret;
+}
+EXPORT_SYMBOL(vfio_pin_pages);
+
+static int vfio_unpin_pfn(struct vfio_domain *domain,
+			  struct vfio_pfn *vpfn, bool do_accounting)
+{
+	int ret;
+
+	ret = vfio_unpin_pages_internal(domain, vpfn->pfn, vpfn->npage,
+					vpfn->prot, do_accounting);
+
+	if (ret > 0 && atomic_dec_and_test(&vpfn->ref_count)) {
+		vfio_unlink_pfn(domain, vpfn);
+		kfree(vpfn);
+	}
+
+	return ret;
+}
+
+/*
+ * Unpin set of host PFNs for API supported domain only.
+ * @pfn	[in] : array of host PFNs to be unpinned.
+ * @npage [in] :count of elements in array, that is number of pages.
+ * @prot [in] : protection flags
+ */
+long vfio_unpin_pages(void *iommu_data, dma_addr_t *pfn, long npage,
+		     int prot)
+{
+	struct vfio_iommu *iommu = iommu_data;
+	struct vfio_domain *domain = NULL;
+	long unlocked = 0;
+	int i;
+
+	if (!iommu || !pfn)
+		return -EINVAL;
+
+	if (!iommu->mediated_domain)
+		return -EINVAL;
+
+	domain = iommu->mediated_domain;
+
+	for (i = 0; i < npage; i++) {
+		struct vfio_pfn *p;
+
+		/* verify if pfn exist in pfn_list */
+		p = vfio_find_pfn(domain, *(pfn + i));
+		if (!p)
+			continue;
+
+		mutex_lock(&domain->pfn_list_lock);
+		unlocked += vfio_unpin_pfn(domain, p, true);
+		mutex_unlock(&domain->pfn_list_lock);
+	}
 
 	return unlocked;
 }
+EXPORT_SYMBOL(vfio_unpin_pages);
 
 static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
 {
@@ -341,6 +580,10 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
 
 	if (!dma->size)
 		return;
+
+	if (list_empty(&iommu->domain_list))
+		return;
+
 	/*
 	 * We use the IOMMU to track the physical addresses, otherwise we'd
 	 * need a much more complicated tracking system.  Unfortunately that
@@ -382,9 +625,10 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
 		if (WARN_ON(!unmapped))
 			break;
 
-		unlocked += vfio_unpin_pages(phys >> PAGE_SHIFT,
-					     unmapped >> PAGE_SHIFT,
-					     dma->prot, false);
+		unlocked += vfio_unpin_pages_internal(domain,
+						phys >> PAGE_SHIFT,
+						unmapped >> PAGE_SHIFT,
+						dma->prot, false);
 		iova += unmapped;
 
 		cond_resched();
@@ -517,6 +761,9 @@ static int map_try_harder(struct vfio_domain *domain, dma_addr_t iova,
 	long i;
 	int ret;
 
+	if (domain->mediated_device)
+		return -EINVAL;
+
 	for (i = 0; i < npage; i++, pfn++, iova += PAGE_SIZE) {
 		ret = iommu_map(domain->domain, iova,
 				(phys_addr_t)pfn << PAGE_SHIFT,
@@ -537,6 +784,9 @@ static int vfio_iommu_map(struct vfio_iommu *iommu, dma_addr_t iova,
 	struct vfio_domain *d;
 	int ret;
 
+	if (list_empty(&iommu->domain_list))
+		return 0;
+
 	list_for_each_entry(d, &iommu->domain_list, next) {
 		ret = iommu_map(d->domain, iova, (phys_addr_t)pfn << PAGE_SHIFT,
 				npage << PAGE_SHIFT, prot | d->prot);
@@ -569,6 +819,7 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
 	uint64_t mask;
 	struct vfio_dma *dma;
 	unsigned long pfn;
+	struct vfio_domain *domain = NULL;
 
 	/* Verify that none of our __u64 fields overflow */
 	if (map->size != size || map->vaddr != vaddr || map->iova != iova)
@@ -611,10 +862,21 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
 	/* Insert zero-sized and grow as we map chunks of it */
 	vfio_link_dma(iommu, dma);
 
+	/*
+	 * Skip pin and map if and domain list is empty
+	 */
+	if (list_empty(&iommu->domain_list)) {
+		dma->size = size;
+		goto map_done;
+	}
+
+	domain = list_first_entry(&iommu->domain_list,
+				  struct vfio_domain, next);
+
 	while (size) {
 		/* Pin a contiguous chunk of memory */
-		npage = vfio_pin_pages(vaddr + dma->size,
-				       size >> PAGE_SHIFT, prot, &pfn);
+		npage = vfio_pin_pages_internal(domain, vaddr + dma->size,
+						size >> PAGE_SHIFT, prot, &pfn);
 		if (npage <= 0) {
 			WARN_ON(!npage);
 			ret = (int)npage;
@@ -624,7 +886,8 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
 		/* Map it! */
 		ret = vfio_iommu_map(iommu, iova + dma->size, pfn, npage, prot);
 		if (ret) {
-			vfio_unpin_pages(pfn, npage, prot, true);
+			vfio_unpin_pages_internal(domain, pfn, npage,
+						  prot, true);
 			break;
 		}
 
@@ -635,6 +898,7 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
 	if (ret)
 		vfio_remove_dma(iommu, dma);
 
+map_done:
 	mutex_unlock(&iommu->lock);
 	return ret;
 }
@@ -658,6 +922,9 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
 	struct rb_node *n;
 	int ret;
 
+	if (domain->mediated_device)
+		return 0;
+
 	/* Arbitrarily pick the first domain in the list for lookups */
 	d = list_first_entry(&iommu->domain_list, struct vfio_domain, next);
 	n = rb_first(&iommu->dma_list);
@@ -716,6 +983,9 @@ static void vfio_test_domain_fgsp(struct vfio_domain *domain)
 	struct page *pages;
 	int ret, order = get_order(PAGE_SIZE * 2);
 
+	if (domain->mediated_device)
+		return;
+
 	pages = alloc_pages(GFP_KERNEL | __GFP_ZERO, order);
 	if (!pages)
 		return;
@@ -734,11 +1004,25 @@ static void vfio_test_domain_fgsp(struct vfio_domain *domain)
 	__free_pages(pages, order);
 }
 
+static struct vfio_group *is_iommu_group_present(struct vfio_domain *domain,
+				   struct iommu_group *iommu_group)
+{
+	struct vfio_group *g;
+
+	list_for_each_entry(g, &domain->group_list, next) {
+		if (g->iommu_group != iommu_group)
+			continue;
+		return g;
+	}
+
+	return NULL;
+}
+
 static int vfio_iommu_type1_attach_group(void *iommu_data,
 					 struct iommu_group *iommu_group)
 {
 	struct vfio_iommu *iommu = iommu_data;
-	struct vfio_group *group, *g;
+	struct vfio_group *group;
 	struct vfio_domain *domain, *d;
 	struct bus_type *bus = NULL;
 	int ret;
@@ -746,14 +1030,21 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 	mutex_lock(&iommu->lock);
 
 	list_for_each_entry(d, &iommu->domain_list, next) {
-		list_for_each_entry(g, &d->group_list, next) {
-			if (g->iommu_group != iommu_group)
-				continue;
+		if (is_iommu_group_present(d, iommu_group)) {
+			mutex_unlock(&iommu->lock);
+			return -EINVAL;
+		}
+	}
 
+#if defined(CONFIG_MDEV) || defined(CONFIG_MDEV_MODULE)
+	if (iommu->mediated_domain) {
+		if (is_iommu_group_present(iommu->mediated_domain,
+					   iommu_group)) {
 			mutex_unlock(&iommu->lock);
 			return -EINVAL;
 		}
 	}
+#endif
 
 	group = kzalloc(sizeof(*group), GFP_KERNEL);
 	domain = kzalloc(sizeof(*domain), GFP_KERNEL);
@@ -769,6 +1060,36 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 	if (ret)
 		goto out_free;
 
+#if defined(CONFIG_MDEV) || defined(CONFIG_MDEV_MODULE)
+	if (!iommu_present(bus) && (bus == &mdev_bus_type)) {
+		struct mdev_device *mdev = NULL;
+
+		mdev = mdev_get_device_by_group(iommu_group);
+		if (!mdev)
+			goto out_free;
+
+		mdev->iommu_data = iommu;
+		group->mdev = mdev;
+
+		if (iommu->mediated_domain) {
+			list_add(&group->next,
+				 &iommu->mediated_domain->group_list);
+			kfree(domain);
+			mutex_unlock(&iommu->lock);
+			return 0;
+		}
+		domain->mediated_device = true;
+		domain->mm = current->mm;
+		INIT_LIST_HEAD(&domain->group_list);
+		list_add(&group->next, &domain->group_list);
+		domain->pfn_list = RB_ROOT;
+		mutex_init(&domain->pfn_list_lock);
+		iommu->mediated_domain = domain;
+		mutex_unlock(&iommu->lock);
+		return 0;
+	}
+#endif
+
 	domain->domain = iommu_domain_alloc(bus);
 	if (!domain->domain) {
 		ret = -EIO;
@@ -859,6 +1180,20 @@ static void vfio_iommu_unmap_unpin_all(struct vfio_iommu *iommu)
 		vfio_remove_dma(iommu, rb_entry(node, struct vfio_dma, node));
 }
 
+#if defined(CONFIG_MDEV) || defined(CONFIG_MDEV_MODULE)
+static void vfio_iommu_unpin_api_domain(struct vfio_domain *domain)
+{
+	struct rb_node *node;
+
+	mutex_lock(&domain->pfn_list_lock);
+	while ((node = rb_first(&domain->pfn_list))) {
+		vfio_unpin_pfn(domain,
+				rb_entry(node, struct vfio_pfn, node), false);
+	}
+	mutex_unlock(&domain->pfn_list_lock);
+}
+#endif
+
 static void vfio_iommu_type1_detach_group(void *iommu_data,
 					  struct iommu_group *iommu_group)
 {
@@ -868,31 +1203,55 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
 
 	mutex_lock(&iommu->lock);
 
-	list_for_each_entry(domain, &iommu->domain_list, next) {
-		list_for_each_entry(group, &domain->group_list, next) {
-			if (group->iommu_group != iommu_group)
-				continue;
+#if defined(CONFIG_MDEV) || defined(CONFIG_MDEV_MODULE)
+	if (iommu->mediated_domain) {
+		domain = iommu->mediated_domain;
+		group = is_iommu_group_present(domain, iommu_group);
+		if (group) {
+			if (group->mdev) {
+				group->mdev->iommu_data = NULL;
+				mdev_put_device(group->mdev);
+			}
+			list_del(&group->next);
+			kfree(group);
+
+			if (list_empty(&domain->group_list)) {
+				vfio_iommu_unpin_api_domain(domain);
+
+				if (list_empty(&iommu->domain_list))
+					vfio_iommu_unmap_unpin_all(iommu);
+
+				kfree(domain);
+				iommu->mediated_domain = NULL;
+			}
+		}
+	}
+#endif
 
+	list_for_each_entry(domain, &iommu->domain_list, next) {
+		group = is_iommu_group_present(domain, iommu_group);
+		if (group) {
 			iommu_detach_group(domain->domain, iommu_group);
 			list_del(&group->next);
 			kfree(group);
 			/*
 			 * Group ownership provides privilege, if the group
 			 * list is empty, the domain goes away.  If it's the
-			 * last domain, then all the mappings go away too.
+			 * last domain with iommu and API-only domain doesn't
+			 * exist, the all the mappings go away too.
 			 */
 			if (list_empty(&domain->group_list)) {
-				if (list_is_singular(&iommu->domain_list))
+				if (list_is_singular(&iommu->domain_list) &&
+				    !iommu->mediated_domain)
 					vfio_iommu_unmap_unpin_all(iommu);
 				iommu_domain_free(domain->domain);
 				list_del(&domain->next);
 				kfree(domain);
 			}
-			goto done;
+			break;
 		}
 	}
 
-done:
 	mutex_unlock(&iommu->lock);
 }
 
@@ -930,8 +1289,28 @@ static void vfio_iommu_type1_release(void *iommu_data)
 	struct vfio_domain *domain, *domain_tmp;
 	struct vfio_group *group, *group_tmp;
 
+#if defined(CONFIG_MDEV) || defined(CONFIG_MDEV_MODULE)
+	if (iommu->mediated_domain) {
+		domain = iommu->mediated_domain;
+		list_for_each_entry_safe(group, group_tmp,
+					 &domain->group_list, next) {
+			if (group->mdev) {
+				group->mdev->iommu_data = NULL;
+				mdev_put_device(group->mdev);
+			}
+			list_del(&group->next);
+			kfree(group);
+		}
+		vfio_iommu_unpin_api_domain(domain);
+		kfree(domain);
+		iommu->mediated_domain = NULL;
+	}
+#endif
 	vfio_iommu_unmap_unpin_all(iommu);
 
+	if (list_empty(&iommu->domain_list))
+		goto release_exit;
+
 	list_for_each_entry_safe(domain, domain_tmp,
 				 &iommu->domain_list, next) {
 		list_for_each_entry_safe(group, group_tmp,
@@ -945,6 +1324,7 @@ static void vfio_iommu_type1_release(void *iommu_data)
 		kfree(domain);
 	}
 
+release_exit:
 	kfree(iommu);
 }
 
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 431b824b0d3e..0a907bb33426 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -134,6 +134,12 @@ static inline long vfio_spapr_iommu_eeh_ioctl(struct iommu_group *group,
 }
 #endif /* CONFIG_EEH */
 
+extern long vfio_pin_pages(void *iommu_data, dma_addr_t *vaddr, long npage,
+			   int prot, dma_addr_t *pfn_base);
+
+extern long vfio_unpin_pages(void *iommu_data, dma_addr_t *pfn, long npage,
+			     int prot);
+
 /*
  * IRQfd - generic
  */
-- 
2.7.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [Qemu-devel] [PATCH 3/3] VFIO Type1 IOMMU: Add support for mediated devices
@ 2016-06-20 16:31   ` Kirti Wankhede
  0 siblings, 0 replies; 51+ messages in thread
From: Kirti Wankhede @ 2016-06-20 16:31 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, kevin.tian, shuai.ruan, jike.song, zhiyuan.lv,
	bjsdjshi, Kirti Wankhede

VFIO Type1 IOMMU driver is designed for the devices which are IOMMU
capable. Mediated device only uses IOMMU TYPE1 API, the underlying
hardware can be managed by an IOMMU domain.

This change exports functions to pin and unpin pages for mediated devices.
It maintains data of pinned pages for mediated domain. This data is used to
verify unpinning request and to unpin remaining pages from detach_group()
if there are any.

Aim of this change is:
- To use most of the code of IOMMU driver for mediated devices
- To support direct assigned device and mediated device by single module

Updated the change to keep mediated domain structure out of domain_list.

Tested by assigning below combinations of devices to a single VM:
- GPU pass through only
- vGPU device only
- One GPU pass through and one vGPU device
- two GPU pass through

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Change-Id: I295d6f0f2e0579b8d9882bfd8fd5a4194b97bd9a
---
 drivers/vfio/vfio_iommu_type1.c | 444 +++++++++++++++++++++++++++++++++++++---
 include/linux/vfio.h            |   6 +
 2 files changed, 418 insertions(+), 32 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 75b24e93cedb..f17dd104fe27 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -36,6 +36,7 @@
 #include <linux/uaccess.h>
 #include <linux/vfio.h>
 #include <linux/workqueue.h>
+#include <linux/mdev.h>
 
 #define DRIVER_VERSION  "0.2"
 #define DRIVER_AUTHOR   "Alex Williamson <alex.williamson@redhat.com>"
@@ -55,6 +56,7 @@ MODULE_PARM_DESC(disable_hugepages,
 
 struct vfio_iommu {
 	struct list_head	domain_list;
+	struct vfio_domain	*mediated_domain;
 	struct mutex		lock;
 	struct rb_root		dma_list;
 	bool			v2;
@@ -67,6 +69,13 @@ struct vfio_domain {
 	struct list_head	group_list;
 	int			prot;		/* IOMMU_CACHE */
 	bool			fgsp;		/* Fine-grained super pages */
+
+	/* Domain for mediated device which is without physical IOMMU */
+	bool			mediated_device;
+
+	struct mm_struct	*mm;
+	struct rb_root		pfn_list;	/* pinned Host pfn list */
+	struct mutex		pfn_list_lock;	/* mutex for pfn_list */
 };
 
 struct vfio_dma {
@@ -79,10 +88,26 @@ struct vfio_dma {
 
 struct vfio_group {
 	struct iommu_group	*iommu_group;
+#if defined(CONFIG_MDEV) || defined(CONFIG_MDEV_MODULE)
+	struct mdev_device	*mdev;
+#endif
 	struct list_head	next;
 };
 
 /*
+ * Guest RAM pinning working set or DMA target
+ */
+struct vfio_pfn {
+	struct rb_node		node;
+	unsigned long		vaddr;		/* virtual addr */
+	dma_addr_t		iova;		/* IOVA */
+	unsigned long		npage;		/* number of pages */
+	unsigned long		pfn;		/* Host pfn */
+	size_t			prot;
+	atomic_t		ref_count;
+};
+
+/*
  * This code handles mapping and unmapping of user data buffers
  * into DMA'ble space using the IOMMU
  */
@@ -130,6 +155,64 @@ static void vfio_unlink_dma(struct vfio_iommu *iommu, struct vfio_dma *old)
 	rb_erase(&old->node, &iommu->dma_list);
 }
 
+/*
+ * Helper Functions for host pfn list
+ */
+
+static struct vfio_pfn *vfio_find_pfn(struct vfio_domain *domain,
+				      unsigned long pfn)
+{
+	struct rb_node *node;
+	struct vfio_pfn *vpfn, *ret = NULL;
+
+	mutex_lock(&domain->pfn_list_lock);
+	node = domain->pfn_list.rb_node;
+
+	while (node) {
+		vpfn = rb_entry(node, struct vfio_pfn, node);
+
+		if (pfn < vpfn->pfn)
+			node = node->rb_left;
+		else if (pfn > vpfn->pfn)
+			node = node->rb_right;
+		else {
+			ret = vpfn;
+			break;
+		}
+	}
+
+	mutex_unlock(&domain->pfn_list_lock);
+	return ret;
+}
+
+static void vfio_link_pfn(struct vfio_domain *domain, struct vfio_pfn *new)
+{
+	struct rb_node **link, *parent = NULL;
+	struct vfio_pfn *vpfn;
+
+	mutex_lock(&domain->pfn_list_lock);
+	link = &domain->pfn_list.rb_node;
+	while (*link) {
+		parent = *link;
+		vpfn = rb_entry(parent, struct vfio_pfn, node);
+
+		if (new->pfn < vpfn->pfn)
+			link = &(*link)->rb_left;
+		else
+			link = &(*link)->rb_right;
+	}
+
+	rb_link_node(&new->node, parent, link);
+	rb_insert_color(&new->node, &domain->pfn_list);
+	mutex_unlock(&domain->pfn_list_lock);
+}
+
+/* call by holding domain->pfn_list_lock */
+static void vfio_unlink_pfn(struct vfio_domain *domain, struct vfio_pfn *old)
+{
+	rb_erase(&old->node, &domain->pfn_list);
+}
+
 struct vwork {
 	struct mm_struct	*mm;
 	long			npage;
@@ -228,20 +311,29 @@ static int put_pfn(unsigned long pfn, int prot)
 	return 0;
 }
 
-static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
+static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
+			 int prot, unsigned long *pfn)
 {
 	struct page *page[1];
 	struct vm_area_struct *vma;
+	struct mm_struct *local_mm = mm;
 	int ret = -EFAULT;
 
-	if (get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE), page) == 1) {
+	if (!local_mm && !current->mm)
+		return -ENODEV;
+
+	if (!local_mm)
+		local_mm = current->mm;
+
+	down_read(&local_mm->mmap_sem);
+	if (get_user_pages_remote(NULL, local_mm, vaddr, 1,
+				!!(prot & IOMMU_WRITE), 0, page, NULL) == 1) {
 		*pfn = page_to_pfn(page[0]);
-		return 0;
+		ret = 0;
+		goto done_pfn;
 	}
 
-	down_read(&current->mm->mmap_sem);
-
-	vma = find_vma_intersection(current->mm, vaddr, vaddr + 1);
+	vma = find_vma_intersection(local_mm, vaddr, vaddr + 1);
 
 	if (vma && vma->vm_flags & VM_PFNMAP) {
 		*pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
@@ -249,7 +341,8 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
 			ret = 0;
 	}
 
-	up_read(&current->mm->mmap_sem);
+done_pfn:
+	up_read(&local_mm->mmap_sem);
 
 	return ret;
 }
@@ -259,18 +352,19 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
  * the iommu can only map chunks of consecutive pfns anyway, so get the
  * first page and all consecutive pages with the same locking.
  */
-static long vfio_pin_pages(unsigned long vaddr, long npage,
-			   int prot, unsigned long *pfn_base)
+static long vfio_pin_pages_internal(struct vfio_domain *domain,
+				    unsigned long vaddr, long npage,
+				    int prot, unsigned long *pfn_base)
 {
 	unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
 	bool lock_cap = capable(CAP_IPC_LOCK);
 	long ret, i;
 	bool rsvd;
 
-	if (!current->mm)
+	if (!domain)
 		return -ENODEV;
 
-	ret = vaddr_get_pfn(vaddr, prot, pfn_base);
+	ret = vaddr_get_pfn(domain->mm, vaddr, prot, pfn_base);
 	if (ret)
 		return ret;
 
@@ -293,7 +387,7 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
 	for (i = 1, vaddr += PAGE_SIZE; i < npage; i++, vaddr += PAGE_SIZE) {
 		unsigned long pfn = 0;
 
-		ret = vaddr_get_pfn(vaddr, prot, &pfn);
+		ret = vaddr_get_pfn(domain->mm, vaddr, prot, &pfn);
 		if (ret)
 			break;
 
@@ -318,20 +412,165 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
 	return i;
 }
 
-static long vfio_unpin_pages(unsigned long pfn, long npage,
-			     int prot, bool do_accounting)
+static long vfio_unpin_pages_internal(struct vfio_domain *domain,
+				      unsigned long pfn, long npage, int prot,
+				      bool do_accounting)
 {
 	unsigned long unlocked = 0;
 	long i;
 
+	if (!domain)
+		return -ENODEV;
+
 	for (i = 0; i < npage; i++)
 		unlocked += put_pfn(pfn++, prot);
 
 	if (do_accounting)
 		vfio_lock_acct(-unlocked);
+	return unlocked;
+}
+
+/*
+ * Pin a set of guest PFNs and return their associated host PFNs for API
+ * supported domain only.
+ * @vaddr [in]: array of guest PFNs
+ * @npage [in]: count of array elements
+ * @prot [in] : protection flags
+ * @pfn_base[out] : array of host PFNs
+ */
+long vfio_pin_pages(void *iommu_data, dma_addr_t *vaddr, long npage,
+		   int prot, dma_addr_t *pfn_base)
+{
+	struct vfio_iommu *iommu = iommu_data;
+	struct vfio_domain *domain = NULL;
+	int i = 0, ret = 0;
+	long retpage;
+	unsigned long remote_vaddr = 0;
+	dma_addr_t *pfn = pfn_base;
+	struct vfio_dma *dma;
+
+	if (!iommu || !vaddr || !pfn_base)
+		return -EINVAL;
+
+	mutex_lock(&iommu->lock);
+
+	if (!iommu->mediated_domain) {
+		ret = -EINVAL;
+		goto pin_done;
+	}
+
+	domain = iommu->mediated_domain;
+
+	for (i = 0; i < npage; i++) {
+		struct vfio_pfn *p, *lpfn;
+		unsigned long tpfn;
+		dma_addr_t iova;
+		long pg_cnt = 1;
+
+		iova = vaddr[i] << PAGE_SHIFT;
+
+		dma = vfio_find_dma(iommu, iova, 0 /*  size */);
+		if (!dma) {
+			ret = -EINVAL;
+			goto pin_done;
+		}
+
+		remote_vaddr = dma->vaddr + iova - dma->iova;
+
+		retpage = vfio_pin_pages_internal(domain, remote_vaddr,
+						  pg_cnt, prot, &tpfn);
+		if (retpage <= 0) {
+			WARN_ON(!retpage);
+			ret = (int)retpage;
+			goto pin_done;
+		}
+
+		pfn[i] = tpfn;
+
+		/* search if pfn exist */
+		p = vfio_find_pfn(domain, tpfn);
+		if (p) {
+			atomic_inc(&p->ref_count);
+			continue;
+		}
+
+		/* add to pfn_list */
+		lpfn = kzalloc(sizeof(*lpfn), GFP_KERNEL);
+		if (!lpfn) {
+			ret = -ENOMEM;
+			goto pin_done;
+		}
+		lpfn->vaddr = remote_vaddr;
+		lpfn->iova = iova;
+		lpfn->pfn = pfn[i];
+		lpfn->npage = 1;
+		lpfn->prot = prot;
+		atomic_inc(&lpfn->ref_count);
+		vfio_link_pfn(domain, lpfn);
+	}
+
+	ret = i;
+
+pin_done:
+	mutex_unlock(&iommu->lock);
+	return ret;
+}
+EXPORT_SYMBOL(vfio_pin_pages);
+
+static int vfio_unpin_pfn(struct vfio_domain *domain,
+			  struct vfio_pfn *vpfn, bool do_accounting)
+{
+	int ret;
+
+	ret = vfio_unpin_pages_internal(domain, vpfn->pfn, vpfn->npage,
+					vpfn->prot, do_accounting);
+
+	if (ret > 0 && atomic_dec_and_test(&vpfn->ref_count)) {
+		vfio_unlink_pfn(domain, vpfn);
+		kfree(vpfn);
+	}
+
+	return ret;
+}
+
+/*
+ * Unpin set of host PFNs for API supported domain only.
+ * @pfn	[in] : array of host PFNs to be unpinned.
+ * @npage [in] :count of elements in array, that is number of pages.
+ * @prot [in] : protection flags
+ */
+long vfio_unpin_pages(void *iommu_data, dma_addr_t *pfn, long npage,
+		     int prot)
+{
+	struct vfio_iommu *iommu = iommu_data;
+	struct vfio_domain *domain = NULL;
+	long unlocked = 0;
+	int i;
+
+	if (!iommu || !pfn)
+		return -EINVAL;
+
+	if (!iommu->mediated_domain)
+		return -EINVAL;
+
+	domain = iommu->mediated_domain;
+
+	for (i = 0; i < npage; i++) {
+		struct vfio_pfn *p;
+
+		/* verify if pfn exist in pfn_list */
+		p = vfio_find_pfn(domain, *(pfn + i));
+		if (!p)
+			continue;
+
+		mutex_lock(&domain->pfn_list_lock);
+		unlocked += vfio_unpin_pfn(domain, p, true);
+		mutex_unlock(&domain->pfn_list_lock);
+	}
 
 	return unlocked;
 }
+EXPORT_SYMBOL(vfio_unpin_pages);
 
 static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
 {
@@ -341,6 +580,10 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
 
 	if (!dma->size)
 		return;
+
+	if (list_empty(&iommu->domain_list))
+		return;
+
 	/*
 	 * We use the IOMMU to track the physical addresses, otherwise we'd
 	 * need a much more complicated tracking system.  Unfortunately that
@@ -382,9 +625,10 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
 		if (WARN_ON(!unmapped))
 			break;
 
-		unlocked += vfio_unpin_pages(phys >> PAGE_SHIFT,
-					     unmapped >> PAGE_SHIFT,
-					     dma->prot, false);
+		unlocked += vfio_unpin_pages_internal(domain,
+						phys >> PAGE_SHIFT,
+						unmapped >> PAGE_SHIFT,
+						dma->prot, false);
 		iova += unmapped;
 
 		cond_resched();
@@ -517,6 +761,9 @@ static int map_try_harder(struct vfio_domain *domain, dma_addr_t iova,
 	long i;
 	int ret;
 
+	if (domain->mediated_device)
+		return -EINVAL;
+
 	for (i = 0; i < npage; i++, pfn++, iova += PAGE_SIZE) {
 		ret = iommu_map(domain->domain, iova,
 				(phys_addr_t)pfn << PAGE_SHIFT,
@@ -537,6 +784,9 @@ static int vfio_iommu_map(struct vfio_iommu *iommu, dma_addr_t iova,
 	struct vfio_domain *d;
 	int ret;
 
+	if (list_empty(&iommu->domain_list))
+		return 0;
+
 	list_for_each_entry(d, &iommu->domain_list, next) {
 		ret = iommu_map(d->domain, iova, (phys_addr_t)pfn << PAGE_SHIFT,
 				npage << PAGE_SHIFT, prot | d->prot);
@@ -569,6 +819,7 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
 	uint64_t mask;
 	struct vfio_dma *dma;
 	unsigned long pfn;
+	struct vfio_domain *domain = NULL;
 
 	/* Verify that none of our __u64 fields overflow */
 	if (map->size != size || map->vaddr != vaddr || map->iova != iova)
@@ -611,10 +862,21 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
 	/* Insert zero-sized and grow as we map chunks of it */
 	vfio_link_dma(iommu, dma);
 
+	/*
+	 * Skip pin and map if and domain list is empty
+	 */
+	if (list_empty(&iommu->domain_list)) {
+		dma->size = size;
+		goto map_done;
+	}
+
+	domain = list_first_entry(&iommu->domain_list,
+				  struct vfio_domain, next);
+
 	while (size) {
 		/* Pin a contiguous chunk of memory */
-		npage = vfio_pin_pages(vaddr + dma->size,
-				       size >> PAGE_SHIFT, prot, &pfn);
+		npage = vfio_pin_pages_internal(domain, vaddr + dma->size,
+						size >> PAGE_SHIFT, prot, &pfn);
 		if (npage <= 0) {
 			WARN_ON(!npage);
 			ret = (int)npage;
@@ -624,7 +886,8 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
 		/* Map it! */
 		ret = vfio_iommu_map(iommu, iova + dma->size, pfn, npage, prot);
 		if (ret) {
-			vfio_unpin_pages(pfn, npage, prot, true);
+			vfio_unpin_pages_internal(domain, pfn, npage,
+						  prot, true);
 			break;
 		}
 
@@ -635,6 +898,7 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
 	if (ret)
 		vfio_remove_dma(iommu, dma);
 
+map_done:
 	mutex_unlock(&iommu->lock);
 	return ret;
 }
@@ -658,6 +922,9 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
 	struct rb_node *n;
 	int ret;
 
+	if (domain->mediated_device)
+		return 0;
+
 	/* Arbitrarily pick the first domain in the list for lookups */
 	d = list_first_entry(&iommu->domain_list, struct vfio_domain, next);
 	n = rb_first(&iommu->dma_list);
@@ -716,6 +983,9 @@ static void vfio_test_domain_fgsp(struct vfio_domain *domain)
 	struct page *pages;
 	int ret, order = get_order(PAGE_SIZE * 2);
 
+	if (domain->mediated_device)
+		return;
+
 	pages = alloc_pages(GFP_KERNEL | __GFP_ZERO, order);
 	if (!pages)
 		return;
@@ -734,11 +1004,25 @@ static void vfio_test_domain_fgsp(struct vfio_domain *domain)
 	__free_pages(pages, order);
 }
 
+static struct vfio_group *is_iommu_group_present(struct vfio_domain *domain,
+				   struct iommu_group *iommu_group)
+{
+	struct vfio_group *g;
+
+	list_for_each_entry(g, &domain->group_list, next) {
+		if (g->iommu_group != iommu_group)
+			continue;
+		return g;
+	}
+
+	return NULL;
+}
+
 static int vfio_iommu_type1_attach_group(void *iommu_data,
 					 struct iommu_group *iommu_group)
 {
 	struct vfio_iommu *iommu = iommu_data;
-	struct vfio_group *group, *g;
+	struct vfio_group *group;
 	struct vfio_domain *domain, *d;
 	struct bus_type *bus = NULL;
 	int ret;
@@ -746,14 +1030,21 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 	mutex_lock(&iommu->lock);
 
 	list_for_each_entry(d, &iommu->domain_list, next) {
-		list_for_each_entry(g, &d->group_list, next) {
-			if (g->iommu_group != iommu_group)
-				continue;
+		if (is_iommu_group_present(d, iommu_group)) {
+			mutex_unlock(&iommu->lock);
+			return -EINVAL;
+		}
+	}
 
+#if defined(CONFIG_MDEV) || defined(CONFIG_MDEV_MODULE)
+	if (iommu->mediated_domain) {
+		if (is_iommu_group_present(iommu->mediated_domain,
+					   iommu_group)) {
 			mutex_unlock(&iommu->lock);
 			return -EINVAL;
 		}
 	}
+#endif
 
 	group = kzalloc(sizeof(*group), GFP_KERNEL);
 	domain = kzalloc(sizeof(*domain), GFP_KERNEL);
@@ -769,6 +1060,36 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 	if (ret)
 		goto out_free;
 
+#if defined(CONFIG_MDEV) || defined(CONFIG_MDEV_MODULE)
+	if (!iommu_present(bus) && (bus == &mdev_bus_type)) {
+		struct mdev_device *mdev = NULL;
+
+		mdev = mdev_get_device_by_group(iommu_group);
+		if (!mdev)
+			goto out_free;
+
+		mdev->iommu_data = iommu;
+		group->mdev = mdev;
+
+		if (iommu->mediated_domain) {
+			list_add(&group->next,
+				 &iommu->mediated_domain->group_list);
+			kfree(domain);
+			mutex_unlock(&iommu->lock);
+			return 0;
+		}
+		domain->mediated_device = true;
+		domain->mm = current->mm;
+		INIT_LIST_HEAD(&domain->group_list);
+		list_add(&group->next, &domain->group_list);
+		domain->pfn_list = RB_ROOT;
+		mutex_init(&domain->pfn_list_lock);
+		iommu->mediated_domain = domain;
+		mutex_unlock(&iommu->lock);
+		return 0;
+	}
+#endif
+
 	domain->domain = iommu_domain_alloc(bus);
 	if (!domain->domain) {
 		ret = -EIO;
@@ -859,6 +1180,20 @@ static void vfio_iommu_unmap_unpin_all(struct vfio_iommu *iommu)
 		vfio_remove_dma(iommu, rb_entry(node, struct vfio_dma, node));
 }
 
+#if defined(CONFIG_MDEV) || defined(CONFIG_MDEV_MODULE)
+static void vfio_iommu_unpin_api_domain(struct vfio_domain *domain)
+{
+	struct rb_node *node;
+
+	mutex_lock(&domain->pfn_list_lock);
+	while ((node = rb_first(&domain->pfn_list))) {
+		vfio_unpin_pfn(domain,
+				rb_entry(node, struct vfio_pfn, node), false);
+	}
+	mutex_unlock(&domain->pfn_list_lock);
+}
+#endif
+
 static void vfio_iommu_type1_detach_group(void *iommu_data,
 					  struct iommu_group *iommu_group)
 {
@@ -868,31 +1203,55 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
 
 	mutex_lock(&iommu->lock);
 
-	list_for_each_entry(domain, &iommu->domain_list, next) {
-		list_for_each_entry(group, &domain->group_list, next) {
-			if (group->iommu_group != iommu_group)
-				continue;
+#if defined(CONFIG_MDEV) || defined(CONFIG_MDEV_MODULE)
+	if (iommu->mediated_domain) {
+		domain = iommu->mediated_domain;
+		group = is_iommu_group_present(domain, iommu_group);
+		if (group) {
+			if (group->mdev) {
+				group->mdev->iommu_data = NULL;
+				mdev_put_device(group->mdev);
+			}
+			list_del(&group->next);
+			kfree(group);
+
+			if (list_empty(&domain->group_list)) {
+				vfio_iommu_unpin_api_domain(domain);
+
+				if (list_empty(&iommu->domain_list))
+					vfio_iommu_unmap_unpin_all(iommu);
+
+				kfree(domain);
+				iommu->mediated_domain = NULL;
+			}
+		}
+	}
+#endif
 
+	list_for_each_entry(domain, &iommu->domain_list, next) {
+		group = is_iommu_group_present(domain, iommu_group);
+		if (group) {
 			iommu_detach_group(domain->domain, iommu_group);
 			list_del(&group->next);
 			kfree(group);
 			/*
 			 * Group ownership provides privilege, if the group
 			 * list is empty, the domain goes away.  If it's the
-			 * last domain, then all the mappings go away too.
+			 * last domain with iommu and API-only domain doesn't
+			 * exist, the all the mappings go away too.
 			 */
 			if (list_empty(&domain->group_list)) {
-				if (list_is_singular(&iommu->domain_list))
+				if (list_is_singular(&iommu->domain_list) &&
+				    !iommu->mediated_domain)
 					vfio_iommu_unmap_unpin_all(iommu);
 				iommu_domain_free(domain->domain);
 				list_del(&domain->next);
 				kfree(domain);
 			}
-			goto done;
+			break;
 		}
 	}
 
-done:
 	mutex_unlock(&iommu->lock);
 }
 
@@ -930,8 +1289,28 @@ static void vfio_iommu_type1_release(void *iommu_data)
 	struct vfio_domain *domain, *domain_tmp;
 	struct vfio_group *group, *group_tmp;
 
+#if defined(CONFIG_MDEV) || defined(CONFIG_MDEV_MODULE)
+	if (iommu->mediated_domain) {
+		domain = iommu->mediated_domain;
+		list_for_each_entry_safe(group, group_tmp,
+					 &domain->group_list, next) {
+			if (group->mdev) {
+				group->mdev->iommu_data = NULL;
+				mdev_put_device(group->mdev);
+			}
+			list_del(&group->next);
+			kfree(group);
+		}
+		vfio_iommu_unpin_api_domain(domain);
+		kfree(domain);
+		iommu->mediated_domain = NULL;
+	}
+#endif
 	vfio_iommu_unmap_unpin_all(iommu);
 
+	if (list_empty(&iommu->domain_list))
+		goto release_exit;
+
 	list_for_each_entry_safe(domain, domain_tmp,
 				 &iommu->domain_list, next) {
 		list_for_each_entry_safe(group, group_tmp,
@@ -945,6 +1324,7 @@ static void vfio_iommu_type1_release(void *iommu_data)
 		kfree(domain);
 	}
 
+release_exit:
 	kfree(iommu);
 }
 
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 431b824b0d3e..0a907bb33426 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -134,6 +134,12 @@ static inline long vfio_spapr_iommu_eeh_ioctl(struct iommu_group *group,
 }
 #endif /* CONFIG_EEH */
 
+extern long vfio_pin_pages(void *iommu_data, dma_addr_t *vaddr, long npage,
+			   int prot, dma_addr_t *pfn_base);
+
+extern long vfio_unpin_pages(void *iommu_data, dma_addr_t *pfn, long npage,
+			     int prot);
+
 /*
  * IRQfd - generic
  */
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH 1/3] Mediated device Core driver
  2016-06-20 16:31   ` [Qemu-devel] " Kirti Wankhede
@ 2016-06-21  7:38     ` Jike Song
  -1 siblings, 0 replies; 51+ messages in thread
From: Jike Song @ 2016-06-21  7:38 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: shuai.ruan, kevin.tian, cjia, kvm, qemu-devel, alex.williamson,
	kraxel, pbonzini, bjsdjshi, zhiyuan.lv

On 06/21/2016 12:31 AM, Kirti Wankhede wrote:
> +
> +static int mdev_add_attribute_group(struct device *dev,
> +				    const struct attribute_group **groups)
> +{
> +	return sysfs_create_groups(&dev->kobj, groups);
> +}
> +
> +static void mdev_remove_attribute_group(struct device *dev,
> +					const struct attribute_group **groups)
> +{
> +	sysfs_remove_groups(&dev->kobj, groups);
> +}
> +
> +static struct mdev_device *find_mdev_device(struct parent_device *parent,
> +					    uuid_le uuid, int instance)
> +{
> +	struct mdev_device *mdev = NULL, *p;
> +
> +	list_for_each_entry(p, &parent->mdev_list, next) {
> +		if ((uuid_le_cmp(p->uuid, uuid) == 0) &&
> +		    (p->instance == instance)) {
> +			mdev = p;
> +			break;
> +		}
> +	}
> +	return mdev;
> +}

Hi Kirti, Neo,

Thanks for the new version!

We just found that find_mdev_device() is necessary for pdev driver to locate
the mdev by VM identity and mdev instance. e.g. caller of your vfio-iommu
exported API, vfio_pin_pages(), must have something to identify which
address space it wants, that's a subfield of mdev.

Do you mind to have it exported? Or is there any other way for this?

--
Thanks,
Jike


> +
> +/* Should be called holding parent_devices.list_lock */
> +static struct parent_device *find_parent_device(struct device *dev)
> +{
> +	struct parent_device *parent = NULL, *p;
> +
> +	WARN_ON(!mutex_is_locked(&parent_devices.list_lock));
> +	list_for_each_entry(p, &parent_devices.dev_list, next) {
> +		if (p->dev == dev) {
> +			parent = p;
> +			break;
> +		}
> +	}
> +	return parent;
> +}
> +
> +static void mdev_release_parent(struct kref *kref)
> +{
> +	struct parent_device *parent = container_of(kref, struct parent_device,
> +						    ref);
> +	kfree(parent);
> +}
> +
> +static
> +inline struct parent_device *mdev_get_parent(struct parent_device *parent)
> +{
> +	if (parent)
> +		kref_get(&parent->ref);
> +
> +	return parent;
> +}
> +
> +static inline void mdev_put_parent(struct parent_device *parent)
> +{
> +	if (parent)
> +		kref_put(&parent->ref, mdev_release_parent);
> +}
> +
> +static struct parent_device *mdev_get_parent_by_dev(struct device *dev)
> +{
> +	struct parent_device *parent = NULL, *p;
> +
> +	mutex_lock(&parent_devices.list_lock);
> +	list_for_each_entry(p, &parent_devices.dev_list, next) {
> +		if (p->dev == dev) {
> +			parent = mdev_get_parent(p);
> +			break;
> +		}
> +	}
> +	mutex_unlock(&parent_devices.list_lock);
> +	return parent;
> +}
> +
> +static int mdev_device_create_ops(struct mdev_device *mdev, char *mdev_params)
> +{
> +	struct parent_device *parent = mdev->parent;
> +	int ret;
> +
> +	mutex_lock(&parent->ops_lock);
> +	if (parent->ops->create) {
> +		ret = parent->ops->create(mdev->dev.parent, mdev->uuid,
> +					mdev->instance, mdev_params);
> +		if (ret)
> +			goto create_ops_err;
> +	}
> +
> +	ret = mdev_add_attribute_group(&mdev->dev,
> +					parent->ops->mdev_attr_groups);
> +create_ops_err:
> +	mutex_unlock(&parent->ops_lock);
> +	return ret;
> +}
> +
> +static int mdev_device_destroy_ops(struct mdev_device *mdev, bool force)
> +{
> +	struct parent_device *parent = mdev->parent;
> +	int ret = 0;
> +
> +	/*
> +	 * If vendor driver doesn't return success that means vendor
> +	 * driver doesn't support hot-unplug
> +	 */
> +	mutex_lock(&parent->ops_lock);
> +	if (parent->ops->destroy) {
> +		ret = parent->ops->destroy(parent->dev, mdev->uuid,
> +					   mdev->instance);
> +		if (ret && !force) {
> +			ret = -EBUSY;
> +			goto destroy_ops_err;
> +		}
> +	}
> +	mdev_remove_attribute_group(&mdev->dev,
> +				    parent->ops->mdev_attr_groups);
> +destroy_ops_err:
> +	mutex_unlock(&parent->ops_lock);
> +
> +	return ret;
> +}
> +
> +static void mdev_release_device(struct kref *kref)
> +{
> +	struct mdev_device *mdev = container_of(kref, struct mdev_device, ref);
> +	struct parent_device *parent = mdev->parent;
> +
> +	device_unregister(&mdev->dev);
> +	wake_up(&parent->release_done);
> +	mdev_put_parent(parent);
> +}
> +
> +struct mdev_device *mdev_get_device(struct mdev_device *mdev)
> +{
> +	if (mdev)
> +		kref_get(&mdev->ref);
> +
> +	return mdev;
> +}
> +EXPORT_SYMBOL(mdev_get_device);
> +
> +void mdev_put_device(struct mdev_device *mdev)
> +{
> +	if (mdev)
> +		kref_put(&mdev->ref, mdev_release_device);
> +}
> +EXPORT_SYMBOL(mdev_put_device);
> +
> +/*
> + * Find first mediated device from given uuid and increment refcount of
> + * mediated device. Caller should call mdev_put_device() when the use of
> + * mdev_device is done.
> + */
> +static struct mdev_device *mdev_get_first_device_by_uuid(uuid_le uuid)
> +{
> +	struct mdev_device *mdev = NULL, *p;
> +	struct parent_device *parent;
> +
> +	mutex_lock(&parent_devices.list_lock);
> +	list_for_each_entry(parent, &parent_devices.dev_list, next) {
> +		mutex_lock(&parent->mdev_list_lock);
> +		list_for_each_entry(p, &parent->mdev_list, next) {
> +			if (uuid_le_cmp(p->uuid, uuid) == 0) {
> +				mdev = mdev_get_device(p);
> +				break;
> +			}
> +		}
> +		mutex_unlock(&parent->mdev_list_lock);
> +
> +		if (mdev)
> +			break;
> +	}
> +	mutex_unlock(&parent_devices.list_lock);
> +	return mdev;
> +}
> +
> +/*
> + * Find mediated device from given iommu_group and increment refcount of
> + * mediated device. Caller should call mdev_put_device() when the use of
> + * mdev_device is done.
> + */
> +struct mdev_device *mdev_get_device_by_group(struct iommu_group *group)
> +{
> +	struct mdev_device *mdev = NULL, *p;
> +	struct parent_device *parent;
> +
> +	mutex_lock(&parent_devices.list_lock);
> +	list_for_each_entry(parent, &parent_devices.dev_list, next) {
> +		mutex_lock(&parent->mdev_list_lock);
> +		list_for_each_entry(p, &parent->mdev_list, next) {
> +			if (!p->group)
> +				continue;
> +
> +			if (iommu_group_id(p->group) == iommu_group_id(group)) {
> +				mdev = mdev_get_device(p);
> +				break;
> +			}
> +		}
> +		mutex_unlock(&parent->mdev_list_lock);
> +
> +		if (mdev)
> +			break;
> +	}
> +	mutex_unlock(&parent_devices.list_lock);
> +	return mdev;
> +}
> +EXPORT_SYMBOL(mdev_get_device_by_group);
> +
> +/*
> + * mdev_register_device : Register a device
> + * @dev: device structure representing parent device.
> + * @ops: Parent device operation structure to be registered.
> + *
> + * Add device to list of registered parent devices.
> + * Returns a negative value on error, otherwise 0.
> + */
> +int mdev_register_device(struct device *dev, const struct parent_ops *ops)
> +{
> +	int ret = 0;
> +	struct parent_device *parent;
> +
> +	if (!dev || !ops)
> +		return -EINVAL;
> +
> +	mutex_lock(&parent_devices.list_lock);
> +
> +	/* Check for duplicate */
> +	parent = find_parent_device(dev);
> +	if (parent) {
> +		ret = -EEXIST;
> +		goto add_dev_err;
> +	}
> +
> +	parent = kzalloc(sizeof(*parent), GFP_KERNEL);
> +	if (!parent) {
> +		ret = -ENOMEM;
> +		goto add_dev_err;
> +	}
> +
> +	kref_init(&parent->ref);
> +	list_add(&parent->next, &parent_devices.dev_list);
> +	mutex_unlock(&parent_devices.list_lock);
> +
> +	parent->dev = dev;
> +	parent->ops = ops;
> +	mutex_init(&parent->ops_lock);
> +	mutex_init(&parent->mdev_list_lock);
> +	INIT_LIST_HEAD(&parent->mdev_list);
> +	init_waitqueue_head(&parent->release_done);
> +
> +	ret = mdev_create_sysfs_files(dev);
> +	if (ret)
> +		goto add_sysfs_error;
> +
> +	ret = mdev_add_attribute_group(dev, ops->dev_attr_groups);
> +	if (ret)
> +		goto add_group_error;
> +
> +	dev_info(dev, "MDEV: Registered\n");
> +	return 0;
> +
> +add_group_error:
> +	mdev_remove_sysfs_files(dev);
> +add_sysfs_error:
> +	mutex_lock(&parent_devices.list_lock);
> +	list_del(&parent->next);
> +	mutex_unlock(&parent_devices.list_lock);
> +	mdev_put_parent(parent);
> +	return ret;
> +
> +add_dev_err:
> +	mutex_unlock(&parent_devices.list_lock);
> +	return ret;
> +}
> +EXPORT_SYMBOL(mdev_register_device);
> +
> +/*
> + * mdev_unregister_device : Unregister a parent device
> + * @dev: device structure representing parent device.
> + *
> + * Remove device from list of registered parent devices. Give a chance to free
> + * existing mediated devices for given device.
> + */
> +
> +void mdev_unregister_device(struct device *dev)
> +{
> +	struct parent_device *parent;
> +	struct mdev_device *mdev, *n;
> +	int ret;
> +
> +	mutex_lock(&parent_devices.list_lock);
> +	parent = find_parent_device(dev);
> +
> +	if (!parent) {
> +		mutex_unlock(&parent_devices.list_lock);
> +		return;
> +	}
> +	dev_info(dev, "MDEV: Unregistering\n");
> +
> +	/*
> +	 * Remove parent from the list and remove create and destroy sysfs
> +	 * files so that no new mediated device could be created for this parent
> +	 */
> +	list_del(&parent->next);
> +	mdev_remove_sysfs_files(dev);
> +	mutex_unlock(&parent_devices.list_lock);
> +
> +	mutex_lock(&parent->ops_lock);
> +	mdev_remove_attribute_group(dev,
> +				    parent->ops->dev_attr_groups);
> +	mutex_unlock(&parent->ops_lock);
> +
> +	mutex_lock(&parent->mdev_list_lock);
> +	list_for_each_entry_safe(mdev, n, &parent->mdev_list, next) {
> +		mdev_device_destroy_ops(mdev, true);
> +		list_del(&mdev->next);
> +		mdev_put_device(mdev);
> +	}
> +	mutex_unlock(&parent->mdev_list_lock);
> +
> +	do {
> +		ret = wait_event_interruptible_timeout(parent->release_done,
> +				list_empty(&parent->mdev_list), HZ * 10);
> +		if (ret == -ERESTARTSYS) {
> +			dev_warn(dev, "Mediated devices are in use, task"
> +				      " \"%s\" (%d) "
> +				      "blocked until all are released",
> +				      current->comm, task_pid_nr(current));
> +		}
> +	} while (ret <= 0);
> +
> +	mdev_put_parent(parent);
> +}
> +EXPORT_SYMBOL(mdev_unregister_device);
> +
> +/*
> + * Functions required for mdev-sysfs
> + */
> +static void mdev_device_release(struct device *dev)
> +{
> +	struct mdev_device *mdev = to_mdev_device(dev);
> +
> +	dev_dbg(&mdev->dev, "MDEV: destroying\n");
> +	kfree(mdev);
> +}
> +
> +int mdev_device_create(struct device *dev, uuid_le uuid, uint32_t instance,
> +		       char *mdev_params)
> +{
> +	int ret;
> +	struct mdev_device *mdev;
> +	struct parent_device *parent;
> +
> +	parent = mdev_get_parent_by_dev(dev);
> +	if (!parent)
> +		return -EINVAL;
> +
> +	/* Check for duplicate */
> +	mdev = find_mdev_device(parent, uuid, instance);
> +	if (mdev) {
> +		ret = -EEXIST;
> +		goto create_err;
> +	}
> +
> +	mdev = kzalloc(sizeof(*mdev), GFP_KERNEL);
> +	if (!mdev) {
> +		ret = -ENOMEM;
> +		goto create_err;
> +	}
> +
> +	memcpy(&mdev->uuid, &uuid, sizeof(uuid_le));
> +	mdev->instance = instance;
> +	mdev->parent = parent;
> +	mutex_init(&mdev->ops_lock);
> +	kref_init(&mdev->ref);
> +
> +	mdev->dev.parent  = dev;
> +	mdev->dev.bus     = &mdev_bus_type;
> +	mdev->dev.release = mdev_device_release;
> +	dev_set_name(&mdev->dev, "%pUb-%d", uuid.b, instance);
> +
> +	ret = device_register(&mdev->dev);
> +	if (ret) {
> +		put_device(&mdev->dev);
> +		goto create_err;
> +	}
> +
> +	ret = mdev_device_create_ops(mdev, mdev_params);
> +	if (ret)
> +		goto create_failed;
> +
> +	mutex_lock(&parent->mdev_list_lock);
> +	list_add(&mdev->next, &parent->mdev_list);
> +	mutex_unlock(&parent->mdev_list_lock);
> +
> +	dev_dbg(&mdev->dev, "MDEV: created\n");
> +
> +	return ret;
> +
> +create_failed:
> +	device_unregister(&mdev->dev);
> +
> +create_err:
> +	mdev_put_parent(parent);
> +	return ret;
> +}
> +
> +int mdev_device_destroy(struct device *dev, uuid_le uuid, uint32_t instance)
> +{
> +	struct mdev_device *mdev;
> +	struct parent_device *parent;
> +	int ret;
> +
> +	parent = mdev_get_parent_by_dev(dev);
> +	if (!parent) {
> +		ret = -EINVAL;
> +		goto destroy_err;
> +	}
> +
> +	mdev = find_mdev_device(parent, uuid, instance);
> +	if (!mdev) {
> +		ret = -EINVAL;
> +		goto destroy_err;
> +	}
> +
> +	ret = mdev_device_destroy_ops(mdev, false);
> +	if (ret)
> +		goto destroy_err;
> +
> +	mdev_put_parent(parent);
> +
> +	mutex_lock(&parent->mdev_list_lock);
> +	list_del(&mdev->next);
> +	mutex_unlock(&parent->mdev_list_lock);
> +
> +	mdev_put_device(mdev);
> +	return ret;
> +
> +destroy_err:
> +	mdev_put_parent(parent);
> +	return ret;
> +}
> +
> +void mdev_device_supported_config(struct device *dev, char *str)
> +{
> +	struct parent_device *parent;
> +
> +	parent = mdev_get_parent_by_dev(dev);
> +
> +	if (parent) {
> +		mutex_lock(&parent->ops_lock);
> +		if (parent->ops->supported_config)
> +			parent->ops->supported_config(parent->dev, str);
> +		mutex_unlock(&parent->ops_lock);
> +		mdev_put_parent(parent);
> +	}
> +}
> +
> +int mdev_device_start(uuid_le uuid)
> +{
> +	int ret = 0;
> +	struct mdev_device *mdev;
> +	struct parent_device *parent;
> +
> +	mdev = mdev_get_first_device_by_uuid(uuid);
> +	if (!mdev)
> +		return -EINVAL;
> +
> +	parent = mdev->parent;
> +
> +	mutex_lock(&parent->ops_lock);
> +	if (parent->ops->start)
> +		ret = parent->ops->start(mdev->uuid);
> +	mutex_unlock(&parent->ops_lock);
> +
> +	if (ret)
> +		pr_err("mdev_start failed  %d\n", ret);
> +	else
> +		kobject_uevent(&mdev->dev.kobj, KOBJ_ONLINE);
> +
> +	mdev_put_device(mdev);
> +
> +	return ret;
> +}
> +
> +int mdev_device_shutdown(uuid_le uuid)
> +{
> +	int ret = 0;
> +	struct mdev_device *mdev;
> +	struct parent_device *parent;
> +
> +	mdev = mdev_get_first_device_by_uuid(uuid);
> +	if (!mdev)
> +		return -EINVAL;
> +
> +	parent = mdev->parent;
> +
> +	mutex_lock(&parent->ops_lock);
> +	if (parent->ops->shutdown)
> +		ret = parent->ops->shutdown(mdev->uuid);
> +	mutex_unlock(&parent->ops_lock);
> +
> +	if (ret)
> +		pr_err("mdev_shutdown failed %d\n", ret);
> +	else
> +		kobject_uevent(&mdev->dev.kobj, KOBJ_OFFLINE);
> +
> +	mdev_put_device(mdev);
> +	return ret;
> +}
> +
> +static struct class mdev_class = {
> +	.name		= MDEV_CLASS_NAME,
> +	.owner		= THIS_MODULE,
> +	.class_attrs	= mdev_class_attrs,
> +};
> +
> +static int __init mdev_init(void)
> +{
> +	int ret;
> +
> +	mutex_init(&parent_devices.list_lock);
> +	INIT_LIST_HEAD(&parent_devices.dev_list);
> +
> +	ret = class_register(&mdev_class);
> +	if (ret) {
> +		pr_err("Failed to register mdev class\n");
> +		return ret;
> +	}
> +
> +	ret = mdev_bus_register();
> +	if (ret) {
> +		pr_err("Failed to register mdev bus\n");
> +		class_unregister(&mdev_class);
> +		return ret;
> +	}
> +
> +	return ret;
> +}
> +
> +static void __exit mdev_exit(void)
> +{
> +	mdev_bus_unregister();
> +	class_unregister(&mdev_class);
> +}
> +
> +module_init(mdev_init)
> +module_exit(mdev_exit)
> +
> +MODULE_VERSION(DRIVER_VERSION);
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR(DRIVER_AUTHOR);
> +MODULE_DESCRIPTION(DRIVER_DESC);
> diff --git a/drivers/vfio/mdev/mdev_driver.c b/drivers/vfio/mdev/mdev_driver.c
> new file mode 100644
> index 000000000000..f1aed541111d
> --- /dev/null
> +++ b/drivers/vfio/mdev/mdev_driver.c
> @@ -0,0 +1,138 @@
> +/*
> + * MDEV driver
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/device.h>
> +#include <linux/iommu.h>
> +#include <linux/mdev.h>
> +
> +#include "mdev_private.h"
> +
> +static int mdev_attach_iommu(struct mdev_device *mdev)
> +{
> +	int ret;
> +	struct iommu_group *group;
> +
> +	group = iommu_group_alloc();
> +	if (IS_ERR(group)) {
> +		dev_err(&mdev->dev, "MDEV: failed to allocate group!\n");
> +		return PTR_ERR(group);
> +	}
> +
> +	ret = iommu_group_add_device(group, &mdev->dev);
> +	if (ret) {
> +		dev_err(&mdev->dev, "MDEV: failed to add dev to group!\n");
> +		goto attach_fail;
> +	}
> +
> +	mdev->group = group;
> +
> +	dev_info(&mdev->dev, "MDEV: group_id = %d\n",
> +				 iommu_group_id(group));
> +attach_fail:
> +	iommu_group_put(group);
> +	return ret;
> +}
> +
> +static void mdev_detach_iommu(struct mdev_device *mdev)
> +{
> +	iommu_group_remove_device(&mdev->dev);
> +	dev_info(&mdev->dev, "MDEV: detaching iommu\n");
> +}
> +
> +static int mdev_probe(struct device *dev)
> +{
> +	struct mdev_driver *drv = to_mdev_driver(dev->driver);
> +	struct mdev_device *mdev = to_mdev_device(dev);
> +	int ret;
> +
> +	ret = mdev_attach_iommu(mdev);
> +	if (ret) {
> +		dev_err(dev, "Failed to attach IOMMU\n");
> +		return ret;
> +	}
> +
> +	if (drv && drv->probe)
> +		ret = drv->probe(dev);
> +
> +	return ret;
> +}
> +
> +static int mdev_remove(struct device *dev)
> +{
> +	struct mdev_driver *drv = to_mdev_driver(dev->driver);
> +	struct mdev_device *mdev = to_mdev_device(dev);
> +
> +	if (drv && drv->remove)
> +		drv->remove(dev);
> +
> +	mdev_detach_iommu(mdev);
> +
> +	return 0;
> +}
> +
> +static int mdev_match(struct device *dev, struct device_driver *drv)
> +{
> +	struct mdev_driver *mdrv = to_mdev_driver(drv);
> +
> +	if (mdrv && mdrv->match)
> +		return mdrv->match(dev);
> +
> +	return 0;
> +}
> +
> +struct bus_type mdev_bus_type = {
> +	.name		= "mdev",
> +	.match		= mdev_match,
> +	.probe		= mdev_probe,
> +	.remove		= mdev_remove,
> +};
> +EXPORT_SYMBOL_GPL(mdev_bus_type);
> +
> +/*
> + * mdev_register_driver - register a new MDEV driver
> + * @drv: the driver to register
> + * @owner: module owner of driver to be registered
> + *
> + * Returns a negative value on error, otherwise 0.
> + */
> +int mdev_register_driver(struct mdev_driver *drv, struct module *owner)
> +{
> +	/* initialize common driver fields */
> +	drv->driver.name = drv->name;
> +	drv->driver.bus = &mdev_bus_type;
> +	drv->driver.owner = owner;
> +
> +	/* register with core */
> +	return driver_register(&drv->driver);
> +}
> +EXPORT_SYMBOL(mdev_register_driver);
> +
> +/*
> + * mdev_unregister_driver - unregister MDEV driver
> + * @drv: the driver to unregister
> + *
> + */
> +void mdev_unregister_driver(struct mdev_driver *drv)
> +{
> +	driver_unregister(&drv->driver);
> +}
> +EXPORT_SYMBOL(mdev_unregister_driver);
> +
> +int mdev_bus_register(void)
> +{
> +	return bus_register(&mdev_bus_type);
> +}
> +
> +void mdev_bus_unregister(void)
> +{
> +	bus_unregister(&mdev_bus_type);
> +}
> diff --git a/drivers/vfio/mdev/mdev_private.h b/drivers/vfio/mdev/mdev_private.h
> new file mode 100644
> index 000000000000..991d7f796169
> --- /dev/null
> +++ b/drivers/vfio/mdev/mdev_private.h
> @@ -0,0 +1,33 @@
> +/*
> + * Mediated device interal definitions
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#ifndef MDEV_PRIVATE_H
> +#define MDEV_PRIVATE_H
> +
> +int  mdev_bus_register(void);
> +void mdev_bus_unregister(void);
> +
> +/* Function prototypes for mdev_sysfs */
> +
> +extern struct class_attribute mdev_class_attrs[];
> +
> +int  mdev_create_sysfs_files(struct device *dev);
> +void mdev_remove_sysfs_files(struct device *dev);
> +
> +int  mdev_device_create(struct device *dev, uuid_le uuid, uint32_t instance,
> +			char *mdev_params);
> +int  mdev_device_destroy(struct device *dev, uuid_le uuid, uint32_t instance);
> +void mdev_device_supported_config(struct device *dev, char *str);
> +int  mdev_device_start(uuid_le uuid);
> +int  mdev_device_shutdown(uuid_le uuid);
> +
> +#endif /* MDEV_PRIVATE_H */
> diff --git a/drivers/vfio/mdev/mdev_sysfs.c b/drivers/vfio/mdev/mdev_sysfs.c
> new file mode 100644
> index 000000000000..48b66e40009e
> --- /dev/null
> +++ b/drivers/vfio/mdev/mdev_sysfs.c
> @@ -0,0 +1,300 @@
> +/*
> + * File attributes for Mediated devices
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/sysfs.h>
> +#include <linux/ctype.h>
> +#include <linux/device.h>
> +#include <linux/slab.h>
> +#include <linux/uuid.h>
> +#include <linux/mdev.h>
> +
> +#include "mdev_private.h"
> +
> +/* Prototypes */
> +static ssize_t mdev_supported_types_show(struct device *dev,
> +					 struct device_attribute *attr,
> +					 char *buf);
> +static DEVICE_ATTR_RO(mdev_supported_types);
> +
> +static ssize_t mdev_create_store(struct device *dev,
> +				 struct device_attribute *attr,
> +				 const char *buf, size_t count);
> +static DEVICE_ATTR_WO(mdev_create);
> +
> +static ssize_t mdev_destroy_store(struct device *dev,
> +				  struct device_attribute *attr,
> +				  const char *buf, size_t count);
> +static DEVICE_ATTR_WO(mdev_destroy);
> +
> +/* Static functions */
> +
> +#define UUID_CHAR_LENGTH	36
> +#define UUID_BYTE_LENGTH	16
> +
> +#define SUPPORTED_TYPE_BUFFER_LENGTH	1024
> +
> +static inline bool is_uuid_sep(char sep)
> +{
> +	if (sep == '\n' || sep == '-' || sep == ':' || sep == '\0')
> +		return true;
> +	return false;
> +}
> +
> +static int uuid_parse(const char *str, uuid_le *uuid)
> +{
> +	int i;
> +
> +	if (strlen(str) < UUID_CHAR_LENGTH)
> +		return -EINVAL;
> +
> +	for (i = 0; i < UUID_BYTE_LENGTH; i++) {
> +		if (!isxdigit(str[0]) || !isxdigit(str[1])) {
> +			pr_err("%s err", __func__);
> +			return -EINVAL;
> +		}
> +
> +		uuid->b[i] = (hex_to_bin(str[0]) << 4) | hex_to_bin(str[1]);
> +		str += 2;
> +		if (is_uuid_sep(*str))
> +			str++;
> +	}
> +
> +	return 0;
> +}
> +
> +/* mdev sysfs Functions */
> +static ssize_t mdev_supported_types_show(struct device *dev,
> +					 struct device_attribute *attr,
> +					 char *buf)
> +{
> +	char *str, *ptr;
> +	ssize_t n;
> +
> +	str = kzalloc(sizeof(*str) * SUPPORTED_TYPE_BUFFER_LENGTH, GFP_KERNEL);
> +	if (!str)
> +		return -ENOMEM;
> +
> +	ptr = str;
> +	mdev_device_supported_config(dev, str);
> +
> +	n = sprintf(buf, "%s\n", str);
> +	kfree(ptr);
> +
> +	return n;
> +}
> +
> +static ssize_t mdev_create_store(struct device *dev,
> +				 struct device_attribute *attr,
> +				 const char *buf, size_t count)
> +{
> +	char *str, *pstr;
> +	char *uuid_str, *instance_str, *mdev_params = NULL, *params = NULL;
> +	uuid_le uuid;
> +	uint32_t instance;
> +	int ret;
> +
> +	pstr = str = kstrndup(buf, count, GFP_KERNEL);
> +
> +	if (!str)
> +		return -ENOMEM;
> +
> +	uuid_str = strsep(&str, ":");
> +	if (!uuid_str) {
> +		pr_err("mdev_create: Empty UUID string %s\n", buf);
> +		ret = -EINVAL;
> +		goto create_error;
> +	}
> +
> +	if (!str) {
> +		pr_err("mdev_create: mdev instance not present %s\n", buf);
> +		ret = -EINVAL;
> +		goto create_error;
> +	}
> +
> +	instance_str = strsep(&str, ":");
> +	if (!instance_str) {
> +		pr_err("mdev_create: Empty instance string %s\n", buf);
> +		ret = -EINVAL;
> +		goto create_error;
> +	}
> +
> +	ret = kstrtouint(instance_str, 0, &instance);
> +	if (ret) {
> +		pr_err("mdev_create: mdev instance parsing error %s\n", buf);
> +		goto create_error;
> +	}
> +
> +	if (str)
> +		params = mdev_params = kstrdup(str, GFP_KERNEL);
> +
> +	ret = uuid_parse(uuid_str, &uuid);
> +	if (ret) {
> +		pr_err("mdev_create: UUID parse error %s\n", buf);
> +		goto create_error;
> +	}
> +
> +	ret = mdev_device_create(dev, uuid, instance, mdev_params);
> +	if (ret)
> +		pr_err("mdev_create: Failed to create mdev device\n");
> +	else
> +		ret = count;
> +
> +create_error:
> +	kfree(params);
> +	kfree(pstr);
> +	return ret;
> +}
> +
> +static ssize_t mdev_destroy_store(struct device *dev,
> +				  struct device_attribute *attr,
> +				  const char *buf, size_t count)
> +{
> +	char *uuid_str, *str, *pstr;
> +	uuid_le uuid;
> +	unsigned int instance;
> +	int ret;
> +
> +	str = pstr = kstrndup(buf, count, GFP_KERNEL);
> +
> +	if (!str)
> +		return -ENOMEM;
> +
> +	uuid_str = strsep(&str, ":");
> +	if (!uuid_str) {
> +		pr_err("mdev_destroy: Empty UUID string %s\n", buf);
> +		ret = -EINVAL;
> +		goto destroy_error;
> +	}
> +
> +	if (str == NULL) {
> +		pr_err("mdev_destroy: instance not specified %s\n", buf);
> +		ret = -EINVAL;
> +		goto destroy_error;
> +	}
> +
> +	ret = kstrtouint(str, 0, &instance);
> +	if (ret) {
> +		pr_err("mdev_destroy: instance parsing error %s\n", buf);
> +		goto destroy_error;
> +	}
> +
> +	ret = uuid_parse(uuid_str, &uuid);
> +	if (ret) {
> +		pr_err("mdev_destroy: UUID parse error  %s\n", buf);
> +		goto destroy_error;
> +	}
> +
> +	ret = mdev_device_destroy(dev, uuid, instance);
> +	if (ret == 0)
> +		ret = count;
> +
> +destroy_error:
> +	kfree(pstr);
> +	return ret;
> +}
> +
> +ssize_t mdev_start_store(struct class *class, struct class_attribute *attr,
> +			 const char *buf, size_t count)
> +{
> +	char *uuid_str, *ptr;
> +	uuid_le uuid;
> +	int ret;
> +
> +	ptr = uuid_str = kstrndup(buf, count, GFP_KERNEL);
> +
> +	if (!uuid_str)
> +		return -ENOMEM;
> +
> +	ret = uuid_parse(uuid_str, &uuid);
> +	if (ret) {
> +		pr_err("mdev_start: UUID parse error  %s\n", buf);
> +		goto start_error;
> +	}
> +
> +	ret = mdev_device_start(uuid);
> +	if (ret == 0)
> +		ret = count;
> +
> +start_error:
> +	kfree(ptr);
> +	return ret;
> +}
> +
> +ssize_t mdev_shutdown_store(struct class *class, struct class_attribute *attr,
> +			    const char *buf, size_t count)
> +{
> +	char *uuid_str, *ptr;
> +	uuid_le uuid;
> +	int ret;
> +
> +	ptr = uuid_str = kstrndup(buf, count, GFP_KERNEL);
> +
> +	if (!uuid_str)
> +		return -ENOMEM;
> +
> +	ret = uuid_parse(uuid_str, &uuid);
> +	if (ret) {
> +		pr_err("mdev_shutdown: UUID parse error %s\n", buf);
> +		goto shutdown_error;
> +	}
> +
> +	ret = mdev_device_shutdown(uuid);
> +	if (ret == 0)
> +		ret = count;
> +
> +shutdown_error:
> +	kfree(ptr);
> +	return ret;
> +
> +}
> +
> +struct class_attribute mdev_class_attrs[] = {
> +	__ATTR_WO(mdev_start),
> +	__ATTR_WO(mdev_shutdown),
> +	__ATTR_NULL
> +};
> +
> +int mdev_create_sysfs_files(struct device *dev)
> +{
> +	int ret;
> +
> +	ret = sysfs_create_file(&dev->kobj,
> +				&dev_attr_mdev_supported_types.attr);
> +	if (ret) {
> +		pr_err("Failed to create mdev_supported_types sysfs entry\n");
> +		return ret;
> +	}
> +
> +	ret = sysfs_create_file(&dev->kobj, &dev_attr_mdev_create.attr);
> +	if (ret) {
> +		pr_err("Failed to create mdev_create sysfs entry\n");
> +		goto create_sysfs_failed;
> +	}
> +
> +	ret = sysfs_create_file(&dev->kobj, &dev_attr_mdev_destroy.attr);
> +	if (ret) {
> +		pr_err("Failed to create mdev_destroy sysfs entry\n");
> +		sysfs_remove_file(&dev->kobj, &dev_attr_mdev_create.attr);
> +	} else
> +		return ret;
> +
> +create_sysfs_failed:
> +	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_supported_types.attr);
> +	return ret;
> +}
> +
> +void mdev_remove_sysfs_files(struct device *dev)
> +{
> +	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_supported_types.attr);
> +	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_create.attr);
> +	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_destroy.attr);
> +}
> diff --git a/include/linux/mdev.h b/include/linux/mdev.h
> new file mode 100644
> index 000000000000..31b6f8572cfa
> --- /dev/null
> +++ b/include/linux/mdev.h
> @@ -0,0 +1,232 @@
> +/*
> + * Mediated device definition
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#ifndef MDEV_H
> +#define MDEV_H
> +
> +/* Common Data structures */
> +
> +struct pci_region_info {
> +	uint64_t start;
> +	uint64_t size;
> +	uint32_t flags;		/* VFIO region info flags */
> +};
> +
> +enum mdev_emul_space {
> +	EMUL_CONFIG_SPACE,	/* PCI configuration space */
> +	EMUL_IO,		/* I/O register space */
> +	EMUL_MMIO		/* Memory-mapped I/O space */
> +};
> +
> +struct parent_device;
> +
> +/*
> + * Mediated device
> + */
> +
> +struct mdev_device {
> +	struct device		dev;
> +	struct parent_device	*parent;
> +	struct iommu_group	*group;
> +	void			*iommu_data;
> +	uuid_le			uuid;
> +	uint32_t		instance;
> +
> +	/* internal only */
> +	struct kref		ref;
> +	struct mutex		ops_lock;
> +	struct list_head	next;
> +};
> +
> +
> +/**
> + * struct parent_ops - Structure to be registered for each parent device to
> + * register the device to mdev module.
> + *
> + * @owner:		The module owner.
> + * @dev_attr_groups:	Default attributes of the parent device.
> + * @mdev_attr_groups:	Default attributes of the mediated device.
> + * @supported_config:	Called to get information about supported types.
> + *			@dev : device structure of parent device.
> + *			@config: should return string listing supported config
> + *			Returns integer: success (0) or error (< 0)
> + * @create:		Called to allocate basic resources in parent device's
> + *			driver for a particular mediated device
> + *			@dev: parent device structure on which mediated device
> + *			      should be created
> + *			@uuid: VM's uuid for which VM it is intended to
> + *			@instance: mediated instance in that VM
> + *			@mdev_params: extra parameters required by parent
> + *			device's driver.
> + *			Returns integer: success (0) or error (< 0)
> + * @destroy:		Called to free resources in parent device's driver for a
> + *			a mediated device instance of that VM.
> + *			@dev: parent device structure to which this mediated
> + *			      device points to.
> + *			@uuid: VM's uuid for which the mediated device belongs
> + *			@instance: mdev instance in that VM
> + *			Returns integer: success (0) or error (< 0)
> + *			If VM is running and destroy() is called that means the
> + *			mdev is being hotunpluged. Return error if VM is running
> + *			and driver doesn't support mediated device hotplug.
> + * @start:		Called to initiate mediated device initialization
> + *			process in parent device's driver when VM boots before
> + *			VMM starts
> + *			@uuid: VM's UUID which is booting.
> + *			Returns integer: success (0) or error (< 0)
> + * @shutdown:		Called to teardown mediated device related resources for
> + *			the VM
> + *			@uuid: VM's UUID which is shutting down .
> + *			Returns integer: success (0) or error (< 0)
> + * @read:		Read emulation callback
> + *			@mdev: mediated device structure
> + *			@buf: read buffer
> + *			@count: number of bytes to read
> + *			@address_space: specifies for which address space the
> + *			request is intended for - pci_config_space, IO register
> + *			space or MMIO space.
> + *			@addr: address.
> + *			Retuns number on bytes read on success or error.
> + * @write:		Write emulation callback
> + *			@mdev: mediated device structure
> + *			@buf: write buffer
> + *			@count: number of bytes to be written
> + *			@address_space: specifies for which address space the
> + *			request is intended for - pci_config_space, IO register
> + *			space or MMIO space.
> + *			@addr: address.
> + *			Retuns number on bytes written on success or error.
> + * @set_irqs:		Called to send about interrupts configuration
> + *			information that VMM sets.
> + *			@mdev: mediated device structure
> + *			@flags, index, start, count and *data : same as that of
> + *			struct vfio_irq_set of VFIO_DEVICE_SET_IRQS API.
> + * @get_region_info:	Called to get VFIO region size and flags of mediated
> + *			device.
> + *			@mdev: mediated device structure
> + *			@region_index: VFIO region index
> + *			@region_info: output, returns size and flags of
> + *				      requested region.
> + *			Returns integer: success (0) or error (< 0)
> + * @validate_map_request: Validate remap pfn request
> + *			@mdev: mediated device structure
> + *			@virtaddr: target user address to start at
> + *			@pfn: parent address of kernel memory, vendor driver
> + *			      can change if required.
> + *			@size: size of map area, vendor driver can change the
> + *			       size of map area if desired.
> + *			@prot: page protection flags for this mapping, vendor
> + *			       driver can change, if required.
> + *			Returns integer: success (0) or error (< 0)
> + *
> + * Parent device that support mediated device should be registered with mdev
> + * module with parent_ops structure.
> + */
> +
> +struct parent_ops {
> +	struct module   *owner;
> +	const struct attribute_group **dev_attr_groups;
> +	const struct attribute_group **mdev_attr_groups;
> +
> +	int	(*supported_config)(struct device *dev, char *config);
> +	int     (*create)(struct device *dev, uuid_le uuid,
> +			  uint32_t instance, char *mdev_params);
> +	int     (*destroy)(struct device *dev, uuid_le uuid,
> +			   uint32_t instance);
> +	int     (*start)(uuid_le uuid);
> +	int     (*shutdown)(uuid_le uuid);
> +	ssize_t (*read)(struct mdev_device *vdev, char *buf, size_t count,
> +			enum mdev_emul_space address_space, loff_t pos);
> +	ssize_t (*write)(struct mdev_device *vdev, char *buf, size_t count,
> +			 enum mdev_emul_space address_space, loff_t pos);
> +	int     (*set_irqs)(struct mdev_device *vdev, uint32_t flags,
> +			    unsigned int index, unsigned int start,
> +			    unsigned int count, void *data);
> +	int	(*get_region_info)(struct mdev_device *vdev, int region_index,
> +				 struct pci_region_info *region_info);
> +	int	(*validate_map_request)(struct mdev_device *vdev,
> +					unsigned long virtaddr,
> +					unsigned long *pfn, unsigned long *size,
> +					pgprot_t *prot);
> +};
> +
> +/*
> + * Parent Device
> + */
> +struct parent_device {
> +	struct device		*dev;
> +	const struct parent_ops	*ops;
> +
> +	/* internal */
> +	struct kref		ref;
> +	struct mutex		ops_lock;
> +	struct list_head	next;
> +	struct list_head	mdev_list;
> +	struct mutex		mdev_list_lock;
> +	wait_queue_head_t	release_done;
> +};
> +
> +/**
> + * struct mdev_driver - Mediated device driver
> + * @name: driver name
> + * @probe: called when new device created
> + * @remove: called when device removed
> + * @match: called when new device or driver is added for this bus. Return 1 if
> + *	   given device can be handled by given driver and zero otherwise.
> + * @driver: device driver structure
> + *
> + **/
> +struct mdev_driver {
> +	const char *name;
> +	int  (*probe)(struct device *dev);
> +	void (*remove)(struct device *dev);
> +	int  (*match)(struct device *dev);
> +	struct device_driver driver;
> +};
> +
> +static inline struct mdev_driver *to_mdev_driver(struct device_driver *drv)
> +{
> +	return drv ? container_of(drv, struct mdev_driver, driver) : NULL;
> +}
> +
> +static inline struct mdev_device *to_mdev_device(struct device *dev)
> +{
> +	return dev ? container_of(dev, struct mdev_device, dev) : NULL;
> +}
> +
> +static inline void *mdev_get_drvdata(struct mdev_device *mdev)
> +{
> +	return dev_get_drvdata(&mdev->dev);
> +}
> +
> +static inline void mdev_set_drvdata(struct mdev_device *mdev, void *data)
> +{
> +	dev_set_drvdata(&mdev->dev, data);
> +}
> +
> +extern struct bus_type mdev_bus_type;
> +
> +#define dev_is_mdev(d) ((d)->bus == &mdev_bus_type)
> +
> +extern int  mdev_register_device(struct device *dev,
> +				 const struct parent_ops *ops);
> +extern void mdev_unregister_device(struct device *dev);
> +
> +extern int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
> +extern void mdev_unregister_driver(struct mdev_driver *drv);
> +
> +extern struct mdev_device *mdev_get_device(struct mdev_device *mdev);
> +extern void mdev_put_device(struct mdev_device *mdev);
> +
> +extern struct mdev_device *mdev_get_device_by_group(struct iommu_group *group);
> +
> +#endif /* MDEV_H */
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Qemu-devel] [PATCH 1/3] Mediated device Core driver
@ 2016-06-21  7:38     ` Jike Song
  0 siblings, 0 replies; 51+ messages in thread
From: Jike Song @ 2016-06-21  7:38 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: alex.williamson, pbonzini, kraxel, cjia, qemu-devel, kvm,
	kevin.tian, shuai.ruan, zhiyuan.lv, bjsdjshi

On 06/21/2016 12:31 AM, Kirti Wankhede wrote:
> +
> +static int mdev_add_attribute_group(struct device *dev,
> +				    const struct attribute_group **groups)
> +{
> +	return sysfs_create_groups(&dev->kobj, groups);
> +}
> +
> +static void mdev_remove_attribute_group(struct device *dev,
> +					const struct attribute_group **groups)
> +{
> +	sysfs_remove_groups(&dev->kobj, groups);
> +}
> +
> +static struct mdev_device *find_mdev_device(struct parent_device *parent,
> +					    uuid_le uuid, int instance)
> +{
> +	struct mdev_device *mdev = NULL, *p;
> +
> +	list_for_each_entry(p, &parent->mdev_list, next) {
> +		if ((uuid_le_cmp(p->uuid, uuid) == 0) &&
> +		    (p->instance == instance)) {
> +			mdev = p;
> +			break;
> +		}
> +	}
> +	return mdev;
> +}

Hi Kirti, Neo,

Thanks for the new version!

We just found that find_mdev_device() is necessary for pdev driver to locate
the mdev by VM identity and mdev instance. e.g. caller of your vfio-iommu
exported API, vfio_pin_pages(), must have something to identify which
address space it wants, that's a subfield of mdev.

Do you mind to have it exported? Or is there any other way for this?

--
Thanks,
Jike


> +
> +/* Should be called holding parent_devices.list_lock */
> +static struct parent_device *find_parent_device(struct device *dev)
> +{
> +	struct parent_device *parent = NULL, *p;
> +
> +	WARN_ON(!mutex_is_locked(&parent_devices.list_lock));
> +	list_for_each_entry(p, &parent_devices.dev_list, next) {
> +		if (p->dev == dev) {
> +			parent = p;
> +			break;
> +		}
> +	}
> +	return parent;
> +}
> +
> +static void mdev_release_parent(struct kref *kref)
> +{
> +	struct parent_device *parent = container_of(kref, struct parent_device,
> +						    ref);
> +	kfree(parent);
> +}
> +
> +static
> +inline struct parent_device *mdev_get_parent(struct parent_device *parent)
> +{
> +	if (parent)
> +		kref_get(&parent->ref);
> +
> +	return parent;
> +}
> +
> +static inline void mdev_put_parent(struct parent_device *parent)
> +{
> +	if (parent)
> +		kref_put(&parent->ref, mdev_release_parent);
> +}
> +
> +static struct parent_device *mdev_get_parent_by_dev(struct device *dev)
> +{
> +	struct parent_device *parent = NULL, *p;
> +
> +	mutex_lock(&parent_devices.list_lock);
> +	list_for_each_entry(p, &parent_devices.dev_list, next) {
> +		if (p->dev == dev) {
> +			parent = mdev_get_parent(p);
> +			break;
> +		}
> +	}
> +	mutex_unlock(&parent_devices.list_lock);
> +	return parent;
> +}
> +
> +static int mdev_device_create_ops(struct mdev_device *mdev, char *mdev_params)
> +{
> +	struct parent_device *parent = mdev->parent;
> +	int ret;
> +
> +	mutex_lock(&parent->ops_lock);
> +	if (parent->ops->create) {
> +		ret = parent->ops->create(mdev->dev.parent, mdev->uuid,
> +					mdev->instance, mdev_params);
> +		if (ret)
> +			goto create_ops_err;
> +	}
> +
> +	ret = mdev_add_attribute_group(&mdev->dev,
> +					parent->ops->mdev_attr_groups);
> +create_ops_err:
> +	mutex_unlock(&parent->ops_lock);
> +	return ret;
> +}
> +
> +static int mdev_device_destroy_ops(struct mdev_device *mdev, bool force)
> +{
> +	struct parent_device *parent = mdev->parent;
> +	int ret = 0;
> +
> +	/*
> +	 * If vendor driver doesn't return success that means vendor
> +	 * driver doesn't support hot-unplug
> +	 */
> +	mutex_lock(&parent->ops_lock);
> +	if (parent->ops->destroy) {
> +		ret = parent->ops->destroy(parent->dev, mdev->uuid,
> +					   mdev->instance);
> +		if (ret && !force) {
> +			ret = -EBUSY;
> +			goto destroy_ops_err;
> +		}
> +	}
> +	mdev_remove_attribute_group(&mdev->dev,
> +				    parent->ops->mdev_attr_groups);
> +destroy_ops_err:
> +	mutex_unlock(&parent->ops_lock);
> +
> +	return ret;
> +}
> +
> +static void mdev_release_device(struct kref *kref)
> +{
> +	struct mdev_device *mdev = container_of(kref, struct mdev_device, ref);
> +	struct parent_device *parent = mdev->parent;
> +
> +	device_unregister(&mdev->dev);
> +	wake_up(&parent->release_done);
> +	mdev_put_parent(parent);
> +}
> +
> +struct mdev_device *mdev_get_device(struct mdev_device *mdev)
> +{
> +	if (mdev)
> +		kref_get(&mdev->ref);
> +
> +	return mdev;
> +}
> +EXPORT_SYMBOL(mdev_get_device);
> +
> +void mdev_put_device(struct mdev_device *mdev)
> +{
> +	if (mdev)
> +		kref_put(&mdev->ref, mdev_release_device);
> +}
> +EXPORT_SYMBOL(mdev_put_device);
> +
> +/*
> + * Find first mediated device from given uuid and increment refcount of
> + * mediated device. Caller should call mdev_put_device() when the use of
> + * mdev_device is done.
> + */
> +static struct mdev_device *mdev_get_first_device_by_uuid(uuid_le uuid)
> +{
> +	struct mdev_device *mdev = NULL, *p;
> +	struct parent_device *parent;
> +
> +	mutex_lock(&parent_devices.list_lock);
> +	list_for_each_entry(parent, &parent_devices.dev_list, next) {
> +		mutex_lock(&parent->mdev_list_lock);
> +		list_for_each_entry(p, &parent->mdev_list, next) {
> +			if (uuid_le_cmp(p->uuid, uuid) == 0) {
> +				mdev = mdev_get_device(p);
> +				break;
> +			}
> +		}
> +		mutex_unlock(&parent->mdev_list_lock);
> +
> +		if (mdev)
> +			break;
> +	}
> +	mutex_unlock(&parent_devices.list_lock);
> +	return mdev;
> +}
> +
> +/*
> + * Find mediated device from given iommu_group and increment refcount of
> + * mediated device. Caller should call mdev_put_device() when the use of
> + * mdev_device is done.
> + */
> +struct mdev_device *mdev_get_device_by_group(struct iommu_group *group)
> +{
> +	struct mdev_device *mdev = NULL, *p;
> +	struct parent_device *parent;
> +
> +	mutex_lock(&parent_devices.list_lock);
> +	list_for_each_entry(parent, &parent_devices.dev_list, next) {
> +		mutex_lock(&parent->mdev_list_lock);
> +		list_for_each_entry(p, &parent->mdev_list, next) {
> +			if (!p->group)
> +				continue;
> +
> +			if (iommu_group_id(p->group) == iommu_group_id(group)) {
> +				mdev = mdev_get_device(p);
> +				break;
> +			}
> +		}
> +		mutex_unlock(&parent->mdev_list_lock);
> +
> +		if (mdev)
> +			break;
> +	}
> +	mutex_unlock(&parent_devices.list_lock);
> +	return mdev;
> +}
> +EXPORT_SYMBOL(mdev_get_device_by_group);
> +
> +/*
> + * mdev_register_device : Register a device
> + * @dev: device structure representing parent device.
> + * @ops: Parent device operation structure to be registered.
> + *
> + * Add device to list of registered parent devices.
> + * Returns a negative value on error, otherwise 0.
> + */
> +int mdev_register_device(struct device *dev, const struct parent_ops *ops)
> +{
> +	int ret = 0;
> +	struct parent_device *parent;
> +
> +	if (!dev || !ops)
> +		return -EINVAL;
> +
> +	mutex_lock(&parent_devices.list_lock);
> +
> +	/* Check for duplicate */
> +	parent = find_parent_device(dev);
> +	if (parent) {
> +		ret = -EEXIST;
> +		goto add_dev_err;
> +	}
> +
> +	parent = kzalloc(sizeof(*parent), GFP_KERNEL);
> +	if (!parent) {
> +		ret = -ENOMEM;
> +		goto add_dev_err;
> +	}
> +
> +	kref_init(&parent->ref);
> +	list_add(&parent->next, &parent_devices.dev_list);
> +	mutex_unlock(&parent_devices.list_lock);
> +
> +	parent->dev = dev;
> +	parent->ops = ops;
> +	mutex_init(&parent->ops_lock);
> +	mutex_init(&parent->mdev_list_lock);
> +	INIT_LIST_HEAD(&parent->mdev_list);
> +	init_waitqueue_head(&parent->release_done);
> +
> +	ret = mdev_create_sysfs_files(dev);
> +	if (ret)
> +		goto add_sysfs_error;
> +
> +	ret = mdev_add_attribute_group(dev, ops->dev_attr_groups);
> +	if (ret)
> +		goto add_group_error;
> +
> +	dev_info(dev, "MDEV: Registered\n");
> +	return 0;
> +
> +add_group_error:
> +	mdev_remove_sysfs_files(dev);
> +add_sysfs_error:
> +	mutex_lock(&parent_devices.list_lock);
> +	list_del(&parent->next);
> +	mutex_unlock(&parent_devices.list_lock);
> +	mdev_put_parent(parent);
> +	return ret;
> +
> +add_dev_err:
> +	mutex_unlock(&parent_devices.list_lock);
> +	return ret;
> +}
> +EXPORT_SYMBOL(mdev_register_device);
> +
> +/*
> + * mdev_unregister_device : Unregister a parent device
> + * @dev: device structure representing parent device.
> + *
> + * Remove device from list of registered parent devices. Give a chance to free
> + * existing mediated devices for given device.
> + */
> +
> +void mdev_unregister_device(struct device *dev)
> +{
> +	struct parent_device *parent;
> +	struct mdev_device *mdev, *n;
> +	int ret;
> +
> +	mutex_lock(&parent_devices.list_lock);
> +	parent = find_parent_device(dev);
> +
> +	if (!parent) {
> +		mutex_unlock(&parent_devices.list_lock);
> +		return;
> +	}
> +	dev_info(dev, "MDEV: Unregistering\n");
> +
> +	/*
> +	 * Remove parent from the list and remove create and destroy sysfs
> +	 * files so that no new mediated device could be created for this parent
> +	 */
> +	list_del(&parent->next);
> +	mdev_remove_sysfs_files(dev);
> +	mutex_unlock(&parent_devices.list_lock);
> +
> +	mutex_lock(&parent->ops_lock);
> +	mdev_remove_attribute_group(dev,
> +				    parent->ops->dev_attr_groups);
> +	mutex_unlock(&parent->ops_lock);
> +
> +	mutex_lock(&parent->mdev_list_lock);
> +	list_for_each_entry_safe(mdev, n, &parent->mdev_list, next) {
> +		mdev_device_destroy_ops(mdev, true);
> +		list_del(&mdev->next);
> +		mdev_put_device(mdev);
> +	}
> +	mutex_unlock(&parent->mdev_list_lock);
> +
> +	do {
> +		ret = wait_event_interruptible_timeout(parent->release_done,
> +				list_empty(&parent->mdev_list), HZ * 10);
> +		if (ret == -ERESTARTSYS) {
> +			dev_warn(dev, "Mediated devices are in use, task"
> +				      " \"%s\" (%d) "
> +				      "blocked until all are released",
> +				      current->comm, task_pid_nr(current));
> +		}
> +	} while (ret <= 0);
> +
> +	mdev_put_parent(parent);
> +}
> +EXPORT_SYMBOL(mdev_unregister_device);
> +
> +/*
> + * Functions required for mdev-sysfs
> + */
> +static void mdev_device_release(struct device *dev)
> +{
> +	struct mdev_device *mdev = to_mdev_device(dev);
> +
> +	dev_dbg(&mdev->dev, "MDEV: destroying\n");
> +	kfree(mdev);
> +}
> +
> +int mdev_device_create(struct device *dev, uuid_le uuid, uint32_t instance,
> +		       char *mdev_params)
> +{
> +	int ret;
> +	struct mdev_device *mdev;
> +	struct parent_device *parent;
> +
> +	parent = mdev_get_parent_by_dev(dev);
> +	if (!parent)
> +		return -EINVAL;
> +
> +	/* Check for duplicate */
> +	mdev = find_mdev_device(parent, uuid, instance);
> +	if (mdev) {
> +		ret = -EEXIST;
> +		goto create_err;
> +	}
> +
> +	mdev = kzalloc(sizeof(*mdev), GFP_KERNEL);
> +	if (!mdev) {
> +		ret = -ENOMEM;
> +		goto create_err;
> +	}
> +
> +	memcpy(&mdev->uuid, &uuid, sizeof(uuid_le));
> +	mdev->instance = instance;
> +	mdev->parent = parent;
> +	mutex_init(&mdev->ops_lock);
> +	kref_init(&mdev->ref);
> +
> +	mdev->dev.parent  = dev;
> +	mdev->dev.bus     = &mdev_bus_type;
> +	mdev->dev.release = mdev_device_release;
> +	dev_set_name(&mdev->dev, "%pUb-%d", uuid.b, instance);
> +
> +	ret = device_register(&mdev->dev);
> +	if (ret) {
> +		put_device(&mdev->dev);
> +		goto create_err;
> +	}
> +
> +	ret = mdev_device_create_ops(mdev, mdev_params);
> +	if (ret)
> +		goto create_failed;
> +
> +	mutex_lock(&parent->mdev_list_lock);
> +	list_add(&mdev->next, &parent->mdev_list);
> +	mutex_unlock(&parent->mdev_list_lock);
> +
> +	dev_dbg(&mdev->dev, "MDEV: created\n");
> +
> +	return ret;
> +
> +create_failed:
> +	device_unregister(&mdev->dev);
> +
> +create_err:
> +	mdev_put_parent(parent);
> +	return ret;
> +}
> +
> +int mdev_device_destroy(struct device *dev, uuid_le uuid, uint32_t instance)
> +{
> +	struct mdev_device *mdev;
> +	struct parent_device *parent;
> +	int ret;
> +
> +	parent = mdev_get_parent_by_dev(dev);
> +	if (!parent) {
> +		ret = -EINVAL;
> +		goto destroy_err;
> +	}
> +
> +	mdev = find_mdev_device(parent, uuid, instance);
> +	if (!mdev) {
> +		ret = -EINVAL;
> +		goto destroy_err;
> +	}
> +
> +	ret = mdev_device_destroy_ops(mdev, false);
> +	if (ret)
> +		goto destroy_err;
> +
> +	mdev_put_parent(parent);
> +
> +	mutex_lock(&parent->mdev_list_lock);
> +	list_del(&mdev->next);
> +	mutex_unlock(&parent->mdev_list_lock);
> +
> +	mdev_put_device(mdev);
> +	return ret;
> +
> +destroy_err:
> +	mdev_put_parent(parent);
> +	return ret;
> +}
> +
> +void mdev_device_supported_config(struct device *dev, char *str)
> +{
> +	struct parent_device *parent;
> +
> +	parent = mdev_get_parent_by_dev(dev);
> +
> +	if (parent) {
> +		mutex_lock(&parent->ops_lock);
> +		if (parent->ops->supported_config)
> +			parent->ops->supported_config(parent->dev, str);
> +		mutex_unlock(&parent->ops_lock);
> +		mdev_put_parent(parent);
> +	}
> +}
> +
> +int mdev_device_start(uuid_le uuid)
> +{
> +	int ret = 0;
> +	struct mdev_device *mdev;
> +	struct parent_device *parent;
> +
> +	mdev = mdev_get_first_device_by_uuid(uuid);
> +	if (!mdev)
> +		return -EINVAL;
> +
> +	parent = mdev->parent;
> +
> +	mutex_lock(&parent->ops_lock);
> +	if (parent->ops->start)
> +		ret = parent->ops->start(mdev->uuid);
> +	mutex_unlock(&parent->ops_lock);
> +
> +	if (ret)
> +		pr_err("mdev_start failed  %d\n", ret);
> +	else
> +		kobject_uevent(&mdev->dev.kobj, KOBJ_ONLINE);
> +
> +	mdev_put_device(mdev);
> +
> +	return ret;
> +}
> +
> +int mdev_device_shutdown(uuid_le uuid)
> +{
> +	int ret = 0;
> +	struct mdev_device *mdev;
> +	struct parent_device *parent;
> +
> +	mdev = mdev_get_first_device_by_uuid(uuid);
> +	if (!mdev)
> +		return -EINVAL;
> +
> +	parent = mdev->parent;
> +
> +	mutex_lock(&parent->ops_lock);
> +	if (parent->ops->shutdown)
> +		ret = parent->ops->shutdown(mdev->uuid);
> +	mutex_unlock(&parent->ops_lock);
> +
> +	if (ret)
> +		pr_err("mdev_shutdown failed %d\n", ret);
> +	else
> +		kobject_uevent(&mdev->dev.kobj, KOBJ_OFFLINE);
> +
> +	mdev_put_device(mdev);
> +	return ret;
> +}
> +
> +static struct class mdev_class = {
> +	.name		= MDEV_CLASS_NAME,
> +	.owner		= THIS_MODULE,
> +	.class_attrs	= mdev_class_attrs,
> +};
> +
> +static int __init mdev_init(void)
> +{
> +	int ret;
> +
> +	mutex_init(&parent_devices.list_lock);
> +	INIT_LIST_HEAD(&parent_devices.dev_list);
> +
> +	ret = class_register(&mdev_class);
> +	if (ret) {
> +		pr_err("Failed to register mdev class\n");
> +		return ret;
> +	}
> +
> +	ret = mdev_bus_register();
> +	if (ret) {
> +		pr_err("Failed to register mdev bus\n");
> +		class_unregister(&mdev_class);
> +		return ret;
> +	}
> +
> +	return ret;
> +}
> +
> +static void __exit mdev_exit(void)
> +{
> +	mdev_bus_unregister();
> +	class_unregister(&mdev_class);
> +}
> +
> +module_init(mdev_init)
> +module_exit(mdev_exit)
> +
> +MODULE_VERSION(DRIVER_VERSION);
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR(DRIVER_AUTHOR);
> +MODULE_DESCRIPTION(DRIVER_DESC);
> diff --git a/drivers/vfio/mdev/mdev_driver.c b/drivers/vfio/mdev/mdev_driver.c
> new file mode 100644
> index 000000000000..f1aed541111d
> --- /dev/null
> +++ b/drivers/vfio/mdev/mdev_driver.c
> @@ -0,0 +1,138 @@
> +/*
> + * MDEV driver
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/device.h>
> +#include <linux/iommu.h>
> +#include <linux/mdev.h>
> +
> +#include "mdev_private.h"
> +
> +static int mdev_attach_iommu(struct mdev_device *mdev)
> +{
> +	int ret;
> +	struct iommu_group *group;
> +
> +	group = iommu_group_alloc();
> +	if (IS_ERR(group)) {
> +		dev_err(&mdev->dev, "MDEV: failed to allocate group!\n");
> +		return PTR_ERR(group);
> +	}
> +
> +	ret = iommu_group_add_device(group, &mdev->dev);
> +	if (ret) {
> +		dev_err(&mdev->dev, "MDEV: failed to add dev to group!\n");
> +		goto attach_fail;
> +	}
> +
> +	mdev->group = group;
> +
> +	dev_info(&mdev->dev, "MDEV: group_id = %d\n",
> +				 iommu_group_id(group));
> +attach_fail:
> +	iommu_group_put(group);
> +	return ret;
> +}
> +
> +static void mdev_detach_iommu(struct mdev_device *mdev)
> +{
> +	iommu_group_remove_device(&mdev->dev);
> +	dev_info(&mdev->dev, "MDEV: detaching iommu\n");
> +}
> +
> +static int mdev_probe(struct device *dev)
> +{
> +	struct mdev_driver *drv = to_mdev_driver(dev->driver);
> +	struct mdev_device *mdev = to_mdev_device(dev);
> +	int ret;
> +
> +	ret = mdev_attach_iommu(mdev);
> +	if (ret) {
> +		dev_err(dev, "Failed to attach IOMMU\n");
> +		return ret;
> +	}
> +
> +	if (drv && drv->probe)
> +		ret = drv->probe(dev);
> +
> +	return ret;
> +}
> +
> +static int mdev_remove(struct device *dev)
> +{
> +	struct mdev_driver *drv = to_mdev_driver(dev->driver);
> +	struct mdev_device *mdev = to_mdev_device(dev);
> +
> +	if (drv && drv->remove)
> +		drv->remove(dev);
> +
> +	mdev_detach_iommu(mdev);
> +
> +	return 0;
> +}
> +
> +static int mdev_match(struct device *dev, struct device_driver *drv)
> +{
> +	struct mdev_driver *mdrv = to_mdev_driver(drv);
> +
> +	if (mdrv && mdrv->match)
> +		return mdrv->match(dev);
> +
> +	return 0;
> +}
> +
> +struct bus_type mdev_bus_type = {
> +	.name		= "mdev",
> +	.match		= mdev_match,
> +	.probe		= mdev_probe,
> +	.remove		= mdev_remove,
> +};
> +EXPORT_SYMBOL_GPL(mdev_bus_type);
> +
> +/*
> + * mdev_register_driver - register a new MDEV driver
> + * @drv: the driver to register
> + * @owner: module owner of driver to be registered
> + *
> + * Returns a negative value on error, otherwise 0.
> + */
> +int mdev_register_driver(struct mdev_driver *drv, struct module *owner)
> +{
> +	/* initialize common driver fields */
> +	drv->driver.name = drv->name;
> +	drv->driver.bus = &mdev_bus_type;
> +	drv->driver.owner = owner;
> +
> +	/* register with core */
> +	return driver_register(&drv->driver);
> +}
> +EXPORT_SYMBOL(mdev_register_driver);
> +
> +/*
> + * mdev_unregister_driver - unregister MDEV driver
> + * @drv: the driver to unregister
> + *
> + */
> +void mdev_unregister_driver(struct mdev_driver *drv)
> +{
> +	driver_unregister(&drv->driver);
> +}
> +EXPORT_SYMBOL(mdev_unregister_driver);
> +
> +int mdev_bus_register(void)
> +{
> +	return bus_register(&mdev_bus_type);
> +}
> +
> +void mdev_bus_unregister(void)
> +{
> +	bus_unregister(&mdev_bus_type);
> +}
> diff --git a/drivers/vfio/mdev/mdev_private.h b/drivers/vfio/mdev/mdev_private.h
> new file mode 100644
> index 000000000000..991d7f796169
> --- /dev/null
> +++ b/drivers/vfio/mdev/mdev_private.h
> @@ -0,0 +1,33 @@
> +/*
> + * Mediated device interal definitions
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#ifndef MDEV_PRIVATE_H
> +#define MDEV_PRIVATE_H
> +
> +int  mdev_bus_register(void);
> +void mdev_bus_unregister(void);
> +
> +/* Function prototypes for mdev_sysfs */
> +
> +extern struct class_attribute mdev_class_attrs[];
> +
> +int  mdev_create_sysfs_files(struct device *dev);
> +void mdev_remove_sysfs_files(struct device *dev);
> +
> +int  mdev_device_create(struct device *dev, uuid_le uuid, uint32_t instance,
> +			char *mdev_params);
> +int  mdev_device_destroy(struct device *dev, uuid_le uuid, uint32_t instance);
> +void mdev_device_supported_config(struct device *dev, char *str);
> +int  mdev_device_start(uuid_le uuid);
> +int  mdev_device_shutdown(uuid_le uuid);
> +
> +#endif /* MDEV_PRIVATE_H */
> diff --git a/drivers/vfio/mdev/mdev_sysfs.c b/drivers/vfio/mdev/mdev_sysfs.c
> new file mode 100644
> index 000000000000..48b66e40009e
> --- /dev/null
> +++ b/drivers/vfio/mdev/mdev_sysfs.c
> @@ -0,0 +1,300 @@
> +/*
> + * File attributes for Mediated devices
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/sysfs.h>
> +#include <linux/ctype.h>
> +#include <linux/device.h>
> +#include <linux/slab.h>
> +#include <linux/uuid.h>
> +#include <linux/mdev.h>
> +
> +#include "mdev_private.h"
> +
> +/* Prototypes */
> +static ssize_t mdev_supported_types_show(struct device *dev,
> +					 struct device_attribute *attr,
> +					 char *buf);
> +static DEVICE_ATTR_RO(mdev_supported_types);
> +
> +static ssize_t mdev_create_store(struct device *dev,
> +				 struct device_attribute *attr,
> +				 const char *buf, size_t count);
> +static DEVICE_ATTR_WO(mdev_create);
> +
> +static ssize_t mdev_destroy_store(struct device *dev,
> +				  struct device_attribute *attr,
> +				  const char *buf, size_t count);
> +static DEVICE_ATTR_WO(mdev_destroy);
> +
> +/* Static functions */
> +
> +#define UUID_CHAR_LENGTH	36
> +#define UUID_BYTE_LENGTH	16
> +
> +#define SUPPORTED_TYPE_BUFFER_LENGTH	1024
> +
> +static inline bool is_uuid_sep(char sep)
> +{
> +	if (sep == '\n' || sep == '-' || sep == ':' || sep == '\0')
> +		return true;
> +	return false;
> +}
> +
> +static int uuid_parse(const char *str, uuid_le *uuid)
> +{
> +	int i;
> +
> +	if (strlen(str) < UUID_CHAR_LENGTH)
> +		return -EINVAL;
> +
> +	for (i = 0; i < UUID_BYTE_LENGTH; i++) {
> +		if (!isxdigit(str[0]) || !isxdigit(str[1])) {
> +			pr_err("%s err", __func__);
> +			return -EINVAL;
> +		}
> +
> +		uuid->b[i] = (hex_to_bin(str[0]) << 4) | hex_to_bin(str[1]);
> +		str += 2;
> +		if (is_uuid_sep(*str))
> +			str++;
> +	}
> +
> +	return 0;
> +}
> +
> +/* mdev sysfs Functions */
> +static ssize_t mdev_supported_types_show(struct device *dev,
> +					 struct device_attribute *attr,
> +					 char *buf)
> +{
> +	char *str, *ptr;
> +	ssize_t n;
> +
> +	str = kzalloc(sizeof(*str) * SUPPORTED_TYPE_BUFFER_LENGTH, GFP_KERNEL);
> +	if (!str)
> +		return -ENOMEM;
> +
> +	ptr = str;
> +	mdev_device_supported_config(dev, str);
> +
> +	n = sprintf(buf, "%s\n", str);
> +	kfree(ptr);
> +
> +	return n;
> +}
> +
> +static ssize_t mdev_create_store(struct device *dev,
> +				 struct device_attribute *attr,
> +				 const char *buf, size_t count)
> +{
> +	char *str, *pstr;
> +	char *uuid_str, *instance_str, *mdev_params = NULL, *params = NULL;
> +	uuid_le uuid;
> +	uint32_t instance;
> +	int ret;
> +
> +	pstr = str = kstrndup(buf, count, GFP_KERNEL);
> +
> +	if (!str)
> +		return -ENOMEM;
> +
> +	uuid_str = strsep(&str, ":");
> +	if (!uuid_str) {
> +		pr_err("mdev_create: Empty UUID string %s\n", buf);
> +		ret = -EINVAL;
> +		goto create_error;
> +	}
> +
> +	if (!str) {
> +		pr_err("mdev_create: mdev instance not present %s\n", buf);
> +		ret = -EINVAL;
> +		goto create_error;
> +	}
> +
> +	instance_str = strsep(&str, ":");
> +	if (!instance_str) {
> +		pr_err("mdev_create: Empty instance string %s\n", buf);
> +		ret = -EINVAL;
> +		goto create_error;
> +	}
> +
> +	ret = kstrtouint(instance_str, 0, &instance);
> +	if (ret) {
> +		pr_err("mdev_create: mdev instance parsing error %s\n", buf);
> +		goto create_error;
> +	}
> +
> +	if (str)
> +		params = mdev_params = kstrdup(str, GFP_KERNEL);
> +
> +	ret = uuid_parse(uuid_str, &uuid);
> +	if (ret) {
> +		pr_err("mdev_create: UUID parse error %s\n", buf);
> +		goto create_error;
> +	}
> +
> +	ret = mdev_device_create(dev, uuid, instance, mdev_params);
> +	if (ret)
> +		pr_err("mdev_create: Failed to create mdev device\n");
> +	else
> +		ret = count;
> +
> +create_error:
> +	kfree(params);
> +	kfree(pstr);
> +	return ret;
> +}
> +
> +static ssize_t mdev_destroy_store(struct device *dev,
> +				  struct device_attribute *attr,
> +				  const char *buf, size_t count)
> +{
> +	char *uuid_str, *str, *pstr;
> +	uuid_le uuid;
> +	unsigned int instance;
> +	int ret;
> +
> +	str = pstr = kstrndup(buf, count, GFP_KERNEL);
> +
> +	if (!str)
> +		return -ENOMEM;
> +
> +	uuid_str = strsep(&str, ":");
> +	if (!uuid_str) {
> +		pr_err("mdev_destroy: Empty UUID string %s\n", buf);
> +		ret = -EINVAL;
> +		goto destroy_error;
> +	}
> +
> +	if (str == NULL) {
> +		pr_err("mdev_destroy: instance not specified %s\n", buf);
> +		ret = -EINVAL;
> +		goto destroy_error;
> +	}
> +
> +	ret = kstrtouint(str, 0, &instance);
> +	if (ret) {
> +		pr_err("mdev_destroy: instance parsing error %s\n", buf);
> +		goto destroy_error;
> +	}
> +
> +	ret = uuid_parse(uuid_str, &uuid);
> +	if (ret) {
> +		pr_err("mdev_destroy: UUID parse error  %s\n", buf);
> +		goto destroy_error;
> +	}
> +
> +	ret = mdev_device_destroy(dev, uuid, instance);
> +	if (ret == 0)
> +		ret = count;
> +
> +destroy_error:
> +	kfree(pstr);
> +	return ret;
> +}
> +
> +ssize_t mdev_start_store(struct class *class, struct class_attribute *attr,
> +			 const char *buf, size_t count)
> +{
> +	char *uuid_str, *ptr;
> +	uuid_le uuid;
> +	int ret;
> +
> +	ptr = uuid_str = kstrndup(buf, count, GFP_KERNEL);
> +
> +	if (!uuid_str)
> +		return -ENOMEM;
> +
> +	ret = uuid_parse(uuid_str, &uuid);
> +	if (ret) {
> +		pr_err("mdev_start: UUID parse error  %s\n", buf);
> +		goto start_error;
> +	}
> +
> +	ret = mdev_device_start(uuid);
> +	if (ret == 0)
> +		ret = count;
> +
> +start_error:
> +	kfree(ptr);
> +	return ret;
> +}
> +
> +ssize_t mdev_shutdown_store(struct class *class, struct class_attribute *attr,
> +			    const char *buf, size_t count)
> +{
> +	char *uuid_str, *ptr;
> +	uuid_le uuid;
> +	int ret;
> +
> +	ptr = uuid_str = kstrndup(buf, count, GFP_KERNEL);
> +
> +	if (!uuid_str)
> +		return -ENOMEM;
> +
> +	ret = uuid_parse(uuid_str, &uuid);
> +	if (ret) {
> +		pr_err("mdev_shutdown: UUID parse error %s\n", buf);
> +		goto shutdown_error;
> +	}
> +
> +	ret = mdev_device_shutdown(uuid);
> +	if (ret == 0)
> +		ret = count;
> +
> +shutdown_error:
> +	kfree(ptr);
> +	return ret;
> +
> +}
> +
> +struct class_attribute mdev_class_attrs[] = {
> +	__ATTR_WO(mdev_start),
> +	__ATTR_WO(mdev_shutdown),
> +	__ATTR_NULL
> +};
> +
> +int mdev_create_sysfs_files(struct device *dev)
> +{
> +	int ret;
> +
> +	ret = sysfs_create_file(&dev->kobj,
> +				&dev_attr_mdev_supported_types.attr);
> +	if (ret) {
> +		pr_err("Failed to create mdev_supported_types sysfs entry\n");
> +		return ret;
> +	}
> +
> +	ret = sysfs_create_file(&dev->kobj, &dev_attr_mdev_create.attr);
> +	if (ret) {
> +		pr_err("Failed to create mdev_create sysfs entry\n");
> +		goto create_sysfs_failed;
> +	}
> +
> +	ret = sysfs_create_file(&dev->kobj, &dev_attr_mdev_destroy.attr);
> +	if (ret) {
> +		pr_err("Failed to create mdev_destroy sysfs entry\n");
> +		sysfs_remove_file(&dev->kobj, &dev_attr_mdev_create.attr);
> +	} else
> +		return ret;
> +
> +create_sysfs_failed:
> +	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_supported_types.attr);
> +	return ret;
> +}
> +
> +void mdev_remove_sysfs_files(struct device *dev)
> +{
> +	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_supported_types.attr);
> +	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_create.attr);
> +	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_destroy.attr);
> +}
> diff --git a/include/linux/mdev.h b/include/linux/mdev.h
> new file mode 100644
> index 000000000000..31b6f8572cfa
> --- /dev/null
> +++ b/include/linux/mdev.h
> @@ -0,0 +1,232 @@
> +/*
> + * Mediated device definition
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#ifndef MDEV_H
> +#define MDEV_H
> +
> +/* Common Data structures */
> +
> +struct pci_region_info {
> +	uint64_t start;
> +	uint64_t size;
> +	uint32_t flags;		/* VFIO region info flags */
> +};
> +
> +enum mdev_emul_space {
> +	EMUL_CONFIG_SPACE,	/* PCI configuration space */
> +	EMUL_IO,		/* I/O register space */
> +	EMUL_MMIO		/* Memory-mapped I/O space */
> +};
> +
> +struct parent_device;
> +
> +/*
> + * Mediated device
> + */
> +
> +struct mdev_device {
> +	struct device		dev;
> +	struct parent_device	*parent;
> +	struct iommu_group	*group;
> +	void			*iommu_data;
> +	uuid_le			uuid;
> +	uint32_t		instance;
> +
> +	/* internal only */
> +	struct kref		ref;
> +	struct mutex		ops_lock;
> +	struct list_head	next;
> +};
> +
> +
> +/**
> + * struct parent_ops - Structure to be registered for each parent device to
> + * register the device to mdev module.
> + *
> + * @owner:		The module owner.
> + * @dev_attr_groups:	Default attributes of the parent device.
> + * @mdev_attr_groups:	Default attributes of the mediated device.
> + * @supported_config:	Called to get information about supported types.
> + *			@dev : device structure of parent device.
> + *			@config: should return string listing supported config
> + *			Returns integer: success (0) or error (< 0)
> + * @create:		Called to allocate basic resources in parent device's
> + *			driver for a particular mediated device
> + *			@dev: parent device structure on which mediated device
> + *			      should be created
> + *			@uuid: VM's uuid for which VM it is intended to
> + *			@instance: mediated instance in that VM
> + *			@mdev_params: extra parameters required by parent
> + *			device's driver.
> + *			Returns integer: success (0) or error (< 0)
> + * @destroy:		Called to free resources in parent device's driver for a
> + *			a mediated device instance of that VM.
> + *			@dev: parent device structure to which this mediated
> + *			      device points to.
> + *			@uuid: VM's uuid for which the mediated device belongs
> + *			@instance: mdev instance in that VM
> + *			Returns integer: success (0) or error (< 0)
> + *			If VM is running and destroy() is called that means the
> + *			mdev is being hotunpluged. Return error if VM is running
> + *			and driver doesn't support mediated device hotplug.
> + * @start:		Called to initiate mediated device initialization
> + *			process in parent device's driver when VM boots before
> + *			VMM starts
> + *			@uuid: VM's UUID which is booting.
> + *			Returns integer: success (0) or error (< 0)
> + * @shutdown:		Called to teardown mediated device related resources for
> + *			the VM
> + *			@uuid: VM's UUID which is shutting down .
> + *			Returns integer: success (0) or error (< 0)
> + * @read:		Read emulation callback
> + *			@mdev: mediated device structure
> + *			@buf: read buffer
> + *			@count: number of bytes to read
> + *			@address_space: specifies for which address space the
> + *			request is intended for - pci_config_space, IO register
> + *			space or MMIO space.
> + *			@addr: address.
> + *			Retuns number on bytes read on success or error.
> + * @write:		Write emulation callback
> + *			@mdev: mediated device structure
> + *			@buf: write buffer
> + *			@count: number of bytes to be written
> + *			@address_space: specifies for which address space the
> + *			request is intended for - pci_config_space, IO register
> + *			space or MMIO space.
> + *			@addr: address.
> + *			Retuns number on bytes written on success or error.
> + * @set_irqs:		Called to send about interrupts configuration
> + *			information that VMM sets.
> + *			@mdev: mediated device structure
> + *			@flags, index, start, count and *data : same as that of
> + *			struct vfio_irq_set of VFIO_DEVICE_SET_IRQS API.
> + * @get_region_info:	Called to get VFIO region size and flags of mediated
> + *			device.
> + *			@mdev: mediated device structure
> + *			@region_index: VFIO region index
> + *			@region_info: output, returns size and flags of
> + *				      requested region.
> + *			Returns integer: success (0) or error (< 0)
> + * @validate_map_request: Validate remap pfn request
> + *			@mdev: mediated device structure
> + *			@virtaddr: target user address to start at
> + *			@pfn: parent address of kernel memory, vendor driver
> + *			      can change if required.
> + *			@size: size of map area, vendor driver can change the
> + *			       size of map area if desired.
> + *			@prot: page protection flags for this mapping, vendor
> + *			       driver can change, if required.
> + *			Returns integer: success (0) or error (< 0)
> + *
> + * Parent device that support mediated device should be registered with mdev
> + * module with parent_ops structure.
> + */
> +
> +struct parent_ops {
> +	struct module   *owner;
> +	const struct attribute_group **dev_attr_groups;
> +	const struct attribute_group **mdev_attr_groups;
> +
> +	int	(*supported_config)(struct device *dev, char *config);
> +	int     (*create)(struct device *dev, uuid_le uuid,
> +			  uint32_t instance, char *mdev_params);
> +	int     (*destroy)(struct device *dev, uuid_le uuid,
> +			   uint32_t instance);
> +	int     (*start)(uuid_le uuid);
> +	int     (*shutdown)(uuid_le uuid);
> +	ssize_t (*read)(struct mdev_device *vdev, char *buf, size_t count,
> +			enum mdev_emul_space address_space, loff_t pos);
> +	ssize_t (*write)(struct mdev_device *vdev, char *buf, size_t count,
> +			 enum mdev_emul_space address_space, loff_t pos);
> +	int     (*set_irqs)(struct mdev_device *vdev, uint32_t flags,
> +			    unsigned int index, unsigned int start,
> +			    unsigned int count, void *data);
> +	int	(*get_region_info)(struct mdev_device *vdev, int region_index,
> +				 struct pci_region_info *region_info);
> +	int	(*validate_map_request)(struct mdev_device *vdev,
> +					unsigned long virtaddr,
> +					unsigned long *pfn, unsigned long *size,
> +					pgprot_t *prot);
> +};
> +
> +/*
> + * Parent Device
> + */
> +struct parent_device {
> +	struct device		*dev;
> +	const struct parent_ops	*ops;
> +
> +	/* internal */
> +	struct kref		ref;
> +	struct mutex		ops_lock;
> +	struct list_head	next;
> +	struct list_head	mdev_list;
> +	struct mutex		mdev_list_lock;
> +	wait_queue_head_t	release_done;
> +};
> +
> +/**
> + * struct mdev_driver - Mediated device driver
> + * @name: driver name
> + * @probe: called when new device created
> + * @remove: called when device removed
> + * @match: called when new device or driver is added for this bus. Return 1 if
> + *	   given device can be handled by given driver and zero otherwise.
> + * @driver: device driver structure
> + *
> + **/
> +struct mdev_driver {
> +	const char *name;
> +	int  (*probe)(struct device *dev);
> +	void (*remove)(struct device *dev);
> +	int  (*match)(struct device *dev);
> +	struct device_driver driver;
> +};
> +
> +static inline struct mdev_driver *to_mdev_driver(struct device_driver *drv)
> +{
> +	return drv ? container_of(drv, struct mdev_driver, driver) : NULL;
> +}
> +
> +static inline struct mdev_device *to_mdev_device(struct device *dev)
> +{
> +	return dev ? container_of(dev, struct mdev_device, dev) : NULL;
> +}
> +
> +static inline void *mdev_get_drvdata(struct mdev_device *mdev)
> +{
> +	return dev_get_drvdata(&mdev->dev);
> +}
> +
> +static inline void mdev_set_drvdata(struct mdev_device *mdev, void *data)
> +{
> +	dev_set_drvdata(&mdev->dev, data);
> +}
> +
> +extern struct bus_type mdev_bus_type;
> +
> +#define dev_is_mdev(d) ((d)->bus == &mdev_bus_type)
> +
> +extern int  mdev_register_device(struct device *dev,
> +				 const struct parent_ops *ops);
> +extern void mdev_unregister_device(struct device *dev);
> +
> +extern int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
> +extern void mdev_unregister_driver(struct mdev_driver *drv);
> +
> +extern struct mdev_device *mdev_get_device(struct mdev_device *mdev);
> +extern void mdev_put_device(struct mdev_device *mdev);
> +
> +extern struct mdev_device *mdev_get_device_by_group(struct iommu_group *group);
> +
> +#endif /* MDEV_H */
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 1/3] Mediated device Core driver
  2016-06-20 16:31   ` [Qemu-devel] " Kirti Wankhede
@ 2016-06-21 21:30     ` Alex Williamson
  -1 siblings, 0 replies; 51+ messages in thread
From: Alex Williamson @ 2016-06-21 21:30 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: shuai.ruan, kevin.tian, cjia, kvm, qemu-devel, jike.song, kraxel,
	pbonzini, bjsdjshi, zhiyuan.lv

On Mon, 20 Jun 2016 22:01:46 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> Design for Mediated Device Driver:
> Main purpose of this driver is to provide a common interface for mediated
> device management that can be used by differnt drivers of different
> devices.
> 
> This module provides a generic interface to create the device, add it to
> mediated bus, add device to IOMMU group and then add it to vfio group.
> 
> Below is the high Level block diagram, with Nvidia, Intel and IBM devices
> as example, since these are the devices which are going to actively use
> this module as of now.
> 
>  +---------------+
>  |               |
>  | +-----------+ |  mdev_register_driver() +--------------+
>  | |           | +<------------------------+ __init()     |
>  | |           | |                         |              |
>  | |  mdev     | +------------------------>+              |<-> VFIO user
>  | |  bus      | |     probe()/remove()    | vfio_mpci.ko |    APIs
>  | |  driver   | |                         |              |
>  | |           | |                         +--------------+
>  | |           | |  mdev_register_driver() +--------------+
>  | |           | +<------------------------+ __init()     |
>  | |           | |                         |              |
>  | |           | +------------------------>+              |<-> VFIO user
>  | +-----------+ |     probe()/remove()    | vfio_mccw.ko |    APIs
>  |               |                         |              |
>  |  MDEV CORE    |                         +--------------+
>  |   MODULE      |
>  |   mdev.ko     |
>  | +-----------+ |  mdev_register_device() +--------------+
>  | |           | +<------------------------+              |
>  | |           | |                         |  nvidia.ko   |<-> physical
>  | |           | +------------------------>+              |    device
>  | |           | |        callback         +--------------+
>  | | Physical  | |
>  | |  device   | |  mdev_register_device() +--------------+
>  | | interface | |<------------------------+              |
>  | |           | |                         |  i915.ko     |<-> physical
>  | |           | +------------------------>+              |    device
>  | |           | |        callback         +--------------+
>  | |           | |
>  | |           | |  mdev_register_device() +--------------+
>  | |           | +<------------------------+              |
>  | |           | |                         | ccw_device.ko|<-> physical
>  | |           | +------------------------>+              |    device
>  | |           | |        callback         +--------------+
>  | +-----------+ |
>  +---------------+
> 
> Core driver provides two types of registration interfaces:
> 1. Registration interface for mediated bus driver:
> 
> /**
>   * struct mdev_driver - Mediated device's driver
>   * @name: driver name
>   * @probe: called when new device created
>   * @remove:called when device removed
>   * @match: called when new device or driver is added for this bus.
> 	    Return 1 if given device can be handled by given driver and
> 	    zero otherwise.
>   * @driver:device driver structure
>   *
>   **/
> struct mdev_driver {
>          const char *name;
>          int  (*probe)  (struct device *dev);
>          void (*remove) (struct device *dev);
> 	 int  (*match)(struct device *dev);
>          struct device_driver    driver;
> };
> 
> int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
> void mdev_unregister_driver(struct mdev_driver *drv);
> 
> Mediated device's driver for mdev should use this interface to register
> with Core driver. With this, mediated devices driver for such devices is
> responsible to add mediated device to VFIO group.
> 
> 2. Physical device driver interface
> This interface provides vendor driver the set APIs to manage physical
> device related work in their own driver. APIs are :
> - supported_config: provide supported configuration list by the vendor
> 		    driver
> - create: to allocate basic resources in vendor driver for a mediated
> 	  device.
> - destroy: to free resources in vendor driver when mediated device is
> 	   destroyed.
> - start: to initiate mediated device initialization process from vendor
> 	 driver when VM boots and before QEMU starts.
> - shutdown: to teardown mediated device resources during VM teardown.
> - read : read emulation callback.
> - write: write emulation callback.
> - set_irqs: send interrupt configuration information that QEMU sets.
> - get_region_info: to provide region size and its flags for the mediated
> 		   device.
> - validate_map_request: to validate remap pfn request.
> 
> This registration interface should be used by vendor drivers to register
> each physical device to mdev core driver.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Signed-off-by: Neo Jia <cjia@nvidia.com>
> Change-Id: I73a5084574270b14541c529461ea2f03c292d510
> ---
>  drivers/vfio/Kconfig             |   1 +
>  drivers/vfio/Makefile            |   1 +
>  drivers/vfio/mdev/Kconfig        |  11 +
>  drivers/vfio/mdev/Makefile       |   5 +
>  drivers/vfio/mdev/mdev_core.c    | 595 +++++++++++++++++++++++++++++++++++++++
>  drivers/vfio/mdev/mdev_driver.c  | 138 +++++++++
>  drivers/vfio/mdev/mdev_private.h |  33 +++
>  drivers/vfio/mdev/mdev_sysfs.c   | 300 ++++++++++++++++++++
>  include/linux/mdev.h             | 232 +++++++++++++++
>  9 files changed, 1316 insertions(+)
>  create mode 100644 drivers/vfio/mdev/Kconfig
>  create mode 100644 drivers/vfio/mdev/Makefile
>  create mode 100644 drivers/vfio/mdev/mdev_core.c
>  create mode 100644 drivers/vfio/mdev/mdev_driver.c
>  create mode 100644 drivers/vfio/mdev/mdev_private.h
>  create mode 100644 drivers/vfio/mdev/mdev_sysfs.c
>  create mode 100644 include/linux/mdev.h
> 
> diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
> index da6e2ce77495..23eced02aaf6 100644
> --- a/drivers/vfio/Kconfig
> +++ b/drivers/vfio/Kconfig
> @@ -48,4 +48,5 @@ menuconfig VFIO_NOIOMMU
>  
>  source "drivers/vfio/pci/Kconfig"
>  source "drivers/vfio/platform/Kconfig"
> +source "drivers/vfio/mdev/Kconfig"
>  source "virt/lib/Kconfig"
> diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
> index 7b8a31f63fea..7c70753e54ab 100644
> --- a/drivers/vfio/Makefile
> +++ b/drivers/vfio/Makefile
> @@ -7,3 +7,4 @@ obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
>  obj-$(CONFIG_VFIO_SPAPR_EEH) += vfio_spapr_eeh.o
>  obj-$(CONFIG_VFIO_PCI) += pci/
>  obj-$(CONFIG_VFIO_PLATFORM) += platform/
> +obj-$(CONFIG_MDEV) += mdev/
> diff --git a/drivers/vfio/mdev/Kconfig b/drivers/vfio/mdev/Kconfig
> new file mode 100644
> index 000000000000..951e2bb06a3f
> --- /dev/null
> +++ b/drivers/vfio/mdev/Kconfig
> @@ -0,0 +1,11 @@
> +
> +config MDEV
> +    tristate "Mediated device driver framework"
> +    depends on VFIO
> +    default n
> +    help
> +        MDEV provides a framework to virtualize device without SR-IOV cap
> +        See Documentation/mdev.txt for more details.

Documentation pointer still doesn't exist.  Perhaps this file would be
a more appropriate place than the commit log for some of the
information above.

Every time I review this I'm struggling to figure out why this isn't
VFIO_MDEV since it's really tied to vfio and difficult to evaluate it
as some sort of standalone mediated device interface.  I don't know
the answer, but it always strikes me as a discontinuity.

> +
> +        If you don't know what do here, say N.
> +
> diff --git a/drivers/vfio/mdev/Makefile b/drivers/vfio/mdev/Makefile
> new file mode 100644
> index 000000000000..2c6d11f7bc24
> --- /dev/null
> +++ b/drivers/vfio/mdev/Makefile
> @@ -0,0 +1,5 @@
> +
> +mdev-y := mdev_core.o mdev_sysfs.o mdev_driver.o
> +
> +obj-$(CONFIG_MDEV) += mdev.o
> +
> diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
> new file mode 100644
> index 000000000000..3c45ed2ae1e9
> --- /dev/null
> +++ b/drivers/vfio/mdev/mdev_core.c
> @@ -0,0 +1,595 @@
> +/*
> + * Mediated device Core Driver
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/init.h>
> +#include <linux/module.h>
> +#include <linux/device.h>
> +#include <linux/kernel.h>
> +#include <linux/fs.h>
> +#include <linux/slab.h>
> +#include <linux/cdev.h>
> +#include <linux/sched.h>
> +#include <linux/uuid.h>
> +#include <linux/vfio.h>
> +#include <linux/iommu.h>
> +#include <linux/sysfs.h>
> +#include <linux/mdev.h>
> +
> +#include "mdev_private.h"
> +
> +#define DRIVER_VERSION		"0.1"
> +#define DRIVER_AUTHOR		"NVIDIA Corporation"
> +#define DRIVER_DESC		"Mediated device Core Driver"
> +
> +#define MDEV_CLASS_NAME		"mdev"
> +
> +static struct devices_list {
> +	struct list_head    dev_list;
> +	struct mutex        list_lock;
> +} parent_devices;
> +

I imagine this is following the example of struct vfio in vfio.c but
for this usage the following seems much easier:

static LIST_HEAD(parent_list)
static DEFINE_MUTEX(parent_list_lock);

Then you can also remove the initialization from mdev_init().

> +static int mdev_add_attribute_group(struct device *dev,
> +				    const struct attribute_group **groups)
> +{
> +	return sysfs_create_groups(&dev->kobj, groups);
> +}
> +
> +static void mdev_remove_attribute_group(struct device *dev,
> +					const struct attribute_group **groups)
> +{
> +	sysfs_remove_groups(&dev->kobj, groups);
> +}
> +
> +static struct mdev_device *find_mdev_device(struct parent_device *parent,
> +					    uuid_le uuid, int instance)
> +{
> +	struct mdev_device *mdev = NULL, *p;
> +
> +	list_for_each_entry(p, &parent->mdev_list, next) {
> +		if ((uuid_le_cmp(p->uuid, uuid) == 0) &&
> +		    (p->instance == instance)) {
> +			mdev = p;

Locking here is still broken, the callers are create and destroy, which
can still race each other and themselves.

> +			break;
> +		}
> +	}
> +	return mdev;
> +}
> +
> +/* Should be called holding parent_devices.list_lock */
> +static struct parent_device *find_parent_device(struct device *dev)
> +{
> +	struct parent_device *parent = NULL, *p;
> +
> +	WARN_ON(!mutex_is_locked(&parent_devices.list_lock));
> +	list_for_each_entry(p, &parent_devices.dev_list, next) {
> +		if (p->dev == dev) {
> +			parent = p;
> +			break;
> +		}
> +	}
> +	return parent;
> +}
> +
> +static void mdev_release_parent(struct kref *kref)
> +{
> +	struct parent_device *parent = container_of(kref, struct parent_device,
> +						    ref);
> +	kfree(parent);
> +}
> +
> +static
> +inline struct parent_device *mdev_get_parent(struct parent_device *parent)
> +{
> +	if (parent)
> +		kref_get(&parent->ref);
> +
> +	return parent;
> +}
> +
> +static inline void mdev_put_parent(struct parent_device *parent)
> +{
> +	if (parent)
> +		kref_put(&parent->ref, mdev_release_parent);
> +}
> +
> +static struct parent_device *mdev_get_parent_by_dev(struct device *dev)
> +{
> +	struct parent_device *parent = NULL, *p;
> +
> +	mutex_lock(&parent_devices.list_lock);
> +	list_for_each_entry(p, &parent_devices.dev_list, next) {
> +		if (p->dev == dev) {
> +			parent = mdev_get_parent(p);
> +			break;
> +		}
> +	}
> +	mutex_unlock(&parent_devices.list_lock);
> +	return parent;
> +}
> +
> +static int mdev_device_create_ops(struct mdev_device *mdev, char *mdev_params)
> +{
> +	struct parent_device *parent = mdev->parent;
> +	int ret;
> +
> +	mutex_lock(&parent->ops_lock);
> +	if (parent->ops->create) {

How would a parent_device without ops->create or ops->destroy useful?
Perhaps mdev_register_driver() should enforce required ops.  mdev.h
should at least document which ops are optional if they really are
optional.

> +		ret = parent->ops->create(mdev->dev.parent, mdev->uuid,
> +					mdev->instance, mdev_params);
> +		if (ret)
> +			goto create_ops_err;
> +	}
> +
> +	ret = mdev_add_attribute_group(&mdev->dev,
> +					parent->ops->mdev_attr_groups);

An error here seems to put us in a bad place, the device is created but
the attributes are broken, is it the caller's responsibility to
destroy?  Seems like we need a cleanup if this fails.

> +create_ops_err:
> +	mutex_unlock(&parent->ops_lock);

It seems like ops_lock isn't used so much as a lock as a serialization
mechanism.  Why?  Where is this serialization per parent device
documented?

> +	return ret;
> +}
> +
> +static int mdev_device_destroy_ops(struct mdev_device *mdev, bool force)
> +{
> +	struct parent_device *parent = mdev->parent;
> +	int ret = 0;
> +
> +	/*
> +	 * If vendor driver doesn't return success that means vendor
> +	 * driver doesn't support hot-unplug
> +	 */
> +	mutex_lock(&parent->ops_lock);
> +	if (parent->ops->destroy) {
> +		ret = parent->ops->destroy(parent->dev, mdev->uuid,
> +					   mdev->instance);
> +		if (ret && !force) {

It seems this is not so much a 'force' but an ignore errors, we never
actually force the mdev driver to destroy the device... which makes me
wonder if there are leaks there.

> +			ret = -EBUSY;
> +			goto destroy_ops_err;
> +		}
> +	}
> +	mdev_remove_attribute_group(&mdev->dev,
> +				    parent->ops->mdev_attr_groups);
> +destroy_ops_err:
> +	mutex_unlock(&parent->ops_lock);
> +
> +	return ret;
> +}
> +
> +static void mdev_release_device(struct kref *kref)
> +{
> +	struct mdev_device *mdev = container_of(kref, struct mdev_device, ref);
> +	struct parent_device *parent = mdev->parent;
> +
> +	device_unregister(&mdev->dev);
> +	wake_up(&parent->release_done);
> +	mdev_put_parent(parent);
> +}
> +
> +struct mdev_device *mdev_get_device(struct mdev_device *mdev)
> +{
> +	if (mdev)
> +		kref_get(&mdev->ref);
> +
> +	return mdev;
> +}
> +EXPORT_SYMBOL(mdev_get_device);
> +
> +void mdev_put_device(struct mdev_device *mdev)
> +{
> +	if (mdev)
> +		kref_put(&mdev->ref, mdev_release_device);
> +}
> +EXPORT_SYMBOL(mdev_put_device);
> +
> +/*
> + * Find first mediated device from given uuid and increment refcount of
> + * mediated device. Caller should call mdev_put_device() when the use of
> + * mdev_device is done.
> + */
> +static struct mdev_device *mdev_get_first_device_by_uuid(uuid_le uuid)
> +{
> +	struct mdev_device *mdev = NULL, *p;
> +	struct parent_device *parent;
> +
> +	mutex_lock(&parent_devices.list_lock);
> +	list_for_each_entry(parent, &parent_devices.dev_list, next) {
> +		mutex_lock(&parent->mdev_list_lock);

This lock ordering is something we'll need to keep in mind.

> +		list_for_each_entry(p, &parent->mdev_list, next) {
> +			if (uuid_le_cmp(p->uuid, uuid) == 0) {
> +				mdev = mdev_get_device(p);
> +				break;
> +			}
> +		}
> +		mutex_unlock(&parent->mdev_list_lock);
> +
> +		if (mdev)
> +			break;
> +	}
> +	mutex_unlock(&parent_devices.list_lock);
> +	return mdev;
> +}
> +
> +/*
> + * Find mediated device from given iommu_group and increment refcount of
> + * mediated device. Caller should call mdev_put_device() when the use of
> + * mdev_device is done.
> + */
> +struct mdev_device *mdev_get_device_by_group(struct iommu_group *group)
> +{
> +	struct mdev_device *mdev = NULL, *p;
> +	struct parent_device *parent;
> +
> +	mutex_lock(&parent_devices.list_lock);
> +	list_for_each_entry(parent, &parent_devices.dev_list, next) {
> +		mutex_lock(&parent->mdev_list_lock);
> +		list_for_each_entry(p, &parent->mdev_list, next) {
> +			if (!p->group)
> +				continue;
> +
> +			if (iommu_group_id(p->group) == iommu_group_id(group)) {
> +				mdev = mdev_get_device(p);
> +				break;
> +			}
> +		}
> +		mutex_unlock(&parent->mdev_list_lock);
> +
> +		if (mdev)
> +			break;
> +	}
> +	mutex_unlock(&parent_devices.list_lock);
> +	return mdev;
> +}
> +EXPORT_SYMBOL(mdev_get_device_by_group);
> +
> +/*
> + * mdev_register_device : Register a device
> + * @dev: device structure representing parent device.
> + * @ops: Parent device operation structure to be registered.
> + *
> + * Add device to list of registered parent devices.
> + * Returns a negative value on error, otherwise 0.
> + */
> +int mdev_register_device(struct device *dev, const struct parent_ops *ops)
> +{
> +	int ret = 0;
> +	struct parent_device *parent;
> +
> +	if (!dev || !ops)
> +		return -EINVAL;
> +
> +	mutex_lock(&parent_devices.list_lock);
> +
> +	/* Check for duplicate */
> +	parent = find_parent_device(dev);
> +	if (parent) {
> +		ret = -EEXIST;
> +		goto add_dev_err;
> +	}
> +
> +	parent = kzalloc(sizeof(*parent), GFP_KERNEL);
> +	if (!parent) {
> +		ret = -ENOMEM;
> +		goto add_dev_err;
> +	}
> +
> +	kref_init(&parent->ref);
> +	list_add(&parent->next, &parent_devices.dev_list);
> +	mutex_unlock(&parent_devices.list_lock);

find_parent_device() matches based on parent->dev, but we're dropping
the list lock before we setup parent->dev.  There are other ways to
shorten the time this lock is held, but releasing it with an incomplete
entry in the list is not the way I would choose.

> +
> +	parent->dev = dev;
> +	parent->ops = ops;
> +	mutex_init(&parent->ops_lock);
> +	mutex_init(&parent->mdev_list_lock);
> +	INIT_LIST_HEAD(&parent->mdev_list);
> +	init_waitqueue_head(&parent->release_done);
> +
> +	ret = mdev_create_sysfs_files(dev);
> +	if (ret)
> +		goto add_sysfs_error;
> +
> +	ret = mdev_add_attribute_group(dev, ops->dev_attr_groups);
> +	if (ret)
> +		goto add_group_error;
> +
> +	dev_info(dev, "MDEV: Registered\n");
> +	return 0;
> +
> +add_group_error:
> +	mdev_remove_sysfs_files(dev);
> +add_sysfs_error:
> +	mutex_lock(&parent_devices.list_lock);
> +	list_del(&parent->next);
> +	mutex_unlock(&parent_devices.list_lock);
> +	mdev_put_parent(parent);
> +	return ret;
> +
> +add_dev_err:
> +	mutex_unlock(&parent_devices.list_lock);
> +	return ret;
> +}
> +EXPORT_SYMBOL(mdev_register_device);
> +
> +/*
> + * mdev_unregister_device : Unregister a parent device
> + * @dev: device structure representing parent device.
> + *
> + * Remove device from list of registered parent devices. Give a chance to free
> + * existing mediated devices for given device.
> + */
> +
> +void mdev_unregister_device(struct device *dev)
> +{
> +	struct parent_device *parent;
> +	struct mdev_device *mdev, *n;
> +	int ret;
> +
> +	mutex_lock(&parent_devices.list_lock);
> +	parent = find_parent_device(dev);
> +
> +	if (!parent) {
> +		mutex_unlock(&parent_devices.list_lock);
> +		return;
> +	}
> +	dev_info(dev, "MDEV: Unregistering\n");
> +
> +	/*
> +	 * Remove parent from the list and remove create and destroy sysfs
> +	 * files so that no new mediated device could be created for this parent
> +	 */
> +	list_del(&parent->next);
> +	mdev_remove_sysfs_files(dev);
> +	mutex_unlock(&parent_devices.list_lock);
> +
> +	mutex_lock(&parent->ops_lock);
> +	mdev_remove_attribute_group(dev,
> +				    parent->ops->dev_attr_groups);
> +	mutex_unlock(&parent->ops_lock);
> +
> +	mutex_lock(&parent->mdev_list_lock);
> +	list_for_each_entry_safe(mdev, n, &parent->mdev_list, next) {
> +		mdev_device_destroy_ops(mdev, true);
> +		list_del(&mdev->next);
> +		mdev_put_device(mdev);
> +	}
> +	mutex_unlock(&parent->mdev_list_lock);
> +
> +	do {
> +		ret = wait_event_interruptible_timeout(parent->release_done,
> +				list_empty(&parent->mdev_list), HZ * 10);

But we do a list_del for each mdev in mdev_list above, how could the
list not be empty here?  I think you're trying to wait for all the mdev
devices to be released, but I don't think this does that.  Isn't the
list empty regardless?

> +		if (ret == -ERESTARTSYS) {
> +			dev_warn(dev, "Mediated devices are in use, task"
> +				      " \"%s\" (%d) "
> +				      "blocked until all are released",
> +				      current->comm, task_pid_nr(current));
> +		}
> +	} while (ret <= 0);
> +
> +	mdev_put_parent(parent);
> +}
> +EXPORT_SYMBOL(mdev_unregister_device);
> +
> +/*
> + * Functions required for mdev-sysfs
> + */
> +static void mdev_device_release(struct device *dev)
> +{
> +	struct mdev_device *mdev = to_mdev_device(dev);
> +
> +	dev_dbg(&mdev->dev, "MDEV: destroying\n");
> +	kfree(mdev);
> +}
> +
> +int mdev_device_create(struct device *dev, uuid_le uuid, uint32_t instance,
> +		       char *mdev_params)
> +{
> +	int ret;
> +	struct mdev_device *mdev;
> +	struct parent_device *parent;
> +
> +	parent = mdev_get_parent_by_dev(dev);
> +	if (!parent)
> +		return -EINVAL;
> +
> +	/* Check for duplicate */
> +	mdev = find_mdev_device(parent, uuid, instance);

But this doesn't actually prevent duplicates because we we're not
holding any lock the guarantee that another racing process doesn't
create the same {uuid,instance} between where we check and the below
list_add.

> +	if (mdev) {
> +		ret = -EEXIST;
> +		goto create_err;
> +	}
> +
> +	mdev = kzalloc(sizeof(*mdev), GFP_KERNEL);
> +	if (!mdev) {
> +		ret = -ENOMEM;
> +		goto create_err;
> +	}
> +
> +	memcpy(&mdev->uuid, &uuid, sizeof(uuid_le));
> +	mdev->instance = instance;
> +	mdev->parent = parent;
> +	mutex_init(&mdev->ops_lock);
> +	kref_init(&mdev->ref);
> +
> +	mdev->dev.parent  = dev;
> +	mdev->dev.bus     = &mdev_bus_type;
> +	mdev->dev.release = mdev_device_release;
> +	dev_set_name(&mdev->dev, "%pUb-%d", uuid.b, instance);
> +
> +	ret = device_register(&mdev->dev);
> +	if (ret) {
> +		put_device(&mdev->dev);
> +		goto create_err;
> +	}
> +
> +	ret = mdev_device_create_ops(mdev, mdev_params);
> +	if (ret)
> +		goto create_failed;
> +
> +	mutex_lock(&parent->mdev_list_lock);
> +	list_add(&mdev->next, &parent->mdev_list);
> +	mutex_unlock(&parent->mdev_list_lock);
> +
> +	dev_dbg(&mdev->dev, "MDEV: created\n");
> +
> +	return ret;
> +
> +create_failed:
> +	device_unregister(&mdev->dev);
> +
> +create_err:
> +	mdev_put_parent(parent);
> +	return ret;
> +}
> +
> +int mdev_device_destroy(struct device *dev, uuid_le uuid, uint32_t instance)
> +{
> +	struct mdev_device *mdev;
> +	struct parent_device *parent;
> +	int ret;
> +
> +	parent = mdev_get_parent_by_dev(dev);
> +	if (!parent) {
> +		ret = -EINVAL;
> +		goto destroy_err;
> +	}
> +
> +	mdev = find_mdev_device(parent, uuid, instance);
> +	if (!mdev) {
> +		ret = -EINVAL;
> +		goto destroy_err;
> +	}

Likewise, without locking multiple callers can get here with the same
mdev.

> +
> +	ret = mdev_device_destroy_ops(mdev, false);
> +	if (ret)
> +		goto destroy_err;
> +
> +	mdev_put_parent(parent);
> +
> +	mutex_lock(&parent->mdev_list_lock);
> +	list_del(&mdev->next);
> +	mutex_unlock(&parent->mdev_list_lock);
> +
> +	mdev_put_device(mdev);
> +	return ret;
> +
> +destroy_err:
> +	mdev_put_parent(parent);
> +	return ret;
> +}
> +
> +void mdev_device_supported_config(struct device *dev, char *str)
> +{
> +	struct parent_device *parent;
> +
> +	parent = mdev_get_parent_by_dev(dev);
> +
> +	if (parent) {
> +		mutex_lock(&parent->ops_lock);
> +		if (parent->ops->supported_config)
> +			parent->ops->supported_config(parent->dev, str);
> +		mutex_unlock(&parent->ops_lock);
> +		mdev_put_parent(parent);
> +	}
> +}
> +
> +int mdev_device_start(uuid_le uuid)
> +{
> +	int ret = 0;
> +	struct mdev_device *mdev;
> +	struct parent_device *parent;
> +
> +	mdev = mdev_get_first_device_by_uuid(uuid);
> +	if (!mdev)
> +		return -EINVAL;
> +
> +	parent = mdev->parent;
> +
> +	mutex_lock(&parent->ops_lock);
> +	if (parent->ops->start)
> +		ret = parent->ops->start(mdev->uuid);
> +	mutex_unlock(&parent->ops_lock);
> +
> +	if (ret)
> +		pr_err("mdev_start failed  %d\n", ret);
> +	else
> +		kobject_uevent(&mdev->dev.kobj, KOBJ_ONLINE);
> +
> +	mdev_put_device(mdev);
> +
> +	return ret;
> +}
> +
> +int mdev_device_shutdown(uuid_le uuid)
> +{
> +	int ret = 0;
> +	struct mdev_device *mdev;
> +	struct parent_device *parent;
> +
> +	mdev = mdev_get_first_device_by_uuid(uuid);
> +	if (!mdev)
> +		return -EINVAL;
> +
> +	parent = mdev->parent;
> +
> +	mutex_lock(&parent->ops_lock);
> +	if (parent->ops->shutdown)
> +		ret = parent->ops->shutdown(mdev->uuid);
> +	mutex_unlock(&parent->ops_lock);
> +
> +	if (ret)
> +		pr_err("mdev_shutdown failed %d\n", ret);
> +	else
> +		kobject_uevent(&mdev->dev.kobj, KOBJ_OFFLINE);
> +
> +	mdev_put_device(mdev);
> +	return ret;
> +}
> +
> +static struct class mdev_class = {
> +	.name		= MDEV_CLASS_NAME,
> +	.owner		= THIS_MODULE,
> +	.class_attrs	= mdev_class_attrs,
> +};
> +
> +static int __init mdev_init(void)
> +{
> +	int ret;
> +
> +	mutex_init(&parent_devices.list_lock);
> +	INIT_LIST_HEAD(&parent_devices.dev_list);
> +
> +	ret = class_register(&mdev_class);
> +	if (ret) {
> +		pr_err("Failed to register mdev class\n");
> +		return ret;
> +	}
> +
> +	ret = mdev_bus_register();
> +	if (ret) {
> +		pr_err("Failed to register mdev bus\n");
> +		class_unregister(&mdev_class);
> +		return ret;
> +	}
> +
> +	return ret;
> +}
> +
> +static void __exit mdev_exit(void)
> +{
> +	mdev_bus_unregister();
> +	class_unregister(&mdev_class);
> +}
> +
> +module_init(mdev_init)
> +module_exit(mdev_exit)
> +
> +MODULE_VERSION(DRIVER_VERSION);
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR(DRIVER_AUTHOR);
> +MODULE_DESCRIPTION(DRIVER_DESC);
> diff --git a/drivers/vfio/mdev/mdev_driver.c b/drivers/vfio/mdev/mdev_driver.c
> new file mode 100644
> index 000000000000..f1aed541111d
> --- /dev/null
> +++ b/drivers/vfio/mdev/mdev_driver.c
> @@ -0,0 +1,138 @@
> +/*
> + * MDEV driver
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/device.h>
> +#include <linux/iommu.h>
> +#include <linux/mdev.h>
> +
> +#include "mdev_private.h"
> +
> +static int mdev_attach_iommu(struct mdev_device *mdev)
> +{
> +	int ret;
> +	struct iommu_group *group;
> +
> +	group = iommu_group_alloc();
> +	if (IS_ERR(group)) {
> +		dev_err(&mdev->dev, "MDEV: failed to allocate group!\n");
> +		return PTR_ERR(group);
> +	}
> +
> +	ret = iommu_group_add_device(group, &mdev->dev);
> +	if (ret) {
> +		dev_err(&mdev->dev, "MDEV: failed to add dev to group!\n");
> +		goto attach_fail;
> +	}
> +
> +	mdev->group = group;
> +
> +	dev_info(&mdev->dev, "MDEV: group_id = %d\n",
> +				 iommu_group_id(group));
> +attach_fail:
> +	iommu_group_put(group);
> +	return ret;
> +}
> +
> +static void mdev_detach_iommu(struct mdev_device *mdev)
> +{
> +	iommu_group_remove_device(&mdev->dev);

mdev->group = NULL; seems prudent

> +	dev_info(&mdev->dev, "MDEV: detaching iommu\n");
> +}
> +
> +static int mdev_probe(struct device *dev)
> +{
> +	struct mdev_driver *drv = to_mdev_driver(dev->driver);
> +	struct mdev_device *mdev = to_mdev_device(dev);
> +	int ret;
> +
> +	ret = mdev_attach_iommu(mdev);
> +	if (ret) {
> +		dev_err(dev, "Failed to attach IOMMU\n");
> +		return ret;
> +	}
> +
> +	if (drv && drv->probe)
> +		ret = drv->probe(dev);
> +
> +	return ret;
> +}
> +
> +static int mdev_remove(struct device *dev)
> +{
> +	struct mdev_driver *drv = to_mdev_driver(dev->driver);
> +	struct mdev_device *mdev = to_mdev_device(dev);
> +
> +	if (drv && drv->remove)
> +		drv->remove(dev);
> +
> +	mdev_detach_iommu(mdev);
> +
> +	return 0;
> +}
> +
> +static int mdev_match(struct device *dev, struct device_driver *drv)
> +{
> +	struct mdev_driver *mdrv = to_mdev_driver(drv);

nit, drv above, mdrv here

> +
> +	if (mdrv && mdrv->match)
> +		return mdrv->match(dev);
> +
> +	return 0;
> +}
> +
> +struct bus_type mdev_bus_type = {
> +	.name		= "mdev",
> +	.match		= mdev_match,
> +	.probe		= mdev_probe,
> +	.remove		= mdev_remove,
> +};
> +EXPORT_SYMBOL_GPL(mdev_bus_type);
> +
> +/*
> + * mdev_register_driver - register a new MDEV driver
> + * @drv: the driver to register
> + * @owner: module owner of driver to be registered
> + *
> + * Returns a negative value on error, otherwise 0.
> + */
> +int mdev_register_driver(struct mdev_driver *drv, struct module *owner)
> +{
> +	/* initialize common driver fields */
> +	drv->driver.name = drv->name;
> +	drv->driver.bus = &mdev_bus_type;
> +	drv->driver.owner = owner;
> +
> +	/* register with core */
> +	return driver_register(&drv->driver);
> +}
> +EXPORT_SYMBOL(mdev_register_driver);
> +
> +/*
> + * mdev_unregister_driver - unregister MDEV driver
> + * @drv: the driver to unregister
> + *
> + */
> +void mdev_unregister_driver(struct mdev_driver *drv)
> +{
> +	driver_unregister(&drv->driver);
> +}
> +EXPORT_SYMBOL(mdev_unregister_driver);
> +
> +int mdev_bus_register(void)
> +{
> +	return bus_register(&mdev_bus_type);
> +}
> +
> +void mdev_bus_unregister(void)
> +{
> +	bus_unregister(&mdev_bus_type);
> +}
> diff --git a/drivers/vfio/mdev/mdev_private.h b/drivers/vfio/mdev/mdev_private.h
> new file mode 100644
> index 000000000000..991d7f796169
> --- /dev/null
> +++ b/drivers/vfio/mdev/mdev_private.h
> @@ -0,0 +1,33 @@
> +/*
> + * Mediated device interal definitions
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#ifndef MDEV_PRIVATE_H
> +#define MDEV_PRIVATE_H
> +
> +int  mdev_bus_register(void);
> +void mdev_bus_unregister(void);
> +
> +/* Function prototypes for mdev_sysfs */
> +
> +extern struct class_attribute mdev_class_attrs[];
> +
> +int  mdev_create_sysfs_files(struct device *dev);
> +void mdev_remove_sysfs_files(struct device *dev);
> +
> +int  mdev_device_create(struct device *dev, uuid_le uuid, uint32_t instance,
> +			char *mdev_params);
> +int  mdev_device_destroy(struct device *dev, uuid_le uuid, uint32_t instance);
> +void mdev_device_supported_config(struct device *dev, char *str);
> +int  mdev_device_start(uuid_le uuid);
> +int  mdev_device_shutdown(uuid_le uuid);

nit, stop is start as startup is to shutdown.  IOW, should this be
mdev_device_stop()?

> +
> +#endif /* MDEV_PRIVATE_H */
> diff --git a/drivers/vfio/mdev/mdev_sysfs.c b/drivers/vfio/mdev/mdev_sysfs.c
> new file mode 100644
> index 000000000000..48b66e40009e
> --- /dev/null
> +++ b/drivers/vfio/mdev/mdev_sysfs.c
> @@ -0,0 +1,300 @@
> +/*
> + * File attributes for Mediated devices
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/sysfs.h>
> +#include <linux/ctype.h>
> +#include <linux/device.h>
> +#include <linux/slab.h>
> +#include <linux/uuid.h>
> +#include <linux/mdev.h>
> +
> +#include "mdev_private.h"
> +
> +/* Prototypes */
> +static ssize_t mdev_supported_types_show(struct device *dev,
> +					 struct device_attribute *attr,
> +					 char *buf);
> +static DEVICE_ATTR_RO(mdev_supported_types);
> +
> +static ssize_t mdev_create_store(struct device *dev,
> +				 struct device_attribute *attr,
> +				 const char *buf, size_t count);
> +static DEVICE_ATTR_WO(mdev_create);
> +
> +static ssize_t mdev_destroy_store(struct device *dev,
> +				  struct device_attribute *attr,
> +				  const char *buf, size_t count);
> +static DEVICE_ATTR_WO(mdev_destroy);
> +
> +/* Static functions */
> +
> +#define UUID_CHAR_LENGTH	36
> +#define UUID_BYTE_LENGTH	16
> +
> +#define SUPPORTED_TYPE_BUFFER_LENGTH	1024
> +
> +static inline bool is_uuid_sep(char sep)
> +{
> +	if (sep == '\n' || sep == '-' || sep == ':' || sep == '\0')
> +		return true;
> +	return false;
> +}
> +
> +static int uuid_parse(const char *str, uuid_le *uuid)
> +{
> +	int i;
> +
> +	if (strlen(str) < UUID_CHAR_LENGTH)
> +		return -EINVAL;
> +
> +	for (i = 0; i < UUID_BYTE_LENGTH; i++) {
> +		if (!isxdigit(str[0]) || !isxdigit(str[1])) {
> +			pr_err("%s err", __func__);
> +			return -EINVAL;
> +		}
> +
> +		uuid->b[i] = (hex_to_bin(str[0]) << 4) | hex_to_bin(str[1]);
> +		str += 2;
> +		if (is_uuid_sep(*str))
> +			str++;
> +	}
> +
> +	return 0;
> +}
> +
> +/* mdev sysfs Functions */
> +static ssize_t mdev_supported_types_show(struct device *dev,
> +					 struct device_attribute *attr,
> +					 char *buf)
> +{
> +	char *str, *ptr;
> +	ssize_t n;
> +
> +	str = kzalloc(sizeof(*str) * SUPPORTED_TYPE_BUFFER_LENGTH, GFP_KERNEL);
> +	if (!str)
> +		return -ENOMEM;
> +
> +	ptr = str;
> +	mdev_device_supported_config(dev, str);
> +
> +	n = sprintf(buf, "%s\n", str);
> +	kfree(ptr);
> +
> +	return n;
> +}
> +
> +static ssize_t mdev_create_store(struct device *dev,
> +				 struct device_attribute *attr,
> +				 const char *buf, size_t count)
> +{
> +	char *str, *pstr;
> +	char *uuid_str, *instance_str, *mdev_params = NULL, *params = NULL;
> +	uuid_le uuid;
> +	uint32_t instance;
> +	int ret;
> +
> +	pstr = str = kstrndup(buf, count, GFP_KERNEL);
> +
> +	if (!str)
> +		return -ENOMEM;
> +
> +	uuid_str = strsep(&str, ":");
> +	if (!uuid_str) {
> +		pr_err("mdev_create: Empty UUID string %s\n", buf);
> +		ret = -EINVAL;
> +		goto create_error;
> +	}
> +
> +	if (!str) {
> +		pr_err("mdev_create: mdev instance not present %s\n", buf);
> +		ret = -EINVAL;
> +		goto create_error;
> +	}
> +
> +	instance_str = strsep(&str, ":");
> +	if (!instance_str) {
> +		pr_err("mdev_create: Empty instance string %s\n", buf);
> +		ret = -EINVAL;
> +		goto create_error;
> +	}
> +
> +	ret = kstrtouint(instance_str, 0, &instance);
> +	if (ret) {
> +		pr_err("mdev_create: mdev instance parsing error %s\n", buf);
> +		goto create_error;
> +	}
> +
> +	if (str)
> +		params = mdev_params = kstrdup(str, GFP_KERNEL);
> +
> +	ret = uuid_parse(uuid_str, &uuid);
> +	if (ret) {
> +		pr_err("mdev_create: UUID parse error %s\n", buf);
> +		goto create_error;
> +	}
> +
> +	ret = mdev_device_create(dev, uuid, instance, mdev_params);
> +	if (ret)
> +		pr_err("mdev_create: Failed to create mdev device\n");
> +	else
> +		ret = count;
> +
> +create_error:
> +	kfree(params);
> +	kfree(pstr);
> +	return ret;
> +}
> +
> +static ssize_t mdev_destroy_store(struct device *dev,
> +				  struct device_attribute *attr,
> +				  const char *buf, size_t count)
> +{
> +	char *uuid_str, *str, *pstr;
> +	uuid_le uuid;
> +	unsigned int instance;
> +	int ret;
> +
> +	str = pstr = kstrndup(buf, count, GFP_KERNEL);
> +
> +	if (!str)
> +		return -ENOMEM;
> +
> +	uuid_str = strsep(&str, ":");
> +	if (!uuid_str) {
> +		pr_err("mdev_destroy: Empty UUID string %s\n", buf);
> +		ret = -EINVAL;
> +		goto destroy_error;
> +	}
> +
> +	if (str == NULL) {
> +		pr_err("mdev_destroy: instance not specified %s\n", buf);
> +		ret = -EINVAL;
> +		goto destroy_error;
> +	}
> +
> +	ret = kstrtouint(str, 0, &instance);
> +	if (ret) {
> +		pr_err("mdev_destroy: instance parsing error %s\n", buf);
> +		goto destroy_error;
> +	}
> +
> +	ret = uuid_parse(uuid_str, &uuid);
> +	if (ret) {
> +		pr_err("mdev_destroy: UUID parse error  %s\n", buf);
> +		goto destroy_error;
> +	}
> +
> +	ret = mdev_device_destroy(dev, uuid, instance);
> +	if (ret == 0)
> +		ret = count;
> +
> +destroy_error:
> +	kfree(pstr);
> +	return ret;
> +}
> +
> +ssize_t mdev_start_store(struct class *class, struct class_attribute *attr,
> +			 const char *buf, size_t count)
> +{
> +	char *uuid_str, *ptr;
> +	uuid_le uuid;
> +	int ret;
> +
> +	ptr = uuid_str = kstrndup(buf, count, GFP_KERNEL);
> +
> +	if (!uuid_str)
> +		return -ENOMEM;
> +
> +	ret = uuid_parse(uuid_str, &uuid);
> +	if (ret) {
> +		pr_err("mdev_start: UUID parse error  %s\n", buf);
> +		goto start_error;
> +	}
> +
> +	ret = mdev_device_start(uuid);
> +	if (ret == 0)
> +		ret = count;
> +
> +start_error:
> +	kfree(ptr);
> +	return ret;
> +}
> +
> +ssize_t mdev_shutdown_store(struct class *class, struct class_attribute *attr,
> +			    const char *buf, size_t count)
> +{
> +	char *uuid_str, *ptr;
> +	uuid_le uuid;
> +	int ret;
> +
> +	ptr = uuid_str = kstrndup(buf, count, GFP_KERNEL);
> +
> +	if (!uuid_str)
> +		return -ENOMEM;
> +
> +	ret = uuid_parse(uuid_str, &uuid);
> +	if (ret) {
> +		pr_err("mdev_shutdown: UUID parse error %s\n", buf);
> +		goto shutdown_error;
> +	}
> +
> +	ret = mdev_device_shutdown(uuid);
> +	if (ret == 0)
> +		ret = count;
> +
> +shutdown_error:
> +	kfree(ptr);
> +	return ret;
> +
> +}
> +
> +struct class_attribute mdev_class_attrs[] = {
> +	__ATTR_WO(mdev_start),
> +	__ATTR_WO(mdev_shutdown),
> +	__ATTR_NULL
> +};
> +
> +int mdev_create_sysfs_files(struct device *dev)
> +{
> +	int ret;
> +
> +	ret = sysfs_create_file(&dev->kobj,
> +				&dev_attr_mdev_supported_types.attr);
> +	if (ret) {
> +		pr_err("Failed to create mdev_supported_types sysfs entry\n");
> +		return ret;
> +	}
> +
> +	ret = sysfs_create_file(&dev->kobj, &dev_attr_mdev_create.attr);
> +	if (ret) {
> +		pr_err("Failed to create mdev_create sysfs entry\n");
> +		goto create_sysfs_failed;
> +	}
> +
> +	ret = sysfs_create_file(&dev->kobj, &dev_attr_mdev_destroy.attr);
> +	if (ret) {
> +		pr_err("Failed to create mdev_destroy sysfs entry\n");
> +		sysfs_remove_file(&dev->kobj, &dev_attr_mdev_create.attr);
> +	} else
> +		return ret;
> +
> +create_sysfs_failed:
> +	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_supported_types.attr);
> +	return ret;
> +}
> +
> +void mdev_remove_sysfs_files(struct device *dev)
> +{
> +	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_supported_types.attr);
> +	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_create.attr);
> +	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_destroy.attr);
> +}
> diff --git a/include/linux/mdev.h b/include/linux/mdev.h
> new file mode 100644
> index 000000000000..31b6f8572cfa
> --- /dev/null
> +++ b/include/linux/mdev.h
> @@ -0,0 +1,232 @@
> +/*
> + * Mediated device definition
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#ifndef MDEV_H
> +#define MDEV_H
> +
> +/* Common Data structures */
> +
> +struct pci_region_info {
> +	uint64_t start;
> +	uint64_t size;
> +	uint32_t flags;		/* VFIO region info flags */
> +};
> +
> +enum mdev_emul_space {
> +	EMUL_CONFIG_SPACE,	/* PCI configuration space */
> +	EMUL_IO,		/* I/O register space */
> +	EMUL_MMIO		/* Memory-mapped I/O space */
> +};


I'm still confused why this is needed, perhaps a description here would
be useful so I can stop asking.  Clearly config space is PCI only, so
it's strange to have it in the common code.  Everyone not on x86 will
say I/O space is also strange.  I can't keep it in my head why the
read/write offsets aren't sufficient for the driver to figure out what
type it is.


> +
> +struct parent_device;
> +
> +/*
> + * Mediated device
> + */
> +
> +struct mdev_device {
> +	struct device		dev;
> +	struct parent_device	*parent;
> +	struct iommu_group	*group;
> +	void			*iommu_data;
> +	uuid_le			uuid;
> +	uint32_t		instance;
> +
> +	/* internal only */
> +	struct kref		ref;
> +	struct mutex		ops_lock;
> +	struct list_head	next;
> +};
> +
> +
> +/**
> + * struct parent_ops - Structure to be registered for each parent device to
> + * register the device to mdev module.
> + *
> + * @owner:		The module owner.
> + * @dev_attr_groups:	Default attributes of the parent device.
> + * @mdev_attr_groups:	Default attributes of the mediated device.
> + * @supported_config:	Called to get information about supported types.
> + *			@dev : device structure of parent device.
> + *			@config: should return string listing supported config
> + *			Returns integer: success (0) or error (< 0)
> + * @create:		Called to allocate basic resources in parent device's
> + *			driver for a particular mediated device
> + *			@dev: parent device structure on which mediated device
> + *			      should be created
> + *			@uuid: VM's uuid for which VM it is intended to
> + *			@instance: mediated instance in that VM
> + *			@mdev_params: extra parameters required by parent
> + *			device's driver.
> + *			Returns integer: success (0) or error (< 0)
> + * @destroy:		Called to free resources in parent device's driver for a
> + *			a mediated device instance of that VM.
> + *			@dev: parent device structure to which this mediated
> + *			      device points to.
> + *			@uuid: VM's uuid for which the mediated device belongs
> + *			@instance: mdev instance in that VM
> + *			Returns integer: success (0) or error (< 0)
> + *			If VM is running and destroy() is called that means the
> + *			mdev is being hotunpluged. Return error if VM is running
> + *			and driver doesn't support mediated device hotplug.
> + * @start:		Called to initiate mediated device initialization
> + *			process in parent device's driver when VM boots before
> + *			VMM starts
> + *			@uuid: VM's UUID which is booting.
> + *			Returns integer: success (0) or error (< 0)
> + * @shutdown:		Called to teardown mediated device related resources for
> + *			the VM
> + *			@uuid: VM's UUID which is shutting down .
> + *			Returns integer: success (0) or error (< 0)
> + * @read:		Read emulation callback
> + *			@mdev: mediated device structure
> + *			@buf: read buffer
> + *			@count: number of bytes to read
> + *			@address_space: specifies for which address space the
> + *			request is intended for - pci_config_space, IO register
> + *			space or MMIO space.
> + *			@addr: address.
> + *			Retuns number on bytes read on success or error.
> + * @write:		Write emulation callback
> + *			@mdev: mediated device structure
> + *			@buf: write buffer
> + *			@count: number of bytes to be written
> + *			@address_space: specifies for which address space the
> + *			request is intended for - pci_config_space, IO register
> + *			space or MMIO space.
> + *			@addr: address.
> + *			Retuns number on bytes written on success or error.
> + * @set_irqs:		Called to send about interrupts configuration
> + *			information that VMM sets.
> + *			@mdev: mediated device structure
> + *			@flags, index, start, count and *data : same as that of
> + *			struct vfio_irq_set of VFIO_DEVICE_SET_IRQS API.
> + * @get_region_info:	Called to get VFIO region size and flags of mediated
> + *			device.
> + *			@mdev: mediated device structure
> + *			@region_index: VFIO region index
> + *			@region_info: output, returns size and flags of
> + *				      requested region.
> + *			Returns integer: success (0) or error (< 0)
> + * @validate_map_request: Validate remap pfn request
> + *			@mdev: mediated device structure
> + *			@virtaddr: target user address to start at
> + *			@pfn: parent address of kernel memory, vendor driver
> + *			      can change if required.
> + *			@size: size of map area, vendor driver can change the
> + *			       size of map area if desired.
> + *			@prot: page protection flags for this mapping, vendor
> + *			       driver can change, if required.
> + *			Returns integer: success (0) or error (< 0)
> + *
> + * Parent device that support mediated device should be registered with mdev
> + * module with parent_ops structure.
> + */
> +
> +struct parent_ops {
> +	struct module   *owner;
> +	const struct attribute_group **dev_attr_groups;
> +	const struct attribute_group **mdev_attr_groups;
> +
> +	int	(*supported_config)(struct device *dev, char *config);
> +	int     (*create)(struct device *dev, uuid_le uuid,
> +			  uint32_t instance, char *mdev_params);
> +	int     (*destroy)(struct device *dev, uuid_le uuid,
> +			   uint32_t instance);
> +	int     (*start)(uuid_le uuid);
> +	int     (*shutdown)(uuid_le uuid);
> +	ssize_t (*read)(struct mdev_device *vdev, char *buf, size_t count,
> +			enum mdev_emul_space address_space, loff_t pos);
> +	ssize_t (*write)(struct mdev_device *vdev, char *buf, size_t count,
> +			 enum mdev_emul_space address_space, loff_t pos);
> +	int     (*set_irqs)(struct mdev_device *vdev, uint32_t flags,
> +			    unsigned int index, unsigned int start,
> +			    unsigned int count, void *data);
> +	int	(*get_region_info)(struct mdev_device *vdev, int region_index,
> +				 struct pci_region_info *region_info);

This can't be //pci_//region_info.  How do you intend to support things
like sparse mmap capabilities in the user REGION_INFO ioctl when such
things are not part of the mediated device API?  Seems like the driver
should just return a buffer.

> +	int	(*validate_map_request)(struct mdev_device *vdev,
> +					unsigned long virtaddr,
> +					unsigned long *pfn, unsigned long *size,
> +					pgprot_t *prot);
> +};
> +
> +/*
> + * Parent Device
> + */
> +struct parent_device {
> +	struct device		*dev;
> +	const struct parent_ops	*ops;
> +
> +	/* internal */
> +	struct kref		ref;
> +	struct mutex		ops_lock;
> +	struct list_head	next;
> +	struct list_head	mdev_list;
> +	struct mutex		mdev_list_lock;
> +	wait_queue_head_t	release_done;
> +};
> +
> +/**
> + * struct mdev_driver - Mediated device driver
> + * @name: driver name
> + * @probe: called when new device created
> + * @remove: called when device removed
> + * @match: called when new device or driver is added for this bus. Return 1 if
> + *	   given device can be handled by given driver and zero otherwise.
> + * @driver: device driver structure
> + *
> + **/
> +struct mdev_driver {
> +	const char *name;
> +	int  (*probe)(struct device *dev);
> +	void (*remove)(struct device *dev);
> +	int  (*match)(struct device *dev);
> +	struct device_driver driver;
> +};
> +
> +static inline struct mdev_driver *to_mdev_driver(struct device_driver *drv)
> +{
> +	return drv ? container_of(drv, struct mdev_driver, driver) : NULL;
> +}
> +
> +static inline struct mdev_device *to_mdev_device(struct device *dev)
> +{
> +	return dev ? container_of(dev, struct mdev_device, dev) : NULL;
> +}
> +
> +static inline void *mdev_get_drvdata(struct mdev_device *mdev)
> +{
> +	return dev_get_drvdata(&mdev->dev);
> +}
> +
> +static inline void mdev_set_drvdata(struct mdev_device *mdev, void *data)
> +{
> +	dev_set_drvdata(&mdev->dev, data);
> +}
> +
> +extern struct bus_type mdev_bus_type;
> +
> +#define dev_is_mdev(d) ((d)->bus == &mdev_bus_type)
> +
> +extern int  mdev_register_device(struct device *dev,
> +				 const struct parent_ops *ops);
> +extern void mdev_unregister_device(struct device *dev);
> +
> +extern int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
> +extern void mdev_unregister_driver(struct mdev_driver *drv);
> +
> +extern struct mdev_device *mdev_get_device(struct mdev_device *mdev);
> +extern void mdev_put_device(struct mdev_device *mdev);
> +
> +extern struct mdev_device *mdev_get_device_by_group(struct iommu_group *group);
> +
> +#endif /* MDEV_H */

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Qemu-devel] [PATCH 1/3] Mediated device Core driver
@ 2016-06-21 21:30     ` Alex Williamson
  0 siblings, 0 replies; 51+ messages in thread
From: Alex Williamson @ 2016-06-21 21:30 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, shuai.ruan,
	jike.song, zhiyuan.lv, bjsdjshi

On Mon, 20 Jun 2016 22:01:46 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> Design for Mediated Device Driver:
> Main purpose of this driver is to provide a common interface for mediated
> device management that can be used by differnt drivers of different
> devices.
> 
> This module provides a generic interface to create the device, add it to
> mediated bus, add device to IOMMU group and then add it to vfio group.
> 
> Below is the high Level block diagram, with Nvidia, Intel and IBM devices
> as example, since these are the devices which are going to actively use
> this module as of now.
> 
>  +---------------+
>  |               |
>  | +-----------+ |  mdev_register_driver() +--------------+
>  | |           | +<------------------------+ __init()     |
>  | |           | |                         |              |
>  | |  mdev     | +------------------------>+              |<-> VFIO user
>  | |  bus      | |     probe()/remove()    | vfio_mpci.ko |    APIs
>  | |  driver   | |                         |              |
>  | |           | |                         +--------------+
>  | |           | |  mdev_register_driver() +--------------+
>  | |           | +<------------------------+ __init()     |
>  | |           | |                         |              |
>  | |           | +------------------------>+              |<-> VFIO user
>  | +-----------+ |     probe()/remove()    | vfio_mccw.ko |    APIs
>  |               |                         |              |
>  |  MDEV CORE    |                         +--------------+
>  |   MODULE      |
>  |   mdev.ko     |
>  | +-----------+ |  mdev_register_device() +--------------+
>  | |           | +<------------------------+              |
>  | |           | |                         |  nvidia.ko   |<-> physical
>  | |           | +------------------------>+              |    device
>  | |           | |        callback         +--------------+
>  | | Physical  | |
>  | |  device   | |  mdev_register_device() +--------------+
>  | | interface | |<------------------------+              |
>  | |           | |                         |  i915.ko     |<-> physical
>  | |           | +------------------------>+              |    device
>  | |           | |        callback         +--------------+
>  | |           | |
>  | |           | |  mdev_register_device() +--------------+
>  | |           | +<------------------------+              |
>  | |           | |                         | ccw_device.ko|<-> physical
>  | |           | +------------------------>+              |    device
>  | |           | |        callback         +--------------+
>  | +-----------+ |
>  +---------------+
> 
> Core driver provides two types of registration interfaces:
> 1. Registration interface for mediated bus driver:
> 
> /**
>   * struct mdev_driver - Mediated device's driver
>   * @name: driver name
>   * @probe: called when new device created
>   * @remove:called when device removed
>   * @match: called when new device or driver is added for this bus.
> 	    Return 1 if given device can be handled by given driver and
> 	    zero otherwise.
>   * @driver:device driver structure
>   *
>   **/
> struct mdev_driver {
>          const char *name;
>          int  (*probe)  (struct device *dev);
>          void (*remove) (struct device *dev);
> 	 int  (*match)(struct device *dev);
>          struct device_driver    driver;
> };
> 
> int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
> void mdev_unregister_driver(struct mdev_driver *drv);
> 
> Mediated device's driver for mdev should use this interface to register
> with Core driver. With this, mediated devices driver for such devices is
> responsible to add mediated device to VFIO group.
> 
> 2. Physical device driver interface
> This interface provides vendor driver the set APIs to manage physical
> device related work in their own driver. APIs are :
> - supported_config: provide supported configuration list by the vendor
> 		    driver
> - create: to allocate basic resources in vendor driver for a mediated
> 	  device.
> - destroy: to free resources in vendor driver when mediated device is
> 	   destroyed.
> - start: to initiate mediated device initialization process from vendor
> 	 driver when VM boots and before QEMU starts.
> - shutdown: to teardown mediated device resources during VM teardown.
> - read : read emulation callback.
> - write: write emulation callback.
> - set_irqs: send interrupt configuration information that QEMU sets.
> - get_region_info: to provide region size and its flags for the mediated
> 		   device.
> - validate_map_request: to validate remap pfn request.
> 
> This registration interface should be used by vendor drivers to register
> each physical device to mdev core driver.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Signed-off-by: Neo Jia <cjia@nvidia.com>
> Change-Id: I73a5084574270b14541c529461ea2f03c292d510
> ---
>  drivers/vfio/Kconfig             |   1 +
>  drivers/vfio/Makefile            |   1 +
>  drivers/vfio/mdev/Kconfig        |  11 +
>  drivers/vfio/mdev/Makefile       |   5 +
>  drivers/vfio/mdev/mdev_core.c    | 595 +++++++++++++++++++++++++++++++++++++++
>  drivers/vfio/mdev/mdev_driver.c  | 138 +++++++++
>  drivers/vfio/mdev/mdev_private.h |  33 +++
>  drivers/vfio/mdev/mdev_sysfs.c   | 300 ++++++++++++++++++++
>  include/linux/mdev.h             | 232 +++++++++++++++
>  9 files changed, 1316 insertions(+)
>  create mode 100644 drivers/vfio/mdev/Kconfig
>  create mode 100644 drivers/vfio/mdev/Makefile
>  create mode 100644 drivers/vfio/mdev/mdev_core.c
>  create mode 100644 drivers/vfio/mdev/mdev_driver.c
>  create mode 100644 drivers/vfio/mdev/mdev_private.h
>  create mode 100644 drivers/vfio/mdev/mdev_sysfs.c
>  create mode 100644 include/linux/mdev.h
> 
> diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
> index da6e2ce77495..23eced02aaf6 100644
> --- a/drivers/vfio/Kconfig
> +++ b/drivers/vfio/Kconfig
> @@ -48,4 +48,5 @@ menuconfig VFIO_NOIOMMU
>  
>  source "drivers/vfio/pci/Kconfig"
>  source "drivers/vfio/platform/Kconfig"
> +source "drivers/vfio/mdev/Kconfig"
>  source "virt/lib/Kconfig"
> diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
> index 7b8a31f63fea..7c70753e54ab 100644
> --- a/drivers/vfio/Makefile
> +++ b/drivers/vfio/Makefile
> @@ -7,3 +7,4 @@ obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
>  obj-$(CONFIG_VFIO_SPAPR_EEH) += vfio_spapr_eeh.o
>  obj-$(CONFIG_VFIO_PCI) += pci/
>  obj-$(CONFIG_VFIO_PLATFORM) += platform/
> +obj-$(CONFIG_MDEV) += mdev/
> diff --git a/drivers/vfio/mdev/Kconfig b/drivers/vfio/mdev/Kconfig
> new file mode 100644
> index 000000000000..951e2bb06a3f
> --- /dev/null
> +++ b/drivers/vfio/mdev/Kconfig
> @@ -0,0 +1,11 @@
> +
> +config MDEV
> +    tristate "Mediated device driver framework"
> +    depends on VFIO
> +    default n
> +    help
> +        MDEV provides a framework to virtualize device without SR-IOV cap
> +        See Documentation/mdev.txt for more details.

Documentation pointer still doesn't exist.  Perhaps this file would be
a more appropriate place than the commit log for some of the
information above.

Every time I review this I'm struggling to figure out why this isn't
VFIO_MDEV since it's really tied to vfio and difficult to evaluate it
as some sort of standalone mediated device interface.  I don't know
the answer, but it always strikes me as a discontinuity.

> +
> +        If you don't know what do here, say N.
> +
> diff --git a/drivers/vfio/mdev/Makefile b/drivers/vfio/mdev/Makefile
> new file mode 100644
> index 000000000000..2c6d11f7bc24
> --- /dev/null
> +++ b/drivers/vfio/mdev/Makefile
> @@ -0,0 +1,5 @@
> +
> +mdev-y := mdev_core.o mdev_sysfs.o mdev_driver.o
> +
> +obj-$(CONFIG_MDEV) += mdev.o
> +
> diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
> new file mode 100644
> index 000000000000..3c45ed2ae1e9
> --- /dev/null
> +++ b/drivers/vfio/mdev/mdev_core.c
> @@ -0,0 +1,595 @@
> +/*
> + * Mediated device Core Driver
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/init.h>
> +#include <linux/module.h>
> +#include <linux/device.h>
> +#include <linux/kernel.h>
> +#include <linux/fs.h>
> +#include <linux/slab.h>
> +#include <linux/cdev.h>
> +#include <linux/sched.h>
> +#include <linux/uuid.h>
> +#include <linux/vfio.h>
> +#include <linux/iommu.h>
> +#include <linux/sysfs.h>
> +#include <linux/mdev.h>
> +
> +#include "mdev_private.h"
> +
> +#define DRIVER_VERSION		"0.1"
> +#define DRIVER_AUTHOR		"NVIDIA Corporation"
> +#define DRIVER_DESC		"Mediated device Core Driver"
> +
> +#define MDEV_CLASS_NAME		"mdev"
> +
> +static struct devices_list {
> +	struct list_head    dev_list;
> +	struct mutex        list_lock;
> +} parent_devices;
> +

I imagine this is following the example of struct vfio in vfio.c but
for this usage the following seems much easier:

static LIST_HEAD(parent_list)
static DEFINE_MUTEX(parent_list_lock);

Then you can also remove the initialization from mdev_init().

> +static int mdev_add_attribute_group(struct device *dev,
> +				    const struct attribute_group **groups)
> +{
> +	return sysfs_create_groups(&dev->kobj, groups);
> +}
> +
> +static void mdev_remove_attribute_group(struct device *dev,
> +					const struct attribute_group **groups)
> +{
> +	sysfs_remove_groups(&dev->kobj, groups);
> +}
> +
> +static struct mdev_device *find_mdev_device(struct parent_device *parent,
> +					    uuid_le uuid, int instance)
> +{
> +	struct mdev_device *mdev = NULL, *p;
> +
> +	list_for_each_entry(p, &parent->mdev_list, next) {
> +		if ((uuid_le_cmp(p->uuid, uuid) == 0) &&
> +		    (p->instance == instance)) {
> +			mdev = p;

Locking here is still broken, the callers are create and destroy, which
can still race each other and themselves.

> +			break;
> +		}
> +	}
> +	return mdev;
> +}
> +
> +/* Should be called holding parent_devices.list_lock */
> +static struct parent_device *find_parent_device(struct device *dev)
> +{
> +	struct parent_device *parent = NULL, *p;
> +
> +	WARN_ON(!mutex_is_locked(&parent_devices.list_lock));
> +	list_for_each_entry(p, &parent_devices.dev_list, next) {
> +		if (p->dev == dev) {
> +			parent = p;
> +			break;
> +		}
> +	}
> +	return parent;
> +}
> +
> +static void mdev_release_parent(struct kref *kref)
> +{
> +	struct parent_device *parent = container_of(kref, struct parent_device,
> +						    ref);
> +	kfree(parent);
> +}
> +
> +static
> +inline struct parent_device *mdev_get_parent(struct parent_device *parent)
> +{
> +	if (parent)
> +		kref_get(&parent->ref);
> +
> +	return parent;
> +}
> +
> +static inline void mdev_put_parent(struct parent_device *parent)
> +{
> +	if (parent)
> +		kref_put(&parent->ref, mdev_release_parent);
> +}
> +
> +static struct parent_device *mdev_get_parent_by_dev(struct device *dev)
> +{
> +	struct parent_device *parent = NULL, *p;
> +
> +	mutex_lock(&parent_devices.list_lock);
> +	list_for_each_entry(p, &parent_devices.dev_list, next) {
> +		if (p->dev == dev) {
> +			parent = mdev_get_parent(p);
> +			break;
> +		}
> +	}
> +	mutex_unlock(&parent_devices.list_lock);
> +	return parent;
> +}
> +
> +static int mdev_device_create_ops(struct mdev_device *mdev, char *mdev_params)
> +{
> +	struct parent_device *parent = mdev->parent;
> +	int ret;
> +
> +	mutex_lock(&parent->ops_lock);
> +	if (parent->ops->create) {

How would a parent_device without ops->create or ops->destroy useful?
Perhaps mdev_register_driver() should enforce required ops.  mdev.h
should at least document which ops are optional if they really are
optional.

> +		ret = parent->ops->create(mdev->dev.parent, mdev->uuid,
> +					mdev->instance, mdev_params);
> +		if (ret)
> +			goto create_ops_err;
> +	}
> +
> +	ret = mdev_add_attribute_group(&mdev->dev,
> +					parent->ops->mdev_attr_groups);

An error here seems to put us in a bad place, the device is created but
the attributes are broken, is it the caller's responsibility to
destroy?  Seems like we need a cleanup if this fails.

> +create_ops_err:
> +	mutex_unlock(&parent->ops_lock);

It seems like ops_lock isn't used so much as a lock as a serialization
mechanism.  Why?  Where is this serialization per parent device
documented?

> +	return ret;
> +}
> +
> +static int mdev_device_destroy_ops(struct mdev_device *mdev, bool force)
> +{
> +	struct parent_device *parent = mdev->parent;
> +	int ret = 0;
> +
> +	/*
> +	 * If vendor driver doesn't return success that means vendor
> +	 * driver doesn't support hot-unplug
> +	 */
> +	mutex_lock(&parent->ops_lock);
> +	if (parent->ops->destroy) {
> +		ret = parent->ops->destroy(parent->dev, mdev->uuid,
> +					   mdev->instance);
> +		if (ret && !force) {

It seems this is not so much a 'force' but an ignore errors, we never
actually force the mdev driver to destroy the device... which makes me
wonder if there are leaks there.

> +			ret = -EBUSY;
> +			goto destroy_ops_err;
> +		}
> +	}
> +	mdev_remove_attribute_group(&mdev->dev,
> +				    parent->ops->mdev_attr_groups);
> +destroy_ops_err:
> +	mutex_unlock(&parent->ops_lock);
> +
> +	return ret;
> +}
> +
> +static void mdev_release_device(struct kref *kref)
> +{
> +	struct mdev_device *mdev = container_of(kref, struct mdev_device, ref);
> +	struct parent_device *parent = mdev->parent;
> +
> +	device_unregister(&mdev->dev);
> +	wake_up(&parent->release_done);
> +	mdev_put_parent(parent);
> +}
> +
> +struct mdev_device *mdev_get_device(struct mdev_device *mdev)
> +{
> +	if (mdev)
> +		kref_get(&mdev->ref);
> +
> +	return mdev;
> +}
> +EXPORT_SYMBOL(mdev_get_device);
> +
> +void mdev_put_device(struct mdev_device *mdev)
> +{
> +	if (mdev)
> +		kref_put(&mdev->ref, mdev_release_device);
> +}
> +EXPORT_SYMBOL(mdev_put_device);
> +
> +/*
> + * Find first mediated device from given uuid and increment refcount of
> + * mediated device. Caller should call mdev_put_device() when the use of
> + * mdev_device is done.
> + */
> +static struct mdev_device *mdev_get_first_device_by_uuid(uuid_le uuid)
> +{
> +	struct mdev_device *mdev = NULL, *p;
> +	struct parent_device *parent;
> +
> +	mutex_lock(&parent_devices.list_lock);
> +	list_for_each_entry(parent, &parent_devices.dev_list, next) {
> +		mutex_lock(&parent->mdev_list_lock);

This lock ordering is something we'll need to keep in mind.

> +		list_for_each_entry(p, &parent->mdev_list, next) {
> +			if (uuid_le_cmp(p->uuid, uuid) == 0) {
> +				mdev = mdev_get_device(p);
> +				break;
> +			}
> +		}
> +		mutex_unlock(&parent->mdev_list_lock);
> +
> +		if (mdev)
> +			break;
> +	}
> +	mutex_unlock(&parent_devices.list_lock);
> +	return mdev;
> +}
> +
> +/*
> + * Find mediated device from given iommu_group and increment refcount of
> + * mediated device. Caller should call mdev_put_device() when the use of
> + * mdev_device is done.
> + */
> +struct mdev_device *mdev_get_device_by_group(struct iommu_group *group)
> +{
> +	struct mdev_device *mdev = NULL, *p;
> +	struct parent_device *parent;
> +
> +	mutex_lock(&parent_devices.list_lock);
> +	list_for_each_entry(parent, &parent_devices.dev_list, next) {
> +		mutex_lock(&parent->mdev_list_lock);
> +		list_for_each_entry(p, &parent->mdev_list, next) {
> +			if (!p->group)
> +				continue;
> +
> +			if (iommu_group_id(p->group) == iommu_group_id(group)) {
> +				mdev = mdev_get_device(p);
> +				break;
> +			}
> +		}
> +		mutex_unlock(&parent->mdev_list_lock);
> +
> +		if (mdev)
> +			break;
> +	}
> +	mutex_unlock(&parent_devices.list_lock);
> +	return mdev;
> +}
> +EXPORT_SYMBOL(mdev_get_device_by_group);
> +
> +/*
> + * mdev_register_device : Register a device
> + * @dev: device structure representing parent device.
> + * @ops: Parent device operation structure to be registered.
> + *
> + * Add device to list of registered parent devices.
> + * Returns a negative value on error, otherwise 0.
> + */
> +int mdev_register_device(struct device *dev, const struct parent_ops *ops)
> +{
> +	int ret = 0;
> +	struct parent_device *parent;
> +
> +	if (!dev || !ops)
> +		return -EINVAL;
> +
> +	mutex_lock(&parent_devices.list_lock);
> +
> +	/* Check for duplicate */
> +	parent = find_parent_device(dev);
> +	if (parent) {
> +		ret = -EEXIST;
> +		goto add_dev_err;
> +	}
> +
> +	parent = kzalloc(sizeof(*parent), GFP_KERNEL);
> +	if (!parent) {
> +		ret = -ENOMEM;
> +		goto add_dev_err;
> +	}
> +
> +	kref_init(&parent->ref);
> +	list_add(&parent->next, &parent_devices.dev_list);
> +	mutex_unlock(&parent_devices.list_lock);

find_parent_device() matches based on parent->dev, but we're dropping
the list lock before we setup parent->dev.  There are other ways to
shorten the time this lock is held, but releasing it with an incomplete
entry in the list is not the way I would choose.

> +
> +	parent->dev = dev;
> +	parent->ops = ops;
> +	mutex_init(&parent->ops_lock);
> +	mutex_init(&parent->mdev_list_lock);
> +	INIT_LIST_HEAD(&parent->mdev_list);
> +	init_waitqueue_head(&parent->release_done);
> +
> +	ret = mdev_create_sysfs_files(dev);
> +	if (ret)
> +		goto add_sysfs_error;
> +
> +	ret = mdev_add_attribute_group(dev, ops->dev_attr_groups);
> +	if (ret)
> +		goto add_group_error;
> +
> +	dev_info(dev, "MDEV: Registered\n");
> +	return 0;
> +
> +add_group_error:
> +	mdev_remove_sysfs_files(dev);
> +add_sysfs_error:
> +	mutex_lock(&parent_devices.list_lock);
> +	list_del(&parent->next);
> +	mutex_unlock(&parent_devices.list_lock);
> +	mdev_put_parent(parent);
> +	return ret;
> +
> +add_dev_err:
> +	mutex_unlock(&parent_devices.list_lock);
> +	return ret;
> +}
> +EXPORT_SYMBOL(mdev_register_device);
> +
> +/*
> + * mdev_unregister_device : Unregister a parent device
> + * @dev: device structure representing parent device.
> + *
> + * Remove device from list of registered parent devices. Give a chance to free
> + * existing mediated devices for given device.
> + */
> +
> +void mdev_unregister_device(struct device *dev)
> +{
> +	struct parent_device *parent;
> +	struct mdev_device *mdev, *n;
> +	int ret;
> +
> +	mutex_lock(&parent_devices.list_lock);
> +	parent = find_parent_device(dev);
> +
> +	if (!parent) {
> +		mutex_unlock(&parent_devices.list_lock);
> +		return;
> +	}
> +	dev_info(dev, "MDEV: Unregistering\n");
> +
> +	/*
> +	 * Remove parent from the list and remove create and destroy sysfs
> +	 * files so that no new mediated device could be created for this parent
> +	 */
> +	list_del(&parent->next);
> +	mdev_remove_sysfs_files(dev);
> +	mutex_unlock(&parent_devices.list_lock);
> +
> +	mutex_lock(&parent->ops_lock);
> +	mdev_remove_attribute_group(dev,
> +				    parent->ops->dev_attr_groups);
> +	mutex_unlock(&parent->ops_lock);
> +
> +	mutex_lock(&parent->mdev_list_lock);
> +	list_for_each_entry_safe(mdev, n, &parent->mdev_list, next) {
> +		mdev_device_destroy_ops(mdev, true);
> +		list_del(&mdev->next);
> +		mdev_put_device(mdev);
> +	}
> +	mutex_unlock(&parent->mdev_list_lock);
> +
> +	do {
> +		ret = wait_event_interruptible_timeout(parent->release_done,
> +				list_empty(&parent->mdev_list), HZ * 10);

But we do a list_del for each mdev in mdev_list above, how could the
list not be empty here?  I think you're trying to wait for all the mdev
devices to be released, but I don't think this does that.  Isn't the
list empty regardless?

> +		if (ret == -ERESTARTSYS) {
> +			dev_warn(dev, "Mediated devices are in use, task"
> +				      " \"%s\" (%d) "
> +				      "blocked until all are released",
> +				      current->comm, task_pid_nr(current));
> +		}
> +	} while (ret <= 0);
> +
> +	mdev_put_parent(parent);
> +}
> +EXPORT_SYMBOL(mdev_unregister_device);
> +
> +/*
> + * Functions required for mdev-sysfs
> + */
> +static void mdev_device_release(struct device *dev)
> +{
> +	struct mdev_device *mdev = to_mdev_device(dev);
> +
> +	dev_dbg(&mdev->dev, "MDEV: destroying\n");
> +	kfree(mdev);
> +}
> +
> +int mdev_device_create(struct device *dev, uuid_le uuid, uint32_t instance,
> +		       char *mdev_params)
> +{
> +	int ret;
> +	struct mdev_device *mdev;
> +	struct parent_device *parent;
> +
> +	parent = mdev_get_parent_by_dev(dev);
> +	if (!parent)
> +		return -EINVAL;
> +
> +	/* Check for duplicate */
> +	mdev = find_mdev_device(parent, uuid, instance);

But this doesn't actually prevent duplicates because we we're not
holding any lock the guarantee that another racing process doesn't
create the same {uuid,instance} between where we check and the below
list_add.

> +	if (mdev) {
> +		ret = -EEXIST;
> +		goto create_err;
> +	}
> +
> +	mdev = kzalloc(sizeof(*mdev), GFP_KERNEL);
> +	if (!mdev) {
> +		ret = -ENOMEM;
> +		goto create_err;
> +	}
> +
> +	memcpy(&mdev->uuid, &uuid, sizeof(uuid_le));
> +	mdev->instance = instance;
> +	mdev->parent = parent;
> +	mutex_init(&mdev->ops_lock);
> +	kref_init(&mdev->ref);
> +
> +	mdev->dev.parent  = dev;
> +	mdev->dev.bus     = &mdev_bus_type;
> +	mdev->dev.release = mdev_device_release;
> +	dev_set_name(&mdev->dev, "%pUb-%d", uuid.b, instance);
> +
> +	ret = device_register(&mdev->dev);
> +	if (ret) {
> +		put_device(&mdev->dev);
> +		goto create_err;
> +	}
> +
> +	ret = mdev_device_create_ops(mdev, mdev_params);
> +	if (ret)
> +		goto create_failed;
> +
> +	mutex_lock(&parent->mdev_list_lock);
> +	list_add(&mdev->next, &parent->mdev_list);
> +	mutex_unlock(&parent->mdev_list_lock);
> +
> +	dev_dbg(&mdev->dev, "MDEV: created\n");
> +
> +	return ret;
> +
> +create_failed:
> +	device_unregister(&mdev->dev);
> +
> +create_err:
> +	mdev_put_parent(parent);
> +	return ret;
> +}
> +
> +int mdev_device_destroy(struct device *dev, uuid_le uuid, uint32_t instance)
> +{
> +	struct mdev_device *mdev;
> +	struct parent_device *parent;
> +	int ret;
> +
> +	parent = mdev_get_parent_by_dev(dev);
> +	if (!parent) {
> +		ret = -EINVAL;
> +		goto destroy_err;
> +	}
> +
> +	mdev = find_mdev_device(parent, uuid, instance);
> +	if (!mdev) {
> +		ret = -EINVAL;
> +		goto destroy_err;
> +	}

Likewise, without locking multiple callers can get here with the same
mdev.

> +
> +	ret = mdev_device_destroy_ops(mdev, false);
> +	if (ret)
> +		goto destroy_err;
> +
> +	mdev_put_parent(parent);
> +
> +	mutex_lock(&parent->mdev_list_lock);
> +	list_del(&mdev->next);
> +	mutex_unlock(&parent->mdev_list_lock);
> +
> +	mdev_put_device(mdev);
> +	return ret;
> +
> +destroy_err:
> +	mdev_put_parent(parent);
> +	return ret;
> +}
> +
> +void mdev_device_supported_config(struct device *dev, char *str)
> +{
> +	struct parent_device *parent;
> +
> +	parent = mdev_get_parent_by_dev(dev);
> +
> +	if (parent) {
> +		mutex_lock(&parent->ops_lock);
> +		if (parent->ops->supported_config)
> +			parent->ops->supported_config(parent->dev, str);
> +		mutex_unlock(&parent->ops_lock);
> +		mdev_put_parent(parent);
> +	}
> +}
> +
> +int mdev_device_start(uuid_le uuid)
> +{
> +	int ret = 0;
> +	struct mdev_device *mdev;
> +	struct parent_device *parent;
> +
> +	mdev = mdev_get_first_device_by_uuid(uuid);
> +	if (!mdev)
> +		return -EINVAL;
> +
> +	parent = mdev->parent;
> +
> +	mutex_lock(&parent->ops_lock);
> +	if (parent->ops->start)
> +		ret = parent->ops->start(mdev->uuid);
> +	mutex_unlock(&parent->ops_lock);
> +
> +	if (ret)
> +		pr_err("mdev_start failed  %d\n", ret);
> +	else
> +		kobject_uevent(&mdev->dev.kobj, KOBJ_ONLINE);
> +
> +	mdev_put_device(mdev);
> +
> +	return ret;
> +}
> +
> +int mdev_device_shutdown(uuid_le uuid)
> +{
> +	int ret = 0;
> +	struct mdev_device *mdev;
> +	struct parent_device *parent;
> +
> +	mdev = mdev_get_first_device_by_uuid(uuid);
> +	if (!mdev)
> +		return -EINVAL;
> +
> +	parent = mdev->parent;
> +
> +	mutex_lock(&parent->ops_lock);
> +	if (parent->ops->shutdown)
> +		ret = parent->ops->shutdown(mdev->uuid);
> +	mutex_unlock(&parent->ops_lock);
> +
> +	if (ret)
> +		pr_err("mdev_shutdown failed %d\n", ret);
> +	else
> +		kobject_uevent(&mdev->dev.kobj, KOBJ_OFFLINE);
> +
> +	mdev_put_device(mdev);
> +	return ret;
> +}
> +
> +static struct class mdev_class = {
> +	.name		= MDEV_CLASS_NAME,
> +	.owner		= THIS_MODULE,
> +	.class_attrs	= mdev_class_attrs,
> +};
> +
> +static int __init mdev_init(void)
> +{
> +	int ret;
> +
> +	mutex_init(&parent_devices.list_lock);
> +	INIT_LIST_HEAD(&parent_devices.dev_list);
> +
> +	ret = class_register(&mdev_class);
> +	if (ret) {
> +		pr_err("Failed to register mdev class\n");
> +		return ret;
> +	}
> +
> +	ret = mdev_bus_register();
> +	if (ret) {
> +		pr_err("Failed to register mdev bus\n");
> +		class_unregister(&mdev_class);
> +		return ret;
> +	}
> +
> +	return ret;
> +}
> +
> +static void __exit mdev_exit(void)
> +{
> +	mdev_bus_unregister();
> +	class_unregister(&mdev_class);
> +}
> +
> +module_init(mdev_init)
> +module_exit(mdev_exit)
> +
> +MODULE_VERSION(DRIVER_VERSION);
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR(DRIVER_AUTHOR);
> +MODULE_DESCRIPTION(DRIVER_DESC);
> diff --git a/drivers/vfio/mdev/mdev_driver.c b/drivers/vfio/mdev/mdev_driver.c
> new file mode 100644
> index 000000000000..f1aed541111d
> --- /dev/null
> +++ b/drivers/vfio/mdev/mdev_driver.c
> @@ -0,0 +1,138 @@
> +/*
> + * MDEV driver
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/device.h>
> +#include <linux/iommu.h>
> +#include <linux/mdev.h>
> +
> +#include "mdev_private.h"
> +
> +static int mdev_attach_iommu(struct mdev_device *mdev)
> +{
> +	int ret;
> +	struct iommu_group *group;
> +
> +	group = iommu_group_alloc();
> +	if (IS_ERR(group)) {
> +		dev_err(&mdev->dev, "MDEV: failed to allocate group!\n");
> +		return PTR_ERR(group);
> +	}
> +
> +	ret = iommu_group_add_device(group, &mdev->dev);
> +	if (ret) {
> +		dev_err(&mdev->dev, "MDEV: failed to add dev to group!\n");
> +		goto attach_fail;
> +	}
> +
> +	mdev->group = group;
> +
> +	dev_info(&mdev->dev, "MDEV: group_id = %d\n",
> +				 iommu_group_id(group));
> +attach_fail:
> +	iommu_group_put(group);
> +	return ret;
> +}
> +
> +static void mdev_detach_iommu(struct mdev_device *mdev)
> +{
> +	iommu_group_remove_device(&mdev->dev);

mdev->group = NULL; seems prudent

> +	dev_info(&mdev->dev, "MDEV: detaching iommu\n");
> +}
> +
> +static int mdev_probe(struct device *dev)
> +{
> +	struct mdev_driver *drv = to_mdev_driver(dev->driver);
> +	struct mdev_device *mdev = to_mdev_device(dev);
> +	int ret;
> +
> +	ret = mdev_attach_iommu(mdev);
> +	if (ret) {
> +		dev_err(dev, "Failed to attach IOMMU\n");
> +		return ret;
> +	}
> +
> +	if (drv && drv->probe)
> +		ret = drv->probe(dev);
> +
> +	return ret;
> +}
> +
> +static int mdev_remove(struct device *dev)
> +{
> +	struct mdev_driver *drv = to_mdev_driver(dev->driver);
> +	struct mdev_device *mdev = to_mdev_device(dev);
> +
> +	if (drv && drv->remove)
> +		drv->remove(dev);
> +
> +	mdev_detach_iommu(mdev);
> +
> +	return 0;
> +}
> +
> +static int mdev_match(struct device *dev, struct device_driver *drv)
> +{
> +	struct mdev_driver *mdrv = to_mdev_driver(drv);

nit, drv above, mdrv here

> +
> +	if (mdrv && mdrv->match)
> +		return mdrv->match(dev);
> +
> +	return 0;
> +}
> +
> +struct bus_type mdev_bus_type = {
> +	.name		= "mdev",
> +	.match		= mdev_match,
> +	.probe		= mdev_probe,
> +	.remove		= mdev_remove,
> +};
> +EXPORT_SYMBOL_GPL(mdev_bus_type);
> +
> +/*
> + * mdev_register_driver - register a new MDEV driver
> + * @drv: the driver to register
> + * @owner: module owner of driver to be registered
> + *
> + * Returns a negative value on error, otherwise 0.
> + */
> +int mdev_register_driver(struct mdev_driver *drv, struct module *owner)
> +{
> +	/* initialize common driver fields */
> +	drv->driver.name = drv->name;
> +	drv->driver.bus = &mdev_bus_type;
> +	drv->driver.owner = owner;
> +
> +	/* register with core */
> +	return driver_register(&drv->driver);
> +}
> +EXPORT_SYMBOL(mdev_register_driver);
> +
> +/*
> + * mdev_unregister_driver - unregister MDEV driver
> + * @drv: the driver to unregister
> + *
> + */
> +void mdev_unregister_driver(struct mdev_driver *drv)
> +{
> +	driver_unregister(&drv->driver);
> +}
> +EXPORT_SYMBOL(mdev_unregister_driver);
> +
> +int mdev_bus_register(void)
> +{
> +	return bus_register(&mdev_bus_type);
> +}
> +
> +void mdev_bus_unregister(void)
> +{
> +	bus_unregister(&mdev_bus_type);
> +}
> diff --git a/drivers/vfio/mdev/mdev_private.h b/drivers/vfio/mdev/mdev_private.h
> new file mode 100644
> index 000000000000..991d7f796169
> --- /dev/null
> +++ b/drivers/vfio/mdev/mdev_private.h
> @@ -0,0 +1,33 @@
> +/*
> + * Mediated device interal definitions
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#ifndef MDEV_PRIVATE_H
> +#define MDEV_PRIVATE_H
> +
> +int  mdev_bus_register(void);
> +void mdev_bus_unregister(void);
> +
> +/* Function prototypes for mdev_sysfs */
> +
> +extern struct class_attribute mdev_class_attrs[];
> +
> +int  mdev_create_sysfs_files(struct device *dev);
> +void mdev_remove_sysfs_files(struct device *dev);
> +
> +int  mdev_device_create(struct device *dev, uuid_le uuid, uint32_t instance,
> +			char *mdev_params);
> +int  mdev_device_destroy(struct device *dev, uuid_le uuid, uint32_t instance);
> +void mdev_device_supported_config(struct device *dev, char *str);
> +int  mdev_device_start(uuid_le uuid);
> +int  mdev_device_shutdown(uuid_le uuid);

nit, stop is start as startup is to shutdown.  IOW, should this be
mdev_device_stop()?

> +
> +#endif /* MDEV_PRIVATE_H */
> diff --git a/drivers/vfio/mdev/mdev_sysfs.c b/drivers/vfio/mdev/mdev_sysfs.c
> new file mode 100644
> index 000000000000..48b66e40009e
> --- /dev/null
> +++ b/drivers/vfio/mdev/mdev_sysfs.c
> @@ -0,0 +1,300 @@
> +/*
> + * File attributes for Mediated devices
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/sysfs.h>
> +#include <linux/ctype.h>
> +#include <linux/device.h>
> +#include <linux/slab.h>
> +#include <linux/uuid.h>
> +#include <linux/mdev.h>
> +
> +#include "mdev_private.h"
> +
> +/* Prototypes */
> +static ssize_t mdev_supported_types_show(struct device *dev,
> +					 struct device_attribute *attr,
> +					 char *buf);
> +static DEVICE_ATTR_RO(mdev_supported_types);
> +
> +static ssize_t mdev_create_store(struct device *dev,
> +				 struct device_attribute *attr,
> +				 const char *buf, size_t count);
> +static DEVICE_ATTR_WO(mdev_create);
> +
> +static ssize_t mdev_destroy_store(struct device *dev,
> +				  struct device_attribute *attr,
> +				  const char *buf, size_t count);
> +static DEVICE_ATTR_WO(mdev_destroy);
> +
> +/* Static functions */
> +
> +#define UUID_CHAR_LENGTH	36
> +#define UUID_BYTE_LENGTH	16
> +
> +#define SUPPORTED_TYPE_BUFFER_LENGTH	1024
> +
> +static inline bool is_uuid_sep(char sep)
> +{
> +	if (sep == '\n' || sep == '-' || sep == ':' || sep == '\0')
> +		return true;
> +	return false;
> +}
> +
> +static int uuid_parse(const char *str, uuid_le *uuid)
> +{
> +	int i;
> +
> +	if (strlen(str) < UUID_CHAR_LENGTH)
> +		return -EINVAL;
> +
> +	for (i = 0; i < UUID_BYTE_LENGTH; i++) {
> +		if (!isxdigit(str[0]) || !isxdigit(str[1])) {
> +			pr_err("%s err", __func__);
> +			return -EINVAL;
> +		}
> +
> +		uuid->b[i] = (hex_to_bin(str[0]) << 4) | hex_to_bin(str[1]);
> +		str += 2;
> +		if (is_uuid_sep(*str))
> +			str++;
> +	}
> +
> +	return 0;
> +}
> +
> +/* mdev sysfs Functions */
> +static ssize_t mdev_supported_types_show(struct device *dev,
> +					 struct device_attribute *attr,
> +					 char *buf)
> +{
> +	char *str, *ptr;
> +	ssize_t n;
> +
> +	str = kzalloc(sizeof(*str) * SUPPORTED_TYPE_BUFFER_LENGTH, GFP_KERNEL);
> +	if (!str)
> +		return -ENOMEM;
> +
> +	ptr = str;
> +	mdev_device_supported_config(dev, str);
> +
> +	n = sprintf(buf, "%s\n", str);
> +	kfree(ptr);
> +
> +	return n;
> +}
> +
> +static ssize_t mdev_create_store(struct device *dev,
> +				 struct device_attribute *attr,
> +				 const char *buf, size_t count)
> +{
> +	char *str, *pstr;
> +	char *uuid_str, *instance_str, *mdev_params = NULL, *params = NULL;
> +	uuid_le uuid;
> +	uint32_t instance;
> +	int ret;
> +
> +	pstr = str = kstrndup(buf, count, GFP_KERNEL);
> +
> +	if (!str)
> +		return -ENOMEM;
> +
> +	uuid_str = strsep(&str, ":");
> +	if (!uuid_str) {
> +		pr_err("mdev_create: Empty UUID string %s\n", buf);
> +		ret = -EINVAL;
> +		goto create_error;
> +	}
> +
> +	if (!str) {
> +		pr_err("mdev_create: mdev instance not present %s\n", buf);
> +		ret = -EINVAL;
> +		goto create_error;
> +	}
> +
> +	instance_str = strsep(&str, ":");
> +	if (!instance_str) {
> +		pr_err("mdev_create: Empty instance string %s\n", buf);
> +		ret = -EINVAL;
> +		goto create_error;
> +	}
> +
> +	ret = kstrtouint(instance_str, 0, &instance);
> +	if (ret) {
> +		pr_err("mdev_create: mdev instance parsing error %s\n", buf);
> +		goto create_error;
> +	}
> +
> +	if (str)
> +		params = mdev_params = kstrdup(str, GFP_KERNEL);
> +
> +	ret = uuid_parse(uuid_str, &uuid);
> +	if (ret) {
> +		pr_err("mdev_create: UUID parse error %s\n", buf);
> +		goto create_error;
> +	}
> +
> +	ret = mdev_device_create(dev, uuid, instance, mdev_params);
> +	if (ret)
> +		pr_err("mdev_create: Failed to create mdev device\n");
> +	else
> +		ret = count;
> +
> +create_error:
> +	kfree(params);
> +	kfree(pstr);
> +	return ret;
> +}
> +
> +static ssize_t mdev_destroy_store(struct device *dev,
> +				  struct device_attribute *attr,
> +				  const char *buf, size_t count)
> +{
> +	char *uuid_str, *str, *pstr;
> +	uuid_le uuid;
> +	unsigned int instance;
> +	int ret;
> +
> +	str = pstr = kstrndup(buf, count, GFP_KERNEL);
> +
> +	if (!str)
> +		return -ENOMEM;
> +
> +	uuid_str = strsep(&str, ":");
> +	if (!uuid_str) {
> +		pr_err("mdev_destroy: Empty UUID string %s\n", buf);
> +		ret = -EINVAL;
> +		goto destroy_error;
> +	}
> +
> +	if (str == NULL) {
> +		pr_err("mdev_destroy: instance not specified %s\n", buf);
> +		ret = -EINVAL;
> +		goto destroy_error;
> +	}
> +
> +	ret = kstrtouint(str, 0, &instance);
> +	if (ret) {
> +		pr_err("mdev_destroy: instance parsing error %s\n", buf);
> +		goto destroy_error;
> +	}
> +
> +	ret = uuid_parse(uuid_str, &uuid);
> +	if (ret) {
> +		pr_err("mdev_destroy: UUID parse error  %s\n", buf);
> +		goto destroy_error;
> +	}
> +
> +	ret = mdev_device_destroy(dev, uuid, instance);
> +	if (ret == 0)
> +		ret = count;
> +
> +destroy_error:
> +	kfree(pstr);
> +	return ret;
> +}
> +
> +ssize_t mdev_start_store(struct class *class, struct class_attribute *attr,
> +			 const char *buf, size_t count)
> +{
> +	char *uuid_str, *ptr;
> +	uuid_le uuid;
> +	int ret;
> +
> +	ptr = uuid_str = kstrndup(buf, count, GFP_KERNEL);
> +
> +	if (!uuid_str)
> +		return -ENOMEM;
> +
> +	ret = uuid_parse(uuid_str, &uuid);
> +	if (ret) {
> +		pr_err("mdev_start: UUID parse error  %s\n", buf);
> +		goto start_error;
> +	}
> +
> +	ret = mdev_device_start(uuid);
> +	if (ret == 0)
> +		ret = count;
> +
> +start_error:
> +	kfree(ptr);
> +	return ret;
> +}
> +
> +ssize_t mdev_shutdown_store(struct class *class, struct class_attribute *attr,
> +			    const char *buf, size_t count)
> +{
> +	char *uuid_str, *ptr;
> +	uuid_le uuid;
> +	int ret;
> +
> +	ptr = uuid_str = kstrndup(buf, count, GFP_KERNEL);
> +
> +	if (!uuid_str)
> +		return -ENOMEM;
> +
> +	ret = uuid_parse(uuid_str, &uuid);
> +	if (ret) {
> +		pr_err("mdev_shutdown: UUID parse error %s\n", buf);
> +		goto shutdown_error;
> +	}
> +
> +	ret = mdev_device_shutdown(uuid);
> +	if (ret == 0)
> +		ret = count;
> +
> +shutdown_error:
> +	kfree(ptr);
> +	return ret;
> +
> +}
> +
> +struct class_attribute mdev_class_attrs[] = {
> +	__ATTR_WO(mdev_start),
> +	__ATTR_WO(mdev_shutdown),
> +	__ATTR_NULL
> +};
> +
> +int mdev_create_sysfs_files(struct device *dev)
> +{
> +	int ret;
> +
> +	ret = sysfs_create_file(&dev->kobj,
> +				&dev_attr_mdev_supported_types.attr);
> +	if (ret) {
> +		pr_err("Failed to create mdev_supported_types sysfs entry\n");
> +		return ret;
> +	}
> +
> +	ret = sysfs_create_file(&dev->kobj, &dev_attr_mdev_create.attr);
> +	if (ret) {
> +		pr_err("Failed to create mdev_create sysfs entry\n");
> +		goto create_sysfs_failed;
> +	}
> +
> +	ret = sysfs_create_file(&dev->kobj, &dev_attr_mdev_destroy.attr);
> +	if (ret) {
> +		pr_err("Failed to create mdev_destroy sysfs entry\n");
> +		sysfs_remove_file(&dev->kobj, &dev_attr_mdev_create.attr);
> +	} else
> +		return ret;
> +
> +create_sysfs_failed:
> +	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_supported_types.attr);
> +	return ret;
> +}
> +
> +void mdev_remove_sysfs_files(struct device *dev)
> +{
> +	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_supported_types.attr);
> +	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_create.attr);
> +	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_destroy.attr);
> +}
> diff --git a/include/linux/mdev.h b/include/linux/mdev.h
> new file mode 100644
> index 000000000000..31b6f8572cfa
> --- /dev/null
> +++ b/include/linux/mdev.h
> @@ -0,0 +1,232 @@
> +/*
> + * Mediated device definition
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#ifndef MDEV_H
> +#define MDEV_H
> +
> +/* Common Data structures */
> +
> +struct pci_region_info {
> +	uint64_t start;
> +	uint64_t size;
> +	uint32_t flags;		/* VFIO region info flags */
> +};
> +
> +enum mdev_emul_space {
> +	EMUL_CONFIG_SPACE,	/* PCI configuration space */
> +	EMUL_IO,		/* I/O register space */
> +	EMUL_MMIO		/* Memory-mapped I/O space */
> +};


I'm still confused why this is needed, perhaps a description here would
be useful so I can stop asking.  Clearly config space is PCI only, so
it's strange to have it in the common code.  Everyone not on x86 will
say I/O space is also strange.  I can't keep it in my head why the
read/write offsets aren't sufficient for the driver to figure out what
type it is.


> +
> +struct parent_device;
> +
> +/*
> + * Mediated device
> + */
> +
> +struct mdev_device {
> +	struct device		dev;
> +	struct parent_device	*parent;
> +	struct iommu_group	*group;
> +	void			*iommu_data;
> +	uuid_le			uuid;
> +	uint32_t		instance;
> +
> +	/* internal only */
> +	struct kref		ref;
> +	struct mutex		ops_lock;
> +	struct list_head	next;
> +};
> +
> +
> +/**
> + * struct parent_ops - Structure to be registered for each parent device to
> + * register the device to mdev module.
> + *
> + * @owner:		The module owner.
> + * @dev_attr_groups:	Default attributes of the parent device.
> + * @mdev_attr_groups:	Default attributes of the mediated device.
> + * @supported_config:	Called to get information about supported types.
> + *			@dev : device structure of parent device.
> + *			@config: should return string listing supported config
> + *			Returns integer: success (0) or error (< 0)
> + * @create:		Called to allocate basic resources in parent device's
> + *			driver for a particular mediated device
> + *			@dev: parent device structure on which mediated device
> + *			      should be created
> + *			@uuid: VM's uuid for which VM it is intended to
> + *			@instance: mediated instance in that VM
> + *			@mdev_params: extra parameters required by parent
> + *			device's driver.
> + *			Returns integer: success (0) or error (< 0)
> + * @destroy:		Called to free resources in parent device's driver for a
> + *			a mediated device instance of that VM.
> + *			@dev: parent device structure to which this mediated
> + *			      device points to.
> + *			@uuid: VM's uuid for which the mediated device belongs
> + *			@instance: mdev instance in that VM
> + *			Returns integer: success (0) or error (< 0)
> + *			If VM is running and destroy() is called that means the
> + *			mdev is being hotunpluged. Return error if VM is running
> + *			and driver doesn't support mediated device hotplug.
> + * @start:		Called to initiate mediated device initialization
> + *			process in parent device's driver when VM boots before
> + *			VMM starts
> + *			@uuid: VM's UUID which is booting.
> + *			Returns integer: success (0) or error (< 0)
> + * @shutdown:		Called to teardown mediated device related resources for
> + *			the VM
> + *			@uuid: VM's UUID which is shutting down .
> + *			Returns integer: success (0) or error (< 0)
> + * @read:		Read emulation callback
> + *			@mdev: mediated device structure
> + *			@buf: read buffer
> + *			@count: number of bytes to read
> + *			@address_space: specifies for which address space the
> + *			request is intended for - pci_config_space, IO register
> + *			space or MMIO space.
> + *			@addr: address.
> + *			Retuns number on bytes read on success or error.
> + * @write:		Write emulation callback
> + *			@mdev: mediated device structure
> + *			@buf: write buffer
> + *			@count: number of bytes to be written
> + *			@address_space: specifies for which address space the
> + *			request is intended for - pci_config_space, IO register
> + *			space or MMIO space.
> + *			@addr: address.
> + *			Retuns number on bytes written on success or error.
> + * @set_irqs:		Called to send about interrupts configuration
> + *			information that VMM sets.
> + *			@mdev: mediated device structure
> + *			@flags, index, start, count and *data : same as that of
> + *			struct vfio_irq_set of VFIO_DEVICE_SET_IRQS API.
> + * @get_region_info:	Called to get VFIO region size and flags of mediated
> + *			device.
> + *			@mdev: mediated device structure
> + *			@region_index: VFIO region index
> + *			@region_info: output, returns size and flags of
> + *				      requested region.
> + *			Returns integer: success (0) or error (< 0)
> + * @validate_map_request: Validate remap pfn request
> + *			@mdev: mediated device structure
> + *			@virtaddr: target user address to start at
> + *			@pfn: parent address of kernel memory, vendor driver
> + *			      can change if required.
> + *			@size: size of map area, vendor driver can change the
> + *			       size of map area if desired.
> + *			@prot: page protection flags for this mapping, vendor
> + *			       driver can change, if required.
> + *			Returns integer: success (0) or error (< 0)
> + *
> + * Parent device that support mediated device should be registered with mdev
> + * module with parent_ops structure.
> + */
> +
> +struct parent_ops {
> +	struct module   *owner;
> +	const struct attribute_group **dev_attr_groups;
> +	const struct attribute_group **mdev_attr_groups;
> +
> +	int	(*supported_config)(struct device *dev, char *config);
> +	int     (*create)(struct device *dev, uuid_le uuid,
> +			  uint32_t instance, char *mdev_params);
> +	int     (*destroy)(struct device *dev, uuid_le uuid,
> +			   uint32_t instance);
> +	int     (*start)(uuid_le uuid);
> +	int     (*shutdown)(uuid_le uuid);
> +	ssize_t (*read)(struct mdev_device *vdev, char *buf, size_t count,
> +			enum mdev_emul_space address_space, loff_t pos);
> +	ssize_t (*write)(struct mdev_device *vdev, char *buf, size_t count,
> +			 enum mdev_emul_space address_space, loff_t pos);
> +	int     (*set_irqs)(struct mdev_device *vdev, uint32_t flags,
> +			    unsigned int index, unsigned int start,
> +			    unsigned int count, void *data);
> +	int	(*get_region_info)(struct mdev_device *vdev, int region_index,
> +				 struct pci_region_info *region_info);

This can't be //pci_//region_info.  How do you intend to support things
like sparse mmap capabilities in the user REGION_INFO ioctl when such
things are not part of the mediated device API?  Seems like the driver
should just return a buffer.

> +	int	(*validate_map_request)(struct mdev_device *vdev,
> +					unsigned long virtaddr,
> +					unsigned long *pfn, unsigned long *size,
> +					pgprot_t *prot);
> +};
> +
> +/*
> + * Parent Device
> + */
> +struct parent_device {
> +	struct device		*dev;
> +	const struct parent_ops	*ops;
> +
> +	/* internal */
> +	struct kref		ref;
> +	struct mutex		ops_lock;
> +	struct list_head	next;
> +	struct list_head	mdev_list;
> +	struct mutex		mdev_list_lock;
> +	wait_queue_head_t	release_done;
> +};
> +
> +/**
> + * struct mdev_driver - Mediated device driver
> + * @name: driver name
> + * @probe: called when new device created
> + * @remove: called when device removed
> + * @match: called when new device or driver is added for this bus. Return 1 if
> + *	   given device can be handled by given driver and zero otherwise.
> + * @driver: device driver structure
> + *
> + **/
> +struct mdev_driver {
> +	const char *name;
> +	int  (*probe)(struct device *dev);
> +	void (*remove)(struct device *dev);
> +	int  (*match)(struct device *dev);
> +	struct device_driver driver;
> +};
> +
> +static inline struct mdev_driver *to_mdev_driver(struct device_driver *drv)
> +{
> +	return drv ? container_of(drv, struct mdev_driver, driver) : NULL;
> +}
> +
> +static inline struct mdev_device *to_mdev_device(struct device *dev)
> +{
> +	return dev ? container_of(dev, struct mdev_device, dev) : NULL;
> +}
> +
> +static inline void *mdev_get_drvdata(struct mdev_device *mdev)
> +{
> +	return dev_get_drvdata(&mdev->dev);
> +}
> +
> +static inline void mdev_set_drvdata(struct mdev_device *mdev, void *data)
> +{
> +	dev_set_drvdata(&mdev->dev, data);
> +}
> +
> +extern struct bus_type mdev_bus_type;
> +
> +#define dev_is_mdev(d) ((d)->bus == &mdev_bus_type)
> +
> +extern int  mdev_register_device(struct device *dev,
> +				 const struct parent_ops *ops);
> +extern void mdev_unregister_device(struct device *dev);
> +
> +extern int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
> +extern void mdev_unregister_driver(struct mdev_driver *drv);
> +
> +extern struct mdev_device *mdev_get_device(struct mdev_device *mdev);
> +extern void mdev_put_device(struct mdev_device *mdev);
> +
> +extern struct mdev_device *mdev_get_device_by_group(struct iommu_group *group);
> +
> +#endif /* MDEV_H */

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 2/3] VFIO driver for mediated PCI device
  2016-06-20 16:31   ` [Qemu-devel] " Kirti Wankhede
@ 2016-06-21 22:48     ` Alex Williamson
  -1 siblings, 0 replies; 51+ messages in thread
From: Alex Williamson @ 2016-06-21 22:48 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: shuai.ruan, kevin.tian, cjia, kvm, qemu-devel, jike.song, kraxel,
	pbonzini, bjsdjshi, zhiyuan.lv

On Mon, 20 Jun 2016 22:01:47 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> VFIO driver registers with MDEV core driver. MDEV core driver creates
> mediated device and calls probe routine of MPCI VFIO driver. This MPCI
> VFIO driver adds mediated device to VFIO core module.
> Main aim of this module is to manage all VFIO APIs for each mediated PCI
> device.
> Those are:
> - get region information from vendor driver.
> - trap and emulate PCI config space and BAR region.
> - Send interrupt configuration information to vendor driver.
> - mmap mappable region with invalidate mapping and fault on access to
>   remap pfn.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Signed-off-by: Neo Jia <cjia@nvidia.com>
> Change-Id: I583f4734752971d3d112324d69e2508c88f359ec
> ---
>  drivers/vfio/mdev/Kconfig           |   7 +
>  drivers/vfio/mdev/Makefile          |   1 +
>  drivers/vfio/mdev/vfio_mpci.c       | 654 ++++++++++++++++++++++++++++++++++++
>  drivers/vfio/pci/vfio_pci_private.h |   6 -
>  drivers/vfio/pci/vfio_pci_rdwr.c    |   1 +
>  include/linux/vfio.h                |   7 +
>  6 files changed, 670 insertions(+), 6 deletions(-)
>  create mode 100644 drivers/vfio/mdev/vfio_mpci.c
> 
> diff --git a/drivers/vfio/mdev/Kconfig b/drivers/vfio/mdev/Kconfig
> index 951e2bb06a3f..8d9e78aaa80f 100644
> --- a/drivers/vfio/mdev/Kconfig
> +++ b/drivers/vfio/mdev/Kconfig
> @@ -9,3 +9,10 @@ config MDEV
>  
>          If you don't know what do here, say N.
>  
> +config VFIO_MPCI
> +    tristate "VFIO support for Mediated PCI devices"
> +    depends on VFIO && PCI && MDEV
> +    default n
> +    help
> +        VFIO based driver for mediated PCI devices.
> +
> diff --git a/drivers/vfio/mdev/Makefile b/drivers/vfio/mdev/Makefile
> index 2c6d11f7bc24..cd5e7625e1ec 100644
> --- a/drivers/vfio/mdev/Makefile
> +++ b/drivers/vfio/mdev/Makefile
> @@ -2,4 +2,5 @@
>  mdev-y := mdev_core.o mdev_sysfs.o mdev_driver.o
>  
>  obj-$(CONFIG_MDEV) += mdev.o
> +obj-$(CONFIG_VFIO_MPCI) += vfio_mpci.o
>  
> diff --git a/drivers/vfio/mdev/vfio_mpci.c b/drivers/vfio/mdev/vfio_mpci.c
> new file mode 100644
> index 000000000000..267879a05c39
> --- /dev/null
> +++ b/drivers/vfio/mdev/vfio_mpci.c
> @@ -0,0 +1,654 @@
> +/*
> + * VFIO based Mediated PCI device driver
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/init.h>
> +#include <linux/module.h>
> +#include <linux/device.h>
> +#include <linux/kernel.h>
> +#include <linux/slab.h>
> +#include <linux/uuid.h>
> +#include <linux/vfio.h>
> +#include <linux/iommu.h>
> +#include <linux/mdev.h>
> +
> +#include "mdev_private.h"
> +
> +#define DRIVER_VERSION  "0.1"
> +#define DRIVER_AUTHOR   "NVIDIA Corporation"
> +#define DRIVER_DESC     "VFIO based Mediated PCI device driver"
> +
> +struct vfio_mdev {
> +	struct iommu_group *group;
> +	struct mdev_device *mdev;
> +	int		    refcnt;
> +	struct pci_region_info vfio_region_info[VFIO_PCI_NUM_REGIONS];
> +	u8		    *vconfig;
> +	struct mutex	    vfio_mdev_lock;
> +};
> +
> +static int get_mdev_region_info(struct mdev_device *mdev,
> +				struct pci_region_info *vfio_region_info,
> +				int index)
> +{
> +	int ret = -EINVAL;
> +	struct parent_device *parent = mdev->parent;
> +
> +	if (parent && dev_is_pci(parent->dev) && parent->ops->get_region_info) {
> +		mutex_lock(&mdev->ops_lock);
> +		ret = parent->ops->get_region_info(mdev, index,
> +						    vfio_region_info);
> +		mutex_unlock(&mdev->ops_lock);

Why do we have two ops_lock, one on the parent_device and one on the
mdev_device?!  Is this one actually locking anything or also just
providing serialization?  Why do some things get serialized at the
parent level and some things at the device level?  Very confused by
ops_lock.

> +	}
> +	return ret;
> +}
> +
> +static void mdev_read_base(struct vfio_mdev *vmdev)
> +{
> +	int index, pos;
> +	u32 start_lo, start_hi;
> +	u32 mem_type;
> +
> +	pos = PCI_BASE_ADDRESS_0;
> +
> +	for (index = 0; index <= VFIO_PCI_BAR5_REGION_INDEX; index++) {
> +
> +		if (!vmdev->vfio_region_info[index].size)
> +			continue;
> +
> +		start_lo = (*(u32 *)(vmdev->vconfig + pos)) &
> +					PCI_BASE_ADDRESS_MEM_MASK;
> +		mem_type = (*(u32 *)(vmdev->vconfig + pos)) &
> +					PCI_BASE_ADDRESS_MEM_TYPE_MASK;
> +
> +		switch (mem_type) {
> +		case PCI_BASE_ADDRESS_MEM_TYPE_64:
> +			start_hi = (*(u32 *)(vmdev->vconfig + pos + 4));
> +			pos += 4;
> +			break;
> +		case PCI_BASE_ADDRESS_MEM_TYPE_32:
> +		case PCI_BASE_ADDRESS_MEM_TYPE_1M:
> +			/* 1M mem BAR treated as 32-bit BAR */
> +		default:
> +			/* mem unknown type treated as 32-bit BAR */
> +			start_hi = 0;
> +			break;
> +		}
> +		pos += 4;
> +		vmdev->vfio_region_info[index].start = ((u64)start_hi << 32) |
> +							start_lo;
> +	}
> +}
> +
> +static int vfio_mpci_open(void *device_data)
> +{
> +	int ret = 0;
> +	struct vfio_mdev *vmdev = device_data;
> +
> +	if (!try_module_get(THIS_MODULE))
> +		return -ENODEV;
> +
> +	mutex_lock(&vmdev->vfio_mdev_lock);
> +	if (!vmdev->refcnt) {
> +		u8 *vconfig;
> +		int index;
> +		struct pci_region_info *cfg_reg;
> +
> +		for (index = VFIO_PCI_BAR0_REGION_INDEX;
> +		     index < VFIO_PCI_NUM_REGIONS; index++) {
> +			ret = get_mdev_region_info(vmdev->mdev,
> +						&vmdev->vfio_region_info[index],
> +						index);
> +			if (ret)
> +				goto open_error;
> +		}
> +		cfg_reg = &vmdev->vfio_region_info[VFIO_PCI_CONFIG_REGION_INDEX];
> +		if (!cfg_reg->size)
> +			goto open_error;
> +
> +		vconfig = kzalloc(cfg_reg->size, GFP_KERNEL);
> +		if (IS_ERR(vconfig)) {
> +			ret = PTR_ERR(vconfig);
> +			goto open_error;
> +		}
> +
> +		vmdev->vconfig = vconfig;
> +	}
> +
> +	vmdev->refcnt++;
> +open_error:
> +
> +	mutex_unlock(&vmdev->vfio_mdev_lock);
> +	if (ret)
> +		module_put(THIS_MODULE);
> +
> +	return ret;
> +}
> +
> +static void vfio_mpci_close(void *device_data)
> +{
> +	struct vfio_mdev *vmdev = device_data;
> +
> +	mutex_lock(&vmdev->vfio_mdev_lock);
> +	vmdev->refcnt--;
> +	if (!vmdev->refcnt) {
> +		memset(&vmdev->vfio_region_info, 0,
> +			sizeof(vmdev->vfio_region_info));
> +		kfree(vmdev->vconfig);
> +	}
> +	mutex_unlock(&vmdev->vfio_mdev_lock);
> +	module_put(THIS_MODULE);
> +}
> +
> +static int mdev_get_irq_count(struct vfio_mdev *vmdev, int irq_type)
> +{
> +	/* Don't support MSIX for now */
> +	if (irq_type == VFIO_PCI_MSIX_IRQ_INDEX)
> +		return -1;
> +
> +	return 1;

Too much hard coding here, the mediated driver should define this.

> +}
> +
> +static long vfio_mpci_unlocked_ioctl(void *device_data,
> +				     unsigned int cmd, unsigned long arg)
> +{
> +	int ret = 0;
> +	struct vfio_mdev *vmdev = device_data;
> +	unsigned long minsz;
> +
> +	switch (cmd) {
> +	case VFIO_DEVICE_GET_INFO:
> +	{
> +		struct vfio_device_info info;
> +
> +		minsz = offsetofend(struct vfio_device_info, num_irqs);
> +
> +		if (copy_from_user(&info, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (info.argsz < minsz)
> +			return -EINVAL;
> +
> +		info.flags = VFIO_DEVICE_FLAGS_PCI;
> +		info.num_regions = VFIO_PCI_NUM_REGIONS;
> +		info.num_irqs = VFIO_PCI_NUM_IRQS;
> +
> +		return copy_to_user((void __user *)arg, &info, minsz);
> +	}
> +	case VFIO_DEVICE_GET_REGION_INFO:
> +	{
> +		struct vfio_region_info info;
> +
> +		minsz = offsetofend(struct vfio_region_info, offset);
> +
> +		if (copy_from_user(&info, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (info.argsz < minsz)
> +			return -EINVAL;
> +
> +		switch (info.index) {
> +		case VFIO_PCI_CONFIG_REGION_INDEX:
> +		case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
> +			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
> +			info.size = vmdev->vfio_region_info[info.index].size;
> +			if (!info.size) {
> +				info.flags = 0;
> +				break;
> +			}
> +
> +			info.flags = vmdev->vfio_region_info[info.index].flags;
> +			break;
> +		case VFIO_PCI_VGA_REGION_INDEX:
> +		case VFIO_PCI_ROM_REGION_INDEX:
> +		default:
> +			return -EINVAL;
> +		}
> +
> +		return copy_to_user((void __user *)arg, &info, minsz);
> +	}
> +	case VFIO_DEVICE_GET_IRQ_INFO:
> +	{
> +		struct vfio_irq_info info;
> +
> +		minsz = offsetofend(struct vfio_irq_info, count);
> +
> +		if (copy_from_user(&info, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (info.argsz < minsz || info.index >= VFIO_PCI_NUM_IRQS)
> +			return -EINVAL;
> +
> +		switch (info.index) {
> +		case VFIO_PCI_INTX_IRQ_INDEX ... VFIO_PCI_MSI_IRQ_INDEX:
> +		case VFIO_PCI_REQ_IRQ_INDEX:
> +			break;
> +			/* pass thru to return error */
> +		case VFIO_PCI_MSIX_IRQ_INDEX:
> +		default:
> +			return -EINVAL;
> +		}
> +
> +		info.count = VFIO_PCI_NUM_IRQS;

???  This is set again 2 lines below

> +		info.flags = VFIO_IRQ_INFO_EVENTFD;
> +		info.count = mdev_get_irq_count(vmdev, info.index);
> +
> +		if (info.count == -1)
> +			return -EINVAL;
> +
> +		if (info.index == VFIO_PCI_INTX_IRQ_INDEX)
> +			info.flags |= (VFIO_IRQ_INFO_MASKABLE |
> +					VFIO_IRQ_INFO_AUTOMASKED);
> +		else
> +			info.flags |= VFIO_IRQ_INFO_NORESIZE;
> +
> +		return copy_to_user((void __user *)arg, &info, minsz);
> +	}
> +	case VFIO_DEVICE_SET_IRQS:
> +	{
> +		struct vfio_irq_set hdr;
> +		struct mdev_device *mdev = vmdev->mdev;
> +		struct parent_device *parent = vmdev->mdev->parent;
> +		u8 *data = NULL, *ptr = NULL;
> +
> +		minsz = offsetofend(struct vfio_irq_set, count);
> +
> +		if (copy_from_user(&hdr, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (hdr.argsz < minsz || hdr.index >= VFIO_PCI_NUM_IRQS ||
> +		    hdr.flags & ~(VFIO_IRQ_SET_DATA_TYPE_MASK |
> +		    VFIO_IRQ_SET_ACTION_TYPE_MASK))
> +			return -EINVAL;
> +
> +		if (!(hdr.flags & VFIO_IRQ_SET_DATA_NONE)) {
> +			size_t size;
> +			int max = mdev_get_irq_count(vmdev, hdr.index);
> +
> +			if (hdr.flags & VFIO_IRQ_SET_DATA_BOOL)
> +				size = sizeof(uint8_t);
> +			else if (hdr.flags & VFIO_IRQ_SET_DATA_EVENTFD)
> +				size = sizeof(int32_t);
> +			else
> +				return -EINVAL;
> +
> +			if (hdr.argsz - minsz < hdr.count * size ||
> +			    hdr.start >= max || hdr.start + hdr.count > max)
> +				return -EINVAL;
> +
> +			ptr = data = memdup_user((void __user *)(arg + minsz),
> +						 hdr.count * size);
> +			if (IS_ERR(data))
> +				return PTR_ERR(data);
> +		}
> +
> +		if (parent && parent->ops->set_irqs) {
> +			mutex_lock(&mdev->ops_lock);
> +			ret = parent->ops->set_irqs(mdev, hdr.flags, hdr.index,
> +						    hdr.start, hdr.count, data);
> +			mutex_unlock(&mdev->ops_lock);

Device level serialization on set_irqs... interesting.

> +		}
> +
> +		kfree(ptr);
> +		return ret;
> +	}
> +	}
> +	return -ENOTTY;
> +}
> +
> +ssize_t mdev_dev_config_rw(struct vfio_mdev *vmdev, char __user *buf,
> +			   size_t count, loff_t *ppos, bool iswrite)
> +{
> +	struct mdev_device *mdev = vmdev->mdev;
> +	struct parent_device *parent = mdev->parent;
> +	int size = vmdev->vfio_region_info[VFIO_PCI_CONFIG_REGION_INDEX].size;
> +	int ret = 0;
> +	uint64_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
> +
> +	if (pos < 0 || pos >= size ||
> +	    pos + count > size) {
> +		pr_err("%s pos 0x%llx out of range\n", __func__, pos);
> +		ret = -EFAULT;
> +		goto config_rw_exit;
> +	}
> +
> +	if (iswrite) {
> +		char *usr_data, *ptr;
> +
> +		ptr = usr_data = memdup_user(buf, count);
> +		if (IS_ERR(usr_data)) {
> +			ret = PTR_ERR(usr_data);
> +			goto config_rw_exit;
> +		}
> +
> +		ret = parent->ops->write(mdev, usr_data, count,
> +					  EMUL_CONFIG_SPACE, pos);

No serialization on this ops, thank goodness, but why?

This read/write interface still seems strange to me...

> +
> +		memcpy((void *)(vmdev->vconfig + pos), (void *)usr_data, count);
> +		kfree(ptr);
> +	} else {
> +		char *ret_data, *ptr;
> +
> +		ptr = ret_data = kzalloc(count, GFP_KERNEL);
> +
> +		if (IS_ERR(ret_data)) {
> +			ret = PTR_ERR(ret_data);
> +			goto config_rw_exit;
> +		}
> +
> +		ret = parent->ops->read(mdev, ret_data, count,
> +					EMUL_CONFIG_SPACE, pos);
> +
> +		if (ret > 0) {
> +			if (copy_to_user(buf, ret_data, ret))
> +				ret = -EFAULT;
> +			else
> +				memcpy((void *)(vmdev->vconfig + pos),
> +					(void *)ret_data, count);
> +		}
> +		kfree(ptr);

So vconfig caches all of config space for the mdev, but we only ever
use it to read the BAR address via mdev_read_base()... why?  I hope the
mdev driver doesn't freak out if the user reads the mmio region before
writing a base address (remember the vfio API aspect of the interface
doesn't necessarily follow the VM PCI programming API)

> +	}
> +config_rw_exit:
> +
> +	if (ret > 0)
> +		*ppos += ret;
> +
> +	return ret;
> +}
> +
> +ssize_t mdev_dev_bar_rw(struct vfio_mdev *vmdev, char __user *buf,
> +			size_t count, loff_t *ppos, bool iswrite)
> +{
> +	struct mdev_device *mdev = vmdev->mdev;
> +	struct parent_device *parent = mdev->parent;
> +	loff_t offset = *ppos & VFIO_PCI_OFFSET_MASK;
> +	loff_t pos;
> +	int bar_index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
> +	int ret = 0;
> +
> +	if (!vmdev->vfio_region_info[bar_index].start)
> +		mdev_read_base(vmdev);
> +
> +	if (offset >= vmdev->vfio_region_info[bar_index].size) {
> +		ret = -EINVAL;
> +		goto bar_rw_exit;
> +	}
> +
> +	count = min(count,
> +		    (size_t)(vmdev->vfio_region_info[bar_index].size - offset));
> +
> +	pos = vmdev->vfio_region_info[bar_index].start + offset;

In the case of a mpci dev, @start is the vconfig BAR value, so it's
user (guest) writable, and the mediated driver is supposed to
understand that?  I suppose is saw the config write too, if there was
one, but the mediated driver gives us region info based on region index.
We have the region index here.  Why wouldn't we do reads and writes
based on region index and offset and eliminate vconfig?  Seems like
that would consolidate a lot of this, we don't care what we're reading
and writing, just pass it through.  Mediated pci drivers would simply
need to match indexes to those already defined for vfio-pci.

> +
> +	if (iswrite) {
> +		char *usr_data, *ptr;
> +
> +		ptr = usr_data = memdup_user(buf, count);
> +		if (IS_ERR(usr_data)) {
> +			ret = PTR_ERR(usr_data);
> +			goto bar_rw_exit;
> +		}
> +
> +		ret = parent->ops->write(mdev, usr_data, count, EMUL_MMIO, pos);
> +
> +		kfree(ptr);
> +	} else {
> +		char *ret_data, *ptr;
> +
> +		ptr = ret_data = kzalloc(count, GFP_KERNEL);
> +
> +		if (!ret_data) {
> +			ret = -ENOMEM;
> +			goto bar_rw_exit;
> +		}
> +
> +		ret = parent->ops->read(mdev, ret_data, count, EMUL_MMIO, pos);
> +
> +		if (ret > 0) {
> +			if (copy_to_user(buf, ret_data, ret))
> +				ret = -EFAULT;
> +		}
> +		kfree(ptr);
> +	}
> +
> +bar_rw_exit:
> +
> +	if (ret > 0)
> +		*ppos += ret;
> +
> +	return ret;
> +}
> +
> +
> +static ssize_t mdev_dev_rw(void *device_data, char __user *buf,
> +			   size_t count, loff_t *ppos, bool iswrite)
> +{
> +	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
> +	struct vfio_mdev *vmdev = device_data;
> +
> +	if (index >= VFIO_PCI_NUM_REGIONS)
> +		return -EINVAL;
> +
> +	switch (index) {
> +	case VFIO_PCI_CONFIG_REGION_INDEX:
> +		return mdev_dev_config_rw(vmdev, buf, count, ppos, iswrite);
> +
> +	case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
> +		return mdev_dev_bar_rw(vmdev, buf, count, ppos, iswrite);
> +
> +	case VFIO_PCI_ROM_REGION_INDEX:
> +	case VFIO_PCI_VGA_REGION_INDEX:
> +		break;
> +	}
> +
> +	return -EINVAL;
> +}
> +
> +
> +static ssize_t vfio_mpci_read(void *device_data, char __user *buf,
> +			      size_t count, loff_t *ppos)
> +{
> +	struct vfio_mdev *vmdev = device_data;
> +	struct mdev_device *mdev = vmdev->mdev;
> +	struct parent_device *parent = mdev->parent;
> +	int ret = 0;
> +
> +	if (!count)
> +		return 0;
> +
> +	if (IS_ERR_OR_NULL(buf))
> +		return -EINVAL;
> +
> +	if (parent && parent->ops->read) {
> +		mutex_lock(&mdev->ops_lock);
> +		ret = mdev_dev_rw(device_data, buf, count, ppos, false);
> +		mutex_unlock(&mdev->ops_lock);
> +	}

Argh, we do serialize reads and writes per device, why?!

> +
> +	return ret;
> +}
> +
> +static ssize_t vfio_mpci_write(void *device_data, const char __user *buf,
> +			       size_t count, loff_t *ppos)
> +{
> +	struct vfio_mdev *vmdev = device_data;
> +	struct mdev_device *mdev = vmdev->mdev;
> +	struct parent_device *parent = mdev->parent;
> +	int ret = 0;
> +
> +	if (!count)
> +		return 0;
> +
> +	if (IS_ERR_OR_NULL(buf))
> +		return -EINVAL;
> +
> +	if (parent && parent->ops->write) {
> +		mutex_lock(&mdev->ops_lock);
> +		ret = mdev_dev_rw(device_data, (char __user *)buf, count,
> +				  ppos, true);
> +		mutex_unlock(&mdev->ops_lock);
> +	}
> +
> +	return ret;
> +}
> +
> +static int mdev_dev_mmio_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> +{
> +	int ret;
> +	struct vfio_mdev *vmdev = vma->vm_private_data;
> +	struct mdev_device *mdev;
> +	struct parent_device *parent;
> +	u64 virtaddr = (u64)vmf->virtual_address;
> +	u64 offset, phyaddr;
> +	unsigned long req_size, pgoff;
> +	pgprot_t pg_prot;
> +
> +	if (!vmdev && !vmdev->mdev)
> +		return -EINVAL;
> +
> +	mdev = vmdev->mdev;
> +	parent  = mdev->parent;
> +
> +	offset   = virtaddr - vma->vm_start;
> +	phyaddr  = (vma->vm_pgoff << PAGE_SHIFT) + offset;
> +	pgoff    = phyaddr >> PAGE_SHIFT;
> +	req_size = vma->vm_end - virtaddr;
> +	pg_prot  = vma->vm_page_prot;
> +
> +	if (parent && parent->ops->validate_map_request) {
> +		mutex_lock(&mdev->ops_lock);
> +		ret = parent->ops->validate_map_request(mdev, virtaddr,
> +							 &pgoff, &req_size,
> +							 &pg_prot);
> +		mutex_unlock(&mdev->ops_lock);
> +		if (ret)
> +			return ret;
> +
> +		if (!req_size)
> +			return -EINVAL;
> +	}
> +
> +	ret = remap_pfn_range(vma, virtaddr, pgoff, req_size, pg_prot);
> +
> +	return ret | VM_FAULT_NOPAGE;
> +}
> +
> +static const struct vm_operations_struct mdev_dev_mmio_ops = {
> +	.fault = mdev_dev_mmio_fault,
> +};
> +
> +
> +static int vfio_mpci_mmap(void *device_data, struct vm_area_struct *vma)
> +{
> +	unsigned int index;
> +	struct vfio_mdev *vmdev = device_data;
> +	struct mdev_device *mdev = vmdev->mdev;
> +	struct pci_dev *pdev;
> +	unsigned long pgoff;
> +	loff_t offset;
> +
> +	if (!mdev->parent || !dev_is_pci(mdev->parent->dev))
> +		return -EINVAL;
> +
> +	pdev = to_pci_dev(mdev->parent->dev);
> +
> +	offset = vma->vm_pgoff << PAGE_SHIFT;
> +
> +	index = VFIO_PCI_OFFSET_TO_INDEX(offset);
> +
> +	if (index >= VFIO_PCI_ROM_REGION_INDEX)
> +		return -EINVAL;
> +
> +	pgoff = vma->vm_pgoff &
> +		((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
> +
> +	vma->vm_pgoff = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff;
> +
> +	vma->vm_private_data = vmdev;
> +	vma->vm_ops = &mdev_dev_mmio_ops;
> +
> +	return 0;
> +}
> +
> +static const struct vfio_device_ops vfio_mpci_dev_ops = {
> +	.name		= "vfio-mpci",
> +	.open		= vfio_mpci_open,
> +	.release	= vfio_mpci_close,
> +	.ioctl		= vfio_mpci_unlocked_ioctl,
> +	.read		= vfio_mpci_read,
> +	.write		= vfio_mpci_write,
> +	.mmap		= vfio_mpci_mmap,
> +};
> +
> +int vfio_mpci_probe(struct device *dev)
> +{
> +	struct vfio_mdev *vmdev;
> +	struct mdev_device *mdev = to_mdev_device(dev);
> +	int ret;
> +
> +	if (!mdev)
> +		return -EINVAL;

How could that happen?

> +
> +	vmdev = kzalloc(sizeof(*vmdev), GFP_KERNEL);
> +	if (IS_ERR(vmdev))
> +		return PTR_ERR(vmdev);
> +
> +	vmdev->mdev = mdev_get_device(mdev);
> +	vmdev->group = mdev->group;
> +	mutex_init(&vmdev->vfio_mdev_lock);
> +
> +	ret = vfio_add_group_dev(dev, &vfio_mpci_dev_ops, vmdev);
> +	if (ret)
> +		kfree(vmdev);
> +
> +	mdev_put_device(mdev);
> +	return ret;
> +}
> +
> +void vfio_mpci_remove(struct device *dev)
> +{
> +	struct vfio_mdev *vmdev;
> +
> +	vmdev = vfio_del_group_dev(dev);
> +	kfree(vmdev);
> +}
> +
> +int vfio_mpci_match(struct device *dev)
> +{
> +	if (dev_is_pci(dev->parent))
> +		return 1;
> +
> +	return 0;
> +}
> +
> +struct mdev_driver vfio_mpci_driver = {
> +	.name	= "vfio_mpci",
> +	.probe	= vfio_mpci_probe,
> +	.remove	= vfio_mpci_remove,
> +	.match	= vfio_mpci_match,
> +};
> +
> +static int __init vfio_mpci_init(void)
> +{
> +	return mdev_register_driver(&vfio_mpci_driver, THIS_MODULE);
> +}
> +
> +static void __exit vfio_mpci_exit(void)
> +{
> +	mdev_unregister_driver(&vfio_mpci_driver);
> +}
> +
> +module_init(vfio_mpci_init)
> +module_exit(vfio_mpci_exit)
> +
> +MODULE_VERSION(DRIVER_VERSION);
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR(DRIVER_AUTHOR);
> +MODULE_DESCRIPTION(DRIVER_DESC);
> diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
> index 8a7d546d18a0..04a450908ffb 100644
> --- a/drivers/vfio/pci/vfio_pci_private.h
> +++ b/drivers/vfio/pci/vfio_pci_private.h
> @@ -19,12 +19,6 @@
>  #ifndef VFIO_PCI_PRIVATE_H
>  #define VFIO_PCI_PRIVATE_H
>  
> -#define VFIO_PCI_OFFSET_SHIFT   40
> -
> -#define VFIO_PCI_OFFSET_TO_INDEX(off)	(off >> VFIO_PCI_OFFSET_SHIFT)
> -#define VFIO_PCI_INDEX_TO_OFFSET(index)	((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
> -#define VFIO_PCI_OFFSET_MASK	(((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
> -
>  /* Special capability IDs predefined access */
>  #define PCI_CAP_ID_INVALID		0xFF	/* default raw access */
>  #define PCI_CAP_ID_INVALID_VIRT		0xFE	/* default virt access */
> diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c
> index 5ffd1d9ad4bd..5b912be9d9c3 100644
> --- a/drivers/vfio/pci/vfio_pci_rdwr.c
> +++ b/drivers/vfio/pci/vfio_pci_rdwr.c
> @@ -18,6 +18,7 @@
>  #include <linux/uaccess.h>
>  #include <linux/io.h>
>  #include <linux/vgaarb.h>
> +#include <linux/vfio.h>
>  
>  #include "vfio_pci_private.h"
>  
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index 0ecae0b1cd34..431b824b0d3e 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -18,6 +18,13 @@
>  #include <linux/poll.h>
>  #include <uapi/linux/vfio.h>
>  
> +#define VFIO_PCI_OFFSET_SHIFT   40
> +
> +#define VFIO_PCI_OFFSET_TO_INDEX(off)   (off >> VFIO_PCI_OFFSET_SHIFT)
> +#define VFIO_PCI_INDEX_TO_OFFSET(index) ((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
> +#define VFIO_PCI_OFFSET_MASK    (((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
> +
> +
>  /**
>   * struct vfio_device_ops - VFIO bus driver device callbacks
>   *

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Qemu-devel] [PATCH 2/3] VFIO driver for mediated PCI device
@ 2016-06-21 22:48     ` Alex Williamson
  0 siblings, 0 replies; 51+ messages in thread
From: Alex Williamson @ 2016-06-21 22:48 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, shuai.ruan,
	jike.song, zhiyuan.lv, bjsdjshi

On Mon, 20 Jun 2016 22:01:47 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> VFIO driver registers with MDEV core driver. MDEV core driver creates
> mediated device and calls probe routine of MPCI VFIO driver. This MPCI
> VFIO driver adds mediated device to VFIO core module.
> Main aim of this module is to manage all VFIO APIs for each mediated PCI
> device.
> Those are:
> - get region information from vendor driver.
> - trap and emulate PCI config space and BAR region.
> - Send interrupt configuration information to vendor driver.
> - mmap mappable region with invalidate mapping and fault on access to
>   remap pfn.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Signed-off-by: Neo Jia <cjia@nvidia.com>
> Change-Id: I583f4734752971d3d112324d69e2508c88f359ec
> ---
>  drivers/vfio/mdev/Kconfig           |   7 +
>  drivers/vfio/mdev/Makefile          |   1 +
>  drivers/vfio/mdev/vfio_mpci.c       | 654 ++++++++++++++++++++++++++++++++++++
>  drivers/vfio/pci/vfio_pci_private.h |   6 -
>  drivers/vfio/pci/vfio_pci_rdwr.c    |   1 +
>  include/linux/vfio.h                |   7 +
>  6 files changed, 670 insertions(+), 6 deletions(-)
>  create mode 100644 drivers/vfio/mdev/vfio_mpci.c
> 
> diff --git a/drivers/vfio/mdev/Kconfig b/drivers/vfio/mdev/Kconfig
> index 951e2bb06a3f..8d9e78aaa80f 100644
> --- a/drivers/vfio/mdev/Kconfig
> +++ b/drivers/vfio/mdev/Kconfig
> @@ -9,3 +9,10 @@ config MDEV
>  
>          If you don't know what do here, say N.
>  
> +config VFIO_MPCI
> +    tristate "VFIO support for Mediated PCI devices"
> +    depends on VFIO && PCI && MDEV
> +    default n
> +    help
> +        VFIO based driver for mediated PCI devices.
> +
> diff --git a/drivers/vfio/mdev/Makefile b/drivers/vfio/mdev/Makefile
> index 2c6d11f7bc24..cd5e7625e1ec 100644
> --- a/drivers/vfio/mdev/Makefile
> +++ b/drivers/vfio/mdev/Makefile
> @@ -2,4 +2,5 @@
>  mdev-y := mdev_core.o mdev_sysfs.o mdev_driver.o
>  
>  obj-$(CONFIG_MDEV) += mdev.o
> +obj-$(CONFIG_VFIO_MPCI) += vfio_mpci.o
>  
> diff --git a/drivers/vfio/mdev/vfio_mpci.c b/drivers/vfio/mdev/vfio_mpci.c
> new file mode 100644
> index 000000000000..267879a05c39
> --- /dev/null
> +++ b/drivers/vfio/mdev/vfio_mpci.c
> @@ -0,0 +1,654 @@
> +/*
> + * VFIO based Mediated PCI device driver
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/init.h>
> +#include <linux/module.h>
> +#include <linux/device.h>
> +#include <linux/kernel.h>
> +#include <linux/slab.h>
> +#include <linux/uuid.h>
> +#include <linux/vfio.h>
> +#include <linux/iommu.h>
> +#include <linux/mdev.h>
> +
> +#include "mdev_private.h"
> +
> +#define DRIVER_VERSION  "0.1"
> +#define DRIVER_AUTHOR   "NVIDIA Corporation"
> +#define DRIVER_DESC     "VFIO based Mediated PCI device driver"
> +
> +struct vfio_mdev {
> +	struct iommu_group *group;
> +	struct mdev_device *mdev;
> +	int		    refcnt;
> +	struct pci_region_info vfio_region_info[VFIO_PCI_NUM_REGIONS];
> +	u8		    *vconfig;
> +	struct mutex	    vfio_mdev_lock;
> +};
> +
> +static int get_mdev_region_info(struct mdev_device *mdev,
> +				struct pci_region_info *vfio_region_info,
> +				int index)
> +{
> +	int ret = -EINVAL;
> +	struct parent_device *parent = mdev->parent;
> +
> +	if (parent && dev_is_pci(parent->dev) && parent->ops->get_region_info) {
> +		mutex_lock(&mdev->ops_lock);
> +		ret = parent->ops->get_region_info(mdev, index,
> +						    vfio_region_info);
> +		mutex_unlock(&mdev->ops_lock);

Why do we have two ops_lock, one on the parent_device and one on the
mdev_device?!  Is this one actually locking anything or also just
providing serialization?  Why do some things get serialized at the
parent level and some things at the device level?  Very confused by
ops_lock.

> +	}
> +	return ret;
> +}
> +
> +static void mdev_read_base(struct vfio_mdev *vmdev)
> +{
> +	int index, pos;
> +	u32 start_lo, start_hi;
> +	u32 mem_type;
> +
> +	pos = PCI_BASE_ADDRESS_0;
> +
> +	for (index = 0; index <= VFIO_PCI_BAR5_REGION_INDEX; index++) {
> +
> +		if (!vmdev->vfio_region_info[index].size)
> +			continue;
> +
> +		start_lo = (*(u32 *)(vmdev->vconfig + pos)) &
> +					PCI_BASE_ADDRESS_MEM_MASK;
> +		mem_type = (*(u32 *)(vmdev->vconfig + pos)) &
> +					PCI_BASE_ADDRESS_MEM_TYPE_MASK;
> +
> +		switch (mem_type) {
> +		case PCI_BASE_ADDRESS_MEM_TYPE_64:
> +			start_hi = (*(u32 *)(vmdev->vconfig + pos + 4));
> +			pos += 4;
> +			break;
> +		case PCI_BASE_ADDRESS_MEM_TYPE_32:
> +		case PCI_BASE_ADDRESS_MEM_TYPE_1M:
> +			/* 1M mem BAR treated as 32-bit BAR */
> +		default:
> +			/* mem unknown type treated as 32-bit BAR */
> +			start_hi = 0;
> +			break;
> +		}
> +		pos += 4;
> +		vmdev->vfio_region_info[index].start = ((u64)start_hi << 32) |
> +							start_lo;
> +	}
> +}
> +
> +static int vfio_mpci_open(void *device_data)
> +{
> +	int ret = 0;
> +	struct vfio_mdev *vmdev = device_data;
> +
> +	if (!try_module_get(THIS_MODULE))
> +		return -ENODEV;
> +
> +	mutex_lock(&vmdev->vfio_mdev_lock);
> +	if (!vmdev->refcnt) {
> +		u8 *vconfig;
> +		int index;
> +		struct pci_region_info *cfg_reg;
> +
> +		for (index = VFIO_PCI_BAR0_REGION_INDEX;
> +		     index < VFIO_PCI_NUM_REGIONS; index++) {
> +			ret = get_mdev_region_info(vmdev->mdev,
> +						&vmdev->vfio_region_info[index],
> +						index);
> +			if (ret)
> +				goto open_error;
> +		}
> +		cfg_reg = &vmdev->vfio_region_info[VFIO_PCI_CONFIG_REGION_INDEX];
> +		if (!cfg_reg->size)
> +			goto open_error;
> +
> +		vconfig = kzalloc(cfg_reg->size, GFP_KERNEL);
> +		if (IS_ERR(vconfig)) {
> +			ret = PTR_ERR(vconfig);
> +			goto open_error;
> +		}
> +
> +		vmdev->vconfig = vconfig;
> +	}
> +
> +	vmdev->refcnt++;
> +open_error:
> +
> +	mutex_unlock(&vmdev->vfio_mdev_lock);
> +	if (ret)
> +		module_put(THIS_MODULE);
> +
> +	return ret;
> +}
> +
> +static void vfio_mpci_close(void *device_data)
> +{
> +	struct vfio_mdev *vmdev = device_data;
> +
> +	mutex_lock(&vmdev->vfio_mdev_lock);
> +	vmdev->refcnt--;
> +	if (!vmdev->refcnt) {
> +		memset(&vmdev->vfio_region_info, 0,
> +			sizeof(vmdev->vfio_region_info));
> +		kfree(vmdev->vconfig);
> +	}
> +	mutex_unlock(&vmdev->vfio_mdev_lock);
> +	module_put(THIS_MODULE);
> +}
> +
> +static int mdev_get_irq_count(struct vfio_mdev *vmdev, int irq_type)
> +{
> +	/* Don't support MSIX for now */
> +	if (irq_type == VFIO_PCI_MSIX_IRQ_INDEX)
> +		return -1;
> +
> +	return 1;

Too much hard coding here, the mediated driver should define this.

> +}
> +
> +static long vfio_mpci_unlocked_ioctl(void *device_data,
> +				     unsigned int cmd, unsigned long arg)
> +{
> +	int ret = 0;
> +	struct vfio_mdev *vmdev = device_data;
> +	unsigned long minsz;
> +
> +	switch (cmd) {
> +	case VFIO_DEVICE_GET_INFO:
> +	{
> +		struct vfio_device_info info;
> +
> +		minsz = offsetofend(struct vfio_device_info, num_irqs);
> +
> +		if (copy_from_user(&info, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (info.argsz < minsz)
> +			return -EINVAL;
> +
> +		info.flags = VFIO_DEVICE_FLAGS_PCI;
> +		info.num_regions = VFIO_PCI_NUM_REGIONS;
> +		info.num_irqs = VFIO_PCI_NUM_IRQS;
> +
> +		return copy_to_user((void __user *)arg, &info, minsz);
> +	}
> +	case VFIO_DEVICE_GET_REGION_INFO:
> +	{
> +		struct vfio_region_info info;
> +
> +		minsz = offsetofend(struct vfio_region_info, offset);
> +
> +		if (copy_from_user(&info, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (info.argsz < minsz)
> +			return -EINVAL;
> +
> +		switch (info.index) {
> +		case VFIO_PCI_CONFIG_REGION_INDEX:
> +		case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
> +			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
> +			info.size = vmdev->vfio_region_info[info.index].size;
> +			if (!info.size) {
> +				info.flags = 0;
> +				break;
> +			}
> +
> +			info.flags = vmdev->vfio_region_info[info.index].flags;
> +			break;
> +		case VFIO_PCI_VGA_REGION_INDEX:
> +		case VFIO_PCI_ROM_REGION_INDEX:
> +		default:
> +			return -EINVAL;
> +		}
> +
> +		return copy_to_user((void __user *)arg, &info, minsz);
> +	}
> +	case VFIO_DEVICE_GET_IRQ_INFO:
> +	{
> +		struct vfio_irq_info info;
> +
> +		minsz = offsetofend(struct vfio_irq_info, count);
> +
> +		if (copy_from_user(&info, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (info.argsz < minsz || info.index >= VFIO_PCI_NUM_IRQS)
> +			return -EINVAL;
> +
> +		switch (info.index) {
> +		case VFIO_PCI_INTX_IRQ_INDEX ... VFIO_PCI_MSI_IRQ_INDEX:
> +		case VFIO_PCI_REQ_IRQ_INDEX:
> +			break;
> +			/* pass thru to return error */
> +		case VFIO_PCI_MSIX_IRQ_INDEX:
> +		default:
> +			return -EINVAL;
> +		}
> +
> +		info.count = VFIO_PCI_NUM_IRQS;

???  This is set again 2 lines below

> +		info.flags = VFIO_IRQ_INFO_EVENTFD;
> +		info.count = mdev_get_irq_count(vmdev, info.index);
> +
> +		if (info.count == -1)
> +			return -EINVAL;
> +
> +		if (info.index == VFIO_PCI_INTX_IRQ_INDEX)
> +			info.flags |= (VFIO_IRQ_INFO_MASKABLE |
> +					VFIO_IRQ_INFO_AUTOMASKED);
> +		else
> +			info.flags |= VFIO_IRQ_INFO_NORESIZE;
> +
> +		return copy_to_user((void __user *)arg, &info, minsz);
> +	}
> +	case VFIO_DEVICE_SET_IRQS:
> +	{
> +		struct vfio_irq_set hdr;
> +		struct mdev_device *mdev = vmdev->mdev;
> +		struct parent_device *parent = vmdev->mdev->parent;
> +		u8 *data = NULL, *ptr = NULL;
> +
> +		minsz = offsetofend(struct vfio_irq_set, count);
> +
> +		if (copy_from_user(&hdr, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (hdr.argsz < minsz || hdr.index >= VFIO_PCI_NUM_IRQS ||
> +		    hdr.flags & ~(VFIO_IRQ_SET_DATA_TYPE_MASK |
> +		    VFIO_IRQ_SET_ACTION_TYPE_MASK))
> +			return -EINVAL;
> +
> +		if (!(hdr.flags & VFIO_IRQ_SET_DATA_NONE)) {
> +			size_t size;
> +			int max = mdev_get_irq_count(vmdev, hdr.index);
> +
> +			if (hdr.flags & VFIO_IRQ_SET_DATA_BOOL)
> +				size = sizeof(uint8_t);
> +			else if (hdr.flags & VFIO_IRQ_SET_DATA_EVENTFD)
> +				size = sizeof(int32_t);
> +			else
> +				return -EINVAL;
> +
> +			if (hdr.argsz - minsz < hdr.count * size ||
> +			    hdr.start >= max || hdr.start + hdr.count > max)
> +				return -EINVAL;
> +
> +			ptr = data = memdup_user((void __user *)(arg + minsz),
> +						 hdr.count * size);
> +			if (IS_ERR(data))
> +				return PTR_ERR(data);
> +		}
> +
> +		if (parent && parent->ops->set_irqs) {
> +			mutex_lock(&mdev->ops_lock);
> +			ret = parent->ops->set_irqs(mdev, hdr.flags, hdr.index,
> +						    hdr.start, hdr.count, data);
> +			mutex_unlock(&mdev->ops_lock);

Device level serialization on set_irqs... interesting.

> +		}
> +
> +		kfree(ptr);
> +		return ret;
> +	}
> +	}
> +	return -ENOTTY;
> +}
> +
> +ssize_t mdev_dev_config_rw(struct vfio_mdev *vmdev, char __user *buf,
> +			   size_t count, loff_t *ppos, bool iswrite)
> +{
> +	struct mdev_device *mdev = vmdev->mdev;
> +	struct parent_device *parent = mdev->parent;
> +	int size = vmdev->vfio_region_info[VFIO_PCI_CONFIG_REGION_INDEX].size;
> +	int ret = 0;
> +	uint64_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
> +
> +	if (pos < 0 || pos >= size ||
> +	    pos + count > size) {
> +		pr_err("%s pos 0x%llx out of range\n", __func__, pos);
> +		ret = -EFAULT;
> +		goto config_rw_exit;
> +	}
> +
> +	if (iswrite) {
> +		char *usr_data, *ptr;
> +
> +		ptr = usr_data = memdup_user(buf, count);
> +		if (IS_ERR(usr_data)) {
> +			ret = PTR_ERR(usr_data);
> +			goto config_rw_exit;
> +		}
> +
> +		ret = parent->ops->write(mdev, usr_data, count,
> +					  EMUL_CONFIG_SPACE, pos);

No serialization on this ops, thank goodness, but why?

This read/write interface still seems strange to me...

> +
> +		memcpy((void *)(vmdev->vconfig + pos), (void *)usr_data, count);
> +		kfree(ptr);
> +	} else {
> +		char *ret_data, *ptr;
> +
> +		ptr = ret_data = kzalloc(count, GFP_KERNEL);
> +
> +		if (IS_ERR(ret_data)) {
> +			ret = PTR_ERR(ret_data);
> +			goto config_rw_exit;
> +		}
> +
> +		ret = parent->ops->read(mdev, ret_data, count,
> +					EMUL_CONFIG_SPACE, pos);
> +
> +		if (ret > 0) {
> +			if (copy_to_user(buf, ret_data, ret))
> +				ret = -EFAULT;
> +			else
> +				memcpy((void *)(vmdev->vconfig + pos),
> +					(void *)ret_data, count);
> +		}
> +		kfree(ptr);

So vconfig caches all of config space for the mdev, but we only ever
use it to read the BAR address via mdev_read_base()... why?  I hope the
mdev driver doesn't freak out if the user reads the mmio region before
writing a base address (remember the vfio API aspect of the interface
doesn't necessarily follow the VM PCI programming API)

> +	}
> +config_rw_exit:
> +
> +	if (ret > 0)
> +		*ppos += ret;
> +
> +	return ret;
> +}
> +
> +ssize_t mdev_dev_bar_rw(struct vfio_mdev *vmdev, char __user *buf,
> +			size_t count, loff_t *ppos, bool iswrite)
> +{
> +	struct mdev_device *mdev = vmdev->mdev;
> +	struct parent_device *parent = mdev->parent;
> +	loff_t offset = *ppos & VFIO_PCI_OFFSET_MASK;
> +	loff_t pos;
> +	int bar_index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
> +	int ret = 0;
> +
> +	if (!vmdev->vfio_region_info[bar_index].start)
> +		mdev_read_base(vmdev);
> +
> +	if (offset >= vmdev->vfio_region_info[bar_index].size) {
> +		ret = -EINVAL;
> +		goto bar_rw_exit;
> +	}
> +
> +	count = min(count,
> +		    (size_t)(vmdev->vfio_region_info[bar_index].size - offset));
> +
> +	pos = vmdev->vfio_region_info[bar_index].start + offset;

In the case of a mpci dev, @start is the vconfig BAR value, so it's
user (guest) writable, and the mediated driver is supposed to
understand that?  I suppose is saw the config write too, if there was
one, but the mediated driver gives us region info based on region index.
We have the region index here.  Why wouldn't we do reads and writes
based on region index and offset and eliminate vconfig?  Seems like
that would consolidate a lot of this, we don't care what we're reading
and writing, just pass it through.  Mediated pci drivers would simply
need to match indexes to those already defined for vfio-pci.

> +
> +	if (iswrite) {
> +		char *usr_data, *ptr;
> +
> +		ptr = usr_data = memdup_user(buf, count);
> +		if (IS_ERR(usr_data)) {
> +			ret = PTR_ERR(usr_data);
> +			goto bar_rw_exit;
> +		}
> +
> +		ret = parent->ops->write(mdev, usr_data, count, EMUL_MMIO, pos);
> +
> +		kfree(ptr);
> +	} else {
> +		char *ret_data, *ptr;
> +
> +		ptr = ret_data = kzalloc(count, GFP_KERNEL);
> +
> +		if (!ret_data) {
> +			ret = -ENOMEM;
> +			goto bar_rw_exit;
> +		}
> +
> +		ret = parent->ops->read(mdev, ret_data, count, EMUL_MMIO, pos);
> +
> +		if (ret > 0) {
> +			if (copy_to_user(buf, ret_data, ret))
> +				ret = -EFAULT;
> +		}
> +		kfree(ptr);
> +	}
> +
> +bar_rw_exit:
> +
> +	if (ret > 0)
> +		*ppos += ret;
> +
> +	return ret;
> +}
> +
> +
> +static ssize_t mdev_dev_rw(void *device_data, char __user *buf,
> +			   size_t count, loff_t *ppos, bool iswrite)
> +{
> +	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
> +	struct vfio_mdev *vmdev = device_data;
> +
> +	if (index >= VFIO_PCI_NUM_REGIONS)
> +		return -EINVAL;
> +
> +	switch (index) {
> +	case VFIO_PCI_CONFIG_REGION_INDEX:
> +		return mdev_dev_config_rw(vmdev, buf, count, ppos, iswrite);
> +
> +	case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
> +		return mdev_dev_bar_rw(vmdev, buf, count, ppos, iswrite);
> +
> +	case VFIO_PCI_ROM_REGION_INDEX:
> +	case VFIO_PCI_VGA_REGION_INDEX:
> +		break;
> +	}
> +
> +	return -EINVAL;
> +}
> +
> +
> +static ssize_t vfio_mpci_read(void *device_data, char __user *buf,
> +			      size_t count, loff_t *ppos)
> +{
> +	struct vfio_mdev *vmdev = device_data;
> +	struct mdev_device *mdev = vmdev->mdev;
> +	struct parent_device *parent = mdev->parent;
> +	int ret = 0;
> +
> +	if (!count)
> +		return 0;
> +
> +	if (IS_ERR_OR_NULL(buf))
> +		return -EINVAL;
> +
> +	if (parent && parent->ops->read) {
> +		mutex_lock(&mdev->ops_lock);
> +		ret = mdev_dev_rw(device_data, buf, count, ppos, false);
> +		mutex_unlock(&mdev->ops_lock);
> +	}

Argh, we do serialize reads and writes per device, why?!

> +
> +	return ret;
> +}
> +
> +static ssize_t vfio_mpci_write(void *device_data, const char __user *buf,
> +			       size_t count, loff_t *ppos)
> +{
> +	struct vfio_mdev *vmdev = device_data;
> +	struct mdev_device *mdev = vmdev->mdev;
> +	struct parent_device *parent = mdev->parent;
> +	int ret = 0;
> +
> +	if (!count)
> +		return 0;
> +
> +	if (IS_ERR_OR_NULL(buf))
> +		return -EINVAL;
> +
> +	if (parent && parent->ops->write) {
> +		mutex_lock(&mdev->ops_lock);
> +		ret = mdev_dev_rw(device_data, (char __user *)buf, count,
> +				  ppos, true);
> +		mutex_unlock(&mdev->ops_lock);
> +	}
> +
> +	return ret;
> +}
> +
> +static int mdev_dev_mmio_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> +{
> +	int ret;
> +	struct vfio_mdev *vmdev = vma->vm_private_data;
> +	struct mdev_device *mdev;
> +	struct parent_device *parent;
> +	u64 virtaddr = (u64)vmf->virtual_address;
> +	u64 offset, phyaddr;
> +	unsigned long req_size, pgoff;
> +	pgprot_t pg_prot;
> +
> +	if (!vmdev && !vmdev->mdev)
> +		return -EINVAL;
> +
> +	mdev = vmdev->mdev;
> +	parent  = mdev->parent;
> +
> +	offset   = virtaddr - vma->vm_start;
> +	phyaddr  = (vma->vm_pgoff << PAGE_SHIFT) + offset;
> +	pgoff    = phyaddr >> PAGE_SHIFT;
> +	req_size = vma->vm_end - virtaddr;
> +	pg_prot  = vma->vm_page_prot;
> +
> +	if (parent && parent->ops->validate_map_request) {
> +		mutex_lock(&mdev->ops_lock);
> +		ret = parent->ops->validate_map_request(mdev, virtaddr,
> +							 &pgoff, &req_size,
> +							 &pg_prot);
> +		mutex_unlock(&mdev->ops_lock);
> +		if (ret)
> +			return ret;
> +
> +		if (!req_size)
> +			return -EINVAL;
> +	}
> +
> +	ret = remap_pfn_range(vma, virtaddr, pgoff, req_size, pg_prot);
> +
> +	return ret | VM_FAULT_NOPAGE;
> +}
> +
> +static const struct vm_operations_struct mdev_dev_mmio_ops = {
> +	.fault = mdev_dev_mmio_fault,
> +};
> +
> +
> +static int vfio_mpci_mmap(void *device_data, struct vm_area_struct *vma)
> +{
> +	unsigned int index;
> +	struct vfio_mdev *vmdev = device_data;
> +	struct mdev_device *mdev = vmdev->mdev;
> +	struct pci_dev *pdev;
> +	unsigned long pgoff;
> +	loff_t offset;
> +
> +	if (!mdev->parent || !dev_is_pci(mdev->parent->dev))
> +		return -EINVAL;
> +
> +	pdev = to_pci_dev(mdev->parent->dev);
> +
> +	offset = vma->vm_pgoff << PAGE_SHIFT;
> +
> +	index = VFIO_PCI_OFFSET_TO_INDEX(offset);
> +
> +	if (index >= VFIO_PCI_ROM_REGION_INDEX)
> +		return -EINVAL;
> +
> +	pgoff = vma->vm_pgoff &
> +		((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
> +
> +	vma->vm_pgoff = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff;
> +
> +	vma->vm_private_data = vmdev;
> +	vma->vm_ops = &mdev_dev_mmio_ops;
> +
> +	return 0;
> +}
> +
> +static const struct vfio_device_ops vfio_mpci_dev_ops = {
> +	.name		= "vfio-mpci",
> +	.open		= vfio_mpci_open,
> +	.release	= vfio_mpci_close,
> +	.ioctl		= vfio_mpci_unlocked_ioctl,
> +	.read		= vfio_mpci_read,
> +	.write		= vfio_mpci_write,
> +	.mmap		= vfio_mpci_mmap,
> +};
> +
> +int vfio_mpci_probe(struct device *dev)
> +{
> +	struct vfio_mdev *vmdev;
> +	struct mdev_device *mdev = to_mdev_device(dev);
> +	int ret;
> +
> +	if (!mdev)
> +		return -EINVAL;

How could that happen?

> +
> +	vmdev = kzalloc(sizeof(*vmdev), GFP_KERNEL);
> +	if (IS_ERR(vmdev))
> +		return PTR_ERR(vmdev);
> +
> +	vmdev->mdev = mdev_get_device(mdev);
> +	vmdev->group = mdev->group;
> +	mutex_init(&vmdev->vfio_mdev_lock);
> +
> +	ret = vfio_add_group_dev(dev, &vfio_mpci_dev_ops, vmdev);
> +	if (ret)
> +		kfree(vmdev);
> +
> +	mdev_put_device(mdev);
> +	return ret;
> +}
> +
> +void vfio_mpci_remove(struct device *dev)
> +{
> +	struct vfio_mdev *vmdev;
> +
> +	vmdev = vfio_del_group_dev(dev);
> +	kfree(vmdev);
> +}
> +
> +int vfio_mpci_match(struct device *dev)
> +{
> +	if (dev_is_pci(dev->parent))
> +		return 1;
> +
> +	return 0;
> +}
> +
> +struct mdev_driver vfio_mpci_driver = {
> +	.name	= "vfio_mpci",
> +	.probe	= vfio_mpci_probe,
> +	.remove	= vfio_mpci_remove,
> +	.match	= vfio_mpci_match,
> +};
> +
> +static int __init vfio_mpci_init(void)
> +{
> +	return mdev_register_driver(&vfio_mpci_driver, THIS_MODULE);
> +}
> +
> +static void __exit vfio_mpci_exit(void)
> +{
> +	mdev_unregister_driver(&vfio_mpci_driver);
> +}
> +
> +module_init(vfio_mpci_init)
> +module_exit(vfio_mpci_exit)
> +
> +MODULE_VERSION(DRIVER_VERSION);
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR(DRIVER_AUTHOR);
> +MODULE_DESCRIPTION(DRIVER_DESC);
> diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
> index 8a7d546d18a0..04a450908ffb 100644
> --- a/drivers/vfio/pci/vfio_pci_private.h
> +++ b/drivers/vfio/pci/vfio_pci_private.h
> @@ -19,12 +19,6 @@
>  #ifndef VFIO_PCI_PRIVATE_H
>  #define VFIO_PCI_PRIVATE_H
>  
> -#define VFIO_PCI_OFFSET_SHIFT   40
> -
> -#define VFIO_PCI_OFFSET_TO_INDEX(off)	(off >> VFIO_PCI_OFFSET_SHIFT)
> -#define VFIO_PCI_INDEX_TO_OFFSET(index)	((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
> -#define VFIO_PCI_OFFSET_MASK	(((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
> -
>  /* Special capability IDs predefined access */
>  #define PCI_CAP_ID_INVALID		0xFF	/* default raw access */
>  #define PCI_CAP_ID_INVALID_VIRT		0xFE	/* default virt access */
> diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c
> index 5ffd1d9ad4bd..5b912be9d9c3 100644
> --- a/drivers/vfio/pci/vfio_pci_rdwr.c
> +++ b/drivers/vfio/pci/vfio_pci_rdwr.c
> @@ -18,6 +18,7 @@
>  #include <linux/uaccess.h>
>  #include <linux/io.h>
>  #include <linux/vgaarb.h>
> +#include <linux/vfio.h>
>  
>  #include "vfio_pci_private.h"
>  
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index 0ecae0b1cd34..431b824b0d3e 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -18,6 +18,13 @@
>  #include <linux/poll.h>
>  #include <uapi/linux/vfio.h>
>  
> +#define VFIO_PCI_OFFSET_SHIFT   40
> +
> +#define VFIO_PCI_OFFSET_TO_INDEX(off)   (off >> VFIO_PCI_OFFSET_SHIFT)
> +#define VFIO_PCI_INDEX_TO_OFFSET(index) ((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
> +#define VFIO_PCI_OFFSET_MASK    (((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
> +
> +
>  /**
>   * struct vfio_device_ops - VFIO bus driver device callbacks
>   *

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 3/3] VFIO Type1 IOMMU: Add support for mediated devices
  2016-06-20 16:31   ` [Qemu-devel] " Kirti Wankhede
@ 2016-06-22  3:46     ` Alex Williamson
  -1 siblings, 0 replies; 51+ messages in thread
From: Alex Williamson @ 2016-06-22  3:46 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: shuai.ruan, kevin.tian, cjia, kvm, qemu-devel, jike.song, kraxel,
	pbonzini, bjsdjshi, zhiyuan.lv

On Mon, 20 Jun 2016 22:01:48 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> VFIO Type1 IOMMU driver is designed for the devices which are IOMMU
> capable. Mediated device only uses IOMMU TYPE1 API, the underlying
> hardware can be managed by an IOMMU domain.
> 
> This change exports functions to pin and unpin pages for mediated devices.
> It maintains data of pinned pages for mediated domain. This data is used to
> verify unpinning request and to unpin remaining pages from detach_group()
> if there are any.
> 
> Aim of this change is:
> - To use most of the code of IOMMU driver for mediated devices
> - To support direct assigned device and mediated device by single module
> 
> Updated the change to keep mediated domain structure out of domain_list.
> 
> Tested by assigning below combinations of devices to a single VM:
> - GPU pass through only
> - vGPU device only
> - One GPU pass through and one vGPU device
> - two GPU pass through
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Signed-off-by: Neo Jia <cjia@nvidia.com>
> Change-Id: I295d6f0f2e0579b8d9882bfd8fd5a4194b97bd9a
> ---
>  drivers/vfio/vfio_iommu_type1.c | 444 +++++++++++++++++++++++++++++++++++++---
>  include/linux/vfio.h            |   6 +
>  2 files changed, 418 insertions(+), 32 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 75b24e93cedb..f17dd104fe27 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -36,6 +36,7 @@
>  #include <linux/uaccess.h>
>  #include <linux/vfio.h>
>  #include <linux/workqueue.h>
> +#include <linux/mdev.h>
>  
>  #define DRIVER_VERSION  "0.2"
>  #define DRIVER_AUTHOR   "Alex Williamson <alex.williamson@redhat.com>"
> @@ -55,6 +56,7 @@ MODULE_PARM_DESC(disable_hugepages,
>  
>  struct vfio_iommu {
>  	struct list_head	domain_list;
> +	struct vfio_domain	*mediated_domain;

I'm not really a fan of how this is so often used to special case the
code...

>  	struct mutex		lock;
>  	struct rb_root		dma_list;
>  	bool			v2;
> @@ -67,6 +69,13 @@ struct vfio_domain {
>  	struct list_head	group_list;
>  	int			prot;		/* IOMMU_CACHE */
>  	bool			fgsp;		/* Fine-grained super pages */
> +
> +	/* Domain for mediated device which is without physical IOMMU */
> +	bool			mediated_device;

But sometimes we use this to special case the code and other times we
use domain_list being empty.  I thought the argument against pulling
code out to a shared file was that this approach could be made
maintainable.

> +
> +	struct mm_struct	*mm;
> +	struct rb_root		pfn_list;	/* pinned Host pfn list */
> +	struct mutex		pfn_list_lock;	/* mutex for pfn_list */

Seems like we could reduce overhead for the existing use cases by just
adding a pointer here and making these last 3 entries part of the
structure that gets pointed to.  Existence of the pointer would replace
@mediated_device.

>  };
>  
>  struct vfio_dma {
> @@ -79,10 +88,26 @@ struct vfio_dma {
>  
>  struct vfio_group {
>  	struct iommu_group	*iommu_group;
> +#if defined(CONFIG_MDEV) || defined(CONFIG_MDEV_MODULE)

Where does CONFIG_MDEV_MODULE come from?

Plus, all the #ifdefs... <cringe>

> +	struct mdev_device	*mdev;

This gets set on attach_group where we use the iommu_group to lookup
the mdev, so why can't we do that on the other paths that make use of
this?  I think this is just holding a reference.

> +#endif
>  	struct list_head	next;
>  };
>  
>  /*
> + * Guest RAM pinning working set or DMA target
> + */
> +struct vfio_pfn {
> +	struct rb_node		node;
> +	unsigned long		vaddr;		/* virtual addr */
> +	dma_addr_t		iova;		/* IOVA */
> +	unsigned long		npage;		/* number of pages */
> +	unsigned long		pfn;		/* Host pfn */
> +	size_t			prot;
> +	atomic_t		ref_count;
> +};
> +
> +/*
>   * This code handles mapping and unmapping of user data buffers
>   * into DMA'ble space using the IOMMU
>   */
> @@ -130,6 +155,64 @@ static void vfio_unlink_dma(struct vfio_iommu *iommu, struct vfio_dma *old)
>  	rb_erase(&old->node, &iommu->dma_list);
>  }
>  
> +/*
> + * Helper Functions for host pfn list
> + */
> +
> +static struct vfio_pfn *vfio_find_pfn(struct vfio_domain *domain,
> +				      unsigned long pfn)
> +{
> +	struct rb_node *node;
> +	struct vfio_pfn *vpfn, *ret = NULL;
> +
> +	mutex_lock(&domain->pfn_list_lock);
> +	node = domain->pfn_list.rb_node;
> +
> +	while (node) {
> +		vpfn = rb_entry(node, struct vfio_pfn, node);
> +
> +		if (pfn < vpfn->pfn)
> +			node = node->rb_left;
> +		else if (pfn > vpfn->pfn)
> +			node = node->rb_right;
> +		else {
> +			ret = vpfn;
> +			break;
> +		}
> +	}
> +
> +	mutex_unlock(&domain->pfn_list_lock);
> +	return ret;
> +}
> +
> +static void vfio_link_pfn(struct vfio_domain *domain, struct vfio_pfn *new)
> +{
> +	struct rb_node **link, *parent = NULL;
> +	struct vfio_pfn *vpfn;
> +
> +	mutex_lock(&domain->pfn_list_lock);
> +	link = &domain->pfn_list.rb_node;
> +	while (*link) {
> +		parent = *link;
> +		vpfn = rb_entry(parent, struct vfio_pfn, node);
> +
> +		if (new->pfn < vpfn->pfn)
> +			link = &(*link)->rb_left;
> +		else
> +			link = &(*link)->rb_right;
> +	}
> +
> +	rb_link_node(&new->node, parent, link);
> +	rb_insert_color(&new->node, &domain->pfn_list);
> +	mutex_unlock(&domain->pfn_list_lock);
> +}
> +
> +/* call by holding domain->pfn_list_lock */
> +static void vfio_unlink_pfn(struct vfio_domain *domain, struct vfio_pfn *old)
> +{
> +	rb_erase(&old->node, &domain->pfn_list);
> +}

Hmm, all the other pfn list interfaces lock themselves yet this one
requires a lock.  I think that should be called out by naming it
something like __vfio_unlink_pfn() rather than simply a comment.

> +
>  struct vwork {
>  	struct mm_struct	*mm;
>  	long			npage;
> @@ -228,20 +311,29 @@ static int put_pfn(unsigned long pfn, int prot)
>  	return 0;
>  }
>  
> -static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
> +static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
> +			 int prot, unsigned long *pfn)
>  {
>  	struct page *page[1];
>  	struct vm_area_struct *vma;
> +	struct mm_struct *local_mm = mm;
>  	int ret = -EFAULT;
>  
> -	if (get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE), page) == 1) {
> +	if (!local_mm && !current->mm)
> +		return -ENODEV;
> +
> +	if (!local_mm)
> +		local_mm = current->mm;

The above would be much more concise if we just initialized local_mm
as: mm ? mm : current->mm

> +
> +	down_read(&local_mm->mmap_sem);
> +	if (get_user_pages_remote(NULL, local_mm, vaddr, 1,
> +				!!(prot & IOMMU_WRITE), 0, page, NULL) == 1) {

Um, the comment for get_user_pages_remote says:

"See also get_user_pages_fast, for performance critical applications."

So what penalty are we imposing on the existing behavior of type1
here?  Previously we only needed to acquire mmap_sem if
get_user_pages_fast() didn't work, so the existing use case seems to be
compromised.

>  		*pfn = page_to_pfn(page[0]);
> -		return 0;
> +		ret = 0;
> +		goto done_pfn;
>  	}
>  
> -	down_read(&current->mm->mmap_sem);
> -
> -	vma = find_vma_intersection(current->mm, vaddr, vaddr + 1);
> +	vma = find_vma_intersection(local_mm, vaddr, vaddr + 1);
>  
>  	if (vma && vma->vm_flags & VM_PFNMAP) {
>  		*pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
> @@ -249,7 +341,8 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
>  			ret = 0;
>  	}
>  
> -	up_read(&current->mm->mmap_sem);
> +done_pfn:
> +	up_read(&local_mm->mmap_sem);
>  
>  	return ret;
>  }
> @@ -259,18 +352,19 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
>   * the iommu can only map chunks of consecutive pfns anyway, so get the
>   * first page and all consecutive pages with the same locking.
>   */
> -static long vfio_pin_pages(unsigned long vaddr, long npage,
> -			   int prot, unsigned long *pfn_base)
> +static long vfio_pin_pages_internal(struct vfio_domain *domain,
> +				    unsigned long vaddr, long npage,
> +				    int prot, unsigned long *pfn_base)
>  {
>  	unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
>  	bool lock_cap = capable(CAP_IPC_LOCK);
>  	long ret, i;
>  	bool rsvd;
>  
> -	if (!current->mm)
> +	if (!domain)
>  		return -ENODEV;

This test doesn't make much sense to me.  The existing use case error
is again being deferred and the callers either seem like domain can't
be NULL or it's an exported function where we should be validating the
parameters before calling this function.

>  
> -	ret = vaddr_get_pfn(vaddr, prot, pfn_base);
> +	ret = vaddr_get_pfn(domain->mm, vaddr, prot, pfn_base);
>  	if (ret)
>  		return ret;
>  
> @@ -293,7 +387,7 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
>  	for (i = 1, vaddr += PAGE_SIZE; i < npage; i++, vaddr += PAGE_SIZE) {
>  		unsigned long pfn = 0;
>  
> -		ret = vaddr_get_pfn(vaddr, prot, &pfn);
> +		ret = vaddr_get_pfn(domain->mm, vaddr, prot, &pfn);
>  		if (ret)
>  			break;
>  
> @@ -318,20 +412,165 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
>  	return i;
>  }
>  
> -static long vfio_unpin_pages(unsigned long pfn, long npage,
> -			     int prot, bool do_accounting)
> +static long vfio_unpin_pages_internal(struct vfio_domain *domain,
> +				      unsigned long pfn, long npage, int prot,
> +				      bool do_accounting)
>  {
>  	unsigned long unlocked = 0;
>  	long i;
>  
> +	if (!domain)
> +		return -ENODEV;
> +

Again, seems like validation of parameters should happen at the caller
in this case.

>  	for (i = 0; i < npage; i++)
>  		unlocked += put_pfn(pfn++, prot);
>  
>  	if (do_accounting)
>  		vfio_lock_acct(-unlocked);
> +	return unlocked;
> +}
> +
> +/*
> + * Pin a set of guest PFNs and return their associated host PFNs for API
> + * supported domain only.
> + * @vaddr [in]: array of guest PFNs

vfio is a userspace driver, never assume the userspace is a VM.  It's
also not really a vaddr since it's a frame number.  Please work on the
names.

> + * @npage [in]: count of array elements
> + * @prot [in] : protection flags
> + * @pfn_base[out] : array of host PFNs

phys_pfn maybe.

> + */
> +long vfio_pin_pages(void *iommu_data, dma_addr_t *vaddr, long npage,
> +		   int prot, dma_addr_t *pfn_base)
> +{
> +	struct vfio_iommu *iommu = iommu_data;
> +	struct vfio_domain *domain = NULL;

Unnecessary initialization.

> +	int i = 0, ret = 0;

Same for these.

> +	long retpage;
> +	unsigned long remote_vaddr = 0;

And this.

> +	dma_addr_t *pfn = pfn_base;
> +	struct vfio_dma *dma;
> +
> +	if (!iommu || !vaddr || !pfn_base)
> +		return -EINVAL;
> +
> +	mutex_lock(&iommu->lock);
> +
> +	if (!iommu->mediated_domain) {
> +		ret = -EINVAL;
> +		goto pin_done;
> +	}
> +
> +	domain = iommu->mediated_domain;

You're already validating domain here, which makes the test in
vfio_pin_pages_internal() really seem unnecessary.

> +
> +	for (i = 0; i < npage; i++) {
> +		struct vfio_pfn *p, *lpfn;
> +		unsigned long tpfn;
> +		dma_addr_t iova;
> +		long pg_cnt = 1;
> +
> +		iova = vaddr[i] << PAGE_SHIFT;
> +
> +		dma = vfio_find_dma(iommu, iova, 0 /*  size */);
> +		if (!dma) {
> +			ret = -EINVAL;
> +			goto pin_done;

All error paths need to unwind, if we return error there should be no
state change, otherwise we're leaking pages.

> +		}
> +
> +		remote_vaddr = dma->vaddr + iova - dma->iova;
> +
> +		retpage = vfio_pin_pages_internal(domain, remote_vaddr,
> +						  pg_cnt, prot, &tpfn);
> +		if (retpage <= 0) {
> +			WARN_ON(!retpage);
> +			ret = (int)retpage;
> +			goto pin_done;

unwind

> +		}
> +
> +		pfn[i] = tpfn;
> +
> +		/* search if pfn exist */
> +		p = vfio_find_pfn(domain, tpfn);
> +		if (p) {
> +			atomic_inc(&p->ref_count);
> +			continue;
> +		}
> +
> +		/* add to pfn_list */
> +		lpfn = kzalloc(sizeof(*lpfn), GFP_KERNEL);
> +		if (!lpfn) {
> +			ret = -ENOMEM;
> +			goto pin_done;

unwind

> +		}
> +		lpfn->vaddr = remote_vaddr;
> +		lpfn->iova = iova;
> +		lpfn->pfn = pfn[i];
> +		lpfn->npage = 1;

Why do we need this variable if this can only track 1 page?

> +		lpfn->prot = prot;
> +		atomic_inc(&lpfn->ref_count);

atomic_set(), we want to set the ref count to 1, not increment.

> +		vfio_link_pfn(domain, lpfn);
> +	}
> +
> +	ret = i;
> +
> +pin_done:
> +	mutex_unlock(&iommu->lock);
> +	return ret;
> +}
> +EXPORT_SYMBOL(vfio_pin_pages);
> +
> +static int vfio_unpin_pfn(struct vfio_domain *domain,
> +			  struct vfio_pfn *vpfn, bool do_accounting)
> +{
> +	int ret;
> +
> +	ret = vfio_unpin_pages_internal(domain, vpfn->pfn, vpfn->npage,
> +					vpfn->prot, do_accounting);
> +
> +	if (ret > 0 && atomic_dec_and_test(&vpfn->ref_count)) {
> +		vfio_unlink_pfn(domain, vpfn);
> +		kfree(vpfn);
> +	}
> +
> +	return ret;
> +}
> +
> +/*
> + * Unpin set of host PFNs for API supported domain only.
> + * @pfn	[in] : array of host PFNs to be unpinned.
> + * @npage [in] :count of elements in array, that is number of pages.
> + * @prot [in] : protection flags

prot is unused, it's also saved in out pfn list if we did need it.  In
fact, should we compare prot after our vfio_find_pfn above to make sure
our existing pinned page has the right settings?

> + */
> +long vfio_unpin_pages(void *iommu_data, dma_addr_t *pfn, long npage,
> +		     int prot)
> +{
> +	struct vfio_iommu *iommu = iommu_data;
> +	struct vfio_domain *domain = NULL;
> +	long unlocked = 0;
> +	int i;
> +
> +	if (!iommu || !pfn)
> +		return -EINVAL;
> +
> +	if (!iommu->mediated_domain)
> +		return -EINVAL;
> +
> +	domain = iommu->mediated_domain;

Again, domain is already validated here.

> +
> +	for (i = 0; i < npage; i++) {
> +		struct vfio_pfn *p;
> +
> +		/* verify if pfn exist in pfn_list */
> +		p = vfio_find_pfn(domain, *(pfn + i));

Why are we using array indexing above and array math here?  Were these
functions written by different people?

> +		if (!p)
> +			continue;

Hmm, this seems like more of a bad thing than a continue.

> +
> +		mutex_lock(&domain->pfn_list_lock);
> +		unlocked += vfio_unpin_pfn(domain, p, true);
> +		mutex_unlock(&domain->pfn_list_lock);

[huge red flag] the entire vfio_unpin_pfn path is called under
pfn_list_lock, but vfio_pin_pages only uses it sparsely.  Maybe someone
else did write this function.  I'll assume all locking here needs a
revisit.

> +	}
>  
>  	return unlocked;
>  }
> +EXPORT_SYMBOL(vfio_unpin_pages);
>  
>  static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
>  {
> @@ -341,6 +580,10 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
>  
>  	if (!dma->size)
>  		return;
> +
> +	if (list_empty(&iommu->domain_list))
> +		return;

Huh?  This would be a serious consistency error if this happened for
the existing use case.

> +
>  	/*
>  	 * We use the IOMMU to track the physical addresses, otherwise we'd
>  	 * need a much more complicated tracking system.  Unfortunately that
> @@ -382,9 +625,10 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
>  		if (WARN_ON(!unmapped))
>  			break;
>  
> -		unlocked += vfio_unpin_pages(phys >> PAGE_SHIFT,
> -					     unmapped >> PAGE_SHIFT,
> -					     dma->prot, false);
> +		unlocked += vfio_unpin_pages_internal(domain,
> +						phys >> PAGE_SHIFT,
> +						unmapped >> PAGE_SHIFT,
> +						dma->prot, false);
>  		iova += unmapped;
>  
>  		cond_resched();
> @@ -517,6 +761,9 @@ static int map_try_harder(struct vfio_domain *domain, dma_addr_t iova,
>  	long i;
>  	int ret;
>  
> +	if (domain->mediated_device)
> +		return -EINVAL;


Which is it going to be, mediated_device to flag these special domains
or domain_list empty, let's not use both.

> +
>  	for (i = 0; i < npage; i++, pfn++, iova += PAGE_SIZE) {
>  		ret = iommu_map(domain->domain, iova,
>  				(phys_addr_t)pfn << PAGE_SHIFT,
> @@ -537,6 +784,9 @@ static int vfio_iommu_map(struct vfio_iommu *iommu, dma_addr_t iova,
>  	struct vfio_domain *d;
>  	int ret;
>  
> +	if (list_empty(&iommu->domain_list))
> +		return 0;
> +
>  	list_for_each_entry(d, &iommu->domain_list, next) {
>  		ret = iommu_map(d->domain, iova, (phys_addr_t)pfn << PAGE_SHIFT,
>  				npage << PAGE_SHIFT, prot | d->prot);
> @@ -569,6 +819,7 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>  	uint64_t mask;
>  	struct vfio_dma *dma;
>  	unsigned long pfn;
> +	struct vfio_domain *domain = NULL;
>  
>  	/* Verify that none of our __u64 fields overflow */
>  	if (map->size != size || map->vaddr != vaddr || map->iova != iova)
> @@ -611,10 +862,21 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>  	/* Insert zero-sized and grow as we map chunks of it */
>  	vfio_link_dma(iommu, dma);
>  
> +	/*
> +	 * Skip pin and map if and domain list is empty
> +	 */
> +	if (list_empty(&iommu->domain_list)) {
> +		dma->size = size;
> +		goto map_done;
> +	}

Again, this would be a serious consistency error for the existing use
case.  Let's use indicators that are explicit.

> +
> +	domain = list_first_entry(&iommu->domain_list,
> +				  struct vfio_domain, next);
> +
>  	while (size) {
>  		/* Pin a contiguous chunk of memory */
> -		npage = vfio_pin_pages(vaddr + dma->size,
> -				       size >> PAGE_SHIFT, prot, &pfn);
> +		npage = vfio_pin_pages_internal(domain, vaddr + dma->size,
> +						size >> PAGE_SHIFT, prot, &pfn);
>  		if (npage <= 0) {
>  			WARN_ON(!npage);
>  			ret = (int)npage;
> @@ -624,7 +886,8 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>  		/* Map it! */
>  		ret = vfio_iommu_map(iommu, iova + dma->size, pfn, npage, prot);
>  		if (ret) {
> -			vfio_unpin_pages(pfn, npage, prot, true);
> +			vfio_unpin_pages_internal(domain, pfn, npage,
> +						  prot, true);
>  			break;
>  		}
>  
> @@ -635,6 +898,7 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>  	if (ret)
>  		vfio_remove_dma(iommu, dma);
>  
> +map_done:
>  	mutex_unlock(&iommu->lock);
>  	return ret;
>  }
> @@ -658,6 +922,9 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
>  	struct rb_node *n;
>  	int ret;
>  
> +	if (domain->mediated_device)
> +		return 0;

Though "mediated_device" is the user, not really the property of the
domain we're trying to support, which is more like track_and_pin_only.

> +
>  	/* Arbitrarily pick the first domain in the list for lookups */
>  	d = list_first_entry(&iommu->domain_list, struct vfio_domain, next);
>  	n = rb_first(&iommu->dma_list);
> @@ -716,6 +983,9 @@ static void vfio_test_domain_fgsp(struct vfio_domain *domain)
>  	struct page *pages;
>  	int ret, order = get_order(PAGE_SIZE * 2);
>  
> +	if (domain->mediated_device)
> +		return;
> +
>  	pages = alloc_pages(GFP_KERNEL | __GFP_ZERO, order);
>  	if (!pages)
>  		return;
> @@ -734,11 +1004,25 @@ static void vfio_test_domain_fgsp(struct vfio_domain *domain)
>  	__free_pages(pages, order);
>  }
>  
> +static struct vfio_group *is_iommu_group_present(struct vfio_domain *domain,
> +				   struct iommu_group *iommu_group)

is_foo is a yes/no answer, the return should be bool.  This is more of
a find_foo_from_bar

> +{
> +	struct vfio_group *g;
> +
> +	list_for_each_entry(g, &domain->group_list, next) {
> +		if (g->iommu_group != iommu_group)
> +			continue;
> +		return g;

Hmmm

if (g->iommu_group == iommu_group)
	return g;

> +	}
> +
> +	return NULL;
> +}
> +
>  static int vfio_iommu_type1_attach_group(void *iommu_data,
>  					 struct iommu_group *iommu_group)
>  {
>  	struct vfio_iommu *iommu = iommu_data;
> -	struct vfio_group *group, *g;
> +	struct vfio_group *group;
>  	struct vfio_domain *domain, *d;
>  	struct bus_type *bus = NULL;
>  	int ret;
> @@ -746,14 +1030,21 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>  	mutex_lock(&iommu->lock);
>  
>  	list_for_each_entry(d, &iommu->domain_list, next) {
> -		list_for_each_entry(g, &d->group_list, next) {
> -			if (g->iommu_group != iommu_group)
> -				continue;
> +		if (is_iommu_group_present(d, iommu_group)) {
> +			mutex_unlock(&iommu->lock);
> +			return -EINVAL;
> +		}
> +	}
>  
> +#if defined(CONFIG_MDEV) || defined(CONFIG_MDEV_MODULE)
> +	if (iommu->mediated_domain) {
> +		if (is_iommu_group_present(iommu->mediated_domain,
> +					   iommu_group)) {
>  			mutex_unlock(&iommu->lock);
>  			return -EINVAL;
>  		}
>  	}
> +#endif
>  
>  	group = kzalloc(sizeof(*group), GFP_KERNEL);
>  	domain = kzalloc(sizeof(*domain), GFP_KERNEL);
> @@ -769,6 +1060,36 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>  	if (ret)
>  		goto out_free;
>  
> +#if defined(CONFIG_MDEV) || defined(CONFIG_MDEV_MODULE)
> +	if (!iommu_present(bus) && (bus == &mdev_bus_type)) {
> +		struct mdev_device *mdev = NULL;

Unnecessary initialization.

> +
> +		mdev = mdev_get_device_by_group(iommu_group);
> +		if (!mdev)
> +			goto out_free;
> +
> +		mdev->iommu_data = iommu;

This looks rather sketchy to me, we don't have a mediated driver in
this series, but presumably the driver blindly calls vfio_pin_pages
passing mdev->iommu_data and hoping that it's either NULL to generate
an error or relevant to this iommu backend.  How would we add a second
mediated driver iommu backend?  We're currently assuming the user
configured this backend.  Should vfio_pin_pages instead have a struct
device* parameter from which we would lookup the iommu_group and get to
the vfio_domain?  That's a bit heavy weight, but we need something
along those lines.


> +		group->mdev = mdev;
> +
> +		if (iommu->mediated_domain) {
> +			list_add(&group->next,
> +				 &iommu->mediated_domain->group_list);
> +			kfree(domain);
> +			mutex_unlock(&iommu->lock);
> +			return 0;
> +		}
> +		domain->mediated_device = true;
> +		domain->mm = current->mm;
> +		INIT_LIST_HEAD(&domain->group_list);
> +		list_add(&group->next, &domain->group_list);
> +		domain->pfn_list = RB_ROOT;
> +		mutex_init(&domain->pfn_list_lock);
> +		iommu->mediated_domain = domain;
> +		mutex_unlock(&iommu->lock);
> +		return 0;
> +	}
> +#endif
> +
>  	domain->domain = iommu_domain_alloc(bus);
>  	if (!domain->domain) {
>  		ret = -EIO;
> @@ -859,6 +1180,20 @@ static void vfio_iommu_unmap_unpin_all(struct vfio_iommu *iommu)
>  		vfio_remove_dma(iommu, rb_entry(node, struct vfio_dma, node));
>  }
>  
> +#if defined(CONFIG_MDEV) || defined(CONFIG_MDEV_MODULE)
> +static void vfio_iommu_unpin_api_domain(struct vfio_domain *domain)
> +{
> +	struct rb_node *node;
> +
> +	mutex_lock(&domain->pfn_list_lock);
> +	while ((node = rb_first(&domain->pfn_list))) {
> +		vfio_unpin_pfn(domain,
> +				rb_entry(node, struct vfio_pfn, node), false);
> +	}
> +	mutex_unlock(&domain->pfn_list_lock);
> +}
> +#endif
> +
>  static void vfio_iommu_type1_detach_group(void *iommu_data,
>  					  struct iommu_group *iommu_group)
>  {
> @@ -868,31 +1203,55 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
>  
>  	mutex_lock(&iommu->lock);
>  
> -	list_for_each_entry(domain, &iommu->domain_list, next) {
> -		list_for_each_entry(group, &domain->group_list, next) {
> -			if (group->iommu_group != iommu_group)
> -				continue;
> +#if defined(CONFIG_MDEV) || defined(CONFIG_MDEV_MODULE)
> +	if (iommu->mediated_domain) {
> +		domain = iommu->mediated_domain;
> +		group = is_iommu_group_present(domain, iommu_group);
> +		if (group) {
> +			if (group->mdev) {
> +				group->mdev->iommu_data = NULL;
> +				mdev_put_device(group->mdev);
> +			}
> +			list_del(&group->next);
> +			kfree(group);
> +
> +			if (list_empty(&domain->group_list)) {
> +				vfio_iommu_unpin_api_domain(domain);
> +
> +				if (list_empty(&iommu->domain_list))
> +					vfio_iommu_unmap_unpin_all(iommu);
> +
> +				kfree(domain);
> +				iommu->mediated_domain = NULL;
> +			}
> +		}
> +	}
> +#endif
>  
> +	list_for_each_entry(domain, &iommu->domain_list, next) {
> +		group = is_iommu_group_present(domain, iommu_group);
> +		if (group) {
>  			iommu_detach_group(domain->domain, iommu_group);
>  			list_del(&group->next);
>  			kfree(group);
>  			/*
>  			 * Group ownership provides privilege, if the group
>  			 * list is empty, the domain goes away.  If it's the
> -			 * last domain, then all the mappings go away too.
> +			 * last domain with iommu and API-only domain doesn't
> +			 * exist, the all the mappings go away too.
>  			 */
>  			if (list_empty(&domain->group_list)) {
> -				if (list_is_singular(&iommu->domain_list))
> +				if (list_is_singular(&iommu->domain_list) &&
> +				    !iommu->mediated_domain)
>  					vfio_iommu_unmap_unpin_all(iommu);
>  				iommu_domain_free(domain->domain);
>  				list_del(&domain->next);
>  				kfree(domain);
>  			}
> -			goto done;
> +			break;
>  		}
>  	}
>  
> -done:
>  	mutex_unlock(&iommu->lock);
>  }
>  
> @@ -930,8 +1289,28 @@ static void vfio_iommu_type1_release(void *iommu_data)
>  	struct vfio_domain *domain, *domain_tmp;
>  	struct vfio_group *group, *group_tmp;
>  
> +#if defined(CONFIG_MDEV) || defined(CONFIG_MDEV_MODULE)
> +	if (iommu->mediated_domain) {
> +		domain = iommu->mediated_domain;
> +		list_for_each_entry_safe(group, group_tmp,
> +					 &domain->group_list, next) {
> +			if (group->mdev) {
> +				group->mdev->iommu_data = NULL;
> +				mdev_put_device(group->mdev);
> +			}
> +			list_del(&group->next);
> +			kfree(group);
> +		}
> +		vfio_iommu_unpin_api_domain(domain);
> +		kfree(domain);
> +		iommu->mediated_domain = NULL;
> +	}
> +#endif

I'm not really seeing how this is all that much more maintainable than
what was proposed previously, has this aspect been worked on since last
I reviewed this patch?

>  	vfio_iommu_unmap_unpin_all(iommu);
>  
> +	if (list_empty(&iommu->domain_list))
> +		goto release_exit;
> +
>  	list_for_each_entry_safe(domain, domain_tmp,
>  				 &iommu->domain_list, next) {
>  		list_for_each_entry_safe(group, group_tmp,
> @@ -945,6 +1324,7 @@ static void vfio_iommu_type1_release(void *iommu_data)
>  		kfree(domain);
>  	}
>  
> +release_exit:
>  	kfree(iommu);
>  }
>  
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index 431b824b0d3e..0a907bb33426 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -134,6 +134,12 @@ static inline long vfio_spapr_iommu_eeh_ioctl(struct iommu_group *group,
>  }
>  #endif /* CONFIG_EEH */
>  
> +extern long vfio_pin_pages(void *iommu_data, dma_addr_t *vaddr, long npage,
> +			   int prot, dma_addr_t *pfn_base);
> +
> +extern long vfio_unpin_pages(void *iommu_data, dma_addr_t *pfn, long npage,
> +			     int prot);
> +
>  /*
>   * IRQfd - generic
>   */

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Qemu-devel] [PATCH 3/3] VFIO Type1 IOMMU: Add support for mediated devices
@ 2016-06-22  3:46     ` Alex Williamson
  0 siblings, 0 replies; 51+ messages in thread
From: Alex Williamson @ 2016-06-22  3:46 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, shuai.ruan,
	jike.song, zhiyuan.lv, bjsdjshi

On Mon, 20 Jun 2016 22:01:48 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> VFIO Type1 IOMMU driver is designed for the devices which are IOMMU
> capable. Mediated device only uses IOMMU TYPE1 API, the underlying
> hardware can be managed by an IOMMU domain.
> 
> This change exports functions to pin and unpin pages for mediated devices.
> It maintains data of pinned pages for mediated domain. This data is used to
> verify unpinning request and to unpin remaining pages from detach_group()
> if there are any.
> 
> Aim of this change is:
> - To use most of the code of IOMMU driver for mediated devices
> - To support direct assigned device and mediated device by single module
> 
> Updated the change to keep mediated domain structure out of domain_list.
> 
> Tested by assigning below combinations of devices to a single VM:
> - GPU pass through only
> - vGPU device only
> - One GPU pass through and one vGPU device
> - two GPU pass through
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Signed-off-by: Neo Jia <cjia@nvidia.com>
> Change-Id: I295d6f0f2e0579b8d9882bfd8fd5a4194b97bd9a
> ---
>  drivers/vfio/vfio_iommu_type1.c | 444 +++++++++++++++++++++++++++++++++++++---
>  include/linux/vfio.h            |   6 +
>  2 files changed, 418 insertions(+), 32 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 75b24e93cedb..f17dd104fe27 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -36,6 +36,7 @@
>  #include <linux/uaccess.h>
>  #include <linux/vfio.h>
>  #include <linux/workqueue.h>
> +#include <linux/mdev.h>
>  
>  #define DRIVER_VERSION  "0.2"
>  #define DRIVER_AUTHOR   "Alex Williamson <alex.williamson@redhat.com>"
> @@ -55,6 +56,7 @@ MODULE_PARM_DESC(disable_hugepages,
>  
>  struct vfio_iommu {
>  	struct list_head	domain_list;
> +	struct vfio_domain	*mediated_domain;

I'm not really a fan of how this is so often used to special case the
code...

>  	struct mutex		lock;
>  	struct rb_root		dma_list;
>  	bool			v2;
> @@ -67,6 +69,13 @@ struct vfio_domain {
>  	struct list_head	group_list;
>  	int			prot;		/* IOMMU_CACHE */
>  	bool			fgsp;		/* Fine-grained super pages */
> +
> +	/* Domain for mediated device which is without physical IOMMU */
> +	bool			mediated_device;

But sometimes we use this to special case the code and other times we
use domain_list being empty.  I thought the argument against pulling
code out to a shared file was that this approach could be made
maintainable.

> +
> +	struct mm_struct	*mm;
> +	struct rb_root		pfn_list;	/* pinned Host pfn list */
> +	struct mutex		pfn_list_lock;	/* mutex for pfn_list */

Seems like we could reduce overhead for the existing use cases by just
adding a pointer here and making these last 3 entries part of the
structure that gets pointed to.  Existence of the pointer would replace
@mediated_device.

>  };
>  
>  struct vfio_dma {
> @@ -79,10 +88,26 @@ struct vfio_dma {
>  
>  struct vfio_group {
>  	struct iommu_group	*iommu_group;
> +#if defined(CONFIG_MDEV) || defined(CONFIG_MDEV_MODULE)

Where does CONFIG_MDEV_MODULE come from?

Plus, all the #ifdefs... <cringe>

> +	struct mdev_device	*mdev;

This gets set on attach_group where we use the iommu_group to lookup
the mdev, so why can't we do that on the other paths that make use of
this?  I think this is just holding a reference.

> +#endif
>  	struct list_head	next;
>  };
>  
>  /*
> + * Guest RAM pinning working set or DMA target
> + */
> +struct vfio_pfn {
> +	struct rb_node		node;
> +	unsigned long		vaddr;		/* virtual addr */
> +	dma_addr_t		iova;		/* IOVA */
> +	unsigned long		npage;		/* number of pages */
> +	unsigned long		pfn;		/* Host pfn */
> +	size_t			prot;
> +	atomic_t		ref_count;
> +};
> +
> +/*
>   * This code handles mapping and unmapping of user data buffers
>   * into DMA'ble space using the IOMMU
>   */
> @@ -130,6 +155,64 @@ static void vfio_unlink_dma(struct vfio_iommu *iommu, struct vfio_dma *old)
>  	rb_erase(&old->node, &iommu->dma_list);
>  }
>  
> +/*
> + * Helper Functions for host pfn list
> + */
> +
> +static struct vfio_pfn *vfio_find_pfn(struct vfio_domain *domain,
> +				      unsigned long pfn)
> +{
> +	struct rb_node *node;
> +	struct vfio_pfn *vpfn, *ret = NULL;
> +
> +	mutex_lock(&domain->pfn_list_lock);
> +	node = domain->pfn_list.rb_node;
> +
> +	while (node) {
> +		vpfn = rb_entry(node, struct vfio_pfn, node);
> +
> +		if (pfn < vpfn->pfn)
> +			node = node->rb_left;
> +		else if (pfn > vpfn->pfn)
> +			node = node->rb_right;
> +		else {
> +			ret = vpfn;
> +			break;
> +		}
> +	}
> +
> +	mutex_unlock(&domain->pfn_list_lock);
> +	return ret;
> +}
> +
> +static void vfio_link_pfn(struct vfio_domain *domain, struct vfio_pfn *new)
> +{
> +	struct rb_node **link, *parent = NULL;
> +	struct vfio_pfn *vpfn;
> +
> +	mutex_lock(&domain->pfn_list_lock);
> +	link = &domain->pfn_list.rb_node;
> +	while (*link) {
> +		parent = *link;
> +		vpfn = rb_entry(parent, struct vfio_pfn, node);
> +
> +		if (new->pfn < vpfn->pfn)
> +			link = &(*link)->rb_left;
> +		else
> +			link = &(*link)->rb_right;
> +	}
> +
> +	rb_link_node(&new->node, parent, link);
> +	rb_insert_color(&new->node, &domain->pfn_list);
> +	mutex_unlock(&domain->pfn_list_lock);
> +}
> +
> +/* call by holding domain->pfn_list_lock */
> +static void vfio_unlink_pfn(struct vfio_domain *domain, struct vfio_pfn *old)
> +{
> +	rb_erase(&old->node, &domain->pfn_list);
> +}

Hmm, all the other pfn list interfaces lock themselves yet this one
requires a lock.  I think that should be called out by naming it
something like __vfio_unlink_pfn() rather than simply a comment.

> +
>  struct vwork {
>  	struct mm_struct	*mm;
>  	long			npage;
> @@ -228,20 +311,29 @@ static int put_pfn(unsigned long pfn, int prot)
>  	return 0;
>  }
>  
> -static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
> +static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
> +			 int prot, unsigned long *pfn)
>  {
>  	struct page *page[1];
>  	struct vm_area_struct *vma;
> +	struct mm_struct *local_mm = mm;
>  	int ret = -EFAULT;
>  
> -	if (get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE), page) == 1) {
> +	if (!local_mm && !current->mm)
> +		return -ENODEV;
> +
> +	if (!local_mm)
> +		local_mm = current->mm;

The above would be much more concise if we just initialized local_mm
as: mm ? mm : current->mm

> +
> +	down_read(&local_mm->mmap_sem);
> +	if (get_user_pages_remote(NULL, local_mm, vaddr, 1,
> +				!!(prot & IOMMU_WRITE), 0, page, NULL) == 1) {

Um, the comment for get_user_pages_remote says:

"See also get_user_pages_fast, for performance critical applications."

So what penalty are we imposing on the existing behavior of type1
here?  Previously we only needed to acquire mmap_sem if
get_user_pages_fast() didn't work, so the existing use case seems to be
compromised.

>  		*pfn = page_to_pfn(page[0]);
> -		return 0;
> +		ret = 0;
> +		goto done_pfn;
>  	}
>  
> -	down_read(&current->mm->mmap_sem);
> -
> -	vma = find_vma_intersection(current->mm, vaddr, vaddr + 1);
> +	vma = find_vma_intersection(local_mm, vaddr, vaddr + 1);
>  
>  	if (vma && vma->vm_flags & VM_PFNMAP) {
>  		*pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
> @@ -249,7 +341,8 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
>  			ret = 0;
>  	}
>  
> -	up_read(&current->mm->mmap_sem);
> +done_pfn:
> +	up_read(&local_mm->mmap_sem);
>  
>  	return ret;
>  }
> @@ -259,18 +352,19 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
>   * the iommu can only map chunks of consecutive pfns anyway, so get the
>   * first page and all consecutive pages with the same locking.
>   */
> -static long vfio_pin_pages(unsigned long vaddr, long npage,
> -			   int prot, unsigned long *pfn_base)
> +static long vfio_pin_pages_internal(struct vfio_domain *domain,
> +				    unsigned long vaddr, long npage,
> +				    int prot, unsigned long *pfn_base)
>  {
>  	unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
>  	bool lock_cap = capable(CAP_IPC_LOCK);
>  	long ret, i;
>  	bool rsvd;
>  
> -	if (!current->mm)
> +	if (!domain)
>  		return -ENODEV;

This test doesn't make much sense to me.  The existing use case error
is again being deferred and the callers either seem like domain can't
be NULL or it's an exported function where we should be validating the
parameters before calling this function.

>  
> -	ret = vaddr_get_pfn(vaddr, prot, pfn_base);
> +	ret = vaddr_get_pfn(domain->mm, vaddr, prot, pfn_base);
>  	if (ret)
>  		return ret;
>  
> @@ -293,7 +387,7 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
>  	for (i = 1, vaddr += PAGE_SIZE; i < npage; i++, vaddr += PAGE_SIZE) {
>  		unsigned long pfn = 0;
>  
> -		ret = vaddr_get_pfn(vaddr, prot, &pfn);
> +		ret = vaddr_get_pfn(domain->mm, vaddr, prot, &pfn);
>  		if (ret)
>  			break;
>  
> @@ -318,20 +412,165 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
>  	return i;
>  }
>  
> -static long vfio_unpin_pages(unsigned long pfn, long npage,
> -			     int prot, bool do_accounting)
> +static long vfio_unpin_pages_internal(struct vfio_domain *domain,
> +				      unsigned long pfn, long npage, int prot,
> +				      bool do_accounting)
>  {
>  	unsigned long unlocked = 0;
>  	long i;
>  
> +	if (!domain)
> +		return -ENODEV;
> +

Again, seems like validation of parameters should happen at the caller
in this case.

>  	for (i = 0; i < npage; i++)
>  		unlocked += put_pfn(pfn++, prot);
>  
>  	if (do_accounting)
>  		vfio_lock_acct(-unlocked);
> +	return unlocked;
> +}
> +
> +/*
> + * Pin a set of guest PFNs and return their associated host PFNs for API
> + * supported domain only.
> + * @vaddr [in]: array of guest PFNs

vfio is a userspace driver, never assume the userspace is a VM.  It's
also not really a vaddr since it's a frame number.  Please work on the
names.

> + * @npage [in]: count of array elements
> + * @prot [in] : protection flags
> + * @pfn_base[out] : array of host PFNs

phys_pfn maybe.

> + */
> +long vfio_pin_pages(void *iommu_data, dma_addr_t *vaddr, long npage,
> +		   int prot, dma_addr_t *pfn_base)
> +{
> +	struct vfio_iommu *iommu = iommu_data;
> +	struct vfio_domain *domain = NULL;

Unnecessary initialization.

> +	int i = 0, ret = 0;

Same for these.

> +	long retpage;
> +	unsigned long remote_vaddr = 0;

And this.

> +	dma_addr_t *pfn = pfn_base;
> +	struct vfio_dma *dma;
> +
> +	if (!iommu || !vaddr || !pfn_base)
> +		return -EINVAL;
> +
> +	mutex_lock(&iommu->lock);
> +
> +	if (!iommu->mediated_domain) {
> +		ret = -EINVAL;
> +		goto pin_done;
> +	}
> +
> +	domain = iommu->mediated_domain;

You're already validating domain here, which makes the test in
vfio_pin_pages_internal() really seem unnecessary.

> +
> +	for (i = 0; i < npage; i++) {
> +		struct vfio_pfn *p, *lpfn;
> +		unsigned long tpfn;
> +		dma_addr_t iova;
> +		long pg_cnt = 1;
> +
> +		iova = vaddr[i] << PAGE_SHIFT;
> +
> +		dma = vfio_find_dma(iommu, iova, 0 /*  size */);
> +		if (!dma) {
> +			ret = -EINVAL;
> +			goto pin_done;

All error paths need to unwind, if we return error there should be no
state change, otherwise we're leaking pages.

> +		}
> +
> +		remote_vaddr = dma->vaddr + iova - dma->iova;
> +
> +		retpage = vfio_pin_pages_internal(domain, remote_vaddr,
> +						  pg_cnt, prot, &tpfn);
> +		if (retpage <= 0) {
> +			WARN_ON(!retpage);
> +			ret = (int)retpage;
> +			goto pin_done;

unwind

> +		}
> +
> +		pfn[i] = tpfn;
> +
> +		/* search if pfn exist */
> +		p = vfio_find_pfn(domain, tpfn);
> +		if (p) {
> +			atomic_inc(&p->ref_count);
> +			continue;
> +		}
> +
> +		/* add to pfn_list */
> +		lpfn = kzalloc(sizeof(*lpfn), GFP_KERNEL);
> +		if (!lpfn) {
> +			ret = -ENOMEM;
> +			goto pin_done;

unwind

> +		}
> +		lpfn->vaddr = remote_vaddr;
> +		lpfn->iova = iova;
> +		lpfn->pfn = pfn[i];
> +		lpfn->npage = 1;

Why do we need this variable if this can only track 1 page?

> +		lpfn->prot = prot;
> +		atomic_inc(&lpfn->ref_count);

atomic_set(), we want to set the ref count to 1, not increment.

> +		vfio_link_pfn(domain, lpfn);
> +	}
> +
> +	ret = i;
> +
> +pin_done:
> +	mutex_unlock(&iommu->lock);
> +	return ret;
> +}
> +EXPORT_SYMBOL(vfio_pin_pages);
> +
> +static int vfio_unpin_pfn(struct vfio_domain *domain,
> +			  struct vfio_pfn *vpfn, bool do_accounting)
> +{
> +	int ret;
> +
> +	ret = vfio_unpin_pages_internal(domain, vpfn->pfn, vpfn->npage,
> +					vpfn->prot, do_accounting);
> +
> +	if (ret > 0 && atomic_dec_and_test(&vpfn->ref_count)) {
> +		vfio_unlink_pfn(domain, vpfn);
> +		kfree(vpfn);
> +	}
> +
> +	return ret;
> +}
> +
> +/*
> + * Unpin set of host PFNs for API supported domain only.
> + * @pfn	[in] : array of host PFNs to be unpinned.
> + * @npage [in] :count of elements in array, that is number of pages.
> + * @prot [in] : protection flags

prot is unused, it's also saved in out pfn list if we did need it.  In
fact, should we compare prot after our vfio_find_pfn above to make sure
our existing pinned page has the right settings?

> + */
> +long vfio_unpin_pages(void *iommu_data, dma_addr_t *pfn, long npage,
> +		     int prot)
> +{
> +	struct vfio_iommu *iommu = iommu_data;
> +	struct vfio_domain *domain = NULL;
> +	long unlocked = 0;
> +	int i;
> +
> +	if (!iommu || !pfn)
> +		return -EINVAL;
> +
> +	if (!iommu->mediated_domain)
> +		return -EINVAL;
> +
> +	domain = iommu->mediated_domain;

Again, domain is already validated here.

> +
> +	for (i = 0; i < npage; i++) {
> +		struct vfio_pfn *p;
> +
> +		/* verify if pfn exist in pfn_list */
> +		p = vfio_find_pfn(domain, *(pfn + i));

Why are we using array indexing above and array math here?  Were these
functions written by different people?

> +		if (!p)
> +			continue;

Hmm, this seems like more of a bad thing than a continue.

> +
> +		mutex_lock(&domain->pfn_list_lock);
> +		unlocked += vfio_unpin_pfn(domain, p, true);
> +		mutex_unlock(&domain->pfn_list_lock);

[huge red flag] the entire vfio_unpin_pfn path is called under
pfn_list_lock, but vfio_pin_pages only uses it sparsely.  Maybe someone
else did write this function.  I'll assume all locking here needs a
revisit.

> +	}
>  
>  	return unlocked;
>  }
> +EXPORT_SYMBOL(vfio_unpin_pages);
>  
>  static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
>  {
> @@ -341,6 +580,10 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
>  
>  	if (!dma->size)
>  		return;
> +
> +	if (list_empty(&iommu->domain_list))
> +		return;

Huh?  This would be a serious consistency error if this happened for
the existing use case.

> +
>  	/*
>  	 * We use the IOMMU to track the physical addresses, otherwise we'd
>  	 * need a much more complicated tracking system.  Unfortunately that
> @@ -382,9 +625,10 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
>  		if (WARN_ON(!unmapped))
>  			break;
>  
> -		unlocked += vfio_unpin_pages(phys >> PAGE_SHIFT,
> -					     unmapped >> PAGE_SHIFT,
> -					     dma->prot, false);
> +		unlocked += vfio_unpin_pages_internal(domain,
> +						phys >> PAGE_SHIFT,
> +						unmapped >> PAGE_SHIFT,
> +						dma->prot, false);
>  		iova += unmapped;
>  
>  		cond_resched();
> @@ -517,6 +761,9 @@ static int map_try_harder(struct vfio_domain *domain, dma_addr_t iova,
>  	long i;
>  	int ret;
>  
> +	if (domain->mediated_device)
> +		return -EINVAL;


Which is it going to be, mediated_device to flag these special domains
or domain_list empty, let's not use both.

> +
>  	for (i = 0; i < npage; i++, pfn++, iova += PAGE_SIZE) {
>  		ret = iommu_map(domain->domain, iova,
>  				(phys_addr_t)pfn << PAGE_SHIFT,
> @@ -537,6 +784,9 @@ static int vfio_iommu_map(struct vfio_iommu *iommu, dma_addr_t iova,
>  	struct vfio_domain *d;
>  	int ret;
>  
> +	if (list_empty(&iommu->domain_list))
> +		return 0;
> +
>  	list_for_each_entry(d, &iommu->domain_list, next) {
>  		ret = iommu_map(d->domain, iova, (phys_addr_t)pfn << PAGE_SHIFT,
>  				npage << PAGE_SHIFT, prot | d->prot);
> @@ -569,6 +819,7 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>  	uint64_t mask;
>  	struct vfio_dma *dma;
>  	unsigned long pfn;
> +	struct vfio_domain *domain = NULL;
>  
>  	/* Verify that none of our __u64 fields overflow */
>  	if (map->size != size || map->vaddr != vaddr || map->iova != iova)
> @@ -611,10 +862,21 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>  	/* Insert zero-sized and grow as we map chunks of it */
>  	vfio_link_dma(iommu, dma);
>  
> +	/*
> +	 * Skip pin and map if and domain list is empty
> +	 */
> +	if (list_empty(&iommu->domain_list)) {
> +		dma->size = size;
> +		goto map_done;
> +	}

Again, this would be a serious consistency error for the existing use
case.  Let's use indicators that are explicit.

> +
> +	domain = list_first_entry(&iommu->domain_list,
> +				  struct vfio_domain, next);
> +
>  	while (size) {
>  		/* Pin a contiguous chunk of memory */
> -		npage = vfio_pin_pages(vaddr + dma->size,
> -				       size >> PAGE_SHIFT, prot, &pfn);
> +		npage = vfio_pin_pages_internal(domain, vaddr + dma->size,
> +						size >> PAGE_SHIFT, prot, &pfn);
>  		if (npage <= 0) {
>  			WARN_ON(!npage);
>  			ret = (int)npage;
> @@ -624,7 +886,8 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>  		/* Map it! */
>  		ret = vfio_iommu_map(iommu, iova + dma->size, pfn, npage, prot);
>  		if (ret) {
> -			vfio_unpin_pages(pfn, npage, prot, true);
> +			vfio_unpin_pages_internal(domain, pfn, npage,
> +						  prot, true);
>  			break;
>  		}
>  
> @@ -635,6 +898,7 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>  	if (ret)
>  		vfio_remove_dma(iommu, dma);
>  
> +map_done:
>  	mutex_unlock(&iommu->lock);
>  	return ret;
>  }
> @@ -658,6 +922,9 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
>  	struct rb_node *n;
>  	int ret;
>  
> +	if (domain->mediated_device)
> +		return 0;

Though "mediated_device" is the user, not really the property of the
domain we're trying to support, which is more like track_and_pin_only.

> +
>  	/* Arbitrarily pick the first domain in the list for lookups */
>  	d = list_first_entry(&iommu->domain_list, struct vfio_domain, next);
>  	n = rb_first(&iommu->dma_list);
> @@ -716,6 +983,9 @@ static void vfio_test_domain_fgsp(struct vfio_domain *domain)
>  	struct page *pages;
>  	int ret, order = get_order(PAGE_SIZE * 2);
>  
> +	if (domain->mediated_device)
> +		return;
> +
>  	pages = alloc_pages(GFP_KERNEL | __GFP_ZERO, order);
>  	if (!pages)
>  		return;
> @@ -734,11 +1004,25 @@ static void vfio_test_domain_fgsp(struct vfio_domain *domain)
>  	__free_pages(pages, order);
>  }
>  
> +static struct vfio_group *is_iommu_group_present(struct vfio_domain *domain,
> +				   struct iommu_group *iommu_group)

is_foo is a yes/no answer, the return should be bool.  This is more of
a find_foo_from_bar

> +{
> +	struct vfio_group *g;
> +
> +	list_for_each_entry(g, &domain->group_list, next) {
> +		if (g->iommu_group != iommu_group)
> +			continue;
> +		return g;

Hmmm

if (g->iommu_group == iommu_group)
	return g;

> +	}
> +
> +	return NULL;
> +}
> +
>  static int vfio_iommu_type1_attach_group(void *iommu_data,
>  					 struct iommu_group *iommu_group)
>  {
>  	struct vfio_iommu *iommu = iommu_data;
> -	struct vfio_group *group, *g;
> +	struct vfio_group *group;
>  	struct vfio_domain *domain, *d;
>  	struct bus_type *bus = NULL;
>  	int ret;
> @@ -746,14 +1030,21 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>  	mutex_lock(&iommu->lock);
>  
>  	list_for_each_entry(d, &iommu->domain_list, next) {
> -		list_for_each_entry(g, &d->group_list, next) {
> -			if (g->iommu_group != iommu_group)
> -				continue;
> +		if (is_iommu_group_present(d, iommu_group)) {
> +			mutex_unlock(&iommu->lock);
> +			return -EINVAL;
> +		}
> +	}
>  
> +#if defined(CONFIG_MDEV) || defined(CONFIG_MDEV_MODULE)
> +	if (iommu->mediated_domain) {
> +		if (is_iommu_group_present(iommu->mediated_domain,
> +					   iommu_group)) {
>  			mutex_unlock(&iommu->lock);
>  			return -EINVAL;
>  		}
>  	}
> +#endif
>  
>  	group = kzalloc(sizeof(*group), GFP_KERNEL);
>  	domain = kzalloc(sizeof(*domain), GFP_KERNEL);
> @@ -769,6 +1060,36 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>  	if (ret)
>  		goto out_free;
>  
> +#if defined(CONFIG_MDEV) || defined(CONFIG_MDEV_MODULE)
> +	if (!iommu_present(bus) && (bus == &mdev_bus_type)) {
> +		struct mdev_device *mdev = NULL;

Unnecessary initialization.

> +
> +		mdev = mdev_get_device_by_group(iommu_group);
> +		if (!mdev)
> +			goto out_free;
> +
> +		mdev->iommu_data = iommu;

This looks rather sketchy to me, we don't have a mediated driver in
this series, but presumably the driver blindly calls vfio_pin_pages
passing mdev->iommu_data and hoping that it's either NULL to generate
an error or relevant to this iommu backend.  How would we add a second
mediated driver iommu backend?  We're currently assuming the user
configured this backend.  Should vfio_pin_pages instead have a struct
device* parameter from which we would lookup the iommu_group and get to
the vfio_domain?  That's a bit heavy weight, but we need something
along those lines.


> +		group->mdev = mdev;
> +
> +		if (iommu->mediated_domain) {
> +			list_add(&group->next,
> +				 &iommu->mediated_domain->group_list);
> +			kfree(domain);
> +			mutex_unlock(&iommu->lock);
> +			return 0;
> +		}
> +		domain->mediated_device = true;
> +		domain->mm = current->mm;
> +		INIT_LIST_HEAD(&domain->group_list);
> +		list_add(&group->next, &domain->group_list);
> +		domain->pfn_list = RB_ROOT;
> +		mutex_init(&domain->pfn_list_lock);
> +		iommu->mediated_domain = domain;
> +		mutex_unlock(&iommu->lock);
> +		return 0;
> +	}
> +#endif
> +
>  	domain->domain = iommu_domain_alloc(bus);
>  	if (!domain->domain) {
>  		ret = -EIO;
> @@ -859,6 +1180,20 @@ static void vfio_iommu_unmap_unpin_all(struct vfio_iommu *iommu)
>  		vfio_remove_dma(iommu, rb_entry(node, struct vfio_dma, node));
>  }
>  
> +#if defined(CONFIG_MDEV) || defined(CONFIG_MDEV_MODULE)
> +static void vfio_iommu_unpin_api_domain(struct vfio_domain *domain)
> +{
> +	struct rb_node *node;
> +
> +	mutex_lock(&domain->pfn_list_lock);
> +	while ((node = rb_first(&domain->pfn_list))) {
> +		vfio_unpin_pfn(domain,
> +				rb_entry(node, struct vfio_pfn, node), false);
> +	}
> +	mutex_unlock(&domain->pfn_list_lock);
> +}
> +#endif
> +
>  static void vfio_iommu_type1_detach_group(void *iommu_data,
>  					  struct iommu_group *iommu_group)
>  {
> @@ -868,31 +1203,55 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
>  
>  	mutex_lock(&iommu->lock);
>  
> -	list_for_each_entry(domain, &iommu->domain_list, next) {
> -		list_for_each_entry(group, &domain->group_list, next) {
> -			if (group->iommu_group != iommu_group)
> -				continue;
> +#if defined(CONFIG_MDEV) || defined(CONFIG_MDEV_MODULE)
> +	if (iommu->mediated_domain) {
> +		domain = iommu->mediated_domain;
> +		group = is_iommu_group_present(domain, iommu_group);
> +		if (group) {
> +			if (group->mdev) {
> +				group->mdev->iommu_data = NULL;
> +				mdev_put_device(group->mdev);
> +			}
> +			list_del(&group->next);
> +			kfree(group);
> +
> +			if (list_empty(&domain->group_list)) {
> +				vfio_iommu_unpin_api_domain(domain);
> +
> +				if (list_empty(&iommu->domain_list))
> +					vfio_iommu_unmap_unpin_all(iommu);
> +
> +				kfree(domain);
> +				iommu->mediated_domain = NULL;
> +			}
> +		}
> +	}
> +#endif
>  
> +	list_for_each_entry(domain, &iommu->domain_list, next) {
> +		group = is_iommu_group_present(domain, iommu_group);
> +		if (group) {
>  			iommu_detach_group(domain->domain, iommu_group);
>  			list_del(&group->next);
>  			kfree(group);
>  			/*
>  			 * Group ownership provides privilege, if the group
>  			 * list is empty, the domain goes away.  If it's the
> -			 * last domain, then all the mappings go away too.
> +			 * last domain with iommu and API-only domain doesn't
> +			 * exist, the all the mappings go away too.
>  			 */
>  			if (list_empty(&domain->group_list)) {
> -				if (list_is_singular(&iommu->domain_list))
> +				if (list_is_singular(&iommu->domain_list) &&
> +				    !iommu->mediated_domain)
>  					vfio_iommu_unmap_unpin_all(iommu);
>  				iommu_domain_free(domain->domain);
>  				list_del(&domain->next);
>  				kfree(domain);
>  			}
> -			goto done;
> +			break;
>  		}
>  	}
>  
> -done:
>  	mutex_unlock(&iommu->lock);
>  }
>  
> @@ -930,8 +1289,28 @@ static void vfio_iommu_type1_release(void *iommu_data)
>  	struct vfio_domain *domain, *domain_tmp;
>  	struct vfio_group *group, *group_tmp;
>  
> +#if defined(CONFIG_MDEV) || defined(CONFIG_MDEV_MODULE)
> +	if (iommu->mediated_domain) {
> +		domain = iommu->mediated_domain;
> +		list_for_each_entry_safe(group, group_tmp,
> +					 &domain->group_list, next) {
> +			if (group->mdev) {
> +				group->mdev->iommu_data = NULL;
> +				mdev_put_device(group->mdev);
> +			}
> +			list_del(&group->next);
> +			kfree(group);
> +		}
> +		vfio_iommu_unpin_api_domain(domain);
> +		kfree(domain);
> +		iommu->mediated_domain = NULL;
> +	}
> +#endif

I'm not really seeing how this is all that much more maintainable than
what was proposed previously, has this aspect been worked on since last
I reviewed this patch?

>  	vfio_iommu_unmap_unpin_all(iommu);
>  
> +	if (list_empty(&iommu->domain_list))
> +		goto release_exit;
> +
>  	list_for_each_entry_safe(domain, domain_tmp,
>  				 &iommu->domain_list, next) {
>  		list_for_each_entry_safe(group, group_tmp,
> @@ -945,6 +1324,7 @@ static void vfio_iommu_type1_release(void *iommu_data)
>  		kfree(domain);
>  	}
>  
> +release_exit:
>  	kfree(iommu);
>  }
>  
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index 431b824b0d3e..0a907bb33426 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -134,6 +134,12 @@ static inline long vfio_spapr_iommu_eeh_ioctl(struct iommu_group *group,
>  }
>  #endif /* CONFIG_EEH */
>  
> +extern long vfio_pin_pages(void *iommu_data, dma_addr_t *vaddr, long npage,
> +			   int prot, dma_addr_t *pfn_base);
> +
> +extern long vfio_unpin_pages(void *iommu_data, dma_addr_t *pfn, long npage,
> +			     int prot);
> +
>  /*
>   * IRQfd - generic
>   */

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 1/3] Mediated device Core driver
  2016-06-21 21:30     ` [Qemu-devel] " Alex Williamson
@ 2016-06-24 17:54       ` Kirti Wankhede
  -1 siblings, 0 replies; 51+ messages in thread
From: Kirti Wankhede @ 2016-06-24 17:54 UTC (permalink / raw)
  To: Alex Williamson
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, shuai.ruan,
	jike.song, zhiyuan.lv, bjsdjshi

Alex,

Thanks for taking closer look. I'll incorporate all the nits you suggested.

On 6/22/2016 3:00 AM, Alex Williamson wrote:
> On Mon, 20 Jun 2016 22:01:46 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>
...
>> +
>> +config MDEV
>> +    tristate "Mediated device driver framework"
>> +    depends on VFIO
>> +    default n
>> +    help
>> +        MDEV provides a framework to virtualize device without SR-IOV cap
>> +        See Documentation/mdev.txt for more details.
> 
> Documentation pointer still doesn't exist.  Perhaps this file would be
> a more appropriate place than the commit log for some of the
> information above.
> 

Sure, I'll add these details to documentation.

> Every time I review this I'm struggling to figure out why this isn't
> VFIO_MDEV since it's really tied to vfio and difficult to evaluate it
> as some sort of standalone mediated device interface.  I don't know
> the answer, but it always strikes me as a discontinuity.
> 

Ok. I'll change to VFIO_MDEV

>> +
>> +static struct mdev_device *find_mdev_device(struct parent_device *parent,
>> +					    uuid_le uuid, int instance)
>> +{
>> +	struct mdev_device *mdev = NULL, *p;
>> +
>> +	list_for_each_entry(p, &parent->mdev_list, next) {
>> +		if ((uuid_le_cmp(p->uuid, uuid) == 0) &&
>> +		    (p->instance == instance)) {
>> +			mdev = p;
> 
> Locking here is still broken, the callers are create and destroy, which
> can still race each other and themselves.
>

Fixed it.

>> +
>> +static int mdev_device_create_ops(struct mdev_device *mdev, char *mdev_params)
>> +{
>> +	struct parent_device *parent = mdev->parent;
>> +	int ret;
>> +
>> +	mutex_lock(&parent->ops_lock);
>> +	if (parent->ops->create) {
> 
> How would a parent_device without ops->create or ops->destroy useful?
> Perhaps mdev_register_driver() should enforce required ops.  mdev.h
> should at least document which ops are optional if they really are
> optional.

Makes sense, adding check in mdev_register_driver() to mandate create
and destroy in ops. I'll also update the comments in mdev.h for
mandatory and optional ops.

> 
>> +		ret = parent->ops->create(mdev->dev.parent, mdev->uuid,
>> +					mdev->instance, mdev_params);
>> +		if (ret)
>> +			goto create_ops_err;
>> +	}
>> +
>> +	ret = mdev_add_attribute_group(&mdev->dev,
>> +					parent->ops->mdev_attr_groups);
> 
> An error here seems to put us in a bad place, the device is created but
> the attributes are broken, is it the caller's responsibility to
> destroy?  Seems like we need a cleanup if this fails.
> 

Right, adding cleanup here.

>> +create_ops_err:
>> +	mutex_unlock(&parent->ops_lock);
> 
> It seems like ops_lock isn't used so much as a lock as a serialization
> mechanism.  Why?  Where is this serialization per parent device
> documented?
>

parent->ops_lock is to serialize parent device callbacks to vendor
driver, i.e supported_config(), create() and destroy().
mdev->ops_lock is to serialize mediated device related callbacks to
vendor driver, i.e. start(), stop(), read(), write(), set_irqs(),
get_region_info(), validate_map_request().
Its not documented, I'll add comments to mdev.h about these locks.


>> +	return ret;
>> +}
>> +
>> +static int mdev_device_destroy_ops(struct mdev_device *mdev, bool force)
>> +{
>> +	struct parent_device *parent = mdev->parent;
>> +	int ret = 0;
>> +
>> +	/*
>> +	 * If vendor driver doesn't return success that means vendor
>> +	 * driver doesn't support hot-unplug
>> +	 */
>> +	mutex_lock(&parent->ops_lock);
>> +	if (parent->ops->destroy) {
>> +		ret = parent->ops->destroy(parent->dev, mdev->uuid,
>> +					   mdev->instance);
>> +		if (ret && !force) {
> 
> It seems this is not so much a 'force' but an ignore errors, we never
> actually force the mdev driver to destroy the device... which makes me
> wonder if there are leaks there.
> 

Consider a case where VM is running or in teardown path and parent
device in unbound from vendor driver, then vendor driver would call
mdev_unregister_device() from its remove() call. Even if
parent->ops->destroy() returns error that could also mean that
hot-unplug is not supported but we have to destroy mdev device. remove()
call doesn't honor error returned. In that case its a force removal.

>> +
>> +/*
>> + * mdev_unregister_device : Unregister a parent device
>> + * @dev: device structure representing parent device.
>> + *
>> + * Remove device from list of registered parent devices. Give a chance to free
>> + * existing mediated devices for given device.
>> + */
>> +
>> +void mdev_unregister_device(struct device *dev)
>> +{
>> +	struct parent_device *parent;
>> +	struct mdev_device *mdev, *n;
>> +	int ret;
>> +
>> +	mutex_lock(&parent_devices.list_lock);
>> +	parent = find_parent_device(dev);
>> +
>> +	if (!parent) {
>> +		mutex_unlock(&parent_devices.list_lock);
>> +		return;
>> +	}
>> +	dev_info(dev, "MDEV: Unregistering\n");
>> +
>> +	/*
>> +	 * Remove parent from the list and remove create and destroy sysfs
>> +	 * files so that no new mediated device could be created for this parent
>> +	 */
>> +	list_del(&parent->next);
>> +	mdev_remove_sysfs_files(dev);
>> +	mutex_unlock(&parent_devices.list_lock);
>> +
>> +	mutex_lock(&parent->ops_lock);
>> +	mdev_remove_attribute_group(dev,
>> +				    parent->ops->dev_attr_groups);
>> +	mutex_unlock(&parent->ops_lock);
>> +
>> +	mutex_lock(&parent->mdev_list_lock);
>> +	list_for_each_entry_safe(mdev, n, &parent->mdev_list, next) {
>> +		mdev_device_destroy_ops(mdev, true);
>> +		list_del(&mdev->next);
>> +		mdev_put_device(mdev);
>> +	}
>> +	mutex_unlock(&parent->mdev_list_lock);
>> +
>> +	do {
>> +		ret = wait_event_interruptible_timeout(parent->release_done,
>> +				list_empty(&parent->mdev_list), HZ * 10);
> 
> But we do a list_del for each mdev in mdev_list above, how could the
> list not be empty here?  I think you're trying to wait for all the mdev
> devices to be released, but I don't think this does that.  Isn't the
> list empty regardless?
>

Right, I do want to wait for all the mdev devices to be released. Moving
list_del(&mdev->next) from the above for loop to mdev_release_device()
so that mdev will be removed from list on last mdev_put_device().


>> +		if (ret == -ERESTARTSYS) {
>> +			dev_warn(dev, "Mediated devices are in use, task"
>> +				      " \"%s\" (%d) "
>> +				      "blocked until all are released",
>> +				      current->comm, task_pid_nr(current));
>> +		}
>> +	} while (ret <= 0);
>> +
>> +	mdev_put_parent(parent);
>> +}
>> +EXPORT_SYMBOL(mdev_unregister_device);
>> +
>> +


>> +int mdev_device_create(struct device *dev, uuid_le uuid, uint32_t instance,
>> +		       char *mdev_params)
>> +{
>> +	int ret;
>> +	struct mdev_device *mdev;
>> +	struct parent_device *parent;
>> +
>> +	parent = mdev_get_parent_by_dev(dev);
>> +	if (!parent)
>> +		return -EINVAL;
>> +
>> +	/* Check for duplicate */
>> +	mdev = find_mdev_device(parent, uuid, instance);
> 
> But this doesn't actually prevent duplicates because we we're not
> holding any lock the guarantee that another racing process doesn't
> create the same {uuid,instance} between where we check and the below
> list_add.
> 

Oops I missed this race condition. Moving
mutex_lock(&parent->mdev_list_lock);
before find_mdev_device() in mdev_device_create() and
mdev_device_destroy().


>> +
>> +int  mdev_device_create(struct device *dev, uuid_le uuid, uint32_t instance,
>> +			char *mdev_params);
>> +int  mdev_device_destroy(struct device *dev, uuid_le uuid, uint32_t instance);
>> +void mdev_device_supported_config(struct device *dev, char *str);
>> +int  mdev_device_start(uuid_le uuid);
>> +int  mdev_device_shutdown(uuid_le uuid);
> 
> nit, stop is start as startup is to shutdown.  IOW, should this be
> mdev_device_stop()?
>

Ok. Renaming mdev_device_shutdown() to mdev_device_stop().


>> +
>> +struct pci_region_info {
>> +	uint64_t start;
>> +	uint64_t size;
>> +	uint32_t flags;		/* VFIO region info flags */
>> +};
>> +
>> +enum mdev_emul_space {
>> +	EMUL_CONFIG_SPACE,	/* PCI configuration space */
>> +	EMUL_IO,		/* I/O register space */
>> +	EMUL_MMIO		/* Memory-mapped I/O space */
>> +};
> 
> 
> I'm still confused why this is needed, perhaps a description here would
> be useful so I can stop asking.  Clearly config space is PCI only, so
> it's strange to have it in the common code.  Everyone not on x86 will
> say I/O space is also strange.  I can't keep it in my head why the
> read/write offsets aren't sufficient for the driver to figure out what
> type it is.
> 
>

Now that VFIO_PCI_OFFSET_* macros are moved to vfio.h which vendor
driver can also use, above enum could be removed from read/write. But
again these macros are useful when parent device is PCI device. How
would non-pci parent device differentiate IO ports and MMIO?


>> +	int	(*get_region_info)(struct mdev_device *vdev, int region_index,
>> +				 struct pci_region_info *region_info);
> 
> This can't be //pci_//region_info.  How do you intend to support things
> like sparse mmap capabilities in the user REGION_INFO ioctl when such
> things are not part of the mediated device API?  Seems like the driver
> should just return a buffer.
>

If not pci_region_info, can use vfio_region_info here, even to fetch
sparce mmap capabilities from vendor driver?

Thanks,
Kirti.



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Qemu-devel] [PATCH 1/3] Mediated device Core driver
@ 2016-06-24 17:54       ` Kirti Wankhede
  0 siblings, 0 replies; 51+ messages in thread
From: Kirti Wankhede @ 2016-06-24 17:54 UTC (permalink / raw)
  To: Alex Williamson
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, shuai.ruan,
	jike.song, zhiyuan.lv, bjsdjshi

Alex,

Thanks for taking closer look. I'll incorporate all the nits you suggested.

On 6/22/2016 3:00 AM, Alex Williamson wrote:
> On Mon, 20 Jun 2016 22:01:46 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>
...
>> +
>> +config MDEV
>> +    tristate "Mediated device driver framework"
>> +    depends on VFIO
>> +    default n
>> +    help
>> +        MDEV provides a framework to virtualize device without SR-IOV cap
>> +        See Documentation/mdev.txt for more details.
> 
> Documentation pointer still doesn't exist.  Perhaps this file would be
> a more appropriate place than the commit log for some of the
> information above.
> 

Sure, I'll add these details to documentation.

> Every time I review this I'm struggling to figure out why this isn't
> VFIO_MDEV since it's really tied to vfio and difficult to evaluate it
> as some sort of standalone mediated device interface.  I don't know
> the answer, but it always strikes me as a discontinuity.
> 

Ok. I'll change to VFIO_MDEV

>> +
>> +static struct mdev_device *find_mdev_device(struct parent_device *parent,
>> +					    uuid_le uuid, int instance)
>> +{
>> +	struct mdev_device *mdev = NULL, *p;
>> +
>> +	list_for_each_entry(p, &parent->mdev_list, next) {
>> +		if ((uuid_le_cmp(p->uuid, uuid) == 0) &&
>> +		    (p->instance == instance)) {
>> +			mdev = p;
> 
> Locking here is still broken, the callers are create and destroy, which
> can still race each other and themselves.
>

Fixed it.

>> +
>> +static int mdev_device_create_ops(struct mdev_device *mdev, char *mdev_params)
>> +{
>> +	struct parent_device *parent = mdev->parent;
>> +	int ret;
>> +
>> +	mutex_lock(&parent->ops_lock);
>> +	if (parent->ops->create) {
> 
> How would a parent_device without ops->create or ops->destroy useful?
> Perhaps mdev_register_driver() should enforce required ops.  mdev.h
> should at least document which ops are optional if they really are
> optional.

Makes sense, adding check in mdev_register_driver() to mandate create
and destroy in ops. I'll also update the comments in mdev.h for
mandatory and optional ops.

> 
>> +		ret = parent->ops->create(mdev->dev.parent, mdev->uuid,
>> +					mdev->instance, mdev_params);
>> +		if (ret)
>> +			goto create_ops_err;
>> +	}
>> +
>> +	ret = mdev_add_attribute_group(&mdev->dev,
>> +					parent->ops->mdev_attr_groups);
> 
> An error here seems to put us in a bad place, the device is created but
> the attributes are broken, is it the caller's responsibility to
> destroy?  Seems like we need a cleanup if this fails.
> 

Right, adding cleanup here.

>> +create_ops_err:
>> +	mutex_unlock(&parent->ops_lock);
> 
> It seems like ops_lock isn't used so much as a lock as a serialization
> mechanism.  Why?  Where is this serialization per parent device
> documented?
>

parent->ops_lock is to serialize parent device callbacks to vendor
driver, i.e supported_config(), create() and destroy().
mdev->ops_lock is to serialize mediated device related callbacks to
vendor driver, i.e. start(), stop(), read(), write(), set_irqs(),
get_region_info(), validate_map_request().
Its not documented, I'll add comments to mdev.h about these locks.


>> +	return ret;
>> +}
>> +
>> +static int mdev_device_destroy_ops(struct mdev_device *mdev, bool force)
>> +{
>> +	struct parent_device *parent = mdev->parent;
>> +	int ret = 0;
>> +
>> +	/*
>> +	 * If vendor driver doesn't return success that means vendor
>> +	 * driver doesn't support hot-unplug
>> +	 */
>> +	mutex_lock(&parent->ops_lock);
>> +	if (parent->ops->destroy) {
>> +		ret = parent->ops->destroy(parent->dev, mdev->uuid,
>> +					   mdev->instance);
>> +		if (ret && !force) {
> 
> It seems this is not so much a 'force' but an ignore errors, we never
> actually force the mdev driver to destroy the device... which makes me
> wonder if there are leaks there.
> 

Consider a case where VM is running or in teardown path and parent
device in unbound from vendor driver, then vendor driver would call
mdev_unregister_device() from its remove() call. Even if
parent->ops->destroy() returns error that could also mean that
hot-unplug is not supported but we have to destroy mdev device. remove()
call doesn't honor error returned. In that case its a force removal.

>> +
>> +/*
>> + * mdev_unregister_device : Unregister a parent device
>> + * @dev: device structure representing parent device.
>> + *
>> + * Remove device from list of registered parent devices. Give a chance to free
>> + * existing mediated devices for given device.
>> + */
>> +
>> +void mdev_unregister_device(struct device *dev)
>> +{
>> +	struct parent_device *parent;
>> +	struct mdev_device *mdev, *n;
>> +	int ret;
>> +
>> +	mutex_lock(&parent_devices.list_lock);
>> +	parent = find_parent_device(dev);
>> +
>> +	if (!parent) {
>> +		mutex_unlock(&parent_devices.list_lock);
>> +		return;
>> +	}
>> +	dev_info(dev, "MDEV: Unregistering\n");
>> +
>> +	/*
>> +	 * Remove parent from the list and remove create and destroy sysfs
>> +	 * files so that no new mediated device could be created for this parent
>> +	 */
>> +	list_del(&parent->next);
>> +	mdev_remove_sysfs_files(dev);
>> +	mutex_unlock(&parent_devices.list_lock);
>> +
>> +	mutex_lock(&parent->ops_lock);
>> +	mdev_remove_attribute_group(dev,
>> +				    parent->ops->dev_attr_groups);
>> +	mutex_unlock(&parent->ops_lock);
>> +
>> +	mutex_lock(&parent->mdev_list_lock);
>> +	list_for_each_entry_safe(mdev, n, &parent->mdev_list, next) {
>> +		mdev_device_destroy_ops(mdev, true);
>> +		list_del(&mdev->next);
>> +		mdev_put_device(mdev);
>> +	}
>> +	mutex_unlock(&parent->mdev_list_lock);
>> +
>> +	do {
>> +		ret = wait_event_interruptible_timeout(parent->release_done,
>> +				list_empty(&parent->mdev_list), HZ * 10);
> 
> But we do a list_del for each mdev in mdev_list above, how could the
> list not be empty here?  I think you're trying to wait for all the mdev
> devices to be released, but I don't think this does that.  Isn't the
> list empty regardless?
>

Right, I do want to wait for all the mdev devices to be released. Moving
list_del(&mdev->next) from the above for loop to mdev_release_device()
so that mdev will be removed from list on last mdev_put_device().


>> +		if (ret == -ERESTARTSYS) {
>> +			dev_warn(dev, "Mediated devices are in use, task"
>> +				      " \"%s\" (%d) "
>> +				      "blocked until all are released",
>> +				      current->comm, task_pid_nr(current));
>> +		}
>> +	} while (ret <= 0);
>> +
>> +	mdev_put_parent(parent);
>> +}
>> +EXPORT_SYMBOL(mdev_unregister_device);
>> +
>> +


>> +int mdev_device_create(struct device *dev, uuid_le uuid, uint32_t instance,
>> +		       char *mdev_params)
>> +{
>> +	int ret;
>> +	struct mdev_device *mdev;
>> +	struct parent_device *parent;
>> +
>> +	parent = mdev_get_parent_by_dev(dev);
>> +	if (!parent)
>> +		return -EINVAL;
>> +
>> +	/* Check for duplicate */
>> +	mdev = find_mdev_device(parent, uuid, instance);
> 
> But this doesn't actually prevent duplicates because we we're not
> holding any lock the guarantee that another racing process doesn't
> create the same {uuid,instance} between where we check and the below
> list_add.
> 

Oops I missed this race condition. Moving
mutex_lock(&parent->mdev_list_lock);
before find_mdev_device() in mdev_device_create() and
mdev_device_destroy().


>> +
>> +int  mdev_device_create(struct device *dev, uuid_le uuid, uint32_t instance,
>> +			char *mdev_params);
>> +int  mdev_device_destroy(struct device *dev, uuid_le uuid, uint32_t instance);
>> +void mdev_device_supported_config(struct device *dev, char *str);
>> +int  mdev_device_start(uuid_le uuid);
>> +int  mdev_device_shutdown(uuid_le uuid);
> 
> nit, stop is start as startup is to shutdown.  IOW, should this be
> mdev_device_stop()?
>

Ok. Renaming mdev_device_shutdown() to mdev_device_stop().


>> +
>> +struct pci_region_info {
>> +	uint64_t start;
>> +	uint64_t size;
>> +	uint32_t flags;		/* VFIO region info flags */
>> +};
>> +
>> +enum mdev_emul_space {
>> +	EMUL_CONFIG_SPACE,	/* PCI configuration space */
>> +	EMUL_IO,		/* I/O register space */
>> +	EMUL_MMIO		/* Memory-mapped I/O space */
>> +};
> 
> 
> I'm still confused why this is needed, perhaps a description here would
> be useful so I can stop asking.  Clearly config space is PCI only, so
> it's strange to have it in the common code.  Everyone not on x86 will
> say I/O space is also strange.  I can't keep it in my head why the
> read/write offsets aren't sufficient for the driver to figure out what
> type it is.
> 
>

Now that VFIO_PCI_OFFSET_* macros are moved to vfio.h which vendor
driver can also use, above enum could be removed from read/write. But
again these macros are useful when parent device is PCI device. How
would non-pci parent device differentiate IO ports and MMIO?


>> +	int	(*get_region_info)(struct mdev_device *vdev, int region_index,
>> +				 struct pci_region_info *region_info);
> 
> This can't be //pci_//region_info.  How do you intend to support things
> like sparse mmap capabilities in the user REGION_INFO ioctl when such
> things are not part of the mediated device API?  Seems like the driver
> should just return a buffer.
>

If not pci_region_info, can use vfio_region_info here, even to fetch
sparce mmap capabilities from vendor driver?

Thanks,
Kirti.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 2/3] VFIO driver for mediated PCI device
  2016-06-21 22:48     ` [Qemu-devel] " Alex Williamson
@ 2016-06-24 18:34       ` Kirti Wankhede
  -1 siblings, 0 replies; 51+ messages in thread
From: Kirti Wankhede @ 2016-06-24 18:34 UTC (permalink / raw)
  To: Alex Williamson
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, shuai.ruan,
	jike.song, zhiyuan.lv, bjsdjshi

Thanks Alex.


On 6/22/2016 4:18 AM, Alex Williamson wrote:
> On Mon, 20 Jun 2016 22:01:47 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> +
>> +static int get_mdev_region_info(struct mdev_device *mdev,
>> +				struct pci_region_info *vfio_region_info,
>> +				int index)
>> +{
>> +	int ret = -EINVAL;
>> +	struct parent_device *parent = mdev->parent;
>> +
>> +	if (parent && dev_is_pci(parent->dev) && parent->ops->get_region_info) {
>> +		mutex_lock(&mdev->ops_lock);
>> +		ret = parent->ops->get_region_info(mdev, index,
>> +						    vfio_region_info);
>> +		mutex_unlock(&mdev->ops_lock);
> 
> Why do we have two ops_lock, one on the parent_device and one on the
> mdev_device?!  Is this one actually locking anything or also just
> providing serialization?  Why do some things get serialized at the
> parent level and some things at the device level?  Very confused by
> ops_lock.
>

There are two sets of callback:
* parent device callbacks: supported_config, create, destroy, start, stop
* mdev device callbacks: read, write, set_irqs, get_region_info,
validate_map_request

parent->ops_lock is to serialize per parent device callbacks.
mdev->ops_lock is to serialize per mdev device callbacks.

I'll add above comment in mdev.h.


>> +
>> +static int mdev_get_irq_count(struct vfio_mdev *vmdev, int irq_type)
>> +{
>> +	/* Don't support MSIX for now */
>> +	if (irq_type == VFIO_PCI_MSIX_IRQ_INDEX)
>> +		return -1;
>> +
>> +	return 1;
> 
> Too much hard coding here, the mediated driver should define this.
> 

I'm testing INTX and MSI, I don't have a way to test MSIX for now. So we
thought we can add supported for MSIX later. Till then hard code it to 1.

>> +
>> +		if (parent && parent->ops->set_irqs) {
>> +			mutex_lock(&mdev->ops_lock);
>> +			ret = parent->ops->set_irqs(mdev, hdr.flags, hdr.index,
>> +						    hdr.start, hdr.count, data);
>> +			mutex_unlock(&mdev->ops_lock);
> 
> Device level serialization on set_irqs... interesting.
> 

Hope answer above helps to clarify this.


>> +		}
>> +
>> +		kfree(ptr);
>> +		return ret;
>> +	}
>> +	}
>> +	return -ENOTTY;
>> +}
>> +
>> +ssize_t mdev_dev_config_rw(struct vfio_mdev *vmdev, char __user *buf,
>> +			   size_t count, loff_t *ppos, bool iswrite)
>> +{
>> +	struct mdev_device *mdev = vmdev->mdev;
>> +	struct parent_device *parent = mdev->parent;
>> +	int size = vmdev->vfio_region_info[VFIO_PCI_CONFIG_REGION_INDEX].size;
>> +	int ret = 0;
>> +	uint64_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
>> +
>> +	if (pos < 0 || pos >= size ||
>> +	    pos + count > size) {
>> +		pr_err("%s pos 0x%llx out of range\n", __func__, pos);
>> +		ret = -EFAULT;
>> +		goto config_rw_exit;
>> +	}
>> +
>> +	if (iswrite) {
>> +		char *usr_data, *ptr;
>> +
>> +		ptr = usr_data = memdup_user(buf, count);
>> +		if (IS_ERR(usr_data)) {
>> +			ret = PTR_ERR(usr_data);
>> +			goto config_rw_exit;
>> +		}
>> +
>> +		ret = parent->ops->write(mdev, usr_data, count,
>> +					  EMUL_CONFIG_SPACE, pos);
> 
> No serialization on this ops, thank goodness, but why?
>

Its there at caller of mdev_dev_rw().


> This read/write interface still seems strange to me...
> 

Replied on this in 1st Patch.

>> +
>> +		memcpy((void *)(vmdev->vconfig + pos), (void *)usr_data, count);
>> +		kfree(ptr);
>> +	} else {
>> +		char *ret_data, *ptr;
>> +
>> +		ptr = ret_data = kzalloc(count, GFP_KERNEL);
>> +
>> +		if (IS_ERR(ret_data)) {
>> +			ret = PTR_ERR(ret_data);
>> +			goto config_rw_exit;
>> +		}
>> +
>> +		ret = parent->ops->read(mdev, ret_data, count,
>> +					EMUL_CONFIG_SPACE, pos);
>> +
>> +		if (ret > 0) {
>> +			if (copy_to_user(buf, ret_data, ret))
>> +				ret = -EFAULT;
>> +			else
>> +				memcpy((void *)(vmdev->vconfig + pos),
>> +					(void *)ret_data, count);
>> +		}
>> +		kfree(ptr);
> 
> So vconfig caches all of config space for the mdev, but we only ever
> use it to read the BAR address via mdev_read_base()... why?  I hope the
> mdev driver doesn't freak out if the user reads the mmio region before
> writing a base address (remember the vfio API aspect of the interface
> doesn't necessarily follow the VM PCI programming API)
> 

How could user read mmio region from guest before writing base address?
Isn't that would be a bug?
>From our driver if pos is not within the base address range, then we
return error for read/write.


>> +	}
>> +config_rw_exit:
>> +
>> +	if (ret > 0)
>> +		*ppos += ret;
>> +
>> +	return ret;
>> +}
>> +
>> +ssize_t mdev_dev_bar_rw(struct vfio_mdev *vmdev, char __user *buf,
>> +			size_t count, loff_t *ppos, bool iswrite)
>> +{
>> +	struct mdev_device *mdev = vmdev->mdev;
>> +	struct parent_device *parent = mdev->parent;
>> +	loff_t offset = *ppos & VFIO_PCI_OFFSET_MASK;
>> +	loff_t pos;
>> +	int bar_index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
>> +	int ret = 0;
>> +
>> +	if (!vmdev->vfio_region_info[bar_index].start)
>> +		mdev_read_base(vmdev);
>> +
>> +	if (offset >= vmdev->vfio_region_info[bar_index].size) {
>> +		ret = -EINVAL;
>> +		goto bar_rw_exit;
>> +	}
>> +
>> +	count = min(count,
>> +		    (size_t)(vmdev->vfio_region_info[bar_index].size - offset));
>> +
>> +	pos = vmdev->vfio_region_info[bar_index].start + offset;
> 
> In the case of a mpci dev, @start is the vconfig BAR value, so it's
> user (guest) writable, and the mediated driver is supposed to
> understand that?  I suppose is saw the config write too, if there was
> one, but the mediated driver gives us region info based on region index.
> We have the region index here.  Why wouldn't we do reads and writes
> based on region index and offset and eliminate vconfig?  Seems like
> that would consolidate a lot of this, we don't care what we're reading
> and writing, just pass it through.  Mediated pci drivers would simply
> need to match indexes to those already defined for vfio-pci.
> 

Ok, looking at it. so this will remove vconfig completely.

Thanks,
Kirti.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Qemu-devel] [PATCH 2/3] VFIO driver for mediated PCI device
@ 2016-06-24 18:34       ` Kirti Wankhede
  0 siblings, 0 replies; 51+ messages in thread
From: Kirti Wankhede @ 2016-06-24 18:34 UTC (permalink / raw)
  To: Alex Williamson
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, shuai.ruan,
	jike.song, zhiyuan.lv, bjsdjshi

Thanks Alex.


On 6/22/2016 4:18 AM, Alex Williamson wrote:
> On Mon, 20 Jun 2016 22:01:47 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> +
>> +static int get_mdev_region_info(struct mdev_device *mdev,
>> +				struct pci_region_info *vfio_region_info,
>> +				int index)
>> +{
>> +	int ret = -EINVAL;
>> +	struct parent_device *parent = mdev->parent;
>> +
>> +	if (parent && dev_is_pci(parent->dev) && parent->ops->get_region_info) {
>> +		mutex_lock(&mdev->ops_lock);
>> +		ret = parent->ops->get_region_info(mdev, index,
>> +						    vfio_region_info);
>> +		mutex_unlock(&mdev->ops_lock);
> 
> Why do we have two ops_lock, one on the parent_device and one on the
> mdev_device?!  Is this one actually locking anything or also just
> providing serialization?  Why do some things get serialized at the
> parent level and some things at the device level?  Very confused by
> ops_lock.
>

There are two sets of callback:
* parent device callbacks: supported_config, create, destroy, start, stop
* mdev device callbacks: read, write, set_irqs, get_region_info,
validate_map_request

parent->ops_lock is to serialize per parent device callbacks.
mdev->ops_lock is to serialize per mdev device callbacks.

I'll add above comment in mdev.h.


>> +
>> +static int mdev_get_irq_count(struct vfio_mdev *vmdev, int irq_type)
>> +{
>> +	/* Don't support MSIX for now */
>> +	if (irq_type == VFIO_PCI_MSIX_IRQ_INDEX)
>> +		return -1;
>> +
>> +	return 1;
> 
> Too much hard coding here, the mediated driver should define this.
> 

I'm testing INTX and MSI, I don't have a way to test MSIX for now. So we
thought we can add supported for MSIX later. Till then hard code it to 1.

>> +
>> +		if (parent && parent->ops->set_irqs) {
>> +			mutex_lock(&mdev->ops_lock);
>> +			ret = parent->ops->set_irqs(mdev, hdr.flags, hdr.index,
>> +						    hdr.start, hdr.count, data);
>> +			mutex_unlock(&mdev->ops_lock);
> 
> Device level serialization on set_irqs... interesting.
> 

Hope answer above helps to clarify this.


>> +		}
>> +
>> +		kfree(ptr);
>> +		return ret;
>> +	}
>> +	}
>> +	return -ENOTTY;
>> +}
>> +
>> +ssize_t mdev_dev_config_rw(struct vfio_mdev *vmdev, char __user *buf,
>> +			   size_t count, loff_t *ppos, bool iswrite)
>> +{
>> +	struct mdev_device *mdev = vmdev->mdev;
>> +	struct parent_device *parent = mdev->parent;
>> +	int size = vmdev->vfio_region_info[VFIO_PCI_CONFIG_REGION_INDEX].size;
>> +	int ret = 0;
>> +	uint64_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
>> +
>> +	if (pos < 0 || pos >= size ||
>> +	    pos + count > size) {
>> +		pr_err("%s pos 0x%llx out of range\n", __func__, pos);
>> +		ret = -EFAULT;
>> +		goto config_rw_exit;
>> +	}
>> +
>> +	if (iswrite) {
>> +		char *usr_data, *ptr;
>> +
>> +		ptr = usr_data = memdup_user(buf, count);
>> +		if (IS_ERR(usr_data)) {
>> +			ret = PTR_ERR(usr_data);
>> +			goto config_rw_exit;
>> +		}
>> +
>> +		ret = parent->ops->write(mdev, usr_data, count,
>> +					  EMUL_CONFIG_SPACE, pos);
> 
> No serialization on this ops, thank goodness, but why?
>

Its there at caller of mdev_dev_rw().


> This read/write interface still seems strange to me...
> 

Replied on this in 1st Patch.

>> +
>> +		memcpy((void *)(vmdev->vconfig + pos), (void *)usr_data, count);
>> +		kfree(ptr);
>> +	} else {
>> +		char *ret_data, *ptr;
>> +
>> +		ptr = ret_data = kzalloc(count, GFP_KERNEL);
>> +
>> +		if (IS_ERR(ret_data)) {
>> +			ret = PTR_ERR(ret_data);
>> +			goto config_rw_exit;
>> +		}
>> +
>> +		ret = parent->ops->read(mdev, ret_data, count,
>> +					EMUL_CONFIG_SPACE, pos);
>> +
>> +		if (ret > 0) {
>> +			if (copy_to_user(buf, ret_data, ret))
>> +				ret = -EFAULT;
>> +			else
>> +				memcpy((void *)(vmdev->vconfig + pos),
>> +					(void *)ret_data, count);
>> +		}
>> +		kfree(ptr);
> 
> So vconfig caches all of config space for the mdev, but we only ever
> use it to read the BAR address via mdev_read_base()... why?  I hope the
> mdev driver doesn't freak out if the user reads the mmio region before
> writing a base address (remember the vfio API aspect of the interface
> doesn't necessarily follow the VM PCI programming API)
> 

How could user read mmio region from guest before writing base address?
Isn't that would be a bug?
>From our driver if pos is not within the base address range, then we
return error for read/write.


>> +	}
>> +config_rw_exit:
>> +
>> +	if (ret > 0)
>> +		*ppos += ret;
>> +
>> +	return ret;
>> +}
>> +
>> +ssize_t mdev_dev_bar_rw(struct vfio_mdev *vmdev, char __user *buf,
>> +			size_t count, loff_t *ppos, bool iswrite)
>> +{
>> +	struct mdev_device *mdev = vmdev->mdev;
>> +	struct parent_device *parent = mdev->parent;
>> +	loff_t offset = *ppos & VFIO_PCI_OFFSET_MASK;
>> +	loff_t pos;
>> +	int bar_index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
>> +	int ret = 0;
>> +
>> +	if (!vmdev->vfio_region_info[bar_index].start)
>> +		mdev_read_base(vmdev);
>> +
>> +	if (offset >= vmdev->vfio_region_info[bar_index].size) {
>> +		ret = -EINVAL;
>> +		goto bar_rw_exit;
>> +	}
>> +
>> +	count = min(count,
>> +		    (size_t)(vmdev->vfio_region_info[bar_index].size - offset));
>> +
>> +	pos = vmdev->vfio_region_info[bar_index].start + offset;
> 
> In the case of a mpci dev, @start is the vconfig BAR value, so it's
> user (guest) writable, and the mediated driver is supposed to
> understand that?  I suppose is saw the config write too, if there was
> one, but the mediated driver gives us region info based on region index.
> We have the region index here.  Why wouldn't we do reads and writes
> based on region index and offset and eliminate vconfig?  Seems like
> that would consolidate a lot of this, we don't care what we're reading
> and writing, just pass it through.  Mediated pci drivers would simply
> need to match indexes to those already defined for vfio-pci.
> 

Ok, looking at it. so this will remove vconfig completely.

Thanks,
Kirti.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 1/3] Mediated device Core driver
  2016-06-24 17:54       ` [Qemu-devel] " Kirti Wankhede
@ 2016-06-24 19:40         ` Alex Williamson
  -1 siblings, 0 replies; 51+ messages in thread
From: Alex Williamson @ 2016-06-24 19:40 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, shuai.ruan,
	jike.song, zhiyuan.lv, bjsdjshi

On Fri, 24 Jun 2016 23:24:58 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> Alex,
> 
> Thanks for taking closer look. I'll incorporate all the nits you suggested.
> 
> On 6/22/2016 3:00 AM, Alex Williamson wrote:
> > On Mon, 20 Jun 2016 22:01:46 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >  
> ...
> >> +
> >> +config MDEV
> >> +    tristate "Mediated device driver framework"
> >> +    depends on VFIO
> >> +    default n
> >> +    help
> >> +        MDEV provides a framework to virtualize device without SR-IOV cap
> >> +        See Documentation/mdev.txt for more details.  
> > 
> > Documentation pointer still doesn't exist.  Perhaps this file would be
> > a more appropriate place than the commit log for some of the
> > information above.
> >   
> 
> Sure, I'll add these details to documentation.
> 
> > Every time I review this I'm struggling to figure out why this isn't
> > VFIO_MDEV since it's really tied to vfio and difficult to evaluate it
> > as some sort of standalone mediated device interface.  I don't know
> > the answer, but it always strikes me as a discontinuity.
> >   
> 
> Ok. I'll change to VFIO_MDEV
> 
> >> +
> >> +static struct mdev_device *find_mdev_device(struct parent_device *parent,
> >> +					    uuid_le uuid, int instance)
> >> +{
> >> +	struct mdev_device *mdev = NULL, *p;
> >> +
> >> +	list_for_each_entry(p, &parent->mdev_list, next) {
> >> +		if ((uuid_le_cmp(p->uuid, uuid) == 0) &&
> >> +		    (p->instance == instance)) {
> >> +			mdev = p;  
> > 
> > Locking here is still broken, the callers are create and destroy, which
> > can still race each other and themselves.
> >  
> 
> Fixed it.
> 
> >> +
> >> +static int mdev_device_create_ops(struct mdev_device *mdev, char *mdev_params)
> >> +{
> >> +	struct parent_device *parent = mdev->parent;
> >> +	int ret;
> >> +
> >> +	mutex_lock(&parent->ops_lock);
> >> +	if (parent->ops->create) {  
> > 
> > How would a parent_device without ops->create or ops->destroy useful?
> > Perhaps mdev_register_driver() should enforce required ops.  mdev.h
> > should at least document which ops are optional if they really are
> > optional.  
> 
> Makes sense, adding check in mdev_register_driver() to mandate create
> and destroy in ops. I'll also update the comments in mdev.h for
> mandatory and optional ops.
> 
> >   
> >> +		ret = parent->ops->create(mdev->dev.parent, mdev->uuid,
> >> +					mdev->instance, mdev_params);
> >> +		if (ret)
> >> +			goto create_ops_err;
> >> +	}
> >> +
> >> +	ret = mdev_add_attribute_group(&mdev->dev,
> >> +					parent->ops->mdev_attr_groups);  
> > 
> > An error here seems to put us in a bad place, the device is created but
> > the attributes are broken, is it the caller's responsibility to
> > destroy?  Seems like we need a cleanup if this fails.
> >   
> 
> Right, adding cleanup here.
> 
> >> +create_ops_err:
> >> +	mutex_unlock(&parent->ops_lock);  
> > 
> > It seems like ops_lock isn't used so much as a lock as a serialization
> > mechanism.  Why?  Where is this serialization per parent device
> > documented?
> >  
> 
> parent->ops_lock is to serialize parent device callbacks to vendor
> driver, i.e supported_config(), create() and destroy().
> mdev->ops_lock is to serialize mediated device related callbacks to
> vendor driver, i.e. start(), stop(), read(), write(), set_irqs(),
> get_region_info(), validate_map_request().
> Its not documented, I'll add comments to mdev.h about these locks.

Should it be the mediated driver core's responsibility to do this?  If
a given mediated driver wants to serialize on their own, they can do
that, but I don't see why we would impose that on every mediated driver.

> >> +	return ret;
> >> +}
> >> +
> >> +static int mdev_device_destroy_ops(struct mdev_device *mdev, bool force)
> >> +{
> >> +	struct parent_device *parent = mdev->parent;
> >> +	int ret = 0;
> >> +
> >> +	/*
> >> +	 * If vendor driver doesn't return success that means vendor
> >> +	 * driver doesn't support hot-unplug
> >> +	 */
> >> +	mutex_lock(&parent->ops_lock);
> >> +	if (parent->ops->destroy) {
> >> +		ret = parent->ops->destroy(parent->dev, mdev->uuid,
> >> +					   mdev->instance);
> >> +		if (ret && !force) {  
> > 
> > It seems this is not so much a 'force' but an ignore errors, we never
> > actually force the mdev driver to destroy the device... which makes me
> > wonder if there are leaks there.
> >   
> 
> Consider a case where VM is running or in teardown path and parent
> device in unbound from vendor driver, then vendor driver would call
> mdev_unregister_device() from its remove() call. Even if
> parent->ops->destroy() returns error that could also mean that
> hot-unplug is not supported but we have to destroy mdev device. remove()
> call doesn't honor error returned. In that case its a force removal.
> 
> >> +
> >> +/*
> >> + * mdev_unregister_device : Unregister a parent device
> >> + * @dev: device structure representing parent device.
> >> + *
> >> + * Remove device from list of registered parent devices. Give a chance to free
> >> + * existing mediated devices for given device.
> >> + */
> >> +
> >> +void mdev_unregister_device(struct device *dev)
> >> +{
> >> +	struct parent_device *parent;
> >> +	struct mdev_device *mdev, *n;
> >> +	int ret;
> >> +
> >> +	mutex_lock(&parent_devices.list_lock);
> >> +	parent = find_parent_device(dev);
> >> +
> >> +	if (!parent) {
> >> +		mutex_unlock(&parent_devices.list_lock);
> >> +		return;
> >> +	}
> >> +	dev_info(dev, "MDEV: Unregistering\n");
> >> +
> >> +	/*
> >> +	 * Remove parent from the list and remove create and destroy sysfs
> >> +	 * files so that no new mediated device could be created for this parent
> >> +	 */
> >> +	list_del(&parent->next);
> >> +	mdev_remove_sysfs_files(dev);
> >> +	mutex_unlock(&parent_devices.list_lock);
> >> +
> >> +	mutex_lock(&parent->ops_lock);
> >> +	mdev_remove_attribute_group(dev,
> >> +				    parent->ops->dev_attr_groups);
> >> +	mutex_unlock(&parent->ops_lock);
> >> +
> >> +	mutex_lock(&parent->mdev_list_lock);
> >> +	list_for_each_entry_safe(mdev, n, &parent->mdev_list, next) {
> >> +		mdev_device_destroy_ops(mdev, true);
> >> +		list_del(&mdev->next);
> >> +		mdev_put_device(mdev);
> >> +	}
> >> +	mutex_unlock(&parent->mdev_list_lock);
> >> +
> >> +	do {
> >> +		ret = wait_event_interruptible_timeout(parent->release_done,
> >> +				list_empty(&parent->mdev_list), HZ * 10);  
> > 
> > But we do a list_del for each mdev in mdev_list above, how could the
> > list not be empty here?  I think you're trying to wait for all the mdev
> > devices to be released, but I don't think this does that.  Isn't the
> > list empty regardless?
> >  
> 
> Right, I do want to wait for all the mdev devices to be released. Moving
> list_del(&mdev->next) from the above for loop to mdev_release_device()
> so that mdev will be removed from list on last mdev_put_device().
> 
> 
> >> +		if (ret == -ERESTARTSYS) {
> >> +			dev_warn(dev, "Mediated devices are in use, task"
> >> +				      " \"%s\" (%d) "
> >> +				      "blocked until all are released",
> >> +				      current->comm, task_pid_nr(current));
> >> +		}
> >> +	} while (ret <= 0);
> >> +
> >> +	mdev_put_parent(parent);
> >> +}
> >> +EXPORT_SYMBOL(mdev_unregister_device);
> >> +
> >> +  
> 
> 
> >> +int mdev_device_create(struct device *dev, uuid_le uuid, uint32_t instance,
> >> +		       char *mdev_params)
> >> +{
> >> +	int ret;
> >> +	struct mdev_device *mdev;
> >> +	struct parent_device *parent;
> >> +
> >> +	parent = mdev_get_parent_by_dev(dev);
> >> +	if (!parent)
> >> +		return -EINVAL;
> >> +
> >> +	/* Check for duplicate */
> >> +	mdev = find_mdev_device(parent, uuid, instance);  
> > 
> > But this doesn't actually prevent duplicates because we we're not
> > holding any lock the guarantee that another racing process doesn't
> > create the same {uuid,instance} between where we check and the below
> > list_add.
> >   
> 
> Oops I missed this race condition. Moving
> mutex_lock(&parent->mdev_list_lock);
> before find_mdev_device() in mdev_device_create() and
> mdev_device_destroy().
> 
> 
> >> +
> >> +int  mdev_device_create(struct device *dev, uuid_le uuid, uint32_t instance,
> >> +			char *mdev_params);
> >> +int  mdev_device_destroy(struct device *dev, uuid_le uuid, uint32_t instance);
> >> +void mdev_device_supported_config(struct device *dev, char *str);
> >> +int  mdev_device_start(uuid_le uuid);
> >> +int  mdev_device_shutdown(uuid_le uuid);  
> > 
> > nit, stop is start as startup is to shutdown.  IOW, should this be
> > mdev_device_stop()?
> >  
> 
> Ok. Renaming mdev_device_shutdown() to mdev_device_stop().
> 
> 
> >> +
> >> +struct pci_region_info {
> >> +	uint64_t start;
> >> +	uint64_t size;
> >> +	uint32_t flags;		/* VFIO region info flags */
> >> +};
> >> +
> >> +enum mdev_emul_space {
> >> +	EMUL_CONFIG_SPACE,	/* PCI configuration space */
> >> +	EMUL_IO,		/* I/O register space */
> >> +	EMUL_MMIO		/* Memory-mapped I/O space */
> >> +};  
> > 
> > 
> > I'm still confused why this is needed, perhaps a description here would
> > be useful so I can stop asking.  Clearly config space is PCI only, so
> > it's strange to have it in the common code.  Everyone not on x86 will
> > say I/O space is also strange.  I can't keep it in my head why the
> > read/write offsets aren't sufficient for the driver to figure out what
> > type it is.
> > 
> >  
> 
> Now that VFIO_PCI_OFFSET_* macros are moved to vfio.h which vendor
> driver can also use, above enum could be removed from read/write. But
> again these macros are useful when parent device is PCI device. How
> would non-pci parent device differentiate IO ports and MMIO?

Moving VFIO_PCI_OFFSET_* to vfio.h already worries me, the vfio api
does not impose fixed offsets, it's simply an implementation detail of
vfio-pci.  We should be free to change that whenever we want and not
break userspace.  By moving it to vfio.h and potentially having
external mediated drivers depend on those offset macros, they now become
part of the kABI.  So more and more, I'd prefer that reads/writes/mmaps
get passed directly to the mediated driver, let them define which
offset is which, the core is just a passthrough.  For non-PCI devices,
like platform devices, the indexes are implementation specific, the
user really needs to know how to work with the specific device and how
it defines device mmio to region indexes.
 
> >> +	int	(*get_region_info)(struct mdev_device *vdev, int region_index,
> >> +				 struct pci_region_info *region_info);  
> > 
> > This can't be //pci_//region_info.  How do you intend to support things
> > like sparse mmap capabilities in the user REGION_INFO ioctl when such
> > things are not part of the mediated device API?  Seems like the driver
> > should just return a buffer.
> >  
> 
> If not pci_region_info, can use vfio_region_info here, even to fetch
> sparce mmap capabilities from vendor driver?

Sure, you can use vfio_region_info, then it's just a pointer to a
buffer allocated by the callee and the mediated core is just a
passthrough, which is probably how it should be.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Qemu-devel] [PATCH 1/3] Mediated device Core driver
@ 2016-06-24 19:40         ` Alex Williamson
  0 siblings, 0 replies; 51+ messages in thread
From: Alex Williamson @ 2016-06-24 19:40 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, shuai.ruan,
	jike.song, zhiyuan.lv, bjsdjshi

On Fri, 24 Jun 2016 23:24:58 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> Alex,
> 
> Thanks for taking closer look. I'll incorporate all the nits you suggested.
> 
> On 6/22/2016 3:00 AM, Alex Williamson wrote:
> > On Mon, 20 Jun 2016 22:01:46 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >  
> ...
> >> +
> >> +config MDEV
> >> +    tristate "Mediated device driver framework"
> >> +    depends on VFIO
> >> +    default n
> >> +    help
> >> +        MDEV provides a framework to virtualize device without SR-IOV cap
> >> +        See Documentation/mdev.txt for more details.  
> > 
> > Documentation pointer still doesn't exist.  Perhaps this file would be
> > a more appropriate place than the commit log for some of the
> > information above.
> >   
> 
> Sure, I'll add these details to documentation.
> 
> > Every time I review this I'm struggling to figure out why this isn't
> > VFIO_MDEV since it's really tied to vfio and difficult to evaluate it
> > as some sort of standalone mediated device interface.  I don't know
> > the answer, but it always strikes me as a discontinuity.
> >   
> 
> Ok. I'll change to VFIO_MDEV
> 
> >> +
> >> +static struct mdev_device *find_mdev_device(struct parent_device *parent,
> >> +					    uuid_le uuid, int instance)
> >> +{
> >> +	struct mdev_device *mdev = NULL, *p;
> >> +
> >> +	list_for_each_entry(p, &parent->mdev_list, next) {
> >> +		if ((uuid_le_cmp(p->uuid, uuid) == 0) &&
> >> +		    (p->instance == instance)) {
> >> +			mdev = p;  
> > 
> > Locking here is still broken, the callers are create and destroy, which
> > can still race each other and themselves.
> >  
> 
> Fixed it.
> 
> >> +
> >> +static int mdev_device_create_ops(struct mdev_device *mdev, char *mdev_params)
> >> +{
> >> +	struct parent_device *parent = mdev->parent;
> >> +	int ret;
> >> +
> >> +	mutex_lock(&parent->ops_lock);
> >> +	if (parent->ops->create) {  
> > 
> > How would a parent_device without ops->create or ops->destroy useful?
> > Perhaps mdev_register_driver() should enforce required ops.  mdev.h
> > should at least document which ops are optional if they really are
> > optional.  
> 
> Makes sense, adding check in mdev_register_driver() to mandate create
> and destroy in ops. I'll also update the comments in mdev.h for
> mandatory and optional ops.
> 
> >   
> >> +		ret = parent->ops->create(mdev->dev.parent, mdev->uuid,
> >> +					mdev->instance, mdev_params);
> >> +		if (ret)
> >> +			goto create_ops_err;
> >> +	}
> >> +
> >> +	ret = mdev_add_attribute_group(&mdev->dev,
> >> +					parent->ops->mdev_attr_groups);  
> > 
> > An error here seems to put us in a bad place, the device is created but
> > the attributes are broken, is it the caller's responsibility to
> > destroy?  Seems like we need a cleanup if this fails.
> >   
> 
> Right, adding cleanup here.
> 
> >> +create_ops_err:
> >> +	mutex_unlock(&parent->ops_lock);  
> > 
> > It seems like ops_lock isn't used so much as a lock as a serialization
> > mechanism.  Why?  Where is this serialization per parent device
> > documented?
> >  
> 
> parent->ops_lock is to serialize parent device callbacks to vendor
> driver, i.e supported_config(), create() and destroy().
> mdev->ops_lock is to serialize mediated device related callbacks to
> vendor driver, i.e. start(), stop(), read(), write(), set_irqs(),
> get_region_info(), validate_map_request().
> Its not documented, I'll add comments to mdev.h about these locks.

Should it be the mediated driver core's responsibility to do this?  If
a given mediated driver wants to serialize on their own, they can do
that, but I don't see why we would impose that on every mediated driver.

> >> +	return ret;
> >> +}
> >> +
> >> +static int mdev_device_destroy_ops(struct mdev_device *mdev, bool force)
> >> +{
> >> +	struct parent_device *parent = mdev->parent;
> >> +	int ret = 0;
> >> +
> >> +	/*
> >> +	 * If vendor driver doesn't return success that means vendor
> >> +	 * driver doesn't support hot-unplug
> >> +	 */
> >> +	mutex_lock(&parent->ops_lock);
> >> +	if (parent->ops->destroy) {
> >> +		ret = parent->ops->destroy(parent->dev, mdev->uuid,
> >> +					   mdev->instance);
> >> +		if (ret && !force) {  
> > 
> > It seems this is not so much a 'force' but an ignore errors, we never
> > actually force the mdev driver to destroy the device... which makes me
> > wonder if there are leaks there.
> >   
> 
> Consider a case where VM is running or in teardown path and parent
> device in unbound from vendor driver, then vendor driver would call
> mdev_unregister_device() from its remove() call. Even if
> parent->ops->destroy() returns error that could also mean that
> hot-unplug is not supported but we have to destroy mdev device. remove()
> call doesn't honor error returned. In that case its a force removal.
> 
> >> +
> >> +/*
> >> + * mdev_unregister_device : Unregister a parent device
> >> + * @dev: device structure representing parent device.
> >> + *
> >> + * Remove device from list of registered parent devices. Give a chance to free
> >> + * existing mediated devices for given device.
> >> + */
> >> +
> >> +void mdev_unregister_device(struct device *dev)
> >> +{
> >> +	struct parent_device *parent;
> >> +	struct mdev_device *mdev, *n;
> >> +	int ret;
> >> +
> >> +	mutex_lock(&parent_devices.list_lock);
> >> +	parent = find_parent_device(dev);
> >> +
> >> +	if (!parent) {
> >> +		mutex_unlock(&parent_devices.list_lock);
> >> +		return;
> >> +	}
> >> +	dev_info(dev, "MDEV: Unregistering\n");
> >> +
> >> +	/*
> >> +	 * Remove parent from the list and remove create and destroy sysfs
> >> +	 * files so that no new mediated device could be created for this parent
> >> +	 */
> >> +	list_del(&parent->next);
> >> +	mdev_remove_sysfs_files(dev);
> >> +	mutex_unlock(&parent_devices.list_lock);
> >> +
> >> +	mutex_lock(&parent->ops_lock);
> >> +	mdev_remove_attribute_group(dev,
> >> +				    parent->ops->dev_attr_groups);
> >> +	mutex_unlock(&parent->ops_lock);
> >> +
> >> +	mutex_lock(&parent->mdev_list_lock);
> >> +	list_for_each_entry_safe(mdev, n, &parent->mdev_list, next) {
> >> +		mdev_device_destroy_ops(mdev, true);
> >> +		list_del(&mdev->next);
> >> +		mdev_put_device(mdev);
> >> +	}
> >> +	mutex_unlock(&parent->mdev_list_lock);
> >> +
> >> +	do {
> >> +		ret = wait_event_interruptible_timeout(parent->release_done,
> >> +				list_empty(&parent->mdev_list), HZ * 10);  
> > 
> > But we do a list_del for each mdev in mdev_list above, how could the
> > list not be empty here?  I think you're trying to wait for all the mdev
> > devices to be released, but I don't think this does that.  Isn't the
> > list empty regardless?
> >  
> 
> Right, I do want to wait for all the mdev devices to be released. Moving
> list_del(&mdev->next) from the above for loop to mdev_release_device()
> so that mdev will be removed from list on last mdev_put_device().
> 
> 
> >> +		if (ret == -ERESTARTSYS) {
> >> +			dev_warn(dev, "Mediated devices are in use, task"
> >> +				      " \"%s\" (%d) "
> >> +				      "blocked until all are released",
> >> +				      current->comm, task_pid_nr(current));
> >> +		}
> >> +	} while (ret <= 0);
> >> +
> >> +	mdev_put_parent(parent);
> >> +}
> >> +EXPORT_SYMBOL(mdev_unregister_device);
> >> +
> >> +  
> 
> 
> >> +int mdev_device_create(struct device *dev, uuid_le uuid, uint32_t instance,
> >> +		       char *mdev_params)
> >> +{
> >> +	int ret;
> >> +	struct mdev_device *mdev;
> >> +	struct parent_device *parent;
> >> +
> >> +	parent = mdev_get_parent_by_dev(dev);
> >> +	if (!parent)
> >> +		return -EINVAL;
> >> +
> >> +	/* Check for duplicate */
> >> +	mdev = find_mdev_device(parent, uuid, instance);  
> > 
> > But this doesn't actually prevent duplicates because we we're not
> > holding any lock the guarantee that another racing process doesn't
> > create the same {uuid,instance} between where we check and the below
> > list_add.
> >   
> 
> Oops I missed this race condition. Moving
> mutex_lock(&parent->mdev_list_lock);
> before find_mdev_device() in mdev_device_create() and
> mdev_device_destroy().
> 
> 
> >> +
> >> +int  mdev_device_create(struct device *dev, uuid_le uuid, uint32_t instance,
> >> +			char *mdev_params);
> >> +int  mdev_device_destroy(struct device *dev, uuid_le uuid, uint32_t instance);
> >> +void mdev_device_supported_config(struct device *dev, char *str);
> >> +int  mdev_device_start(uuid_le uuid);
> >> +int  mdev_device_shutdown(uuid_le uuid);  
> > 
> > nit, stop is start as startup is to shutdown.  IOW, should this be
> > mdev_device_stop()?
> >  
> 
> Ok. Renaming mdev_device_shutdown() to mdev_device_stop().
> 
> 
> >> +
> >> +struct pci_region_info {
> >> +	uint64_t start;
> >> +	uint64_t size;
> >> +	uint32_t flags;		/* VFIO region info flags */
> >> +};
> >> +
> >> +enum mdev_emul_space {
> >> +	EMUL_CONFIG_SPACE,	/* PCI configuration space */
> >> +	EMUL_IO,		/* I/O register space */
> >> +	EMUL_MMIO		/* Memory-mapped I/O space */
> >> +};  
> > 
> > 
> > I'm still confused why this is needed, perhaps a description here would
> > be useful so I can stop asking.  Clearly config space is PCI only, so
> > it's strange to have it in the common code.  Everyone not on x86 will
> > say I/O space is also strange.  I can't keep it in my head why the
> > read/write offsets aren't sufficient for the driver to figure out what
> > type it is.
> > 
> >  
> 
> Now that VFIO_PCI_OFFSET_* macros are moved to vfio.h which vendor
> driver can also use, above enum could be removed from read/write. But
> again these macros are useful when parent device is PCI device. How
> would non-pci parent device differentiate IO ports and MMIO?

Moving VFIO_PCI_OFFSET_* to vfio.h already worries me, the vfio api
does not impose fixed offsets, it's simply an implementation detail of
vfio-pci.  We should be free to change that whenever we want and not
break userspace.  By moving it to vfio.h and potentially having
external mediated drivers depend on those offset macros, they now become
part of the kABI.  So more and more, I'd prefer that reads/writes/mmaps
get passed directly to the mediated driver, let them define which
offset is which, the core is just a passthrough.  For non-PCI devices,
like platform devices, the indexes are implementation specific, the
user really needs to know how to work with the specific device and how
it defines device mmio to region indexes.
 
> >> +	int	(*get_region_info)(struct mdev_device *vdev, int region_index,
> >> +				 struct pci_region_info *region_info);  
> > 
> > This can't be //pci_//region_info.  How do you intend to support things
> > like sparse mmap capabilities in the user REGION_INFO ioctl when such
> > things are not part of the mediated device API?  Seems like the driver
> > should just return a buffer.
> >  
> 
> If not pci_region_info, can use vfio_region_info here, even to fetch
> sparce mmap capabilities from vendor driver?

Sure, you can use vfio_region_info, then it's just a pointer to a
buffer allocated by the callee and the mediated core is just a
passthrough, which is probably how it should be.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 2/3] VFIO driver for mediated PCI device
  2016-06-24 18:34       ` [Qemu-devel] " Kirti Wankhede
@ 2016-06-24 19:45         ` Alex Williamson
  -1 siblings, 0 replies; 51+ messages in thread
From: Alex Williamson @ 2016-06-24 19:45 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, shuai.ruan,
	jike.song, zhiyuan.lv, bjsdjshi

On Sat, 25 Jun 2016 00:04:27 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> Thanks Alex.
> 
> 
> On 6/22/2016 4:18 AM, Alex Williamson wrote:
> > On Mon, 20 Jun 2016 22:01:47 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> +
> >> +static int get_mdev_region_info(struct mdev_device *mdev,
> >> +				struct pci_region_info *vfio_region_info,
> >> +				int index)
> >> +{
> >> +	int ret = -EINVAL;
> >> +	struct parent_device *parent = mdev->parent;
> >> +
> >> +	if (parent && dev_is_pci(parent->dev) && parent->ops->get_region_info) {
> >> +		mutex_lock(&mdev->ops_lock);
> >> +		ret = parent->ops->get_region_info(mdev, index,
> >> +						    vfio_region_info);
> >> +		mutex_unlock(&mdev->ops_lock);  
> > 
> > Why do we have two ops_lock, one on the parent_device and one on the
> > mdev_device?!  Is this one actually locking anything or also just
> > providing serialization?  Why do some things get serialized at the
> > parent level and some things at the device level?  Very confused by
> > ops_lock.
> >  
> 
> There are two sets of callback:
> * parent device callbacks: supported_config, create, destroy, start, stop
> * mdev device callbacks: read, write, set_irqs, get_region_info,
> validate_map_request
> 
> parent->ops_lock is to serialize per parent device callbacks.
> mdev->ops_lock is to serialize per mdev device callbacks.
> 
> I'll add above comment in mdev.h.

As mentioned in the other reply, I don't think serialization policy
should be imposed in the driver core.  If a given mediated driver needs
to serialize, they can do it internally.

> >> +
> >> +static int mdev_get_irq_count(struct vfio_mdev *vmdev, int irq_type)
> >> +{
> >> +	/* Don't support MSIX for now */
> >> +	if (irq_type == VFIO_PCI_MSIX_IRQ_INDEX)
> >> +		return -1;
> >> +
> >> +	return 1;  
> > 
> > Too much hard coding here, the mediated driver should define this.
> >   
> 
> I'm testing INTX and MSI, I don't have a way to test MSIX for now. So we
> thought we can add supported for MSIX later. Till then hard code it to 1.

To me it screams that there needs to be an interface to the mediated
device here.  How do you even know that the mediated device intends to
support MSI?  What if it wants to emulated a VF and not support INTx?
This is basically just a big "TODO" flag that needs to be addressed
before a non-RFC.

> >> +
> >> +		if (parent && parent->ops->set_irqs) {
> >> +			mutex_lock(&mdev->ops_lock);
> >> +			ret = parent->ops->set_irqs(mdev, hdr.flags, hdr.index,
> >> +						    hdr.start, hdr.count, data);
> >> +			mutex_unlock(&mdev->ops_lock);  
> > 
> > Device level serialization on set_irqs... interesting.
> >   
> 
> Hope answer above helps to clarify this.
> 
> 
> >> +		}
> >> +
> >> +		kfree(ptr);
> >> +		return ret;
> >> +	}
> >> +	}
> >> +	return -ENOTTY;
> >> +}
> >> +
> >> +ssize_t mdev_dev_config_rw(struct vfio_mdev *vmdev, char __user *buf,
> >> +			   size_t count, loff_t *ppos, bool iswrite)
> >> +{
> >> +	struct mdev_device *mdev = vmdev->mdev;
> >> +	struct parent_device *parent = mdev->parent;
> >> +	int size = vmdev->vfio_region_info[VFIO_PCI_CONFIG_REGION_INDEX].size;
> >> +	int ret = 0;
> >> +	uint64_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
> >> +
> >> +	if (pos < 0 || pos >= size ||
> >> +	    pos + count > size) {
> >> +		pr_err("%s pos 0x%llx out of range\n", __func__, pos);
> >> +		ret = -EFAULT;
> >> +		goto config_rw_exit;
> >> +	}
> >> +
> >> +	if (iswrite) {
> >> +		char *usr_data, *ptr;
> >> +
> >> +		ptr = usr_data = memdup_user(buf, count);
> >> +		if (IS_ERR(usr_data)) {
> >> +			ret = PTR_ERR(usr_data);
> >> +			goto config_rw_exit;
> >> +		}
> >> +
> >> +		ret = parent->ops->write(mdev, usr_data, count,
> >> +					  EMUL_CONFIG_SPACE, pos);  
> > 
> > No serialization on this ops, thank goodness, but why?
> >  
> 
> Its there at caller of mdev_dev_rw().
> 
> 
> > This read/write interface still seems strange to me...
> >   
> 
> Replied on this in 1st Patch.
> 
> >> +
> >> +		memcpy((void *)(vmdev->vconfig + pos), (void *)usr_data, count);
> >> +		kfree(ptr);
> >> +	} else {
> >> +		char *ret_data, *ptr;
> >> +
> >> +		ptr = ret_data = kzalloc(count, GFP_KERNEL);
> >> +
> >> +		if (IS_ERR(ret_data)) {
> >> +			ret = PTR_ERR(ret_data);
> >> +			goto config_rw_exit;
> >> +		}
> >> +
> >> +		ret = parent->ops->read(mdev, ret_data, count,
> >> +					EMUL_CONFIG_SPACE, pos);
> >> +
> >> +		if (ret > 0) {
> >> +			if (copy_to_user(buf, ret_data, ret))
> >> +				ret = -EFAULT;
> >> +			else
> >> +				memcpy((void *)(vmdev->vconfig + pos),
> >> +					(void *)ret_data, count);
> >> +		}
> >> +		kfree(ptr);  
> > 
> > So vconfig caches all of config space for the mdev, but we only ever
> > use it to read the BAR address via mdev_read_base()... why?  I hope the
> > mdev driver doesn't freak out if the user reads the mmio region before
> > writing a base address (remember the vfio API aspect of the interface
> > doesn't necessarily follow the VM PCI programming API)
> >   
> 
> How could user read mmio region from guest before writing base address?
> Isn't that would be a bug?

The user is never to be trusted.  The possibility that the user is
either clueless or malicious needs to be accounted for to protect the
host kernel.

> From our driver if pos is not within the base address range, then we
> return error for read/write.
> 
> 
> >> +	}
> >> +config_rw_exit:
> >> +
> >> +	if (ret > 0)
> >> +		*ppos += ret;
> >> +
> >> +	return ret;
> >> +}
> >> +
> >> +ssize_t mdev_dev_bar_rw(struct vfio_mdev *vmdev, char __user *buf,
> >> +			size_t count, loff_t *ppos, bool iswrite)
> >> +{
> >> +	struct mdev_device *mdev = vmdev->mdev;
> >> +	struct parent_device *parent = mdev->parent;
> >> +	loff_t offset = *ppos & VFIO_PCI_OFFSET_MASK;
> >> +	loff_t pos;
> >> +	int bar_index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
> >> +	int ret = 0;
> >> +
> >> +	if (!vmdev->vfio_region_info[bar_index].start)
> >> +		mdev_read_base(vmdev);
> >> +
> >> +	if (offset >= vmdev->vfio_region_info[bar_index].size) {
> >> +		ret = -EINVAL;
> >> +		goto bar_rw_exit;
> >> +	}
> >> +
> >> +	count = min(count,
> >> +		    (size_t)(vmdev->vfio_region_info[bar_index].size - offset));
> >> +
> >> +	pos = vmdev->vfio_region_info[bar_index].start + offset;  
> > 
> > In the case of a mpci dev, @start is the vconfig BAR value, so it's
> > user (guest) writable, and the mediated driver is supposed to
> > understand that?  I suppose is saw the config write too, if there was
> > one, but the mediated driver gives us region info based on region index.
> > We have the region index here.  Why wouldn't we do reads and writes
> > based on region index and offset and eliminate vconfig?  Seems like
> > that would consolidate a lot of this, we don't care what we're reading
> > and writing, just pass it through.  Mediated pci drivers would simply
> > need to match indexes to those already defined for vfio-pci.
> >   
> 
> Ok, looking at it. so this will remove vconfig completely.

Thanks,
Alex

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Qemu-devel] [PATCH 2/3] VFIO driver for mediated PCI device
@ 2016-06-24 19:45         ` Alex Williamson
  0 siblings, 0 replies; 51+ messages in thread
From: Alex Williamson @ 2016-06-24 19:45 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, shuai.ruan,
	jike.song, zhiyuan.lv, bjsdjshi

On Sat, 25 Jun 2016 00:04:27 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> Thanks Alex.
> 
> 
> On 6/22/2016 4:18 AM, Alex Williamson wrote:
> > On Mon, 20 Jun 2016 22:01:47 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> +
> >> +static int get_mdev_region_info(struct mdev_device *mdev,
> >> +				struct pci_region_info *vfio_region_info,
> >> +				int index)
> >> +{
> >> +	int ret = -EINVAL;
> >> +	struct parent_device *parent = mdev->parent;
> >> +
> >> +	if (parent && dev_is_pci(parent->dev) && parent->ops->get_region_info) {
> >> +		mutex_lock(&mdev->ops_lock);
> >> +		ret = parent->ops->get_region_info(mdev, index,
> >> +						    vfio_region_info);
> >> +		mutex_unlock(&mdev->ops_lock);  
> > 
> > Why do we have two ops_lock, one on the parent_device and one on the
> > mdev_device?!  Is this one actually locking anything or also just
> > providing serialization?  Why do some things get serialized at the
> > parent level and some things at the device level?  Very confused by
> > ops_lock.
> >  
> 
> There are two sets of callback:
> * parent device callbacks: supported_config, create, destroy, start, stop
> * mdev device callbacks: read, write, set_irqs, get_region_info,
> validate_map_request
> 
> parent->ops_lock is to serialize per parent device callbacks.
> mdev->ops_lock is to serialize per mdev device callbacks.
> 
> I'll add above comment in mdev.h.

As mentioned in the other reply, I don't think serialization policy
should be imposed in the driver core.  If a given mediated driver needs
to serialize, they can do it internally.

> >> +
> >> +static int mdev_get_irq_count(struct vfio_mdev *vmdev, int irq_type)
> >> +{
> >> +	/* Don't support MSIX for now */
> >> +	if (irq_type == VFIO_PCI_MSIX_IRQ_INDEX)
> >> +		return -1;
> >> +
> >> +	return 1;  
> > 
> > Too much hard coding here, the mediated driver should define this.
> >   
> 
> I'm testing INTX and MSI, I don't have a way to test MSIX for now. So we
> thought we can add supported for MSIX later. Till then hard code it to 1.

To me it screams that there needs to be an interface to the mediated
device here.  How do you even know that the mediated device intends to
support MSI?  What if it wants to emulated a VF and not support INTx?
This is basically just a big "TODO" flag that needs to be addressed
before a non-RFC.

> >> +
> >> +		if (parent && parent->ops->set_irqs) {
> >> +			mutex_lock(&mdev->ops_lock);
> >> +			ret = parent->ops->set_irqs(mdev, hdr.flags, hdr.index,
> >> +						    hdr.start, hdr.count, data);
> >> +			mutex_unlock(&mdev->ops_lock);  
> > 
> > Device level serialization on set_irqs... interesting.
> >   
> 
> Hope answer above helps to clarify this.
> 
> 
> >> +		}
> >> +
> >> +		kfree(ptr);
> >> +		return ret;
> >> +	}
> >> +	}
> >> +	return -ENOTTY;
> >> +}
> >> +
> >> +ssize_t mdev_dev_config_rw(struct vfio_mdev *vmdev, char __user *buf,
> >> +			   size_t count, loff_t *ppos, bool iswrite)
> >> +{
> >> +	struct mdev_device *mdev = vmdev->mdev;
> >> +	struct parent_device *parent = mdev->parent;
> >> +	int size = vmdev->vfio_region_info[VFIO_PCI_CONFIG_REGION_INDEX].size;
> >> +	int ret = 0;
> >> +	uint64_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
> >> +
> >> +	if (pos < 0 || pos >= size ||
> >> +	    pos + count > size) {
> >> +		pr_err("%s pos 0x%llx out of range\n", __func__, pos);
> >> +		ret = -EFAULT;
> >> +		goto config_rw_exit;
> >> +	}
> >> +
> >> +	if (iswrite) {
> >> +		char *usr_data, *ptr;
> >> +
> >> +		ptr = usr_data = memdup_user(buf, count);
> >> +		if (IS_ERR(usr_data)) {
> >> +			ret = PTR_ERR(usr_data);
> >> +			goto config_rw_exit;
> >> +		}
> >> +
> >> +		ret = parent->ops->write(mdev, usr_data, count,
> >> +					  EMUL_CONFIG_SPACE, pos);  
> > 
> > No serialization on this ops, thank goodness, but why?
> >  
> 
> Its there at caller of mdev_dev_rw().
> 
> 
> > This read/write interface still seems strange to me...
> >   
> 
> Replied on this in 1st Patch.
> 
> >> +
> >> +		memcpy((void *)(vmdev->vconfig + pos), (void *)usr_data, count);
> >> +		kfree(ptr);
> >> +	} else {
> >> +		char *ret_data, *ptr;
> >> +
> >> +		ptr = ret_data = kzalloc(count, GFP_KERNEL);
> >> +
> >> +		if (IS_ERR(ret_data)) {
> >> +			ret = PTR_ERR(ret_data);
> >> +			goto config_rw_exit;
> >> +		}
> >> +
> >> +		ret = parent->ops->read(mdev, ret_data, count,
> >> +					EMUL_CONFIG_SPACE, pos);
> >> +
> >> +		if (ret > 0) {
> >> +			if (copy_to_user(buf, ret_data, ret))
> >> +				ret = -EFAULT;
> >> +			else
> >> +				memcpy((void *)(vmdev->vconfig + pos),
> >> +					(void *)ret_data, count);
> >> +		}
> >> +		kfree(ptr);  
> > 
> > So vconfig caches all of config space for the mdev, but we only ever
> > use it to read the BAR address via mdev_read_base()... why?  I hope the
> > mdev driver doesn't freak out if the user reads the mmio region before
> > writing a base address (remember the vfio API aspect of the interface
> > doesn't necessarily follow the VM PCI programming API)
> >   
> 
> How could user read mmio region from guest before writing base address?
> Isn't that would be a bug?

The user is never to be trusted.  The possibility that the user is
either clueless or malicious needs to be accounted for to protect the
host kernel.

> From our driver if pos is not within the base address range, then we
> return error for read/write.
> 
> 
> >> +	}
> >> +config_rw_exit:
> >> +
> >> +	if (ret > 0)
> >> +		*ppos += ret;
> >> +
> >> +	return ret;
> >> +}
> >> +
> >> +ssize_t mdev_dev_bar_rw(struct vfio_mdev *vmdev, char __user *buf,
> >> +			size_t count, loff_t *ppos, bool iswrite)
> >> +{
> >> +	struct mdev_device *mdev = vmdev->mdev;
> >> +	struct parent_device *parent = mdev->parent;
> >> +	loff_t offset = *ppos & VFIO_PCI_OFFSET_MASK;
> >> +	loff_t pos;
> >> +	int bar_index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
> >> +	int ret = 0;
> >> +
> >> +	if (!vmdev->vfio_region_info[bar_index].start)
> >> +		mdev_read_base(vmdev);
> >> +
> >> +	if (offset >= vmdev->vfio_region_info[bar_index].size) {
> >> +		ret = -EINVAL;
> >> +		goto bar_rw_exit;
> >> +	}
> >> +
> >> +	count = min(count,
> >> +		    (size_t)(vmdev->vfio_region_info[bar_index].size - offset));
> >> +
> >> +	pos = vmdev->vfio_region_info[bar_index].start + offset;  
> > 
> > In the case of a mpci dev, @start is the vconfig BAR value, so it's
> > user (guest) writable, and the mediated driver is supposed to
> > understand that?  I suppose is saw the config write too, if there was
> > one, but the mediated driver gives us region info based on region index.
> > We have the region index here.  Why wouldn't we do reads and writes
> > based on region index and offset and eliminate vconfig?  Seems like
> > that would consolidate a lot of this, we don't care what we're reading
> > and writing, just pass it through.  Mediated pci drivers would simply
> > need to match indexes to those already defined for vfio-pci.
> >   
> 
> Ok, looking at it. so this will remove vconfig completely.

Thanks,
Alex

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 3/3] VFIO Type1 IOMMU: Add support for mediated devices
  2016-06-22  3:46     ` [Qemu-devel] " Alex Williamson
@ 2016-06-28 13:02       ` Kirti Wankhede
  -1 siblings, 0 replies; 51+ messages in thread
From: Kirti Wankhede @ 2016-06-28 13:02 UTC (permalink / raw)
  To: Alex Williamson
  Cc: shuai.ruan, kevin.tian, cjia, kvm, qemu-devel, jike.song, kraxel,
	pbonzini, bjsdjshi, zhiyuan.lv



On 6/22/2016 9:16 AM, Alex Williamson wrote:
> On Mon, 20 Jun 2016 22:01:48 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>>  
>>  struct vfio_iommu {
>>  	struct list_head	domain_list;
>> +	struct vfio_domain	*mediated_domain;
> 
> I'm not really a fan of how this is so often used to special case the
> code...
> 
>>  	struct mutex		lock;
>>  	struct rb_root		dma_list;
>>  	bool			v2;
>> @@ -67,6 +69,13 @@ struct vfio_domain {
>>  	struct list_head	group_list;
>>  	int			prot;		/* IOMMU_CACHE */
>>  	bool			fgsp;		/* Fine-grained super pages */
>> +
>> +	/* Domain for mediated device which is without physical IOMMU */
>> +	bool			mediated_device;
> 
> But sometimes we use this to special case the code and other times we
> use domain_list being empty.  I thought the argument against pulling
> code out to a shared file was that this approach could be made
> maintainable.
> 

Functions where struct vfio_domain *domain is argument which are
intended to perform for that domain only, checked if
(domain->mediated_device), like map_try_harder(), vfio_iommu_replay(),
vfio_test_domain_fgsp(). Checks in these functions can be removed but
then it would be callers responsibility to make sure that they don't
call these functions for mediated_domain.
Whereas functions where struct vfio_iommu *iommu is argument and
domain_list is traversed to find domain or perform for each domain in
domain_list, checked if (list_empty(&iommu->domain_list)), like
vfio_unmap_unpin(), vfio_iommu_map(), vfio_dma_do_map().


>> +
>> +	struct mm_struct	*mm;
>> +	struct rb_root		pfn_list;	/* pinned Host pfn list */
>> +	struct mutex		pfn_list_lock;	/* mutex for pfn_list */
> 
> Seems like we could reduce overhead for the existing use cases by just
> adding a pointer here and making these last 3 entries part of the
> structure that gets pointed to.  Existence of the pointer would replace
> @mediated_device.
>

Ok.

>>  };
>>  
>>  struct vfio_dma {
>> @@ -79,10 +88,26 @@ struct vfio_dma {
>>  
>>  struct vfio_group {
>>  	struct iommu_group	*iommu_group;
>> +#if defined(CONFIG_MDEV) || defined(CONFIG_MDEV_MODULE)
> 
> Where does CONFIG_MDEV_MODULE come from?
> 
> Plus, all the #ifdefs... <cringe>
> 

Config option MDEV is tristate and when selected as module
CONFIG_MDEV_MODULE is set in include/generated/autoconf.h.
Symbols mdev_bus_type, mdev_get_device_by_group() and mdev_put_device()
are only available when MDEV option is selected as built-in or modular.
If MDEV option is not selected, vfio_iommu_type1 modules should still
work for device direct assignment. If these #ifdefs are not there
vfio_iommu_type1 module fails to load with undefined symbols when MDEV
is not selected.

>> +	struct mdev_device	*mdev;
> 
> This gets set on attach_group where we use the iommu_group to lookup
> the mdev, so why can't we do that on the other paths that make use of
> this?  I think this is just holding a reference.
> 

mdev is retrieved from attach_group for 2 reasons:
1. to increase the ref count of mdev, mdev_get_device_by_group(), when
its iommu_group is attached. That should be decremented, by
mdev_put_device(), from detach while detaching its iommu_group. This is
make sure that mdev is not freed until it's iommu_group is detached from
the container.

2. save reference to iommu_data so that vendor driver would use to call
vfio_pin_pages() and vfio_unpin_pages(). More details below.



>> -static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
>> +static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
>> +			 int prot, unsigned long *pfn)
>>  {
>>  	struct page *page[1];
>>  	struct vm_area_struct *vma;
>> +	struct mm_struct *local_mm = mm;
>>  	int ret = -EFAULT;
>>  
>> -	if (get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE), page) == 1) {
>> +	if (!local_mm && !current->mm)
>> +		return -ENODEV;
>> +
>> +	if (!local_mm)
>> +		local_mm = current->mm;
> 
> The above would be much more concise if we just initialized local_mm
> as: mm ? mm : current->mm
> 
>> +
>> +	down_read(&local_mm->mmap_sem);
>> +	if (get_user_pages_remote(NULL, local_mm, vaddr, 1,
>> +				!!(prot & IOMMU_WRITE), 0, page, NULL) == 1) {
> 
> Um, the comment for get_user_pages_remote says:
> 
> "See also get_user_pages_fast, for performance critical applications."
> 
> So what penalty are we imposing on the existing behavior of type1
> here?  Previously we only needed to acquire mmap_sem if
> get_user_pages_fast() didn't work, so the existing use case seems to be
> compromised.
>

Yes.
get_user_pages_fast() pins pages from current->mm, but for mediated
device mm could be different than current->mm.

This penalty for existing behavior could be avoided by:
if (!mm && current->mm)
    get_user_pages_fast(); //take fast path
else
    get_user_pages_remote(); // take slow path


>> +long vfio_unpin_pages(void *iommu_data, dma_addr_t *pfn, long npage,
>> +		     int prot)
>> +{
>> +	struct vfio_iommu *iommu = iommu_data;
>> +	struct vfio_domain *domain = NULL;
>> +	long unlocked = 0;
>> +	int i;
>> +
>> +	if (!iommu || !pfn)
>> +		return -EINVAL;
>> +
>> +	if (!iommu->mediated_domain)
>> +		return -EINVAL;
>> +
>> +	domain = iommu->mediated_domain;
> 
> Again, domain is already validated here.
> 
>> +
>> +	for (i = 0; i < npage; i++) {
>> +		struct vfio_pfn *p;
>> +
>> +		/* verify if pfn exist in pfn_list */
>> +		p = vfio_find_pfn(domain, *(pfn + i));
> 
> Why are we using array indexing above and array math here?  Were these
> functions written by different people?
> 

No, input argument to vfio_unpin_pages() was always array of pfns to be
unpinned.

>> +		if (!p)
>> +			continue;
> 
> Hmm, this seems like more of a bad thing than a continue.
> 

Caller of vfio_unpin_pages() are other modules. I feel its better to do
sanity check than crash.


>>  
>>  static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
>>  {
>> @@ -341,6 +580,10 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
>>  
>>  	if (!dma->size)
>>  		return;
>> +
>> +	if (list_empty(&iommu->domain_list))
>> +		return;
> 
> Huh?  This would be a serious consistency error if this happened for
> the existing use case.
>

This will not happen for existing use case, i.e. device direct
assignment. This case is true when there is only mediated device
assigned and there are no direct assigned devices.


>> @@ -569,6 +819,7 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>>  	uint64_t mask;
>>  	struct vfio_dma *dma;
>>  	unsigned long pfn;
>> +	struct vfio_domain *domain = NULL;
>>  
>>  	/* Verify that none of our __u64 fields overflow */
>>  	if (map->size != size || map->vaddr != vaddr || map->iova != iova)
>> @@ -611,10 +862,21 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>>  	/* Insert zero-sized and grow as we map chunks of it */
>>  	vfio_link_dma(iommu, dma);
>>  
>> +	/*
>> +	 * Skip pin and map if and domain list is empty
>> +	 */
>> +	if (list_empty(&iommu->domain_list)) {
>> +		dma->size = size;
>> +		goto map_done;
>> +	}
> 
> Again, this would be a serious consistency error for the existing use
> case.  Let's use indicators that are explicit.
>

Why? for existing use case (i.e. direct device assignment) domain_list
will not be empty, domain_list will only be empty when there is mediated
device assigned and no direct device assigned.

>>  static int vfio_iommu_type1_attach_group(void *iommu_data,
>>  					 struct iommu_group *iommu_group)
>>  {
>>  	struct vfio_iommu *iommu = iommu_data;
>> -	struct vfio_group *group, *g;
>> +	struct vfio_group *group;
>>  	struct vfio_domain *domain, *d;
>>  	struct bus_type *bus = NULL;
>>  	int ret;
>> @@ -746,14 +1030,21 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>>  	mutex_lock(&iommu->lock);
>>  
>>  	list_for_each_entry(d, &iommu->domain_list, next) {
>> -		list_for_each_entry(g, &d->group_list, next) {
>> -			if (g->iommu_group != iommu_group)
>> -				continue;
>> +		if (is_iommu_group_present(d, iommu_group)) {
>> +			mutex_unlock(&iommu->lock);
>> +			return -EINVAL;
>> +		}
>> +	}
>>  
>> +#if defined(CONFIG_MDEV) || defined(CONFIG_MDEV_MODULE)
>> +	if (iommu->mediated_domain) {
>> +		if (is_iommu_group_present(iommu->mediated_domain,
>> +					   iommu_group)) {
>>  			mutex_unlock(&iommu->lock);
>>  			return -EINVAL;
>>  		}
>>  	}
>> +#endif
>>  
>>  	group = kzalloc(sizeof(*group), GFP_KERNEL);
>>  	domain = kzalloc(sizeof(*domain), GFP_KERNEL);
>> @@ -769,6 +1060,36 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>>  	if (ret)
>>  		goto out_free;
>>  
>> +#if defined(CONFIG_MDEV) || defined(CONFIG_MDEV_MODULE)
>> +	if (!iommu_present(bus) && (bus == &mdev_bus_type)) {
>> +		struct mdev_device *mdev = NULL;
> 
> Unnecessary initialization.
> 
>> +
>> +		mdev = mdev_get_device_by_group(iommu_group);
>> +		if (!mdev)
>> +			goto out_free;
>> +
>> +		mdev->iommu_data = iommu;
> 
> This looks rather sketchy to me, we don't have a mediated driver in
> this series, but presumably the driver blindly calls vfio_pin_pages
> passing mdev->iommu_data and hoping that it's either NULL to generate
> an error or relevant to this iommu backend.  How would we add a second
> mediated driver iommu backend?  We're currently assuming the user
> configured this backend.  

If I understand correctly, your question is if two different mediated
devices are assigned to same container. In such case, the two mediated
devices will have different iommu_groups and will be added to
mediated_domain's group_list (iommu->mediated_domain->group_list).

> Should vfio_pin_pages instead have a struct
> device* parameter from which we would lookup the iommu_group and get to
> the vfio_domain?  That's a bit heavy weight, but we need something
> along those lines.
> 

There could be multiple mdev devices from same mediated vendor driver in
one container. In that case, that vendor driver need reference of
container or container->iommu_data to pin and unpin pages.
Similarly, there could be mutiple mdev devices from different mediated
vendor driver in one container, in that case both vendor driver need
reference to container or container->iommu_data to pin and unpin pages
in their driver.

>> +		mdev->iommu_data = iommu;
With the above line, a reference to container->iommu_data is kept in
mdev structure when the iommu_group is attached to a container so that
vendor drivers can find reference to pin and unpin pages.

If struct device* is passed as an argument to vfio_pin_pages, to find
reference to container of struct device *dev, have to find
vfio_device/vfio_group from dev that means traverse vfio.group_list for
each pin and unpin call. This list would be long when there are many
mdev devices in the system.

Is there any better way to find reference to container from struct device*?


>>  
>> @@ -930,8 +1289,28 @@ static void vfio_iommu_type1_release(void *iommu_data)
>>  	struct vfio_domain *domain, *domain_tmp;
>>  	struct vfio_group *group, *group_tmp;
>>  
>> +#if defined(CONFIG_MDEV) || defined(CONFIG_MDEV_MODULE)
>> +	if (iommu->mediated_domain) {
>> +		domain = iommu->mediated_domain;
>> +		list_for_each_entry_safe(group, group_tmp,
>> +					 &domain->group_list, next) {
>> +			if (group->mdev) {
>> +				group->mdev->iommu_data = NULL;
>> +				mdev_put_device(group->mdev);
>> +			}
>> +			list_del(&group->next);
>> +			kfree(group);
>> +		}
>> +		vfio_iommu_unpin_api_domain(domain);
>> +		kfree(domain);
>> +		iommu->mediated_domain = NULL;
>> +	}
>> +#endif
> 
> I'm not really seeing how this is all that much more maintainable than
> what was proposed previously, has this aspect been worked on since last
> I reviewed this patch?
> 

There aren't many changes from v4 to v5 version of this patch.
Can you more specific on you concerns about maintainability? I'll
definitely address your concerns.

Thanks,
Kirti

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Qemu-devel] [PATCH 3/3] VFIO Type1 IOMMU: Add support for mediated devices
@ 2016-06-28 13:02       ` Kirti Wankhede
  0 siblings, 0 replies; 51+ messages in thread
From: Kirti Wankhede @ 2016-06-28 13:02 UTC (permalink / raw)
  To: Alex Williamson
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, shuai.ruan,
	jike.song, zhiyuan.lv, bjsdjshi



On 6/22/2016 9:16 AM, Alex Williamson wrote:
> On Mon, 20 Jun 2016 22:01:48 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>>  
>>  struct vfio_iommu {
>>  	struct list_head	domain_list;
>> +	struct vfio_domain	*mediated_domain;
> 
> I'm not really a fan of how this is so often used to special case the
> code...
> 
>>  	struct mutex		lock;
>>  	struct rb_root		dma_list;
>>  	bool			v2;
>> @@ -67,6 +69,13 @@ struct vfio_domain {
>>  	struct list_head	group_list;
>>  	int			prot;		/* IOMMU_CACHE */
>>  	bool			fgsp;		/* Fine-grained super pages */
>> +
>> +	/* Domain for mediated device which is without physical IOMMU */
>> +	bool			mediated_device;
> 
> But sometimes we use this to special case the code and other times we
> use domain_list being empty.  I thought the argument against pulling
> code out to a shared file was that this approach could be made
> maintainable.
> 

Functions where struct vfio_domain *domain is argument which are
intended to perform for that domain only, checked if
(domain->mediated_device), like map_try_harder(), vfio_iommu_replay(),
vfio_test_domain_fgsp(). Checks in these functions can be removed but
then it would be callers responsibility to make sure that they don't
call these functions for mediated_domain.
Whereas functions where struct vfio_iommu *iommu is argument and
domain_list is traversed to find domain or perform for each domain in
domain_list, checked if (list_empty(&iommu->domain_list)), like
vfio_unmap_unpin(), vfio_iommu_map(), vfio_dma_do_map().


>> +
>> +	struct mm_struct	*mm;
>> +	struct rb_root		pfn_list;	/* pinned Host pfn list */
>> +	struct mutex		pfn_list_lock;	/* mutex for pfn_list */
> 
> Seems like we could reduce overhead for the existing use cases by just
> adding a pointer here and making these last 3 entries part of the
> structure that gets pointed to.  Existence of the pointer would replace
> @mediated_device.
>

Ok.

>>  };
>>  
>>  struct vfio_dma {
>> @@ -79,10 +88,26 @@ struct vfio_dma {
>>  
>>  struct vfio_group {
>>  	struct iommu_group	*iommu_group;
>> +#if defined(CONFIG_MDEV) || defined(CONFIG_MDEV_MODULE)
> 
> Where does CONFIG_MDEV_MODULE come from?
> 
> Plus, all the #ifdefs... <cringe>
> 

Config option MDEV is tristate and when selected as module
CONFIG_MDEV_MODULE is set in include/generated/autoconf.h.
Symbols mdev_bus_type, mdev_get_device_by_group() and mdev_put_device()
are only available when MDEV option is selected as built-in or modular.
If MDEV option is not selected, vfio_iommu_type1 modules should still
work for device direct assignment. If these #ifdefs are not there
vfio_iommu_type1 module fails to load with undefined symbols when MDEV
is not selected.

>> +	struct mdev_device	*mdev;
> 
> This gets set on attach_group where we use the iommu_group to lookup
> the mdev, so why can't we do that on the other paths that make use of
> this?  I think this is just holding a reference.
> 

mdev is retrieved from attach_group for 2 reasons:
1. to increase the ref count of mdev, mdev_get_device_by_group(), when
its iommu_group is attached. That should be decremented, by
mdev_put_device(), from detach while detaching its iommu_group. This is
make sure that mdev is not freed until it's iommu_group is detached from
the container.

2. save reference to iommu_data so that vendor driver would use to call
vfio_pin_pages() and vfio_unpin_pages(). More details below.



>> -static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
>> +static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
>> +			 int prot, unsigned long *pfn)
>>  {
>>  	struct page *page[1];
>>  	struct vm_area_struct *vma;
>> +	struct mm_struct *local_mm = mm;
>>  	int ret = -EFAULT;
>>  
>> -	if (get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE), page) == 1) {
>> +	if (!local_mm && !current->mm)
>> +		return -ENODEV;
>> +
>> +	if (!local_mm)
>> +		local_mm = current->mm;
> 
> The above would be much more concise if we just initialized local_mm
> as: mm ? mm : current->mm
> 
>> +
>> +	down_read(&local_mm->mmap_sem);
>> +	if (get_user_pages_remote(NULL, local_mm, vaddr, 1,
>> +				!!(prot & IOMMU_WRITE), 0, page, NULL) == 1) {
> 
> Um, the comment for get_user_pages_remote says:
> 
> "See also get_user_pages_fast, for performance critical applications."
> 
> So what penalty are we imposing on the existing behavior of type1
> here?  Previously we only needed to acquire mmap_sem if
> get_user_pages_fast() didn't work, so the existing use case seems to be
> compromised.
>

Yes.
get_user_pages_fast() pins pages from current->mm, but for mediated
device mm could be different than current->mm.

This penalty for existing behavior could be avoided by:
if (!mm && current->mm)
    get_user_pages_fast(); //take fast path
else
    get_user_pages_remote(); // take slow path


>> +long vfio_unpin_pages(void *iommu_data, dma_addr_t *pfn, long npage,
>> +		     int prot)
>> +{
>> +	struct vfio_iommu *iommu = iommu_data;
>> +	struct vfio_domain *domain = NULL;
>> +	long unlocked = 0;
>> +	int i;
>> +
>> +	if (!iommu || !pfn)
>> +		return -EINVAL;
>> +
>> +	if (!iommu->mediated_domain)
>> +		return -EINVAL;
>> +
>> +	domain = iommu->mediated_domain;
> 
> Again, domain is already validated here.
> 
>> +
>> +	for (i = 0; i < npage; i++) {
>> +		struct vfio_pfn *p;
>> +
>> +		/* verify if pfn exist in pfn_list */
>> +		p = vfio_find_pfn(domain, *(pfn + i));
> 
> Why are we using array indexing above and array math here?  Were these
> functions written by different people?
> 

No, input argument to vfio_unpin_pages() was always array of pfns to be
unpinned.

>> +		if (!p)
>> +			continue;
> 
> Hmm, this seems like more of a bad thing than a continue.
> 

Caller of vfio_unpin_pages() are other modules. I feel its better to do
sanity check than crash.


>>  
>>  static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
>>  {
>> @@ -341,6 +580,10 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
>>  
>>  	if (!dma->size)
>>  		return;
>> +
>> +	if (list_empty(&iommu->domain_list))
>> +		return;
> 
> Huh?  This would be a serious consistency error if this happened for
> the existing use case.
>

This will not happen for existing use case, i.e. device direct
assignment. This case is true when there is only mediated device
assigned and there are no direct assigned devices.


>> @@ -569,6 +819,7 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>>  	uint64_t mask;
>>  	struct vfio_dma *dma;
>>  	unsigned long pfn;
>> +	struct vfio_domain *domain = NULL;
>>  
>>  	/* Verify that none of our __u64 fields overflow */
>>  	if (map->size != size || map->vaddr != vaddr || map->iova != iova)
>> @@ -611,10 +862,21 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>>  	/* Insert zero-sized and grow as we map chunks of it */
>>  	vfio_link_dma(iommu, dma);
>>  
>> +	/*
>> +	 * Skip pin and map if and domain list is empty
>> +	 */
>> +	if (list_empty(&iommu->domain_list)) {
>> +		dma->size = size;
>> +		goto map_done;
>> +	}
> 
> Again, this would be a serious consistency error for the existing use
> case.  Let's use indicators that are explicit.
>

Why? for existing use case (i.e. direct device assignment) domain_list
will not be empty, domain_list will only be empty when there is mediated
device assigned and no direct device assigned.

>>  static int vfio_iommu_type1_attach_group(void *iommu_data,
>>  					 struct iommu_group *iommu_group)
>>  {
>>  	struct vfio_iommu *iommu = iommu_data;
>> -	struct vfio_group *group, *g;
>> +	struct vfio_group *group;
>>  	struct vfio_domain *domain, *d;
>>  	struct bus_type *bus = NULL;
>>  	int ret;
>> @@ -746,14 +1030,21 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>>  	mutex_lock(&iommu->lock);
>>  
>>  	list_for_each_entry(d, &iommu->domain_list, next) {
>> -		list_for_each_entry(g, &d->group_list, next) {
>> -			if (g->iommu_group != iommu_group)
>> -				continue;
>> +		if (is_iommu_group_present(d, iommu_group)) {
>> +			mutex_unlock(&iommu->lock);
>> +			return -EINVAL;
>> +		}
>> +	}
>>  
>> +#if defined(CONFIG_MDEV) || defined(CONFIG_MDEV_MODULE)
>> +	if (iommu->mediated_domain) {
>> +		if (is_iommu_group_present(iommu->mediated_domain,
>> +					   iommu_group)) {
>>  			mutex_unlock(&iommu->lock);
>>  			return -EINVAL;
>>  		}
>>  	}
>> +#endif
>>  
>>  	group = kzalloc(sizeof(*group), GFP_KERNEL);
>>  	domain = kzalloc(sizeof(*domain), GFP_KERNEL);
>> @@ -769,6 +1060,36 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>>  	if (ret)
>>  		goto out_free;
>>  
>> +#if defined(CONFIG_MDEV) || defined(CONFIG_MDEV_MODULE)
>> +	if (!iommu_present(bus) && (bus == &mdev_bus_type)) {
>> +		struct mdev_device *mdev = NULL;
> 
> Unnecessary initialization.
> 
>> +
>> +		mdev = mdev_get_device_by_group(iommu_group);
>> +		if (!mdev)
>> +			goto out_free;
>> +
>> +		mdev->iommu_data = iommu;
> 
> This looks rather sketchy to me, we don't have a mediated driver in
> this series, but presumably the driver blindly calls vfio_pin_pages
> passing mdev->iommu_data and hoping that it's either NULL to generate
> an error or relevant to this iommu backend.  How would we add a second
> mediated driver iommu backend?  We're currently assuming the user
> configured this backend.  

If I understand correctly, your question is if two different mediated
devices are assigned to same container. In such case, the two mediated
devices will have different iommu_groups and will be added to
mediated_domain's group_list (iommu->mediated_domain->group_list).

> Should vfio_pin_pages instead have a struct
> device* parameter from which we would lookup the iommu_group and get to
> the vfio_domain?  That's a bit heavy weight, but we need something
> along those lines.
> 

There could be multiple mdev devices from same mediated vendor driver in
one container. In that case, that vendor driver need reference of
container or container->iommu_data to pin and unpin pages.
Similarly, there could be mutiple mdev devices from different mediated
vendor driver in one container, in that case both vendor driver need
reference to container or container->iommu_data to pin and unpin pages
in their driver.

>> +		mdev->iommu_data = iommu;
With the above line, a reference to container->iommu_data is kept in
mdev structure when the iommu_group is attached to a container so that
vendor drivers can find reference to pin and unpin pages.

If struct device* is passed as an argument to vfio_pin_pages, to find
reference to container of struct device *dev, have to find
vfio_device/vfio_group from dev that means traverse vfio.group_list for
each pin and unpin call. This list would be long when there are many
mdev devices in the system.

Is there any better way to find reference to container from struct device*?


>>  
>> @@ -930,8 +1289,28 @@ static void vfio_iommu_type1_release(void *iommu_data)
>>  	struct vfio_domain *domain, *domain_tmp;
>>  	struct vfio_group *group, *group_tmp;
>>  
>> +#if defined(CONFIG_MDEV) || defined(CONFIG_MDEV_MODULE)
>> +	if (iommu->mediated_domain) {
>> +		domain = iommu->mediated_domain;
>> +		list_for_each_entry_safe(group, group_tmp,
>> +					 &domain->group_list, next) {
>> +			if (group->mdev) {
>> +				group->mdev->iommu_data = NULL;
>> +				mdev_put_device(group->mdev);
>> +			}
>> +			list_del(&group->next);
>> +			kfree(group);
>> +		}
>> +		vfio_iommu_unpin_api_domain(domain);
>> +		kfree(domain);
>> +		iommu->mediated_domain = NULL;
>> +	}
>> +#endif
> 
> I'm not really seeing how this is all that much more maintainable than
> what was proposed previously, has this aspect been worked on since last
> I reviewed this patch?
> 

There aren't many changes from v4 to v5 version of this patch.
Can you more specific on you concerns about maintainability? I'll
definitely address your concerns.

Thanks,
Kirti

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 2/3] VFIO driver for mediated PCI device
  2016-06-24 19:45         ` [Qemu-devel] " Alex Williamson
@ 2016-06-28 18:45           ` Kirti Wankhede
  -1 siblings, 0 replies; 51+ messages in thread
From: Kirti Wankhede @ 2016-06-28 18:45 UTC (permalink / raw)
  To: Alex Williamson
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, shuai.ruan,
	jike.song, zhiyuan.lv, bjsdjshi



On 6/25/2016 1:15 AM, Alex Williamson wrote:
> On Sat, 25 Jun 2016 00:04:27 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 

>>>> +
>>>> +static int mdev_get_irq_count(struct vfio_mdev *vmdev, int irq_type)
>>>> +{
>>>> +	/* Don't support MSIX for now */
>>>> +	if (irq_type == VFIO_PCI_MSIX_IRQ_INDEX)
>>>> +		return -1;
>>>> +
>>>> +	return 1;  
>>>
>>> Too much hard coding here, the mediated driver should define this.
>>>   
>>
>> I'm testing INTX and MSI, I don't have a way to test MSIX for now. So we
>> thought we can add supported for MSIX later. Till then hard code it to 1.
> 
> To me it screams that there needs to be an interface to the mediated
> device here.  How do you even know that the mediated device intends to
> support MSI?  What if it wants to emulated a VF and not support INTx?
> This is basically just a big "TODO" flag that needs to be addressed
> before a non-RFC.
> 

VFIO user space app reads emulated PCI config space of mediated device.
In PCI capability list when MSI capability (PCI_CAP_ID_MSI) is present,
it calls VFIO_DEVICE_SET_IRQS ioctl with irq_set->index set to
VFIO_PCI_MSI_IRQ_INDEX.
Similarly, MSIX is identified from emulated config space of mediated
device that checks if MSI capability is present and number of vectors
extracted from PCI_MSI_FLAGS_QSIZE flag.
vfio_mpci modules don't need to query it from vendor driver of mediated
device. Depending on which interrupt to support, mediated driver should
emulate PCI config space.

Thanks,
Kirti.




^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Qemu-devel] [PATCH 2/3] VFIO driver for mediated PCI device
@ 2016-06-28 18:45           ` Kirti Wankhede
  0 siblings, 0 replies; 51+ messages in thread
From: Kirti Wankhede @ 2016-06-28 18:45 UTC (permalink / raw)
  To: Alex Williamson
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, shuai.ruan,
	jike.song, zhiyuan.lv, bjsdjshi



On 6/25/2016 1:15 AM, Alex Williamson wrote:
> On Sat, 25 Jun 2016 00:04:27 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 

>>>> +
>>>> +static int mdev_get_irq_count(struct vfio_mdev *vmdev, int irq_type)
>>>> +{
>>>> +	/* Don't support MSIX for now */
>>>> +	if (irq_type == VFIO_PCI_MSIX_IRQ_INDEX)
>>>> +		return -1;
>>>> +
>>>> +	return 1;  
>>>
>>> Too much hard coding here, the mediated driver should define this.
>>>   
>>
>> I'm testing INTX and MSI, I don't have a way to test MSIX for now. So we
>> thought we can add supported for MSIX later. Till then hard code it to 1.
> 
> To me it screams that there needs to be an interface to the mediated
> device here.  How do you even know that the mediated device intends to
> support MSI?  What if it wants to emulated a VF and not support INTx?
> This is basically just a big "TODO" flag that needs to be addressed
> before a non-RFC.
> 

VFIO user space app reads emulated PCI config space of mediated device.
In PCI capability list when MSI capability (PCI_CAP_ID_MSI) is present,
it calls VFIO_DEVICE_SET_IRQS ioctl with irq_set->index set to
VFIO_PCI_MSI_IRQ_INDEX.
Similarly, MSIX is identified from emulated config space of mediated
device that checks if MSI capability is present and number of vectors
extracted from PCI_MSI_FLAGS_QSIZE flag.
vfio_mpci modules don't need to query it from vendor driver of mediated
device. Depending on which interrupt to support, mediated driver should
emulate PCI config space.

Thanks,
Kirti.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 3/3] VFIO Type1 IOMMU: Add support for mediated devices
  2016-06-28 13:02       ` [Qemu-devel] " Kirti Wankhede
@ 2016-06-29  2:46         ` Alex Williamson
  -1 siblings, 0 replies; 51+ messages in thread
From: Alex Williamson @ 2016-06-29  2:46 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, shuai.ruan,
	jike.song, zhiyuan.lv, bjsdjshi

On Tue, 28 Jun 2016 18:32:44 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 6/22/2016 9:16 AM, Alex Williamson wrote:
> > On Mon, 20 Jun 2016 22:01:48 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >>  
> >>  struct vfio_iommu {
> >>  	struct list_head	domain_list;
> >> +	struct vfio_domain	*mediated_domain;  
> > 
> > I'm not really a fan of how this is so often used to special case the
> > code...
> >   
> >>  	struct mutex		lock;
> >>  	struct rb_root		dma_list;
> >>  	bool			v2;
> >> @@ -67,6 +69,13 @@ struct vfio_domain {
> >>  	struct list_head	group_list;
> >>  	int			prot;		/* IOMMU_CACHE */
> >>  	bool			fgsp;		/* Fine-grained super pages */
> >> +
> >> +	/* Domain for mediated device which is without physical IOMMU */
> >> +	bool			mediated_device;  
> > 
> > But sometimes we use this to special case the code and other times we
> > use domain_list being empty.  I thought the argument against pulling
> > code out to a shared file was that this approach could be made
> > maintainable.
> >   
> 
> Functions where struct vfio_domain *domain is argument which are
> intended to perform for that domain only, checked if
> (domain->mediated_device), like map_try_harder(), vfio_iommu_replay(),
> vfio_test_domain_fgsp(). Checks in these functions can be removed but
> then it would be callers responsibility to make sure that they don't
> call these functions for mediated_domain.
> Whereas functions where struct vfio_iommu *iommu is argument and
> domain_list is traversed to find domain or perform for each domain in
> domain_list, checked if (list_empty(&iommu->domain_list)), like
> vfio_unmap_unpin(), vfio_iommu_map(), vfio_dma_do_map().

My point is that we have different test elements at different points in
the data structures and they all need to be kept in sync and the right
one used at the right place, which makes the code all that much more
complex versus the alternative approach of finding commonality,
extracting it into a shared file, and creating a mediated version of
the type1 iommu that doesn't try to overload dual functionality into a
single code block. 

> >> +
> >> +	struct mm_struct	*mm;
> >> +	struct rb_root		pfn_list;	/* pinned Host pfn list */
> >> +	struct mutex		pfn_list_lock;	/* mutex for pfn_list */  
> > 
> > Seems like we could reduce overhead for the existing use cases by just
> > adding a pointer here and making these last 3 entries part of the
> > structure that gets pointed to.  Existence of the pointer would replace
> > @mediated_device.
> >  
> 
> Ok.
> 
> >>  };
> >>  
> >>  struct vfio_dma {
> >> @@ -79,10 +88,26 @@ struct vfio_dma {
> >>  
> >>  struct vfio_group {
> >>  	struct iommu_group	*iommu_group;
> >> +#if defined(CONFIG_MDEV) || defined(CONFIG_MDEV_MODULE)  
> > 
> > Where does CONFIG_MDEV_MODULE come from?
> > 
> > Plus, all the #ifdefs... <cringe>
> >   
> 
> Config option MDEV is tristate and when selected as module
> CONFIG_MDEV_MODULE is set in include/generated/autoconf.h.
> Symbols mdev_bus_type, mdev_get_device_by_group() and mdev_put_device()
> are only available when MDEV option is selected as built-in or modular.
> If MDEV option is not selected, vfio_iommu_type1 modules should still
> work for device direct assignment. If these #ifdefs are not there
> vfio_iommu_type1 module fails to load with undefined symbols when MDEV
> is not selected.

I guess I just hadn't seen the _MODULE define used before, but it does
appear to be fairly common.  Another option might be to provide stubs
or static inline abstractions in a header file so the #ifdefs can be
isolated.  It also seems like this is going to mean that type1 now
depends on and will autoload the mdev module even for physical
assignment.  That's not terribly desirable.

> >> +	struct mdev_device	*mdev;  
> > 
> > This gets set on attach_group where we use the iommu_group to lookup
> > the mdev, so why can't we do that on the other paths that make use of
> > this?  I think this is just holding a reference.
> >   
> 
> mdev is retrieved from attach_group for 2 reasons:
> 1. to increase the ref count of mdev, mdev_get_device_by_group(), when
> its iommu_group is attached. That should be decremented, by
> mdev_put_device(), from detach while detaching its iommu_group. This is
> make sure that mdev is not freed until it's iommu_group is detached from
> the container.
> 
> 2. save reference to iommu_data so that vendor driver would use to call
> vfio_pin_pages() and vfio_unpin_pages(). More details below.
> 
> 
> 
> >> -static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
> >> +static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
> >> +			 int prot, unsigned long *pfn)
> >>  {
> >>  	struct page *page[1];
> >>  	struct vm_area_struct *vma;
> >> +	struct mm_struct *local_mm = mm;
> >>  	int ret = -EFAULT;
> >>  
> >> -	if (get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE), page) == 1) {
> >> +	if (!local_mm && !current->mm)
> >> +		return -ENODEV;
> >> +
> >> +	if (!local_mm)
> >> +		local_mm = current->mm;  
> > 
> > The above would be much more concise if we just initialized local_mm
> > as: mm ? mm : current->mm
> >   
> >> +
> >> +	down_read(&local_mm->mmap_sem);
> >> +	if (get_user_pages_remote(NULL, local_mm, vaddr, 1,
> >> +				!!(prot & IOMMU_WRITE), 0, page, NULL) == 1) {  
> > 
> > Um, the comment for get_user_pages_remote says:
> > 
> > "See also get_user_pages_fast, for performance critical applications."
> > 
> > So what penalty are we imposing on the existing behavior of type1
> > here?  Previously we only needed to acquire mmap_sem if
> > get_user_pages_fast() didn't work, so the existing use case seems to be
> > compromised.
> >  
> 
> Yes.
> get_user_pages_fast() pins pages from current->mm, but for mediated
> device mm could be different than current->mm.
> 
> This penalty for existing behavior could be avoided by:
> if (!mm && current->mm)
>     get_user_pages_fast(); //take fast path
> else
>     get_user_pages_remote(); // take slow path


How to avoid it is pretty obvious, the concern is that overhead of the
existing use case isn't being prioritized.

> >> +long vfio_unpin_pages(void *iommu_data, dma_addr_t *pfn, long npage,
> >> +		     int prot)
> >> +{
> >> +	struct vfio_iommu *iommu = iommu_data;
> >> +	struct vfio_domain *domain = NULL;
> >> +	long unlocked = 0;
> >> +	int i;
> >> +
> >> +	if (!iommu || !pfn)
> >> +		return -EINVAL;
> >> +
> >> +	if (!iommu->mediated_domain)
> >> +		return -EINVAL;
> >> +
> >> +	domain = iommu->mediated_domain;  
> > 
> > Again, domain is already validated here.
> >   
> >> +
> >> +	for (i = 0; i < npage; i++) {
> >> +		struct vfio_pfn *p;
> >> +
> >> +		/* verify if pfn exist in pfn_list */
> >> +		p = vfio_find_pfn(domain, *(pfn + i));  
> > 
> > Why are we using array indexing above and array math here?  Were these
> > functions written by different people?
> >   
> 
> No, input argument to vfio_unpin_pages() was always array of pfns to be
> unpinned.

My comment is in regard to how the code added for vfio_pin_pages() uses
array indexing (ie, pfn[i]) while here we use array math (ie, *(pfn +
i)).  Don't arbitrarily mix them, be consistent.

> >> +		if (!p)
> >> +			continue;  
> > 
> > Hmm, this seems like more of a bad thing than a continue.
> >   
> 
> Caller of vfio_unpin_pages() are other modules. I feel its better to do
> sanity check than crash.

Who said anything about crashing?  We have a return value for a reason,
right?

> >>  
> >>  static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
> >>  {
> >> @@ -341,6 +580,10 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
> >>  
> >>  	if (!dma->size)
> >>  		return;
> >> +
> >> +	if (list_empty(&iommu->domain_list))
> >> +		return;  
> > 
> > Huh?  This would be a serious consistency error if this happened for
> > the existing use case.
> >  
> 
> This will not happen for existing use case, i.e. device direct
> assignment. This case is true when there is only mediated device
> assigned and there are no direct assigned devices.

Which is sort of my point, you're using an arbitrary property to
identify a mediated device vfio_iommu vs a directed assigned one.  This
fits in with my complaint of how many different special cases are being
thrown into the code which increases the overall code complexity.

> >> @@ -569,6 +819,7 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
> >>  	uint64_t mask;
> >>  	struct vfio_dma *dma;
> >>  	unsigned long pfn;
> >> +	struct vfio_domain *domain = NULL;
> >>  
> >>  	/* Verify that none of our __u64 fields overflow */
> >>  	if (map->size != size || map->vaddr != vaddr || map->iova != iova)
> >> @@ -611,10 +862,21 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
> >>  	/* Insert zero-sized and grow as we map chunks of it */
> >>  	vfio_link_dma(iommu, dma);
> >>  
> >> +	/*
> >> +	 * Skip pin and map if and domain list is empty
> >> +	 */
> >> +	if (list_empty(&iommu->domain_list)) {
> >> +		dma->size = size;
> >> +		goto map_done;
> >> +	}  
> > 
> > Again, this would be a serious consistency error for the existing use
> > case.  Let's use indicators that are explicit.
> >  
> 
> Why? for existing use case (i.e. direct device assignment) domain_list
> will not be empty, domain_list will only be empty when there is mediated
> device assigned and no direct device assigned.

I'm trying to be cautious whether it actually makes sense to
dual-purpose the existing type1 iommu backend.  What's the benefit of
putting an exit path in the middle of a function versus splitting it in
two separate functions with two separate callers, one of which only
calls the first function.  What's the benefit of a struct vfio_iommu
that hosts both mediated and directly assigned devices?  Is the
benefit to the functionality or to the code base?  Should the fact that
they use the same iommu API dictate that they're also managed in the
same data structures?  When we have too many special case exits or
branches, the code complexity increases, bugs are harder to flush out,
and possible exploits are more likely.  Convince me that this is the
right approach.
 
> >>  static int vfio_iommu_type1_attach_group(void *iommu_data,
> >>  					 struct iommu_group *iommu_group)
> >>  {
> >>  	struct vfio_iommu *iommu = iommu_data;
> >> -	struct vfio_group *group, *g;
> >> +	struct vfio_group *group;
> >>  	struct vfio_domain *domain, *d;
> >>  	struct bus_type *bus = NULL;
> >>  	int ret;
> >> @@ -746,14 +1030,21 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
> >>  	mutex_lock(&iommu->lock);
> >>  
> >>  	list_for_each_entry(d, &iommu->domain_list, next) {
> >> -		list_for_each_entry(g, &d->group_list, next) {
> >> -			if (g->iommu_group != iommu_group)
> >> -				continue;
> >> +		if (is_iommu_group_present(d, iommu_group)) {
> >> +			mutex_unlock(&iommu->lock);
> >> +			return -EINVAL;
> >> +		}
> >> +	}
> >>  
> >> +#if defined(CONFIG_MDEV) || defined(CONFIG_MDEV_MODULE)
> >> +	if (iommu->mediated_domain) {
> >> +		if (is_iommu_group_present(iommu->mediated_domain,
> >> +					   iommu_group)) {
> >>  			mutex_unlock(&iommu->lock);
> >>  			return -EINVAL;
> >>  		}
> >>  	}
> >> +#endif
> >>  
> >>  	group = kzalloc(sizeof(*group), GFP_KERNEL);
> >>  	domain = kzalloc(sizeof(*domain), GFP_KERNEL);
> >> @@ -769,6 +1060,36 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
> >>  	if (ret)
> >>  		goto out_free;
> >>  
> >> +#if defined(CONFIG_MDEV) || defined(CONFIG_MDEV_MODULE)
> >> +	if (!iommu_present(bus) && (bus == &mdev_bus_type)) {
> >> +		struct mdev_device *mdev = NULL;  
> > 
> > Unnecessary initialization.
> >   
> >> +
> >> +		mdev = mdev_get_device_by_group(iommu_group);
> >> +		if (!mdev)
> >> +			goto out_free;
> >> +
> >> +		mdev->iommu_data = iommu;  
> > 
> > This looks rather sketchy to me, we don't have a mediated driver in
> > this series, but presumably the driver blindly calls vfio_pin_pages
> > passing mdev->iommu_data and hoping that it's either NULL to generate
> > an error or relevant to this iommu backend.  How would we add a second
> > mediated driver iommu backend?  We're currently assuming the user
> > configured this backend.    
> 
> If I understand correctly, your question is if two different mediated
> devices are assigned to same container. In such case, the two mediated
> devices will have different iommu_groups and will be added to
> mediated_domain's group_list (iommu->mediated_domain->group_list).

No, my concern is that mdev->iommu_data is opaque data specific to the
type1 extensions here, but vfio_pin_pages() is effectively a completely
separate API.  Mediated devices end up with sort of a side channel
into this one iommu, which breaks the modularity of vfio iommus.  So
let's say we create a type2 interface that also supports mediated
devices, do the mediated drivers still call vfio_pin_pages()?  Do we
need to export new interfaces for every iommu backend to support this?
The interface should probably be through the vfio container, IOW
extensions to the vfio_iommu_driver_ops or maybe direct use of the
ioctl callback within that interface such that the pinning is actually
paired with the correct driver and extensible.
 
> > Should vfio_pin_pages instead have a struct
> > device* parameter from which we would lookup the iommu_group and get to
> > the vfio_domain?  That's a bit heavy weight, but we need something
> > along those lines.
> >   
> 
> There could be multiple mdev devices from same mediated vendor driver in
> one container. In that case, that vendor driver need reference of
> container or container->iommu_data to pin and unpin pages.
> Similarly, there could be mutiple mdev devices from different mediated
> vendor driver in one container, in that case both vendor driver need
> reference to container or container->iommu_data to pin and unpin pages
> in their driver.

As above, I think something like this is necessary, the proposed
interface here is a bit of a side-channel hack.

> >> +		mdev->iommu_data = iommu;  
> With the above line, a reference to container->iommu_data is kept in
> mdev structure when the iommu_group is attached to a container so that
> vendor drivers can find reference to pin and unpin pages.
> 
> If struct device* is passed as an argument to vfio_pin_pages, to find
> reference to container of struct device *dev, have to find
> vfio_device/vfio_group from dev that means traverse vfio.group_list for
> each pin and unpin call. This list would be long when there are many
> mdev devices in the system.
> 
> Is there any better way to find reference to container from struct device*?

struct vfio_device *device = dev_get_drvdata(mdev->dev);

device->group->container

Note of course that vfio_device and vfio_group are private to vfio.c.
We do already have a vfio_device_get_from_dev() function though, so the
vendor could call into a vfio.c function with a reference to the
vfio_device.

> >>  
> >> @@ -930,8 +1289,28 @@ static void vfio_iommu_type1_release(void *iommu_data)
> >>  	struct vfio_domain *domain, *domain_tmp;
> >>  	struct vfio_group *group, *group_tmp;
> >>  
> >> +#if defined(CONFIG_MDEV) || defined(CONFIG_MDEV_MODULE)
> >> +	if (iommu->mediated_domain) {
> >> +		domain = iommu->mediated_domain;
> >> +		list_for_each_entry_safe(group, group_tmp,
> >> +					 &domain->group_list, next) {
> >> +			if (group->mdev) {
> >> +				group->mdev->iommu_data = NULL;
> >> +				mdev_put_device(group->mdev);
> >> +			}
> >> +			list_del(&group->next);
> >> +			kfree(group);
> >> +		}
> >> +		vfio_iommu_unpin_api_domain(domain);
> >> +		kfree(domain);
> >> +		iommu->mediated_domain = NULL;
> >> +	}
> >> +#endif  
> > 
> > I'm not really seeing how this is all that much more maintainable than
> > what was proposed previously, has this aspect been worked on since last
> > I reviewed this patch?
> >   
> 
> There aren't many changes from v4 to v5 version of this patch.
> Can you more specific on you concerns about maintainability? I'll
> definitely address your concerns.

In reply to comments on v3 of the series you said you'd prefer to work
on making the bimodal approach more maintainable rather than splitting
out common code and creating a separate module for type1-direct vs
type1-mediated.  To make the code more maintainable, I think we need to
see fewer special cases, clean data paths for each time with some
attention paid to how we detect one type vs another.  The killer
feature that would that would really justify the complexity of this
approach would be if locked page accounting avoid duplicate counts
between interfaces.  As it is, vfio_pin_pages() calls
vfio_pin_pages_internal() which adds to the user's locked_vm regardless
of whether those pages are already locked by a direct assigned device,
or even a previous call to vfio_pin_pages() for the same range, perhaps
by a different mediated device.  So we end up with userspace needing to
set the locked memory limit once for any number of direct assigned
devices, but then bump it up for each mediated device, by a value which
may depend on the type of mediated device.  That's clearly a rat's nest
for userspace to guess what it needs to do and pretty strongly
justifies such tight integration.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Qemu-devel] [PATCH 3/3] VFIO Type1 IOMMU: Add support for mediated devices
@ 2016-06-29  2:46         ` Alex Williamson
  0 siblings, 0 replies; 51+ messages in thread
From: Alex Williamson @ 2016-06-29  2:46 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, shuai.ruan,
	jike.song, zhiyuan.lv, bjsdjshi

On Tue, 28 Jun 2016 18:32:44 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 6/22/2016 9:16 AM, Alex Williamson wrote:
> > On Mon, 20 Jun 2016 22:01:48 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >>  
> >>  struct vfio_iommu {
> >>  	struct list_head	domain_list;
> >> +	struct vfio_domain	*mediated_domain;  
> > 
> > I'm not really a fan of how this is so often used to special case the
> > code...
> >   
> >>  	struct mutex		lock;
> >>  	struct rb_root		dma_list;
> >>  	bool			v2;
> >> @@ -67,6 +69,13 @@ struct vfio_domain {
> >>  	struct list_head	group_list;
> >>  	int			prot;		/* IOMMU_CACHE */
> >>  	bool			fgsp;		/* Fine-grained super pages */
> >> +
> >> +	/* Domain for mediated device which is without physical IOMMU */
> >> +	bool			mediated_device;  
> > 
> > But sometimes we use this to special case the code and other times we
> > use domain_list being empty.  I thought the argument against pulling
> > code out to a shared file was that this approach could be made
> > maintainable.
> >   
> 
> Functions where struct vfio_domain *domain is argument which are
> intended to perform for that domain only, checked if
> (domain->mediated_device), like map_try_harder(), vfio_iommu_replay(),
> vfio_test_domain_fgsp(). Checks in these functions can be removed but
> then it would be callers responsibility to make sure that they don't
> call these functions for mediated_domain.
> Whereas functions where struct vfio_iommu *iommu is argument and
> domain_list is traversed to find domain or perform for each domain in
> domain_list, checked if (list_empty(&iommu->domain_list)), like
> vfio_unmap_unpin(), vfio_iommu_map(), vfio_dma_do_map().

My point is that we have different test elements at different points in
the data structures and they all need to be kept in sync and the right
one used at the right place, which makes the code all that much more
complex versus the alternative approach of finding commonality,
extracting it into a shared file, and creating a mediated version of
the type1 iommu that doesn't try to overload dual functionality into a
single code block. 

> >> +
> >> +	struct mm_struct	*mm;
> >> +	struct rb_root		pfn_list;	/* pinned Host pfn list */
> >> +	struct mutex		pfn_list_lock;	/* mutex for pfn_list */  
> > 
> > Seems like we could reduce overhead for the existing use cases by just
> > adding a pointer here and making these last 3 entries part of the
> > structure that gets pointed to.  Existence of the pointer would replace
> > @mediated_device.
> >  
> 
> Ok.
> 
> >>  };
> >>  
> >>  struct vfio_dma {
> >> @@ -79,10 +88,26 @@ struct vfio_dma {
> >>  
> >>  struct vfio_group {
> >>  	struct iommu_group	*iommu_group;
> >> +#if defined(CONFIG_MDEV) || defined(CONFIG_MDEV_MODULE)  
> > 
> > Where does CONFIG_MDEV_MODULE come from?
> > 
> > Plus, all the #ifdefs... <cringe>
> >   
> 
> Config option MDEV is tristate and when selected as module
> CONFIG_MDEV_MODULE is set in include/generated/autoconf.h.
> Symbols mdev_bus_type, mdev_get_device_by_group() and mdev_put_device()
> are only available when MDEV option is selected as built-in or modular.
> If MDEV option is not selected, vfio_iommu_type1 modules should still
> work for device direct assignment. If these #ifdefs are not there
> vfio_iommu_type1 module fails to load with undefined symbols when MDEV
> is not selected.

I guess I just hadn't seen the _MODULE define used before, but it does
appear to be fairly common.  Another option might be to provide stubs
or static inline abstractions in a header file so the #ifdefs can be
isolated.  It also seems like this is going to mean that type1 now
depends on and will autoload the mdev module even for physical
assignment.  That's not terribly desirable.

> >> +	struct mdev_device	*mdev;  
> > 
> > This gets set on attach_group where we use the iommu_group to lookup
> > the mdev, so why can't we do that on the other paths that make use of
> > this?  I think this is just holding a reference.
> >   
> 
> mdev is retrieved from attach_group for 2 reasons:
> 1. to increase the ref count of mdev, mdev_get_device_by_group(), when
> its iommu_group is attached. That should be decremented, by
> mdev_put_device(), from detach while detaching its iommu_group. This is
> make sure that mdev is not freed until it's iommu_group is detached from
> the container.
> 
> 2. save reference to iommu_data so that vendor driver would use to call
> vfio_pin_pages() and vfio_unpin_pages(). More details below.
> 
> 
> 
> >> -static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
> >> +static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
> >> +			 int prot, unsigned long *pfn)
> >>  {
> >>  	struct page *page[1];
> >>  	struct vm_area_struct *vma;
> >> +	struct mm_struct *local_mm = mm;
> >>  	int ret = -EFAULT;
> >>  
> >> -	if (get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE), page) == 1) {
> >> +	if (!local_mm && !current->mm)
> >> +		return -ENODEV;
> >> +
> >> +	if (!local_mm)
> >> +		local_mm = current->mm;  
> > 
> > The above would be much more concise if we just initialized local_mm
> > as: mm ? mm : current->mm
> >   
> >> +
> >> +	down_read(&local_mm->mmap_sem);
> >> +	if (get_user_pages_remote(NULL, local_mm, vaddr, 1,
> >> +				!!(prot & IOMMU_WRITE), 0, page, NULL) == 1) {  
> > 
> > Um, the comment for get_user_pages_remote says:
> > 
> > "See also get_user_pages_fast, for performance critical applications."
> > 
> > So what penalty are we imposing on the existing behavior of type1
> > here?  Previously we only needed to acquire mmap_sem if
> > get_user_pages_fast() didn't work, so the existing use case seems to be
> > compromised.
> >  
> 
> Yes.
> get_user_pages_fast() pins pages from current->mm, but for mediated
> device mm could be different than current->mm.
> 
> This penalty for existing behavior could be avoided by:
> if (!mm && current->mm)
>     get_user_pages_fast(); //take fast path
> else
>     get_user_pages_remote(); // take slow path


How to avoid it is pretty obvious, the concern is that overhead of the
existing use case isn't being prioritized.

> >> +long vfio_unpin_pages(void *iommu_data, dma_addr_t *pfn, long npage,
> >> +		     int prot)
> >> +{
> >> +	struct vfio_iommu *iommu = iommu_data;
> >> +	struct vfio_domain *domain = NULL;
> >> +	long unlocked = 0;
> >> +	int i;
> >> +
> >> +	if (!iommu || !pfn)
> >> +		return -EINVAL;
> >> +
> >> +	if (!iommu->mediated_domain)
> >> +		return -EINVAL;
> >> +
> >> +	domain = iommu->mediated_domain;  
> > 
> > Again, domain is already validated here.
> >   
> >> +
> >> +	for (i = 0; i < npage; i++) {
> >> +		struct vfio_pfn *p;
> >> +
> >> +		/* verify if pfn exist in pfn_list */
> >> +		p = vfio_find_pfn(domain, *(pfn + i));  
> > 
> > Why are we using array indexing above and array math here?  Were these
> > functions written by different people?
> >   
> 
> No, input argument to vfio_unpin_pages() was always array of pfns to be
> unpinned.

My comment is in regard to how the code added for vfio_pin_pages() uses
array indexing (ie, pfn[i]) while here we use array math (ie, *(pfn +
i)).  Don't arbitrarily mix them, be consistent.

> >> +		if (!p)
> >> +			continue;  
> > 
> > Hmm, this seems like more of a bad thing than a continue.
> >   
> 
> Caller of vfio_unpin_pages() are other modules. I feel its better to do
> sanity check than crash.

Who said anything about crashing?  We have a return value for a reason,
right?

> >>  
> >>  static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
> >>  {
> >> @@ -341,6 +580,10 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
> >>  
> >>  	if (!dma->size)
> >>  		return;
> >> +
> >> +	if (list_empty(&iommu->domain_list))
> >> +		return;  
> > 
> > Huh?  This would be a serious consistency error if this happened for
> > the existing use case.
> >  
> 
> This will not happen for existing use case, i.e. device direct
> assignment. This case is true when there is only mediated device
> assigned and there are no direct assigned devices.

Which is sort of my point, you're using an arbitrary property to
identify a mediated device vfio_iommu vs a directed assigned one.  This
fits in with my complaint of how many different special cases are being
thrown into the code which increases the overall code complexity.

> >> @@ -569,6 +819,7 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
> >>  	uint64_t mask;
> >>  	struct vfio_dma *dma;
> >>  	unsigned long pfn;
> >> +	struct vfio_domain *domain = NULL;
> >>  
> >>  	/* Verify that none of our __u64 fields overflow */
> >>  	if (map->size != size || map->vaddr != vaddr || map->iova != iova)
> >> @@ -611,10 +862,21 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
> >>  	/* Insert zero-sized and grow as we map chunks of it */
> >>  	vfio_link_dma(iommu, dma);
> >>  
> >> +	/*
> >> +	 * Skip pin and map if and domain list is empty
> >> +	 */
> >> +	if (list_empty(&iommu->domain_list)) {
> >> +		dma->size = size;
> >> +		goto map_done;
> >> +	}  
> > 
> > Again, this would be a serious consistency error for the existing use
> > case.  Let's use indicators that are explicit.
> >  
> 
> Why? for existing use case (i.e. direct device assignment) domain_list
> will not be empty, domain_list will only be empty when there is mediated
> device assigned and no direct device assigned.

I'm trying to be cautious whether it actually makes sense to
dual-purpose the existing type1 iommu backend.  What's the benefit of
putting an exit path in the middle of a function versus splitting it in
two separate functions with two separate callers, one of which only
calls the first function.  What's the benefit of a struct vfio_iommu
that hosts both mediated and directly assigned devices?  Is the
benefit to the functionality or to the code base?  Should the fact that
they use the same iommu API dictate that they're also managed in the
same data structures?  When we have too many special case exits or
branches, the code complexity increases, bugs are harder to flush out,
and possible exploits are more likely.  Convince me that this is the
right approach.
 
> >>  static int vfio_iommu_type1_attach_group(void *iommu_data,
> >>  					 struct iommu_group *iommu_group)
> >>  {
> >>  	struct vfio_iommu *iommu = iommu_data;
> >> -	struct vfio_group *group, *g;
> >> +	struct vfio_group *group;
> >>  	struct vfio_domain *domain, *d;
> >>  	struct bus_type *bus = NULL;
> >>  	int ret;
> >> @@ -746,14 +1030,21 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
> >>  	mutex_lock(&iommu->lock);
> >>  
> >>  	list_for_each_entry(d, &iommu->domain_list, next) {
> >> -		list_for_each_entry(g, &d->group_list, next) {
> >> -			if (g->iommu_group != iommu_group)
> >> -				continue;
> >> +		if (is_iommu_group_present(d, iommu_group)) {
> >> +			mutex_unlock(&iommu->lock);
> >> +			return -EINVAL;
> >> +		}
> >> +	}
> >>  
> >> +#if defined(CONFIG_MDEV) || defined(CONFIG_MDEV_MODULE)
> >> +	if (iommu->mediated_domain) {
> >> +		if (is_iommu_group_present(iommu->mediated_domain,
> >> +					   iommu_group)) {
> >>  			mutex_unlock(&iommu->lock);
> >>  			return -EINVAL;
> >>  		}
> >>  	}
> >> +#endif
> >>  
> >>  	group = kzalloc(sizeof(*group), GFP_KERNEL);
> >>  	domain = kzalloc(sizeof(*domain), GFP_KERNEL);
> >> @@ -769,6 +1060,36 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
> >>  	if (ret)
> >>  		goto out_free;
> >>  
> >> +#if defined(CONFIG_MDEV) || defined(CONFIG_MDEV_MODULE)
> >> +	if (!iommu_present(bus) && (bus == &mdev_bus_type)) {
> >> +		struct mdev_device *mdev = NULL;  
> > 
> > Unnecessary initialization.
> >   
> >> +
> >> +		mdev = mdev_get_device_by_group(iommu_group);
> >> +		if (!mdev)
> >> +			goto out_free;
> >> +
> >> +		mdev->iommu_data = iommu;  
> > 
> > This looks rather sketchy to me, we don't have a mediated driver in
> > this series, but presumably the driver blindly calls vfio_pin_pages
> > passing mdev->iommu_data and hoping that it's either NULL to generate
> > an error or relevant to this iommu backend.  How would we add a second
> > mediated driver iommu backend?  We're currently assuming the user
> > configured this backend.    
> 
> If I understand correctly, your question is if two different mediated
> devices are assigned to same container. In such case, the two mediated
> devices will have different iommu_groups and will be added to
> mediated_domain's group_list (iommu->mediated_domain->group_list).

No, my concern is that mdev->iommu_data is opaque data specific to the
type1 extensions here, but vfio_pin_pages() is effectively a completely
separate API.  Mediated devices end up with sort of a side channel
into this one iommu, which breaks the modularity of vfio iommus.  So
let's say we create a type2 interface that also supports mediated
devices, do the mediated drivers still call vfio_pin_pages()?  Do we
need to export new interfaces for every iommu backend to support this?
The interface should probably be through the vfio container, IOW
extensions to the vfio_iommu_driver_ops or maybe direct use of the
ioctl callback within that interface such that the pinning is actually
paired with the correct driver and extensible.
 
> > Should vfio_pin_pages instead have a struct
> > device* parameter from which we would lookup the iommu_group and get to
> > the vfio_domain?  That's a bit heavy weight, but we need something
> > along those lines.
> >   
> 
> There could be multiple mdev devices from same mediated vendor driver in
> one container. In that case, that vendor driver need reference of
> container or container->iommu_data to pin and unpin pages.
> Similarly, there could be mutiple mdev devices from different mediated
> vendor driver in one container, in that case both vendor driver need
> reference to container or container->iommu_data to pin and unpin pages
> in their driver.

As above, I think something like this is necessary, the proposed
interface here is a bit of a side-channel hack.

> >> +		mdev->iommu_data = iommu;  
> With the above line, a reference to container->iommu_data is kept in
> mdev structure when the iommu_group is attached to a container so that
> vendor drivers can find reference to pin and unpin pages.
> 
> If struct device* is passed as an argument to vfio_pin_pages, to find
> reference to container of struct device *dev, have to find
> vfio_device/vfio_group from dev that means traverse vfio.group_list for
> each pin and unpin call. This list would be long when there are many
> mdev devices in the system.
> 
> Is there any better way to find reference to container from struct device*?

struct vfio_device *device = dev_get_drvdata(mdev->dev);

device->group->container

Note of course that vfio_device and vfio_group are private to vfio.c.
We do already have a vfio_device_get_from_dev() function though, so the
vendor could call into a vfio.c function with a reference to the
vfio_device.

> >>  
> >> @@ -930,8 +1289,28 @@ static void vfio_iommu_type1_release(void *iommu_data)
> >>  	struct vfio_domain *domain, *domain_tmp;
> >>  	struct vfio_group *group, *group_tmp;
> >>  
> >> +#if defined(CONFIG_MDEV) || defined(CONFIG_MDEV_MODULE)
> >> +	if (iommu->mediated_domain) {
> >> +		domain = iommu->mediated_domain;
> >> +		list_for_each_entry_safe(group, group_tmp,
> >> +					 &domain->group_list, next) {
> >> +			if (group->mdev) {
> >> +				group->mdev->iommu_data = NULL;
> >> +				mdev_put_device(group->mdev);
> >> +			}
> >> +			list_del(&group->next);
> >> +			kfree(group);
> >> +		}
> >> +		vfio_iommu_unpin_api_domain(domain);
> >> +		kfree(domain);
> >> +		iommu->mediated_domain = NULL;
> >> +	}
> >> +#endif  
> > 
> > I'm not really seeing how this is all that much more maintainable than
> > what was proposed previously, has this aspect been worked on since last
> > I reviewed this patch?
> >   
> 
> There aren't many changes from v4 to v5 version of this patch.
> Can you more specific on you concerns about maintainability? I'll
> definitely address your concerns.

In reply to comments on v3 of the series you said you'd prefer to work
on making the bimodal approach more maintainable rather than splitting
out common code and creating a separate module for type1-direct vs
type1-mediated.  To make the code more maintainable, I think we need to
see fewer special cases, clean data paths for each time with some
attention paid to how we detect one type vs another.  The killer
feature that would that would really justify the complexity of this
approach would be if locked page accounting avoid duplicate counts
between interfaces.  As it is, vfio_pin_pages() calls
vfio_pin_pages_internal() which adds to the user's locked_vm regardless
of whether those pages are already locked by a direct assigned device,
or even a previous call to vfio_pin_pages() for the same range, perhaps
by a different mediated device.  So we end up with userspace needing to
set the locked memory limit once for any number of direct assigned
devices, but then bump it up for each mediated device, by a value which
may depend on the type of mediated device.  That's clearly a rat's nest
for userspace to guess what it needs to do and pretty strongly
justifies such tight integration.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 2/3] VFIO driver for mediated PCI device
  2016-06-28 18:45           ` [Qemu-devel] " Kirti Wankhede
@ 2016-06-29  2:54             ` Alex Williamson
  -1 siblings, 0 replies; 51+ messages in thread
From: Alex Williamson @ 2016-06-29  2:54 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, shuai.ruan,
	jike.song, zhiyuan.lv, bjsdjshi

On Wed, 29 Jun 2016 00:15:23 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 6/25/2016 1:15 AM, Alex Williamson wrote:
> > On Sat, 25 Jun 2016 00:04:27 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> 
> >>>> +
> >>>> +static int mdev_get_irq_count(struct vfio_mdev *vmdev, int irq_type)
> >>>> +{
> >>>> +	/* Don't support MSIX for now */
> >>>> +	if (irq_type == VFIO_PCI_MSIX_IRQ_INDEX)
> >>>> +		return -1;
> >>>> +
> >>>> +	return 1;    
> >>>
> >>> Too much hard coding here, the mediated driver should define this.
> >>>     
> >>
> >> I'm testing INTX and MSI, I don't have a way to test MSIX for now. So we
> >> thought we can add supported for MSIX later. Till then hard code it to 1.  
> > 
> > To me it screams that there needs to be an interface to the mediated
> > device here.  How do you even know that the mediated device intends to
> > support MSI?  What if it wants to emulated a VF and not support INTx?
> > This is basically just a big "TODO" flag that needs to be addressed
> > before a non-RFC.
> >   
> 
> VFIO user space app reads emulated PCI config space of mediated device.
> In PCI capability list when MSI capability (PCI_CAP_ID_MSI) is present,
> it calls VFIO_DEVICE_SET_IRQS ioctl with irq_set->index set to
> VFIO_PCI_MSI_IRQ_INDEX.
> Similarly, MSIX is identified from emulated config space of mediated
> device that checks if MSI capability is present and number of vectors
> extracted from PCI_MSI_FLAGS_QSIZE flag.
> vfio_mpci modules don't need to query it from vendor driver of mediated
> device. Depending on which interrupt to support, mediated driver should
> emulate PCI config space.

Are you suggesting that if the user can determine which interrupts are
supported and the various counts for each by querying the PCI config
space of the mediated device then this interface should do the same,
much like vfio_pci_get_irq_count(), such that it can provide results
consistent with config space?  That I'm ok with.  Having the user find
one IRQ count as they read PCI config space and another via the vfio
API, I'm not ok with.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Qemu-devel] [PATCH 2/3] VFIO driver for mediated PCI device
@ 2016-06-29  2:54             ` Alex Williamson
  0 siblings, 0 replies; 51+ messages in thread
From: Alex Williamson @ 2016-06-29  2:54 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, shuai.ruan,
	jike.song, zhiyuan.lv, bjsdjshi

On Wed, 29 Jun 2016 00:15:23 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 6/25/2016 1:15 AM, Alex Williamson wrote:
> > On Sat, 25 Jun 2016 00:04:27 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> 
> >>>> +
> >>>> +static int mdev_get_irq_count(struct vfio_mdev *vmdev, int irq_type)
> >>>> +{
> >>>> +	/* Don't support MSIX for now */
> >>>> +	if (irq_type == VFIO_PCI_MSIX_IRQ_INDEX)
> >>>> +		return -1;
> >>>> +
> >>>> +	return 1;    
> >>>
> >>> Too much hard coding here, the mediated driver should define this.
> >>>     
> >>
> >> I'm testing INTX and MSI, I don't have a way to test MSIX for now. So we
> >> thought we can add supported for MSIX later. Till then hard code it to 1.  
> > 
> > To me it screams that there needs to be an interface to the mediated
> > device here.  How do you even know that the mediated device intends to
> > support MSI?  What if it wants to emulated a VF and not support INTx?
> > This is basically just a big "TODO" flag that needs to be addressed
> > before a non-RFC.
> >   
> 
> VFIO user space app reads emulated PCI config space of mediated device.
> In PCI capability list when MSI capability (PCI_CAP_ID_MSI) is present,
> it calls VFIO_DEVICE_SET_IRQS ioctl with irq_set->index set to
> VFIO_PCI_MSI_IRQ_INDEX.
> Similarly, MSIX is identified from emulated config space of mediated
> device that checks if MSI capability is present and number of vectors
> extracted from PCI_MSI_FLAGS_QSIZE flag.
> vfio_mpci modules don't need to query it from vendor driver of mediated
> device. Depending on which interrupt to support, mediated driver should
> emulate PCI config space.

Are you suggesting that if the user can determine which interrupts are
supported and the various counts for each by querying the PCI config
space of the mediated device then this interface should do the same,
much like vfio_pci_get_irq_count(), such that it can provide results
consistent with config space?  That I'm ok with.  Having the user find
one IRQ count as they read PCI config space and another via the vfio
API, I'm not ok with.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 1/3] Mediated device Core driver
  2016-06-20 16:31   ` [Qemu-devel] " Kirti Wankhede
@ 2016-06-29 13:51     ` Xiao Guangrong
  -1 siblings, 0 replies; 51+ messages in thread
From: Xiao Guangrong @ 2016-06-29 13:51 UTC (permalink / raw)
  To: Kirti Wankhede, alex.williamson, pbonzini, kraxel, cjia
  Cc: shuai.ruan, jike.song, kvm, kevin.tian, qemu-devel, zhiyuan.lv, bjsdjshi



On 06/21/2016 12:31 AM, Kirti Wankhede wrote:
> Design for Mediated Device Driver:
> Main purpose of this driver is to provide a common interface for mediated
> device management that can be used by differnt drivers of different
> devices.
>
> This module provides a generic interface to create the device, add it to
> mediated bus, add device to IOMMU group and then add it to vfio group.
>
> Below is the high Level block diagram, with Nvidia, Intel and IBM devices
> as example, since these are the devices which are going to actively use
> this module as of now.
>
>   +---------------+
>   |               |
>   | +-----------+ |  mdev_register_driver() +--------------+
>   | |           | +<------------------------+ __init()     |
>   | |           | |                         |              |
>   | |  mdev     | +------------------------>+              |<-> VFIO user
>   | |  bus      | |     probe()/remove()    | vfio_mpci.ko |    APIs
>   | |  driver   | |                         |              |
>   | |           | |                         +--------------+
>   | |           | |  mdev_register_driver() +--------------+
>   | |           | +<------------------------+ __init()     |
>   | |           | |                         |              |
>   | |           | +------------------------>+              |<-> VFIO user
>   | +-----------+ |     probe()/remove()    | vfio_mccw.ko |    APIs
>   |               |                         |              |
>   |  MDEV CORE    |                         +--------------+
>   |   MODULE      |
>   |   mdev.ko     |
>   | +-----------+ |  mdev_register_device() +--------------+
>   | |           | +<------------------------+              |
>   | |           | |                         |  nvidia.ko   |<-> physical
>   | |           | +------------------------>+              |    device
>   | |           | |        callback         +--------------+
>   | | Physical  | |
>   | |  device   | |  mdev_register_device() +--------------+
>   | | interface | |<------------------------+              |
>   | |           | |                         |  i915.ko     |<-> physical
>   | |           | +------------------------>+              |    device
>   | |           | |        callback         +--------------+
>   | |           | |
>   | |           | |  mdev_register_device() +--------------+
>   | |           | +<------------------------+              |
>   | |           | |                         | ccw_device.ko|<-> physical
>   | |           | +------------------------>+              |    device
>   | |           | |        callback         +--------------+
>   | +-----------+ |
>   +---------------+
>
> Core driver provides two types of registration interfaces:
> 1. Registration interface for mediated bus driver:
>
> /**
>    * struct mdev_driver - Mediated device's driver
>    * @name: driver name
>    * @probe: called when new device created
>    * @remove:called when device removed
>    * @match: called when new device or driver is added for this bus.
> 	    Return 1 if given device can be handled by given driver and
> 	    zero otherwise.
>    * @driver:device driver structure
>    *
>    **/
> struct mdev_driver {
>           const char *name;
>           int  (*probe)  (struct device *dev);
>           void (*remove) (struct device *dev);
> 	 int  (*match)(struct device *dev);
>           struct device_driver    driver;
> };
>
> int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
> void mdev_unregister_driver(struct mdev_driver *drv);
>
> Mediated device's driver for mdev should use this interface to register
> with Core driver. With this, mediated devices driver for such devices is
> responsible to add mediated device to VFIO group.
>
> 2. Physical device driver interface
> This interface provides vendor driver the set APIs to manage physical
> device related work in their own driver. APIs are :
> - supported_config: provide supported configuration list by the vendor
> 		    driver
> - create: to allocate basic resources in vendor driver for a mediated
> 	  device.
> - destroy: to free resources in vendor driver when mediated device is
> 	   destroyed.
> - start: to initiate mediated device initialization process from vendor
> 	 driver when VM boots and before QEMU starts.
> - shutdown: to teardown mediated device resources during VM teardown.
> - read : read emulation callback.
> - write: write emulation callback.
> - set_irqs: send interrupt configuration information that QEMU sets.
> - get_region_info: to provide region size and its flags for the mediated
> 		   device.
> - validate_map_request: to validate remap pfn request.
>
> This registration interface should be used by vendor drivers to register
> each physical device to mdev core driver.
>
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Signed-off-by: Neo Jia <cjia@nvidia.com>
> Change-Id: I73a5084574270b14541c529461ea2f03c292d510
> ---
>   drivers/vfio/Kconfig             |   1 +
>   drivers/vfio/Makefile            |   1 +
>   drivers/vfio/mdev/Kconfig        |  11 +
>   drivers/vfio/mdev/Makefile       |   5 +
>   drivers/vfio/mdev/mdev_core.c    | 595 +++++++++++++++++++++++++++++++++++++++
>   drivers/vfio/mdev/mdev_driver.c  | 138 +++++++++
>   drivers/vfio/mdev/mdev_private.h |  33 +++
>   drivers/vfio/mdev/mdev_sysfs.c   | 300 ++++++++++++++++++++
>   include/linux/mdev.h             | 232 +++++++++++++++
>   9 files changed, 1316 insertions(+)
>   create mode 100644 drivers/vfio/mdev/Kconfig
>   create mode 100644 drivers/vfio/mdev/Makefile
>   create mode 100644 drivers/vfio/mdev/mdev_core.c
>   create mode 100644 drivers/vfio/mdev/mdev_driver.c
>   create mode 100644 drivers/vfio/mdev/mdev_private.h
>   create mode 100644 drivers/vfio/mdev/mdev_sysfs.c
>   create mode 100644 include/linux/mdev.h
>
> diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
> index da6e2ce77495..23eced02aaf6 100644
> --- a/drivers/vfio/Kconfig
> +++ b/drivers/vfio/Kconfig
> @@ -48,4 +48,5 @@ menuconfig VFIO_NOIOMMU
>
>   source "drivers/vfio/pci/Kconfig"
>   source "drivers/vfio/platform/Kconfig"
> +source "drivers/vfio/mdev/Kconfig"
>   source "virt/lib/Kconfig"
> diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
> index 7b8a31f63fea..7c70753e54ab 100644
> --- a/drivers/vfio/Makefile
> +++ b/drivers/vfio/Makefile
> @@ -7,3 +7,4 @@ obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
>   obj-$(CONFIG_VFIO_SPAPR_EEH) += vfio_spapr_eeh.o
>   obj-$(CONFIG_VFIO_PCI) += pci/
>   obj-$(CONFIG_VFIO_PLATFORM) += platform/
> +obj-$(CONFIG_MDEV) += mdev/
> diff --git a/drivers/vfio/mdev/Kconfig b/drivers/vfio/mdev/Kconfig
> new file mode 100644
> index 000000000000..951e2bb06a3f
> --- /dev/null
> +++ b/drivers/vfio/mdev/Kconfig
> @@ -0,0 +1,11 @@
> +
> +config MDEV
> +    tristate "Mediated device driver framework"
> +    depends on VFIO
> +    default n
> +    help
> +        MDEV provides a framework to virtualize device without SR-IOV cap
> +        See Documentation/mdev.txt for more details.
> +
> +        If you don't know what do here, say N.
> +
> diff --git a/drivers/vfio/mdev/Makefile b/drivers/vfio/mdev/Makefile
> new file mode 100644
> index 000000000000..2c6d11f7bc24
> --- /dev/null
> +++ b/drivers/vfio/mdev/Makefile
> @@ -0,0 +1,5 @@
> +
> +mdev-y := mdev_core.o mdev_sysfs.o mdev_driver.o
> +
> +obj-$(CONFIG_MDEV) += mdev.o
> +
> diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
> new file mode 100644
> index 000000000000..3c45ed2ae1e9
> --- /dev/null
> +++ b/drivers/vfio/mdev/mdev_core.c
> @@ -0,0 +1,595 @@
> +/*
> + * Mediated device Core Driver
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/init.h>
> +#include <linux/module.h>
> +#include <linux/device.h>
> +#include <linux/kernel.h>
> +#include <linux/fs.h>
> +#include <linux/slab.h>
> +#include <linux/cdev.h>
> +#include <linux/sched.h>
> +#include <linux/uuid.h>
> +#include <linux/vfio.h>
> +#include <linux/iommu.h>
> +#include <linux/sysfs.h>
> +#include <linux/mdev.h>
> +
> +#include "mdev_private.h"
> +
> +#define DRIVER_VERSION		"0.1"
> +#define DRIVER_AUTHOR		"NVIDIA Corporation"
> +#define DRIVER_DESC		"Mediated device Core Driver"
> +
> +#define MDEV_CLASS_NAME		"mdev"
> +
> +static struct devices_list {
> +	struct list_head    dev_list;
> +	struct mutex        list_lock;
> +} parent_devices;
> +
> +static int mdev_add_attribute_group(struct device *dev,
> +				    const struct attribute_group **groups)
> +{
> +	return sysfs_create_groups(&dev->kobj, groups);
> +}
> +
> +static void mdev_remove_attribute_group(struct device *dev,
> +					const struct attribute_group **groups)
> +{
> +	sysfs_remove_groups(&dev->kobj, groups);
> +}
> +

better use device_add_groups() / device_remove_groups() instead?

> +static struct mdev_device *find_mdev_device(struct parent_device *parent,
> +					    uuid_le uuid, int instance)
> +{
> +	struct mdev_device *mdev = NULL, *p;
> +
> +	list_for_each_entry(p, &parent->mdev_list, next) {
> +		if ((uuid_le_cmp(p->uuid, uuid) == 0) &&
> +		    (p->instance == instance)) {
> +			mdev = p;
> +			break;
> +		}
> +	}
> +	return mdev;
> +}
> +
> +/* Should be called holding parent_devices.list_lock */
> +static struct parent_device *find_parent_device(struct device *dev)
> +{
> +	struct parent_device *parent = NULL, *p;
> +
> +	WARN_ON(!mutex_is_locked(&parent_devices.list_lock));

This is a static function, do we really need this assert?

> +	list_for_each_entry(p, &parent_devices.dev_list, next) {
> +		if (p->dev == dev) {
> +			parent = p;
> +			break;
> +		}
> +	}
> +	return parent;
> +}
> +
> +static void mdev_release_parent(struct kref *kref)
> +{
> +	struct parent_device *parent = container_of(kref, struct parent_device,
> +						    ref);
> +	kfree(parent);
> +}
> +
> +static
> +inline struct parent_device *mdev_get_parent(struct parent_device *parent)
> +{
> +	if (parent)
> +		kref_get(&parent->ref);
> +
> +	return parent;
> +}
> +
> +static inline void mdev_put_parent(struct parent_device *parent)
> +{
> +	if (parent)
> +		kref_put(&parent->ref, mdev_release_parent);
> +}
> +
> +static struct parent_device *mdev_get_parent_by_dev(struct device *dev)
> +{
> +	struct parent_device *parent = NULL, *p;
> +
> +	mutex_lock(&parent_devices.list_lock);
> +	list_for_each_entry(p, &parent_devices.dev_list, next) {
> +		if (p->dev == dev) {
> +			parent = mdev_get_parent(p);
> +			break;
> +		}
> +	}

Can directly call find_parent_device().

> +	mutex_unlock(&parent_devices.list_lock);
> +	return parent;
> +}
> +
> +static int mdev_device_create_ops(struct mdev_device *mdev, char *mdev_params)
> +{
> +	struct parent_device *parent = mdev->parent;
> +	int ret;
> +
> +	mutex_lock(&parent->ops_lock);
> +	if (parent->ops->create) {
> +		ret = parent->ops->create(mdev->dev.parent, mdev->uuid,
> +					mdev->instance, mdev_params);

I think it is better if we pass @mdev to this callback, then the parent driver
can do its specified operations and associate it with the instance,
e.g, via mdev->private.

> +		if (ret)
> +			goto create_ops_err;
> +	}
> +
> +	ret = mdev_add_attribute_group(&mdev->dev,
> +					parent->ops->mdev_attr_groups);
> +create_ops_err:
> +	mutex_unlock(&parent->ops_lock);
> +	return ret;
> +}
> +
> +static int mdev_device_destroy_ops(struct mdev_device *mdev, bool force)
> +{
> +	struct parent_device *parent = mdev->parent;
> +	int ret = 0;
> +
> +	/*
> +	 * If vendor driver doesn't return success that means vendor
> +	 * driver doesn't support hot-unplug
> +	 */
> +	mutex_lock(&parent->ops_lock);
> +	if (parent->ops->destroy) {
> +		ret = parent->ops->destroy(parent->dev, mdev->uuid,
> +					   mdev->instance);
> +		if (ret && !force) {
> +			ret = -EBUSY;
> +			goto destroy_ops_err;
> +		}
> +	}
> +	mdev_remove_attribute_group(&mdev->dev,
> +				    parent->ops->mdev_attr_groups);
> +destroy_ops_err:
> +	mutex_unlock(&parent->ops_lock);
> +
> +	return ret;
> +}
> +
> +static void mdev_release_device(struct kref *kref)
> +{
> +	struct mdev_device *mdev = container_of(kref, struct mdev_device, ref);
> +	struct parent_device *parent = mdev->parent;
> +
> +	device_unregister(&mdev->dev);
> +	wake_up(&parent->release_done);
> +	mdev_put_parent(parent);
> +}
> +
> +struct mdev_device *mdev_get_device(struct mdev_device *mdev)
> +{
> +	if (mdev)
> +		kref_get(&mdev->ref);
> +
> +	return mdev;
> +}
> +EXPORT_SYMBOL(mdev_get_device);
> +
> +void mdev_put_device(struct mdev_device *mdev)
> +{
> +	if (mdev)
> +		kref_put(&mdev->ref, mdev_release_device);
> +}
> +EXPORT_SYMBOL(mdev_put_device);
> +
> +/*
> + * Find first mediated device from given uuid and increment refcount of
> + * mediated device. Caller should call mdev_put_device() when the use of
> + * mdev_device is done.
> + */
> +static struct mdev_device *mdev_get_first_device_by_uuid(uuid_le uuid)
> +{
> +	struct mdev_device *mdev = NULL, *p;
> +	struct parent_device *parent;
> +
> +	mutex_lock(&parent_devices.list_lock);
> +	list_for_each_entry(parent, &parent_devices.dev_list, next) {
> +		mutex_lock(&parent->mdev_list_lock);
> +		list_for_each_entry(p, &parent->mdev_list, next) {
> +			if (uuid_le_cmp(p->uuid, uuid) == 0) {
> +				mdev = mdev_get_device(p);
> +				break;
> +			}
> +		}
> +		mutex_unlock(&parent->mdev_list_lock);
> +
> +		if (mdev)
> +			break;
> +	}
> +	mutex_unlock(&parent_devices.list_lock);
> +	return mdev;
> +}
> +
> +/*
> + * Find mediated device from given iommu_group and increment refcount of
> + * mediated device. Caller should call mdev_put_device() when the use of
> + * mdev_device is done.
> + */
> +struct mdev_device *mdev_get_device_by_group(struct iommu_group *group)
> +{
> +	struct mdev_device *mdev = NULL, *p;
> +	struct parent_device *parent;
> +
> +	mutex_lock(&parent_devices.list_lock);
> +	list_for_each_entry(parent, &parent_devices.dev_list, next) {
> +		mutex_lock(&parent->mdev_list_lock);
> +		list_for_each_entry(p, &parent->mdev_list, next) {
> +			if (!p->group)
> +				continue;
> +
> +			if (iommu_group_id(p->group) == iommu_group_id(group)) {
> +				mdev = mdev_get_device(p);
> +				break;
> +			}
> +		}
> +		mutex_unlock(&parent->mdev_list_lock);
> +
> +		if (mdev)
> +			break;
> +	}
> +	mutex_unlock(&parent_devices.list_lock);
> +	return mdev;
> +}
> +EXPORT_SYMBOL(mdev_get_device_by_group);
> +
> +/*
> + * mdev_register_device : Register a device
> + * @dev: device structure representing parent device.
> + * @ops: Parent device operation structure to be registered.
> + *
> + * Add device to list of registered parent devices.
> + * Returns a negative value on error, otherwise 0.
> + */
> +int mdev_register_device(struct device *dev, const struct parent_ops *ops)
> +{
> +	int ret = 0;
> +	struct parent_device *parent;
> +
> +	if (!dev || !ops)
> +		return -EINVAL;
> +
> +	mutex_lock(&parent_devices.list_lock);
> +
> +	/* Check for duplicate */
> +	parent = find_parent_device(dev);
> +	if (parent) {
> +		ret = -EEXIST;
> +		goto add_dev_err;
> +	}
> +
> +	parent = kzalloc(sizeof(*parent), GFP_KERNEL);
> +	if (!parent) {
> +		ret = -ENOMEM;
> +		goto add_dev_err;
> +	}
> +
> +	kref_init(&parent->ref);
> +	list_add(&parent->next, &parent_devices.dev_list);
> +	mutex_unlock(&parent_devices.list_lock);

It is not safe as Alex's already pointed it out.

> +
> +	parent->dev = dev;
> +	parent->ops = ops;
> +	mutex_init(&parent->ops_lock);
> +	mutex_init(&parent->mdev_list_lock);
> +	INIT_LIST_HEAD(&parent->mdev_list);
> +	init_waitqueue_head(&parent->release_done);

And no lock to protect these operations.

> +
> +	ret = mdev_create_sysfs_files(dev);
> +	if (ret)
> +		goto add_sysfs_error;
> +
> +	ret = mdev_add_attribute_group(dev, ops->dev_attr_groups);
> +	if (ret)
> +		goto add_group_error;
> +
> +	dev_info(dev, "MDEV: Registered\n");
> +	return 0;
> +
> +add_group_error:
> +	mdev_remove_sysfs_files(dev);
> +add_sysfs_error:
> +	mutex_lock(&parent_devices.list_lock);
> +	list_del(&parent->next);
> +	mutex_unlock(&parent_devices.list_lock);
> +	mdev_put_parent(parent);
> +	return ret;
> +
> +add_dev_err:
> +	mutex_unlock(&parent_devices.list_lock);
> +	return ret;
> +}
> +EXPORT_SYMBOL(mdev_register_device);
> +
> +/*
> + * mdev_unregister_device : Unregister a parent device
> + * @dev: device structure representing parent device.
> + *
> + * Remove device from list of registered parent devices. Give a chance to free
> + * existing mediated devices for given device.
> + */
> +
> +void mdev_unregister_device(struct device *dev)
> +{
> +	struct parent_device *parent;
> +	struct mdev_device *mdev, *n;
> +	int ret;
> +
> +	mutex_lock(&parent_devices.list_lock);
> +	parent = find_parent_device(dev);
> +
> +	if (!parent) {
> +		mutex_unlock(&parent_devices.list_lock);
> +		return;
> +	}
> +	dev_info(dev, "MDEV: Unregistering\n");
> +
> +	/*
> +	 * Remove parent from the list and remove create and destroy sysfs
> +	 * files so that no new mediated device could be created for this parent
> +	 */
> +	list_del(&parent->next);
> +	mdev_remove_sysfs_files(dev);
> +	mutex_unlock(&parent_devices.list_lock);
> +

find_parent_device() does not increase the refcount of the parent-device,
after releasing the lock, is it still safe to use the device?

> +	mutex_lock(&parent->ops_lock);
> +	mdev_remove_attribute_group(dev,
> +				    parent->ops->dev_attr_groups);

Why mdev_remove_sysfs_files() and mdev_remove_attribute_group()
are protected by different locks?

> +	mutex_unlock(&parent->ops_lock);
> +
> +	mutex_lock(&parent->mdev_list_lock);
> +	list_for_each_entry_safe(mdev, n, &parent->mdev_list, next) {
> +		mdev_device_destroy_ops(mdev, true);
> +		list_del(&mdev->next);
> +		mdev_put_device(mdev);
> +	}
> +	mutex_unlock(&parent->mdev_list_lock);
> +
> +	do {
> +		ret = wait_event_interruptible_timeout(parent->release_done,
> +				list_empty(&parent->mdev_list), HZ * 10);
> +		if (ret == -ERESTARTSYS) {
> +			dev_warn(dev, "Mediated devices are in use, task"
> +				      " \"%s\" (%d) "
> +				      "blocked until all are released",
> +				      current->comm, task_pid_nr(current));
> +		}
> +	} while (ret <= 0);
> +
> +	mdev_put_parent(parent);
> +}
> +EXPORT_SYMBOL(mdev_unregister_device);
> +
> +/*
> + * Functions required for mdev-sysfs
> + */
> +static void mdev_device_release(struct device *dev)
> +{
> +	struct mdev_device *mdev = to_mdev_device(dev);
> +
> +	dev_dbg(&mdev->dev, "MDEV: destroying\n");
> +	kfree(mdev);
> +}
> +
> +int mdev_device_create(struct device *dev, uuid_le uuid, uint32_t instance,
> +		       char *mdev_params)
> +{
> +	int ret;
> +	struct mdev_device *mdev;
> +	struct parent_device *parent;
> +
> +	parent = mdev_get_parent_by_dev(dev);
> +	if (!parent)
> +		return -EINVAL;
> +
> +	/* Check for duplicate */
> +	mdev = find_mdev_device(parent, uuid, instance);
> +	if (mdev) {
> +		ret = -EEXIST;
> +		goto create_err;
> +	}
> +
> +	mdev = kzalloc(sizeof(*mdev), GFP_KERNEL);
> +	if (!mdev) {
> +		ret = -ENOMEM;
> +		goto create_err;
> +	}
> +
> +	memcpy(&mdev->uuid, &uuid, sizeof(uuid_le));
> +	mdev->instance = instance;
> +	mdev->parent = parent;
> +	mutex_init(&mdev->ops_lock);
> +	kref_init(&mdev->ref);
> +
> +	mdev->dev.parent  = dev;
> +	mdev->dev.bus     = &mdev_bus_type;
> +	mdev->dev.release = mdev_device_release;
> +	dev_set_name(&mdev->dev, "%pUb-%d", uuid.b, instance);
> +
> +	ret = device_register(&mdev->dev);
> +	if (ret) {
> +		put_device(&mdev->dev);
> +		goto create_err;
> +	}
> +
> +	ret = mdev_device_create_ops(mdev, mdev_params);
> +	if (ret)
> +		goto create_failed;
> +
> +	mutex_lock(&parent->mdev_list_lock);
> +	list_add(&mdev->next, &parent->mdev_list);
> +	mutex_unlock(&parent->mdev_list_lock);
> +
> +	dev_dbg(&mdev->dev, "MDEV: created\n");
> +
> +	return ret;
> +
> +create_failed:
> +	device_unregister(&mdev->dev);
> +
> +create_err:
> +	mdev_put_parent(parent);
> +	return ret;
> +}
> +
> +int mdev_device_destroy(struct device *dev, uuid_le uuid, uint32_t instance)
> +{
> +	struct mdev_device *mdev;
> +	struct parent_device *parent;
> +	int ret;
> +
> +	parent = mdev_get_parent_by_dev(dev);
> +	if (!parent) {
> +		ret = -EINVAL;
> +		goto destroy_err;
> +	}
> +
> +	mdev = find_mdev_device(parent, uuid, instance);
> +	if (!mdev) {
> +		ret = -EINVAL;
> +		goto destroy_err;
> +	}
> +
> +	ret = mdev_device_destroy_ops(mdev, false);
> +	if (ret)
> +		goto destroy_err;

find_mdev_device() does not hold the refcount of mdev, is it safe?

> +
> +	mdev_put_parent(parent);
> +

The refcount of parent-device is released, you can not continue to
use it.

> +	mutex_lock(&parent->mdev_list_lock);
> +	list_del(&mdev->next);
> +	mutex_unlock(&parent->mdev_list_lock);
> +
> +	mdev_put_device(mdev);
> +	return ret;
> +
> +destroy_err:
> +	mdev_put_parent(parent);
> +	return ret;
> +}
> +
> +void mdev_device_supported_config(struct device *dev, char *str)
> +{
> +	struct parent_device *parent;
> +
> +	parent = mdev_get_parent_by_dev(dev);
> +
> +	if (parent) {
> +		mutex_lock(&parent->ops_lock);
> +		if (parent->ops->supported_config)
> +			parent->ops->supported_config(parent->dev, str);
> +		mutex_unlock(&parent->ops_lock);
> +		mdev_put_parent(parent);
> +	}
> +}
> +
> +int mdev_device_start(uuid_le uuid)
> +{
> +	int ret = 0;
> +	struct mdev_device *mdev;
> +	struct parent_device *parent;
> +
> +	mdev = mdev_get_first_device_by_uuid(uuid);
> +	if (!mdev)
> +		return -EINVAL;
> +
> +	parent = mdev->parent;
> +
> +	mutex_lock(&parent->ops_lock);
> +	if (parent->ops->start)
> +		ret = parent->ops->start(mdev->uuid);
> +	mutex_unlock(&parent->ops_lock);
> +
> +	if (ret)
> +		pr_err("mdev_start failed  %d\n", ret);
> +	else
> +		kobject_uevent(&mdev->dev.kobj, KOBJ_ONLINE);
> +
> +	mdev_put_device(mdev);
> +
> +	return ret;
> +}
> +
> +int mdev_device_shutdown(uuid_le uuid)
> +{
> +	int ret = 0;
> +	struct mdev_device *mdev;
> +	struct parent_device *parent;
> +
> +	mdev = mdev_get_first_device_by_uuid(uuid);
> +	if (!mdev)
> +		return -EINVAL;
> +
> +	parent = mdev->parent;
> +
> +	mutex_lock(&parent->ops_lock);
> +	if (parent->ops->shutdown)
> +		ret = parent->ops->shutdown(mdev->uuid);
> +	mutex_unlock(&parent->ops_lock);
> +
> +	if (ret)
> +		pr_err("mdev_shutdown failed %d\n", ret);
> +	else
> +		kobject_uevent(&mdev->dev.kobj, KOBJ_OFFLINE);
> +
> +	mdev_put_device(mdev);
> +	return ret;
> +}
> +
> +static struct class mdev_class = {
> +	.name		= MDEV_CLASS_NAME,
> +	.owner		= THIS_MODULE,
> +	.class_attrs	= mdev_class_attrs,

These interfaces, start and shutdown, are based on UUID, how
about if we want to operate on the specified instance?

> +};
> +
> +static int __init mdev_init(void)
> +{
> +	int ret;
> +
> +	mutex_init(&parent_devices.list_lock);
> +	INIT_LIST_HEAD(&parent_devices.dev_list);
> +
> +	ret = class_register(&mdev_class);
> +	if (ret) {
> +		pr_err("Failed to register mdev class\n");
> +		return ret;
> +	}
> +
> +	ret = mdev_bus_register();
> +	if (ret) {
> +		pr_err("Failed to register mdev bus\n");
> +		class_unregister(&mdev_class);
> +		return ret;
> +	}
> +
> +	return ret;
> +}
> +
> +static void __exit mdev_exit(void)
> +{
> +	mdev_bus_unregister();
> +	class_unregister(&mdev_class);
> +}

Hmm, how to prevent if there are parent-devices existing
when the module is being unloaded?

> +
> +module_init(mdev_init)
> +module_exit(mdev_exit)
> +
> +MODULE_VERSION(DRIVER_VERSION);
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR(DRIVER_AUTHOR);
> +MODULE_DESCRIPTION(DRIVER_DESC);
> diff --git a/drivers/vfio/mdev/mdev_driver.c b/drivers/vfio/mdev/mdev_driver.c
> new file mode 100644
> index 000000000000..f1aed541111d
> --- /dev/null
> +++ b/drivers/vfio/mdev/mdev_driver.c
> @@ -0,0 +1,138 @@
> +/*
> + * MDEV driver
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/device.h>
> +#include <linux/iommu.h>
> +#include <linux/mdev.h>
> +
> +#include "mdev_private.h"
> +
> +static int mdev_attach_iommu(struct mdev_device *mdev)
> +{
> +	int ret;
> +	struct iommu_group *group;
> +
> +	group = iommu_group_alloc();
> +	if (IS_ERR(group)) {
> +		dev_err(&mdev->dev, "MDEV: failed to allocate group!\n");
> +		return PTR_ERR(group);
> +	}
> +
> +	ret = iommu_group_add_device(group, &mdev->dev);
> +	if (ret) {
> +		dev_err(&mdev->dev, "MDEV: failed to add dev to group!\n");
> +		goto attach_fail;
> +	}
> +
> +	mdev->group = group;
> +
> +	dev_info(&mdev->dev, "MDEV: group_id = %d\n",
> +				 iommu_group_id(group));
> +attach_fail:
> +	iommu_group_put(group);
> +	return ret;
> +}
> +
> +static void mdev_detach_iommu(struct mdev_device *mdev)
> +{
> +	iommu_group_remove_device(&mdev->dev);
> +	dev_info(&mdev->dev, "MDEV: detaching iommu\n");
> +}
> +
> +static int mdev_probe(struct device *dev)
> +{
> +	struct mdev_driver *drv = to_mdev_driver(dev->driver);
> +	struct mdev_device *mdev = to_mdev_device(dev);
> +	int ret;
> +
> +	ret = mdev_attach_iommu(mdev);
> +	if (ret) {
> +		dev_err(dev, "Failed to attach IOMMU\n");
> +		return ret;
> +	}
> +
> +	if (drv && drv->probe)
> +		ret = drv->probe(dev);

If probe failed, need to deattache mdev?

> +
> +	return ret;
> +}
> +
> +static int mdev_remove(struct device *dev)
> +{
> +	struct mdev_driver *drv = to_mdev_driver(dev->driver);
> +	struct mdev_device *mdev = to_mdev_device(dev);
> +
> +	if (drv && drv->remove)
> +		drv->remove(dev);
> +
> +	mdev_detach_iommu(mdev);
> +
> +	return 0;
> +}
> +
> +static int mdev_match(struct device *dev, struct device_driver *drv)
> +{
> +	struct mdev_driver *mdrv = to_mdev_driver(drv);
> +
> +	if (mdrv && mdrv->match)
> +		return mdrv->match(dev);
> +
> +	return 0;
> +}
> +
> +struct bus_type mdev_bus_type = {
> +	.name		= "mdev",
> +	.match		= mdev_match,
> +	.probe		= mdev_probe,
> +	.remove		= mdev_remove,
> +};
> +EXPORT_SYMBOL_GPL(mdev_bus_type);
> +
> +/*
> + * mdev_register_driver - register a new MDEV driver
> + * @drv: the driver to register
> + * @owner: module owner of driver to be registered
> + *
> + * Returns a negative value on error, otherwise 0.
> + */
> +int mdev_register_driver(struct mdev_driver *drv, struct module *owner)
> +{
> +	/* initialize common driver fields */
> +	drv->driver.name = drv->name;
> +	drv->driver.bus = &mdev_bus_type;
> +	drv->driver.owner = owner;
> +
> +	/* register with core */
> +	return driver_register(&drv->driver);
> +}
> +EXPORT_SYMBOL(mdev_register_driver);
> +
> +/*
> + * mdev_unregister_driver - unregister MDEV driver
> + * @drv: the driver to unregister
> + *
> + */
> +void mdev_unregister_driver(struct mdev_driver *drv)
> +{
> +	driver_unregister(&drv->driver);
> +}
> +EXPORT_SYMBOL(mdev_unregister_driver);
> +
> +int mdev_bus_register(void)
> +{
> +	return bus_register(&mdev_bus_type);
> +}
> +
> +void mdev_bus_unregister(void)
> +{
> +	bus_unregister(&mdev_bus_type);
> +}
> diff --git a/drivers/vfio/mdev/mdev_private.h b/drivers/vfio/mdev/mdev_private.h
> new file mode 100644
> index 000000000000..991d7f796169
> --- /dev/null
> +++ b/drivers/vfio/mdev/mdev_private.h
> @@ -0,0 +1,33 @@
> +/*
> + * Mediated device interal definitions
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#ifndef MDEV_PRIVATE_H
> +#define MDEV_PRIVATE_H
> +
> +int  mdev_bus_register(void);
> +void mdev_bus_unregister(void);
> +
> +/* Function prototypes for mdev_sysfs */
> +
> +extern struct class_attribute mdev_class_attrs[];
> +
> +int  mdev_create_sysfs_files(struct device *dev);
> +void mdev_remove_sysfs_files(struct device *dev);
> +
> +int  mdev_device_create(struct device *dev, uuid_le uuid, uint32_t instance,
> +			char *mdev_params);
> +int  mdev_device_destroy(struct device *dev, uuid_le uuid, uint32_t instance);
> +void mdev_device_supported_config(struct device *dev, char *str);
> +int  mdev_device_start(uuid_le uuid);
> +int  mdev_device_shutdown(uuid_le uuid);
> +
> +#endif /* MDEV_PRIVATE_H */
> diff --git a/drivers/vfio/mdev/mdev_sysfs.c b/drivers/vfio/mdev/mdev_sysfs.c
> new file mode 100644
> index 000000000000..48b66e40009e
> --- /dev/null
> +++ b/drivers/vfio/mdev/mdev_sysfs.c
> @@ -0,0 +1,300 @@
> +/*
> + * File attributes for Mediated devices
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/sysfs.h>
> +#include <linux/ctype.h>
> +#include <linux/device.h>
> +#include <linux/slab.h>
> +#include <linux/uuid.h>
> +#include <linux/mdev.h>
> +
> +#include "mdev_private.h"
> +
> +/* Prototypes */
> +static ssize_t mdev_supported_types_show(struct device *dev,
> +					 struct device_attribute *attr,
> +					 char *buf);
> +static DEVICE_ATTR_RO(mdev_supported_types);
> +
> +static ssize_t mdev_create_store(struct device *dev,
> +				 struct device_attribute *attr,
> +				 const char *buf, size_t count);
> +static DEVICE_ATTR_WO(mdev_create);
> +
> +static ssize_t mdev_destroy_store(struct device *dev,
> +				  struct device_attribute *attr,
> +				  const char *buf, size_t count);
> +static DEVICE_ATTR_WO(mdev_destroy);
> +
> +/* Static functions */
> +
> +#define UUID_CHAR_LENGTH	36
> +#define UUID_BYTE_LENGTH	16
> +
> +#define SUPPORTED_TYPE_BUFFER_LENGTH	1024
> +
> +static inline bool is_uuid_sep(char sep)
> +{
> +	if (sep == '\n' || sep == '-' || sep == ':' || sep == '\0')
> +		return true;
> +	return false;
> +}
> +
> +static int uuid_parse(const char *str, uuid_le *uuid)
> +{
> +	int i;
> +
> +	if (strlen(str) < UUID_CHAR_LENGTH)
> +		return -EINVAL;
> +
> +	for (i = 0; i < UUID_BYTE_LENGTH; i++) {
> +		if (!isxdigit(str[0]) || !isxdigit(str[1])) {
> +			pr_err("%s err", __func__);
> +			return -EINVAL;
> +		}
> +
> +		uuid->b[i] = (hex_to_bin(str[0]) << 4) | hex_to_bin(str[1]);
> +		str += 2;
> +		if (is_uuid_sep(*str))
> +			str++;
> +	}
> +
> +	return 0;
> +}
> +

Can we use uuid_le_to_bin()?

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Qemu-devel] [PATCH 1/3] Mediated device Core driver
@ 2016-06-29 13:51     ` Xiao Guangrong
  0 siblings, 0 replies; 51+ messages in thread
From: Xiao Guangrong @ 2016-06-29 13:51 UTC (permalink / raw)
  To: Kirti Wankhede, alex.williamson, pbonzini, kraxel, cjia
  Cc: shuai.ruan, jike.song, kvm, kevin.tian, qemu-devel, zhiyuan.lv, bjsdjshi



On 06/21/2016 12:31 AM, Kirti Wankhede wrote:
> Design for Mediated Device Driver:
> Main purpose of this driver is to provide a common interface for mediated
> device management that can be used by differnt drivers of different
> devices.
>
> This module provides a generic interface to create the device, add it to
> mediated bus, add device to IOMMU group and then add it to vfio group.
>
> Below is the high Level block diagram, with Nvidia, Intel and IBM devices
> as example, since these are the devices which are going to actively use
> this module as of now.
>
>   +---------------+
>   |               |
>   | +-----------+ |  mdev_register_driver() +--------------+
>   | |           | +<------------------------+ __init()     |
>   | |           | |                         |              |
>   | |  mdev     | +------------------------>+              |<-> VFIO user
>   | |  bus      | |     probe()/remove()    | vfio_mpci.ko |    APIs
>   | |  driver   | |                         |              |
>   | |           | |                         +--------------+
>   | |           | |  mdev_register_driver() +--------------+
>   | |           | +<------------------------+ __init()     |
>   | |           | |                         |              |
>   | |           | +------------------------>+              |<-> VFIO user
>   | +-----------+ |     probe()/remove()    | vfio_mccw.ko |    APIs
>   |               |                         |              |
>   |  MDEV CORE    |                         +--------------+
>   |   MODULE      |
>   |   mdev.ko     |
>   | +-----------+ |  mdev_register_device() +--------------+
>   | |           | +<------------------------+              |
>   | |           | |                         |  nvidia.ko   |<-> physical
>   | |           | +------------------------>+              |    device
>   | |           | |        callback         +--------------+
>   | | Physical  | |
>   | |  device   | |  mdev_register_device() +--------------+
>   | | interface | |<------------------------+              |
>   | |           | |                         |  i915.ko     |<-> physical
>   | |           | +------------------------>+              |    device
>   | |           | |        callback         +--------------+
>   | |           | |
>   | |           | |  mdev_register_device() +--------------+
>   | |           | +<------------------------+              |
>   | |           | |                         | ccw_device.ko|<-> physical
>   | |           | +------------------------>+              |    device
>   | |           | |        callback         +--------------+
>   | +-----------+ |
>   +---------------+
>
> Core driver provides two types of registration interfaces:
> 1. Registration interface for mediated bus driver:
>
> /**
>    * struct mdev_driver - Mediated device's driver
>    * @name: driver name
>    * @probe: called when new device created
>    * @remove:called when device removed
>    * @match: called when new device or driver is added for this bus.
> 	    Return 1 if given device can be handled by given driver and
> 	    zero otherwise.
>    * @driver:device driver structure
>    *
>    **/
> struct mdev_driver {
>           const char *name;
>           int  (*probe)  (struct device *dev);
>           void (*remove) (struct device *dev);
> 	 int  (*match)(struct device *dev);
>           struct device_driver    driver;
> };
>
> int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
> void mdev_unregister_driver(struct mdev_driver *drv);
>
> Mediated device's driver for mdev should use this interface to register
> with Core driver. With this, mediated devices driver for such devices is
> responsible to add mediated device to VFIO group.
>
> 2. Physical device driver interface
> This interface provides vendor driver the set APIs to manage physical
> device related work in their own driver. APIs are :
> - supported_config: provide supported configuration list by the vendor
> 		    driver
> - create: to allocate basic resources in vendor driver for a mediated
> 	  device.
> - destroy: to free resources in vendor driver when mediated device is
> 	   destroyed.
> - start: to initiate mediated device initialization process from vendor
> 	 driver when VM boots and before QEMU starts.
> - shutdown: to teardown mediated device resources during VM teardown.
> - read : read emulation callback.
> - write: write emulation callback.
> - set_irqs: send interrupt configuration information that QEMU sets.
> - get_region_info: to provide region size and its flags for the mediated
> 		   device.
> - validate_map_request: to validate remap pfn request.
>
> This registration interface should be used by vendor drivers to register
> each physical device to mdev core driver.
>
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Signed-off-by: Neo Jia <cjia@nvidia.com>
> Change-Id: I73a5084574270b14541c529461ea2f03c292d510
> ---
>   drivers/vfio/Kconfig             |   1 +
>   drivers/vfio/Makefile            |   1 +
>   drivers/vfio/mdev/Kconfig        |  11 +
>   drivers/vfio/mdev/Makefile       |   5 +
>   drivers/vfio/mdev/mdev_core.c    | 595 +++++++++++++++++++++++++++++++++++++++
>   drivers/vfio/mdev/mdev_driver.c  | 138 +++++++++
>   drivers/vfio/mdev/mdev_private.h |  33 +++
>   drivers/vfio/mdev/mdev_sysfs.c   | 300 ++++++++++++++++++++
>   include/linux/mdev.h             | 232 +++++++++++++++
>   9 files changed, 1316 insertions(+)
>   create mode 100644 drivers/vfio/mdev/Kconfig
>   create mode 100644 drivers/vfio/mdev/Makefile
>   create mode 100644 drivers/vfio/mdev/mdev_core.c
>   create mode 100644 drivers/vfio/mdev/mdev_driver.c
>   create mode 100644 drivers/vfio/mdev/mdev_private.h
>   create mode 100644 drivers/vfio/mdev/mdev_sysfs.c
>   create mode 100644 include/linux/mdev.h
>
> diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
> index da6e2ce77495..23eced02aaf6 100644
> --- a/drivers/vfio/Kconfig
> +++ b/drivers/vfio/Kconfig
> @@ -48,4 +48,5 @@ menuconfig VFIO_NOIOMMU
>
>   source "drivers/vfio/pci/Kconfig"
>   source "drivers/vfio/platform/Kconfig"
> +source "drivers/vfio/mdev/Kconfig"
>   source "virt/lib/Kconfig"
> diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
> index 7b8a31f63fea..7c70753e54ab 100644
> --- a/drivers/vfio/Makefile
> +++ b/drivers/vfio/Makefile
> @@ -7,3 +7,4 @@ obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
>   obj-$(CONFIG_VFIO_SPAPR_EEH) += vfio_spapr_eeh.o
>   obj-$(CONFIG_VFIO_PCI) += pci/
>   obj-$(CONFIG_VFIO_PLATFORM) += platform/
> +obj-$(CONFIG_MDEV) += mdev/
> diff --git a/drivers/vfio/mdev/Kconfig b/drivers/vfio/mdev/Kconfig
> new file mode 100644
> index 000000000000..951e2bb06a3f
> --- /dev/null
> +++ b/drivers/vfio/mdev/Kconfig
> @@ -0,0 +1,11 @@
> +
> +config MDEV
> +    tristate "Mediated device driver framework"
> +    depends on VFIO
> +    default n
> +    help
> +        MDEV provides a framework to virtualize device without SR-IOV cap
> +        See Documentation/mdev.txt for more details.
> +
> +        If you don't know what do here, say N.
> +
> diff --git a/drivers/vfio/mdev/Makefile b/drivers/vfio/mdev/Makefile
> new file mode 100644
> index 000000000000..2c6d11f7bc24
> --- /dev/null
> +++ b/drivers/vfio/mdev/Makefile
> @@ -0,0 +1,5 @@
> +
> +mdev-y := mdev_core.o mdev_sysfs.o mdev_driver.o
> +
> +obj-$(CONFIG_MDEV) += mdev.o
> +
> diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
> new file mode 100644
> index 000000000000..3c45ed2ae1e9
> --- /dev/null
> +++ b/drivers/vfio/mdev/mdev_core.c
> @@ -0,0 +1,595 @@
> +/*
> + * Mediated device Core Driver
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/init.h>
> +#include <linux/module.h>
> +#include <linux/device.h>
> +#include <linux/kernel.h>
> +#include <linux/fs.h>
> +#include <linux/slab.h>
> +#include <linux/cdev.h>
> +#include <linux/sched.h>
> +#include <linux/uuid.h>
> +#include <linux/vfio.h>
> +#include <linux/iommu.h>
> +#include <linux/sysfs.h>
> +#include <linux/mdev.h>
> +
> +#include "mdev_private.h"
> +
> +#define DRIVER_VERSION		"0.1"
> +#define DRIVER_AUTHOR		"NVIDIA Corporation"
> +#define DRIVER_DESC		"Mediated device Core Driver"
> +
> +#define MDEV_CLASS_NAME		"mdev"
> +
> +static struct devices_list {
> +	struct list_head    dev_list;
> +	struct mutex        list_lock;
> +} parent_devices;
> +
> +static int mdev_add_attribute_group(struct device *dev,
> +				    const struct attribute_group **groups)
> +{
> +	return sysfs_create_groups(&dev->kobj, groups);
> +}
> +
> +static void mdev_remove_attribute_group(struct device *dev,
> +					const struct attribute_group **groups)
> +{
> +	sysfs_remove_groups(&dev->kobj, groups);
> +}
> +

better use device_add_groups() / device_remove_groups() instead?

> +static struct mdev_device *find_mdev_device(struct parent_device *parent,
> +					    uuid_le uuid, int instance)
> +{
> +	struct mdev_device *mdev = NULL, *p;
> +
> +	list_for_each_entry(p, &parent->mdev_list, next) {
> +		if ((uuid_le_cmp(p->uuid, uuid) == 0) &&
> +		    (p->instance == instance)) {
> +			mdev = p;
> +			break;
> +		}
> +	}
> +	return mdev;
> +}
> +
> +/* Should be called holding parent_devices.list_lock */
> +static struct parent_device *find_parent_device(struct device *dev)
> +{
> +	struct parent_device *parent = NULL, *p;
> +
> +	WARN_ON(!mutex_is_locked(&parent_devices.list_lock));

This is a static function, do we really need this assert?

> +	list_for_each_entry(p, &parent_devices.dev_list, next) {
> +		if (p->dev == dev) {
> +			parent = p;
> +			break;
> +		}
> +	}
> +	return parent;
> +}
> +
> +static void mdev_release_parent(struct kref *kref)
> +{
> +	struct parent_device *parent = container_of(kref, struct parent_device,
> +						    ref);
> +	kfree(parent);
> +}
> +
> +static
> +inline struct parent_device *mdev_get_parent(struct parent_device *parent)
> +{
> +	if (parent)
> +		kref_get(&parent->ref);
> +
> +	return parent;
> +}
> +
> +static inline void mdev_put_parent(struct parent_device *parent)
> +{
> +	if (parent)
> +		kref_put(&parent->ref, mdev_release_parent);
> +}
> +
> +static struct parent_device *mdev_get_parent_by_dev(struct device *dev)
> +{
> +	struct parent_device *parent = NULL, *p;
> +
> +	mutex_lock(&parent_devices.list_lock);
> +	list_for_each_entry(p, &parent_devices.dev_list, next) {
> +		if (p->dev == dev) {
> +			parent = mdev_get_parent(p);
> +			break;
> +		}
> +	}

Can directly call find_parent_device().

> +	mutex_unlock(&parent_devices.list_lock);
> +	return parent;
> +}
> +
> +static int mdev_device_create_ops(struct mdev_device *mdev, char *mdev_params)
> +{
> +	struct parent_device *parent = mdev->parent;
> +	int ret;
> +
> +	mutex_lock(&parent->ops_lock);
> +	if (parent->ops->create) {
> +		ret = parent->ops->create(mdev->dev.parent, mdev->uuid,
> +					mdev->instance, mdev_params);

I think it is better if we pass @mdev to this callback, then the parent driver
can do its specified operations and associate it with the instance,
e.g, via mdev->private.

> +		if (ret)
> +			goto create_ops_err;
> +	}
> +
> +	ret = mdev_add_attribute_group(&mdev->dev,
> +					parent->ops->mdev_attr_groups);
> +create_ops_err:
> +	mutex_unlock(&parent->ops_lock);
> +	return ret;
> +}
> +
> +static int mdev_device_destroy_ops(struct mdev_device *mdev, bool force)
> +{
> +	struct parent_device *parent = mdev->parent;
> +	int ret = 0;
> +
> +	/*
> +	 * If vendor driver doesn't return success that means vendor
> +	 * driver doesn't support hot-unplug
> +	 */
> +	mutex_lock(&parent->ops_lock);
> +	if (parent->ops->destroy) {
> +		ret = parent->ops->destroy(parent->dev, mdev->uuid,
> +					   mdev->instance);
> +		if (ret && !force) {
> +			ret = -EBUSY;
> +			goto destroy_ops_err;
> +		}
> +	}
> +	mdev_remove_attribute_group(&mdev->dev,
> +				    parent->ops->mdev_attr_groups);
> +destroy_ops_err:
> +	mutex_unlock(&parent->ops_lock);
> +
> +	return ret;
> +}
> +
> +static void mdev_release_device(struct kref *kref)
> +{
> +	struct mdev_device *mdev = container_of(kref, struct mdev_device, ref);
> +	struct parent_device *parent = mdev->parent;
> +
> +	device_unregister(&mdev->dev);
> +	wake_up(&parent->release_done);
> +	mdev_put_parent(parent);
> +}
> +
> +struct mdev_device *mdev_get_device(struct mdev_device *mdev)
> +{
> +	if (mdev)
> +		kref_get(&mdev->ref);
> +
> +	return mdev;
> +}
> +EXPORT_SYMBOL(mdev_get_device);
> +
> +void mdev_put_device(struct mdev_device *mdev)
> +{
> +	if (mdev)
> +		kref_put(&mdev->ref, mdev_release_device);
> +}
> +EXPORT_SYMBOL(mdev_put_device);
> +
> +/*
> + * Find first mediated device from given uuid and increment refcount of
> + * mediated device. Caller should call mdev_put_device() when the use of
> + * mdev_device is done.
> + */
> +static struct mdev_device *mdev_get_first_device_by_uuid(uuid_le uuid)
> +{
> +	struct mdev_device *mdev = NULL, *p;
> +	struct parent_device *parent;
> +
> +	mutex_lock(&parent_devices.list_lock);
> +	list_for_each_entry(parent, &parent_devices.dev_list, next) {
> +		mutex_lock(&parent->mdev_list_lock);
> +		list_for_each_entry(p, &parent->mdev_list, next) {
> +			if (uuid_le_cmp(p->uuid, uuid) == 0) {
> +				mdev = mdev_get_device(p);
> +				break;
> +			}
> +		}
> +		mutex_unlock(&parent->mdev_list_lock);
> +
> +		if (mdev)
> +			break;
> +	}
> +	mutex_unlock(&parent_devices.list_lock);
> +	return mdev;
> +}
> +
> +/*
> + * Find mediated device from given iommu_group and increment refcount of
> + * mediated device. Caller should call mdev_put_device() when the use of
> + * mdev_device is done.
> + */
> +struct mdev_device *mdev_get_device_by_group(struct iommu_group *group)
> +{
> +	struct mdev_device *mdev = NULL, *p;
> +	struct parent_device *parent;
> +
> +	mutex_lock(&parent_devices.list_lock);
> +	list_for_each_entry(parent, &parent_devices.dev_list, next) {
> +		mutex_lock(&parent->mdev_list_lock);
> +		list_for_each_entry(p, &parent->mdev_list, next) {
> +			if (!p->group)
> +				continue;
> +
> +			if (iommu_group_id(p->group) == iommu_group_id(group)) {
> +				mdev = mdev_get_device(p);
> +				break;
> +			}
> +		}
> +		mutex_unlock(&parent->mdev_list_lock);
> +
> +		if (mdev)
> +			break;
> +	}
> +	mutex_unlock(&parent_devices.list_lock);
> +	return mdev;
> +}
> +EXPORT_SYMBOL(mdev_get_device_by_group);
> +
> +/*
> + * mdev_register_device : Register a device
> + * @dev: device structure representing parent device.
> + * @ops: Parent device operation structure to be registered.
> + *
> + * Add device to list of registered parent devices.
> + * Returns a negative value on error, otherwise 0.
> + */
> +int mdev_register_device(struct device *dev, const struct parent_ops *ops)
> +{
> +	int ret = 0;
> +	struct parent_device *parent;
> +
> +	if (!dev || !ops)
> +		return -EINVAL;
> +
> +	mutex_lock(&parent_devices.list_lock);
> +
> +	/* Check for duplicate */
> +	parent = find_parent_device(dev);
> +	if (parent) {
> +		ret = -EEXIST;
> +		goto add_dev_err;
> +	}
> +
> +	parent = kzalloc(sizeof(*parent), GFP_KERNEL);
> +	if (!parent) {
> +		ret = -ENOMEM;
> +		goto add_dev_err;
> +	}
> +
> +	kref_init(&parent->ref);
> +	list_add(&parent->next, &parent_devices.dev_list);
> +	mutex_unlock(&parent_devices.list_lock);

It is not safe as Alex's already pointed it out.

> +
> +	parent->dev = dev;
> +	parent->ops = ops;
> +	mutex_init(&parent->ops_lock);
> +	mutex_init(&parent->mdev_list_lock);
> +	INIT_LIST_HEAD(&parent->mdev_list);
> +	init_waitqueue_head(&parent->release_done);

And no lock to protect these operations.

> +
> +	ret = mdev_create_sysfs_files(dev);
> +	if (ret)
> +		goto add_sysfs_error;
> +
> +	ret = mdev_add_attribute_group(dev, ops->dev_attr_groups);
> +	if (ret)
> +		goto add_group_error;
> +
> +	dev_info(dev, "MDEV: Registered\n");
> +	return 0;
> +
> +add_group_error:
> +	mdev_remove_sysfs_files(dev);
> +add_sysfs_error:
> +	mutex_lock(&parent_devices.list_lock);
> +	list_del(&parent->next);
> +	mutex_unlock(&parent_devices.list_lock);
> +	mdev_put_parent(parent);
> +	return ret;
> +
> +add_dev_err:
> +	mutex_unlock(&parent_devices.list_lock);
> +	return ret;
> +}
> +EXPORT_SYMBOL(mdev_register_device);
> +
> +/*
> + * mdev_unregister_device : Unregister a parent device
> + * @dev: device structure representing parent device.
> + *
> + * Remove device from list of registered parent devices. Give a chance to free
> + * existing mediated devices for given device.
> + */
> +
> +void mdev_unregister_device(struct device *dev)
> +{
> +	struct parent_device *parent;
> +	struct mdev_device *mdev, *n;
> +	int ret;
> +
> +	mutex_lock(&parent_devices.list_lock);
> +	parent = find_parent_device(dev);
> +
> +	if (!parent) {
> +		mutex_unlock(&parent_devices.list_lock);
> +		return;
> +	}
> +	dev_info(dev, "MDEV: Unregistering\n");
> +
> +	/*
> +	 * Remove parent from the list and remove create and destroy sysfs
> +	 * files so that no new mediated device could be created for this parent
> +	 */
> +	list_del(&parent->next);
> +	mdev_remove_sysfs_files(dev);
> +	mutex_unlock(&parent_devices.list_lock);
> +

find_parent_device() does not increase the refcount of the parent-device,
after releasing the lock, is it still safe to use the device?

> +	mutex_lock(&parent->ops_lock);
> +	mdev_remove_attribute_group(dev,
> +				    parent->ops->dev_attr_groups);

Why mdev_remove_sysfs_files() and mdev_remove_attribute_group()
are protected by different locks?

> +	mutex_unlock(&parent->ops_lock);
> +
> +	mutex_lock(&parent->mdev_list_lock);
> +	list_for_each_entry_safe(mdev, n, &parent->mdev_list, next) {
> +		mdev_device_destroy_ops(mdev, true);
> +		list_del(&mdev->next);
> +		mdev_put_device(mdev);
> +	}
> +	mutex_unlock(&parent->mdev_list_lock);
> +
> +	do {
> +		ret = wait_event_interruptible_timeout(parent->release_done,
> +				list_empty(&parent->mdev_list), HZ * 10);
> +		if (ret == -ERESTARTSYS) {
> +			dev_warn(dev, "Mediated devices are in use, task"
> +				      " \"%s\" (%d) "
> +				      "blocked until all are released",
> +				      current->comm, task_pid_nr(current));
> +		}
> +	} while (ret <= 0);
> +
> +	mdev_put_parent(parent);
> +}
> +EXPORT_SYMBOL(mdev_unregister_device);
> +
> +/*
> + * Functions required for mdev-sysfs
> + */
> +static void mdev_device_release(struct device *dev)
> +{
> +	struct mdev_device *mdev = to_mdev_device(dev);
> +
> +	dev_dbg(&mdev->dev, "MDEV: destroying\n");
> +	kfree(mdev);
> +}
> +
> +int mdev_device_create(struct device *dev, uuid_le uuid, uint32_t instance,
> +		       char *mdev_params)
> +{
> +	int ret;
> +	struct mdev_device *mdev;
> +	struct parent_device *parent;
> +
> +	parent = mdev_get_parent_by_dev(dev);
> +	if (!parent)
> +		return -EINVAL;
> +
> +	/* Check for duplicate */
> +	mdev = find_mdev_device(parent, uuid, instance);
> +	if (mdev) {
> +		ret = -EEXIST;
> +		goto create_err;
> +	}
> +
> +	mdev = kzalloc(sizeof(*mdev), GFP_KERNEL);
> +	if (!mdev) {
> +		ret = -ENOMEM;
> +		goto create_err;
> +	}
> +
> +	memcpy(&mdev->uuid, &uuid, sizeof(uuid_le));
> +	mdev->instance = instance;
> +	mdev->parent = parent;
> +	mutex_init(&mdev->ops_lock);
> +	kref_init(&mdev->ref);
> +
> +	mdev->dev.parent  = dev;
> +	mdev->dev.bus     = &mdev_bus_type;
> +	mdev->dev.release = mdev_device_release;
> +	dev_set_name(&mdev->dev, "%pUb-%d", uuid.b, instance);
> +
> +	ret = device_register(&mdev->dev);
> +	if (ret) {
> +		put_device(&mdev->dev);
> +		goto create_err;
> +	}
> +
> +	ret = mdev_device_create_ops(mdev, mdev_params);
> +	if (ret)
> +		goto create_failed;
> +
> +	mutex_lock(&parent->mdev_list_lock);
> +	list_add(&mdev->next, &parent->mdev_list);
> +	mutex_unlock(&parent->mdev_list_lock);
> +
> +	dev_dbg(&mdev->dev, "MDEV: created\n");
> +
> +	return ret;
> +
> +create_failed:
> +	device_unregister(&mdev->dev);
> +
> +create_err:
> +	mdev_put_parent(parent);
> +	return ret;
> +}
> +
> +int mdev_device_destroy(struct device *dev, uuid_le uuid, uint32_t instance)
> +{
> +	struct mdev_device *mdev;
> +	struct parent_device *parent;
> +	int ret;
> +
> +	parent = mdev_get_parent_by_dev(dev);
> +	if (!parent) {
> +		ret = -EINVAL;
> +		goto destroy_err;
> +	}
> +
> +	mdev = find_mdev_device(parent, uuid, instance);
> +	if (!mdev) {
> +		ret = -EINVAL;
> +		goto destroy_err;
> +	}
> +
> +	ret = mdev_device_destroy_ops(mdev, false);
> +	if (ret)
> +		goto destroy_err;

find_mdev_device() does not hold the refcount of mdev, is it safe?

> +
> +	mdev_put_parent(parent);
> +

The refcount of parent-device is released, you can not continue to
use it.

> +	mutex_lock(&parent->mdev_list_lock);
> +	list_del(&mdev->next);
> +	mutex_unlock(&parent->mdev_list_lock);
> +
> +	mdev_put_device(mdev);
> +	return ret;
> +
> +destroy_err:
> +	mdev_put_parent(parent);
> +	return ret;
> +}
> +
> +void mdev_device_supported_config(struct device *dev, char *str)
> +{
> +	struct parent_device *parent;
> +
> +	parent = mdev_get_parent_by_dev(dev);
> +
> +	if (parent) {
> +		mutex_lock(&parent->ops_lock);
> +		if (parent->ops->supported_config)
> +			parent->ops->supported_config(parent->dev, str);
> +		mutex_unlock(&parent->ops_lock);
> +		mdev_put_parent(parent);
> +	}
> +}
> +
> +int mdev_device_start(uuid_le uuid)
> +{
> +	int ret = 0;
> +	struct mdev_device *mdev;
> +	struct parent_device *parent;
> +
> +	mdev = mdev_get_first_device_by_uuid(uuid);
> +	if (!mdev)
> +		return -EINVAL;
> +
> +	parent = mdev->parent;
> +
> +	mutex_lock(&parent->ops_lock);
> +	if (parent->ops->start)
> +		ret = parent->ops->start(mdev->uuid);
> +	mutex_unlock(&parent->ops_lock);
> +
> +	if (ret)
> +		pr_err("mdev_start failed  %d\n", ret);
> +	else
> +		kobject_uevent(&mdev->dev.kobj, KOBJ_ONLINE);
> +
> +	mdev_put_device(mdev);
> +
> +	return ret;
> +}
> +
> +int mdev_device_shutdown(uuid_le uuid)
> +{
> +	int ret = 0;
> +	struct mdev_device *mdev;
> +	struct parent_device *parent;
> +
> +	mdev = mdev_get_first_device_by_uuid(uuid);
> +	if (!mdev)
> +		return -EINVAL;
> +
> +	parent = mdev->parent;
> +
> +	mutex_lock(&parent->ops_lock);
> +	if (parent->ops->shutdown)
> +		ret = parent->ops->shutdown(mdev->uuid);
> +	mutex_unlock(&parent->ops_lock);
> +
> +	if (ret)
> +		pr_err("mdev_shutdown failed %d\n", ret);
> +	else
> +		kobject_uevent(&mdev->dev.kobj, KOBJ_OFFLINE);
> +
> +	mdev_put_device(mdev);
> +	return ret;
> +}
> +
> +static struct class mdev_class = {
> +	.name		= MDEV_CLASS_NAME,
> +	.owner		= THIS_MODULE,
> +	.class_attrs	= mdev_class_attrs,

These interfaces, start and shutdown, are based on UUID, how
about if we want to operate on the specified instance?

> +};
> +
> +static int __init mdev_init(void)
> +{
> +	int ret;
> +
> +	mutex_init(&parent_devices.list_lock);
> +	INIT_LIST_HEAD(&parent_devices.dev_list);
> +
> +	ret = class_register(&mdev_class);
> +	if (ret) {
> +		pr_err("Failed to register mdev class\n");
> +		return ret;
> +	}
> +
> +	ret = mdev_bus_register();
> +	if (ret) {
> +		pr_err("Failed to register mdev bus\n");
> +		class_unregister(&mdev_class);
> +		return ret;
> +	}
> +
> +	return ret;
> +}
> +
> +static void __exit mdev_exit(void)
> +{
> +	mdev_bus_unregister();
> +	class_unregister(&mdev_class);
> +}

Hmm, how to prevent if there are parent-devices existing
when the module is being unloaded?

> +
> +module_init(mdev_init)
> +module_exit(mdev_exit)
> +
> +MODULE_VERSION(DRIVER_VERSION);
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR(DRIVER_AUTHOR);
> +MODULE_DESCRIPTION(DRIVER_DESC);
> diff --git a/drivers/vfio/mdev/mdev_driver.c b/drivers/vfio/mdev/mdev_driver.c
> new file mode 100644
> index 000000000000..f1aed541111d
> --- /dev/null
> +++ b/drivers/vfio/mdev/mdev_driver.c
> @@ -0,0 +1,138 @@
> +/*
> + * MDEV driver
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/device.h>
> +#include <linux/iommu.h>
> +#include <linux/mdev.h>
> +
> +#include "mdev_private.h"
> +
> +static int mdev_attach_iommu(struct mdev_device *mdev)
> +{
> +	int ret;
> +	struct iommu_group *group;
> +
> +	group = iommu_group_alloc();
> +	if (IS_ERR(group)) {
> +		dev_err(&mdev->dev, "MDEV: failed to allocate group!\n");
> +		return PTR_ERR(group);
> +	}
> +
> +	ret = iommu_group_add_device(group, &mdev->dev);
> +	if (ret) {
> +		dev_err(&mdev->dev, "MDEV: failed to add dev to group!\n");
> +		goto attach_fail;
> +	}
> +
> +	mdev->group = group;
> +
> +	dev_info(&mdev->dev, "MDEV: group_id = %d\n",
> +				 iommu_group_id(group));
> +attach_fail:
> +	iommu_group_put(group);
> +	return ret;
> +}
> +
> +static void mdev_detach_iommu(struct mdev_device *mdev)
> +{
> +	iommu_group_remove_device(&mdev->dev);
> +	dev_info(&mdev->dev, "MDEV: detaching iommu\n");
> +}
> +
> +static int mdev_probe(struct device *dev)
> +{
> +	struct mdev_driver *drv = to_mdev_driver(dev->driver);
> +	struct mdev_device *mdev = to_mdev_device(dev);
> +	int ret;
> +
> +	ret = mdev_attach_iommu(mdev);
> +	if (ret) {
> +		dev_err(dev, "Failed to attach IOMMU\n");
> +		return ret;
> +	}
> +
> +	if (drv && drv->probe)
> +		ret = drv->probe(dev);

If probe failed, need to deattache mdev?

> +
> +	return ret;
> +}
> +
> +static int mdev_remove(struct device *dev)
> +{
> +	struct mdev_driver *drv = to_mdev_driver(dev->driver);
> +	struct mdev_device *mdev = to_mdev_device(dev);
> +
> +	if (drv && drv->remove)
> +		drv->remove(dev);
> +
> +	mdev_detach_iommu(mdev);
> +
> +	return 0;
> +}
> +
> +static int mdev_match(struct device *dev, struct device_driver *drv)
> +{
> +	struct mdev_driver *mdrv = to_mdev_driver(drv);
> +
> +	if (mdrv && mdrv->match)
> +		return mdrv->match(dev);
> +
> +	return 0;
> +}
> +
> +struct bus_type mdev_bus_type = {
> +	.name		= "mdev",
> +	.match		= mdev_match,
> +	.probe		= mdev_probe,
> +	.remove		= mdev_remove,
> +};
> +EXPORT_SYMBOL_GPL(mdev_bus_type);
> +
> +/*
> + * mdev_register_driver - register a new MDEV driver
> + * @drv: the driver to register
> + * @owner: module owner of driver to be registered
> + *
> + * Returns a negative value on error, otherwise 0.
> + */
> +int mdev_register_driver(struct mdev_driver *drv, struct module *owner)
> +{
> +	/* initialize common driver fields */
> +	drv->driver.name = drv->name;
> +	drv->driver.bus = &mdev_bus_type;
> +	drv->driver.owner = owner;
> +
> +	/* register with core */
> +	return driver_register(&drv->driver);
> +}
> +EXPORT_SYMBOL(mdev_register_driver);
> +
> +/*
> + * mdev_unregister_driver - unregister MDEV driver
> + * @drv: the driver to unregister
> + *
> + */
> +void mdev_unregister_driver(struct mdev_driver *drv)
> +{
> +	driver_unregister(&drv->driver);
> +}
> +EXPORT_SYMBOL(mdev_unregister_driver);
> +
> +int mdev_bus_register(void)
> +{
> +	return bus_register(&mdev_bus_type);
> +}
> +
> +void mdev_bus_unregister(void)
> +{
> +	bus_unregister(&mdev_bus_type);
> +}
> diff --git a/drivers/vfio/mdev/mdev_private.h b/drivers/vfio/mdev/mdev_private.h
> new file mode 100644
> index 000000000000..991d7f796169
> --- /dev/null
> +++ b/drivers/vfio/mdev/mdev_private.h
> @@ -0,0 +1,33 @@
> +/*
> + * Mediated device interal definitions
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#ifndef MDEV_PRIVATE_H
> +#define MDEV_PRIVATE_H
> +
> +int  mdev_bus_register(void);
> +void mdev_bus_unregister(void);
> +
> +/* Function prototypes for mdev_sysfs */
> +
> +extern struct class_attribute mdev_class_attrs[];
> +
> +int  mdev_create_sysfs_files(struct device *dev);
> +void mdev_remove_sysfs_files(struct device *dev);
> +
> +int  mdev_device_create(struct device *dev, uuid_le uuid, uint32_t instance,
> +			char *mdev_params);
> +int  mdev_device_destroy(struct device *dev, uuid_le uuid, uint32_t instance);
> +void mdev_device_supported_config(struct device *dev, char *str);
> +int  mdev_device_start(uuid_le uuid);
> +int  mdev_device_shutdown(uuid_le uuid);
> +
> +#endif /* MDEV_PRIVATE_H */
> diff --git a/drivers/vfio/mdev/mdev_sysfs.c b/drivers/vfio/mdev/mdev_sysfs.c
> new file mode 100644
> index 000000000000..48b66e40009e
> --- /dev/null
> +++ b/drivers/vfio/mdev/mdev_sysfs.c
> @@ -0,0 +1,300 @@
> +/*
> + * File attributes for Mediated devices
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/sysfs.h>
> +#include <linux/ctype.h>
> +#include <linux/device.h>
> +#include <linux/slab.h>
> +#include <linux/uuid.h>
> +#include <linux/mdev.h>
> +
> +#include "mdev_private.h"
> +
> +/* Prototypes */
> +static ssize_t mdev_supported_types_show(struct device *dev,
> +					 struct device_attribute *attr,
> +					 char *buf);
> +static DEVICE_ATTR_RO(mdev_supported_types);
> +
> +static ssize_t mdev_create_store(struct device *dev,
> +				 struct device_attribute *attr,
> +				 const char *buf, size_t count);
> +static DEVICE_ATTR_WO(mdev_create);
> +
> +static ssize_t mdev_destroy_store(struct device *dev,
> +				  struct device_attribute *attr,
> +				  const char *buf, size_t count);
> +static DEVICE_ATTR_WO(mdev_destroy);
> +
> +/* Static functions */
> +
> +#define UUID_CHAR_LENGTH	36
> +#define UUID_BYTE_LENGTH	16
> +
> +#define SUPPORTED_TYPE_BUFFER_LENGTH	1024
> +
> +static inline bool is_uuid_sep(char sep)
> +{
> +	if (sep == '\n' || sep == '-' || sep == ':' || sep == '\0')
> +		return true;
> +	return false;
> +}
> +
> +static int uuid_parse(const char *str, uuid_le *uuid)
> +{
> +	int i;
> +
> +	if (strlen(str) < UUID_CHAR_LENGTH)
> +		return -EINVAL;
> +
> +	for (i = 0; i < UUID_BYTE_LENGTH; i++) {
> +		if (!isxdigit(str[0]) || !isxdigit(str[1])) {
> +			pr_err("%s err", __func__);
> +			return -EINVAL;
> +		}
> +
> +		uuid->b[i] = (hex_to_bin(str[0]) << 4) | hex_to_bin(str[1]);
> +		str += 2;
> +		if (is_uuid_sep(*str))
> +			str++;
> +	}
> +
> +	return 0;
> +}
> +

Can we use uuid_le_to_bin()?

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 2/3] VFIO driver for mediated PCI device
  2016-06-20 16:31   ` [Qemu-devel] " Kirti Wankhede
@ 2016-06-30  6:34     ` Xiao Guangrong
  -1 siblings, 0 replies; 51+ messages in thread
From: Xiao Guangrong @ 2016-06-30  6:34 UTC (permalink / raw)
  To: Kirti Wankhede, alex.williamson, pbonzini, kraxel, cjia
  Cc: shuai.ruan, jike.song, kvm, kevin.tian, qemu-devel, zhiyuan.lv, bjsdjshi



On 06/21/2016 12:31 AM, Kirti Wankhede wrote:

> +static int mdev_dev_mmio_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> +{
> +	int ret;
> +	struct vfio_mdev *vmdev = vma->vm_private_data;
> +	struct mdev_device *mdev;
> +	struct parent_device *parent;
> +	u64 virtaddr = (u64)vmf->virtual_address;
> +	u64 offset, phyaddr;
> +	unsigned long req_size, pgoff;
> +	pgprot_t pg_prot;
> +
> +	if (!vmdev && !vmdev->mdev)
> +		return -EINVAL;
> +
> +	mdev = vmdev->mdev;
> +	parent  = mdev->parent;
> +
> +	offset   = virtaddr - vma->vm_start;
> +	phyaddr  = (vma->vm_pgoff << PAGE_SHIFT) + offset;
> +	pgoff    = phyaddr >> PAGE_SHIFT;
> +	req_size = vma->vm_end - virtaddr;
> +	pg_prot  = vma->vm_page_prot;
> +
> +	if (parent && parent->ops->validate_map_request) {
> +		mutex_lock(&mdev->ops_lock);
> +		ret = parent->ops->validate_map_request(mdev, virtaddr,
> +							 &pgoff, &req_size,
> +							 &pg_prot);

It is not only 'validate' but also adjust the parameters.

> +		mutex_unlock(&mdev->ops_lock);
> +		if (ret)
> +			return ret;
> +
> +		if (!req_size)
> +			return -EINVAL;
> +	}
> +
> +	ret = remap_pfn_range(vma, virtaddr, pgoff, req_size, pg_prot);
> +

Do you know why
"Issues with mmap region fault handler, EPT is not correctly populated with the
   information provided by remap_pfn_range() inside fault handler."
as you mentioned in the patch 0.

> +	return ret | VM_FAULT_NOPAGE;
> +}
> +
> +static const struct vm_operations_struct mdev_dev_mmio_ops = {
> +	.fault = mdev_dev_mmio_fault,
> +};
> +
> +
> +static int vfio_mpci_mmap(void *device_data, struct vm_area_struct *vma)
> +{
> +	unsigned int index;
> +	struct vfio_mdev *vmdev = device_data;
> +	struct mdev_device *mdev = vmdev->mdev;
> +	struct pci_dev *pdev;
> +	unsigned long pgoff;
> +	loff_t offset;
> +
> +	if (!mdev->parent || !dev_is_pci(mdev->parent->dev))
> +		return -EINVAL;
> +
> +	pdev = to_pci_dev(mdev->parent->dev);
> +
> +	offset = vma->vm_pgoff << PAGE_SHIFT;
> +
> +	index = VFIO_PCI_OFFSET_TO_INDEX(offset);
> +
> +	if (index >= VFIO_PCI_ROM_REGION_INDEX)
> +		return -EINVAL;
> +
> +	pgoff = vma->vm_pgoff &
> +		((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
> +
> +	vma->vm_pgoff = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff;
> +
> +	vma->vm_private_data = vmdev;
> +	vma->vm_ops = &mdev_dev_mmio_ops;
> +
> +	return 0;
> +}
> +
> +static const struct vfio_device_ops vfio_mpci_dev_ops = {
> +	.name		= "vfio-mpci",
> +	.open		= vfio_mpci_open,
> +	.release	= vfio_mpci_close,
> +	.ioctl		= vfio_mpci_unlocked_ioctl,
> +	.read		= vfio_mpci_read,
> +	.write		= vfio_mpci_write,
> +	.mmap		= vfio_mpci_mmap,
> +};
> +
> +int vfio_mpci_probe(struct device *dev)
> +{
> +	struct vfio_mdev *vmdev;
> +	struct mdev_device *mdev = to_mdev_device(dev);
> +	int ret;
> +
> +	if (!mdev)
> +		return -EINVAL;
> +
> +	vmdev = kzalloc(sizeof(*vmdev), GFP_KERNEL);
> +	if (IS_ERR(vmdev))
> +		return PTR_ERR(vmdev);
> +
> +	vmdev->mdev = mdev_get_device(mdev);
> +	vmdev->group = mdev->group;
> +	mutex_init(&vmdev->vfio_mdev_lock);
> +
> +	ret = vfio_add_group_dev(dev, &vfio_mpci_dev_ops, vmdev);
> +	if (ret)
> +		kfree(vmdev);
> +
> +	mdev_put_device(mdev);

If you can make sure that mdev is always valid during vmdev's lifecyle,
it is not necessary to get and put the refcount of mdev.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Qemu-devel] [PATCH 2/3] VFIO driver for mediated PCI device
@ 2016-06-30  6:34     ` Xiao Guangrong
  0 siblings, 0 replies; 51+ messages in thread
From: Xiao Guangrong @ 2016-06-30  6:34 UTC (permalink / raw)
  To: Kirti Wankhede, alex.williamson, pbonzini, kraxel, cjia
  Cc: shuai.ruan, jike.song, kvm, kevin.tian, qemu-devel, zhiyuan.lv, bjsdjshi



On 06/21/2016 12:31 AM, Kirti Wankhede wrote:

> +static int mdev_dev_mmio_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> +{
> +	int ret;
> +	struct vfio_mdev *vmdev = vma->vm_private_data;
> +	struct mdev_device *mdev;
> +	struct parent_device *parent;
> +	u64 virtaddr = (u64)vmf->virtual_address;
> +	u64 offset, phyaddr;
> +	unsigned long req_size, pgoff;
> +	pgprot_t pg_prot;
> +
> +	if (!vmdev && !vmdev->mdev)
> +		return -EINVAL;
> +
> +	mdev = vmdev->mdev;
> +	parent  = mdev->parent;
> +
> +	offset   = virtaddr - vma->vm_start;
> +	phyaddr  = (vma->vm_pgoff << PAGE_SHIFT) + offset;
> +	pgoff    = phyaddr >> PAGE_SHIFT;
> +	req_size = vma->vm_end - virtaddr;
> +	pg_prot  = vma->vm_page_prot;
> +
> +	if (parent && parent->ops->validate_map_request) {
> +		mutex_lock(&mdev->ops_lock);
> +		ret = parent->ops->validate_map_request(mdev, virtaddr,
> +							 &pgoff, &req_size,
> +							 &pg_prot);

It is not only 'validate' but also adjust the parameters.

> +		mutex_unlock(&mdev->ops_lock);
> +		if (ret)
> +			return ret;
> +
> +		if (!req_size)
> +			return -EINVAL;
> +	}
> +
> +	ret = remap_pfn_range(vma, virtaddr, pgoff, req_size, pg_prot);
> +

Do you know why
"Issues with mmap region fault handler, EPT is not correctly populated with the
   information provided by remap_pfn_range() inside fault handler."
as you mentioned in the patch 0.

> +	return ret | VM_FAULT_NOPAGE;
> +}
> +
> +static const struct vm_operations_struct mdev_dev_mmio_ops = {
> +	.fault = mdev_dev_mmio_fault,
> +};
> +
> +
> +static int vfio_mpci_mmap(void *device_data, struct vm_area_struct *vma)
> +{
> +	unsigned int index;
> +	struct vfio_mdev *vmdev = device_data;
> +	struct mdev_device *mdev = vmdev->mdev;
> +	struct pci_dev *pdev;
> +	unsigned long pgoff;
> +	loff_t offset;
> +
> +	if (!mdev->parent || !dev_is_pci(mdev->parent->dev))
> +		return -EINVAL;
> +
> +	pdev = to_pci_dev(mdev->parent->dev);
> +
> +	offset = vma->vm_pgoff << PAGE_SHIFT;
> +
> +	index = VFIO_PCI_OFFSET_TO_INDEX(offset);
> +
> +	if (index >= VFIO_PCI_ROM_REGION_INDEX)
> +		return -EINVAL;
> +
> +	pgoff = vma->vm_pgoff &
> +		((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
> +
> +	vma->vm_pgoff = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff;
> +
> +	vma->vm_private_data = vmdev;
> +	vma->vm_ops = &mdev_dev_mmio_ops;
> +
> +	return 0;
> +}
> +
> +static const struct vfio_device_ops vfio_mpci_dev_ops = {
> +	.name		= "vfio-mpci",
> +	.open		= vfio_mpci_open,
> +	.release	= vfio_mpci_close,
> +	.ioctl		= vfio_mpci_unlocked_ioctl,
> +	.read		= vfio_mpci_read,
> +	.write		= vfio_mpci_write,
> +	.mmap		= vfio_mpci_mmap,
> +};
> +
> +int vfio_mpci_probe(struct device *dev)
> +{
> +	struct vfio_mdev *vmdev;
> +	struct mdev_device *mdev = to_mdev_device(dev);
> +	int ret;
> +
> +	if (!mdev)
> +		return -EINVAL;
> +
> +	vmdev = kzalloc(sizeof(*vmdev), GFP_KERNEL);
> +	if (IS_ERR(vmdev))
> +		return PTR_ERR(vmdev);
> +
> +	vmdev->mdev = mdev_get_device(mdev);
> +	vmdev->group = mdev->group;
> +	mutex_init(&vmdev->vfio_mdev_lock);
> +
> +	ret = vfio_add_group_dev(dev, &vfio_mpci_dev_ops, vmdev);
> +	if (ret)
> +		kfree(vmdev);
> +
> +	mdev_put_device(mdev);

If you can make sure that mdev is always valid during vmdev's lifecyle,
it is not necessary to get and put the refcount of mdev.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 1/3] Mediated device Core driver
  2016-06-29 13:51     ` [Qemu-devel] " Xiao Guangrong
@ 2016-06-30  7:12       ` Jike Song
  -1 siblings, 0 replies; 51+ messages in thread
From: Jike Song @ 2016-06-30  7:12 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: shuai.ruan, kevin.tian, cjia, kvm, alex.williamson, qemu-devel,
	Kirti Wankhede, kraxel, pbonzini, bjsdjshi, zhiyuan.lv

On 06/29/2016 09:51 PM, Xiao Guangrong wrote:
> On 06/21/2016 12:31 AM, Kirti Wankhede wrote:
>> +	mutex_unlock(&parent_devices.list_lock);
>> +	return parent;
>> +}
>> +
>> +static int mdev_device_create_ops(struct mdev_device *mdev, char *mdev_params)
>> +{
>> +	struct parent_device *parent = mdev->parent;
>> +	int ret;
>> +
>> +	mutex_lock(&parent->ops_lock);
>> +	if (parent->ops->create) {
>> +		ret = parent->ops->create(mdev->dev.parent, mdev->uuid,
>> +					mdev->instance, mdev_params);
> 
> I think it is better if we pass @mdev to this callback, then the parent driver
> can do its specified operations and associate it with the instance,
> e.g, via mdev->private.
> 

Just noticed that mdev->driver_data is missing in v5, I'd like to have it back :)

Yes either mdev need to be passed to parent driver (preferred), or find_mdev_device to
be exported for parent driver (less preferred, but at least functional).

--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Qemu-devel] [PATCH 1/3] Mediated device Core driver
@ 2016-06-30  7:12       ` Jike Song
  0 siblings, 0 replies; 51+ messages in thread
From: Jike Song @ 2016-06-30  7:12 UTC (permalink / raw)
  To: Xiao Guangrong
  Cc: Kirti Wankhede, alex.williamson, pbonzini, kraxel, cjia,
	shuai.ruan, kvm, kevin.tian, qemu-devel, zhiyuan.lv, bjsdjshi

On 06/29/2016 09:51 PM, Xiao Guangrong wrote:
> On 06/21/2016 12:31 AM, Kirti Wankhede wrote:
>> +	mutex_unlock(&parent_devices.list_lock);
>> +	return parent;
>> +}
>> +
>> +static int mdev_device_create_ops(struct mdev_device *mdev, char *mdev_params)
>> +{
>> +	struct parent_device *parent = mdev->parent;
>> +	int ret;
>> +
>> +	mutex_lock(&parent->ops_lock);
>> +	if (parent->ops->create) {
>> +		ret = parent->ops->create(mdev->dev.parent, mdev->uuid,
>> +					mdev->instance, mdev_params);
> 
> I think it is better if we pass @mdev to this callback, then the parent driver
> can do its specified operations and associate it with the instance,
> e.g, via mdev->private.
> 

Just noticed that mdev->driver_data is missing in v5, I'd like to have it back :)

Yes either mdev need to be passed to parent driver (preferred), or find_mdev_device to
be exported for parent driver (less preferred, but at least functional).

--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 3/3] VFIO Type1 IOMMU: Add support for mediated devices
  2016-06-29  2:46         ` [Qemu-devel] " Alex Williamson
@ 2016-06-30  8:28           ` Tian, Kevin
  -1 siblings, 0 replies; 51+ messages in thread
From: Tian, Kevin @ 2016-06-30  8:28 UTC (permalink / raw)
  To: Alex Williamson, Kirti Wankhede
  Cc: Ruan, Shuai, Song, Jike, cjia, kvm, qemu-devel, kraxel, pbonzini,
	bjsdjshi, Lv, Zhiyuan

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Wednesday, June 29, 2016 10:46 AM
> 
> On Tue, 28 Jun 2016 18:32:44 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
> > On 6/22/2016 9:16 AM, Alex Williamson wrote:
> > > On Mon, 20 Jun 2016 22:01:48 +0530
> > > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > >
> > >>
> > >>  struct vfio_iommu {
> > >>  	struct list_head	domain_list;
> > >> +	struct vfio_domain	*mediated_domain;
> > >
> > > I'm not really a fan of how this is so often used to special case the
> > > code...

Better remove this field and treat mediated_domain same as other domains
(within vfio_domain you already have additional fields to mark mediated
fact)

> 
> 
> > >> +	struct mdev_device	*mdev;
> > >
> > > This gets set on attach_group where we use the iommu_group to lookup
> > > the mdev, so why can't we do that on the other paths that make use of
> > > this?  I think this is just holding a reference.
> > >
> >
> > mdev is retrieved from attach_group for 2 reasons:
> > 1. to increase the ref count of mdev, mdev_get_device_by_group(), when
> > its iommu_group is attached. That should be decremented, by
> > mdev_put_device(), from detach while detaching its iommu_group. This is
> > make sure that mdev is not freed until it's iommu_group is detached from
> > the container.
> >
> > 2. save reference to iommu_data so that vendor driver would use to call
> > vfio_pin_pages() and vfio_unpin_pages(). More details below.

Can't we retrieve mdev_device from iommu_group once the mdev is attached
to a iommu_group?

> > >> @@ -569,6 +819,7 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
> > >>  	uint64_t mask;
> > >>  	struct vfio_dma *dma;
> > >>  	unsigned long pfn;
> > >> +	struct vfio_domain *domain = NULL;
> > >>
> > >>  	/* Verify that none of our __u64 fields overflow */
> > >>  	if (map->size != size || map->vaddr != vaddr || map->iova != iova)
> > >> @@ -611,10 +862,21 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
> > >>  	/* Insert zero-sized and grow as we map chunks of it */
> > >>  	vfio_link_dma(iommu, dma);
> > >>
> > >> +	/*
> > >> +	 * Skip pin and map if and domain list is empty
> > >> +	 */
> > >> +	if (list_empty(&iommu->domain_list)) {
> > >> +		dma->size = size;
> > >> +		goto map_done;
> > >> +	}
> > >
> > > Again, this would be a serious consistency error for the existing use
> > > case.  Let's use indicators that are explicit.
> > >
> >
> > Why? for existing use case (i.e. direct device assignment) domain_list
> > will not be empty, domain_list will only be empty when there is mediated
> > device assigned and no direct device assigned.
> 
> I'm trying to be cautious whether it actually makes sense to
> dual-purpose the existing type1 iommu backend.  What's the benefit of
> putting an exit path in the middle of a function versus splitting it in
> two separate functions with two separate callers, one of which only
> calls the first function.  What's the benefit of a struct vfio_iommu
> that hosts both mediated and directly assigned devices?  Is the
> benefit to the functionality or to the code base?  Should the fact that
> they use the same iommu API dictate that they're also managed in the
> same data structures?  When we have too many special case exits or
> branches, the code complexity increases, bugs are harder to flush out,
> and possible exploits are more likely.  Convince me that this is the
> right approach.

If we have mediated_domain as a normal vfio_domain added to domain_list
of vfio_iommu, no such special case would be there then.

> 
> > > Should vfio_pin_pages instead have a struct
> > > device* parameter from which we would lookup the iommu_group and get to
> > > the vfio_domain?  That's a bit heavy weight, but we need something
> > > along those lines.
> > >
> >
> > There could be multiple mdev devices from same mediated vendor driver in
> > one container. In that case, that vendor driver need reference of
> > container or container->iommu_data to pin and unpin pages.
> > Similarly, there could be mutiple mdev devices from different mediated
> > vendor driver in one container, in that case both vendor driver need
> > reference to container or container->iommu_data to pin and unpin pages
> > in their driver.
> 
> As above, I think something like this is necessary, the proposed
> interface here is a bit of a side-channel hack.

Page-pinning should be counted per container, regardless of whether the
pinning request is for assigned devices or mediated devices. Then double
counting should be avoided as Alex pointed out.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Qemu-devel] [PATCH 3/3] VFIO Type1 IOMMU: Add support for mediated devices
@ 2016-06-30  8:28           ` Tian, Kevin
  0 siblings, 0 replies; 51+ messages in thread
From: Tian, Kevin @ 2016-06-30  8:28 UTC (permalink / raw)
  To: Alex Williamson, Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, Ruan, Shuai, Song, Jike,
	Lv, Zhiyuan, bjsdjshi

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Wednesday, June 29, 2016 10:46 AM
> 
> On Tue, 28 Jun 2016 18:32:44 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
> > On 6/22/2016 9:16 AM, Alex Williamson wrote:
> > > On Mon, 20 Jun 2016 22:01:48 +0530
> > > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > >
> > >>
> > >>  struct vfio_iommu {
> > >>  	struct list_head	domain_list;
> > >> +	struct vfio_domain	*mediated_domain;
> > >
> > > I'm not really a fan of how this is so often used to special case the
> > > code...

Better remove this field and treat mediated_domain same as other domains
(within vfio_domain you already have additional fields to mark mediated
fact)

> 
> 
> > >> +	struct mdev_device	*mdev;
> > >
> > > This gets set on attach_group where we use the iommu_group to lookup
> > > the mdev, so why can't we do that on the other paths that make use of
> > > this?  I think this is just holding a reference.
> > >
> >
> > mdev is retrieved from attach_group for 2 reasons:
> > 1. to increase the ref count of mdev, mdev_get_device_by_group(), when
> > its iommu_group is attached. That should be decremented, by
> > mdev_put_device(), from detach while detaching its iommu_group. This is
> > make sure that mdev is not freed until it's iommu_group is detached from
> > the container.
> >
> > 2. save reference to iommu_data so that vendor driver would use to call
> > vfio_pin_pages() and vfio_unpin_pages(). More details below.

Can't we retrieve mdev_device from iommu_group once the mdev is attached
to a iommu_group?

> > >> @@ -569,6 +819,7 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
> > >>  	uint64_t mask;
> > >>  	struct vfio_dma *dma;
> > >>  	unsigned long pfn;
> > >> +	struct vfio_domain *domain = NULL;
> > >>
> > >>  	/* Verify that none of our __u64 fields overflow */
> > >>  	if (map->size != size || map->vaddr != vaddr || map->iova != iova)
> > >> @@ -611,10 +862,21 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
> > >>  	/* Insert zero-sized and grow as we map chunks of it */
> > >>  	vfio_link_dma(iommu, dma);
> > >>
> > >> +	/*
> > >> +	 * Skip pin and map if and domain list is empty
> > >> +	 */
> > >> +	if (list_empty(&iommu->domain_list)) {
> > >> +		dma->size = size;
> > >> +		goto map_done;
> > >> +	}
> > >
> > > Again, this would be a serious consistency error for the existing use
> > > case.  Let's use indicators that are explicit.
> > >
> >
> > Why? for existing use case (i.e. direct device assignment) domain_list
> > will not be empty, domain_list will only be empty when there is mediated
> > device assigned and no direct device assigned.
> 
> I'm trying to be cautious whether it actually makes sense to
> dual-purpose the existing type1 iommu backend.  What's the benefit of
> putting an exit path in the middle of a function versus splitting it in
> two separate functions with two separate callers, one of which only
> calls the first function.  What's the benefit of a struct vfio_iommu
> that hosts both mediated and directly assigned devices?  Is the
> benefit to the functionality or to the code base?  Should the fact that
> they use the same iommu API dictate that they're also managed in the
> same data structures?  When we have too many special case exits or
> branches, the code complexity increases, bugs are harder to flush out,
> and possible exploits are more likely.  Convince me that this is the
> right approach.

If we have mediated_domain as a normal vfio_domain added to domain_list
of vfio_iommu, no such special case would be there then.

> 
> > > Should vfio_pin_pages instead have a struct
> > > device* parameter from which we would lookup the iommu_group and get to
> > > the vfio_domain?  That's a bit heavy weight, but we need something
> > > along those lines.
> > >
> >
> > There could be multiple mdev devices from same mediated vendor driver in
> > one container. In that case, that vendor driver need reference of
> > container or container->iommu_data to pin and unpin pages.
> > Similarly, there could be mutiple mdev devices from different mediated
> > vendor driver in one container, in that case both vendor driver need
> > reference to container or container->iommu_data to pin and unpin pages
> > in their driver.
> 
> As above, I think something like this is necessary, the proposed
> interface here is a bit of a side-channel hack.

Page-pinning should be counted per container, regardless of whether the
pinning request is for assigned devices or mediated devices. Then double
counting should be avoided as Alex pointed out.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 1/3] Mediated device Core driver
  2016-06-24 19:40         ` [Qemu-devel] " Alex Williamson
@ 2016-06-30 16:48           ` Kirti Wankhede
  -1 siblings, 0 replies; 51+ messages in thread
From: Kirti Wankhede @ 2016-06-30 16:48 UTC (permalink / raw)
  To: Alex Williamson
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, shuai.ruan,
	jike.song, zhiyuan.lv, bjsdjshi



On 6/25/2016 1:10 AM, Alex Williamson wrote:
> On Fri, 24 Jun 2016 23:24:58 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> Alex,
>>
>> Thanks for taking closer look. I'll incorporate all the nits you suggested.
>>
>> On 6/22/2016 3:00 AM, Alex Williamson wrote:
>>> On Mon, 20 Jun 2016 22:01:46 +0530
>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>  
...
>>>> +create_ops_err:
>>>> +	mutex_unlock(&parent->ops_lock);  
>>>
>>> It seems like ops_lock isn't used so much as a lock as a serialization
>>> mechanism.  Why?  Where is this serialization per parent device
>>> documented?
>>>  
>>
>> parent->ops_lock is to serialize parent device callbacks to vendor
>> driver, i.e supported_config(), create() and destroy().
>> mdev->ops_lock is to serialize mediated device related callbacks to
>> vendor driver, i.e. start(), stop(), read(), write(), set_irqs(),
>> get_region_info(), validate_map_request().
>> Its not documented, I'll add comments to mdev.h about these locks.
> 
> Should it be the mediated driver core's responsibility to do this?  If
> a given mediated driver wants to serialize on their own, they can do
> that, but I don't see why we would impose that on every mediated driver.
> 

Ok. Removing these locks from here, so it would be mediated driver
responsibility to serialize if they need to.

>>
>>>> +
>>>> +struct pci_region_info {
>>>> +	uint64_t start;
>>>> +	uint64_t size;
>>>> +	uint32_t flags;		/* VFIO region info flags */
>>>> +};
>>>> +
>>>> +enum mdev_emul_space {
>>>> +	EMUL_CONFIG_SPACE,	/* PCI configuration space */
>>>> +	EMUL_IO,		/* I/O register space */
>>>> +	EMUL_MMIO		/* Memory-mapped I/O space */
>>>> +};  
>>>
>>>
>>> I'm still confused why this is needed, perhaps a description here would
>>> be useful so I can stop asking.  Clearly config space is PCI only, so
>>> it's strange to have it in the common code.  Everyone not on x86 will
>>> say I/O space is also strange.  I can't keep it in my head why the
>>> read/write offsets aren't sufficient for the driver to figure out what
>>> type it is.
>>>
>>>  
>>
>> Now that VFIO_PCI_OFFSET_* macros are moved to vfio.h which vendor
>> driver can also use, above enum could be removed from read/write. But
>> again these macros are useful when parent device is PCI device. How
>> would non-pci parent device differentiate IO ports and MMIO?
> 
> Moving VFIO_PCI_OFFSET_* to vfio.h already worries me, the vfio api
> does not impose fixed offsets, it's simply an implementation detail of
> vfio-pci.  We should be free to change that whenever we want and not
> break userspace.  By moving it to vfio.h and potentially having
> external mediated drivers depend on those offset macros, they now become
> part of the kABI.  So more and more, I'd prefer that reads/writes/mmaps
> get passed directly to the mediated driver, let them define which
> offset is which, the core is just a passthrough.  For non-PCI devices,
> like platform devices, the indexes are implementation specific, the
> user really needs to know how to work with the specific device and how
> it defines device mmio to region indexes.
>  

Ok. With this vfio_mpci looks simple.

Thanks,
Kirti.

>>>> +	int	(*get_region_info)(struct mdev_device *vdev, int region_index,
>>>> +				 struct pci_region_info *region_info);  
>>>
>>> This can't be //pci_//region_info.  How do you intend to support things
>>> like sparse mmap capabilities in the user REGION_INFO ioctl when such
>>> things are not part of the mediated device API?  Seems like the driver
>>> should just return a buffer.
>>>  
>>
>> If not pci_region_info, can use vfio_region_info here, even to fetch
>> sparce mmap capabilities from vendor driver?
> 
> Sure, you can use vfio_region_info, then it's just a pointer to a
> buffer allocated by the callee and the mediated core is just a
> passthrough, which is probably how it should be.  Thanks,
> 
> Alex
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Qemu-devel] [PATCH 1/3] Mediated device Core driver
@ 2016-06-30 16:48           ` Kirti Wankhede
  0 siblings, 0 replies; 51+ messages in thread
From: Kirti Wankhede @ 2016-06-30 16:48 UTC (permalink / raw)
  To: Alex Williamson
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, shuai.ruan,
	jike.song, zhiyuan.lv, bjsdjshi



On 6/25/2016 1:10 AM, Alex Williamson wrote:
> On Fri, 24 Jun 2016 23:24:58 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> Alex,
>>
>> Thanks for taking closer look. I'll incorporate all the nits you suggested.
>>
>> On 6/22/2016 3:00 AM, Alex Williamson wrote:
>>> On Mon, 20 Jun 2016 22:01:46 +0530
>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>  
...
>>>> +create_ops_err:
>>>> +	mutex_unlock(&parent->ops_lock);  
>>>
>>> It seems like ops_lock isn't used so much as a lock as a serialization
>>> mechanism.  Why?  Where is this serialization per parent device
>>> documented?
>>>  
>>
>> parent->ops_lock is to serialize parent device callbacks to vendor
>> driver, i.e supported_config(), create() and destroy().
>> mdev->ops_lock is to serialize mediated device related callbacks to
>> vendor driver, i.e. start(), stop(), read(), write(), set_irqs(),
>> get_region_info(), validate_map_request().
>> Its not documented, I'll add comments to mdev.h about these locks.
> 
> Should it be the mediated driver core's responsibility to do this?  If
> a given mediated driver wants to serialize on their own, they can do
> that, but I don't see why we would impose that on every mediated driver.
> 

Ok. Removing these locks from here, so it would be mediated driver
responsibility to serialize if they need to.

>>
>>>> +
>>>> +struct pci_region_info {
>>>> +	uint64_t start;
>>>> +	uint64_t size;
>>>> +	uint32_t flags;		/* VFIO region info flags */
>>>> +};
>>>> +
>>>> +enum mdev_emul_space {
>>>> +	EMUL_CONFIG_SPACE,	/* PCI configuration space */
>>>> +	EMUL_IO,		/* I/O register space */
>>>> +	EMUL_MMIO		/* Memory-mapped I/O space */
>>>> +};  
>>>
>>>
>>> I'm still confused why this is needed, perhaps a description here would
>>> be useful so I can stop asking.  Clearly config space is PCI only, so
>>> it's strange to have it in the common code.  Everyone not on x86 will
>>> say I/O space is also strange.  I can't keep it in my head why the
>>> read/write offsets aren't sufficient for the driver to figure out what
>>> type it is.
>>>
>>>  
>>
>> Now that VFIO_PCI_OFFSET_* macros are moved to vfio.h which vendor
>> driver can also use, above enum could be removed from read/write. But
>> again these macros are useful when parent device is PCI device. How
>> would non-pci parent device differentiate IO ports and MMIO?
> 
> Moving VFIO_PCI_OFFSET_* to vfio.h already worries me, the vfio api
> does not impose fixed offsets, it's simply an implementation detail of
> vfio-pci.  We should be free to change that whenever we want and not
> break userspace.  By moving it to vfio.h and potentially having
> external mediated drivers depend on those offset macros, they now become
> part of the kABI.  So more and more, I'd prefer that reads/writes/mmaps
> get passed directly to the mediated driver, let them define which
> offset is which, the core is just a passthrough.  For non-PCI devices,
> like platform devices, the indexes are implementation specific, the
> user really needs to know how to work with the specific device and how
> it defines device mmio to region indexes.
>  

Ok. With this vfio_mpci looks simple.

Thanks,
Kirti.

>>>> +	int	(*get_region_info)(struct mdev_device *vdev, int region_index,
>>>> +				 struct pci_region_info *region_info);  
>>>
>>> This can't be //pci_//region_info.  How do you intend to support things
>>> like sparse mmap capabilities in the user REGION_INFO ioctl when such
>>> things are not part of the mediated device API?  Seems like the driver
>>> should just return a buffer.
>>>  
>>
>> If not pci_region_info, can use vfio_region_info here, even to fetch
>> sparce mmap capabilities from vendor driver?
> 
> Sure, you can use vfio_region_info, then it's just a pointer to a
> buffer allocated by the callee and the mediated core is just a
> passthrough, which is probably how it should be.  Thanks,
> 
> Alex
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 2/3] VFIO driver for mediated PCI device
  2016-06-29  2:54             ` [Qemu-devel] " Alex Williamson
@ 2016-06-30 16:54               ` Kirti Wankhede
  -1 siblings, 0 replies; 51+ messages in thread
From: Kirti Wankhede @ 2016-06-30 16:54 UTC (permalink / raw)
  To: Alex Williamson
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, shuai.ruan,
	jike.song, zhiyuan.lv, bjsdjshi



On 6/29/2016 8:24 AM, Alex Williamson wrote:
> On Wed, 29 Jun 2016 00:15:23 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> On 6/25/2016 1:15 AM, Alex Williamson wrote:
>>> On Sat, 25 Jun 2016 00:04:27 +0530
>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>   
>>
>>>>>> +
>>>>>> +static int mdev_get_irq_count(struct vfio_mdev *vmdev, int irq_type)
>>>>>> +{
>>>>>> +	/* Don't support MSIX for now */
>>>>>> +	if (irq_type == VFIO_PCI_MSIX_IRQ_INDEX)
>>>>>> +		return -1;
>>>>>> +
>>>>>> +	return 1;    
>>>>>
>>>>> Too much hard coding here, the mediated driver should define this.
>>>>>     
>>>>
>>>> I'm testing INTX and MSI, I don't have a way to test MSIX for now. So we
>>>> thought we can add supported for MSIX later. Till then hard code it to 1.  
>>>
>>> To me it screams that there needs to be an interface to the mediated
>>> device here.  How do you even know that the mediated device intends to
>>> support MSI?  What if it wants to emulated a VF and not support INTx?
>>> This is basically just a big "TODO" flag that needs to be addressed
>>> before a non-RFC.
>>>   
>>
>> VFIO user space app reads emulated PCI config space of mediated device.
>> In PCI capability list when MSI capability (PCI_CAP_ID_MSI) is present,
>> it calls VFIO_DEVICE_SET_IRQS ioctl with irq_set->index set to
>> VFIO_PCI_MSI_IRQ_INDEX.
>> Similarly, MSIX is identified from emulated config space of mediated
>> device that checks if MSI capability is present and number of vectors
>> extracted from PCI_MSI_FLAGS_QSIZE flag.
>> vfio_mpci modules don't need to query it from vendor driver of mediated
>> device. Depending on which interrupt to support, mediated driver should
>> emulate PCI config space.
> 
> Are you suggesting that if the user can determine which interrupts are
> supported and the various counts for each by querying the PCI config
> space of the mediated device then this interface should do the same,
> much like vfio_pci_get_irq_count(), such that it can provide results
> consistent with config space?  That I'm ok with.  Having the user find
> one IRQ count as they read PCI config space and another via the vfio
> API, I'm not ok with.  Thanks,
> 

Yes, it will be more like vfio_pci_get_irq_count(). I will have
mdev_get_irq_count() updated with such change in next version of patch.

Thanks,
Kirti.



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Qemu-devel] [PATCH 2/3] VFIO driver for mediated PCI device
@ 2016-06-30 16:54               ` Kirti Wankhede
  0 siblings, 0 replies; 51+ messages in thread
From: Kirti Wankhede @ 2016-06-30 16:54 UTC (permalink / raw)
  To: Alex Williamson
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, shuai.ruan,
	jike.song, zhiyuan.lv, bjsdjshi



On 6/29/2016 8:24 AM, Alex Williamson wrote:
> On Wed, 29 Jun 2016 00:15:23 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> On 6/25/2016 1:15 AM, Alex Williamson wrote:
>>> On Sat, 25 Jun 2016 00:04:27 +0530
>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>   
>>
>>>>>> +
>>>>>> +static int mdev_get_irq_count(struct vfio_mdev *vmdev, int irq_type)
>>>>>> +{
>>>>>> +	/* Don't support MSIX for now */
>>>>>> +	if (irq_type == VFIO_PCI_MSIX_IRQ_INDEX)
>>>>>> +		return -1;
>>>>>> +
>>>>>> +	return 1;    
>>>>>
>>>>> Too much hard coding here, the mediated driver should define this.
>>>>>     
>>>>
>>>> I'm testing INTX and MSI, I don't have a way to test MSIX for now. So we
>>>> thought we can add supported for MSIX later. Till then hard code it to 1.  
>>>
>>> To me it screams that there needs to be an interface to the mediated
>>> device here.  How do you even know that the mediated device intends to
>>> support MSI?  What if it wants to emulated a VF and not support INTx?
>>> This is basically just a big "TODO" flag that needs to be addressed
>>> before a non-RFC.
>>>   
>>
>> VFIO user space app reads emulated PCI config space of mediated device.
>> In PCI capability list when MSI capability (PCI_CAP_ID_MSI) is present,
>> it calls VFIO_DEVICE_SET_IRQS ioctl with irq_set->index set to
>> VFIO_PCI_MSI_IRQ_INDEX.
>> Similarly, MSIX is identified from emulated config space of mediated
>> device that checks if MSI capability is present and number of vectors
>> extracted from PCI_MSI_FLAGS_QSIZE flag.
>> vfio_mpci modules don't need to query it from vendor driver of mediated
>> device. Depending on which interrupt to support, mediated driver should
>> emulate PCI config space.
> 
> Are you suggesting that if the user can determine which interrupts are
> supported and the various counts for each by querying the PCI config
> space of the mediated device then this interface should do the same,
> much like vfio_pci_get_irq_count(), such that it can provide results
> consistent with config space?  That I'm ok with.  Having the user find
> one IRQ count as they read PCI config space and another via the vfio
> API, I'm not ok with.  Thanks,
> 

Yes, it will be more like vfio_pci_get_irq_count(). I will have
mdev_get_irq_count() updated with such change in next version of patch.

Thanks,
Kirti.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Qemu-devel] [PATCH 1/3] Mediated device Core driver
  2016-06-29 13:51     ` [Qemu-devel] " Xiao Guangrong
@ 2016-06-30 18:51       ` Kirti Wankhede
  -1 siblings, 0 replies; 51+ messages in thread
From: Kirti Wankhede @ 2016-06-30 18:51 UTC (permalink / raw)
  To: Xiao Guangrong, alex.williamson, pbonzini, kraxel, cjia
  Cc: shuai.ruan, jike.song, kvm, kevin.tian, qemu-devel, zhiyuan.lv, bjsdjshi



On 6/29/2016 7:21 PM, Xiao Guangrong wrote:
> 
> 
> On 06/21/2016 12:31 AM, Kirti Wankhede wrote:
>> Design for Mediated Device Driver:
...
>> +static int mdev_add_attribute_group(struct device *dev,
>> +                    const struct attribute_group **groups)
>> +{
>> +    return sysfs_create_groups(&dev->kobj, groups);
>> +}
>> +
>> +static void mdev_remove_attribute_group(struct device *dev,
>> +                    const struct attribute_group **groups)
>> +{
>> +    sysfs_remove_groups(&dev->kobj, groups);
>> +}
>> +
> 
> better use device_add_groups() / device_remove_groups() instead?
> 

These are not exported from base module. They can't be used here.


>> +}
>> +
>> +static int mdev_device_create_ops(struct mdev_device *mdev, char
>> *mdev_params)
>> +{
>> +    struct parent_device *parent = mdev->parent;
>> +    int ret;
>> +
>> +    mutex_lock(&parent->ops_lock);
>> +    if (parent->ops->create) {
>> +        ret = parent->ops->create(mdev->dev.parent, mdev->uuid,
>> +                    mdev->instance, mdev_params);
> 
> I think it is better if we pass @mdev to this callback, then the parent
> driver
> can do its specified operations and associate it with the instance,
> e.g, via mdev->private.
> 

Yes, actually I was also thinking of changing it to

-       ret = parent->ops->create(mdev->dev.parent, mdev->uuid,
-                                 mdev->instance, mdev_params);
+       ret = parent->ops->create(mdev, mdev_params);


>> +int mdev_register_device(struct device *dev, const struct parent_ops
>> *ops)
>> +{
>> +    int ret = 0;
>> +    struct parent_device *parent;
>> +
>> +    if (!dev || !ops)
>> +        return -EINVAL;
>> +
>> +    mutex_lock(&parent_devices.list_lock);
>> +
>> +    /* Check for duplicate */
>> +    parent = find_parent_device(dev);
>> +    if (parent) {
>> +        ret = -EEXIST;
>> +        goto add_dev_err;
>> +    }
>> +
>> +    parent = kzalloc(sizeof(*parent), GFP_KERNEL);
>> +    if (!parent) {
>> +        ret = -ENOMEM;
>> +        goto add_dev_err;
>> +    }
>> +
>> +    kref_init(&parent->ref);
>> +    list_add(&parent->next, &parent_devices.dev_list);
>> +    mutex_unlock(&parent_devices.list_lock);
> 
> It is not safe as Alex's already pointed it out.
> 
>> +
>> +    parent->dev = dev;
>> +    parent->ops = ops;
>> +    mutex_init(&parent->ops_lock);
>> +    mutex_init(&parent->mdev_list_lock);
>> +    INIT_LIST_HEAD(&parent->mdev_list);
>> +    init_waitqueue_head(&parent->release_done);
> 
> And no lock to protect these operations.
> 

As I replied to Alex also, yes I'm fixing it.

>> +void mdev_unregister_device(struct device *dev)
>> +{
>> +    struct parent_device *parent;
>> +    struct mdev_device *mdev, *n;
>> +    int ret;
>> +
>> +    mutex_lock(&parent_devices.list_lock);
>> +    parent = find_parent_device(dev);
>> +
>> +    if (!parent) {
>> +        mutex_unlock(&parent_devices.list_lock);
>> +        return;
>> +    }
>> +    dev_info(dev, "MDEV: Unregistering\n");
>> +
>> +    /*
>> +     * Remove parent from the list and remove create and destroy sysfs
>> +     * files so that no new mediated device could be created for this
>> parent
>> +     */
>> +    list_del(&parent->next);
>> +    mdev_remove_sysfs_files(dev);
>> +    mutex_unlock(&parent_devices.list_lock);
>> +
> 
> find_parent_device() does not increase the refcount of the parent-device,
> after releasing the lock, is it still safe to use the device?
> 

Yes. In mdev_register_device(), kref_init() initialises refcount to 1
and then when mdev child is created refcount is incremented and on child
mdev destroys refcount is decremented. So when all child mdev are
destroyed, refcount will still be 1 until mdev_unregister_device() is
called. So when no mdev device is created, mdev_register_device() hold
parent's refcount and released from mdev_unregister_device().


>> +    mutex_lock(&parent->ops_lock);
>> +    mdev_remove_attribute_group(dev,
>> +                    parent->ops->dev_attr_groups);
> 
> Why mdev_remove_sysfs_files() and mdev_remove_attribute_group()
> are protected by different locks?
>

As mentioned in reply to Alex on another thread, removing these locks.

>> +
>> +int mdev_device_destroy(struct device *dev, uuid_le uuid, uint32_t
>> instance)
>> +{
>> +    struct mdev_device *mdev;
>> +    struct parent_device *parent;
>> +    int ret;
>> +
>> +    parent = mdev_get_parent_by_dev(dev);
>> +    if (!parent) {
>> +        ret = -EINVAL;
>> +        goto destroy_err;
>> +    }
>> +
>> +    mdev = find_mdev_device(parent, uuid, instance);
>> +    if (!mdev) {
>> +        ret = -EINVAL;
>> +        goto destroy_err;
>> +    }
>> +
>> +    ret = mdev_device_destroy_ops(mdev, false);
>> +    if (ret)
>> +        goto destroy_err;
> 
> find_mdev_device() does not hold the refcount of mdev, is it safe?
> 

Yes, this function is just to check duplicate entry or existence of mdev
device.

>> +
>> +    mdev_put_parent(parent);
>> +
> 
> The refcount of parent-device is released, you can not continue to
> use it.
> 

Removing these locks.

>> +    mutex_lock(&parent->mdev_list_lock);
>> +    list_del(&mdev->next);
>> +    mutex_unlock(&parent->mdev_list_lock);
>> +
>> +    mdev_put_device(mdev);
>> +    return ret;
>> +
>> +destroy_err:
>> +    mdev_put_parent(parent);
>> +    return ret;
>> +}
>> +
>> +
>> +static struct class mdev_class = {
>> +    .name        = MDEV_CLASS_NAME,
>> +    .owner        = THIS_MODULE,
>> +    .class_attrs    = mdev_class_attrs,
> 
> These interfaces, start and shutdown, are based on UUID, how
> about if we want to operate on the specified instance?
> 

Do you mean hot-plug a device?

>> +};
>> +
>> +static int __init mdev_init(void)
>> +{
>> +    int ret;
>> +
>> +    mutex_init(&parent_devices.list_lock);
>> +    INIT_LIST_HEAD(&parent_devices.dev_list);
>> +
>> +    ret = class_register(&mdev_class);
>> +    if (ret) {
>> +        pr_err("Failed to register mdev class\n");
>> +        return ret;
>> +    }
>> +
>> +    ret = mdev_bus_register();
>> +    if (ret) {
>> +        pr_err("Failed to register mdev bus\n");
>> +        class_unregister(&mdev_class);
>> +        return ret;
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>> +static void __exit mdev_exit(void)
>> +{
>> +    mdev_bus_unregister();
>> +    class_unregister(&mdev_class);
>> +}
> 
> Hmm, how to prevent if there are parent-devices existing
> when the module is being unloaded?
>

If parent device exits that means that other module is using
mdev_register_device()/mdev_unregister_device() from their module.
'rmmod mdev' would fail until that module is unloaded.

# rmmod mdev
rmmod: ERROR: Module mdev is in use by: nvidia



>> +
>> +static int uuid_parse(const char *str, uuid_le *uuid)
>> +{
>> +    int i;
>> +
>> +    if (strlen(str) < UUID_CHAR_LENGTH)
>> +        return -EINVAL;
>> +
>> +    for (i = 0; i < UUID_BYTE_LENGTH; i++) {
>> +        if (!isxdigit(str[0]) || !isxdigit(str[1])) {
>> +            pr_err("%s err", __func__);
>> +            return -EINVAL;
>> +        }
>> +
>> +        uuid->b[i] = (hex_to_bin(str[0]) << 4) | hex_to_bin(str[1]);
>> +        str += 2;
>> +        if (is_uuid_sep(*str))
>> +            str++;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
> 
> Can we use uuid_le_to_bin()?

I couldn't find this in kernel?

Thanks,
Kirti

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Qemu-devel] [PATCH 1/3] Mediated device Core driver
@ 2016-06-30 18:51       ` Kirti Wankhede
  0 siblings, 0 replies; 51+ messages in thread
From: Kirti Wankhede @ 2016-06-30 18:51 UTC (permalink / raw)
  To: Xiao Guangrong, alex.williamson, pbonzini, kraxel, cjia
  Cc: shuai.ruan, jike.song, kvm, kevin.tian, qemu-devel, zhiyuan.lv, bjsdjshi



On 6/29/2016 7:21 PM, Xiao Guangrong wrote:
> 
> 
> On 06/21/2016 12:31 AM, Kirti Wankhede wrote:
>> Design for Mediated Device Driver:
...
>> +static int mdev_add_attribute_group(struct device *dev,
>> +                    const struct attribute_group **groups)
>> +{
>> +    return sysfs_create_groups(&dev->kobj, groups);
>> +}
>> +
>> +static void mdev_remove_attribute_group(struct device *dev,
>> +                    const struct attribute_group **groups)
>> +{
>> +    sysfs_remove_groups(&dev->kobj, groups);
>> +}
>> +
> 
> better use device_add_groups() / device_remove_groups() instead?
> 

These are not exported from base module. They can't be used here.


>> +}
>> +
>> +static int mdev_device_create_ops(struct mdev_device *mdev, char
>> *mdev_params)
>> +{
>> +    struct parent_device *parent = mdev->parent;
>> +    int ret;
>> +
>> +    mutex_lock(&parent->ops_lock);
>> +    if (parent->ops->create) {
>> +        ret = parent->ops->create(mdev->dev.parent, mdev->uuid,
>> +                    mdev->instance, mdev_params);
> 
> I think it is better if we pass @mdev to this callback, then the parent
> driver
> can do its specified operations and associate it with the instance,
> e.g, via mdev->private.
> 

Yes, actually I was also thinking of changing it to

-       ret = parent->ops->create(mdev->dev.parent, mdev->uuid,
-                                 mdev->instance, mdev_params);
+       ret = parent->ops->create(mdev, mdev_params);


>> +int mdev_register_device(struct device *dev, const struct parent_ops
>> *ops)
>> +{
>> +    int ret = 0;
>> +    struct parent_device *parent;
>> +
>> +    if (!dev || !ops)
>> +        return -EINVAL;
>> +
>> +    mutex_lock(&parent_devices.list_lock);
>> +
>> +    /* Check for duplicate */
>> +    parent = find_parent_device(dev);
>> +    if (parent) {
>> +        ret = -EEXIST;
>> +        goto add_dev_err;
>> +    }
>> +
>> +    parent = kzalloc(sizeof(*parent), GFP_KERNEL);
>> +    if (!parent) {
>> +        ret = -ENOMEM;
>> +        goto add_dev_err;
>> +    }
>> +
>> +    kref_init(&parent->ref);
>> +    list_add(&parent->next, &parent_devices.dev_list);
>> +    mutex_unlock(&parent_devices.list_lock);
> 
> It is not safe as Alex's already pointed it out.
> 
>> +
>> +    parent->dev = dev;
>> +    parent->ops = ops;
>> +    mutex_init(&parent->ops_lock);
>> +    mutex_init(&parent->mdev_list_lock);
>> +    INIT_LIST_HEAD(&parent->mdev_list);
>> +    init_waitqueue_head(&parent->release_done);
> 
> And no lock to protect these operations.
> 

As I replied to Alex also, yes I'm fixing it.

>> +void mdev_unregister_device(struct device *dev)
>> +{
>> +    struct parent_device *parent;
>> +    struct mdev_device *mdev, *n;
>> +    int ret;
>> +
>> +    mutex_lock(&parent_devices.list_lock);
>> +    parent = find_parent_device(dev);
>> +
>> +    if (!parent) {
>> +        mutex_unlock(&parent_devices.list_lock);
>> +        return;
>> +    }
>> +    dev_info(dev, "MDEV: Unregistering\n");
>> +
>> +    /*
>> +     * Remove parent from the list and remove create and destroy sysfs
>> +     * files so that no new mediated device could be created for this
>> parent
>> +     */
>> +    list_del(&parent->next);
>> +    mdev_remove_sysfs_files(dev);
>> +    mutex_unlock(&parent_devices.list_lock);
>> +
> 
> find_parent_device() does not increase the refcount of the parent-device,
> after releasing the lock, is it still safe to use the device?
> 

Yes. In mdev_register_device(), kref_init() initialises refcount to 1
and then when mdev child is created refcount is incremented and on child
mdev destroys refcount is decremented. So when all child mdev are
destroyed, refcount will still be 1 until mdev_unregister_device() is
called. So when no mdev device is created, mdev_register_device() hold
parent's refcount and released from mdev_unregister_device().


>> +    mutex_lock(&parent->ops_lock);
>> +    mdev_remove_attribute_group(dev,
>> +                    parent->ops->dev_attr_groups);
> 
> Why mdev_remove_sysfs_files() and mdev_remove_attribute_group()
> are protected by different locks?
>

As mentioned in reply to Alex on another thread, removing these locks.

>> +
>> +int mdev_device_destroy(struct device *dev, uuid_le uuid, uint32_t
>> instance)
>> +{
>> +    struct mdev_device *mdev;
>> +    struct parent_device *parent;
>> +    int ret;
>> +
>> +    parent = mdev_get_parent_by_dev(dev);
>> +    if (!parent) {
>> +        ret = -EINVAL;
>> +        goto destroy_err;
>> +    }
>> +
>> +    mdev = find_mdev_device(parent, uuid, instance);
>> +    if (!mdev) {
>> +        ret = -EINVAL;
>> +        goto destroy_err;
>> +    }
>> +
>> +    ret = mdev_device_destroy_ops(mdev, false);
>> +    if (ret)
>> +        goto destroy_err;
> 
> find_mdev_device() does not hold the refcount of mdev, is it safe?
> 

Yes, this function is just to check duplicate entry or existence of mdev
device.

>> +
>> +    mdev_put_parent(parent);
>> +
> 
> The refcount of parent-device is released, you can not continue to
> use it.
> 

Removing these locks.

>> +    mutex_lock(&parent->mdev_list_lock);
>> +    list_del(&mdev->next);
>> +    mutex_unlock(&parent->mdev_list_lock);
>> +
>> +    mdev_put_device(mdev);
>> +    return ret;
>> +
>> +destroy_err:
>> +    mdev_put_parent(parent);
>> +    return ret;
>> +}
>> +
>> +
>> +static struct class mdev_class = {
>> +    .name        = MDEV_CLASS_NAME,
>> +    .owner        = THIS_MODULE,
>> +    .class_attrs    = mdev_class_attrs,
> 
> These interfaces, start and shutdown, are based on UUID, how
> about if we want to operate on the specified instance?
> 

Do you mean hot-plug a device?

>> +};
>> +
>> +static int __init mdev_init(void)
>> +{
>> +    int ret;
>> +
>> +    mutex_init(&parent_devices.list_lock);
>> +    INIT_LIST_HEAD(&parent_devices.dev_list);
>> +
>> +    ret = class_register(&mdev_class);
>> +    if (ret) {
>> +        pr_err("Failed to register mdev class\n");
>> +        return ret;
>> +    }
>> +
>> +    ret = mdev_bus_register();
>> +    if (ret) {
>> +        pr_err("Failed to register mdev bus\n");
>> +        class_unregister(&mdev_class);
>> +        return ret;
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>> +static void __exit mdev_exit(void)
>> +{
>> +    mdev_bus_unregister();
>> +    class_unregister(&mdev_class);
>> +}
> 
> Hmm, how to prevent if there are parent-devices existing
> when the module is being unloaded?
>

If parent device exits that means that other module is using
mdev_register_device()/mdev_unregister_device() from their module.
'rmmod mdev' would fail until that module is unloaded.

# rmmod mdev
rmmod: ERROR: Module mdev is in use by: nvidia



>> +
>> +static int uuid_parse(const char *str, uuid_le *uuid)
>> +{
>> +    int i;
>> +
>> +    if (strlen(str) < UUID_CHAR_LENGTH)
>> +        return -EINVAL;
>> +
>> +    for (i = 0; i < UUID_BYTE_LENGTH; i++) {
>> +        if (!isxdigit(str[0]) || !isxdigit(str[1])) {
>> +            pr_err("%s err", __func__);
>> +            return -EINVAL;
>> +        }
>> +
>> +        uuid->b[i] = (hex_to_bin(str[0]) << 4) | hex_to_bin(str[1]);
>> +        str += 2;
>> +        if (is_uuid_sep(*str))
>> +            str++;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
> 
> Can we use uuid_le_to_bin()?

I couldn't find this in kernel?

Thanks,
Kirti

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Qemu-devel] [PATCH 1/3] Mediated device Core driver
  2016-06-30  7:12       ` [Qemu-devel] " Jike Song
@ 2016-06-30 18:58         ` Kirti Wankhede
  -1 siblings, 0 replies; 51+ messages in thread
From: Kirti Wankhede @ 2016-06-30 18:58 UTC (permalink / raw)
  To: Jike Song, Xiao Guangrong
  Cc: alex.williamson, pbonzini, kraxel, cjia, shuai.ruan, kvm,
	kevin.tian, qemu-devel, zhiyuan.lv, bjsdjshi



On 6/30/2016 12:42 PM, Jike Song wrote:
> On 06/29/2016 09:51 PM, Xiao Guangrong wrote:
>> On 06/21/2016 12:31 AM, Kirti Wankhede wrote:
>>> +	mutex_unlock(&parent_devices.list_lock);
>>> +	return parent;
>>> +}
>>> +
>>> +static int mdev_device_create_ops(struct mdev_device *mdev, char *mdev_params)
>>> +{
>>> +	struct parent_device *parent = mdev->parent;
>>> +	int ret;
>>> +
>>> +	mutex_lock(&parent->ops_lock);
>>> +	if (parent->ops->create) {
>>> +		ret = parent->ops->create(mdev->dev.parent, mdev->uuid,
>>> +					mdev->instance, mdev_params);
>>
>> I think it is better if we pass @mdev to this callback, then the parent driver
>> can do its specified operations and associate it with the instance,
>> e.g, via mdev->private.
>>
> 
> Just noticed that mdev->driver_data is missing in v5, I'd like to have it back :)
>

Actually, I added mdev_get_drvdata() and mdev_set_drvdata() but I missed
earlier that mdev->dev->driver_data is used by vfio module to keep
reference of vfio_device. So adding driver_data to struct mdev_device
again and updating mdev_get_drvdata() and mdev_set_drvdata() as below.

 static inline void *mdev_get_drvdata(struct mdev_device *mdev)
 {
-       return dev_get_drvdata(&mdev->dev);
+       return mdev->driver_data;
 }

 static inline void mdev_set_drvdata(struct mdev_device *mdev, void *data)
 {
-       dev_set_drvdata(&mdev->dev, data);
+       mdev->driver_data = data;
 }


> Yes either mdev need to be passed to parent driver (preferred), or find_mdev_device to
> be exported for parent driver (less preferred, but at least functional).
> 

Updating argument to create to have mdev.

Thanks,
Kirti.



^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Qemu-devel] [PATCH 1/3] Mediated device Core driver
@ 2016-06-30 18:58         ` Kirti Wankhede
  0 siblings, 0 replies; 51+ messages in thread
From: Kirti Wankhede @ 2016-06-30 18:58 UTC (permalink / raw)
  To: Jike Song, Xiao Guangrong
  Cc: alex.williamson, pbonzini, kraxel, cjia, shuai.ruan, kvm,
	kevin.tian, qemu-devel, zhiyuan.lv, bjsdjshi



On 6/30/2016 12:42 PM, Jike Song wrote:
> On 06/29/2016 09:51 PM, Xiao Guangrong wrote:
>> On 06/21/2016 12:31 AM, Kirti Wankhede wrote:
>>> +	mutex_unlock(&parent_devices.list_lock);
>>> +	return parent;
>>> +}
>>> +
>>> +static int mdev_device_create_ops(struct mdev_device *mdev, char *mdev_params)
>>> +{
>>> +	struct parent_device *parent = mdev->parent;
>>> +	int ret;
>>> +
>>> +	mutex_lock(&parent->ops_lock);
>>> +	if (parent->ops->create) {
>>> +		ret = parent->ops->create(mdev->dev.parent, mdev->uuid,
>>> +					mdev->instance, mdev_params);
>>
>> I think it is better if we pass @mdev to this callback, then the parent driver
>> can do its specified operations and associate it with the instance,
>> e.g, via mdev->private.
>>
> 
> Just noticed that mdev->driver_data is missing in v5, I'd like to have it back :)
>

Actually, I added mdev_get_drvdata() and mdev_set_drvdata() but I missed
earlier that mdev->dev->driver_data is used by vfio module to keep
reference of vfio_device. So adding driver_data to struct mdev_device
again and updating mdev_get_drvdata() and mdev_set_drvdata() as below.

 static inline void *mdev_get_drvdata(struct mdev_device *mdev)
 {
-       return dev_get_drvdata(&mdev->dev);
+       return mdev->driver_data;
 }

 static inline void mdev_set_drvdata(struct mdev_device *mdev, void *data)
 {
-       dev_set_drvdata(&mdev->dev, data);
+       mdev->driver_data = data;
 }


> Yes either mdev need to be passed to parent driver (preferred), or find_mdev_device to
> be exported for parent driver (less preferred, but at least functional).
> 

Updating argument to create to have mdev.

Thanks,
Kirti.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 1/3] Mediated device Core driver
  2016-06-20 16:31   ` [Qemu-devel] " Kirti Wankhede
@ 2016-07-04  2:08     ` Jike Song
  -1 siblings, 0 replies; 51+ messages in thread
From: Jike Song @ 2016-07-04  2:08 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: alex.williamson, pbonzini, kraxel, cjia, qemu-devel, kvm,
	kevin.tian, shuai.ruan, zhiyuan.lv, bjsdjshi

On 06/21/2016 12:31 AM, Kirti Wankhede wrote:
> +/*
> + * mdev_register_device : Register a device
> + * @dev: device structure representing parent device.
> + * @ops: Parent device operation structure to be registered.
> + *
> + * Add device to list of registered parent devices.
> + * Returns a negative value on error, otherwise 0.
> + */
> +int mdev_register_device(struct device *dev, const struct parent_ops *ops)
> +{
> +	int ret = 0;
> +	struct parent_device *parent;
> +
> +	if (!dev || !ops)
> +		return -EINVAL;
> +
> +	mutex_lock(&parent_devices.list_lock);
> +
> +	/* Check for duplicate */
> +	parent = find_parent_device(dev);
> +	if (parent) {
> +		ret = -EEXIST;
> +		goto add_dev_err;
> +	}
> +
> +	parent = kzalloc(sizeof(*parent), GFP_KERNEL);
> +	if (!parent) {
> +		ret = -ENOMEM;
> +		goto add_dev_err;
> +	}
> +
> +	kref_init(&parent->ref);
> +	list_add(&parent->next, &parent_devices.dev_list);
> +	mutex_unlock(&parent_devices.list_lock);
> +
> +	parent->dev = dev;
> +	parent->ops = ops;
> +	mutex_init(&parent->ops_lock);
> +	mutex_init(&parent->mdev_list_lock);
> +	INIT_LIST_HEAD(&parent->mdev_list);
> +	init_waitqueue_head(&parent->release_done);
> +
> +	ret = mdev_create_sysfs_files(dev);
> +	if (ret)
> +		goto add_sysfs_error;
> +
> +	ret = mdev_add_attribute_group(dev, ops->dev_attr_groups);
> +	if (ret)
> +		goto add_group_error;
> +
> +	dev_info(dev, "MDEV: Registered\n");
> +	return 0;
> +
> +add_group_error:
> +	mdev_remove_sysfs_files(dev);
> +add_sysfs_error:
> +	mutex_lock(&parent_devices.list_lock);
> +	list_del(&parent->next);
> +	mutex_unlock(&parent_devices.list_lock);
> +	mdev_put_parent(parent);
> +	return ret;
> +
> +add_dev_err:
> +	mutex_unlock(&parent_devices.list_lock);
> +	return ret;
> +}
> +EXPORT_SYMBOL(mdev_register_device);
> +
...
> +static int __init mdev_init(void)
> +{
> +	int ret;
> +
> +	mutex_init(&parent_devices.list_lock);
> +	INIT_LIST_HEAD(&parent_devices.dev_list);
> +
> +	ret = class_register(&mdev_class);
> +	if (ret) {
> +		pr_err("Failed to register mdev class\n");
> +		return ret;
> +	}
> +
> +	ret = mdev_bus_register();
> +	if (ret) {
> +		pr_err("Failed to register mdev bus\n");
> +		class_unregister(&mdev_class);
> +		return ret;
> +	}
> +
> +	return ret;
> +}
> +
> +static void __exit mdev_exit(void)
> +{
> +	mdev_bus_unregister();
> +	class_unregister(&mdev_class);
> +}
> +
> +module_init(mdev_init)
> +module_exit(mdev_exit)

Hi Kirti,

I have a question about the order of initialization,

	phy_driver calls mdev_register_device in its __init function;
	mdev_register_device accesses parent_devices.list_lock;
	parent.list_lock is initialized in __init of mdev;

The __init function of both phy driver and mdev are classified with
module_init, if they are selected to be 'Y' in .config, it's possible that in
mdev_register_device(), the mutex is still uninitialized.

The problem here I think is both mdev and phy driver are actually *drivers*,
so once they are builtin, the initialization order is hard to assume.

Do you have any idea to avoid this? Thanks!

 
--
Thanks,
Jike


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Qemu-devel] [PATCH 1/3] Mediated device Core driver
@ 2016-07-04  2:08     ` Jike Song
  0 siblings, 0 replies; 51+ messages in thread
From: Jike Song @ 2016-07-04  2:08 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: alex.williamson, pbonzini, kraxel, cjia, qemu-devel, kvm,
	kevin.tian, shuai.ruan, zhiyuan.lv, bjsdjshi

On 06/21/2016 12:31 AM, Kirti Wankhede wrote:
> +/*
> + * mdev_register_device : Register a device
> + * @dev: device structure representing parent device.
> + * @ops: Parent device operation structure to be registered.
> + *
> + * Add device to list of registered parent devices.
> + * Returns a negative value on error, otherwise 0.
> + */
> +int mdev_register_device(struct device *dev, const struct parent_ops *ops)
> +{
> +	int ret = 0;
> +	struct parent_device *parent;
> +
> +	if (!dev || !ops)
> +		return -EINVAL;
> +
> +	mutex_lock(&parent_devices.list_lock);
> +
> +	/* Check for duplicate */
> +	parent = find_parent_device(dev);
> +	if (parent) {
> +		ret = -EEXIST;
> +		goto add_dev_err;
> +	}
> +
> +	parent = kzalloc(sizeof(*parent), GFP_KERNEL);
> +	if (!parent) {
> +		ret = -ENOMEM;
> +		goto add_dev_err;
> +	}
> +
> +	kref_init(&parent->ref);
> +	list_add(&parent->next, &parent_devices.dev_list);
> +	mutex_unlock(&parent_devices.list_lock);
> +
> +	parent->dev = dev;
> +	parent->ops = ops;
> +	mutex_init(&parent->ops_lock);
> +	mutex_init(&parent->mdev_list_lock);
> +	INIT_LIST_HEAD(&parent->mdev_list);
> +	init_waitqueue_head(&parent->release_done);
> +
> +	ret = mdev_create_sysfs_files(dev);
> +	if (ret)
> +		goto add_sysfs_error;
> +
> +	ret = mdev_add_attribute_group(dev, ops->dev_attr_groups);
> +	if (ret)
> +		goto add_group_error;
> +
> +	dev_info(dev, "MDEV: Registered\n");
> +	return 0;
> +
> +add_group_error:
> +	mdev_remove_sysfs_files(dev);
> +add_sysfs_error:
> +	mutex_lock(&parent_devices.list_lock);
> +	list_del(&parent->next);
> +	mutex_unlock(&parent_devices.list_lock);
> +	mdev_put_parent(parent);
> +	return ret;
> +
> +add_dev_err:
> +	mutex_unlock(&parent_devices.list_lock);
> +	return ret;
> +}
> +EXPORT_SYMBOL(mdev_register_device);
> +
...
> +static int __init mdev_init(void)
> +{
> +	int ret;
> +
> +	mutex_init(&parent_devices.list_lock);
> +	INIT_LIST_HEAD(&parent_devices.dev_list);
> +
> +	ret = class_register(&mdev_class);
> +	if (ret) {
> +		pr_err("Failed to register mdev class\n");
> +		return ret;
> +	}
> +
> +	ret = mdev_bus_register();
> +	if (ret) {
> +		pr_err("Failed to register mdev bus\n");
> +		class_unregister(&mdev_class);
> +		return ret;
> +	}
> +
> +	return ret;
> +}
> +
> +static void __exit mdev_exit(void)
> +{
> +	mdev_bus_unregister();
> +	class_unregister(&mdev_class);
> +}
> +
> +module_init(mdev_init)
> +module_exit(mdev_exit)

Hi Kirti,

I have a question about the order of initialization,

	phy_driver calls mdev_register_device in its __init function;
	mdev_register_device accesses parent_devices.list_lock;
	parent.list_lock is initialized in __init of mdev;

The __init function of both phy driver and mdev are classified with
module_init, if they are selected to be 'Y' in .config, it's possible that in
mdev_register_device(), the mutex is still uninitialized.

The problem here I think is both mdev and phy driver are actually *drivers*,
so once they are builtin, the initialization order is hard to assume.

Do you have any idea to avoid this? Thanks!

 
--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Qemu-devel] [PATCH 1/3] Mediated device Core driver
  2016-06-30 18:51       ` Kirti Wankhede
  (?)
@ 2016-07-04  7:27       ` Xiao Guangrong
  -1 siblings, 0 replies; 51+ messages in thread
From: Xiao Guangrong @ 2016-07-04  7:27 UTC (permalink / raw)
  To: Kirti Wankhede, alex.williamson, pbonzini, kraxel, cjia
  Cc: shuai.ruan, jike.song, kvm, kevin.tian, qemu-devel, zhiyuan.lv, bjsdjshi



On 07/01/2016 02:51 AM, Kirti Wankhede wrote:
>
>
> On 6/29/2016 7:21 PM, Xiao Guangrong wrote:
>>
>>
>> On 06/21/2016 12:31 AM, Kirti Wankhede wrote:
>>> Design for Mediated Device Driver:
> ...
>>> +static int mdev_add_attribute_group(struct device *dev,
>>> +                    const struct attribute_group **groups)
>>> +{
>>> +    return sysfs_create_groups(&dev->kobj, groups);
>>> +}
>>> +
>>> +static void mdev_remove_attribute_group(struct device *dev,
>>> +                    const struct attribute_group **groups)
>>> +{
>>> +    sysfs_remove_groups(&dev->kobj, groups);
>>> +}
>>> +
>>
>> better use device_add_groups() / device_remove_groups() instead?
>>
>
> These are not exported from base module. They can't be used here.

Er, i did not realize it, sorry.

>
>
>>> +}
>>> +
>>> +static int mdev_device_create_ops(struct mdev_device *mdev, char
>>> *mdev_params)
>>> +{
>>> +    struct parent_device *parent = mdev->parent;
>>> +    int ret;
>>> +
>>> +    mutex_lock(&parent->ops_lock);
>>> +    if (parent->ops->create) {
>>> +        ret = parent->ops->create(mdev->dev.parent, mdev->uuid,
>>> +                    mdev->instance, mdev_params);
>>
>> I think it is better if we pass @mdev to this callback, then the parent
>> driver
>> can do its specified operations and associate it with the instance,
>> e.g, via mdev->private.
>>
>
> Yes, actually I was also thinking of changing it to
>
> -       ret = parent->ops->create(mdev->dev.parent, mdev->uuid,
> -                                 mdev->instance, mdev_params);
> +       ret = parent->ops->create(mdev, mdev_params);
>

Good. :)

>
>>> +int mdev_register_device(struct device *dev, const struct parent_ops
>>> *ops)
>>> +{
>>> +    int ret = 0;
>>> +    struct parent_device *parent;
>>> +
>>> +    if (!dev || !ops)
>>> +        return -EINVAL;
>>> +
>>> +    mutex_lock(&parent_devices.list_lock);
>>> +
>>> +    /* Check for duplicate */
>>> +    parent = find_parent_device(dev);
>>> +    if (parent) {
>>> +        ret = -EEXIST;
>>> +        goto add_dev_err;
>>> +    }
>>> +
>>> +    parent = kzalloc(sizeof(*parent), GFP_KERNEL);
>>> +    if (!parent) {
>>> +        ret = -ENOMEM;
>>> +        goto add_dev_err;
>>> +    }
>>> +
>>> +    kref_init(&parent->ref);
>>> +    list_add(&parent->next, &parent_devices.dev_list);
>>> +    mutex_unlock(&parent_devices.list_lock);
>>
>> It is not safe as Alex's already pointed it out.
>>
>>> +
>>> +    parent->dev = dev;
>>> +    parent->ops = ops;
>>> +    mutex_init(&parent->ops_lock);
>>> +    mutex_init(&parent->mdev_list_lock);
>>> +    INIT_LIST_HEAD(&parent->mdev_list);
>>> +    init_waitqueue_head(&parent->release_done);
>>
>> And no lock to protect these operations.
>>
>
> As I replied to Alex also, yes I'm fixing it.
>
>>> +void mdev_unregister_device(struct device *dev)
>>> +{
>>> +    struct parent_device *parent;
>>> +    struct mdev_device *mdev, *n;
>>> +    int ret;
>>> +
>>> +    mutex_lock(&parent_devices.list_lock);
>>> +    parent = find_parent_device(dev);
>>> +
>>> +    if (!parent) {
>>> +        mutex_unlock(&parent_devices.list_lock);
>>> +        return;
>>> +    }
>>> +    dev_info(dev, "MDEV: Unregistering\n");
>>> +
>>> +    /*
>>> +     * Remove parent from the list and remove create and destroy sysfs
>>> +     * files so that no new mediated device could be created for this
>>> parent
>>> +     */
>>> +    list_del(&parent->next);
>>> +    mdev_remove_sysfs_files(dev);
>>> +    mutex_unlock(&parent_devices.list_lock);
>>> +
>>
>> find_parent_device() does not increase the refcount of the parent-device,
>> after releasing the lock, is it still safe to use the device?
>>
>
> Yes. In mdev_register_device(), kref_init() initialises refcount to 1
> and then when mdev child is created refcount is incremented and on child
> mdev destroys refcount is decremented. So when all child mdev are
> destroyed, refcount will still be 1 until mdev_unregister_device() is
> called. So when no mdev device is created, mdev_register_device() hold
> parent's refcount and released from mdev_unregister_device().
>
>
>>> +    mutex_lock(&parent->ops_lock);
>>> +    mdev_remove_attribute_group(dev,
>>> +                    parent->ops->dev_attr_groups);
>>
>> Why mdev_remove_sysfs_files() and mdev_remove_attribute_group()
>> are protected by different locks?
>>
>
> As mentioned in reply to Alex on another thread, removing these locks.
>
>>> +
>>> +int mdev_device_destroy(struct device *dev, uuid_le uuid, uint32_t
>>> instance)
>>> +{
>>> +    struct mdev_device *mdev;
>>> +    struct parent_device *parent;
>>> +    int ret;
>>> +
>>> +    parent = mdev_get_parent_by_dev(dev);
>>> +    if (!parent) {
>>> +        ret = -EINVAL;
>>> +        goto destroy_err;
>>> +    }
>>> +
>>> +    mdev = find_mdev_device(parent, uuid, instance);
>>> +    if (!mdev) {
>>> +        ret = -EINVAL;
>>> +        goto destroy_err;
>>> +    }
>>> +
>>> +    ret = mdev_device_destroy_ops(mdev, false);
>>> +    if (ret)
>>> +        goto destroy_err;
>>
>> find_mdev_device() does not hold the refcount of mdev, is it safe?
>>
>
> Yes, this function is just to check duplicate entry or existence of mdev
> device.
>

Now, i am getting more confused, the caller of mdev_device_destroy(), e.g,
mdev_destroy_store(), does not hold any lock, however, you was walking
the list of parent->mdev_list (calling find_mdev_device()), is it really safe?

And how to prevent this scenario?

CPU 0                                          CPU 1
in mdev_device_destroy():                in mdev_unregister_device():

mdev = find_mdev_device(parent, uuid, instance); // no refcount hold

                                          mdev_device_destroy_ops(mdev, true);
                                          list_del(&mdev->next);
                                          mdev_put_device(mdev); // last refcount is gone.


mdev = find_mdev_device(parent, uuid, instance); !!!!!! PANIC !!!!!!


>>> +
>>> +    mdev_put_parent(parent);
>>> +
>>
>> The refcount of parent-device is released, you can not continue to
>> use it.
>>
>
> Removing these locks.
>
>>> +    mutex_lock(&parent->mdev_list_lock);
>>> +    list_del(&mdev->next);
>>> +    mutex_unlock(&parent->mdev_list_lock);
>>> +
>>> +    mdev_put_device(mdev);
>>> +    return ret;
>>> +
>>> +destroy_err:
>>> +    mdev_put_parent(parent);
>>> +    return ret;
>>> +}
>>> +
>>> +
>>> +static struct class mdev_class = {
>>> +    .name        = MDEV_CLASS_NAME,
>>> +    .owner        = THIS_MODULE,
>>> +    .class_attrs    = mdev_class_attrs,
>>
>> These interfaces, start and shutdown, are based on UUID, how
>> about if we want to operate on the specified instance?
>>
>
> Do you mean hot-plug a device?

Hot-plug and hot-unplug, yes.

>
>>> +};
>>> +
>>> +static int __init mdev_init(void)
>>> +{
>>> +    int ret;
>>> +
>>> +    mutex_init(&parent_devices.list_lock);
>>> +    INIT_LIST_HEAD(&parent_devices.dev_list);
>>> +
>>> +    ret = class_register(&mdev_class);
>>> +    if (ret) {
>>> +        pr_err("Failed to register mdev class\n");
>>> +        return ret;
>>> +    }
>>> +
>>> +    ret = mdev_bus_register();
>>> +    if (ret) {
>>> +        pr_err("Failed to register mdev bus\n");
>>> +        class_unregister(&mdev_class);
>>> +        return ret;
>>> +    }
>>> +
>>> +    return ret;
>>> +}
>>> +
>>> +static void __exit mdev_exit(void)
>>> +{
>>> +    mdev_bus_unregister();
>>> +    class_unregister(&mdev_class);
>>> +}
>>
>> Hmm, how to prevent if there are parent-devices existing
>> when the module is being unloaded?
>>
>
> If parent device exits that means that other module is using
> mdev_register_device()/mdev_unregister_device() from their module.
> 'rmmod mdev' would fail until that module is unloaded.
>
> # rmmod mdev
> rmmod: ERROR: Module mdev is in use by: nvidia
>

So other module need to explicitly increase the refcount of mdev.ko?

>
>
>>> +
>>> +static int uuid_parse(const char *str, uuid_le *uuid)
>>> +{
>>> +    int i;
>>> +
>>> +    if (strlen(str) < UUID_CHAR_LENGTH)
>>> +        return -EINVAL;
>>> +
>>> +    for (i = 0; i < UUID_BYTE_LENGTH; i++) {
>>> +        if (!isxdigit(str[0]) || !isxdigit(str[1])) {
>>> +            pr_err("%s err", __func__);
>>> +            return -EINVAL;
>>> +        }
>>> +
>>> +        uuid->b[i] = (hex_to_bin(str[0]) << 4) | hex_to_bin(str[1]);
>>> +        str += 2;
>>> +        if (is_uuid_sep(*str))
>>> +            str++;
>>> +    }
>>> +
>>> +    return 0;
>>> +}
>>> +
>>
>> Can we use uuid_le_to_bin()?
>
> I couldn't find this in kernel?

It is in lib/uuid.c, i am using kvm tree.


^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2016-07-04  7:31 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-20 16:31 [RFC PATCH v5 0/3] Add Mediated device support Kirti Wankhede
2016-06-20 16:31 ` [Qemu-devel] " Kirti Wankhede
2016-06-20 16:31 ` [PATCH 1/3] Mediated device Core driver Kirti Wankhede
2016-06-20 16:31   ` [Qemu-devel] " Kirti Wankhede
2016-06-21  7:38   ` Jike Song
2016-06-21  7:38     ` [Qemu-devel] " Jike Song
2016-06-21 21:30   ` Alex Williamson
2016-06-21 21:30     ` [Qemu-devel] " Alex Williamson
2016-06-24 17:54     ` Kirti Wankhede
2016-06-24 17:54       ` [Qemu-devel] " Kirti Wankhede
2016-06-24 19:40       ` Alex Williamson
2016-06-24 19:40         ` [Qemu-devel] " Alex Williamson
2016-06-30 16:48         ` Kirti Wankhede
2016-06-30 16:48           ` [Qemu-devel] " Kirti Wankhede
2016-06-29 13:51   ` Xiao Guangrong
2016-06-29 13:51     ` [Qemu-devel] " Xiao Guangrong
2016-06-30  7:12     ` Jike Song
2016-06-30  7:12       ` [Qemu-devel] " Jike Song
2016-06-30 18:58       ` Kirti Wankhede
2016-06-30 18:58         ` Kirti Wankhede
2016-06-30 18:51     ` Kirti Wankhede
2016-06-30 18:51       ` Kirti Wankhede
2016-07-04  7:27       ` Xiao Guangrong
2016-07-04  2:08   ` Jike Song
2016-07-04  2:08     ` [Qemu-devel] " Jike Song
2016-06-20 16:31 ` [PATCH 2/3] VFIO driver for mediated PCI device Kirti Wankhede
2016-06-20 16:31   ` [Qemu-devel] " Kirti Wankhede
2016-06-21 22:48   ` Alex Williamson
2016-06-21 22:48     ` [Qemu-devel] " Alex Williamson
2016-06-24 18:34     ` Kirti Wankhede
2016-06-24 18:34       ` [Qemu-devel] " Kirti Wankhede
2016-06-24 19:45       ` Alex Williamson
2016-06-24 19:45         ` [Qemu-devel] " Alex Williamson
2016-06-28 18:45         ` Kirti Wankhede
2016-06-28 18:45           ` [Qemu-devel] " Kirti Wankhede
2016-06-29  2:54           ` Alex Williamson
2016-06-29  2:54             ` [Qemu-devel] " Alex Williamson
2016-06-30 16:54             ` Kirti Wankhede
2016-06-30 16:54               ` [Qemu-devel] " Kirti Wankhede
2016-06-30  6:34   ` Xiao Guangrong
2016-06-30  6:34     ` [Qemu-devel] " Xiao Guangrong
2016-06-20 16:31 ` [PATCH 3/3] VFIO Type1 IOMMU: Add support for mediated devices Kirti Wankhede
2016-06-20 16:31   ` [Qemu-devel] " Kirti Wankhede
2016-06-22  3:46   ` Alex Williamson
2016-06-22  3:46     ` [Qemu-devel] " Alex Williamson
2016-06-28 13:02     ` Kirti Wankhede
2016-06-28 13:02       ` [Qemu-devel] " Kirti Wankhede
2016-06-29  2:46       ` Alex Williamson
2016-06-29  2:46         ` [Qemu-devel] " Alex Williamson
2016-06-30  8:28         ` Tian, Kevin
2016-06-30  8:28           ` [Qemu-devel] " Tian, Kevin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.