linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v9 00/12] Add Mediated device support
@ 2016-10-17 21:22 Kirti Wankhede
  2016-10-17 21:22 ` [PATCH v9 01/12] vfio: Mediated device Core driver Kirti Wankhede
                   ` (13 more replies)
  0 siblings, 14 replies; 73+ messages in thread
From: Kirti Wankhede @ 2016-10-17 21:22 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi, linux-kernel,
	Kirti Wankhede

This series adds Mediated device support to Linux host kernel. Purpose
of this series is to provide a common interface for mediated device
management that can be used by different devices. This series introduces
Mdev core module that creates and manages mediated devices, VFIO based
driver for mediated devices that are created by mdev core module and
update VFIO type1 IOMMU module to support pinning & unpinning for mediated
devices.

What changed in v9?
mdev-core:
- added class named 'mdev_bus' that contains links to devices that are
  registered with the mdev core driver.
- The [<type-id>] name is created by adding the the device driver string as a
  prefix to the string provided by the vendor driver.
- 'device_api' attribute should be provided by vendor driver and should show
   which device API is being created, for example, "vfio-pci" for a PCI device.
- Renamed link to its type in mdev device directory to 'mdev_type'

vfio:
- Split commits in multple individual commits
- Added function to get device_api string based on vfio_device_info.flags.

vfio_iommu_type1:
- Handled the case if all devices attached to the normal IOMMU API domain
  go away and mdev device still exist in domain. Updated page accounting
  for local domain.
- Similarly if device is attached to normal IOMMU API domain, mappings are
  establised and page accounting is updated accordingly.
- Tested hot-plug and hot-unplug of vGPU and GPU pass through device with
  Linux VM.

Documentation:
- Updated Documentation and sample driver, mtty.c, accordingly.

Kirti Wankhede (12):
  vfio: Mediated device Core driver
  vfio: VFIO based driver for Mediated devices
  vfio: Rearrange functions to get vfio_group from dev
  vfio iommu: Add support for mediated devices
  vfio: Introduce common function to add capabilities
  vfio_pci: Update vfio_pci to use vfio_info_add_capability()
  vfio: Introduce vfio_set_irqs_validate_and_prepare()
  vfio_pci: Updated to use vfio_set_irqs_validate_and_prepare()
  vfio_platform: Updated to use vfio_set_irqs_validate_and_prepare()
  vfio: Add function to get device_api string from
    vfio_device_info.flags
  docs: Add Documentation for Mediated devices
  docs: Sample driver to demonstrate how to use Mediated device
    framework.

 Documentation/vfio-mdev/Makefile                 |   13 +
 Documentation/vfio-mdev/mtty.c                   | 1429 ++++++++++++++++++++++
 Documentation/vfio-mdev/vfio-mediated-device.txt |  389 ++++++
 drivers/vfio/Kconfig                             |    1 +
 drivers/vfio/Makefile                            |    1 +
 drivers/vfio/mdev/Kconfig                        |   18 +
 drivers/vfio/mdev/Makefile                       |    5 +
 drivers/vfio/mdev/mdev_core.c                    |  372 ++++++
 drivers/vfio/mdev/mdev_driver.c                  |  128 ++
 drivers/vfio/mdev/mdev_private.h                 |   41 +
 drivers/vfio/mdev/mdev_sysfs.c                   |  296 +++++
 drivers/vfio/mdev/vfio_mdev.c                    |  148 +++
 drivers/vfio/pci/vfio_pci.c                      |  101 +-
 drivers/vfio/platform/vfio_platform_common.c     |   31 +-
 drivers/vfio/vfio.c                              |  287 ++++-
 drivers/vfio/vfio_iommu_type1.c                  |  692 +++++++++--
 include/linux/mdev.h                             |  177 +++
 include/linux/vfio.h                             |   23 +-
 18 files changed, 3948 insertions(+), 204 deletions(-)
 create mode 100644 Documentation/vfio-mdev/Makefile
 create mode 100644 Documentation/vfio-mdev/mtty.c
 create mode 100644 Documentation/vfio-mdev/vfio-mediated-device.txt
 create mode 100644 drivers/vfio/mdev/Kconfig
 create mode 100644 drivers/vfio/mdev/Makefile
 create mode 100644 drivers/vfio/mdev/mdev_core.c
 create mode 100644 drivers/vfio/mdev/mdev_driver.c
 create mode 100644 drivers/vfio/mdev/mdev_private.h
 create mode 100644 drivers/vfio/mdev/mdev_sysfs.c
 create mode 100644 drivers/vfio/mdev/vfio_mdev.c
 create mode 100644 include/linux/mdev.h

-- 
2.7.0

^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH v9 01/12] vfio: Mediated device Core driver
  2016-10-17 21:22 [PATCH v9 00/12] Add Mediated device support Kirti Wankhede
@ 2016-10-17 21:22 ` Kirti Wankhede
  2016-10-18 23:16   ` Alex Williamson
                     ` (2 more replies)
  2016-10-17 21:22 ` [PATCH v9 02/12] vfio: VFIO based driver for Mediated devices Kirti Wankhede
                   ` (12 subsequent siblings)
  13 siblings, 3 replies; 73+ messages in thread
From: Kirti Wankhede @ 2016-10-17 21:22 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi, linux-kernel,
	Kirti Wankhede

Design for Mediated Device Driver:
Main purpose of this driver is to provide a common interface for mediated
device management that can be used by different drivers of different
devices.

This module provides a generic interface to create the device, add it to
mediated bus, add device to IOMMU group and then add it to vfio group.

Below is the high Level block diagram, with Nvidia, Intel and IBM devices
as example, since these are the devices which are going to actively use
this module as of now.

 +---------------+
 |               |
 | +-----------+ |  mdev_register_driver() +--------------+
 | |           | +<------------------------+ __init()     |
 | |  mdev     | |                         |              |
 | |  bus      | +------------------------>+              |<-> VFIO user
 | |  driver   | |     probe()/remove()    | vfio_mdev.ko |    APIs
 | |           | |                         |              |
 | +-----------+ |                         +--------------+
 |               |
 |  MDEV CORE    |
 |   MODULE      |
 |   mdev.ko     |
 | +-----------+ |  mdev_register_device() +--------------+
 | |           | +<------------------------+              |
 | |           | |                         |  nvidia.ko   |<-> physical
 | |           | +------------------------>+              |    device
 | |           | |        callback         +--------------+
 | | Physical  | |
 | |  device   | |  mdev_register_device() +--------------+
 | | interface | |<------------------------+              |
 | |           | |                         |  i915.ko     |<-> physical
 | |           | +------------------------>+              |    device
 | |           | |        callback         +--------------+
 | |           | |
 | |           | |  mdev_register_device() +--------------+
 | |           | +<------------------------+              |
 | |           | |                         | ccw_device.ko|<-> physical
 | |           | +------------------------>+              |    device
 | |           | |        callback         +--------------+
 | +-----------+ |
 +---------------+

Core driver provides two types of registration interfaces:
1. Registration interface for mediated bus driver:

/**
  * struct mdev_driver - Mediated device's driver
  * @name: driver name
  * @probe: called when new device created
  * @remove:called when device removed
  * @driver:device driver structure
  *
  **/
struct mdev_driver {
         const char *name;
         int  (*probe)  (struct device *dev);
         void (*remove) (struct device *dev);
         struct device_driver    driver;
};

Mediated bus driver for mdev device should use this interface to register
and unregister with core driver respectively:

int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
void mdev_unregister_driver(struct mdev_driver *drv);

Medisted bus driver is responsible to add/delete mediated devices to/from
VFIO group when devices are bound and unbound to the driver.

2. Physical device driver interface
This interface provides vendor driver the set APIs to manage physical
device related work in its driver. APIs are :

* dev_attr_groups: attributes of the parent device.
* mdev_attr_groups: attributes of the mediated device.
* supported_type_groups: attributes to define supported type. This is
			 mandatory field.
* create: to allocate basic resources in driver for a mediated device.
* remove: to free resources in driver when mediated device is destroyed.
* open: open callback of mediated device
* release: release callback of mediated device
* read : read emulation callback.
* write: write emulation callback.
* mmap: mmap emulation callback.
* ioctl: ioctl callback.

Drivers should use these interfaces to register and unregister device to
mdev core driver respectively:

extern int  mdev_register_device(struct device *dev,
                                 const struct parent_ops *ops);
extern void mdev_unregister_device(struct device *dev);

There are no locks to serialize above callbacks in mdev driver and
vfio_mdev driver. If required, vendor driver can have locks to serialize
above APIs in their driver.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Change-Id: I73a5084574270b14541c529461ea2f03c292d510
---
 drivers/vfio/Kconfig             |   1 +
 drivers/vfio/Makefile            |   1 +
 drivers/vfio/mdev/Kconfig        |  11 ++
 drivers/vfio/mdev/Makefile       |   4 +
 drivers/vfio/mdev/mdev_core.c    | 372 +++++++++++++++++++++++++++++++++++++++
 drivers/vfio/mdev/mdev_driver.c  | 128 ++++++++++++++
 drivers/vfio/mdev/mdev_private.h |  41 +++++
 drivers/vfio/mdev/mdev_sysfs.c   | 296 +++++++++++++++++++++++++++++++
 include/linux/mdev.h             | 177 +++++++++++++++++++
 9 files changed, 1031 insertions(+)
 create mode 100644 drivers/vfio/mdev/Kconfig
 create mode 100644 drivers/vfio/mdev/Makefile
 create mode 100644 drivers/vfio/mdev/mdev_core.c
 create mode 100644 drivers/vfio/mdev/mdev_driver.c
 create mode 100644 drivers/vfio/mdev/mdev_private.h
 create mode 100644 drivers/vfio/mdev/mdev_sysfs.c
 create mode 100644 include/linux/mdev.h

diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index da6e2ce77495..23eced02aaf6 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -48,4 +48,5 @@ menuconfig VFIO_NOIOMMU
 
 source "drivers/vfio/pci/Kconfig"
 source "drivers/vfio/platform/Kconfig"
+source "drivers/vfio/mdev/Kconfig"
 source "virt/lib/Kconfig"
diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
index 7b8a31f63fea..4a23c13b6be4 100644
--- a/drivers/vfio/Makefile
+++ b/drivers/vfio/Makefile
@@ -7,3 +7,4 @@ obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
 obj-$(CONFIG_VFIO_SPAPR_EEH) += vfio_spapr_eeh.o
 obj-$(CONFIG_VFIO_PCI) += pci/
 obj-$(CONFIG_VFIO_PLATFORM) += platform/
+obj-$(CONFIG_VFIO_MDEV) += mdev/
diff --git a/drivers/vfio/mdev/Kconfig b/drivers/vfio/mdev/Kconfig
new file mode 100644
index 000000000000..93addace9a67
--- /dev/null
+++ b/drivers/vfio/mdev/Kconfig
@@ -0,0 +1,11 @@
+
+config VFIO_MDEV
+    tristate "Mediated device driver framework"
+    depends on VFIO
+    default n
+    help
+        Provides a framework to virtualize devices which don't have SR_IOV
+	capability built-in.
+	See Documentation/vfio-mdev/vfio-mediated-device.txt for more details.
+
+        If you don't know what do here, say N.
diff --git a/drivers/vfio/mdev/Makefile b/drivers/vfio/mdev/Makefile
new file mode 100644
index 000000000000..31bc04801d94
--- /dev/null
+++ b/drivers/vfio/mdev/Makefile
@@ -0,0 +1,4 @@
+
+mdev-y := mdev_core.o mdev_sysfs.o mdev_driver.o
+
+obj-$(CONFIG_VFIO_MDEV) += mdev.o
diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
new file mode 100644
index 000000000000..7db5ec164aeb
--- /dev/null
+++ b/drivers/vfio/mdev/mdev_core.c
@@ -0,0 +1,372 @@
+/*
+ * Mediated device Core Driver
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/slab.h>
+#include <linux/uuid.h>
+#include <linux/sysfs.h>
+#include <linux/mdev.h>
+
+#include "mdev_private.h"
+
+#define DRIVER_VERSION		"0.1"
+#define DRIVER_AUTHOR		"NVIDIA Corporation"
+#define DRIVER_DESC		"Mediated device Core Driver"
+
+static LIST_HEAD(parent_list);
+static DEFINE_MUTEX(parent_list_lock);
+static struct class_compat *mdev_bus_compat_class;
+
+static int _find_mdev_device(struct device *dev, void *data)
+{
+	struct mdev_device *mdev;
+
+	if (!dev_is_mdev(dev))
+		return 0;
+
+	mdev = to_mdev_device(dev);
+
+	if (uuid_le_cmp(mdev->uuid, *(uuid_le *)data) == 0)
+		return 1;
+
+	return 0;
+}
+
+static struct mdev_device *__find_mdev_device(struct parent_device *parent,
+					      uuid_le uuid)
+{
+	struct device *dev;
+
+	dev = device_find_child(parent->dev, &uuid, _find_mdev_device);
+	if (!dev)
+		return NULL;
+
+	put_device(dev);
+
+	return to_mdev_device(dev);
+}
+
+/* Should be called holding parent_list_lock */
+static struct parent_device *__find_parent_device(struct device *dev)
+{
+	struct parent_device *parent;
+
+	list_for_each_entry(parent, &parent_list, next) {
+		if (parent->dev == dev)
+			return parent;
+	}
+	return NULL;
+}
+
+static void mdev_release_parent(struct kref *kref)
+{
+	struct parent_device *parent = container_of(kref, struct parent_device,
+						    ref);
+	struct device *dev = parent->dev;
+
+	kfree(parent);
+	put_device(dev);
+}
+
+static
+inline struct parent_device *mdev_get_parent(struct parent_device *parent)
+{
+	if (parent)
+		kref_get(&parent->ref);
+
+	return parent;
+}
+
+static inline void mdev_put_parent(struct parent_device *parent)
+{
+	if (parent)
+		kref_put(&parent->ref, mdev_release_parent);
+}
+
+static int mdev_device_create_ops(struct kobject *kobj,
+				  struct mdev_device *mdev)
+{
+	struct parent_device *parent = mdev->parent;
+	int ret;
+
+	ret = parent->ops->create(kobj, mdev);
+	if (ret)
+		return ret;
+
+	ret = sysfs_create_groups(&mdev->dev.kobj,
+				  parent->ops->mdev_attr_groups);
+	if (ret)
+		parent->ops->remove(mdev);
+
+	return ret;
+}
+
+static int mdev_device_remove_ops(struct mdev_device *mdev, bool force_remove)
+{
+	struct parent_device *parent = mdev->parent;
+	int ret;
+
+	/*
+	 * Vendor driver can return error if VMM or userspace application is
+	 * using this mdev device.
+	 */
+	ret = parent->ops->remove(mdev);
+	if (ret && !force_remove)
+		return -EBUSY;
+
+	sysfs_remove_groups(&mdev->dev.kobj, parent->ops->mdev_attr_groups);
+	return 0;
+}
+
+static int mdev_device_remove_cb(struct device *dev, void *data)
+{
+	return mdev_device_remove(dev, data ? *(bool *)data : true);
+}
+
+/*
+ * mdev_register_device : Register a device
+ * @dev: device structure representing parent device.
+ * @ops: Parent device operation structure to be registered.
+ *
+ * Add device to list of registered parent devices.
+ * Returns a negative value on error, otherwise 0.
+ */
+int mdev_register_device(struct device *dev, const struct parent_ops *ops)
+{
+	int ret = 0;
+	struct parent_device *parent;
+
+	/* check for mandatory ops */
+	if (!ops || !ops->create || !ops->remove || !ops->supported_type_groups)
+		return -EINVAL;
+
+	dev = get_device(dev);
+	if (!dev)
+		return -EINVAL;
+
+	mutex_lock(&parent_list_lock);
+
+	/* Check for duplicate */
+	parent = __find_parent_device(dev);
+	if (parent) {
+		ret = -EEXIST;
+		goto add_dev_err;
+	}
+
+	parent = kzalloc(sizeof(*parent), GFP_KERNEL);
+	if (!parent) {
+		ret = -ENOMEM;
+		goto add_dev_err;
+	}
+
+	kref_init(&parent->ref);
+
+	parent->dev = dev;
+	parent->ops = ops;
+
+	ret = parent_create_sysfs_files(parent);
+	if (ret) {
+		mutex_unlock(&parent_list_lock);
+		mdev_put_parent(parent);
+		return ret;
+	}
+
+	ret = class_compat_create_link(mdev_bus_compat_class, dev, NULL);
+	if (ret)
+		dev_warn(dev, "Failed to create compatibility class link\n");
+
+	list_add(&parent->next, &parent_list);
+	mutex_unlock(&parent_list_lock);
+
+	dev_info(dev, "MDEV: Registered\n");
+	return 0;
+
+add_dev_err:
+	mutex_unlock(&parent_list_lock);
+	put_device(dev);
+	return ret;
+}
+EXPORT_SYMBOL(mdev_register_device);
+
+/*
+ * mdev_unregister_device : Unregister a parent device
+ * @dev: device structure representing parent device.
+ *
+ * Remove device from list of registered parent devices. Give a chance to free
+ * existing mediated devices for given device.
+ */
+
+void mdev_unregister_device(struct device *dev)
+{
+	struct parent_device *parent;
+	bool force_remove = true;
+
+	mutex_lock(&parent_list_lock);
+	parent = __find_parent_device(dev);
+
+	if (!parent) {
+		mutex_unlock(&parent_list_lock);
+		return;
+	}
+	dev_info(dev, "MDEV: Unregistering\n");
+
+	/*
+	 * Remove parent from the list and remove "mdev_supported_types"
+	 * sysfs files so that no new mediated device could be
+	 * created for this parent
+	 */
+	list_del(&parent->next);
+	parent_remove_sysfs_files(parent);
+
+	mutex_unlock(&parent_list_lock);
+
+	class_compat_remove_link(mdev_bus_compat_class, dev, NULL);
+
+	device_for_each_child(dev, (void *)&force_remove,
+			      mdev_device_remove_cb);
+	mdev_put_parent(parent);
+}
+EXPORT_SYMBOL(mdev_unregister_device);
+
+static void mdev_device_release(struct device *dev)
+{
+	struct mdev_device *mdev = to_mdev_device(dev);
+
+	dev_dbg(&mdev->dev, "MDEV: destroying\n");
+	kfree(mdev);
+}
+
+int mdev_device_create(struct kobject *kobj, struct device *dev, uuid_le uuid)
+{
+	int ret;
+	struct mdev_device *mdev;
+	struct parent_device *parent;
+	struct mdev_type *type = to_mdev_type(kobj);
+
+	parent = mdev_get_parent(type->parent);
+	if (!parent)
+		return -EINVAL;
+
+	/* Check for duplicate */
+	mdev = __find_mdev_device(parent, uuid);
+	if (mdev) {
+		ret = -EEXIST;
+		goto create_err;
+	}
+
+	mdev = kzalloc(sizeof(*mdev), GFP_KERNEL);
+	if (!mdev) {
+		ret = -ENOMEM;
+		goto create_err;
+	}
+
+	memcpy(&mdev->uuid, &uuid, sizeof(uuid_le));
+	mdev->parent = parent;
+	kref_init(&mdev->ref);
+
+	mdev->dev.parent  = dev;
+	mdev->dev.bus     = &mdev_bus_type;
+	mdev->dev.release = mdev_device_release;
+	dev_set_name(&mdev->dev, "%pUl", uuid.b);
+
+	ret = device_register(&mdev->dev);
+	if (ret) {
+		put_device(&mdev->dev);
+		goto create_err;
+	}
+
+	ret = mdev_device_create_ops(kobj, mdev);
+	if (ret)
+		goto create_failed;
+
+	ret = mdev_create_sysfs_files(&mdev->dev, type);
+	if (ret) {
+		mdev_device_remove_ops(mdev, true);
+		goto create_failed;
+	}
+
+	mdev->type_kobj = kobj;
+	dev_dbg(&mdev->dev, "MDEV: created\n");
+
+	return ret;
+
+create_failed:
+	device_unregister(&mdev->dev);
+
+create_err:
+	mdev_put_parent(parent);
+	return ret;
+}
+
+int mdev_device_remove(struct device *dev, bool force_remove)
+{
+	struct mdev_device *mdev;
+	struct parent_device *parent;
+	struct mdev_type *type;
+	int ret = 0;
+
+	if (!dev_is_mdev(dev))
+		return 0;
+
+	mdev = to_mdev_device(dev);
+	parent = mdev->parent;
+	type = to_mdev_type(mdev->type_kobj);
+
+	ret = mdev_device_remove_ops(mdev, force_remove);
+	if (ret)
+		return ret;
+
+	mdev_remove_sysfs_files(dev, type);
+	device_unregister(dev);
+	mdev_put_parent(parent);
+	return ret;
+}
+
+static int __init mdev_init(void)
+{
+	int ret;
+
+	ret = mdev_bus_register();
+	if (ret) {
+		pr_err("Failed to register mdev bus\n");
+		return ret;
+	}
+
+	mdev_bus_compat_class = class_compat_register("mdev_bus");
+	if (!mdev_bus_compat_class) {
+		mdev_bus_unregister();
+		return -ENOMEM;
+	}
+
+	/*
+	 * Attempt to load known vfio_mdev.  This gives us a working environment
+	 * without the user needing to explicitly load vfio_mdev driver.
+	 */
+	request_module_nowait("vfio_mdev");
+
+	return ret;
+}
+
+static void __exit mdev_exit(void)
+{
+	class_compat_unregister(mdev_bus_compat_class);
+	mdev_bus_unregister();
+}
+
+module_init(mdev_init)
+module_exit(mdev_exit)
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/drivers/vfio/mdev/mdev_driver.c b/drivers/vfio/mdev/mdev_driver.c
new file mode 100644
index 000000000000..7768ef87f528
--- /dev/null
+++ b/drivers/vfio/mdev/mdev_driver.c
@@ -0,0 +1,128 @@
+/*
+ * MDEV driver
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/device.h>
+#include <linux/iommu.h>
+#include <linux/mdev.h>
+
+#include "mdev_private.h"
+
+static int mdev_attach_iommu(struct mdev_device *mdev)
+{
+	int ret;
+	struct iommu_group *group;
+
+	group = iommu_group_alloc();
+	if (IS_ERR(group)) {
+		dev_err(&mdev->dev, "MDEV: failed to allocate group!\n");
+		return PTR_ERR(group);
+	}
+
+	ret = iommu_group_add_device(group, &mdev->dev);
+	if (ret) {
+		dev_err(&mdev->dev, "MDEV: failed to add dev to group!\n");
+		goto attach_fail;
+	}
+
+	dev_info(&mdev->dev, "MDEV: group_id = %d\n",
+				 iommu_group_id(group));
+attach_fail:
+	iommu_group_put(group);
+	return ret;
+}
+
+static void mdev_detach_iommu(struct mdev_device *mdev)
+{
+	iommu_group_remove_device(&mdev->dev);
+	dev_info(&mdev->dev, "MDEV: detaching iommu\n");
+}
+
+static int mdev_probe(struct device *dev)
+{
+	struct mdev_driver *drv = to_mdev_driver(dev->driver);
+	struct mdev_device *mdev = to_mdev_device(dev);
+	int ret;
+
+	ret = mdev_attach_iommu(mdev);
+	if (ret) {
+		dev_err(dev, "Failed to attach IOMMU\n");
+		return ret;
+	}
+
+	if (drv && drv->probe)
+		ret = drv->probe(dev);
+
+	if (ret)
+		mdev_detach_iommu(mdev);
+
+	return ret;
+}
+
+static int mdev_remove(struct device *dev)
+{
+	struct mdev_driver *drv = to_mdev_driver(dev->driver);
+	struct mdev_device *mdev = to_mdev_device(dev);
+
+	if (drv && drv->remove)
+		drv->remove(dev);
+
+	mdev_detach_iommu(mdev);
+
+	return 0;
+}
+
+struct bus_type mdev_bus_type = {
+	.name		= "mdev",
+	.probe		= mdev_probe,
+	.remove		= mdev_remove,
+};
+EXPORT_SYMBOL_GPL(mdev_bus_type);
+
+/*
+ * mdev_register_driver - register a new MDEV driver
+ * @drv: the driver to register
+ * @owner: module owner of driver to be registered
+ *
+ * Returns a negative value on error, otherwise 0.
+ */
+int mdev_register_driver(struct mdev_driver *drv, struct module *owner)
+{
+	/* initialize common driver fields */
+	drv->driver.name = drv->name;
+	drv->driver.bus = &mdev_bus_type;
+	drv->driver.owner = owner;
+
+	/* register with core */
+	return driver_register(&drv->driver);
+}
+EXPORT_SYMBOL(mdev_register_driver);
+
+/*
+ * mdev_unregister_driver - unregister MDEV driver
+ * @drv: the driver to unregister
+ *
+ */
+void mdev_unregister_driver(struct mdev_driver *drv)
+{
+	driver_unregister(&drv->driver);
+}
+EXPORT_SYMBOL(mdev_unregister_driver);
+
+int mdev_bus_register(void)
+{
+	return bus_register(&mdev_bus_type);
+}
+
+void mdev_bus_unregister(void)
+{
+	bus_unregister(&mdev_bus_type);
+}
diff --git a/drivers/vfio/mdev/mdev_private.h b/drivers/vfio/mdev/mdev_private.h
new file mode 100644
index 000000000000..000c93fcfdbd
--- /dev/null
+++ b/drivers/vfio/mdev/mdev_private.h
@@ -0,0 +1,41 @@
+/*
+ * Mediated device interal definitions
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef MDEV_PRIVATE_H
+#define MDEV_PRIVATE_H
+
+int  mdev_bus_register(void);
+void mdev_bus_unregister(void);
+
+struct mdev_type {
+	struct kobject kobj;
+	struct kobject *devices_kobj;
+	struct parent_device *parent;
+	struct list_head next;
+	struct attribute_group *group;
+};
+
+#define to_mdev_type_attr(_attr)	\
+	container_of(_attr, struct mdev_type_attribute, attr)
+#define to_mdev_type(_kobj)		\
+	container_of(_kobj, struct mdev_type, kobj)
+
+int  parent_create_sysfs_files(struct parent_device *parent);
+void parent_remove_sysfs_files(struct parent_device *parent);
+
+int  mdev_create_sysfs_files(struct device *dev, struct mdev_type *type);
+void mdev_remove_sysfs_files(struct device *dev, struct mdev_type *type);
+
+int  mdev_device_create(struct kobject *kobj, struct device *dev, uuid_le uuid);
+int  mdev_device_remove(struct device *dev, bool force_remove);
+
+#endif /* MDEV_PRIVATE_H */
diff --git a/drivers/vfio/mdev/mdev_sysfs.c b/drivers/vfio/mdev/mdev_sysfs.c
new file mode 100644
index 000000000000..426e35cf79d0
--- /dev/null
+++ b/drivers/vfio/mdev/mdev_sysfs.c
@@ -0,0 +1,296 @@
+/*
+ * File attributes for Mediated devices
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/sysfs.h>
+#include <linux/ctype.h>
+#include <linux/device.h>
+#include <linux/slab.h>
+#include <linux/uuid.h>
+#include <linux/mdev.h>
+
+#include "mdev_private.h"
+
+/* Static functions */
+
+static ssize_t mdev_type_attr_show(struct kobject *kobj,
+				     struct attribute *__attr, char *buf)
+{
+	struct mdev_type_attribute *attr = to_mdev_type_attr(__attr);
+	struct mdev_type *type = to_mdev_type(kobj);
+	ssize_t ret = -EIO;
+
+	if (attr->show)
+		ret = attr->show(kobj, type->parent->dev, buf);
+	return ret;
+}
+
+static ssize_t mdev_type_attr_store(struct kobject *kobj,
+				      struct attribute *__attr,
+				      const char *buf, size_t count)
+{
+	struct mdev_type_attribute *attr = to_mdev_type_attr(__attr);
+	struct mdev_type *type = to_mdev_type(kobj);
+	ssize_t ret = -EIO;
+
+	if (attr->store)
+		ret = attr->store(&type->kobj, type->parent->dev, buf, count);
+	return ret;
+}
+
+static const struct sysfs_ops mdev_type_sysfs_ops = {
+	.show = mdev_type_attr_show,
+	.store = mdev_type_attr_store,
+};
+
+static ssize_t create_store(struct kobject *kobj, struct device *dev,
+			    const char *buf, size_t count)
+{
+	char *str;
+	uuid_le uuid;
+	int ret;
+
+	if (count < UUID_STRING_LEN)
+		return -EINVAL;
+
+	str = kstrndup(buf, count, GFP_KERNEL);
+	if (!str)
+		return -ENOMEM;
+
+	ret = uuid_le_to_bin(str, &uuid);
+	if (!ret) {
+
+		ret = mdev_device_create(kobj, dev, uuid);
+		if (ret)
+			pr_err("mdev_create: Failed to create mdev device\n");
+		else
+			ret = count;
+	}
+
+	kfree(str);
+	return ret;
+}
+
+MDEV_TYPE_ATTR_WO(create);
+
+static void mdev_type_release(struct kobject *kobj)
+{
+	struct mdev_type *type = to_mdev_type(kobj);
+
+	pr_debug("Releasing group %s\n", kobj->name);
+	kfree(type);
+}
+
+static struct kobj_type mdev_type_ktype = {
+	.sysfs_ops = &mdev_type_sysfs_ops,
+	.release = mdev_type_release,
+};
+
+struct mdev_type *add_mdev_supported_type(struct parent_device *parent,
+					  struct attribute_group *group)
+{
+	struct mdev_type *type;
+	int ret;
+
+	if (!group->name) {
+		pr_err("%s: Type name empty!\n", __func__);
+		return ERR_PTR(-EINVAL);
+	}
+
+	type = kzalloc(sizeof(*type), GFP_KERNEL);
+	if (!type)
+		return ERR_PTR(-ENOMEM);
+
+	type->kobj.kset = parent->mdev_types_kset;
+
+	ret = kobject_init_and_add(&type->kobj, &mdev_type_ktype, NULL,
+				   "%s-%s", dev_driver_string(parent->dev),
+				   group->name);
+	if (ret) {
+		kfree(type);
+		return ERR_PTR(ret);
+	}
+
+	ret = sysfs_create_file(&type->kobj, &mdev_type_attr_create.attr);
+	if (ret)
+		goto attr_create_failed;
+
+	type->devices_kobj = kobject_create_and_add("devices", &type->kobj);
+	if (!type->devices_kobj) {
+		ret = -ENOMEM;
+		goto attr_devices_failed;
+	}
+
+	ret = sysfs_create_files(&type->kobj,
+				 (const struct attribute **)group->attrs);
+	if (ret) {
+		ret = -ENOMEM;
+		goto attrs_failed;
+	}
+
+	type->group = group;
+	type->parent = parent;
+	return type;
+
+attrs_failed:
+	kobject_put(type->devices_kobj);
+attr_devices_failed:
+	sysfs_remove_file(&type->kobj, &mdev_type_attr_create.attr);
+attr_create_failed:
+	kobject_del(&type->kobj);
+	kobject_put(&type->kobj);
+	return ERR_PTR(ret);
+}
+
+static void remove_mdev_supported_type(struct mdev_type *type)
+{
+	sysfs_remove_files(&type->kobj,
+			   (const struct attribute **)type->group->attrs);
+	kobject_put(type->devices_kobj);
+	sysfs_remove_file(&type->kobj, &mdev_type_attr_create.attr);
+	kobject_del(&type->kobj);
+	kobject_put(&type->kobj);
+}
+
+static int add_mdev_supported_type_groups(struct parent_device *parent)
+{
+	int i;
+
+	for (i = 0; parent->ops->supported_type_groups[i]; i++) {
+		struct mdev_type *type;
+
+		type = add_mdev_supported_type(parent,
+					parent->ops->supported_type_groups[i]);
+		if (IS_ERR(type)) {
+			struct mdev_type *ltype, *tmp;
+
+			list_for_each_entry_safe(ltype, tmp, &parent->type_list,
+						  next) {
+				list_del(&ltype->next);
+				remove_mdev_supported_type(ltype);
+			}
+			return PTR_ERR(type);
+		}
+		list_add(&type->next, &parent->type_list);
+	}
+	return 0;
+}
+
+/* mdev sysfs Functions */
+
+void parent_remove_sysfs_files(struct parent_device *parent)
+{
+	struct mdev_type *type, *tmp;
+
+	list_for_each_entry_safe(type, tmp, &parent->type_list, next) {
+		list_del(&type->next);
+		remove_mdev_supported_type(type);
+	}
+
+	sysfs_remove_groups(&parent->dev->kobj, parent->ops->dev_attr_groups);
+	kset_unregister(parent->mdev_types_kset);
+}
+
+int parent_create_sysfs_files(struct parent_device *parent)
+{
+	int ret;
+
+	parent->mdev_types_kset = kset_create_and_add("mdev_supported_types",
+					       NULL, &parent->dev->kobj);
+
+	if (!parent->mdev_types_kset)
+		return -ENOMEM;
+
+	INIT_LIST_HEAD(&parent->type_list);
+
+	ret = sysfs_create_groups(&parent->dev->kobj,
+				  parent->ops->dev_attr_groups);
+	if (ret)
+		goto create_err;
+
+	ret = add_mdev_supported_type_groups(parent);
+	if (ret)
+		sysfs_remove_groups(&parent->dev->kobj,
+				    parent->ops->dev_attr_groups);
+	else
+		return ret;
+
+create_err:
+	kset_unregister(parent->mdev_types_kset);
+	return ret;
+}
+
+static ssize_t remove_store(struct device *dev, struct device_attribute *attr,
+			    const char *buf, size_t count)
+{
+	unsigned long val;
+
+	if (kstrtoul(buf, 0, &val) < 0)
+		return -EINVAL;
+
+	if (val && device_remove_file_self(dev, attr)) {
+		int ret;
+
+		ret = mdev_device_remove(dev, false);
+		if (ret) {
+			device_create_file(dev, attr);
+			return ret;
+		}
+	}
+
+	return count;
+}
+
+static DEVICE_ATTR_WO(remove);
+
+static const struct attribute *mdev_device_attrs[] = {
+	&dev_attr_remove.attr,
+	NULL,
+};
+
+int  mdev_create_sysfs_files(struct device *dev, struct mdev_type *type)
+{
+	int ret;
+
+	ret = sysfs_create_files(&dev->kobj, mdev_device_attrs);
+	if (ret) {
+		pr_err("Failed to create remove sysfs entry\n");
+		return ret;
+	}
+
+	ret = sysfs_create_link(type->devices_kobj, &dev->kobj, dev_name(dev));
+	if (ret) {
+		pr_err("Failed to create symlink in types\n");
+		goto device_link_failed;
+	}
+
+	ret = sysfs_create_link(&dev->kobj, &type->kobj, "mdev_type");
+	if (ret) {
+		pr_err("Failed to create symlink in device directory\n");
+		goto type_link_failed;
+	}
+
+	return ret;
+
+type_link_failed:
+	sysfs_remove_link(type->devices_kobj, dev_name(dev));
+device_link_failed:
+	sysfs_remove_files(&dev->kobj, mdev_device_attrs);
+	return ret;
+}
+
+void mdev_remove_sysfs_files(struct device *dev, struct mdev_type *type)
+{
+	sysfs_remove_link(&dev->kobj, "mdev_type");
+	sysfs_remove_link(type->devices_kobj, dev_name(dev));
+	sysfs_remove_files(&dev->kobj, mdev_device_attrs);
+
+}
diff --git a/include/linux/mdev.h b/include/linux/mdev.h
new file mode 100644
index 000000000000..727209b2a67f
--- /dev/null
+++ b/include/linux/mdev.h
@@ -0,0 +1,177 @@
+/*
+ * Mediated device definition
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef MDEV_H
+#define MDEV_H
+
+#include <uapi/linux/vfio.h>
+
+struct parent_device;
+
+/* Mediated device */
+struct mdev_device {
+	struct device		dev;
+	struct parent_device	*parent;
+	uuid_le			uuid;
+	void			*driver_data;
+
+	/* internal only */
+	struct kref		ref;
+	struct list_head	next;
+	struct kobject		*type_kobj;
+};
+
+
+/**
+ * struct parent_ops - Structure to be registered for each parent device to
+ * register the device to mdev module.
+ *
+ * @owner:		The module owner.
+ * @dev_attr_groups:	Attributes of the parent device.
+ * @mdev_attr_groups:	Attributes of the mediated device.
+ * @supported_type_groups: Attributes to define supported types. It is mandatory
+ *			to provide supported types.
+ * @create:		Called to allocate basic resources in parent device's
+ *			driver for a particular mediated device. It is
+ *			mandatory to provide create ops.
+ *			@kobj: kobject of type for which 'create' is called.
+ *			@mdev: mdev_device structure on of mediated device
+ *			      that is being created
+ *			Returns integer: success (0) or error (< 0)
+ * @remove:		Called to free resources in parent device's driver for a
+ *			a mediated device. It is mandatory to provide 'remove'
+ *			ops.
+ *			@mdev: mdev_device device structure which is being
+ *			       destroyed
+ *			Returns integer: success (0) or error (< 0)
+ * @open:		Open mediated device.
+ *			@mdev: mediated device.
+ *			Returns integer: success (0) or error (< 0)
+ * @release:		release mediated device
+ *			@mdev: mediated device.
+ * @read:		Read emulation callback
+ *			@mdev: mediated device structure
+ *			@buf: read buffer
+ *			@count: number of bytes to read
+ *			@ppos: address.
+ *			Retuns number on bytes read on success or error.
+ * @write:		Write emulation callback
+ *			@mdev: mediated device structure
+ *			@buf: write buffer
+ *			@count: number of bytes to be written
+ *			@ppos: address.
+ *			Retuns number on bytes written on success or error.
+ * @ioctl:		IOCTL callback
+ *			@mdev: mediated device structure
+ *			@cmd: mediated device structure
+ *			@arg: mediated device structure
+ * @mmap:		mmap callback
+ * Parent device that support mediated device should be registered with mdev
+ * module with parent_ops structure.
+ */
+
+struct parent_ops {
+	struct module   *owner;
+	const struct attribute_group **dev_attr_groups;
+	const struct attribute_group **mdev_attr_groups;
+	struct attribute_group **supported_type_groups;
+
+	int     (*create)(struct kobject *kobj, struct mdev_device *mdev);
+	int     (*remove)(struct mdev_device *mdev);
+	int     (*open)(struct mdev_device *mdev);
+	void    (*release)(struct mdev_device *mdev);
+	ssize_t (*read)(struct mdev_device *mdev, char __user *buf,
+			size_t count, loff_t *ppos);
+	ssize_t (*write)(struct mdev_device *mdev, const char __user *buf,
+			 size_t count, loff_t *ppos);
+	ssize_t (*ioctl)(struct mdev_device *mdev, unsigned int cmd,
+			 unsigned long arg);
+	int	(*mmap)(struct mdev_device *mdev, struct vm_area_struct *vma);
+};
+
+/* Parent Device */
+struct parent_device {
+	struct device		*dev;
+	const struct parent_ops	*ops;
+
+	/* internal */
+	struct kref		ref;
+	struct list_head	next;
+	struct kset *mdev_types_kset;
+	struct list_head	type_list;
+};
+
+/* interface for exporting mdev supported type attributes */
+struct mdev_type_attribute {
+	struct attribute attr;
+	ssize_t (*show)(struct kobject *kobj, struct device *dev, char *buf);
+	ssize_t (*store)(struct kobject *kobj, struct device *dev,
+			 const char *buf, size_t count);
+};
+
+#define MDEV_TYPE_ATTR(_name, _mode, _show, _store)		\
+struct mdev_type_attribute mdev_type_attr_##_name =		\
+	__ATTR(_name, _mode, _show, _store)
+#define MDEV_TYPE_ATTR_RW(_name) \
+	struct mdev_type_attribute mdev_type_attr_##_name = __ATTR_RW(_name)
+#define MDEV_TYPE_ATTR_RO(_name) \
+	struct mdev_type_attribute mdev_type_attr_##_name = __ATTR_RO(_name)
+#define MDEV_TYPE_ATTR_WO(_name) \
+	struct mdev_type_attribute mdev_type_attr_##_name = __ATTR_WO(_name)
+
+/**
+ * struct mdev_driver - Mediated device driver
+ * @name: driver name
+ * @probe: called when new device created
+ * @remove: called when device removed
+ * @driver: device driver structure
+ *
+ **/
+struct mdev_driver {
+	const char *name;
+	int  (*probe)(struct device *dev);
+	void (*remove)(struct device *dev);
+	struct device_driver driver;
+};
+
+static inline struct mdev_driver *to_mdev_driver(struct device_driver *drv)
+{
+	return drv ? container_of(drv, struct mdev_driver, driver) : NULL;
+}
+
+static inline struct mdev_device *to_mdev_device(struct device *dev)
+{
+	return dev ? container_of(dev, struct mdev_device, dev) : NULL;
+}
+
+static inline void *mdev_get_drvdata(struct mdev_device *mdev)
+{
+	return mdev->driver_data;
+}
+
+static inline void mdev_set_drvdata(struct mdev_device *mdev, void *data)
+{
+	mdev->driver_data = data;
+}
+
+extern struct bus_type mdev_bus_type;
+
+#define dev_is_mdev(d) ((d)->bus == &mdev_bus_type)
+
+extern int  mdev_register_device(struct device *dev,
+				 const struct parent_ops *ops);
+extern void mdev_unregister_device(struct device *dev);
+
+extern int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
+extern void mdev_unregister_driver(struct mdev_driver *drv);
+
+#endif /* MDEV_H */
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v9 02/12] vfio: VFIO based driver for Mediated devices
  2016-10-17 21:22 [PATCH v9 00/12] Add Mediated device support Kirti Wankhede
  2016-10-17 21:22 ` [PATCH v9 01/12] vfio: Mediated device Core driver Kirti Wankhede
@ 2016-10-17 21:22 ` Kirti Wankhede
  2016-10-26  6:57   ` Tian, Kevin
  2016-10-17 21:22 ` [PATCH v9 03/12] vfio: Rearrange functions to get vfio_group from dev Kirti Wankhede
                   ` (11 subsequent siblings)
  13 siblings, 1 reply; 73+ messages in thread
From: Kirti Wankhede @ 2016-10-17 21:22 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi, linux-kernel,
	Kirti Wankhede

vfio_mdev driver registers with mdev core driver.
MDEV core driver creates mediated device and calls probe routine of
vfio_mdev driver for each device.
Probe routine of vfio_mdev driver adds mediated device to VFIO core module

This driver forms a shim layer that pass through VFIO devices operations
to vendor driver for mediated devices.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Change-Id: I583f4734752971d3d112324d69e2508c88f359ec
---
 drivers/vfio/mdev/Kconfig     |   7 ++
 drivers/vfio/mdev/Makefile    |   1 +
 drivers/vfio/mdev/vfio_mdev.c | 148 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 156 insertions(+)
 create mode 100644 drivers/vfio/mdev/vfio_mdev.c

diff --git a/drivers/vfio/mdev/Kconfig b/drivers/vfio/mdev/Kconfig
index 93addace9a67..6cef0c4d2ceb 100644
--- a/drivers/vfio/mdev/Kconfig
+++ b/drivers/vfio/mdev/Kconfig
@@ -9,3 +9,10 @@ config VFIO_MDEV
 	See Documentation/vfio-mdev/vfio-mediated-device.txt for more details.
 
         If you don't know what do here, say N.
+
+config VFIO_MDEV_DEVICE
+    tristate "VFIO support for Mediated devices"
+    depends on VFIO && VFIO_MDEV
+    default n
+    help
+        VFIO based driver for mediated devices.
diff --git a/drivers/vfio/mdev/Makefile b/drivers/vfio/mdev/Makefile
index 31bc04801d94..fa2d5ea466ee 100644
--- a/drivers/vfio/mdev/Makefile
+++ b/drivers/vfio/mdev/Makefile
@@ -2,3 +2,4 @@
 mdev-y := mdev_core.o mdev_sysfs.o mdev_driver.o
 
 obj-$(CONFIG_VFIO_MDEV) += mdev.o
+obj-$(CONFIG_VFIO_MDEV_DEVICE) += vfio_mdev.o
diff --git a/drivers/vfio/mdev/vfio_mdev.c b/drivers/vfio/mdev/vfio_mdev.c
new file mode 100644
index 000000000000..b7b47604ce7a
--- /dev/null
+++ b/drivers/vfio/mdev/vfio_mdev.c
@@ -0,0 +1,148 @@
+/*
+ * VFIO based driver for Mediated device
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/vfio.h>
+#include <linux/mdev.h>
+
+#include "mdev_private.h"
+
+#define DRIVER_VERSION  "0.1"
+#define DRIVER_AUTHOR   "NVIDIA Corporation"
+#define DRIVER_DESC     "VFIO based driver for Mediated device"
+
+static int vfio_mdev_open(void *device_data)
+{
+	struct mdev_device *mdev = device_data;
+	struct parent_device *parent = mdev->parent;
+	int ret;
+
+	if (unlikely(!parent->ops->open))
+		return -EINVAL;
+
+	if (!try_module_get(THIS_MODULE))
+		return -ENODEV;
+
+	ret = parent->ops->open(mdev);
+	if (ret)
+		module_put(THIS_MODULE);
+
+	return ret;
+}
+
+static void vfio_mdev_release(void *device_data)
+{
+	struct mdev_device *mdev = device_data;
+	struct parent_device *parent = mdev->parent;
+
+	if (parent->ops->release)
+		parent->ops->release(mdev);
+
+	module_put(THIS_MODULE);
+}
+
+static long vfio_mdev_unlocked_ioctl(void *device_data,
+				     unsigned int cmd, unsigned long arg)
+{
+	struct mdev_device *mdev = device_data;
+	struct parent_device *parent = mdev->parent;
+
+	if (unlikely(!parent->ops->ioctl))
+		return -EINVAL;
+
+	return parent->ops->ioctl(mdev, cmd, arg);
+}
+
+static ssize_t vfio_mdev_read(void *device_data, char __user *buf,
+			      size_t count, loff_t *ppos)
+{
+	struct mdev_device *mdev = device_data;
+	struct parent_device *parent = mdev->parent;
+
+	if (unlikely(!parent->ops->read))
+		return -EINVAL;
+
+	return parent->ops->read(mdev, buf, count, ppos);
+}
+
+static ssize_t vfio_mdev_write(void *device_data, const char __user *buf,
+			       size_t count, loff_t *ppos)
+{
+	struct mdev_device *mdev = device_data;
+	struct parent_device *parent = mdev->parent;
+
+	if (unlikely(!parent->ops->write))
+		return -EINVAL;
+
+	return parent->ops->write(mdev, buf, count, ppos);
+}
+
+static int vfio_mdev_mmap(void *device_data, struct vm_area_struct *vma)
+{
+	struct mdev_device *mdev = device_data;
+	struct parent_device *parent = mdev->parent;
+
+	if (unlikely(!parent->ops->mmap))
+		return -EINVAL;
+
+	return parent->ops->mmap(mdev, vma);
+}
+
+static const struct vfio_device_ops vfio_mdev_dev_ops = {
+	.name		= "vfio-mdev",
+	.open		= vfio_mdev_open,
+	.release	= vfio_mdev_release,
+	.ioctl		= vfio_mdev_unlocked_ioctl,
+	.read		= vfio_mdev_read,
+	.write		= vfio_mdev_write,
+	.mmap		= vfio_mdev_mmap,
+};
+
+int vfio_mdev_probe(struct device *dev)
+{
+	struct mdev_device *mdev = to_mdev_device(dev);
+
+	return vfio_add_group_dev(dev, &vfio_mdev_dev_ops, mdev);
+}
+
+void vfio_mdev_remove(struct device *dev)
+{
+	vfio_del_group_dev(dev);
+}
+
+struct mdev_driver vfio_mdev_driver = {
+	.name	= "vfio_mdev",
+	.probe	= vfio_mdev_probe,
+	.remove	= vfio_mdev_remove,
+};
+
+static int __init vfio_mdev_init(void)
+{
+	return mdev_register_driver(&vfio_mdev_driver, THIS_MODULE);
+}
+
+static void __exit vfio_mdev_exit(void)
+{
+	mdev_unregister_driver(&vfio_mdev_driver);
+}
+
+module_init(vfio_mdev_init)
+module_exit(vfio_mdev_exit)
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v9 03/12] vfio: Rearrange functions to get vfio_group from dev
  2016-10-17 21:22 [PATCH v9 00/12] Add Mediated device support Kirti Wankhede
  2016-10-17 21:22 ` [PATCH v9 01/12] vfio: Mediated device Core driver Kirti Wankhede
  2016-10-17 21:22 ` [PATCH v9 02/12] vfio: VFIO based driver for Mediated devices Kirti Wankhede
@ 2016-10-17 21:22 ` Kirti Wankhede
  2016-10-19 17:26   ` Alex Williamson
  2016-10-17 21:22 ` [PATCH v9 04/12] vfio iommu: Add support for mediated devices Kirti Wankhede
                   ` (10 subsequent siblings)
  13 siblings, 1 reply; 73+ messages in thread
From: Kirti Wankhede @ 2016-10-17 21:22 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi, linux-kernel,
	Kirti Wankhede

Rearrange functions to have common function to increment container_users.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Change-Id: I1f93262bdbab75094bc24b087b29da35ba70c4c6
---
 drivers/vfio/vfio.c | 57 ++++++++++++++++++++++++++++++++++-------------------
 1 file changed, 37 insertions(+), 20 deletions(-)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index d1d70e0b011b..2e83bdf007fe 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -480,6 +480,21 @@ static struct vfio_group *vfio_group_get_from_minor(int minor)
 	return group;
 }
 
+static struct vfio_group *vfio_group_get_from_dev(struct device *dev)
+{
+	struct iommu_group *iommu_group;
+	struct vfio_group *group;
+
+	iommu_group = iommu_group_get(dev);
+	if (!iommu_group)
+		return NULL;
+
+	group = vfio_group_get_from_iommu(iommu_group);
+	iommu_group_put(iommu_group);
+
+	return group;
+}
+
 /**
  * Device objects - create, release, get, put, search
  */
@@ -811,16 +826,10 @@ EXPORT_SYMBOL_GPL(vfio_add_group_dev);
  */
 struct vfio_device *vfio_device_get_from_dev(struct device *dev)
 {
-	struct iommu_group *iommu_group;
 	struct vfio_group *group;
 	struct vfio_device *device;
 
-	iommu_group = iommu_group_get(dev);
-	if (!iommu_group)
-		return NULL;
-
-	group = vfio_group_get_from_iommu(iommu_group);
-	iommu_group_put(iommu_group);
+	group = vfio_group_get_from_dev(dev);
 	if (!group)
 		return NULL;
 
@@ -1376,6 +1385,23 @@ static bool vfio_group_viable(struct vfio_group *group)
 					 group, vfio_dev_viable) == 0);
 }
 
+static int vfio_group_add_container_user(struct vfio_group *group)
+{
+	if (!atomic_inc_not_zero(&group->container_users))
+		return -EINVAL;
+
+	if (group->noiommu) {
+		atomic_dec(&group->container_users);
+		return -EPERM;
+	}
+	if (!group->container->iommu_driver || !vfio_group_viable(group)) {
+		atomic_dec(&group->container_users);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
 static const struct file_operations vfio_device_fops;
 
 static int vfio_group_get_device_fd(struct vfio_group *group, char *buf)
@@ -1685,23 +1711,14 @@ static const struct file_operations vfio_device_fops = {
 struct vfio_group *vfio_group_get_external_user(struct file *filep)
 {
 	struct vfio_group *group = filep->private_data;
+	int ret;
 
 	if (filep->f_op != &vfio_group_fops)
 		return ERR_PTR(-EINVAL);
 
-	if (!atomic_inc_not_zero(&group->container_users))
-		return ERR_PTR(-EINVAL);
-
-	if (group->noiommu) {
-		atomic_dec(&group->container_users);
-		return ERR_PTR(-EPERM);
-	}
-
-	if (!group->container->iommu_driver ||
-			!vfio_group_viable(group)) {
-		atomic_dec(&group->container_users);
-		return ERR_PTR(-EINVAL);
-	}
+	ret = vfio_group_add_container_user(group);
+	if (ret)
+		return ERR_PTR(ret);
 
 	vfio_group_get(group);
 
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v9 04/12] vfio iommu: Add support for mediated devices
  2016-10-17 21:22 [PATCH v9 00/12] Add Mediated device support Kirti Wankhede
                   ` (2 preceding siblings ...)
  2016-10-17 21:22 ` [PATCH v9 03/12] vfio: Rearrange functions to get vfio_group from dev Kirti Wankhede
@ 2016-10-17 21:22 ` Kirti Wankhede
  2016-10-19 21:02   ` Alex Williamson
                     ` (2 more replies)
  2016-10-17 21:22 ` [PATCH v9 05/12] vfio: Introduce common function to add capabilities Kirti Wankhede
                   ` (9 subsequent siblings)
  13 siblings, 3 replies; 73+ messages in thread
From: Kirti Wankhede @ 2016-10-17 21:22 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi, linux-kernel,
	Kirti Wankhede

VFIO IOMMU drivers are designed for the devices which are IOMMU capable.
Mediated device only uses IOMMU APIs, the underlying hardware can be
managed by an IOMMU domain.

Aim of this change is:
- To use most of the code of TYPE1 IOMMU driver for mediated devices
- To support direct assigned device and mediated device in single module

Added two new callback functions to struct vfio_iommu_driver_ops. Backend
IOMMU module that supports pining and unpinning pages for mdev devices
should provide these functions.
Added APIs for pining and unpining pages to VFIO module. These calls back
into backend iommu module to actually pin and unpin pages.

This change adds pin and unpin support for mediated device to TYPE1 IOMMU
backend module. More details:
- When iommu_group of mediated devices is attached, task structure is
  cached which is used later to pin pages and page accounting.
- It keeps track of pinned pages for mediated domain. This data is used to
  verify unpinning request and to unpin remaining pages while detaching, if
  there are any.
- Used existing mechanism for page accounting. If iommu capable domain
  exist in the container then all pages are already pinned and accounted.
  Accouting for mdev device is only done if there is no iommu capable
  domain in the container.
- Page accouting is updated on hot plug and unplug mdev device and pass
  through device.

Tested by assigning below combinations of devices to a single VM:
- GPU pass through only
- vGPU device only
- One GPU pass through and one vGPU device
- Linux VM hot plug and unplug vGPU device while GPU pass through device
  exist
- Linux VM hot plug and unplug GPU pass through device while vGPU device
  exist

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Change-Id: I295d6f0f2e0579b8d9882bfd8fd5a4194b97bd9a
---
 drivers/vfio/vfio.c             |  98 ++++++
 drivers/vfio/vfio_iommu_type1.c | 692 ++++++++++++++++++++++++++++++++++------
 include/linux/vfio.h            |  13 +-
 3 files changed, 707 insertions(+), 96 deletions(-)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 2e83bdf007fe..a5a210005b65 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -1799,6 +1799,104 @@ void vfio_info_cap_shift(struct vfio_info_cap *caps, size_t offset)
 }
 EXPORT_SYMBOL_GPL(vfio_info_cap_shift);
 
+
+/*
+ * Pin a set of guest PFNs and return their associated host PFNs for local
+ * domain only.
+ * @dev [in] : device
+ * @user_pfn [in]: array of user/guest PFNs
+ * @npage [in]: count of array elements
+ * @prot [in] : protection flags
+ * @phys_pfn[out] : array of host PFNs
+ */
+long vfio_pin_pages(struct device *dev, unsigned long *user_pfn,
+		    long npage, int prot, unsigned long *phys_pfn)
+{
+	struct vfio_container *container;
+	struct vfio_group *group;
+	struct vfio_iommu_driver *driver;
+	ssize_t ret = -EINVAL;
+
+	if (!dev || !user_pfn || !phys_pfn)
+		return -EINVAL;
+
+	group = vfio_group_get_from_dev(dev);
+	if (IS_ERR(group))
+		return PTR_ERR(group);
+
+	ret = vfio_group_add_container_user(group);
+	if (ret)
+		goto err_pin_pages;
+
+	container = group->container;
+	if (IS_ERR(container)) {
+		ret = PTR_ERR(container);
+		goto err_pin_pages;
+	}
+
+	down_read(&container->group_lock);
+
+	driver = container->iommu_driver;
+	if (likely(driver && driver->ops->pin_pages))
+		ret = driver->ops->pin_pages(container->iommu_data, user_pfn,
+					     npage, prot, phys_pfn);
+
+	up_read(&container->group_lock);
+	vfio_group_try_dissolve_container(group);
+
+err_pin_pages:
+	vfio_group_put(group);
+	return ret;
+
+}
+EXPORT_SYMBOL(vfio_pin_pages);
+
+/*
+ * Unpin set of host PFNs for local domain only.
+ * @dev [in] : device
+ * @pfn [in] : array of host PFNs to be unpinned.
+ * @npage [in] :count of elements in array, that is number of pages.
+ */
+long vfio_unpin_pages(struct device *dev, unsigned long *pfn, long npage)
+{
+	struct vfio_container *container;
+	struct vfio_group *group;
+	struct vfio_iommu_driver *driver;
+	ssize_t ret = -EINVAL;
+
+	if (!dev || !pfn)
+		return -EINVAL;
+
+	group = vfio_group_get_from_dev(dev);
+	if (IS_ERR(group))
+		return PTR_ERR(group);
+
+	ret = vfio_group_add_container_user(group);
+	if (ret)
+		goto err_unpin_pages;
+
+	container = group->container;
+	if (IS_ERR(container)) {
+		ret = PTR_ERR(container);
+		goto err_unpin_pages;
+	}
+
+	down_read(&container->group_lock);
+
+	driver = container->iommu_driver;
+	if (likely(driver && driver->ops->unpin_pages))
+		ret = driver->ops->unpin_pages(container->iommu_data, pfn,
+					       npage);
+
+	up_read(&container->group_lock);
+	vfio_group_try_dissolve_container(group);
+
+err_unpin_pages:
+	vfio_group_put(group);
+	return ret;
+}
+EXPORT_SYMBOL(vfio_unpin_pages);
+
 /**
  * Module/class support
  */
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 2ba19424e4a1..5d67058a611d 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -55,16 +55,24 @@ MODULE_PARM_DESC(disable_hugepages,
 
 struct vfio_iommu {
 	struct list_head	domain_list;
+	struct vfio_domain	*local_domain;
 	struct mutex		lock;
 	struct rb_root		dma_list;
 	bool			v2;
 	bool			nesting;
 };
 
+struct local_addr_space {
+	struct task_struct	*task;
+	struct rb_root		pfn_list;	/* pinned Host pfn list */
+	struct mutex		pfn_list_lock;	/* mutex for pfn_list */
+};
+
 struct vfio_domain {
 	struct iommu_domain	*domain;
 	struct list_head	next;
 	struct list_head	group_list;
+	struct local_addr_space	*local_addr_space;
 	int			prot;		/* IOMMU_CACHE */
 	bool			fgsp;		/* Fine-grained super pages */
 };
@@ -75,6 +83,7 @@ struct vfio_dma {
 	unsigned long		vaddr;		/* Process virtual addr */
 	size_t			size;		/* Map size (bytes) */
 	int			prot;		/* IOMMU_READ/WRITE */
+	bool			iommu_mapped;
 };
 
 struct vfio_group {
@@ -83,6 +92,21 @@ struct vfio_group {
 };
 
 /*
+ * Guest RAM pinning working set or DMA target
+ */
+struct vfio_pfn {
+	struct rb_node		node;
+	unsigned long		vaddr;		/* virtual addr */
+	dma_addr_t		iova;		/* IOVA */
+	unsigned long		pfn;		/* Host pfn */
+	int			prot;
+	atomic_t		ref_count;
+};
+
+#define IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)	\
+					(!list_empty(&iommu->domain_list))
+
+/*
  * This code handles mapping and unmapping of user data buffers
  * into DMA'ble space using the IOMMU
  */
@@ -130,6 +154,101 @@ static void vfio_unlink_dma(struct vfio_iommu *iommu, struct vfio_dma *old)
 	rb_erase(&old->node, &iommu->dma_list);
 }
 
+/*
+ * Helper Functions for host pfn list
+ */
+
+static struct vfio_pfn *vfio_find_pfn(struct vfio_domain *domain,
+				      unsigned long pfn)
+{
+	struct rb_node *node;
+	struct vfio_pfn *vpfn;
+
+	node = domain->local_addr_space->pfn_list.rb_node;
+
+	while (node) {
+		vpfn = rb_entry(node, struct vfio_pfn, node);
+
+		if (pfn < vpfn->pfn)
+			node = node->rb_left;
+		else if (pfn > vpfn->pfn)
+			node = node->rb_right;
+		else
+			return vpfn;
+	}
+
+	return NULL;
+}
+
+static void vfio_link_pfn(struct vfio_domain *domain, struct vfio_pfn *new)
+{
+	struct rb_node **link, *parent = NULL;
+	struct vfio_pfn *vpfn;
+
+	link = &domain->local_addr_space->pfn_list.rb_node;
+	while (*link) {
+		parent = *link;
+		vpfn = rb_entry(parent, struct vfio_pfn, node);
+
+		if (new->pfn < vpfn->pfn)
+			link = &(*link)->rb_left;
+		else
+			link = &(*link)->rb_right;
+	}
+
+	rb_link_node(&new->node, parent, link);
+	rb_insert_color(&new->node, &domain->local_addr_space->pfn_list);
+}
+
+static void vfio_unlink_pfn(struct vfio_domain *domain, struct vfio_pfn *old)
+{
+	rb_erase(&old->node, &domain->local_addr_space->pfn_list);
+}
+
+static int vfio_add_to_pfn_list(struct vfio_domain *domain, unsigned long vaddr,
+				dma_addr_t iova, unsigned long pfn, int prot)
+{
+	struct vfio_pfn *vpfn;
+
+	vpfn = kzalloc(sizeof(*vpfn), GFP_KERNEL);
+	if (!vpfn)
+		return -ENOMEM;
+
+	vpfn->vaddr = vaddr;
+	vpfn->iova = iova;
+	vpfn->pfn = pfn;
+	vpfn->prot = prot;
+	atomic_set(&vpfn->ref_count, 1);
+	vfio_link_pfn(domain, vpfn);
+	return 0;
+}
+
+static void vfio_remove_from_pfn_list(struct vfio_domain *domain,
+				      struct vfio_pfn *vpfn)
+{
+	vfio_unlink_pfn(domain, vpfn);
+	kfree(vpfn);
+}
+
+static int vfio_pfn_account(struct vfio_iommu *iommu, unsigned long pfn)
+{
+	struct vfio_pfn *p;
+	struct vfio_domain *domain = iommu->local_domain;
+	int ret = 1;
+
+	if (!domain)
+		return 1;
+
+	mutex_lock(&domain->local_addr_space->pfn_list_lock);
+
+	p = vfio_find_pfn(domain, pfn);
+	if (p)
+		ret = 0;
+
+	mutex_unlock(&domain->local_addr_space->pfn_list_lock);
+	return ret;
+}
+
 struct vwork {
 	struct mm_struct	*mm;
 	long			npage;
@@ -150,17 +269,17 @@ static void vfio_lock_acct_bg(struct work_struct *work)
 	kfree(vwork);
 }
 
-static void vfio_lock_acct(long npage)
+static void vfio_lock_acct(struct task_struct *task, long npage)
 {
 	struct vwork *vwork;
 	struct mm_struct *mm;
 
-	if (!current->mm || !npage)
+	if (!task->mm || !npage)
 		return; /* process exited or nothing to do */
 
-	if (down_write_trylock(&current->mm->mmap_sem)) {
-		current->mm->locked_vm += npage;
-		up_write(&current->mm->mmap_sem);
+	if (down_write_trylock(&task->mm->mmap_sem)) {
+		task->mm->locked_vm += npage;
+		up_write(&task->mm->mmap_sem);
 		return;
 	}
 
@@ -172,7 +291,7 @@ static void vfio_lock_acct(long npage)
 	vwork = kmalloc(sizeof(struct vwork), GFP_KERNEL);
 	if (!vwork)
 		return;
-	mm = get_task_mm(current);
+	mm = get_task_mm(task);
 	if (!mm) {
 		kfree(vwork);
 		return;
@@ -228,20 +347,31 @@ static int put_pfn(unsigned long pfn, int prot)
 	return 0;
 }
 
-static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
+static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
+			 int prot, unsigned long *pfn)
 {
 	struct page *page[1];
 	struct vm_area_struct *vma;
+	struct mm_struct *local_mm = (mm ? mm : current->mm);
 	int ret = -EFAULT;
 
-	if (get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE), page) == 1) {
+	if (mm) {
+		down_read(&local_mm->mmap_sem);
+		ret = get_user_pages_remote(NULL, local_mm, vaddr, 1,
+					!!(prot & IOMMU_WRITE), 0, page, NULL);
+		up_read(&local_mm->mmap_sem);
+	} else
+		ret = get_user_pages_fast(vaddr, 1,
+					  !!(prot & IOMMU_WRITE), page);
+
+	if (ret == 1) {
 		*pfn = page_to_pfn(page[0]);
 		return 0;
 	}
 
-	down_read(&current->mm->mmap_sem);
+	down_read(&local_mm->mmap_sem);
 
-	vma = find_vma_intersection(current->mm, vaddr, vaddr + 1);
+	vma = find_vma_intersection(local_mm, vaddr, vaddr + 1);
 
 	if (vma && vma->vm_flags & VM_PFNMAP) {
 		*pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
@@ -249,7 +379,7 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
 			ret = 0;
 	}
 
-	up_read(&current->mm->mmap_sem);
+	up_read(&local_mm->mmap_sem);
 
 	return ret;
 }
@@ -259,33 +389,37 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
  * the iommu can only map chunks of consecutive pfns anyway, so get the
  * first page and all consecutive pages with the same locking.
  */
-static long vfio_pin_pages(unsigned long vaddr, long npage,
-			   int prot, unsigned long *pfn_base)
+static long __vfio_pin_pages_remote(struct vfio_iommu *iommu,
+				    unsigned long vaddr, long npage,
+				    int prot, unsigned long *pfn_base)
 {
 	unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
 	bool lock_cap = capable(CAP_IPC_LOCK);
-	long ret, i;
+	long ret, i, lock_acct = 0;
 	bool rsvd;
 
 	if (!current->mm)
 		return -ENODEV;
 
-	ret = vaddr_get_pfn(vaddr, prot, pfn_base);
+	ret = vaddr_get_pfn(NULL, vaddr, prot, pfn_base);
 	if (ret)
 		return ret;
 
+	lock_acct = vfio_pfn_account(iommu, *pfn_base);
+
 	rsvd = is_invalid_reserved_pfn(*pfn_base);
 
-	if (!rsvd && !lock_cap && current->mm->locked_vm + 1 > limit) {
+	if (!rsvd && !lock_cap && current->mm->locked_vm + lock_acct > limit) {
 		put_pfn(*pfn_base, prot);
 		pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n", __func__,
 			limit << PAGE_SHIFT);
 		return -ENOMEM;
 	}
 
+
 	if (unlikely(disable_hugepages)) {
 		if (!rsvd)
-			vfio_lock_acct(1);
+			vfio_lock_acct(current, lock_acct);
 		return 1;
 	}
 
@@ -293,7 +427,7 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
 	for (i = 1, vaddr += PAGE_SIZE; i < npage; i++, vaddr += PAGE_SIZE) {
 		unsigned long pfn = 0;
 
-		ret = vaddr_get_pfn(vaddr, prot, &pfn);
+		ret = vaddr_get_pfn(NULL, vaddr, prot, &pfn);
 		if (ret)
 			break;
 
@@ -303,8 +437,10 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
 			break;
 		}
 
+		lock_acct += vfio_pfn_account(iommu, pfn);
+
 		if (!rsvd && !lock_cap &&
-		    current->mm->locked_vm + i + 1 > limit) {
+		    current->mm->locked_vm + lock_acct > limit) {
 			put_pfn(pfn, prot);
 			pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
 				__func__, limit << PAGE_SHIFT);
@@ -313,23 +449,216 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
 	}
 
 	if (!rsvd)
-		vfio_lock_acct(i);
+		vfio_lock_acct(current, lock_acct);
 
 	return i;
 }
 
-static long vfio_unpin_pages(unsigned long pfn, long npage,
-			     int prot, bool do_accounting)
+static long __vfio_unpin_pages_remote(struct vfio_iommu *iommu,
+				      unsigned long pfn, long npage, int prot,
+				      bool do_accounting)
 {
-	unsigned long unlocked = 0;
+	unsigned long unlocked = 0, unlock_acct = 0;
 	long i;
 
-	for (i = 0; i < npage; i++)
+	for (i = 0; i < npage; i++) {
+		if (do_accounting)
+			unlock_acct += vfio_pfn_account(iommu, pfn);
+
 		unlocked += put_pfn(pfn++, prot);
+	}
 
 	if (do_accounting)
-		vfio_lock_acct(-unlocked);
+		vfio_lock_acct(current, -unlock_acct);
+
+	return unlocked;
+}
+
+static long __vfio_pin_page_local(struct vfio_domain *domain,
+				  unsigned long vaddr, int prot,
+				  unsigned long *pfn_base,
+				  bool do_accounting)
+{
+	unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+	bool lock_cap = capable(CAP_IPC_LOCK);
+	long ret;
+	bool rsvd;
+	struct task_struct *task = domain->local_addr_space->task;
+
+	if (!task->mm)
+		return -ENODEV;
+
+	ret = vaddr_get_pfn(task->mm, vaddr, prot, pfn_base);
+	if (ret)
+		return ret;
+
+	rsvd = is_invalid_reserved_pfn(*pfn_base);
+
+	if (!rsvd && !lock_cap && task->mm->locked_vm + 1 > limit) {
+		put_pfn(*pfn_base, prot);
+		pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n", __func__,
+			limit << PAGE_SHIFT);
+		return -ENOMEM;
+	}
+
+	if (!rsvd && do_accounting)
+		vfio_lock_acct(task, 1);
+
+	return 1;
+}
+
+static void __vfio_unpin_page_local(struct vfio_domain *domain,
+				    unsigned long pfn, int prot,
+				    bool do_accounting)
+{
+	put_pfn(pfn, prot);
+
+	if (do_accounting)
+		vfio_lock_acct(domain->local_addr_space->task, -1);
+}
+
+static int vfio_unpin_pfn(struct vfio_domain *domain,
+			  struct vfio_pfn *vpfn, bool do_accounting)
+{
+	__vfio_unpin_page_local(domain, vpfn->pfn, vpfn->prot,
+				do_accounting);
+
+	if (atomic_dec_and_test(&vpfn->ref_count))
+		vfio_remove_from_pfn_list(domain, vpfn);
+
+	return 1;
+}
+
+static long vfio_iommu_type1_pin_pages(void *iommu_data,
+				       unsigned long *user_pfn,
+				       long npage, int prot,
+				       unsigned long *phys_pfn)
+{
+	struct vfio_iommu *iommu = iommu_data;
+	struct vfio_domain *domain;
+	int i, j, ret;
+	long retpage;
+	unsigned long remote_vaddr;
+	unsigned long *pfn = phys_pfn;
+	struct vfio_dma *dma;
+	bool do_accounting;
+
+	if (!iommu || !user_pfn || !phys_pfn)
+		return -EINVAL;
+
+	mutex_lock(&iommu->lock);
+
+	if (!iommu->local_domain) {
+		ret = -EINVAL;
+		goto pin_done;
+	}
+
+	domain = iommu->local_domain;
+
+	/*
+	 * If iommu capable domain exist in the container then all pages are
+	 * already pinned and accounted. Accouting should be done if there is no
+	 * iommu capable domain in the container.
+	 */
+	do_accounting = !IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu);
+
+	for (i = 0; i < npage; i++) {
+		struct vfio_pfn *p;
+		dma_addr_t iova;
+
+		iova = user_pfn[i] << PAGE_SHIFT;
+
+		dma = vfio_find_dma(iommu, iova, 0);
+		if (!dma) {
+			ret = -EINVAL;
+			goto pin_unwind;
+		}
+
+		remote_vaddr = dma->vaddr + iova - dma->iova;
+
+		retpage = __vfio_pin_page_local(domain, remote_vaddr, prot,
+						&pfn[i], do_accounting);
+		if (retpage <= 0) {
+			WARN_ON(!retpage);
+			ret = (int)retpage;
+			goto pin_unwind;
+		}
+
+		mutex_lock(&domain->local_addr_space->pfn_list_lock);
+
+		/* search if pfn exist */
+		p = vfio_find_pfn(domain, pfn[i]);
+		if (p) {
+			atomic_inc(&p->ref_count);
+			mutex_unlock(&domain->local_addr_space->pfn_list_lock);
+			continue;
+		}
+
+		ret = vfio_add_to_pfn_list(domain, remote_vaddr, iova,
+					   pfn[i], prot);
+		mutex_unlock(&domain->local_addr_space->pfn_list_lock);
+
+		if (ret) {
+			__vfio_unpin_page_local(domain, pfn[i], prot,
+						do_accounting);
+			goto pin_unwind;
+		}
+	}
+
+	ret = i;
+	goto pin_done;
+
+pin_unwind:
+	pfn[i] = 0;
+	mutex_lock(&domain->local_addr_space->pfn_list_lock);
+	for (j = 0; j < i; j++) {
+		struct vfio_pfn *p;
+
+		p = vfio_find_pfn(domain, pfn[j]);
+		if (p)
+			vfio_unpin_pfn(domain, p, do_accounting);
+
+		pfn[j] = 0;
+	}
+	mutex_unlock(&domain->local_addr_space->pfn_list_lock);
+
+pin_done:
+	mutex_unlock(&iommu->lock);
+	return ret;
+}
+
+static long vfio_iommu_type1_unpin_pages(void *iommu_data, unsigned long *pfn,
+					 long npage)
+{
+	struct vfio_iommu *iommu = iommu_data;
+	struct vfio_domain *domain = NULL;
+	bool do_accounting;
+	long unlocked = 0;
+	int i;
+
+	if (!iommu || !pfn)
+		return -EINVAL;
+
+	mutex_lock(&iommu->lock);
+
+	domain = iommu->local_domain;
+
+	do_accounting = !IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu);
+
+	mutex_lock(&domain->local_addr_space->pfn_list_lock);
+
+	for (i = 0; i < npage; i++) {
+		struct vfio_pfn *p;
 
+		/* verify if pfn exist in pfn_list */
+		p = vfio_find_pfn(domain, pfn[i]);
+		if (p)
+			unlocked += vfio_unpin_pfn(domain, p, do_accounting);
+
+	}
+	mutex_unlock(&domain->local_addr_space->pfn_list_lock);
+
+	mutex_unlock(&iommu->lock);
 	return unlocked;
 }
 
@@ -341,6 +670,10 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
 
 	if (!dma->size)
 		return;
+
+	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu))
+		return;
+
 	/*
 	 * We use the IOMMU to track the physical addresses, otherwise we'd
 	 * need a much more complicated tracking system.  Unfortunately that
@@ -382,15 +715,16 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
 		if (WARN_ON(!unmapped))
 			break;
 
-		unlocked += vfio_unpin_pages(phys >> PAGE_SHIFT,
-					     unmapped >> PAGE_SHIFT,
-					     dma->prot, false);
+		unlocked += __vfio_unpin_pages_remote(iommu, phys >> PAGE_SHIFT,
+						      unmapped >> PAGE_SHIFT,
+						      dma->prot, false);
 		iova += unmapped;
 
 		cond_resched();
 	}
 
-	vfio_lock_acct(-unlocked);
+	dma->iommu_mapped = false;
+	vfio_lock_acct(current, -unlocked);
 }
 
 static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
@@ -558,17 +892,57 @@ unwind:
 	return ret;
 }
 
+static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct vfio_dma *dma,
+			    size_t map_size)
+{
+	dma_addr_t iova = dma->iova;
+	unsigned long vaddr = dma->vaddr;
+	size_t size = map_size;
+	long npage;
+	unsigned long pfn;
+	int ret = 0;
+
+	while (size) {
+		/* Pin a contiguous chunk of memory */
+		npage = __vfio_pin_pages_remote(iommu, vaddr + dma->size,
+						size >> PAGE_SHIFT, dma->prot,
+						&pfn);
+		if (npage <= 0) {
+			WARN_ON(!npage);
+			ret = (int)npage;
+			break;
+		}
+
+		/* Map it! */
+		ret = vfio_iommu_map(iommu, iova + dma->size, pfn, npage,
+				     dma->prot);
+		if (ret) {
+			__vfio_unpin_pages_remote(iommu, pfn, npage, dma->prot,
+						  true);
+			break;
+		}
+
+		size -= npage << PAGE_SHIFT;
+		dma->size += npage << PAGE_SHIFT;
+	}
+
+	dma->iommu_mapped = true;
+
+	if (ret)
+		vfio_remove_dma(iommu, dma);
+
+	return ret;
+}
+
 static int vfio_dma_do_map(struct vfio_iommu *iommu,
 			   struct vfio_iommu_type1_dma_map *map)
 {
 	dma_addr_t iova = map->iova;
 	unsigned long vaddr = map->vaddr;
 	size_t size = map->size;
-	long npage;
 	int ret = 0, prot = 0;
 	uint64_t mask;
 	struct vfio_dma *dma;
-	unsigned long pfn;
 
 	/* Verify that none of our __u64 fields overflow */
 	if (map->size != size || map->vaddr != vaddr || map->iova != iova)
@@ -611,29 +985,11 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
 	/* Insert zero-sized and grow as we map chunks of it */
 	vfio_link_dma(iommu, dma);
 
-	while (size) {
-		/* Pin a contiguous chunk of memory */
-		npage = vfio_pin_pages(vaddr + dma->size,
-				       size >> PAGE_SHIFT, prot, &pfn);
-		if (npage <= 0) {
-			WARN_ON(!npage);
-			ret = (int)npage;
-			break;
-		}
-
-		/* Map it! */
-		ret = vfio_iommu_map(iommu, iova + dma->size, pfn, npage, prot);
-		if (ret) {
-			vfio_unpin_pages(pfn, npage, prot, true);
-			break;
-		}
-
-		size -= npage << PAGE_SHIFT;
-		dma->size += npage << PAGE_SHIFT;
-	}
-
-	if (ret)
-		vfio_remove_dma(iommu, dma);
+	/* Don't pin and map if container doesn't contain IOMMU capable domain*/
+	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu))
+		dma->size = size;
+	else
+		ret = vfio_pin_map_dma(iommu, dma, size);
 
 	mutex_unlock(&iommu->lock);
 	return ret;
@@ -662,10 +1018,6 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
 	d = list_first_entry(&iommu->domain_list, struct vfio_domain, next);
 	n = rb_first(&iommu->dma_list);
 
-	/* If there's not a domain, there better not be any mappings */
-	if (WARN_ON(n && !d))
-		return -EINVAL;
-
 	for (; n; n = rb_next(n)) {
 		struct vfio_dma *dma;
 		dma_addr_t iova;
@@ -674,20 +1026,43 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
 		iova = dma->iova;
 
 		while (iova < dma->iova + dma->size) {
-			phys_addr_t phys = iommu_iova_to_phys(d->domain, iova);
+			phys_addr_t phys;
 			size_t size;
 
-			if (WARN_ON(!phys)) {
-				iova += PAGE_SIZE;
-				continue;
-			}
+			if (dma->iommu_mapped) {
+				phys = iommu_iova_to_phys(d->domain, iova);
+
+				if (WARN_ON(!phys)) {
+					iova += PAGE_SIZE;
+					continue;
+				}
 
-			size = PAGE_SIZE;
+				size = PAGE_SIZE;
 
-			while (iova + size < dma->iova + dma->size &&
-			       phys + size == iommu_iova_to_phys(d->domain,
+				while (iova + size < dma->iova + dma->size &&
+				    phys + size == iommu_iova_to_phys(d->domain,
 								 iova + size))
-				size += PAGE_SIZE;
+					size += PAGE_SIZE;
+			} else {
+				unsigned long pfn;
+				unsigned long vaddr = dma->vaddr +
+						     (iova - dma->iova);
+				size_t n = dma->iova + dma->size - iova;
+				long npage;
+
+				npage = __vfio_pin_pages_remote(iommu, vaddr,
+								n >> PAGE_SHIFT,
+								dma->prot,
+								&pfn);
+				if (npage <= 0) {
+					WARN_ON(!npage);
+					ret = (int)npage;
+					return ret;
+				}
+
+				phys = pfn << PAGE_SHIFT;
+				size = npage << PAGE_SHIFT;
+			}
 
 			ret = iommu_map(domain->domain, iova, phys,
 					size, dma->prot | domain->prot);
@@ -696,6 +1071,8 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
 
 			iova += size;
 		}
+
+		dma->iommu_mapped = true;
 	}
 
 	return 0;
@@ -734,11 +1111,24 @@ static void vfio_test_domain_fgsp(struct vfio_domain *domain)
 	__free_pages(pages, order);
 }
 
+static struct vfio_group *find_iommu_group(struct vfio_domain *domain,
+				   struct iommu_group *iommu_group)
+{
+	struct vfio_group *g;
+
+	list_for_each_entry(g, &domain->group_list, next) {
+		if (g->iommu_group == iommu_group)
+			return g;
+	}
+
+	return NULL;
+}
+
 static int vfio_iommu_type1_attach_group(void *iommu_data,
 					 struct iommu_group *iommu_group)
 {
 	struct vfio_iommu *iommu = iommu_data;
-	struct vfio_group *group, *g;
+	struct vfio_group *group;
 	struct vfio_domain *domain, *d;
 	struct bus_type *bus = NULL;
 	int ret;
@@ -746,10 +1136,14 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 	mutex_lock(&iommu->lock);
 
 	list_for_each_entry(d, &iommu->domain_list, next) {
-		list_for_each_entry(g, &d->group_list, next) {
-			if (g->iommu_group != iommu_group)
-				continue;
+		if (find_iommu_group(d, iommu_group)) {
+			mutex_unlock(&iommu->lock);
+			return -EINVAL;
+		}
+	}
 
+	if (iommu->local_domain) {
+		if (find_iommu_group(iommu->local_domain, iommu_group)) {
 			mutex_unlock(&iommu->lock);
 			return -EINVAL;
 		}
@@ -769,6 +1163,30 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 	if (ret)
 		goto out_free;
 
+	if (IS_ENABLED(CONFIG_VFIO_MDEV) && !iommu_present(bus) &&
+	    (bus == &mdev_bus_type)) {
+		if (!iommu->local_domain) {
+			domain->local_addr_space =
+				kzalloc(sizeof(*domain->local_addr_space),
+						GFP_KERNEL);
+			if (!domain->local_addr_space) {
+				ret = -ENOMEM;
+				goto out_free;
+			}
+
+			domain->local_addr_space->task = current;
+			INIT_LIST_HEAD(&domain->group_list);
+			domain->local_addr_space->pfn_list = RB_ROOT;
+			mutex_init(&domain->local_addr_space->pfn_list_lock);
+			iommu->local_domain = domain;
+		} else
+			kfree(domain);
+
+		list_add(&group->next, &domain->group_list);
+		mutex_unlock(&iommu->lock);
+		return 0;
+	}
+
 	domain->domain = iommu_domain_alloc(bus);
 	if (!domain->domain) {
 		ret = -EIO;
@@ -859,6 +1277,41 @@ static void vfio_iommu_unmap_unpin_all(struct vfio_iommu *iommu)
 		vfio_remove_dma(iommu, rb_entry(node, struct vfio_dma, node));
 }
 
+static void vfio_iommu_unmap_unpin_reaccount(struct vfio_iommu *iommu)
+{
+	struct vfio_domain *domain = iommu->local_domain;
+	struct vfio_dma *dma, *tdma;
+	struct rb_node *n;
+	long locked = 0;
+
+	rbtree_postorder_for_each_entry_safe(dma, tdma, &iommu->dma_list,
+					     node) {
+		vfio_unmap_unpin(iommu, dma);
+	}
+
+	mutex_lock(&domain->local_addr_space->pfn_list_lock);
+
+	n = rb_first(&domain->local_addr_space->pfn_list);
+
+	for (; n; n = rb_next(n))
+		locked++;
+
+	vfio_lock_acct(domain->local_addr_space->task, locked);
+	mutex_unlock(&domain->local_addr_space->pfn_list_lock);
+}
+
+static void vfio_local_unpin_all(struct vfio_domain *domain)
+{
+	struct rb_node *node;
+
+	mutex_lock(&domain->local_addr_space->pfn_list_lock);
+	while ((node = rb_first(&domain->local_addr_space->pfn_list)))
+		vfio_unpin_pfn(domain,
+				rb_entry(node, struct vfio_pfn, node), false);
+
+	mutex_unlock(&domain->local_addr_space->pfn_list_lock);
+}
+
 static void vfio_iommu_type1_detach_group(void *iommu_data,
 					  struct iommu_group *iommu_group)
 {
@@ -868,31 +1321,57 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
 
 	mutex_lock(&iommu->lock);
 
-	list_for_each_entry(domain, &iommu->domain_list, next) {
-		list_for_each_entry(group, &domain->group_list, next) {
-			if (group->iommu_group != iommu_group)
-				continue;
-
-			iommu_detach_group(domain->domain, iommu_group);
+	if (iommu->local_domain) {
+		domain = iommu->local_domain;
+		group = find_iommu_group(domain, iommu_group);
+		if (group) {
 			list_del(&group->next);
 			kfree(group);
-			/*
-			 * Group ownership provides privilege, if the group
-			 * list is empty, the domain goes away.  If it's the
-			 * last domain, then all the mappings go away too.
-			 */
+
 			if (list_empty(&domain->group_list)) {
-				if (list_is_singular(&iommu->domain_list))
+				vfio_local_unpin_all(domain);
+				if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu))
 					vfio_iommu_unmap_unpin_all(iommu);
-				iommu_domain_free(domain->domain);
-				list_del(&domain->next);
 				kfree(domain);
+				iommu->local_domain = NULL;
+			}
+			goto detach_group_done;
+		}
+	}
+
+	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu))
+		goto detach_group_done;
+
+	list_for_each_entry(domain, &iommu->domain_list, next) {
+		group = find_iommu_group(domain, iommu_group);
+		if (!group)
+			continue;
+
+		iommu_detach_group(domain->domain, iommu_group);
+		list_del(&group->next);
+		kfree(group);
+		/*
+		 * Group ownership provides privilege, if the group list is
+		 * empty, the domain goes away. If it's the last domain with
+		 * iommu and local domain doesn't exist, then all the mappings
+		 * go away too. If it's the last domain with iommu and local
+		 * domain exist, update accounting
+		 */
+		if (list_empty(&domain->group_list)) {
+			if (list_is_singular(&iommu->domain_list)) {
+				if (!iommu->local_domain)
+					vfio_iommu_unmap_unpin_all(iommu);
+				else
+					vfio_iommu_unmap_unpin_reaccount(iommu);
 			}
-			goto done;
+			iommu_domain_free(domain->domain);
+			list_del(&domain->next);
+			kfree(domain);
 		}
+		break;
 	}
 
-done:
+detach_group_done:
 	mutex_unlock(&iommu->lock);
 }
 
@@ -924,27 +1403,48 @@ static void *vfio_iommu_type1_open(unsigned long arg)
 	return iommu;
 }
 
+static void vfio_release_domain(struct vfio_domain *domain)
+{
+	struct vfio_group *group, *group_tmp;
+
+	list_for_each_entry_safe(group, group_tmp,
+				 &domain->group_list, next) {
+		if (!domain->local_addr_space)
+			iommu_detach_group(domain->domain, group->iommu_group);
+		list_del(&group->next);
+		kfree(group);
+	}
+
+	if (domain->local_addr_space)
+		vfio_local_unpin_all(domain);
+	else
+		iommu_domain_free(domain->domain);
+}
+
 static void vfio_iommu_type1_release(void *iommu_data)
 {
 	struct vfio_iommu *iommu = iommu_data;
 	struct vfio_domain *domain, *domain_tmp;
-	struct vfio_group *group, *group_tmp;
+
+	if (iommu->local_domain) {
+		vfio_release_domain(iommu->local_domain);
+		kfree(iommu->local_domain);
+		iommu->local_domain = NULL;
+	}
 
 	vfio_iommu_unmap_unpin_all(iommu);
 
+	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu))
+		goto release_exit;
+
 	list_for_each_entry_safe(domain, domain_tmp,
 				 &iommu->domain_list, next) {
-		list_for_each_entry_safe(group, group_tmp,
-					 &domain->group_list, next) {
-			iommu_detach_group(domain->domain, group->iommu_group);
-			list_del(&group->next);
-			kfree(group);
-		}
-		iommu_domain_free(domain->domain);
+		vfio_release_domain(domain);
 		list_del(&domain->next);
 		kfree(domain);
 	}
 
+release_exit:
 	kfree(iommu);
 }
 
@@ -1048,6 +1548,8 @@ static const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_type1 = {
 	.ioctl		= vfio_iommu_type1_ioctl,
 	.attach_group	= vfio_iommu_type1_attach_group,
 	.detach_group	= vfio_iommu_type1_detach_group,
+	.pin_pages	= vfio_iommu_type1_pin_pages,
+	.unpin_pages	= vfio_iommu_type1_unpin_pages,
 };
 
 static int __init vfio_iommu_type1_init(void)
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 0ecae0b1cd34..0bd25ba6223d 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -17,6 +17,7 @@
 #include <linux/workqueue.h>
 #include <linux/poll.h>
 #include <uapi/linux/vfio.h>
+#include <linux/mdev.h>
 
 /**
  * struct vfio_device_ops - VFIO bus driver device callbacks
@@ -75,7 +76,11 @@ struct vfio_iommu_driver_ops {
 					struct iommu_group *group);
 	void		(*detach_group)(void *iommu_data,
 					struct iommu_group *group);
-
+	long		(*pin_pages)(void *iommu_data, unsigned long *user_pfn,
+				     long npage, int prot,
+				     unsigned long *phys_pfn);
+	long		(*unpin_pages)(void *iommu_data, unsigned long *pfn,
+				       long npage);
 };
 
 extern int vfio_register_iommu_driver(const struct vfio_iommu_driver_ops *ops);
@@ -127,6 +132,12 @@ static inline long vfio_spapr_iommu_eeh_ioctl(struct iommu_group *group,
 }
 #endif /* CONFIG_EEH */
 
+extern long vfio_pin_pages(struct device *dev, unsigned long *user_pfn,
+			   long npage, int prot, unsigned long *phys_pfn);
+
+extern long vfio_unpin_pages(struct device *dev, unsigned long *pfn,
+			     long npage);
+
 /*
  * IRQfd - generic
  */
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v9 05/12] vfio: Introduce common function to add capabilities
  2016-10-17 21:22 [PATCH v9 00/12] Add Mediated device support Kirti Wankhede
                   ` (3 preceding siblings ...)
  2016-10-17 21:22 ` [PATCH v9 04/12] vfio iommu: Add support for mediated devices Kirti Wankhede
@ 2016-10-17 21:22 ` Kirti Wankhede
  2016-10-20 19:24   ` Alex Williamson
  2016-10-17 21:22 ` [PATCH v9 06/12] vfio_pci: Update vfio_pci to use vfio_info_add_capability() Kirti Wankhede
                   ` (8 subsequent siblings)
  13 siblings, 1 reply; 73+ messages in thread
From: Kirti Wankhede @ 2016-10-17 21:22 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi, linux-kernel,
	Kirti Wankhede

Vendor driver using mediated device framework should use
vfio_info_add_capability() to add capabilities.
Introduced this function to reduce code duplication in vendor drivers.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Change-Id: I6fca329fa2291f37a2c859d0bc97574d9e2ce1a6
---
 drivers/vfio/vfio.c  | 78 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/vfio.h |  4 +++
 2 files changed, 82 insertions(+)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index a5a210005b65..e96cb3f7a23c 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -1799,6 +1799,84 @@ void vfio_info_cap_shift(struct vfio_info_cap *caps, size_t offset)
 }
 EXPORT_SYMBOL_GPL(vfio_info_cap_shift);
 
+static int sparse_mmap_cap(struct vfio_info_cap *caps, void *cap_type)
+{
+	struct vfio_info_cap_header *header;
+	struct vfio_region_info_cap_sparse_mmap *sparse_cap, *sparse = cap_type;
+	size_t size;
+
+	size = sizeof(*sparse) + sparse->nr_areas *  sizeof(*sparse->areas);
+	header = vfio_info_cap_add(caps, size,
+				   VFIO_REGION_INFO_CAP_SPARSE_MMAP, 1);
+	if (IS_ERR(header))
+		return PTR_ERR(header);
+
+	sparse_cap = container_of(header,
+			struct vfio_region_info_cap_sparse_mmap, header);
+	sparse_cap->nr_areas = sparse->nr_areas;
+	memcpy(sparse_cap->areas, sparse->areas,
+	       sparse->nr_areas * sizeof(*sparse->areas));
+	return 0;
+}
+
+static int region_type_cap(struct vfio_info_cap *caps, void *cap_type)
+{
+	struct vfio_info_cap_header *header;
+	struct vfio_region_info_cap_type *type_cap, *cap = cap_type;
+
+	header = vfio_info_cap_add(caps, sizeof(*cap),
+				   VFIO_REGION_INFO_CAP_TYPE, 1);
+	if (IS_ERR(header))
+		return PTR_ERR(header);
+
+	type_cap = container_of(header, struct vfio_region_info_cap_type,
+				header);
+	type_cap->type = cap->type;
+	type_cap->subtype = cap->subtype;
+	return 0;
+}
+
+int vfio_info_add_capability(struct vfio_region_info *info,
+			     struct vfio_info_cap *caps,
+			     int cap_type_id,
+			     void *cap_type)
+{
+	int ret;
+
+	if (!cap_type)
+		return 0;
+
+	switch (cap_type_id) {
+	case VFIO_REGION_INFO_CAP_SPARSE_MMAP:
+		ret = sparse_mmap_cap(caps, cap_type);
+		if (ret)
+			return ret;
+		break;
+
+	case VFIO_REGION_INFO_CAP_TYPE:
+		ret = region_type_cap(caps, cap_type);
+		if (ret)
+			return ret;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	info->flags |= VFIO_REGION_INFO_FLAG_CAPS;
+
+	if (caps->size) {
+		if (info->argsz < sizeof(*info) + caps->size) {
+			info->argsz = sizeof(*info) + caps->size;
+			info->cap_offset = 0;
+		} else {
+			vfio_info_cap_shift(caps, sizeof(*info));
+			info->cap_offset = sizeof(*info);
+		}
+	}
+	return 0;
+}
+EXPORT_SYMBOL(vfio_info_add_capability);
+
 
 /*
  * Pin a set of guest PFNs and return their associated host PFNs for local
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 0bd25ba6223d..854a4b40be02 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -108,6 +108,10 @@ extern struct vfio_info_cap_header *vfio_info_cap_add(
 		struct vfio_info_cap *caps, size_t size, u16 id, u16 version);
 extern void vfio_info_cap_shift(struct vfio_info_cap *caps, size_t offset);
 
+extern int vfio_info_add_capability(struct vfio_region_info *info,
+				    struct vfio_info_cap *caps,
+				    int cap_type_id, void *cap_type);
+
 struct pci_dev;
 #ifdef CONFIG_EEH
 extern void vfio_spapr_pci_eeh_open(struct pci_dev *pdev);
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v9 06/12] vfio_pci: Update vfio_pci to use vfio_info_add_capability()
  2016-10-17 21:22 [PATCH v9 00/12] Add Mediated device support Kirti Wankhede
                   ` (4 preceding siblings ...)
  2016-10-17 21:22 ` [PATCH v9 05/12] vfio: Introduce common function to add capabilities Kirti Wankhede
@ 2016-10-17 21:22 ` Kirti Wankhede
  2016-10-20 19:24   ` Alex Williamson
  2016-10-17 21:22 ` [PATCH v9 07/12] vfio: Introduce vfio_set_irqs_validate_and_prepare() Kirti Wankhede
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 73+ messages in thread
From: Kirti Wankhede @ 2016-10-17 21:22 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi, linux-kernel,
	Kirti Wankhede

Update msix_sparse_mmap_cap() to use vfio_info_add_capability()
Update region type capability to use vfio_info_add_capability()
Can't split this commit for MSIx and region_type cap since there is a
common code which need to be updated for both the cases.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Change-Id: I52bb28c7875a6da5a79ddad1843e6088aff58a45
---
 drivers/vfio/pci/vfio_pci.c | 72 +++++++++++++++++----------------------------
 1 file changed, 27 insertions(+), 45 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index d624a527777f..1ec0565b48ea 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -556,12 +556,12 @@ static int vfio_pci_for_each_slot_or_bus(struct pci_dev *pdev,
 }
 
 static int msix_sparse_mmap_cap(struct vfio_pci_device *vdev,
+				struct vfio_region_info *info,
 				struct vfio_info_cap *caps)
 {
-	struct vfio_info_cap_header *header;
 	struct vfio_region_info_cap_sparse_mmap *sparse;
 	size_t end, size;
-	int nr_areas = 2, i = 0;
+	int nr_areas = 2, i = 0, ret;
 
 	end = pci_resource_len(vdev->pdev, vdev->msix_bar);
 
@@ -572,13 +572,10 @@ static int msix_sparse_mmap_cap(struct vfio_pci_device *vdev,
 
 	size = sizeof(*sparse) + (nr_areas * sizeof(*sparse->areas));
 
-	header = vfio_info_cap_add(caps, size,
-				   VFIO_REGION_INFO_CAP_SPARSE_MMAP, 1);
-	if (IS_ERR(header))
-		return PTR_ERR(header);
+	sparse = kzalloc(size, GFP_KERNEL);
+	if (!sparse)
+		return -ENOMEM;
 
-	sparse = container_of(header,
-			      struct vfio_region_info_cap_sparse_mmap, header);
 	sparse->nr_areas = nr_areas;
 
 	if (vdev->msix_offset & PAGE_MASK) {
@@ -594,26 +591,11 @@ static int msix_sparse_mmap_cap(struct vfio_pci_device *vdev,
 		i++;
 	}
 
-	return 0;
-}
-
-static int region_type_cap(struct vfio_pci_device *vdev,
-			   struct vfio_info_cap *caps,
-			   unsigned int type, unsigned int subtype)
-{
-	struct vfio_info_cap_header *header;
-	struct vfio_region_info_cap_type *cap;
-
-	header = vfio_info_cap_add(caps, sizeof(*cap),
-				   VFIO_REGION_INFO_CAP_TYPE, 1);
-	if (IS_ERR(header))
-		return PTR_ERR(header);
+	ret = vfio_info_add_capability(info, caps,
+				      VFIO_REGION_INFO_CAP_SPARSE_MMAP, sparse);
+	kfree(sparse);
 
-	cap = container_of(header, struct vfio_region_info_cap_type, header);
-	cap->type = type;
-	cap->subtype = subtype;
-
-	return 0;
+	return ret;
 }
 
 int vfio_pci_register_dev_region(struct vfio_pci_device *vdev,
@@ -704,7 +686,8 @@ static long vfio_pci_ioctl(void *device_data,
 			if (vdev->bar_mmap_supported[info.index]) {
 				info.flags |= VFIO_REGION_INFO_FLAG_MMAP;
 				if (info.index == vdev->msix_bar) {
-					ret = msix_sparse_mmap_cap(vdev, &caps);
+					ret = msix_sparse_mmap_cap(vdev, &info,
+								   &caps);
 					if (ret)
 						return ret;
 				}
@@ -752,6 +735,9 @@ static long vfio_pci_ioctl(void *device_data,
 
 			break;
 		default:
+		{
+			struct vfio_region_info_cap_type cap_type;
+
 			if (info.index >=
 			    VFIO_PCI_NUM_REGIONS + vdev->num_regions)
 				return -EINVAL;
@@ -762,27 +748,23 @@ static long vfio_pci_ioctl(void *device_data,
 			info.size = vdev->region[i].size;
 			info.flags = vdev->region[i].flags;
 
-			ret = region_type_cap(vdev, &caps,
-					      vdev->region[i].type,
-					      vdev->region[i].subtype);
+			cap_type.type = vdev->region[i].type;
+			cap_type.subtype = vdev->region[i].subtype;
+
+			ret = vfio_info_add_capability(&info, &caps,
+						      VFIO_REGION_INFO_CAP_TYPE,
+						      &cap_type);
 			if (ret)
 				return ret;
+
+		}
 		}
 
-		if (caps.size) {
-			info.flags |= VFIO_REGION_INFO_FLAG_CAPS;
-			if (info.argsz < sizeof(info) + caps.size) {
-				info.argsz = sizeof(info) + caps.size;
-				info.cap_offset = 0;
-			} else {
-				vfio_info_cap_shift(&caps, sizeof(info));
-				if (copy_to_user((void __user *)arg +
-						  sizeof(info), caps.buf,
-						  caps.size)) {
-					kfree(caps.buf);
-					return -EFAULT;
-				}
-				info.cap_offset = sizeof(info);
+		if (info.cap_offset) {
+			if (copy_to_user((void __user *)arg + info.cap_offset,
+					 caps.buf, caps.size)) {
+				kfree(caps.buf);
+				return -EFAULT;
 			}
 
 			kfree(caps.buf);
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v9 07/12] vfio: Introduce vfio_set_irqs_validate_and_prepare()
  2016-10-17 21:22 [PATCH v9 00/12] Add Mediated device support Kirti Wankhede
                   ` (5 preceding siblings ...)
  2016-10-17 21:22 ` [PATCH v9 06/12] vfio_pci: Update vfio_pci to use vfio_info_add_capability() Kirti Wankhede
@ 2016-10-17 21:22 ` Kirti Wankhede
  2016-10-17 21:22 ` [PATCH v9 08/12] vfio_pci: Updated to use vfio_set_irqs_validate_and_prepare() Kirti Wankhede
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 73+ messages in thread
From: Kirti Wankhede @ 2016-10-17 21:22 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi, linux-kernel,
	Kirti Wankhede

Vendor driver using mediated device framework would use same mechnism to
validate and prepare IRQs. Introducing this function to reduce code
replication in multiple drivers.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Change-Id: Ie201f269dda0713ca18a07dc4852500bd8b48309
---
 drivers/vfio/vfio.c  | 39 +++++++++++++++++++++++++++++++++++++++
 include/linux/vfio.h |  4 ++++
 2 files changed, 43 insertions(+)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index e96cb3f7a23c..10ef1c5fa762 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -1877,6 +1877,45 @@ int vfio_info_add_capability(struct vfio_region_info *info,
 }
 EXPORT_SYMBOL(vfio_info_add_capability);
 
+int vfio_set_irqs_validate_and_prepare(struct vfio_irq_set *hdr, int num_irqs,
+				       int max_irq_type, size_t *data_size)
+{
+	unsigned long minsz;
+
+	minsz = offsetofend(struct vfio_irq_set, count);
+
+	if ((hdr->argsz < minsz) || (hdr->index >= max_irq_type) ||
+	    (hdr->flags & ~(VFIO_IRQ_SET_DATA_TYPE_MASK |
+				VFIO_IRQ_SET_ACTION_TYPE_MASK)))
+		return -EINVAL;
+
+	if (data_size)
+		*data_size = 0;
+
+	if (!(hdr->flags & VFIO_IRQ_SET_DATA_NONE)) {
+		size_t size;
+
+		if (hdr->flags & VFIO_IRQ_SET_DATA_BOOL)
+			size = sizeof(uint8_t);
+		else if (hdr->flags & VFIO_IRQ_SET_DATA_EVENTFD)
+			size = sizeof(int32_t);
+		else
+			return -EINVAL;
+
+		if ((hdr->argsz - minsz < hdr->count * size) ||
+		    (hdr->start >= num_irqs) ||
+		    (hdr->start + hdr->count > num_irqs))
+			return -EINVAL;
+
+		if (!data_size)
+			return -EINVAL;
+
+		*data_size = hdr->count * size;
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL(vfio_set_irqs_validate_and_prepare);
 
 /*
  * Pin a set of guest PFNs and return their associated host PFNs for local
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 854a4b40be02..31d059f1649b 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -112,6 +112,10 @@ extern int vfio_info_add_capability(struct vfio_region_info *info,
 				    struct vfio_info_cap *caps,
 				    int cap_type_id, void *cap_type);
 
+extern int vfio_set_irqs_validate_and_prepare(struct vfio_irq_set *hdr,
+					      int num_irqs, int max_irq_type,
+					      size_t *data_size);
+
 struct pci_dev;
 #ifdef CONFIG_EEH
 extern void vfio_spapr_pci_eeh_open(struct pci_dev *pdev);
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v9 08/12] vfio_pci: Updated to use vfio_set_irqs_validate_and_prepare()
  2016-10-17 21:22 [PATCH v9 00/12] Add Mediated device support Kirti Wankhede
                   ` (6 preceding siblings ...)
  2016-10-17 21:22 ` [PATCH v9 07/12] vfio: Introduce vfio_set_irqs_validate_and_prepare() Kirti Wankhede
@ 2016-10-17 21:22 ` Kirti Wankhede
  2016-10-17 21:22 ` [PATCH v9 09/12] vfio_platform: " Kirti Wankhede
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 73+ messages in thread
From: Kirti Wankhede @ 2016-10-17 21:22 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi, linux-kernel,
	Kirti Wankhede

Updated vfio_pci.c file to use vfio_set_irqs_validate_and_prepare()

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Change-Id: I9f3daba89d8dba5cb5b01a8cff420412f30686c7
---
 drivers/vfio/pci/vfio_pci.c | 29 +++++++++--------------------
 1 file changed, 9 insertions(+), 20 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 1ec0565b48ea..23e7f32a4a07 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -812,35 +812,24 @@ static long vfio_pci_ioctl(void *device_data,
 	} else if (cmd == VFIO_DEVICE_SET_IRQS) {
 		struct vfio_irq_set hdr;
 		u8 *data = NULL;
-		int ret = 0;
+		int max, ret = 0;
+		size_t data_size = 0;
 
 		minsz = offsetofend(struct vfio_irq_set, count);
 
 		if (copy_from_user(&hdr, (void __user *)arg, minsz))
 			return -EFAULT;
 
-		if (hdr.argsz < minsz || hdr.index >= VFIO_PCI_NUM_IRQS ||
-		    hdr.flags & ~(VFIO_IRQ_SET_DATA_TYPE_MASK |
-				  VFIO_IRQ_SET_ACTION_TYPE_MASK))
-			return -EINVAL;
+		max = vfio_pci_get_irq_count(vdev, hdr.index);
 
-		if (!(hdr.flags & VFIO_IRQ_SET_DATA_NONE)) {
-			size_t size;
-			int max = vfio_pci_get_irq_count(vdev, hdr.index);
-
-			if (hdr.flags & VFIO_IRQ_SET_DATA_BOOL)
-				size = sizeof(uint8_t);
-			else if (hdr.flags & VFIO_IRQ_SET_DATA_EVENTFD)
-				size = sizeof(int32_t);
-			else
-				return -EINVAL;
-
-			if (hdr.argsz - minsz < hdr.count * size ||
-			    hdr.start >= max || hdr.start + hdr.count > max)
-				return -EINVAL;
+		ret = vfio_set_irqs_validate_and_prepare(&hdr, max,
+						 VFIO_PCI_NUM_IRQS, &data_size);
+		if (ret)
+			return ret;
 
+		if (data_size) {
 			data = memdup_user((void __user *)(arg + minsz),
-					   hdr.count * size);
+					    data_size);
 			if (IS_ERR(data))
 				return PTR_ERR(data);
 		}
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v9 09/12] vfio_platform: Updated to use vfio_set_irqs_validate_and_prepare()
  2016-10-17 21:22 [PATCH v9 00/12] Add Mediated device support Kirti Wankhede
                   ` (7 preceding siblings ...)
  2016-10-17 21:22 ` [PATCH v9 08/12] vfio_pci: Updated to use vfio_set_irqs_validate_and_prepare() Kirti Wankhede
@ 2016-10-17 21:22 ` Kirti Wankhede
  2016-10-17 21:22 ` [PATCH v9 10/12] vfio: Add function to get device_api string from vfio_device_info.flags Kirti Wankhede
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 73+ messages in thread
From: Kirti Wankhede @ 2016-10-17 21:22 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi, linux-kernel,
	Kirti Wankhede

Updated vfio_platform_common.c file to use
vfio_set_irqs_validate_and_prepare()

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Change-Id: Id87cd6b78ae901610b39bf957974baa6f40cd7b0
---
 drivers/vfio/platform/vfio_platform_common.c | 31 +++++++---------------------
 1 file changed, 8 insertions(+), 23 deletions(-)

diff --git a/drivers/vfio/platform/vfio_platform_common.c b/drivers/vfio/platform/vfio_platform_common.c
index d78142830754..4c27f4be3c3d 100644
--- a/drivers/vfio/platform/vfio_platform_common.c
+++ b/drivers/vfio/platform/vfio_platform_common.c
@@ -364,36 +364,21 @@ static long vfio_platform_ioctl(void *device_data,
 		struct vfio_irq_set hdr;
 		u8 *data = NULL;
 		int ret = 0;
+		size_t data_size = 0;
 
 		minsz = offsetofend(struct vfio_irq_set, count);
 
 		if (copy_from_user(&hdr, (void __user *)arg, minsz))
 			return -EFAULT;
 
-		if (hdr.argsz < minsz)
-			return -EINVAL;
-
-		if (hdr.index >= vdev->num_irqs)
-			return -EINVAL;
-
-		if (hdr.flags & ~(VFIO_IRQ_SET_DATA_TYPE_MASK |
-				  VFIO_IRQ_SET_ACTION_TYPE_MASK))
-			return -EINVAL;
-
-		if (!(hdr.flags & VFIO_IRQ_SET_DATA_NONE)) {
-			size_t size;
-
-			if (hdr.flags & VFIO_IRQ_SET_DATA_BOOL)
-				size = sizeof(uint8_t);
-			else if (hdr.flags & VFIO_IRQ_SET_DATA_EVENTFD)
-				size = sizeof(int32_t);
-			else
-				return -EINVAL;
-
-			if (hdr.argsz - minsz < size)
-				return -EINVAL;
+		ret = vfio_set_irqs_validate_and_prepare(&hdr, vdev->num_irqs,
+						 vdev->num_irqs, &data_size);
+		if (ret)
+			return ret;
 
-			data = memdup_user((void __user *)(arg + minsz), size);
+		if (data_size) {
+			data = memdup_user((void __user *)(arg + minsz),
+					    data_size);
 			if (IS_ERR(data))
 				return PTR_ERR(data);
 		}
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v9 10/12] vfio: Add function to get device_api string from vfio_device_info.flags
  2016-10-17 21:22 [PATCH v9 00/12] Add Mediated device support Kirti Wankhede
                   ` (8 preceding siblings ...)
  2016-10-17 21:22 ` [PATCH v9 09/12] vfio_platform: " Kirti Wankhede
@ 2016-10-17 21:22 ` Kirti Wankhede
  2016-10-20 19:34   ` Alex Williamson
  2016-10-17 21:22 ` [PATCH v9 11/12] docs: Add Documentation for Mediated devices Kirti Wankhede
                   ` (3 subsequent siblings)
  13 siblings, 1 reply; 73+ messages in thread
From: Kirti Wankhede @ 2016-10-17 21:22 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi, linux-kernel,
	Kirti Wankhede

Function vfio_device_api_string() returns string based on flag set in
vfio_device_info's flag. This should be used by vendor driver to get string
based on flag for device_api attribute.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Change-Id: I42d29f475f02a7132ce13297fbf2b48f1da10995
---
 drivers/vfio/vfio.c  | 15 +++++++++++++++
 include/linux/vfio.h |  2 ++
 2 files changed, 17 insertions(+)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 10ef1c5fa762..aec470454a13 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -1917,6 +1917,21 @@ int vfio_set_irqs_validate_and_prepare(struct vfio_irq_set *hdr, int num_irqs,
 }
 EXPORT_SYMBOL(vfio_set_irqs_validate_and_prepare);
 
+const char *vfio_device_api_string(u32 flags)
+{
+	if (flags & VFIO_DEVICE_FLAGS_PCI)
+		return "vfio-pci";
+
+	if (flags & VFIO_DEVICE_FLAGS_PLATFORM)
+		return "vfio-platform";
+
+	if (flags & VFIO_DEVICE_FLAGS_AMBA)
+		return "vfio-amba";
+
+	return "";
+}
+EXPORT_SYMBOL(vfio_device_api_string);
+
 /*
  * Pin a set of guest PFNs and return their associated host PFNs for local
  * domain only.
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 31d059f1649b..fca2bf23c4f1 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -116,6 +116,8 @@ extern int vfio_set_irqs_validate_and_prepare(struct vfio_irq_set *hdr,
 					      int num_irqs, int max_irq_type,
 					      size_t *data_size);
 
+extern const char *vfio_device_api_string(u32 flags);
+
 struct pci_dev;
 #ifdef CONFIG_EEH
 extern void vfio_spapr_pci_eeh_open(struct pci_dev *pdev);
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v9 11/12] docs: Add Documentation for Mediated devices
  2016-10-17 21:22 [PATCH v9 00/12] Add Mediated device support Kirti Wankhede
                   ` (9 preceding siblings ...)
  2016-10-17 21:22 ` [PATCH v9 10/12] vfio: Add function to get device_api string from vfio_device_info.flags Kirti Wankhede
@ 2016-10-17 21:22 ` Kirti Wankhede
  2016-10-25 16:17   ` Alex Williamson
  2016-10-17 21:22 ` [PATCH v9 12/12] docs: Sample driver to demonstrate how to use Mediated device framework Kirti Wankhede
                   ` (2 subsequent siblings)
  13 siblings, 1 reply; 73+ messages in thread
From: Kirti Wankhede @ 2016-10-17 21:22 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi, linux-kernel,
	Kirti Wankhede

Add file Documentation/vfio-mediated-device.txt that include details of
mediated device framework.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Change-Id: I137dd646442936090d92008b115908b7b2c7bc5d
---
 Documentation/vfio-mdev/vfio-mediated-device.txt | 289 +++++++++++++++++++++++
 1 file changed, 289 insertions(+)
 create mode 100644 Documentation/vfio-mdev/vfio-mediated-device.txt

diff --git a/Documentation/vfio-mdev/vfio-mediated-device.txt b/Documentation/vfio-mdev/vfio-mediated-device.txt
new file mode 100644
index 000000000000..8746e88dca4d
--- /dev/null
+++ b/Documentation/vfio-mdev/vfio-mediated-device.txt
@@ -0,0 +1,289 @@
+/*
+ * VFIO Mediated devices
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *             Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+Virtual Function I/O (VFIO) Mediated devices[1]
+===============================================
+
+The number of use cases for virtualizing DMA devices that do not have built-in
+SR_IOV capability is increasing. Previously, to virtualize such devices,
+developers had to create their own management interfaces and APIs, and then
+integrate them with user space software. To simplify integration with user space
+software, we have identified common requirements and a unified management
+interface for such devices.
+
+The VFIO driver framework provides unified APIs for direct device access. It is
+an IOMMU/device-agnostic framework for exposing direct device access to user
+space in a secure, IOMMU-protected environment. This framework is used for
+multiple devices, such as GPUs, network adapters, and compute accelerators. With
+direct device access, virtual machines or user space applications have direct
+access to the physical device. This framework is reused for mediated devices.
+
+The mediated core driver provides a common interface for mediated device
+management that can be used by drivers of different devices. This module
+provides a generic interface to perform these operations:
+
+* Create and destroy a mediated device
+* Add a mediated device to and remove it from a mediated bus driver
+* Add a mediated device to and remove it from an IOMMU group
+
+The mediated core driver also provides an interface to register a bus driver.
+For example, the mediated VFIO mdev driver is designed for mediated devices and
+supports VFIO APIs. The mediated bus driver adds a mediated device to and
+removes it from a VFIO group.
+
+The following high-level block diagram shows the main components and interfaces
+in the VFIO mediated driver framework. The diagram shows NVIDIA, Intel, and IBM
+devices as examples, as these devices are the first devices to use this module.
+
+     +---------------+
+     |               |
+     | +-----------+ |  mdev_register_driver() +--------------+
+     | |           | +<------------------------+              |
+     | |  mdev     | |                         |              |
+     | |  bus      | +------------------------>+ vfio_mdev.ko |<-> VFIO user
+     | |  driver   | |     probe()/remove()    |              |    APIs
+     | |           | |                         +--------------+
+     | +-----------+ |
+     |               |
+     |  MDEV CORE    |
+     |   MODULE      |
+     |   mdev.ko     |
+     | +-----------+ |  mdev_register_device() +--------------+
+     | |           | +<------------------------+              |
+     | |           | |                         |  nvidia.ko   |<-> physical
+     | |           | +------------------------>+              |    device
+     | |           | |        callbacks        +--------------+
+     | | Physical  | |
+     | |  device   | |  mdev_register_device() +--------------+
+     | | interface | |<------------------------+              |
+     | |           | |                         |  i915.ko     |<-> physical
+     | |           | +------------------------>+              |    device
+     | |           | |        callbacks        +--------------+
+     | |           | |
+     | |           | |  mdev_register_device() +--------------+
+     | |           | +<------------------------+              |
+     | |           | |                         | ccw_device.ko|<-> physical
+     | |           | +------------------------>+              |    device
+     | |           | |        callbacks        +--------------+
+     | +-----------+ |
+     +---------------+
+
+
+Registration Interfaces
+=======================
+
+The mediated core driver provides the following types of registration
+interfaces:
+
+* Registration interface for a mediated bus driver
+* Physical device driver interface
+
+Registration Interface for a Mediated Bus Driver
+------------------------------------------------
+
+The registration interface for a mediated bus driver provides the following
+structure to represent a mediated device's driver:
+
+     /*
+      * struct mdev_driver [2] - Mediated device's driver
+      * @name: driver name
+      * @probe: called when new device created
+      * @remove: called when device removed
+      * @driver: device driver structure
+      */
+     struct mdev_driver {
+	     const char *name;
+	     int  (*probe)  (struct device *dev);
+	     void (*remove) (struct device *dev);
+	     struct device_driver    driver;
+     };
+
+A mediated bus driver for mdev should use this structure in the function calls
+to register and unregister itself with the core driver:
+
+* Register:
+
+  extern int  mdev_register_driver(struct mdev_driver *drv,
+				   struct module *owner);
+
+* Unregister:
+
+  extern void mdev_unregister_driver(struct mdev_driver *drv);
+
+The mediated bus driver is responsible for adding mediated devices to the VFIO
+group when devices are bound to the driver and removing mediated devices from
+the VFIO when devices are unbound from the driver.
+
+
+Physical Device Driver Interface
+--------------------------------
+
+The physical device driver interface provides the parent_ops[3] structure to
+define the APIs to manage work in the mediated core driver that is related to
+the physical device.
+
+The structures in the parent_ops structure are as follows:
+
+* dev_attr_groups: attributes of the parent device
+* mdev_attr_groups: attributes of the mediated device
+* supported_config: attributes to define supported configurations
+
+The functions in the parent_ops structure are as follows:
+
+* create: allocate basic resources in a driver for a mediated device
+* remove: free resources in a driver when a mediated device is destroyed
+
+The callbacks in the parent_ops structure are as follows:
+
+* open: open callback of mediated device
+* close: close callback of mediated device
+* ioctl: ioctl callback of mediated device
+* read : read emulation callback
+* write: write emulation callback
+* mmap: mmap emulation callback
+
+A driver should use the parent_ops structure in the function call to register
+itself with the mdev core driver:
+
+extern int  mdev_register_device(struct device *dev,
+                                 const struct parent_ops *ops);
+
+However, the parent_ops structure is not required in the function call that a
+driver should use to unregister itself with the mdev core driver:
+
+extern void mdev_unregister_device(struct device *dev);
+
+
+Mediated Device Management Interface Through sysfs
+==================================================
+
+The management interface through sysfs enables user space software, such as
+libvirt, to query and configure mediated devices in a hardware-agnostic fashion.
+This management interface provides flexibility to the underlying physical
+device's driver to support features such as:
+
+* Mediated device hot plug
+* Multiple mediated devices in a single virtual machine
+* Multiple mediated devices from different physical devices
+
+Links in the mdev_bus Class Directory
+-------------------------------------
+The /sys/class/mdev_bus/ directory contains links to devices that are registered
+with the mdev core driver.
+
+Directories and files under the sysfs for Each Physical Device
+--------------------------------------------------------------
+
+|- [parent physical device]
+|--- Vendor-specific-attributes [optional]
+|--- [mdev_supported_types]
+|     |--- [<type-id>]
+|     |   |--- create
+|     |   |--- name
+|     |   |--- available_instances
+|     |   |--- device_api
+|     |   |--- description
+|     |   |--- [devices]
+|     |--- [<type-id>]
+|     |   |--- create
+|     |   |--- name
+|     |   |--- available_instances
+|     |   |--- device_api
+|     |   |--- description
+|     |   |--- [devices]
+|     |--- [<type-id>]
+|          |--- create
+|          |--- name
+|          |--- available_instances
+|          |--- device_api
+|          |--- description
+|          |--- [devices]
+
+* [mdev_supported_types]
+
+  The list of currently supported mediated device types and their details.
+
+  [<type-id>], device_api, and available_instances are mandatory attributes
+  that should be provided by vendor driver.
+
+* [<type-id>]
+
+  The [<type-id>] name is created by adding the the device driver string as a
+  prefix to the string provided by the vendor driver. This format of this name
+  is as follows:
+
+	sprintf(buf, "%s-%s", dev_driver_string(parent->dev), group->name);
+
+* device_api
+
+  This attribute should show which device API is being created, for example,
+  "vfio-pci" for a PCI device.
+
+* available_instances
+
+  This attribute should show the number of devices of type <type-id> that can be
+  created.
+
+* [device]
+
+  This directory contains links to the devices of type <type-id> that have been
+created.
+
+Directories and Files Under the sysfs for Each mdev Device
+----------------------------------------------------------
+
+|- [parent phy device]
+|--- [$MDEV_UUID]
+         |--- remove
+         |--- mdev_type {link to its type}
+         |--- vendor-specific-attributes [optional]
+
+* remove (write only)
+Writing '1' to the 'remove' file destroys the mdev device. The vendor driver can
+fail the remove() callback if that device is active and the vendor driver
+doesn't support hot unplug.
+
+Example:
+	# echo 1 > /sys/bus/mdev/devices/$mdev_UUID/remove
+
+Mediated device Hot plug:
+------------------------
+
+Mediated devices can be created and assigned at runtime. The procedure to hot
+plug a mediated device is the same as the procedure to hot plug a PCI device.
+
+Translation APIs for Mediated Devices
+=====================================
+
+The following APIs are provided for translating user pfn to host pfn in a VFIO
+driver:
+
+ extern long vfio_pin_pages(struct device *dev, unsigned long *user_pfn,
+                            long npage, int prot, unsigned long *phys_pfn);
+
+ extern long vfio_unpin_pages(struct device *dev, unsigned long *pfn,
+			      long npage);
+
+These functions call back into the back-end IOMMU module by using the pin_pages
+and unpin_pages callbacks of the struct vfio_iommu_driver_ops[4]. Currently
+these callbacks are supported in the TYPE1 IOMMU module. To enable them for
+other IOMMU backend modules, such as PPC64 sPAPR module, they need to provide
+these two callback functions.
+
+References
+----------
+
+[1] See Documentation/vfio.txt for more information on VFIO.
+[2] struct mdev_driver in include/linux/mdev.h
+[3] struct parent_ops in include/linux/mdev.h
+[4] struct vfio_iommu_driver_ops in include/linux/vfio.h
+
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v9 12/12] docs: Sample driver to demonstrate how to use Mediated device framework.
  2016-10-17 21:22 [PATCH v9 00/12] Add Mediated device support Kirti Wankhede
                   ` (10 preceding siblings ...)
  2016-10-17 21:22 ` [PATCH v9 11/12] docs: Add Documentation for Mediated devices Kirti Wankhede
@ 2016-10-17 21:22 ` Kirti Wankhede
       [not found]   ` <20161018025411.GA22572@bjsdjshi@linux.vnet.ibm.com>
  2016-10-17 21:41 ` [PATCH v9 00/12] Add Mediated device support Alex Williamson
  2016-10-24  7:07 ` Jike Song
  13 siblings, 1 reply; 73+ messages in thread
From: Kirti Wankhede @ 2016-10-17 21:22 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi, linux-kernel,
	Kirti Wankhede

The Sample driver creates mdev device that simulates serial port over PCI
card.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Change-Id: I857f8f12f8b275f2498dfe8c628a5cdc7193b1b2
---
 Documentation/vfio-mdev/Makefile                 |   13 +
 Documentation/vfio-mdev/mtty.c                   | 1429 ++++++++++++++++++++++
 Documentation/vfio-mdev/vfio-mediated-device.txt |  104 +-
 3 files changed, 1544 insertions(+), 2 deletions(-)
 create mode 100644 Documentation/vfio-mdev/Makefile
 create mode 100644 Documentation/vfio-mdev/mtty.c

diff --git a/Documentation/vfio-mdev/Makefile b/Documentation/vfio-mdev/Makefile
new file mode 100644
index 000000000000..a932edbe38eb
--- /dev/null
+++ b/Documentation/vfio-mdev/Makefile
@@ -0,0 +1,13 @@
+#
+# Makefile for mtty.c file
+#
+KERNEL_DIR:=/lib/modules/$(shell uname -r)/build
+
+obj-m:=mtty.o
+
+modules clean modules_install:
+	$(MAKE) -C $(KERNEL_DIR) SUBDIRS=$(PWD) $@
+
+default: modules
+
+module: modules
diff --git a/Documentation/vfio-mdev/mtty.c b/Documentation/vfio-mdev/mtty.c
new file mode 100644
index 000000000000..8ac321c4c8f1
--- /dev/null
+++ b/Documentation/vfio-mdev/mtty.c
@@ -0,0 +1,1429 @@
+/*
+ * Mediated virtual PCI serial host device driver
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *             Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * Sample driver that creates mdev device that simulates serial port over PCI
+ * card.
+ *
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/poll.h>
+#include <linux/slab.h>
+#include <linux/cdev.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/uuid.h>
+#include <linux/vfio.h>
+#include <linux/iommu.h>
+#include <linux/sysfs.h>
+#include <linux/ctype.h>
+#include <linux/file.h>
+#include <linux/mdev.h>
+#include <linux/pci.h>
+#include <linux/serial.h>
+#include <uapi/linux/serial_reg.h>
+/*
+ * #defines
+ */
+
+#define VERSION_STRING  "0.1"
+#define DRIVER_AUTHOR   "NVIDIA Corporation"
+
+#define MTTY_CLASS_NAME "mtty"
+
+#define MTTY_NAME       "mtty"
+
+#define MTTY_STRING_LEN		16
+
+#define MTTY_CONFIG_SPACE_SIZE  0xff
+#define MTTY_IO_BAR_SIZE        0x8
+#define MTTY_MMIO_BAR_SIZE      0x100000
+
+#define STORE_LE16(addr, val)   (*(u16 *)addr = val)
+#define STORE_LE32(addr, val)   (*(u32 *)addr = val)
+
+#define MAX_FIFO_SIZE   16
+
+#define CIRCULAR_BUF_INC_IDX(idx)    (idx = (idx + 1) & (MAX_FIFO_SIZE - 1))
+
+#define MTTY_VFIO_PCI_OFFSET_SHIFT   40
+
+#define MTTY_VFIO_PCI_OFFSET_TO_INDEX(off)   (off >> MTTY_VFIO_PCI_OFFSET_SHIFT)
+#define MTTY_VFIO_PCI_INDEX_TO_OFFSET(index) \
+				((u64)(index) << MTTY_VFIO_PCI_OFFSET_SHIFT)
+#define MTTY_VFIO_PCI_OFFSET_MASK    \
+				(((u64)(1) << MTTY_VFIO_PCI_OFFSET_SHIFT) - 1)
+#define MAX_MTTYS	24
+
+/*
+ * Global Structures
+ */
+
+struct mtty_dev {
+	dev_t		vd_devt;
+	struct class	*vd_class;
+	struct cdev	vd_cdev;
+	struct idr	vd_idr;
+	struct device	dev;
+} mtty_dev;
+
+struct mdev_region_info {
+	u64 start;
+	u64 phys_start;
+	u32 size;
+	u64 vfio_offset;
+};
+
+#if defined(DEBUG_REGS)
+const char *wr_reg[] = {
+	"TX",
+	"IER",
+	"FCR",
+	"LCR",
+	"MCR",
+	"LSR",
+	"MSR",
+	"SCR"
+};
+
+const char *rd_reg[] = {
+	"RX",
+	"IER",
+	"IIR",
+	"LCR",
+	"MCR",
+	"LSR",
+	"MSR",
+	"SCR"
+};
+#endif
+
+/* loop back buffer */
+struct rxtx {
+	u8 fifo[MAX_FIFO_SIZE];
+	u8 head, tail;
+	u8 count;
+};
+
+struct serial_port {
+	u8 uart_reg[8];         /* 8 registers */
+	struct rxtx rxtx;       /* loop back buffer */
+	bool dlab;
+	bool overrun;
+	u16 divisor;
+	u8 fcr;                 /* FIFO control register */
+	u8 max_fifo_size;
+	u8 intr_trigger_level;  /* interrupt trigger level */
+};
+
+/* State of each mdev device */
+struct mdev_state {
+	int irq_fd;
+	struct file *intx_file;
+	struct file *msi_file;
+	int irq_index;
+	u8 *vconfig;
+	struct mutex ops_lock;
+	struct mdev_device *mdev;
+	struct mdev_region_info region_info[VFIO_PCI_NUM_REGIONS];
+	u32 bar_mask[VFIO_PCI_NUM_REGIONS];
+	struct list_head next;
+	struct serial_port s[2];
+	struct mutex rxtx_lock;
+	struct vfio_device_info dev_info;
+	int nr_ports;
+};
+
+struct mutex mdev_list_lock;
+struct list_head mdev_devices_list;
+
+static const struct file_operations vd_fops = {
+	.owner          = THIS_MODULE,
+};
+
+/* function prototypes */
+
+static int mtty_trigger_interrupt(uuid_le uuid);
+
+/* Helper functions */
+static struct mdev_state *find_mdev_state_by_uuid(uuid_le uuid)
+{
+	struct mdev_state *mds;
+
+	list_for_each_entry(mds, &mdev_devices_list, next) {
+		if (uuid_le_cmp(mds->mdev->uuid, uuid) == 0)
+			return mds;
+	}
+
+	return NULL;
+}
+
+void dump_buffer(char *buf, uint32_t count)
+{
+#if defined(DEBUG)
+	int i;
+
+	pr_info("Buffer:\n");
+	for (i = 0; i < count; i++) {
+		pr_info("%2x ", *(buf + i));
+		if ((i + 1) % 16 == 0)
+			pr_info("\n");
+	}
+#endif
+}
+
+static void mtty_create_config_space(struct mdev_state *mdev_state)
+{
+	/* PCI dev ID */
+	STORE_LE32((u32 *) &mdev_state->vconfig[0x0], 0x32534348);
+
+	/* Control: I/O+, Mem-, BusMaster- */
+	STORE_LE16((u16 *) &mdev_state->vconfig[0x4], 0x0001);
+
+	/* Status: capabilities list absent */
+	STORE_LE16((u16 *) &mdev_state->vconfig[0x6], 0x0200);
+
+	/* Rev ID */
+	mdev_state->vconfig[0x8] =  0x10;
+
+	/* programming interface class : 16550-compatible serial controller */
+	mdev_state->vconfig[0x9] =  0x02;
+
+	/* Sub class : 00 */
+	mdev_state->vconfig[0xa] =  0x00;
+
+	/* Base class : Simple Communication controllers */
+	mdev_state->vconfig[0xb] =  0x07;
+
+	/* base address registers */
+	/* BAR0: IO space */
+	STORE_LE32((u32 *) &mdev_state->vconfig[0x10], 0x000001);
+	mdev_state->bar_mask[0] = ~(MTTY_IO_BAR_SIZE) + 1;
+
+	if (mdev_state->nr_ports == 2) {
+		/* BAR1: IO space */
+		STORE_LE32((u32 *) &mdev_state->vconfig[0x14], 0x000001);
+		mdev_state->bar_mask[1] = ~(MTTY_IO_BAR_SIZE) + 1;
+	}
+
+	/* Subsystem ID */
+	STORE_LE32((u32 *) &mdev_state->vconfig[0x2c], 0x32534348);
+
+	mdev_state->vconfig[0x34] =  0x00;   /* Cap Ptr */
+	mdev_state->vconfig[0x3d] =  0x01;   /* interrupt pin (INTA#) */
+
+	/* Vendor specific data */
+	mdev_state->vconfig[0x40] =  0x23;
+	mdev_state->vconfig[0x43] =  0x80;
+	mdev_state->vconfig[0x44] =  0x23;
+	mdev_state->vconfig[0x48] =  0x23;
+	mdev_state->vconfig[0x4c] =  0x23;
+
+	mdev_state->vconfig[0x60] =  0x50;
+	mdev_state->vconfig[0x61] =  0x43;
+	mdev_state->vconfig[0x62] =  0x49;
+	mdev_state->vconfig[0x63] =  0x20;
+	mdev_state->vconfig[0x64] =  0x53;
+	mdev_state->vconfig[0x65] =  0x65;
+	mdev_state->vconfig[0x66] =  0x72;
+	mdev_state->vconfig[0x67] =  0x69;
+	mdev_state->vconfig[0x68] =  0x61;
+	mdev_state->vconfig[0x69] =  0x6c;
+	mdev_state->vconfig[0x6a] =  0x2f;
+	mdev_state->vconfig[0x6b] =  0x55;
+	mdev_state->vconfig[0x6c] =  0x41;
+	mdev_state->vconfig[0x6d] =  0x52;
+	mdev_state->vconfig[0x6e] =  0x54;
+}
+
+static void handle_pci_cfg_write(struct mdev_state *mdev_state, u16 offset,
+				 char *buf, u32 count)
+{
+	u32 cfg_addr, bar_mask, bar_index = 0;
+
+	switch (offset) {
+	case 0x04: /* device control */
+	case 0x06: /* device status */
+		/* do nothing */
+		break;
+	case 0x3c:  /* interrupt line */
+		mdev_state->vconfig[0x3c] = buf[0];
+		break;
+	case 0x3d:
+		/*
+		 * Interrupt Pin is hardwired to INTA.
+		 * This field is write protected by hardware
+		 */
+		break;
+	case 0x10:  /* BAR0 */
+	case 0x14:  /* BAR1 */
+		if (offset == 0x10)
+			bar_index = 0;
+		else if (offset == 0x14)
+			bar_index = 1;
+
+		if ((mdev_state->nr_ports == 1) && (bar_index == 1)) {
+			STORE_LE32(&mdev_state->vconfig[offset], 0);
+			break;
+		}
+
+		cfg_addr = *(u32 *)buf;
+		pr_info("BAR%d addr 0x%x\n", bar_index, cfg_addr);
+
+		if (cfg_addr == 0xffffffff) {
+			bar_mask = mdev_state->bar_mask[bar_index];
+			cfg_addr = (cfg_addr & bar_mask);
+		}
+
+		cfg_addr |= (mdev_state->vconfig[offset] & 0x3ul);
+		STORE_LE32(&mdev_state->vconfig[offset], cfg_addr);
+		break;
+	case 0x18:  /* BAR2 */
+	case 0x1c:  /* BAR3 */
+	case 0x20:  /* BAR4 */
+		STORE_LE32(&mdev_state->vconfig[offset], 0);
+		break;
+	default:
+		pr_info("PCI config write @0x%x of %d bytes not handled\n",
+			offset, count);
+		break;
+	}
+}
+
+static void handle_bar_write(unsigned int index, struct mdev_state *mdev_state,
+				u16 offset, char *buf, u32 count)
+{
+	u8 data = *buf;
+
+	/* Handle data written by guest */
+	switch (offset) {
+	case UART_TX:
+		/* if DLAB set, data is LSB of divisor */
+		if (mdev_state->s[index].dlab) {
+			mdev_state->s[index].divisor |= data;
+			break;
+		}
+
+		mutex_lock(&mdev_state->rxtx_lock);
+
+		/* save in TX buffer */
+		if (mdev_state->s[index].rxtx.count <
+				mdev_state->s[index].max_fifo_size) {
+			mdev_state->s[index].rxtx.fifo[
+					mdev_state->s[index].rxtx.head] = data;
+			mdev_state->s[index].rxtx.count++;
+			CIRCULAR_BUF_INC_IDX(mdev_state->s[index].rxtx.head);
+			mdev_state->s[index].overrun = false;
+
+			/*
+			 * Trigger interrupt if receive data interrupt is
+			 * enabled and fifo reached trigger level
+			 */
+			if ((mdev_state->s[index].uart_reg[UART_IER] &
+						UART_IER_RDI) &&
+			   (mdev_state->s[index].rxtx.count ==
+				    mdev_state->s[index].intr_trigger_level)) {
+				/* trigger interrupt */
+#if defined(DEBUG_INTR)
+				pr_err("Serial port %d: Fifo level trigger\n",
+					index);
+#endif
+				mtty_trigger_interrupt(mdev_state->mdev->uuid);
+			}
+		} else {
+#if defined(DEBUG_INTR)
+			pr_err("Serial port %d: Buffer Overflow\n", index);
+#endif
+			mdev_state->s[index].overrun = true;
+
+			/*
+			 * Trigger interrupt if receiver line status interrupt
+			 * is enabled
+			 */
+			if (mdev_state->s[index].uart_reg[UART_IER] &
+								UART_IER_RLSI)
+				mtty_trigger_interrupt(mdev_state->mdev->uuid);
+		}
+		mutex_unlock(&mdev_state->rxtx_lock);
+		break;
+
+	case UART_IER:
+		/* if DLAB set, data is MSB of divisor */
+		if (mdev_state->s[index].dlab)
+			mdev_state->s[index].divisor |= (u16)data << 8;
+		else {
+			mdev_state->s[index].uart_reg[offset] = data;
+			mutex_lock(&mdev_state->rxtx_lock);
+			if ((data & UART_IER_THRI) &&
+			    (mdev_state->s[index].rxtx.head ==
+					mdev_state->s[index].rxtx.tail)) {
+#if defined(DEBUG_INTR)
+				pr_err("Serial port %d: IER_THRI write\n",
+					index);
+#endif
+				mtty_trigger_interrupt(mdev_state->mdev->uuid);
+			}
+
+			mutex_unlock(&mdev_state->rxtx_lock);
+		}
+
+		break;
+
+	case UART_FCR:
+		mdev_state->s[index].fcr = data;
+
+		mutex_lock(&mdev_state->rxtx_lock);
+		if (data & (UART_FCR_CLEAR_RCVR | UART_FCR_CLEAR_XMIT)) {
+			/* clear loop back FIFO */
+			mdev_state->s[index].rxtx.count = 0;
+			mdev_state->s[index].rxtx.head = 0;
+			mdev_state->s[index].rxtx.tail = 0;
+		}
+		mutex_unlock(&mdev_state->rxtx_lock);
+
+		switch (data & UART_FCR_TRIGGER_MASK) {
+		case UART_FCR_TRIGGER_1:
+			mdev_state->s[index].intr_trigger_level = 1;
+			break;
+
+		case UART_FCR_TRIGGER_4:
+			mdev_state->s[index].intr_trigger_level = 4;
+			break;
+
+		case UART_FCR_TRIGGER_8:
+			mdev_state->s[index].intr_trigger_level = 8;
+			break;
+
+		case UART_FCR_TRIGGER_14:
+			mdev_state->s[index].intr_trigger_level = 14;
+			break;
+		}
+
+		/*
+		 * Set trigger level to 1 otherwise or  implement timer with
+		 * timeout of 4 characters and on expiring that timer set
+		 * Recevice data timeout in IIR register
+		 */
+		mdev_state->s[index].intr_trigger_level = 1;
+		if (data & UART_FCR_ENABLE_FIFO)
+			mdev_state->s[index].max_fifo_size = MAX_FIFO_SIZE;
+		else {
+			mdev_state->s[index].max_fifo_size = 1;
+			mdev_state->s[index].intr_trigger_level = 1;
+		}
+
+		break;
+
+	case UART_LCR:
+		if (data & UART_LCR_DLAB) {
+			mdev_state->s[index].dlab = true;
+			mdev_state->s[index].divisor = 0;
+		} else
+			mdev_state->s[index].dlab = false;
+
+		mdev_state->s[index].uart_reg[offset] = data;
+		break;
+
+	case UART_MCR:
+		mdev_state->s[index].uart_reg[offset] = data;
+
+		if ((mdev_state->s[index].uart_reg[UART_IER] & UART_IER_MSI) &&
+				(data & UART_MCR_OUT2)) {
+#if defined(DEBUG_INTR)
+			pr_err("Serial port %d: MCR_OUT2 write\n", index);
+#endif
+			mtty_trigger_interrupt(mdev_state->mdev->uuid);
+		}
+
+		if ((mdev_state->s[index].uart_reg[UART_IER] & UART_IER_MSI) &&
+				(data & (UART_MCR_RTS | UART_MCR_DTR))) {
+#if defined(DEBUG_INTR)
+			pr_err("Serial port %d: MCR RTS/DTR write\n", index);
+#endif
+			mtty_trigger_interrupt(mdev_state->mdev->uuid);
+		}
+		break;
+
+	case UART_LSR:
+	case UART_MSR:
+		/* do nothing */
+		break;
+
+	case UART_SCR:
+		mdev_state->s[index].uart_reg[offset] = data;
+		break;
+
+	default:
+		break;
+	}
+}
+
+static void handle_bar_read(unsigned int index, struct mdev_state *mdev_state,
+			    u16 offset, char *buf, u32 count)
+{
+	/* Handle read requests by guest */
+	switch (offset) {
+	case UART_RX:
+		/* if DLAB set, data is LSB of divisor */
+		if (mdev_state->s[index].dlab) {
+			*buf  = (u8)mdev_state->s[index].divisor;
+			break;
+		}
+
+		mutex_lock(&mdev_state->rxtx_lock);
+		/* return data in tx buffer */
+		if (mdev_state->s[index].rxtx.head !=
+				 mdev_state->s[index].rxtx.tail) {
+			*buf = mdev_state->s[index].rxtx.fifo[
+						mdev_state->s[index].rxtx.tail];
+			mdev_state->s[index].rxtx.count--;
+			CIRCULAR_BUF_INC_IDX(mdev_state->s[index].rxtx.tail);
+		}
+
+		if (mdev_state->s[index].rxtx.head ==
+				mdev_state->s[index].rxtx.tail) {
+		/*
+		 *  Trigger interrupt if tx buffer empty interrupt is
+		 *  enabled and fifo is empty
+		 */
+#if defined(DEBUG_INTR)
+			pr_err("Serial port %d: Buffer Empty\n", index);
+#endif
+			if (mdev_state->s[index].uart_reg[UART_IER] &
+							 UART_IER_THRI)
+				mtty_trigger_interrupt(mdev_state->mdev->uuid);
+		}
+		mutex_unlock(&mdev_state->rxtx_lock);
+
+		break;
+
+	case UART_IER:
+		if (mdev_state->s[index].dlab) {
+			*buf = (u8)(mdev_state->s[index].divisor >> 8);
+			break;
+		}
+		*buf = mdev_state->s[index].uart_reg[offset] & 0x0f;
+		break;
+
+	case UART_IIR:
+	{
+		u8 ier = mdev_state->s[index].uart_reg[UART_IER];
+		*buf = 0;
+
+		mutex_lock(&mdev_state->rxtx_lock);
+		/* Interrupt priority 1: Parity, overrun, framing or break */
+		if ((ier & UART_IER_RLSI) && mdev_state->s[index].overrun)
+			*buf |= UART_IIR_RLSI;
+
+		/* Interrupt priority 2: Fifo trigger level reached */
+		if ((ier & UART_IER_RDI) &&
+		    (mdev_state->s[index].rxtx.count ==
+		      mdev_state->s[index].intr_trigger_level))
+			*buf |= UART_IIR_RDI;
+
+		/* Interrupt priotiry 3: transmitter holding register empty */
+		if ((ier & UART_IER_THRI) &&
+		    (mdev_state->s[index].rxtx.head ==
+				mdev_state->s[index].rxtx.tail))
+			*buf |= UART_IIR_THRI;
+
+		/* Interrupt priotiry 4: Modem status: CTS, DSR, RI or DCD  */
+		if ((ier & UART_IER_MSI) &&
+		    (mdev_state->s[index].uart_reg[UART_MCR] &
+				 (UART_MCR_RTS | UART_MCR_DTR)))
+			*buf |= UART_IIR_MSI;
+
+		/* bit0: 0=> interrupt pending, 1=> no interrupt is pending */
+		if (*buf == 0)
+			*buf = UART_IIR_NO_INT;
+
+		/* set bit 6 & 7 to be 16550 compatible */
+		*buf |= 0xC0;
+		mutex_unlock(&mdev_state->rxtx_lock);
+	}
+	break;
+
+	case UART_LCR:
+	case UART_MCR:
+		*buf = mdev_state->s[index].uart_reg[offset];
+		break;
+
+	case UART_LSR:
+	{
+		u8 lsr = 0;
+
+		mutex_lock(&mdev_state->rxtx_lock);
+		/* atleast one char in FIFO */
+		if (mdev_state->s[index].rxtx.head !=
+				 mdev_state->s[index].rxtx.tail)
+			lsr |= UART_LSR_DR;
+
+		/* if FIFO overrun */
+		if (mdev_state->s[index].overrun)
+			lsr |= UART_LSR_OE;
+
+		/* transmit FIFO empty and tramsitter empty */
+		if (mdev_state->s[index].rxtx.head ==
+				 mdev_state->s[index].rxtx.tail)
+			lsr |= UART_LSR_TEMT | UART_LSR_THRE;
+
+		mutex_unlock(&mdev_state->rxtx_lock);
+		*buf = lsr;
+		break;
+	}
+	case UART_MSR:
+		*buf = UART_MSR_DSR | UART_MSR_DDSR | UART_MSR_DCD;
+
+		mutex_lock(&mdev_state->rxtx_lock);
+		/* if AFE is 1 and FIFO have space, set CTS bit */
+		if (mdev_state->s[index].uart_reg[UART_MCR] &
+						 UART_MCR_AFE) {
+			if (mdev_state->s[index].rxtx.count <
+					mdev_state->s[index].max_fifo_size)
+				*buf |= UART_MSR_CTS | UART_MSR_DCTS;
+		} else
+			*buf |= UART_MSR_CTS | UART_MSR_DCTS;
+		mutex_unlock(&mdev_state->rxtx_lock);
+
+		break;
+
+	case UART_SCR:
+		*buf = mdev_state->s[index].uart_reg[offset];
+		break;
+
+	default:
+		break;
+	}
+}
+
+static void mdev_read_base(struct mdev_state *mdev_state)
+{
+	int index, pos;
+	u32 start_lo, start_hi;
+	u32 mem_type;
+
+	pos = PCI_BASE_ADDRESS_0;
+
+	for (index = 0; index <= VFIO_PCI_BAR5_REGION_INDEX; index++) {
+
+		if (!mdev_state->region_info[index].size)
+			continue;
+
+		start_lo = (*(u32 *)(mdev_state->vconfig + pos)) &
+			PCI_BASE_ADDRESS_MEM_MASK;
+		mem_type = (*(u32 *)(mdev_state->vconfig + pos)) &
+			PCI_BASE_ADDRESS_MEM_TYPE_MASK;
+
+		switch (mem_type) {
+		case PCI_BASE_ADDRESS_MEM_TYPE_64:
+			start_hi = (*(u32 *)(mdev_state->vconfig + pos + 4));
+			pos += 4;
+			break;
+		case PCI_BASE_ADDRESS_MEM_TYPE_32:
+		case PCI_BASE_ADDRESS_MEM_TYPE_1M:
+			/* 1M mem BAR treated as 32-bit BAR */
+		default:
+			/* mem unknown type treated as 32-bit BAR */
+			start_hi = 0;
+			break;
+		}
+		pos += 4;
+		mdev_state->region_info[index].start = ((u64)start_hi << 32) |
+							start_lo;
+	}
+}
+
+static ssize_t mdev_access(struct mdev_device *mdev, char *buf,
+		size_t count, loff_t pos, bool is_write)
+{
+	struct mdev_state *mdev_state;
+	unsigned int index;
+	loff_t offset;
+	int ret = 0;
+
+	if (!mdev || !buf)
+		return -EINVAL;
+
+	mdev_state = mdev_get_drvdata(mdev);
+	if (!mdev_state) {
+		pr_err("%s mdev_state not found\n", __func__);
+		return -EINVAL;
+	}
+
+	mutex_lock(&mdev_state->ops_lock);
+
+	index = MTTY_VFIO_PCI_OFFSET_TO_INDEX(pos);
+	offset = pos & MTTY_VFIO_PCI_OFFSET_MASK;
+	switch (index) {
+	case VFIO_PCI_CONFIG_REGION_INDEX:
+
+#if defined(DEBUG)
+		pr_info("%s: PCI config space %s at offset 0x%llx\n",
+			 __func__, is_write ? "write" : "read", offset);
+#endif
+		if (is_write) {
+			dump_buffer(buf, count);
+			handle_pci_cfg_write(mdev_state, offset, buf, count);
+		} else {
+			memcpy(buf, (mdev_state->vconfig + offset), count);
+			dump_buffer(buf, count);
+		}
+
+		break;
+
+	case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
+		if (!mdev_state->region_info[index].start)
+			mdev_read_base(mdev_state);
+
+		if (is_write) {
+			dump_buffer(buf, count);
+
+#if defined(DEBUG_REGS)
+			pr_info("%s: BAR%d  WR @0x%llx %s val:0x%02x dlab:%d\n",
+				__func__, index, offset, wr_reg[offset],
+				(u8)*buf, mdev_state->s[index].dlab);
+#endif
+			handle_bar_write(index, mdev_state, offset, buf, count);
+		} else {
+			handle_bar_read(index, mdev_state, offset, buf, count);
+			dump_buffer(buf, count);
+
+#if defined(DEBUG_REGS)
+			pr_info("%s: BAR%d  RD @0x%llx %s val:0x%02x dlab:%d\n",
+				__func__, index, offset, rd_reg[offset],
+				(u8)*buf, mdev_state->s[index].dlab);
+#endif
+		}
+		break;
+
+	default:
+		ret = -1;
+		goto accessfailed;
+	}
+
+	ret = count;
+
+
+accessfailed:
+	mutex_unlock(&mdev_state->ops_lock);
+
+	return ret;
+}
+
+int mtty_create(struct kobject *kobj, struct mdev_device *mdev)
+{
+	struct mdev_state *mdev_state;
+	char name[MTTY_STRING_LEN];
+	int nr_ports = 0, i;
+
+	if (!mdev)
+		return -EINVAL;
+
+	for (i = 0; i < 2; i++) {
+		snprintf(name, MTTY_STRING_LEN, "%s-%d",
+			dev_driver_string(mdev->parent->dev), i + 1);
+		if (!strcmp(kobj->name, name)) {
+			nr_ports = i + 1;
+			break;
+		}
+	}
+
+	if (!nr_ports)
+		return -EINVAL;
+
+	mdev_state = kzalloc(sizeof(struct mdev_state), GFP_KERNEL);
+	if (mdev_state == NULL)
+		return -ENOMEM;
+
+	mdev_state->nr_ports = nr_ports;
+	mdev_state->irq_index = -1;
+	mdev_state->s[0].max_fifo_size = MAX_FIFO_SIZE;
+	mdev_state->s[1].max_fifo_size = MAX_FIFO_SIZE;
+	mutex_init(&mdev_state->rxtx_lock);
+	mdev_state->vconfig = kzalloc(MTTY_CONFIG_SPACE_SIZE, GFP_KERNEL);
+
+	if (mdev_state->vconfig == NULL) {
+		kfree(mdev_state);
+		return -ENOMEM;
+	}
+
+	mutex_init(&mdev_state->ops_lock);
+	mdev_state->mdev = mdev;
+	mdev_set_drvdata(mdev, mdev_state);
+
+	mtty_create_config_space(mdev_state);
+
+	mutex_lock(&mdev_list_lock);
+	list_add(&mdev_state->next, &mdev_devices_list);
+	mutex_unlock(&mdev_list_lock);
+
+	return 0;
+}
+
+int mtty_remove(struct mdev_device *mdev)
+{
+	struct mdev_state *mds, *tmp_mds;
+	struct mdev_state *mdev_state = mdev_get_drvdata(mdev);
+	int ret = -EINVAL;
+
+	mutex_lock(&mdev_list_lock);
+	list_for_each_entry_safe(mds, tmp_mds, &mdev_devices_list, next) {
+		if (mdev_state == mds) {
+			list_del(&mdev_state->next);
+			mdev_set_drvdata(mdev, NULL);
+			kfree(mdev_state->vconfig);
+			kfree(mdev_state);
+			ret = 0;
+			break;
+		}
+	}
+	mutex_unlock(&mdev_list_lock);
+
+	return ret;
+}
+
+int mtty_reset(struct mdev_device *mdev)
+{
+	struct mdev_state *mdev_state;
+
+	if (!mdev)
+		return -EINVAL;
+
+	mdev_state = mdev_get_drvdata(mdev);
+	if (!mdev_state)
+		return -EINVAL;
+
+	pr_info("%s: called\n", __func__);
+
+	return 0;
+}
+
+ssize_t mtty_read(struct mdev_device *mdev, char __user *buf,
+			size_t count, loff_t *ppos)
+{
+	return mdev_access(mdev, buf, count, *ppos, false);
+}
+
+ssize_t mtty_write(struct mdev_device *mdev, const char __user *buf,
+			 size_t count, loff_t *ppos)
+{
+	return mdev_access(mdev, (char *)buf, count, *ppos, true);
+}
+
+static int mtty_set_irqs(struct mdev_device *mdev, uint32_t flags,
+			 unsigned int index, unsigned int start,
+			 unsigned int count, void *data)
+{
+	int ret = 0;
+	struct mdev_state *mdev_state;
+
+	if (!mdev)
+		return -EINVAL;
+
+	mdev_state = mdev_get_drvdata(mdev);
+	if (!mdev_state)
+		return -EINVAL;
+
+	mutex_lock(&mdev_state->ops_lock);
+	switch (index) {
+	case VFIO_PCI_INTX_IRQ_INDEX:
+		switch (flags & VFIO_IRQ_SET_ACTION_TYPE_MASK) {
+		case VFIO_IRQ_SET_ACTION_MASK:
+		case VFIO_IRQ_SET_ACTION_UNMASK:
+			break;
+		case VFIO_IRQ_SET_ACTION_TRIGGER:
+		{
+			if (flags & VFIO_IRQ_SET_DATA_NONE) {
+				pr_info("%s: disable INTx\n", __func__);
+				break;
+			}
+
+			if (flags & VFIO_IRQ_SET_DATA_EVENTFD) {
+				int fd = *(int *)data;
+
+				if (fd > 0) {
+					struct fd irqfd;
+
+					irqfd = fdget(fd);
+					if (!irqfd.file) {
+						ret = -EBADF;
+						break;
+					}
+					mdev_state->intx_file = irqfd.file;
+					fdput(irqfd);
+					mdev_state->irq_fd = fd;
+					mdev_state->irq_index = index;
+					break;
+				}
+			}
+			break;
+		}
+		}
+		break;
+	case VFIO_PCI_MSI_IRQ_INDEX:
+		switch (flags & VFIO_IRQ_SET_ACTION_TYPE_MASK) {
+		case VFIO_IRQ_SET_ACTION_MASK:
+		case VFIO_IRQ_SET_ACTION_UNMASK:
+			break;
+		case VFIO_IRQ_SET_ACTION_TRIGGER:
+			if (flags & VFIO_IRQ_SET_DATA_NONE) {
+				pr_info("%s: disable MSI\n", __func__);
+				mdev_state->irq_index = VFIO_PCI_INTX_IRQ_INDEX;
+				break;
+			}
+			if (flags & VFIO_IRQ_SET_DATA_EVENTFD) {
+				int fd = *(int *)data;
+				struct fd irqfd;
+
+				if (fd <= 0)
+					break;
+
+				if (mdev_state->msi_file)
+					break;
+
+				irqfd = fdget(fd);
+				if (!irqfd.file) {
+					ret = -EBADF;
+					break;
+				}
+
+				mdev_state->msi_file = irqfd.file;
+				fdput(irqfd);
+				mdev_state->irq_fd = fd;
+				mdev_state->irq_index = index;
+			}
+			break;
+	}
+	break;
+	case VFIO_PCI_MSIX_IRQ_INDEX:
+		pr_info("%s: MSIX_IRQ\n", __func__);
+		break;
+	case VFIO_PCI_ERR_IRQ_INDEX:
+		pr_info("%s: ERR_IRQ\n", __func__);
+		break;
+	case VFIO_PCI_REQ_IRQ_INDEX:
+		pr_info("%s: REQ_IRQ\n", __func__);
+		break;
+	}
+
+	mutex_unlock(&mdev_state->ops_lock);
+	return ret;
+}
+
+static int mtty_trigger_interrupt(uuid_le uuid)
+{
+	mm_segment_t old_fs;
+	u64 val = 1;
+	loff_t offset = 0;
+	int ret = -1;
+	struct file *pfile = NULL;
+	struct mdev_state *mdev_state;
+
+	mdev_state = find_mdev_state_by_uuid(uuid);
+
+	if (!mdev_state) {
+		pr_info("%s: mdev not found\n", __func__);
+		return -EINVAL;
+	}
+
+	if ((mdev_state->irq_index == VFIO_PCI_MSI_IRQ_INDEX) &&
+			(mdev_state->msi_file == NULL))
+		return -EINVAL;
+	else if ((mdev_state->irq_index == VFIO_PCI_INTX_IRQ_INDEX) &&
+			(mdev_state->intx_file == NULL)) {
+		pr_info("%s: Intr file not found\n", __func__);
+		return -EINVAL;
+	}
+
+	old_fs = get_fs();
+	set_fs(KERNEL_DS);
+
+	if (mdev_state->irq_index == VFIO_PCI_MSI_IRQ_INDEX)
+		pfile = mdev_state->msi_file;
+	else
+		pfile = mdev_state->intx_file;
+
+	if (pfile && pfile->f_op && pfile->f_op->write) {
+		ret = pfile->f_op->write(pfile, (char *)&val, sizeof(val),
+					 &offset);
+#if defined(DEBUG_INTR)
+		pr_info("Intx triggered\n");
+#endif
+	} else
+		pr_err("%s: pfile not valid, intr_type = %d\n", __func__,
+				mdev_state->irq_index);
+
+	set_fs(old_fs);
+
+	if (ret < 0)
+		pr_err("%s: eventfd write failed (%d)\n", __func__, ret);
+
+	return ret;
+}
+
+int mtty_get_region_info(struct mdev_device *mdev,
+			 struct vfio_region_info *region_info,
+			 u16 *cap_type_id, void **cap_type)
+{
+	unsigned int size = 0;
+	struct mdev_state *mdev_state;
+	int bar_index;
+
+	if (!mdev)
+		return -EINVAL;
+
+	mdev_state = mdev_get_drvdata(mdev);
+	if (!mdev_state)
+		return -EINVAL;
+
+	mutex_lock(&mdev_state->ops_lock);
+	bar_index = region_info->index;
+
+	switch (bar_index) {
+	case VFIO_PCI_CONFIG_REGION_INDEX:
+		size = MTTY_CONFIG_SPACE_SIZE;
+		break;
+	case VFIO_PCI_BAR0_REGION_INDEX:
+		size = MTTY_IO_BAR_SIZE;
+		break;
+	case VFIO_PCI_BAR1_REGION_INDEX:
+		if (mdev_state->nr_ports == 2)
+			size = MTTY_IO_BAR_SIZE;
+		break;
+	default:
+		size = 0;
+		break;
+	}
+
+	mdev_state->region_info[bar_index].size = size;
+	mdev_state->region_info[bar_index].vfio_offset =
+		MTTY_VFIO_PCI_INDEX_TO_OFFSET(bar_index);
+
+	region_info->size = size;
+	region_info->offset = MTTY_VFIO_PCI_INDEX_TO_OFFSET(bar_index);
+	region_info->flags = VFIO_REGION_INFO_FLAG_READ |
+		VFIO_REGION_INFO_FLAG_WRITE;
+	mutex_unlock(&mdev_state->ops_lock);
+	return 0;
+}
+
+int mtty_get_irq_info(struct mdev_device *mdev, struct vfio_irq_info *irq_info)
+{
+	switch (irq_info->index) {
+	case VFIO_PCI_INTX_IRQ_INDEX:
+	case VFIO_PCI_MSI_IRQ_INDEX:
+	case VFIO_PCI_REQ_IRQ_INDEX:
+		break;
+
+	default:
+		return -EINVAL;
+	}
+
+	irq_info->flags = VFIO_IRQ_INFO_EVENTFD;
+	irq_info->count = 1;
+
+	if (irq_info->index == VFIO_PCI_INTX_IRQ_INDEX)
+		irq_info->flags |= (VFIO_IRQ_INFO_MASKABLE |
+				VFIO_IRQ_INFO_AUTOMASKED);
+	else
+		irq_info->flags |= VFIO_IRQ_INFO_NORESIZE;
+
+	return 0;
+}
+
+int mtty_get_device_info(struct mdev_device *mdev,
+			 struct vfio_device_info *dev_info)
+{
+	dev_info->flags = VFIO_DEVICE_FLAGS_PCI;
+	dev_info->num_regions = VFIO_PCI_NUM_REGIONS;
+	dev_info->num_irqs = VFIO_PCI_NUM_IRQS;
+
+	return 0;
+}
+
+static long mtty_ioctl(struct mdev_device *mdev, unsigned int cmd,
+			unsigned long arg)
+{
+	int ret = 0;
+	unsigned long minsz;
+	struct mdev_state *mdev_state;
+
+	if (!mdev)
+		return -EINVAL;
+
+	mdev_state = mdev_get_drvdata(mdev);
+	if (!mdev_state)
+		return -ENODEV;
+
+	switch (cmd) {
+	case VFIO_DEVICE_GET_INFO:
+	{
+		struct vfio_device_info info;
+
+		minsz = offsetofend(struct vfio_device_info, num_irqs);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz)
+			return -EINVAL;
+
+		ret = mtty_get_device_info(mdev, &info);
+		if (ret)
+			return ret;
+
+		memcpy(&mdev_state->dev_info, &info, sizeof(info));
+
+		return copy_to_user((void __user *)arg, &info, minsz);
+	}
+	case VFIO_DEVICE_GET_REGION_INFO:
+	{
+		struct vfio_region_info info;
+		struct vfio_info_cap caps = { .buf = NULL, .size = 0 };
+		u16 cap_type_id = 0;
+		void *cap_type = NULL;
+
+		minsz = offsetofend(struct vfio_region_info, offset);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz)
+			return -EINVAL;
+
+		ret = mtty_get_region_info(mdev, &info, &cap_type_id,
+					   &cap_type);
+		if (ret)
+			return ret;
+
+		ret = vfio_info_add_capability(&info, &caps, cap_type_id,
+						cap_type);
+		if (ret)
+			return ret;
+
+		if (info.cap_offset) {
+			if (copy_to_user((void __user *)arg + info.cap_offset,
+						caps.buf, caps.size)) {
+				kfree(caps.buf);
+				return -EFAULT;
+			}
+			kfree(caps.buf);
+		}
+
+		return copy_to_user((void __user *)arg, &info, minsz);
+	}
+
+	case VFIO_DEVICE_GET_IRQ_INFO:
+	{
+		struct vfio_irq_info info;
+
+		minsz = offsetofend(struct vfio_irq_info, count);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if ((info.argsz < minsz) ||
+		    (info.index >= mdev_state->dev_info.num_irqs))
+			return -EINVAL;
+
+		ret = mtty_get_irq_info(mdev, &info);
+		if (ret)
+			return ret;
+
+		if (info.count == -1)
+			return -EINVAL;
+
+		return copy_to_user((void __user *)arg, &info, minsz);
+	}
+	case VFIO_DEVICE_SET_IRQS:
+	{
+		struct vfio_irq_set hdr;
+		u8 *data = NULL, *ptr = NULL;
+		size_t data_size = 0;
+
+		minsz = offsetofend(struct vfio_irq_set, count);
+
+		if (copy_from_user(&hdr, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		ret = vfio_set_irqs_validate_and_prepare(&hdr,
+				mdev_state->dev_info.num_irqs,
+				VFIO_PCI_NUM_IRQS,
+				&data_size);
+		if (ret)
+			return ret;
+
+		if (data_size) {
+			ptr = data = memdup_user((void __user *)(arg + minsz),
+					data_size);
+			if (IS_ERR(data))
+				return PTR_ERR(data);
+		}
+
+		ret = mtty_set_irqs(mdev, hdr.flags, hdr.index, hdr.start,
+				hdr.count, data);
+
+		kfree(ptr);
+		return ret;
+	}
+	case VFIO_DEVICE_RESET:
+		return mtty_reset(mdev);
+	}
+	return -ENOTTY;
+}
+
+int mtty_open(struct mdev_device *mdev)
+{
+	pr_info("%s\n", __func__);
+	return 0;
+}
+
+void mtty_close(struct mdev_device *mdev)
+{
+	pr_info("%s\n", __func__);
+}
+
+static ssize_t
+sample_mtty_dev_show(struct device *dev, struct device_attribute *attr,
+		     char *buf)
+{
+	return sprintf(buf, "This is phy device\n");
+}
+
+static DEVICE_ATTR_RO(sample_mtty_dev);
+
+static struct attribute *mtty_dev_attrs[] = {
+	&dev_attr_sample_mtty_dev.attr,
+	NULL,
+};
+
+static const struct attribute_group mtty_dev_group = {
+	.name  = "mtty_dev",
+	.attrs = mtty_dev_attrs,
+};
+
+const struct attribute_group *mtty_dev_groups[] = {
+	&mtty_dev_group,
+	NULL,
+};
+
+static ssize_t
+sample_mdev_dev_show(struct device *dev, struct device_attribute *attr,
+		     char *buf)
+{
+	struct mdev_device *mdev = to_mdev_device(dev);
+
+	if (mdev)
+		return sprintf(buf, "This is MDEV %s\n", dev_name(&mdev->dev));
+
+	return sprintf(buf, "\n");
+}
+
+static DEVICE_ATTR_RO(sample_mdev_dev);
+
+static struct attribute *mdev_dev_attrs[] = {
+	&dev_attr_sample_mdev_dev.attr,
+	NULL,
+};
+
+static const struct attribute_group mdev_dev_group = {
+	.name  = "vendor",
+	.attrs = mdev_dev_attrs,
+};
+
+const struct attribute_group *mdev_dev_groups[] = {
+	&mdev_dev_group,
+	NULL,
+};
+
+static ssize_t
+name_show(struct kobject *kobj, struct device *dev, char *buf)
+{
+	char name[MTTY_STRING_LEN];
+	int i;
+	const char *name_str[2] = {"Single port serial", "Dual port serial"};
+
+	for (i = 0; i < 2; i++) {
+		snprintf(name, MTTY_STRING_LEN, "%s-%d",
+			 dev_driver_string(dev), i + 1);
+		if (!strcmp(kobj->name, name))
+			return sprintf(buf, "%s\n", name_str[i]);
+	}
+
+	return -EINVAL;
+}
+
+MDEV_TYPE_ATTR_RO(name);
+
+static ssize_t
+available_instances_show(struct kobject *kobj, struct device *dev, char *buf)
+{
+	char name[MTTY_STRING_LEN];
+	int i;
+	struct mdev_state *mds;
+	int ports = 0, used = 0;
+
+	for (i = 0; i < 2; i++) {
+		snprintf(name, MTTY_STRING_LEN, "%s-%d",
+			 dev_driver_string(dev), i + 1);
+		if (!strcmp(kobj->name, name)) {
+			ports = i + 1;
+			break;
+		}
+	}
+
+	if (!ports)
+		return -EINVAL;
+
+	list_for_each_entry(mds, &mdev_devices_list, next)
+		used += mds->nr_ports;
+
+	return sprintf(buf, "%d\n", (MAX_MTTYS - used)/ports);
+}
+
+MDEV_TYPE_ATTR_RO(available_instances);
+
+
+static ssize_t device_api_show(struct kobject *kobj, struct device *dev,
+			       char *buf)
+{
+	return sprintf(buf, "%s\n",
+		       vfio_device_api_string(VFIO_DEVICE_FLAGS_PCI));
+}
+
+MDEV_TYPE_ATTR_RO(device_api);
+
+static struct attribute *mdev_types_attrs[] = {
+	&mdev_type_attr_name.attr,
+	&mdev_type_attr_device_api.attr,
+	&mdev_type_attr_available_instances.attr,
+	NULL,
+};
+
+static struct attribute_group mdev_type_group1 = {
+	.name  = "1",
+	.attrs = mdev_types_attrs,
+};
+
+static struct attribute_group mdev_type_group2 = {
+	.name  = "2",
+	.attrs = mdev_types_attrs,
+};
+
+struct attribute_group *mdev_type_groups[] = {
+	&mdev_type_group1,
+	&mdev_type_group2,
+	NULL,
+};
+
+struct parent_ops mdev_fops = {
+	.owner                  = THIS_MODULE,
+	.dev_attr_groups        = mtty_dev_groups,
+	.mdev_attr_groups       = mdev_dev_groups,
+	.supported_type_groups  = mdev_type_groups,
+	.create                 = mtty_create,
+	.remove			= mtty_remove,
+	.open                   = mtty_open,
+	.release                = mtty_close,
+	.read                   = mtty_read,
+	.write                  = mtty_write,
+	.ioctl		        = mtty_ioctl,
+};
+
+static void mtty_device_release(struct device *dev)
+{
+	dev_dbg(dev, "mtty: released\n");
+}
+
+static int __init mtty_dev_init(void)
+{
+	int ret = 0;
+
+	pr_info("mtty_dev: %s\n", __func__);
+
+	memset(&mtty_dev, 0, sizeof(mtty_dev));
+
+	idr_init(&mtty_dev.vd_idr);
+
+	ret = alloc_chrdev_region(&mtty_dev.vd_devt, 0, MINORMASK, MTTY_NAME);
+
+	if (ret < 0) {
+		pr_err("Error: failed to register mtty_dev, err:%d\n", ret);
+		return ret;
+	}
+
+	cdev_init(&mtty_dev.vd_cdev, &vd_fops);
+	cdev_add(&mtty_dev.vd_cdev, mtty_dev.vd_devt, MINORMASK);
+
+	pr_info("major_number:%d\n", MAJOR(mtty_dev.vd_devt));
+
+	mtty_dev.vd_class = class_create(THIS_MODULE, MTTY_CLASS_NAME);
+
+	if (IS_ERR(mtty_dev.vd_class)) {
+		pr_err("Error: failed to register mtty_dev class\n");
+		goto failed1;
+	}
+
+	mtty_dev.dev.class = mtty_dev.vd_class;
+	mtty_dev.dev.release = mtty_device_release;
+	dev_set_name(&mtty_dev.dev, "%s", MTTY_NAME);
+
+	ret = device_register(&mtty_dev.dev);
+	if (ret)
+		goto failed2;
+
+	if (mdev_register_device(&mtty_dev.dev, &mdev_fops) != 0)
+		goto failed3;
+
+	mutex_init(&mdev_list_lock);
+	INIT_LIST_HEAD(&mdev_devices_list);
+
+	goto all_done;
+
+failed3:
+
+	device_unregister(&mtty_dev.dev);
+failed2:
+	class_destroy(mtty_dev.vd_class);
+
+failed1:
+	cdev_del(&mtty_dev.vd_cdev);
+	unregister_chrdev_region(mtty_dev.vd_devt, MINORMASK);
+
+all_done:
+	return ret;
+}
+
+static void __exit mtty_dev_exit(void)
+{
+	mtty_dev.dev.bus = NULL;
+	mdev_unregister_device(&mtty_dev.dev);
+
+	device_unregister(&mtty_dev.dev);
+	idr_destroy(&mtty_dev.vd_idr);
+	cdev_del(&mtty_dev.vd_cdev);
+	unregister_chrdev_region(mtty_dev.vd_devt, MINORMASK);
+	class_destroy(mtty_dev.vd_class);
+	mtty_dev.vd_class = NULL;
+	pr_info("mtty_dev: Unloaded!\n");
+}
+
+module_init(mtty_dev_init)
+module_exit(mtty_dev_exit)
+
+MODULE_LICENSE("GPL");
+MODULE_INFO(supported, "Test driver that simulate serial port over PCI");
+MODULE_VERSION(VERSION_STRING);
+MODULE_AUTHOR(DRIVER_AUTHOR);
diff --git a/Documentation/vfio-mdev/vfio-mediated-device.txt b/Documentation/vfio-mdev/vfio-mediated-device.txt
index 8746e88dca4d..1b99db97a6eb 100644
--- a/Documentation/vfio-mdev/vfio-mediated-device.txt
+++ b/Documentation/vfio-mdev/vfio-mediated-device.txt
@@ -279,11 +279,111 @@ these callbacks are supported in the TYPE1 IOMMU module. To enable them for
 other IOMMU backend modules, such as PPC64 sPAPR module, they need to provide
 these two callback functions.
 
+Using the Sample Code
+=====================
+
+mtty.c in this folder is a sample driver program to demonstrate how to use the
+mediated device framework.
+
+The sample driver creates an mdev device that simulates a serial port over a PCI
+card.
+
+1. Build and load the mtty.ko module.
+
+   This step creates a dummy device, /sys/devices/virtual/mtty/mtty/
+
+   Files in this device directory in sysfs are similar to the following:
+
+   # tree /sys/devices/virtual/mtty/mtty/
+      /sys/devices/virtual/mtty/mtty/
+      |-- mdev_supported_types
+      |   |-- mtty-1
+      |   |   |-- available_instances
+      |   |   |-- create
+      |   |   |-- device_api
+      |   |   |-- devices
+      |   |   `-- name
+      |   `-- mtty-2
+      |       |-- available_instances
+      |       |-- create
+      |       |-- device_api
+      |       |-- devices
+      |       `-- name
+      |-- mtty_dev
+      |   `-- sample_mtty_dev
+      |-- power
+      |   |-- autosuspend_delay_ms
+      |   |-- control
+      |   |-- runtime_active_time
+      |   |-- runtime_status
+      |   `-- runtime_suspended_time
+      |-- subsystem -> ../../../../class/mtty
+      `-- uevent
+
+2. Create a mediated device by using the dummy device that you created in the
+   previous step.
+
+   # echo "83b8f4f2-509f-382f-3c1e-e6bfe0fa1001" >	\
+              /sys/devices/virtual/mtty/mtty/mdev_supported_types/mtty-2/create
+
+3. Add parameters to qemu-kvm.
+
+   -device vfio-pci,\
+    sysfsdev=/sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1001
+
+4. Boot the VM.
+
+   In the Linux guest VM, with no hardware on the host, the device appears
+   as  follows:
+
+   # lspci -s 00:05.0 -xxvv
+   00:05.0 Serial controller: Device 4348:3253 (rev 10) (prog-if 02 [16550])
+           Subsystem: Device 4348:3253
+           Physical Slot: 5
+           Control: I/O+ Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr-
+   Stepping- SERR- FastB2B- DisINTx-
+           Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort-
+   <TAbort- <MAbort- >SERR- <PERR- INTx-
+           Interrupt: pin A routed to IRQ 10
+           Region 0: I/O ports at c150 [size=8]
+           Region 1: I/O ports at c158 [size=8]
+           Kernel driver in use: serial
+   00: 48 43 53 32 01 00 00 02 10 02 00 07 00 00 00 00
+   10: 51 c1 00 00 59 c1 00 00 00 00 00 00 00 00 00 00
+   20: 00 00 00 00 00 00 00 00 00 00 00 00 48 43 53 32
+   30: 00 00 00 00 00 00 00 00 00 00 00 00 0a 01 00 00
+
+   In the Linux guest VM, dmesg output for the device is as follows:
+
+   serial 0000:00:05.0: PCI INT A -> Link[LNKA] -> GSI 10 (level, high) -> IRQ
+10
+   0000:00:05.0: ttyS1 at I/O 0xc150 (irq = 10) is a 16550A
+   0000:00:05.0: ttyS2 at I/O 0xc158 (irq = 10) is a 16550A
+
+
+5. In the Linux guest VM, check the serial ports.
+
+   # setserial -g /dev/ttyS*
+   /dev/ttyS0, UART: 16550A, Port: 0x03f8, IRQ: 4
+   /dev/ttyS1, UART: 16550A, Port: 0xc150, IRQ: 10
+   /dev/ttyS2, UART: 16550A, Port: 0xc158, IRQ: 10
+
+6. Using a minicom or any terminal enulation program, open port /dev/ttyS1 or
+   /dev/ttyS2 with hardware flow control disabled.
+
+7. Type data on the minicom terminal or send data to the terminal emulation
+   program and read the data.
+
+   Data is loop backed from hosts mtty driver.
+
+8. Destroy the mediated device that you created.
+
+   # echo 1 > /sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1001/remove
+
 References
-----------
+==========
 
 [1] See Documentation/vfio.txt for more information on VFIO.
 [2] struct mdev_driver in include/linux/mdev.h
 [3] struct parent_ops in include/linux/mdev.h
 [4] struct vfio_iommu_driver_ops in include/linux/vfio.h
-
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 00/12] Add Mediated device support
  2016-10-17 21:22 [PATCH v9 00/12] Add Mediated device support Kirti Wankhede
                   ` (11 preceding siblings ...)
  2016-10-17 21:22 ` [PATCH v9 12/12] docs: Sample driver to demonstrate how to use Mediated device framework Kirti Wankhede
@ 2016-10-17 21:41 ` Alex Williamson
  2016-10-24  7:07 ` Jike Song
  13 siblings, 0 replies; 73+ messages in thread
From: Alex Williamson @ 2016-10-17 21:41 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song,
	bjsdjshi, linux-kernel

On Tue, 18 Oct 2016 02:52:00 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> This series adds Mediated device support to Linux host kernel. Purpose
> of this series is to provide a common interface for mediated device
> management that can be used by different devices. This series introduces
> Mdev core module that creates and manages mediated devices, VFIO based
> driver for mediated devices that are created by mdev core module and
> update VFIO type1 IOMMU module to support pinning & unpinning for mediated
> devices.
> 
> What changed in v9?
> mdev-core:
> - added class named 'mdev_bus' that contains links to devices that are
>   registered with the mdev core driver.
> - The [<type-id>] name is created by adding the the device driver string as a
>   prefix to the string provided by the vendor driver.
> - 'device_api' attribute should be provided by vendor driver and should show
>    which device API is being created, for example, "vfio-pci" for a PCI device.
> - Renamed link to its type in mdev device directory to 'mdev_type'
> 
> vfio:
> - Split commits in multple individual commits
> - Added function to get device_api string based on vfio_device_info.flags.
> 
> vfio_iommu_type1:
> - Handled the case if all devices attached to the normal IOMMU API domain
>   go away and mdev device still exist in domain. Updated page accounting
>   for local domain.
> - Similarly if device is attached to normal IOMMU API domain, mappings are
>   establised and page accounting is updated accordingly.
> - Tested hot-plug and hot-unplug of vGPU and GPU pass through device with
>   Linux VM.

Hi,

I also commented that there must be an invalidation mechanism for pages
pinned by the vendor driver.  This is where pfn pinning was adjusting
accounting after a DMA_MAP, where the pfn should have been invalidated
on user unmap.  Userspace is in control of page mappings, the vendor
driver cannot maintain references to pages unmapped by the user.  I
would suggest that minimally some sort of callback needs to be
registered for every set of pinned pages to be called when the user
unmaps those IOVAs.  Thanks,

Alex

> 
> Documentation:
> - Updated Documentation and sample driver, mtty.c, accordingly.
> 
> Kirti Wankhede (12):
>   vfio: Mediated device Core driver
>   vfio: VFIO based driver for Mediated devices
>   vfio: Rearrange functions to get vfio_group from dev
>   vfio iommu: Add support for mediated devices
>   vfio: Introduce common function to add capabilities
>   vfio_pci: Update vfio_pci to use vfio_info_add_capability()
>   vfio: Introduce vfio_set_irqs_validate_and_prepare()
>   vfio_pci: Updated to use vfio_set_irqs_validate_and_prepare()
>   vfio_platform: Updated to use vfio_set_irqs_validate_and_prepare()
>   vfio: Add function to get device_api string from
>     vfio_device_info.flags
>   docs: Add Documentation for Mediated devices
>   docs: Sample driver to demonstrate how to use Mediated device
>     framework.
> 
>  Documentation/vfio-mdev/Makefile                 |   13 +
>  Documentation/vfio-mdev/mtty.c                   | 1429 ++++++++++++++++++++++
>  Documentation/vfio-mdev/vfio-mediated-device.txt |  389 ++++++
>  drivers/vfio/Kconfig                             |    1 +
>  drivers/vfio/Makefile                            |    1 +
>  drivers/vfio/mdev/Kconfig                        |   18 +
>  drivers/vfio/mdev/Makefile                       |    5 +
>  drivers/vfio/mdev/mdev_core.c                    |  372 ++++++
>  drivers/vfio/mdev/mdev_driver.c                  |  128 ++
>  drivers/vfio/mdev/mdev_private.h                 |   41 +
>  drivers/vfio/mdev/mdev_sysfs.c                   |  296 +++++
>  drivers/vfio/mdev/vfio_mdev.c                    |  148 +++
>  drivers/vfio/pci/vfio_pci.c                      |  101 +-
>  drivers/vfio/platform/vfio_platform_common.c     |   31 +-
>  drivers/vfio/vfio.c                              |  287 ++++-
>  drivers/vfio/vfio_iommu_type1.c                  |  692 +++++++++--
>  include/linux/mdev.h                             |  177 +++
>  include/linux/vfio.h                             |   23 +-
>  18 files changed, 3948 insertions(+), 204 deletions(-)
>  create mode 100644 Documentation/vfio-mdev/Makefile
>  create mode 100644 Documentation/vfio-mdev/mtty.c
>  create mode 100644 Documentation/vfio-mdev/vfio-mediated-device.txt
>  create mode 100644 drivers/vfio/mdev/Kconfig
>  create mode 100644 drivers/vfio/mdev/Makefile
>  create mode 100644 drivers/vfio/mdev/mdev_core.c
>  create mode 100644 drivers/vfio/mdev/mdev_driver.c
>  create mode 100644 drivers/vfio/mdev/mdev_private.h
>  create mode 100644 drivers/vfio/mdev/mdev_sysfs.c
>  create mode 100644 drivers/vfio/mdev/vfio_mdev.c
>  create mode 100644 include/linux/mdev.h
> 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 12/12] docs: Sample driver to demonstrate how to use Mediated device framework.
       [not found]   ` <20161018025411.GA22572@bjsdjshi@linux.vnet.ibm.com>
@ 2016-10-18 17:17     ` Alex Williamson
  2016-10-19 19:19       ` Kirti Wankhede
  0 siblings, 1 reply; 73+ messages in thread
From: Alex Williamson @ 2016-10-18 17:17 UTC (permalink / raw)
  To: Dong Jia Shi
  Cc: Kirti Wankhede, pbonzini, kraxel, cjia, qemu-devel, kvm,
	kevin.tian, jike.song, linux-kernel

On Tue, 18 Oct 2016 10:54:11 +0800
Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com> wrote:

> * Kirti Wankhede <kwankhede@nvidia.com> [2016-10-18 02:52:12 +0530]:
> 
> ...snip...
> 
> > +static ssize_t mdev_access(struct mdev_device *mdev, char *buf,
> > +		size_t count, loff_t pos, bool is_write)
> > +{
> > +	struct mdev_state *mdev_state;
> > +	unsigned int index;
> > +	loff_t offset;
> > +	int ret = 0;
> > +
> > +	if (!mdev || !buf)
> > +		return -EINVAL;
> > +
> > +	mdev_state = mdev_get_drvdata(mdev);
> > +	if (!mdev_state) {
> > +		pr_err("%s mdev_state not found\n", __func__);
> > +		return -EINVAL;
> > +	}
> > +
> > +	mutex_lock(&mdev_state->ops_lock);
> > +
> > +	index = MTTY_VFIO_PCI_OFFSET_TO_INDEX(pos);
> > +	offset = pos & MTTY_VFIO_PCI_OFFSET_MASK;
> > +	switch (index) {
> > +	case VFIO_PCI_CONFIG_REGION_INDEX:
> > +
> > +#if defined(DEBUG)
> > +		pr_info("%s: PCI config space %s at offset 0x%llx\n",
> > +			 __func__, is_write ? "write" : "read", offset);
> > +#endif
> > +		if (is_write) {
> > +			dump_buffer(buf, count);
> > +			handle_pci_cfg_write(mdev_state, offset, buf, count);
> > +		} else {
> > +			memcpy(buf, (mdev_state->vconfig + offset), count);
> > +			dump_buffer(buf, count);  
> Dear Kirti:
> 
> Shouldn't we use copy_from_user instead of memcpy on @buf here? And I'm
> wondering if dump_buffer could really work since it tries to dereference
> a *__user* marked pointor.

I agree, the __user attribute is getting lost here and we're operating
on user buffers as if they were kernel buffers.  That's a bug.  Thanks,

Alex
 
> Otherwise, this is a good example driver. Thanks!
> 
> > +		}
> > +
> > +		break;
> > +
> > +	case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
> > +		if (!mdev_state->region_info[index].start)
> > +			mdev_read_base(mdev_state);
> > +
> > +		if (is_write) {
> > +			dump_buffer(buf, count);
> > +
> > +#if defined(DEBUG_REGS)
> > +			pr_info("%s: BAR%d  WR @0x%llx %s val:0x%02x dlab:%d\n",
> > +				__func__, index, offset, wr_reg[offset],
> > +				(u8)*buf, mdev_state->s[index].dlab);
> > +#endif
> > +			handle_bar_write(index, mdev_state, offset, buf, count);
> > +		} else {
> > +			handle_bar_read(index, mdev_state, offset, buf, count);
> > +			dump_buffer(buf, count);
> > +
> > +#if defined(DEBUG_REGS)
> > +			pr_info("%s: BAR%d  RD @0x%llx %s val:0x%02x dlab:%d\n",
> > +				__func__, index, offset, rd_reg[offset],
> > +				(u8)*buf, mdev_state->s[index].dlab);
> > +#endif
> > +		}
> > +		break;
> > +
> > +	default:
> > +		ret = -1;
> > +		goto accessfailed;
> > +	}
> > +
> > +	ret = count;
> > +
> > +
> > +accessfailed:
> > +	mutex_unlock(&mdev_state->ops_lock);
> > +
> > +	return ret;
> > +}
> > +  
> ...snip...
> 
> > +ssize_t mtty_read(struct mdev_device *mdev, char __user *buf,
> > +			size_t count, loff_t *ppos)
> > +{
> > +	return mdev_access(mdev, buf, count, *ppos, false);
> > +}
> > +
> > +ssize_t mtty_write(struct mdev_device *mdev, const char __user *buf,
> > +			 size_t count, loff_t *ppos)
> > +{
> > +	return mdev_access(mdev, (char *)buf, count, *ppos, true);
> > +}
> > +  
> ...snip...
> 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 01/12] vfio: Mediated device Core driver
  2016-10-17 21:22 ` [PATCH v9 01/12] vfio: Mediated device Core driver Kirti Wankhede
@ 2016-10-18 23:16   ` Alex Williamson
  2016-10-19 19:16     ` Kirti Wankhede
  2016-10-20  7:23   ` Jike Song
  2016-10-26  6:52   ` Tian, Kevin
  2 siblings, 1 reply; 73+ messages in thread
From: Alex Williamson @ 2016-10-18 23:16 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song,
	bjsdjshi, linux-kernel

On Tue, 18 Oct 2016 02:52:01 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> Design for Mediated Device Driver:
> Main purpose of this driver is to provide a common interface for mediated
> device management that can be used by different drivers of different
> devices.
> 
> This module provides a generic interface to create the device, add it to
> mediated bus, add device to IOMMU group and then add it to vfio group.
> 
> Below is the high Level block diagram, with Nvidia, Intel and IBM devices
> as example, since these are the devices which are going to actively use
> this module as of now.
> 
>  +---------------+
>  |               |
>  | +-----------+ |  mdev_register_driver() +--------------+
>  | |           | +<------------------------+ __init()     |
>  | |  mdev     | |                         |              |
>  | |  bus      | +------------------------>+              |<-> VFIO user
>  | |  driver   | |     probe()/remove()    | vfio_mdev.ko |    APIs
>  | |           | |                         |              |
>  | +-----------+ |                         +--------------+
>  |               |
>  |  MDEV CORE    |
>  |   MODULE      |
>  |   mdev.ko     |
>  | +-----------+ |  mdev_register_device() +--------------+
>  | |           | +<------------------------+              |
>  | |           | |                         |  nvidia.ko   |<-> physical
>  | |           | +------------------------>+              |    device
>  | |           | |        callback         +--------------+
>  | | Physical  | |
>  | |  device   | |  mdev_register_device() +--------------+
>  | | interface | |<------------------------+              |
>  | |           | |                         |  i915.ko     |<-> physical
>  | |           | +------------------------>+              |    device
>  | |           | |        callback         +--------------+
>  | |           | |
>  | |           | |  mdev_register_device() +--------------+
>  | |           | +<------------------------+              |
>  | |           | |                         | ccw_device.ko|<-> physical
>  | |           | +------------------------>+              |    device
>  | |           | |        callback         +--------------+
>  | +-----------+ |
>  +---------------+
> 
> Core driver provides two types of registration interfaces:
> 1. Registration interface for mediated bus driver:
> 
> /**
>   * struct mdev_driver - Mediated device's driver
>   * @name: driver name
>   * @probe: called when new device created
>   * @remove:called when device removed
>   * @driver:device driver structure
>   *
>   **/
> struct mdev_driver {
>          const char *name;
>          int  (*probe)  (struct device *dev);
>          void (*remove) (struct device *dev);
>          struct device_driver    driver;
> };
> 
> Mediated bus driver for mdev device should use this interface to register
> and unregister with core driver respectively:
> 
> int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
> void mdev_unregister_driver(struct mdev_driver *drv);
> 
> Medisted bus driver is responsible to add/delete mediated devices to/from
> VFIO group when devices are bound and unbound to the driver.
> 
> 2. Physical device driver interface
> This interface provides vendor driver the set APIs to manage physical
> device related work in its driver. APIs are :
> 
> * dev_attr_groups: attributes of the parent device.
> * mdev_attr_groups: attributes of the mediated device.
> * supported_type_groups: attributes to define supported type. This is
> 			 mandatory field.
> * create: to allocate basic resources in driver for a mediated device.
> * remove: to free resources in driver when mediated device is destroyed.
> * open: open callback of mediated device
> * release: release callback of mediated device
> * read : read emulation callback.
> * write: write emulation callback.
> * mmap: mmap emulation callback.
> * ioctl: ioctl callback.
> 
> Drivers should use these interfaces to register and unregister device to
> mdev core driver respectively:
> 
> extern int  mdev_register_device(struct device *dev,
>                                  const struct parent_ops *ops);
> extern void mdev_unregister_device(struct device *dev);
> 
> There are no locks to serialize above callbacks in mdev driver and
> vfio_mdev driver. If required, vendor driver can have locks to serialize
> above APIs in their driver.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Signed-off-by: Neo Jia <cjia@nvidia.com>
> Change-Id: I73a5084574270b14541c529461ea2f03c292d510
> ---
>  drivers/vfio/Kconfig             |   1 +
>  drivers/vfio/Makefile            |   1 +
>  drivers/vfio/mdev/Kconfig        |  11 ++
>  drivers/vfio/mdev/Makefile       |   4 +
>  drivers/vfio/mdev/mdev_core.c    | 372 +++++++++++++++++++++++++++++++++++++++
>  drivers/vfio/mdev/mdev_driver.c  | 128 ++++++++++++++
>  drivers/vfio/mdev/mdev_private.h |  41 +++++
>  drivers/vfio/mdev/mdev_sysfs.c   | 296 +++++++++++++++++++++++++++++++
>  include/linux/mdev.h             | 177 +++++++++++++++++++
>  9 files changed, 1031 insertions(+)
>  create mode 100644 drivers/vfio/mdev/Kconfig
>  create mode 100644 drivers/vfio/mdev/Makefile
>  create mode 100644 drivers/vfio/mdev/mdev_core.c
>  create mode 100644 drivers/vfio/mdev/mdev_driver.c
>  create mode 100644 drivers/vfio/mdev/mdev_private.h
>  create mode 100644 drivers/vfio/mdev/mdev_sysfs.c
>  create mode 100644 include/linux/mdev.h
> 
> diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
> index da6e2ce77495..23eced02aaf6 100644
> --- a/drivers/vfio/Kconfig
> +++ b/drivers/vfio/Kconfig
> @@ -48,4 +48,5 @@ menuconfig VFIO_NOIOMMU
>  
>  source "drivers/vfio/pci/Kconfig"
>  source "drivers/vfio/platform/Kconfig"
> +source "drivers/vfio/mdev/Kconfig"
>  source "virt/lib/Kconfig"
> diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
> index 7b8a31f63fea..4a23c13b6be4 100644
> --- a/drivers/vfio/Makefile
> +++ b/drivers/vfio/Makefile
> @@ -7,3 +7,4 @@ obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
>  obj-$(CONFIG_VFIO_SPAPR_EEH) += vfio_spapr_eeh.o
>  obj-$(CONFIG_VFIO_PCI) += pci/
>  obj-$(CONFIG_VFIO_PLATFORM) += platform/
> +obj-$(CONFIG_VFIO_MDEV) += mdev/
> diff --git a/drivers/vfio/mdev/Kconfig b/drivers/vfio/mdev/Kconfig
> new file mode 100644
> index 000000000000..93addace9a67
> --- /dev/null
> +++ b/drivers/vfio/mdev/Kconfig
> @@ -0,0 +1,11 @@
> +
> +config VFIO_MDEV
> +    tristate "Mediated device driver framework"
> +    depends on VFIO
> +    default n
> +    help
> +        Provides a framework to virtualize devices which don't have SR_IOV
> +	capability built-in.
> +	See Documentation/vfio-mdev/vfio-mediated-device.txt for more details.
> +
> +        If you don't know what do here, say N.
> diff --git a/drivers/vfio/mdev/Makefile b/drivers/vfio/mdev/Makefile
> new file mode 100644
> index 000000000000..31bc04801d94
> --- /dev/null
> +++ b/drivers/vfio/mdev/Makefile
> @@ -0,0 +1,4 @@
> +
> +mdev-y := mdev_core.o mdev_sysfs.o mdev_driver.o
> +
> +obj-$(CONFIG_VFIO_MDEV) += mdev.o
> diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
> new file mode 100644
> index 000000000000..7db5ec164aeb
> --- /dev/null
> +++ b/drivers/vfio/mdev/mdev_core.c
> @@ -0,0 +1,372 @@
> +/*
> + * Mediated device Core Driver
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/module.h>
> +#include <linux/device.h>
> +#include <linux/slab.h>
> +#include <linux/uuid.h>
> +#include <linux/sysfs.h>
> +#include <linux/mdev.h>
> +
> +#include "mdev_private.h"
> +
> +#define DRIVER_VERSION		"0.1"
> +#define DRIVER_AUTHOR		"NVIDIA Corporation"
> +#define DRIVER_DESC		"Mediated device Core Driver"
> +
> +static LIST_HEAD(parent_list);
> +static DEFINE_MUTEX(parent_list_lock);
> +static struct class_compat *mdev_bus_compat_class;
> +
> +static int _find_mdev_device(struct device *dev, void *data)
> +{
> +	struct mdev_device *mdev;
> +
> +	if (!dev_is_mdev(dev))
> +		return 0;
> +
> +	mdev = to_mdev_device(dev);
> +
> +	if (uuid_le_cmp(mdev->uuid, *(uuid_le *)data) == 0)
> +		return 1;
> +
> +	return 0;
> +}
> +
> +static struct mdev_device *__find_mdev_device(struct parent_device *parent,
> +					      uuid_le uuid)
> +{
> +	struct device *dev;
> +
> +	dev = device_find_child(parent->dev, &uuid, _find_mdev_device);
> +	if (!dev)
> +		return NULL;
> +
> +	put_device(dev);
> +
> +	return to_mdev_device(dev);
> +}

This function is only used by mdev_device_create() for the purpose of
checking whether a given uuid for a parent already exists, so the
returned device is not actually used.  However, at the point where
we're using to_mdev_device() here, we don't actually hold a reference to
the device, so that function call and any possible use of the returned
pointer by the callee is invalid.  I would either turn this into a
"get" function where the callee has a device reference and needs to do
a "put" on it or change this to a "exists" test where true/false is
returned and the function cannot be later mis-used to do a device
lookup where the reference isn't actually valid.

> +
> +/* Should be called holding parent_list_lock */
> +static struct parent_device *__find_parent_device(struct device *dev)
> +{
> +	struct parent_device *parent;
> +
> +	list_for_each_entry(parent, &parent_list, next) {
> +		if (parent->dev == dev)
> +			return parent;
> +	}
> +	return NULL;
> +}
> +
> +static void mdev_release_parent(struct kref *kref)
> +{
> +	struct parent_device *parent = container_of(kref, struct parent_device,
> +						    ref);
> +	struct device *dev = parent->dev;
> +
> +	kfree(parent);
> +	put_device(dev);
> +}
> +
> +static
> +inline struct parent_device *mdev_get_parent(struct parent_device *parent)
> +{
> +	if (parent)
> +		kref_get(&parent->ref);
> +
> +	return parent;
> +}
> +
> +static inline void mdev_put_parent(struct parent_device *parent)
> +{
> +	if (parent)
> +		kref_put(&parent->ref, mdev_release_parent);
> +}
> +
> +static int mdev_device_create_ops(struct kobject *kobj,
> +				  struct mdev_device *mdev)
> +{
> +	struct parent_device *parent = mdev->parent;
> +	int ret;
> +
> +	ret = parent->ops->create(kobj, mdev);
> +	if (ret)
> +		return ret;
> +
> +	ret = sysfs_create_groups(&mdev->dev.kobj,
> +				  parent->ops->mdev_attr_groups);
> +	if (ret)
> +		parent->ops->remove(mdev);
> +
> +	return ret;
> +}
> +
> +static int mdev_device_remove_ops(struct mdev_device *mdev, bool force_remove)
> +{
> +	struct parent_device *parent = mdev->parent;
> +	int ret;
> +
> +	/*
> +	 * Vendor driver can return error if VMM or userspace application is
> +	 * using this mdev device.
> +	 */
> +	ret = parent->ops->remove(mdev);
> +	if (ret && !force_remove)
> +		return -EBUSY;
> +
> +	sysfs_remove_groups(&mdev->dev.kobj, parent->ops->mdev_attr_groups);
> +	return 0;
> +}
> +
> +static int mdev_device_remove_cb(struct device *dev, void *data)
> +{
> +	return mdev_device_remove(dev, data ? *(bool *)data : true);
> +}
> +
> +/*
> + * mdev_register_device : Register a device
> + * @dev: device structure representing parent device.
> + * @ops: Parent device operation structure to be registered.
> + *
> + * Add device to list of registered parent devices.
> + * Returns a negative value on error, otherwise 0.
> + */
> +int mdev_register_device(struct device *dev, const struct parent_ops *ops)
> +{
> +	int ret = 0;
> +	struct parent_device *parent;
> +
> +	/* check for mandatory ops */
> +	if (!ops || !ops->create || !ops->remove || !ops->supported_type_groups)
> +		return -EINVAL;
> +
> +	dev = get_device(dev);
> +	if (!dev)
> +		return -EINVAL;
> +
> +	mutex_lock(&parent_list_lock);
> +
> +	/* Check for duplicate */
> +	parent = __find_parent_device(dev);
> +	if (parent) {
> +		ret = -EEXIST;
> +		goto add_dev_err;
> +	}
> +
> +	parent = kzalloc(sizeof(*parent), GFP_KERNEL);
> +	if (!parent) {
> +		ret = -ENOMEM;
> +		goto add_dev_err;
> +	}
> +
> +	kref_init(&parent->ref);
> +
> +	parent->dev = dev;
> +	parent->ops = ops;
> +
> +	ret = parent_create_sysfs_files(parent);
> +	if (ret) {
> +		mutex_unlock(&parent_list_lock);
> +		mdev_put_parent(parent);
> +		return ret;
> +	}
> +
> +	ret = class_compat_create_link(mdev_bus_compat_class, dev, NULL);
> +	if (ret)
> +		dev_warn(dev, "Failed to create compatibility class link\n");
> +
> +	list_add(&parent->next, &parent_list);
> +	mutex_unlock(&parent_list_lock);
> +
> +	dev_info(dev, "MDEV: Registered\n");
> +	return 0;
> +
> +add_dev_err:
> +	mutex_unlock(&parent_list_lock);
> +	put_device(dev);
> +	return ret;
> +}
> +EXPORT_SYMBOL(mdev_register_device);
> +
> +/*
> + * mdev_unregister_device : Unregister a parent device
> + * @dev: device structure representing parent device.
> + *
> + * Remove device from list of registered parent devices. Give a chance to free
> + * existing mediated devices for given device.
> + */
> +
> +void mdev_unregister_device(struct device *dev)
> +{
> +	struct parent_device *parent;
> +	bool force_remove = true;
> +
> +	mutex_lock(&parent_list_lock);
> +	parent = __find_parent_device(dev);
> +
> +	if (!parent) {
> +		mutex_unlock(&parent_list_lock);
> +		return;
> +	}
> +	dev_info(dev, "MDEV: Unregistering\n");
> +
> +	/*
> +	 * Remove parent from the list and remove "mdev_supported_types"
> +	 * sysfs files so that no new mediated device could be
> +	 * created for this parent
> +	 */
> +	list_del(&parent->next);
> +	parent_remove_sysfs_files(parent);
> +
> +	mutex_unlock(&parent_list_lock);
> +
> +	class_compat_remove_link(mdev_bus_compat_class, dev, NULL);
> +
> +	device_for_each_child(dev, (void *)&force_remove,
> +			      mdev_device_remove_cb);
> +	mdev_put_parent(parent);
> +}
> +EXPORT_SYMBOL(mdev_unregister_device);
> +
> +static void mdev_device_release(struct device *dev)
> +{
> +	struct mdev_device *mdev = to_mdev_device(dev);
> +
> +	dev_dbg(&mdev->dev, "MDEV: destroying\n");
> +	kfree(mdev);
> +}
> +
> +int mdev_device_create(struct kobject *kobj, struct device *dev, uuid_le uuid)
> +{
> +	int ret;
> +	struct mdev_device *mdev;
> +	struct parent_device *parent;
> +	struct mdev_type *type = to_mdev_type(kobj);
> +
> +	parent = mdev_get_parent(type->parent);
> +	if (!parent)
> +		return -EINVAL;
> +
> +	/* Check for duplicate */
> +	mdev = __find_mdev_device(parent, uuid);
> +	if (mdev) {
> +		ret = -EEXIST;
> +		goto create_err;
> +	}

We check here whether the {parent,uuid} already exists, but what
prevents us racing with another create call with the same uuid?  ie.
neither exists at this point.  Will device_register() fail if the
device name already exists?  If so, should we just rely on the error
there and skip this duplicate check?  If not, we need a mutex to avoid
the race.

> +
> +	mdev = kzalloc(sizeof(*mdev), GFP_KERNEL);
> +	if (!mdev) {
> +		ret = -ENOMEM;
> +		goto create_err;
> +	}
> +
> +	memcpy(&mdev->uuid, &uuid, sizeof(uuid_le));
> +	mdev->parent = parent;
> +	kref_init(&mdev->ref);
> +
> +	mdev->dev.parent  = dev;
> +	mdev->dev.bus     = &mdev_bus_type;
> +	mdev->dev.release = mdev_device_release;
> +	dev_set_name(&mdev->dev, "%pUl", uuid.b);
> +
> +	ret = device_register(&mdev->dev);
> +	if (ret) {
> +		put_device(&mdev->dev);
> +		goto create_err;
> +	}
> +
> +	ret = mdev_device_create_ops(kobj, mdev);
> +	if (ret)
> +		goto create_failed;
> +
> +	ret = mdev_create_sysfs_files(&mdev->dev, type);
> +	if (ret) {
> +		mdev_device_remove_ops(mdev, true);
> +		goto create_failed;
> +	}
> +
> +	mdev->type_kobj = kobj;
> +	dev_dbg(&mdev->dev, "MDEV: created\n");
> +
> +	return ret;
> +
> +create_failed:
> +	device_unregister(&mdev->dev);
> +
> +create_err:
> +	mdev_put_parent(parent);
> +	return ret;
> +}
> +
> +int mdev_device_remove(struct device *dev, bool force_remove)
> +{
> +	struct mdev_device *mdev;
> +	struct parent_device *parent;
> +	struct mdev_type *type;
> +	int ret = 0;
> +
> +	if (!dev_is_mdev(dev))
> +		return 0;
> +
> +	mdev = to_mdev_device(dev);
> +	parent = mdev->parent;
> +	type = to_mdev_type(mdev->type_kobj);
> +
> +	ret = mdev_device_remove_ops(mdev, force_remove);
> +	if (ret)
> +		return ret;
> +
> +	mdev_remove_sysfs_files(dev, type);
> +	device_unregister(dev);
> +	mdev_put_parent(parent);
> +	return ret;
> +}
> +
> +static int __init mdev_init(void)
> +{
> +	int ret;
> +
> +	ret = mdev_bus_register();
> +	if (ret) {
> +		pr_err("Failed to register mdev bus\n");
> +		return ret;
> +	}
> +
> +	mdev_bus_compat_class = class_compat_register("mdev_bus");
> +	if (!mdev_bus_compat_class) {
> +		mdev_bus_unregister();
> +		return -ENOMEM;
> +	}
> +
> +	/*
> +	 * Attempt to load known vfio_mdev.  This gives us a working environment
> +	 * without the user needing to explicitly load vfio_mdev driver.
> +	 */
> +	request_module_nowait("vfio_mdev");
> +
> +	return ret;
> +}
> +
> +static void __exit mdev_exit(void)
> +{
> +	class_compat_unregister(mdev_bus_compat_class);
> +	mdev_bus_unregister();
> +}
> +
> +module_init(mdev_init)
> +module_exit(mdev_exit)
> +
> +MODULE_VERSION(DRIVER_VERSION);
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR(DRIVER_AUTHOR);
> +MODULE_DESCRIPTION(DRIVER_DESC);
> diff --git a/drivers/vfio/mdev/mdev_driver.c b/drivers/vfio/mdev/mdev_driver.c
> new file mode 100644
> index 000000000000..7768ef87f528
> --- /dev/null
> +++ b/drivers/vfio/mdev/mdev_driver.c
> @@ -0,0 +1,128 @@
> +/*
> + * MDEV driver
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/device.h>
> +#include <linux/iommu.h>
> +#include <linux/mdev.h>
> +
> +#include "mdev_private.h"
> +
> +static int mdev_attach_iommu(struct mdev_device *mdev)
> +{
> +	int ret;
> +	struct iommu_group *group;
> +
> +	group = iommu_group_alloc();
> +	if (IS_ERR(group)) {
> +		dev_err(&mdev->dev, "MDEV: failed to allocate group!\n");
> +		return PTR_ERR(group);
> +	}
> +
> +	ret = iommu_group_add_device(group, &mdev->dev);
> +	if (ret) {
> +		dev_err(&mdev->dev, "MDEV: failed to add dev to group!\n");
> +		goto attach_fail;
> +	}
> +
> +	dev_info(&mdev->dev, "MDEV: group_id = %d\n",
> +				 iommu_group_id(group));
> +attach_fail:
> +	iommu_group_put(group);
> +	return ret;
> +}
> +
> +static void mdev_detach_iommu(struct mdev_device *mdev)
> +{
> +	iommu_group_remove_device(&mdev->dev);
> +	dev_info(&mdev->dev, "MDEV: detaching iommu\n");
> +}
> +
> +static int mdev_probe(struct device *dev)
> +{
> +	struct mdev_driver *drv = to_mdev_driver(dev->driver);
> +	struct mdev_device *mdev = to_mdev_device(dev);
> +	int ret;
> +
> +	ret = mdev_attach_iommu(mdev);
> +	if (ret) {
> +		dev_err(dev, "Failed to attach IOMMU\n");
> +		return ret;
> +	}
> +
> +	if (drv && drv->probe)
> +		ret = drv->probe(dev);
> +
> +	if (ret)
> +		mdev_detach_iommu(mdev);
> +
> +	return ret;
> +}
> +
> +static int mdev_remove(struct device *dev)
> +{
> +	struct mdev_driver *drv = to_mdev_driver(dev->driver);
> +	struct mdev_device *mdev = to_mdev_device(dev);
> +
> +	if (drv && drv->remove)
> +		drv->remove(dev);
> +
> +	mdev_detach_iommu(mdev);
> +
> +	return 0;
> +}
> +
> +struct bus_type mdev_bus_type = {
> +	.name		= "mdev",
> +	.probe		= mdev_probe,
> +	.remove		= mdev_remove,
> +};
> +EXPORT_SYMBOL_GPL(mdev_bus_type);
> +
> +/*
> + * mdev_register_driver - register a new MDEV driver
> + * @drv: the driver to register
> + * @owner: module owner of driver to be registered
> + *
> + * Returns a negative value on error, otherwise 0.
> + */
> +int mdev_register_driver(struct mdev_driver *drv, struct module *owner)
> +{
> +	/* initialize common driver fields */
> +	drv->driver.name = drv->name;
> +	drv->driver.bus = &mdev_bus_type;
> +	drv->driver.owner = owner;
> +
> +	/* register with core */
> +	return driver_register(&drv->driver);
> +}
> +EXPORT_SYMBOL(mdev_register_driver);
> +
> +/*
> + * mdev_unregister_driver - unregister MDEV driver
> + * @drv: the driver to unregister
> + *
> + */
> +void mdev_unregister_driver(struct mdev_driver *drv)
> +{
> +	driver_unregister(&drv->driver);
> +}
> +EXPORT_SYMBOL(mdev_unregister_driver);
> +
> +int mdev_bus_register(void)
> +{
> +	return bus_register(&mdev_bus_type);
> +}
> +
> +void mdev_bus_unregister(void)
> +{
> +	bus_unregister(&mdev_bus_type);
> +}
> diff --git a/drivers/vfio/mdev/mdev_private.h b/drivers/vfio/mdev/mdev_private.h
> new file mode 100644
> index 000000000000..000c93fcfdbd
> --- /dev/null
> +++ b/drivers/vfio/mdev/mdev_private.h
> @@ -0,0 +1,41 @@
> +/*
> + * Mediated device interal definitions
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#ifndef MDEV_PRIVATE_H
> +#define MDEV_PRIVATE_H
> +
> +int  mdev_bus_register(void);
> +void mdev_bus_unregister(void);
> +
> +struct mdev_type {
> +	struct kobject kobj;
> +	struct kobject *devices_kobj;
> +	struct parent_device *parent;
> +	struct list_head next;
> +	struct attribute_group *group;
> +};
> +
> +#define to_mdev_type_attr(_attr)	\
> +	container_of(_attr, struct mdev_type_attribute, attr)
> +#define to_mdev_type(_kobj)		\
> +	container_of(_kobj, struct mdev_type, kobj)
> +
> +int  parent_create_sysfs_files(struct parent_device *parent);
> +void parent_remove_sysfs_files(struct parent_device *parent);
> +
> +int  mdev_create_sysfs_files(struct device *dev, struct mdev_type *type);
> +void mdev_remove_sysfs_files(struct device *dev, struct mdev_type *type);
> +
> +int  mdev_device_create(struct kobject *kobj, struct device *dev, uuid_le uuid);
> +int  mdev_device_remove(struct device *dev, bool force_remove);
> +
> +#endif /* MDEV_PRIVATE_H */
> diff --git a/drivers/vfio/mdev/mdev_sysfs.c b/drivers/vfio/mdev/mdev_sysfs.c
> new file mode 100644
> index 000000000000..426e35cf79d0
> --- /dev/null
> +++ b/drivers/vfio/mdev/mdev_sysfs.c
> @@ -0,0 +1,296 @@
> +/*
> + * File attributes for Mediated devices
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/sysfs.h>
> +#include <linux/ctype.h>
> +#include <linux/device.h>
> +#include <linux/slab.h>
> +#include <linux/uuid.h>
> +#include <linux/mdev.h>
> +
> +#include "mdev_private.h"
> +
> +/* Static functions */
> +
> +static ssize_t mdev_type_attr_show(struct kobject *kobj,
> +				     struct attribute *__attr, char *buf)
> +{
> +	struct mdev_type_attribute *attr = to_mdev_type_attr(__attr);
> +	struct mdev_type *type = to_mdev_type(kobj);
> +	ssize_t ret = -EIO;
> +
> +	if (attr->show)
> +		ret = attr->show(kobj, type->parent->dev, buf);
> +	return ret;
> +}
> +
> +static ssize_t mdev_type_attr_store(struct kobject *kobj,
> +				      struct attribute *__attr,
> +				      const char *buf, size_t count)
> +{
> +	struct mdev_type_attribute *attr = to_mdev_type_attr(__attr);
> +	struct mdev_type *type = to_mdev_type(kobj);
> +	ssize_t ret = -EIO;
> +
> +	if (attr->store)
> +		ret = attr->store(&type->kobj, type->parent->dev, buf, count);
> +	return ret;
> +}
> +
> +static const struct sysfs_ops mdev_type_sysfs_ops = {
> +	.show = mdev_type_attr_show,
> +	.store = mdev_type_attr_store,
> +};
> +
> +static ssize_t create_store(struct kobject *kobj, struct device *dev,
> +			    const char *buf, size_t count)
> +{
> +	char *str;
> +	uuid_le uuid;
> +	int ret;
> +
> +	if (count < UUID_STRING_LEN)
> +		return -EINVAL;


Can't we also test for something unreasonably large?


> +
> +	str = kstrndup(buf, count, GFP_KERNEL);
> +	if (!str)
> +		return -ENOMEM;
> +
> +	ret = uuid_le_to_bin(str, &uuid);

nit, we can kfree(str) here regardless of the return.

> +	if (!ret) {
> +
> +		ret = mdev_device_create(kobj, dev, uuid);
> +		if (ret)
> +			pr_err("mdev_create: Failed to create mdev device\n");

What value does this pr_err add?  It doesn't tell us why it failed and
the user will already know if failed by the return value of their write.

> +		else
> +			ret = count;
> +	}
> +
> +	kfree(str);
> +	return ret;
> +}
> +
> +MDEV_TYPE_ATTR_WO(create);
> +
> +static void mdev_type_release(struct kobject *kobj)
> +{
> +	struct mdev_type *type = to_mdev_type(kobj);
> +
> +	pr_debug("Releasing group %s\n", kobj->name);
> +	kfree(type);
> +}
> +
> +static struct kobj_type mdev_type_ktype = {
> +	.sysfs_ops = &mdev_type_sysfs_ops,
> +	.release = mdev_type_release,
> +};
> +
> +struct mdev_type *add_mdev_supported_type(struct parent_device *parent,
> +					  struct attribute_group *group)
> +{
> +	struct mdev_type *type;
> +	int ret;
> +
> +	if (!group->name) {
> +		pr_err("%s: Type name empty!\n", __func__);
> +		return ERR_PTR(-EINVAL);
> +	}
> +
> +	type = kzalloc(sizeof(*type), GFP_KERNEL);
> +	if (!type)
> +		return ERR_PTR(-ENOMEM);
> +
> +	type->kobj.kset = parent->mdev_types_kset;
> +
> +	ret = kobject_init_and_add(&type->kobj, &mdev_type_ktype, NULL,
> +				   "%s-%s", dev_driver_string(parent->dev),
> +				   group->name);
> +	if (ret) {
> +		kfree(type);
> +		return ERR_PTR(ret);
> +	}
> +
> +	ret = sysfs_create_file(&type->kobj, &mdev_type_attr_create.attr);
> +	if (ret)
> +		goto attr_create_failed;
> +
> +	type->devices_kobj = kobject_create_and_add("devices", &type->kobj);
> +	if (!type->devices_kobj) {
> +		ret = -ENOMEM;
> +		goto attr_devices_failed;
> +	}
> +
> +	ret = sysfs_create_files(&type->kobj,
> +				 (const struct attribute **)group->attrs);
> +	if (ret) {
> +		ret = -ENOMEM;
> +		goto attrs_failed;
> +	}
> +
> +	type->group = group;
> +	type->parent = parent;
> +	return type;
> +
> +attrs_failed:
> +	kobject_put(type->devices_kobj);
> +attr_devices_failed:
> +	sysfs_remove_file(&type->kobj, &mdev_type_attr_create.attr);
> +attr_create_failed:
> +	kobject_del(&type->kobj);
> +	kobject_put(&type->kobj);
> +	return ERR_PTR(ret);
> +}
> +
> +static void remove_mdev_supported_type(struct mdev_type *type)
> +{
> +	sysfs_remove_files(&type->kobj,
> +			   (const struct attribute **)type->group->attrs);
> +	kobject_put(type->devices_kobj);
> +	sysfs_remove_file(&type->kobj, &mdev_type_attr_create.attr);
> +	kobject_del(&type->kobj);
> +	kobject_put(&type->kobj);
> +}
> +
> +static int add_mdev_supported_type_groups(struct parent_device *parent)
> +{
> +	int i;
> +
> +	for (i = 0; parent->ops->supported_type_groups[i]; i++) {
> +		struct mdev_type *type;
> +
> +		type = add_mdev_supported_type(parent,
> +					parent->ops->supported_type_groups[i]);
> +		if (IS_ERR(type)) {
> +			struct mdev_type *ltype, *tmp;
> +
> +			list_for_each_entry_safe(ltype, tmp, &parent->type_list,
> +						  next) {
> +				list_del(&ltype->next);
> +				remove_mdev_supported_type(ltype);
> +			}
> +			return PTR_ERR(type);
> +		}
> +		list_add(&type->next, &parent->type_list);
> +	}
> +	return 0;
> +}
> +
> +/* mdev sysfs Functions */
> +
> +void parent_remove_sysfs_files(struct parent_device *parent)
> +{
> +	struct mdev_type *type, *tmp;
> +
> +	list_for_each_entry_safe(type, tmp, &parent->type_list, next) {
> +		list_del(&type->next);
> +		remove_mdev_supported_type(type);
> +	}
> +
> +	sysfs_remove_groups(&parent->dev->kobj, parent->ops->dev_attr_groups);
> +	kset_unregister(parent->mdev_types_kset);
> +}
> +
> +int parent_create_sysfs_files(struct parent_device *parent)
> +{
> +	int ret;
> +
> +	parent->mdev_types_kset = kset_create_and_add("mdev_supported_types",
> +					       NULL, &parent->dev->kobj);
> +
> +	if (!parent->mdev_types_kset)
> +		return -ENOMEM;
> +
> +	INIT_LIST_HEAD(&parent->type_list);
> +
> +	ret = sysfs_create_groups(&parent->dev->kobj,
> +				  parent->ops->dev_attr_groups);
> +	if (ret)
> +		goto create_err;
> +
> +	ret = add_mdev_supported_type_groups(parent);
> +	if (ret)
> +		sysfs_remove_groups(&parent->dev->kobj,
> +				    parent->ops->dev_attr_groups);
> +	else
> +		return ret;
> +
> +create_err:
> +	kset_unregister(parent->mdev_types_kset);
> +	return ret;
> +}
> +
> +static ssize_t remove_store(struct device *dev, struct device_attribute *attr,
> +			    const char *buf, size_t count)
> +{
> +	unsigned long val;
> +
> +	if (kstrtoul(buf, 0, &val) < 0)
> +		return -EINVAL;
> +
> +	if (val && device_remove_file_self(dev, attr)) {
> +		int ret;
> +
> +		ret = mdev_device_remove(dev, false);
> +		if (ret) {
> +			device_create_file(dev, attr);
> +			return ret;
> +		}
> +	}
> +
> +	return count;
> +}
> +
> +static DEVICE_ATTR_WO(remove);
> +
> +static const struct attribute *mdev_device_attrs[] = {
> +	&dev_attr_remove.attr,
> +	NULL,
> +};
> +
> +int  mdev_create_sysfs_files(struct device *dev, struct mdev_type *type)
> +{
> +	int ret;
> +
> +	ret = sysfs_create_files(&dev->kobj, mdev_device_attrs);
> +	if (ret) {
> +		pr_err("Failed to create remove sysfs entry\n");
> +		return ret;
> +	}
> +
> +	ret = sysfs_create_link(type->devices_kobj, &dev->kobj, dev_name(dev));
> +	if (ret) {
> +		pr_err("Failed to create symlink in types\n");
> +		goto device_link_failed;
> +	}
> +
> +	ret = sysfs_create_link(&dev->kobj, &type->kobj, "mdev_type");
> +	if (ret) {
> +		pr_err("Failed to create symlink in device directory\n");
> +		goto type_link_failed;
> +	}
> +
> +	return ret;
> +
> +type_link_failed:
> +	sysfs_remove_link(type->devices_kobj, dev_name(dev));
> +device_link_failed:
> +	sysfs_remove_files(&dev->kobj, mdev_device_attrs);
> +	return ret;
> +}
> +
> +void mdev_remove_sysfs_files(struct device *dev, struct mdev_type *type)
> +{
> +	sysfs_remove_link(&dev->kobj, "mdev_type");
> +	sysfs_remove_link(type->devices_kobj, dev_name(dev));
> +	sysfs_remove_files(&dev->kobj, mdev_device_attrs);
> +
> +}
> diff --git a/include/linux/mdev.h b/include/linux/mdev.h
> new file mode 100644
> index 000000000000..727209b2a67f
> --- /dev/null
> +++ b/include/linux/mdev.h
> @@ -0,0 +1,177 @@
> +/*
> + * Mediated device definition
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#ifndef MDEV_H
> +#define MDEV_H
> +
> +#include <uapi/linux/vfio.h>
> +
> +struct parent_device;
> +
> +/* Mediated device */
> +struct mdev_device {
> +	struct device		dev;
> +	struct parent_device	*parent;
> +	uuid_le			uuid;
> +	void			*driver_data;
> +
> +	/* internal only */
> +	struct kref		ref;
> +	struct list_head	next;
> +	struct kobject		*type_kobj;
> +};
> +
> +
> +/**
> + * struct parent_ops - Structure to be registered for each parent device to
> + * register the device to mdev module.
> + *
> + * @owner:		The module owner.
> + * @dev_attr_groups:	Attributes of the parent device.
> + * @mdev_attr_groups:	Attributes of the mediated device.
> + * @supported_type_groups: Attributes to define supported types. It is mandatory
> + *			to provide supported types.
> + * @create:		Called to allocate basic resources in parent device's
> + *			driver for a particular mediated device. It is
> + *			mandatory to provide create ops.
> + *			@kobj: kobject of type for which 'create' is called.
> + *			@mdev: mdev_device structure on of mediated device
> + *			      that is being created
> + *			Returns integer: success (0) or error (< 0)
> + * @remove:		Called to free resources in parent device's driver for a
> + *			a mediated device. It is mandatory to provide 'remove'
> + *			ops.
> + *			@mdev: mdev_device device structure which is being
> + *			       destroyed
> + *			Returns integer: success (0) or error (< 0)
> + * @open:		Open mediated device.
> + *			@mdev: mediated device.
> + *			Returns integer: success (0) or error (< 0)
> + * @release:		release mediated device
> + *			@mdev: mediated device.
> + * @read:		Read emulation callback
> + *			@mdev: mediated device structure
> + *			@buf: read buffer
> + *			@count: number of bytes to read
> + *			@ppos: address.
> + *			Retuns number on bytes read on success or error.
> + * @write:		Write emulation callback
> + *			@mdev: mediated device structure
> + *			@buf: write buffer
> + *			@count: number of bytes to be written
> + *			@ppos: address.
> + *			Retuns number on bytes written on success or error.
> + * @ioctl:		IOCTL callback
> + *			@mdev: mediated device structure
> + *			@cmd: mediated device structure
> + *			@arg: mediated device structure
> + * @mmap:		mmap callback
> + * Parent device that support mediated device should be registered with mdev
> + * module with parent_ops structure.
> + */
> +
> +struct parent_ops {
> +	struct module   *owner;
> +	const struct attribute_group **dev_attr_groups;
> +	const struct attribute_group **mdev_attr_groups;
> +	struct attribute_group **supported_type_groups;
> +
> +	int     (*create)(struct kobject *kobj, struct mdev_device *mdev);
> +	int     (*remove)(struct mdev_device *mdev);
> +	int     (*open)(struct mdev_device *mdev);
> +	void    (*release)(struct mdev_device *mdev);
> +	ssize_t (*read)(struct mdev_device *mdev, char __user *buf,
> +			size_t count, loff_t *ppos);
> +	ssize_t (*write)(struct mdev_device *mdev, const char __user *buf,
> +			 size_t count, loff_t *ppos);
> +	ssize_t (*ioctl)(struct mdev_device *mdev, unsigned int cmd,
> +			 unsigned long arg);
> +	int	(*mmap)(struct mdev_device *mdev, struct vm_area_struct *vma);
> +};
> +
> +/* Parent Device */
> +struct parent_device {
> +	struct device		*dev;
> +	const struct parent_ops	*ops;
> +
> +	/* internal */
> +	struct kref		ref;
> +	struct list_head	next;
> +	struct kset *mdev_types_kset;
> +	struct list_head	type_list;
> +};
> +
> +/* interface for exporting mdev supported type attributes */
> +struct mdev_type_attribute {
> +	struct attribute attr;
> +	ssize_t (*show)(struct kobject *kobj, struct device *dev, char *buf);
> +	ssize_t (*store)(struct kobject *kobj, struct device *dev,
> +			 const char *buf, size_t count);
> +};
> +
> +#define MDEV_TYPE_ATTR(_name, _mode, _show, _store)		\
> +struct mdev_type_attribute mdev_type_attr_##_name =		\
> +	__ATTR(_name, _mode, _show, _store)
> +#define MDEV_TYPE_ATTR_RW(_name) \
> +	struct mdev_type_attribute mdev_type_attr_##_name = __ATTR_RW(_name)
> +#define MDEV_TYPE_ATTR_RO(_name) \
> +	struct mdev_type_attribute mdev_type_attr_##_name = __ATTR_RO(_name)
> +#define MDEV_TYPE_ATTR_WO(_name) \
> +	struct mdev_type_attribute mdev_type_attr_##_name = __ATTR_WO(_name)
> +
> +/**
> + * struct mdev_driver - Mediated device driver
> + * @name: driver name
> + * @probe: called when new device created
> + * @remove: called when device removed
> + * @driver: device driver structure
> + *
> + **/
> +struct mdev_driver {
> +	const char *name;
> +	int  (*probe)(struct device *dev);
> +	void (*remove)(struct device *dev);
> +	struct device_driver driver;
> +};
> +
> +static inline struct mdev_driver *to_mdev_driver(struct device_driver *drv)
> +{
> +	return drv ? container_of(drv, struct mdev_driver, driver) : NULL;
> +}
> +
> +static inline struct mdev_device *to_mdev_device(struct device *dev)
> +{
> +	return dev ? container_of(dev, struct mdev_device, dev) : NULL;


Do we really need this NULL dev/drv behavior?  I don't see that any of
the callers can pass NULL into these.  The PCI equivalents don't
support this behavior and it doesn't seem they need to.  Thanks,

Alex


> +}
> +
> +static inline void *mdev_get_drvdata(struct mdev_device *mdev)
> +{
> +	return mdev->driver_data;
> +}
> +
> +static inline void mdev_set_drvdata(struct mdev_device *mdev, void *data)
> +{
> +	mdev->driver_data = data;
> +}
> +
> +extern struct bus_type mdev_bus_type;
> +
> +#define dev_is_mdev(d) ((d)->bus == &mdev_bus_type)
> +
> +extern int  mdev_register_device(struct device *dev,
> +				 const struct parent_ops *ops);
> +extern void mdev_unregister_device(struct device *dev);
> +
> +extern int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
> +extern void mdev_unregister_driver(struct mdev_driver *drv);
> +
> +#endif /* MDEV_H */

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 03/12] vfio: Rearrange functions to get vfio_group from dev
  2016-10-17 21:22 ` [PATCH v9 03/12] vfio: Rearrange functions to get vfio_group from dev Kirti Wankhede
@ 2016-10-19 17:26   ` Alex Williamson
  0 siblings, 0 replies; 73+ messages in thread
From: Alex Williamson @ 2016-10-19 17:26 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song,
	bjsdjshi, linux-kernel

On Tue, 18 Oct 2016 02:52:03 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> Rearrange functions to have common function to increment container_users.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Signed-off-by: Neo Jia <cjia@nvidia.com>
> Change-Id: I1f93262bdbab75094bc24b087b29da35ba70c4c6
> ---
>  drivers/vfio/vfio.c | 57 ++++++++++++++++++++++++++++++++++-------------------
>  1 file changed, 37 insertions(+), 20 deletions(-)

Ideally these would be two separate patches, one pulling
vfio_group_get_from_dev() out of vfio_device_get_from_dev() and
replacing the existing use, the other doing the same for
vfio_group_add_container_user().  Otherwise it looks good.  Thanks,

Alex

> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index d1d70e0b011b..2e83bdf007fe 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -480,6 +480,21 @@ static struct vfio_group *vfio_group_get_from_minor(int minor)
>  	return group;
>  }
>  
> +static struct vfio_group *vfio_group_get_from_dev(struct device *dev)
> +{
> +	struct iommu_group *iommu_group;
> +	struct vfio_group *group;
> +
> +	iommu_group = iommu_group_get(dev);
> +	if (!iommu_group)
> +		return NULL;
> +
> +	group = vfio_group_get_from_iommu(iommu_group);
> +	iommu_group_put(iommu_group);
> +
> +	return group;
> +}
> +
>  /**
>   * Device objects - create, release, get, put, search
>   */
> @@ -811,16 +826,10 @@ EXPORT_SYMBOL_GPL(vfio_add_group_dev);
>   */
>  struct vfio_device *vfio_device_get_from_dev(struct device *dev)
>  {
> -	struct iommu_group *iommu_group;
>  	struct vfio_group *group;
>  	struct vfio_device *device;
>  
> -	iommu_group = iommu_group_get(dev);
> -	if (!iommu_group)
> -		return NULL;
> -
> -	group = vfio_group_get_from_iommu(iommu_group);
> -	iommu_group_put(iommu_group);
> +	group = vfio_group_get_from_dev(dev);
>  	if (!group)
>  		return NULL;
>  
> @@ -1376,6 +1385,23 @@ static bool vfio_group_viable(struct vfio_group *group)
>  					 group, vfio_dev_viable) == 0);
>  }
>  
> +static int vfio_group_add_container_user(struct vfio_group *group)
> +{
> +	if (!atomic_inc_not_zero(&group->container_users))
> +		return -EINVAL;
> +
> +	if (group->noiommu) {
> +		atomic_dec(&group->container_users);
> +		return -EPERM;
> +	}
> +	if (!group->container->iommu_driver || !vfio_group_viable(group)) {
> +		atomic_dec(&group->container_users);
> +		return -EINVAL;
> +	}
> +
> +	return 0;
> +}
> +
>  static const struct file_operations vfio_device_fops;
>  
>  static int vfio_group_get_device_fd(struct vfio_group *group, char *buf)
> @@ -1685,23 +1711,14 @@ static const struct file_operations vfio_device_fops = {
>  struct vfio_group *vfio_group_get_external_user(struct file *filep)
>  {
>  	struct vfio_group *group = filep->private_data;
> +	int ret;
>  
>  	if (filep->f_op != &vfio_group_fops)
>  		return ERR_PTR(-EINVAL);
>  
> -	if (!atomic_inc_not_zero(&group->container_users))
> -		return ERR_PTR(-EINVAL);
> -
> -	if (group->noiommu) {
> -		atomic_dec(&group->container_users);
> -		return ERR_PTR(-EPERM);
> -	}
> -
> -	if (!group->container->iommu_driver ||
> -			!vfio_group_viable(group)) {
> -		atomic_dec(&group->container_users);
> -		return ERR_PTR(-EINVAL);
> -	}
> +	ret = vfio_group_add_container_user(group);
> +	if (ret)
> +		return ERR_PTR(ret);
>  
>  	vfio_group_get(group);
>  

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 01/12] vfio: Mediated device Core driver
  2016-10-18 23:16   ` Alex Williamson
@ 2016-10-19 19:16     ` Kirti Wankhede
  2016-10-19 22:20       ` Alex Williamson
  0 siblings, 1 reply; 73+ messages in thread
From: Kirti Wankhede @ 2016-10-19 19:16 UTC (permalink / raw)
  To: Alex Williamson
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song,
	bjsdjshi, linux-kernel



On 10/19/2016 4:46 AM, Alex Williamson wrote:
> On Tue, 18 Oct 2016 02:52:01 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
...
>> +static struct mdev_device *__find_mdev_device(struct parent_device *parent,
>> +					      uuid_le uuid)
>> +{
>> +	struct device *dev;
>> +
>> +	dev = device_find_child(parent->dev, &uuid, _find_mdev_device);
>> +	if (!dev)
>> +		return NULL;
>> +
>> +	put_device(dev);
>> +
>> +	return to_mdev_device(dev);
>> +}
> 
> This function is only used by mdev_device_create() for the purpose of
> checking whether a given uuid for a parent already exists, so the
> returned device is not actually used.  However, at the point where
> we're using to_mdev_device() here, we don't actually hold a reference to
> the device, so that function call and any possible use of the returned
> pointer by the callee is invalid.  I would either turn this into a
> "get" function where the callee has a device reference and needs to do
> a "put" on it or change this to a "exists" test where true/false is
> returned and the function cannot be later mis-used to do a device
> lookup where the reference isn't actually valid.
> 

I'll change it to return 0 if not found and -EEXIST if found.


>> +int mdev_device_create(struct kobject *kobj, struct device *dev, uuid_le uuid)
>> +{
>> +	int ret;
>> +	struct mdev_device *mdev;
>> +	struct parent_device *parent;
>> +	struct mdev_type *type = to_mdev_type(kobj);
>> +
>> +	parent = mdev_get_parent(type->parent);
>> +	if (!parent)
>> +		return -EINVAL;
>> +
>> +	/* Check for duplicate */
>> +	mdev = __find_mdev_device(parent, uuid);
>> +	if (mdev) {
>> +		ret = -EEXIST;
>> +		goto create_err;
>> +	}
> 
> We check here whether the {parent,uuid} already exists, but what
> prevents us racing with another create call with the same uuid?  ie.
> neither exists at this point.  Will device_register() fail if the
> device name already exists?  If so, should we just rely on the error
> there and skip this duplicate check?  If not, we need a mutex to avoid
> the race.
>

Yes, device_register() fails if device exists already with below
warning. Is it ok to dump such warning? I think, this should be fine,
right? then we can remove duplicate check.

If we want to avoid such warning, we should have duplication check.

[  610.847958] ------------[ cut here ]------------
[  610.855377] WARNING: CPU: 15 PID: 19839 at fs/sysfs/dir.c:31
sysfs_warn_dup+0x64/0x80
[  610.865798] sysfs: cannot create duplicate filename
'/devices/pci0000:80/0000:80:02.0/0000:83:00.0/0000:84:08.0/0000:85:00.0/83b8f4f2-509f-382f-3c1e-e6bfe0fa1234'
[  610.885101] Modules linked in:[  610.888039]  nvidia(POE)
vfio_iommu_type1 vfio_mdev mdev vfio nfsv4 dns_resolver nfs fscache
sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp
kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel
ghash_clmulni_intel aesni_intel glue_helper lrw gf128mul ablk_helper
cryptd nfsd auth_rpcgss nfs_acl lockd mei_me grace iTCO_wdt
iTCO_vendor_support mei ipmi_si pcspkr ioatdma i2c_i801 lpc_ich shpchp
i2c_smbus mfd_core ipmi_msghandler acpi_pad uinput sunrpc xfs libcrc32c
sd_mod mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt
fb_sys_fops ttm drm igb ahci libahci ptp libata pps_core dca
i2c_algo_bit i2c_core dm_mirror dm_region_hash dm_log dm_mod [last
unloaded: mdev]
[  610.963835] CPU: 15 PID: 19839 Comm: bash Tainted: P           OE
4.8.0-next-20161013+ #0
[  610.973779] Hardware name: Supermicro
SYS-2027GR-5204A-NC024/X9DRG-HF, BIOS 1.0c 02/28/2013
[  610.983769]  ffffc90009323ae0 ffffffff813568bf ffffc90009323b30
0000000000000000
[  610.992867]  ffffc90009323b20 ffffffff81085511 0000001f00001000
ffff8808839ef000
[  611.001954]  ffff88108b30f900 ffff88109ae368e8 ffff88109ae580b0
ffff881099cc0818
[  611.011055] Call Trace:
[  611.015087]  [<ffffffff813568bf>] dump_stack+0x63/0x84
[  611.021784]  [<ffffffff81085511>] __warn+0xd1/0xf0
[  611.028115]  [<ffffffff8108558f>] warn_slowpath_fmt+0x5f/0x80
[  611.035379]  [<ffffffff812a4e80>] ? kernfs_path_from_node+0x50/0x60
[  611.043148]  [<ffffffff812a86c4>] sysfs_warn_dup+0x64/0x80
[  611.050109]  [<ffffffff812a87ae>] sysfs_create_dir_ns+0x7e/0x90
[  611.057481]  [<ffffffff81359891>] kobject_add_internal+0xc1/0x340
[  611.065018]  [<ffffffff81359d45>] kobject_add+0x75/0xd0
[  611.071635]  [<ffffffff81483829>] device_add+0x119/0x610
[  611.078314]  [<ffffffff81483d3a>] device_register+0x1a/0x20
[  611.085261]  [<ffffffffa03c748d>] mdev_device_create+0xdd/0x200 [mdev]
[  611.093143]  [<ffffffffa03c7768>] create_store+0xa8/0xe0 [mdev]
[  611.100385]  [<ffffffffa03c76ab>] mdev_type_attr_store+0x1b/0x30 [mdev]
[  611.108309]  [<ffffffff812a7d8a>] sysfs_kf_write+0x3a/0x50
[  611.115096]  [<ffffffff812a78bb>] kernfs_fop_write+0x10b/0x190
[  611.122231]  [<ffffffff81224e97>] __vfs_write+0x37/0x140
[  611.128817]  [<ffffffff811cea84>] ? handle_mm_fault+0x724/0xd80
[  611.135976]  [<ffffffff81225da2>] vfs_write+0xb2/0x1b0
[  611.142354]  [<ffffffff81003510>] ? syscall_trace_enter+0x1d0/0x2b0
[  611.149836]  [<ffffffff812271f5>] SyS_write+0x55/0xc0
[  611.156065]  [<ffffffff81003a47>] do_syscall_64+0x67/0x180
[  611.162734]  [<ffffffff816d41eb>] entry_SYSCALL64_slow_path+0x25/0x25
[  611.170345] ---[ end trace b05a73599da2ba3f ]---
[  611.175940] ------------[ cut here ]------------



>> +static ssize_t create_store(struct kobject *kobj, struct device *dev,
>> +			    const char *buf, size_t count)
>> +{
>> +	char *str;
>> +	uuid_le uuid;
>> +	int ret;
>> +
>> +	if (count < UUID_STRING_LEN)
>> +		return -EINVAL;
> 
> 
> Can't we also test for something unreasonably large?
> 

Ok. I'll add that check.

> 
>> +
>> +	str = kstrndup(buf, count, GFP_KERNEL);
>> +	if (!str)
>> +		return -ENOMEM;
>> +
>> +	ret = uuid_le_to_bin(str, &uuid);
> 
> nit, we can kfree(str) here regardless of the return.
> 
>> +	if (!ret) {
>> +
>> +		ret = mdev_device_create(kobj, dev, uuid);
>> +		if (ret)
>> +			pr_err("mdev_create: Failed to create mdev device\n");
> 
> What value does this pr_err add?  It doesn't tell us why it failed and
> the user will already know if failed by the return value of their write.
> 

Ok, will remove it.

>> +		else
>> +			ret = count;
>> +	}
>> +
>> +	kfree(str);
>> +	return ret;
>> +}

...

>> +static inline struct mdev_driver *to_mdev_driver(struct device_driver *drv)
>> +{
>> +	return drv ? container_of(drv, struct mdev_driver, driver) : NULL;
>> +}
>> +
>> +static inline struct mdev_device *to_mdev_device(struct device *dev)
>> +{
>> +	return dev ? container_of(dev, struct mdev_device, dev) : NULL;
> 
> 
> Do we really need this NULL dev/drv behavior?  I don't see that any of
> the callers can pass NULL into these.  The PCI equivalents don't
> support this behavior and it doesn't seem they need to.  Thanks,
>

Ok, I'll update that.

Kirti

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 12/12] docs: Sample driver to demonstrate how to use Mediated device framework.
  2016-10-18 17:17     ` Alex Williamson
@ 2016-10-19 19:19       ` Kirti Wankhede
  0 siblings, 0 replies; 73+ messages in thread
From: Kirti Wankhede @ 2016-10-19 19:19 UTC (permalink / raw)
  To: Alex Williamson, Dong Jia Shi
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song,
	linux-kernel



On 10/18/2016 10:47 PM, Alex Williamson wrote:
> On Tue, 18 Oct 2016 10:54:11 +0800
> Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com> wrote:
> 
>> * Kirti Wankhede <kwankhede@nvidia.com> [2016-10-18 02:52:12 +0530]:
>>
>> ...snip...
>>
>>> +static ssize_t mdev_access(struct mdev_device *mdev, char *buf,
>>> +		size_t count, loff_t pos, bool is_write)
>>> +{
>>> +	struct mdev_state *mdev_state;
>>> +	unsigned int index;
>>> +	loff_t offset;
>>> +	int ret = 0;
>>> +
>>> +	if (!mdev || !buf)
>>> +		return -EINVAL;
>>> +
>>> +	mdev_state = mdev_get_drvdata(mdev);
>>> +	if (!mdev_state) {
>>> +		pr_err("%s mdev_state not found\n", __func__);
>>> +		return -EINVAL;
>>> +	}
>>> +
>>> +	mutex_lock(&mdev_state->ops_lock);
>>> +
>>> +	index = MTTY_VFIO_PCI_OFFSET_TO_INDEX(pos);
>>> +	offset = pos & MTTY_VFIO_PCI_OFFSET_MASK;
>>> +	switch (index) {
>>> +	case VFIO_PCI_CONFIG_REGION_INDEX:
>>> +
>>> +#if defined(DEBUG)
>>> +		pr_info("%s: PCI config space %s at offset 0x%llx\n",
>>> +			 __func__, is_write ? "write" : "read", offset);
>>> +#endif
>>> +		if (is_write) {
>>> +			dump_buffer(buf, count);
>>> +			handle_pci_cfg_write(mdev_state, offset, buf, count);
>>> +		} else {
>>> +			memcpy(buf, (mdev_state->vconfig + offset), count);
>>> +			dump_buffer(buf, count);  
>> Dear Kirti:
>>
>> Shouldn't we use copy_from_user instead of memcpy on @buf here? And I'm
>> wondering if dump_buffer could really work since it tries to dereference
>> a *__user* marked pointor.
> 
> I agree, the __user attribute is getting lost here and we're operating
> on user buffers as if they were kernel buffers.  That's a bug.  Thanks,
> 

Oh, yes. Thanks for catching that. While transiting all changes from v7
to v8 I missed to update these. I'll have this corrected.

Kirti.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 04/12] vfio iommu: Add support for mediated devices
  2016-10-17 21:22 ` [PATCH v9 04/12] vfio iommu: Add support for mediated devices Kirti Wankhede
@ 2016-10-19 21:02   ` Alex Williamson
  2016-10-20 20:17     ` Kirti Wankhede
                       ` (2 more replies)
  2016-10-21  7:49   ` Jike Song
  2016-10-27  7:20   ` [Qemu-devel] " Alexey Kardashevskiy
  2 siblings, 3 replies; 73+ messages in thread
From: Alex Williamson @ 2016-10-19 21:02 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song,
	bjsdjshi, linux-kernel

On Tue, 18 Oct 2016 02:52:04 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> VFIO IOMMU drivers are designed for the devices which are IOMMU capable.
> Mediated device only uses IOMMU APIs, the underlying hardware can be
> managed by an IOMMU domain.
> 
> Aim of this change is:
> - To use most of the code of TYPE1 IOMMU driver for mediated devices
> - To support direct assigned device and mediated device in single module
> 
> Added two new callback functions to struct vfio_iommu_driver_ops. Backend
> IOMMU module that supports pining and unpinning pages for mdev devices
> should provide these functions.
> Added APIs for pining and unpining pages to VFIO module. These calls back
> into backend iommu module to actually pin and unpin pages.
> 
> This change adds pin and unpin support for mediated device to TYPE1 IOMMU
> backend module. More details:
> - When iommu_group of mediated devices is attached, task structure is
>   cached which is used later to pin pages and page accounting.
> - It keeps track of pinned pages for mediated domain. This data is used to
>   verify unpinning request and to unpin remaining pages while detaching, if
>   there are any.
> - Used existing mechanism for page accounting. If iommu capable domain
>   exist in the container then all pages are already pinned and accounted.
>   Accouting for mdev device is only done if there is no iommu capable
>   domain in the container.
> - Page accouting is updated on hot plug and unplug mdev device and pass
>   through device.
> 
> Tested by assigning below combinations of devices to a single VM:
> - GPU pass through only
> - vGPU device only
> - One GPU pass through and one vGPU device
> - Linux VM hot plug and unplug vGPU device while GPU pass through device
>   exist
> - Linux VM hot plug and unplug GPU pass through device while vGPU device
>   exist

Were you able to do these with the locked memory limit of the user set
to the minimum required for existing GPU assignment?

> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Signed-off-by: Neo Jia <cjia@nvidia.com>
> Change-Id: I295d6f0f2e0579b8d9882bfd8fd5a4194b97bd9a
> ---
>  drivers/vfio/vfio.c             |  98 ++++++
>  drivers/vfio/vfio_iommu_type1.c | 692 ++++++++++++++++++++++++++++++++++------
>  include/linux/vfio.h            |  13 +-
>  3 files changed, 707 insertions(+), 96 deletions(-)
> 
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index 2e83bdf007fe..a5a210005b65 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -1799,6 +1799,104 @@ void vfio_info_cap_shift(struct vfio_info_cap *caps, size_t offset)
>  }
>  EXPORT_SYMBOL_GPL(vfio_info_cap_shift);
>  
> +
> +/*
> + * Pin a set of guest PFNs and return their associated host PFNs for local
> + * domain only.
> + * @dev [in] : device
> + * @user_pfn [in]: array of user/guest PFNs
> + * @npage [in]: count of array elements
> + * @prot [in] : protection flags
> + * @phys_pfn[out] : array of host PFNs
> + */
> +long vfio_pin_pages(struct device *dev, unsigned long *user_pfn,
> +		    long npage, int prot, unsigned long *phys_pfn)
> +{
> +	struct vfio_container *container;
> +	struct vfio_group *group;
> +	struct vfio_iommu_driver *driver;
> +	ssize_t ret = -EINVAL;

Unused initialization.

> +
> +	if (!dev || !user_pfn || !phys_pfn)
> +		return -EINVAL;
> +
> +	group = vfio_group_get_from_dev(dev);
> +	if (IS_ERR(group))
> +		return PTR_ERR(group);
> +
> +	ret = vfio_group_add_container_user(group);
> +	if (ret)
> +		goto err_pin_pages;
> +
> +	container = group->container;
> +	if (IS_ERR(container)) {

I don't see that we ever use an ERR_PTR to set group->container, it
should either be NULL or valid and the fact that we added ourselves to
container_users should mean that it's valid.  The paranoia test here
would be if container is NULL, but IS_ERR() doesn't check NULL.  If we
need that paranoia test, maybe we should just:

if (WARN_ON(!container)) {

I'm not fully convinced it's needed though.

> +		ret = PTR_ERR(container);
> +		goto err_pin_pages;
> +	}
> +
> +	down_read(&container->group_lock);
> +
> +	driver = container->iommu_driver;
> +	if (likely(driver && driver->ops->pin_pages))
> +		ret = driver->ops->pin_pages(container->iommu_data, user_pfn,
> +					     npage, prot, phys_pfn);

The caller is going to need to provide some means for us to callback to
invalidate pinned pages.

ret has already been used, so it's zero at this point.  I expect the
original intention was to let the initialization above fall through
here so that the caller gets an errno if the driver doesn't support
pin_pages.  Returning zero without actually doing anything seems like
an unexpected return value.

> +
> +	up_read(&container->group_lock);
> +	vfio_group_try_dissolve_container(group);
> +
> +err_pin_pages:
> +	vfio_group_put(group);
> +	return ret;
> +
> +}
> +EXPORT_SYMBOL(vfio_pin_pages);
> +
> +/*
> + * Unpin set of host PFNs for local domain only.
> + * @dev [in] : device
> + * @pfn [in] : array of host PFNs to be unpinned.
> + * @npage [in] :count of elements in array, that is number of pages.
> + */
> +long vfio_unpin_pages(struct device *dev, unsigned long *pfn, long npage)
> +{
> +	struct vfio_container *container;
> +	struct vfio_group *group;
> +	struct vfio_iommu_driver *driver;
> +	ssize_t ret = -EINVAL;

Same unused initialization.

> +
> +	if (!dev || !pfn)
> +		return -EINVAL;
> +
> +	group = vfio_group_get_from_dev(dev);
> +	if (IS_ERR(group))
> +		return PTR_ERR(group);
> +
> +	ret = vfio_group_add_container_user(group);
> +	if (ret)
> +		goto err_unpin_pages;
> +
> +	container = group->container;
> +	if (IS_ERR(container)) {

Same container not as above.

> +		ret = PTR_ERR(container);
> +		goto err_unpin_pages;
> +	}
> +
> +	down_read(&container->group_lock);
> +
> +	driver = container->iommu_driver;
> +	if (likely(driver && driver->ops->unpin_pages))
> +		ret = driver->ops->unpin_pages(container->iommu_data, pfn,
> +					       npage);

Same fall through, zero return value as above.

> +
> +	up_read(&container->group_lock);
> +	vfio_group_try_dissolve_container(group);
> +
> +err_unpin_pages:
> +	vfio_group_put(group);
> +	return ret;
> +}
> +EXPORT_SYMBOL(vfio_unpin_pages);
> +
>  /**
>   * Module/class support
>   */
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 2ba19424e4a1..5d67058a611d 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -55,16 +55,24 @@ MODULE_PARM_DESC(disable_hugepages,
>  
>  struct vfio_iommu {
>  	struct list_head	domain_list;
> +	struct vfio_domain	*local_domain;
>  	struct mutex		lock;
>  	struct rb_root		dma_list;
>  	bool			v2;
>  	bool			nesting;
>  };
>  
> +struct local_addr_space {
> +	struct task_struct	*task;
> +	struct rb_root		pfn_list;	/* pinned Host pfn list */
> +	struct mutex		pfn_list_lock;	/* mutex for pfn_list */
> +};
> +
>  struct vfio_domain {
>  	struct iommu_domain	*domain;
>  	struct list_head	next;
>  	struct list_head	group_list;
> +	struct local_addr_space	*local_addr_space;
>  	int			prot;		/* IOMMU_CACHE */
>  	bool			fgsp;		/* Fine-grained super pages */
>  };
> @@ -75,6 +83,7 @@ struct vfio_dma {
>  	unsigned long		vaddr;		/* Process virtual addr */
>  	size_t			size;		/* Map size (bytes) */
>  	int			prot;		/* IOMMU_READ/WRITE */
> +	bool			iommu_mapped;
>  };
>  
>  struct vfio_group {
> @@ -83,6 +92,21 @@ struct vfio_group {
>  };
>  
>  /*
> + * Guest RAM pinning working set or DMA target
> + */
> +struct vfio_pfn {
> +	struct rb_node		node;
> +	unsigned long		vaddr;		/* virtual addr */
> +	dma_addr_t		iova;		/* IOVA */
> +	unsigned long		pfn;		/* Host pfn */
> +	int			prot;
> +	atomic_t		ref_count;
> +};

Somehow we're going to need to fit an invalidation callback here too.
How would we handle a case where there are multiple mdev devices, from
different vendor drivers, that all have the same pfn pinned?  I'm
already concerned about the per pfn overhead we're introducing here so
clearly we cannot store an invalidation callback per pinned page, per
vendor driver.  Perhaps invalidations should be done using a notifier
chain per vfio_iommu, the vendor drivers are required to register on
that chain (fail pinning with empty notifier list) user unmapping
will be broadcast to the notifier chain, the vendor driver will be
responsible for deciding if each unmap is relevant to them (potentially
it's for a pinning from another driver).

I expect we also need to enforce that vendors perform a synchronous
unmap such that after returning from the notifier list call, the
vfio_pfn should no longer exist.  If it does we might need to BUG_ON.
Also be careful to pay attention to the locking of the notifier vs
unpin callbacks to avoid deadlocks.

> +
> +#define IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu)	\
> +					(!list_empty(&iommu->domain_list))
> +
> +/*
>   * This code handles mapping and unmapping of user data buffers
>   * into DMA'ble space using the IOMMU
>   */
> @@ -130,6 +154,101 @@ static void vfio_unlink_dma(struct vfio_iommu *iommu, struct vfio_dma *old)
>  	rb_erase(&old->node, &iommu->dma_list);
>  }
>  
> +/*
> + * Helper Functions for host pfn list
> + */
> +
> +static struct vfio_pfn *vfio_find_pfn(struct vfio_domain *domain,
> +				      unsigned long pfn)
> +{
> +	struct rb_node *node;
> +	struct vfio_pfn *vpfn;
> +
> +	node = domain->local_addr_space->pfn_list.rb_node;
> +
> +	while (node) {
> +		vpfn = rb_entry(node, struct vfio_pfn, node);
> +
> +		if (pfn < vpfn->pfn)
> +			node = node->rb_left;
> +		else if (pfn > vpfn->pfn)
> +			node = node->rb_right;
> +		else
> +			return vpfn;
> +	}
> +
> +	return NULL;
> +}
> +
> +static void vfio_link_pfn(struct vfio_domain *domain, struct vfio_pfn *new)
> +{
> +	struct rb_node **link, *parent = NULL;
> +	struct vfio_pfn *vpfn;
> +
> +	link = &domain->local_addr_space->pfn_list.rb_node;
> +	while (*link) {
> +		parent = *link;
> +		vpfn = rb_entry(parent, struct vfio_pfn, node);
> +
> +		if (new->pfn < vpfn->pfn)
> +			link = &(*link)->rb_left;
> +		else
> +			link = &(*link)->rb_right;
> +	}
> +
> +	rb_link_node(&new->node, parent, link);
> +	rb_insert_color(&new->node, &domain->local_addr_space->pfn_list);
> +}
> +
> +static void vfio_unlink_pfn(struct vfio_domain *domain, struct vfio_pfn *old)
> +{
> +	rb_erase(&old->node, &domain->local_addr_space->pfn_list);
> +}
> +
> +static int vfio_add_to_pfn_list(struct vfio_domain *domain, unsigned long vaddr,
> +				dma_addr_t iova, unsigned long pfn, int prot)
> +{
> +	struct vfio_pfn *vpfn;
> +
> +	vpfn = kzalloc(sizeof(*vpfn), GFP_KERNEL);
> +	if (!vpfn)
> +		return -ENOMEM;
> +
> +	vpfn->vaddr = vaddr;
> +	vpfn->iova = iova;
> +	vpfn->pfn = pfn;
> +	vpfn->prot = prot;
> +	atomic_set(&vpfn->ref_count, 1);
> +	vfio_link_pfn(domain, vpfn);
> +	return 0;
> +}
> +
> +static void vfio_remove_from_pfn_list(struct vfio_domain *domain,
> +				      struct vfio_pfn *vpfn)
> +{
> +	vfio_unlink_pfn(domain, vpfn);
> +	kfree(vpfn);
> +}
> +
> +static int vfio_pfn_account(struct vfio_iommu *iommu, unsigned long pfn)
> +{
> +	struct vfio_pfn *p;
> +	struct vfio_domain *domain = iommu->local_domain;
> +	int ret = 1;
> +
> +	if (!domain)
> +		return 1;
> +
> +	mutex_lock(&domain->local_addr_space->pfn_list_lock);
> +
> +	p = vfio_find_pfn(domain, pfn);
> +	if (p)
> +		ret = 0;
> +
> +	mutex_unlock(&domain->local_addr_space->pfn_list_lock);
> +	return ret;
> +}

So if the vfio_pfn for a given pfn exists, return 0, else return 1.
But do we know that the vfio_pfn exists at the point where we actually
do that accounting?

> +
>  struct vwork {
>  	struct mm_struct	*mm;
>  	long			npage;
> @@ -150,17 +269,17 @@ static void vfio_lock_acct_bg(struct work_struct *work)
>  	kfree(vwork);
>  }
>  
> -static void vfio_lock_acct(long npage)
> +static void vfio_lock_acct(struct task_struct *task, long npage)
>  {
>  	struct vwork *vwork;
>  	struct mm_struct *mm;
>  
> -	if (!current->mm || !npage)
> +	if (!task->mm || !npage)
>  		return; /* process exited or nothing to do */
>  
> -	if (down_write_trylock(&current->mm->mmap_sem)) {
> -		current->mm->locked_vm += npage;
> -		up_write(&current->mm->mmap_sem);
> +	if (down_write_trylock(&task->mm->mmap_sem)) {
> +		task->mm->locked_vm += npage;
> +		up_write(&task->mm->mmap_sem);
>  		return;
>  	}
>  
> @@ -172,7 +291,7 @@ static void vfio_lock_acct(long npage)
>  	vwork = kmalloc(sizeof(struct vwork), GFP_KERNEL);
>  	if (!vwork)
>  		return;
> -	mm = get_task_mm(current);
> +	mm = get_task_mm(task);
>  	if (!mm) {
>  		kfree(vwork);
>  		return;
> @@ -228,20 +347,31 @@ static int put_pfn(unsigned long pfn, int prot)
>  	return 0;
>  }

This coversion of vfio_lock_acct() to pass a task_struct and updating
existing callers to pass current would be a great separate, easily
review-able patch.

>  
> -static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
> +static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
> +			 int prot, unsigned long *pfn)
>  {
>  	struct page *page[1];
>  	struct vm_area_struct *vma;
> +	struct mm_struct *local_mm = (mm ? mm : current->mm);
>  	int ret = -EFAULT;
>  
> -	if (get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE), page) == 1) {
> +	if (mm) {
> +		down_read(&local_mm->mmap_sem);
> +		ret = get_user_pages_remote(NULL, local_mm, vaddr, 1,
> +					!!(prot & IOMMU_WRITE), 0, page, NULL);
> +		up_read(&local_mm->mmap_sem);
> +	} else
> +		ret = get_user_pages_fast(vaddr, 1,
> +					  !!(prot & IOMMU_WRITE), page);
> +
> +	if (ret == 1) {
>  		*pfn = page_to_pfn(page[0]);
>  		return 0;
>  	}
>  
> -	down_read(&current->mm->mmap_sem);
> +	down_read(&local_mm->mmap_sem);
>  
> -	vma = find_vma_intersection(current->mm, vaddr, vaddr + 1);
> +	vma = find_vma_intersection(local_mm, vaddr, vaddr + 1);
>  
>  	if (vma && vma->vm_flags & VM_PFNMAP) {
>  		*pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
> @@ -249,7 +379,7 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
>  			ret = 0;
>  	}
>  
> -	up_read(&current->mm->mmap_sem);
> +	up_read(&local_mm->mmap_sem);
>  
>  	return ret;
>  }

This would also be a great separate patch.  Have you considered
renaming the mm_struct function arg to "remote_mm" and making the local
variable simply "mm"?  It seems like it would tie nicely with the
remote_mm path using get_user_pages_remote() while passing NULL for
remote_mm uses current->mm and the existing path (and avoid the general
oddness of passing local_mm to a "remote" function).

> @@ -259,33 +389,37 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
>   * the iommu can only map chunks of consecutive pfns anyway, so get the
>   * first page and all consecutive pages with the same locking.
>   */
> -static long vfio_pin_pages(unsigned long vaddr, long npage,
> -			   int prot, unsigned long *pfn_base)
> +static long __vfio_pin_pages_remote(struct vfio_iommu *iommu,
> +				    unsigned long vaddr, long npage,
> +				    int prot, unsigned long *pfn_base)
>  {
>  	unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
>  	bool lock_cap = capable(CAP_IPC_LOCK);
> -	long ret, i;
> +	long ret, i, lock_acct = 0;
>  	bool rsvd;
>  
>  	if (!current->mm)
>  		return -ENODEV;
>  
> -	ret = vaddr_get_pfn(vaddr, prot, pfn_base);
> +	ret = vaddr_get_pfn(NULL, vaddr, prot, pfn_base);
>  	if (ret)
>  		return ret;
>  
> +	lock_acct = vfio_pfn_account(iommu, *pfn_base);
> +
>  	rsvd = is_invalid_reserved_pfn(*pfn_base);
>  
> -	if (!rsvd && !lock_cap && current->mm->locked_vm + 1 > limit) {
> +	if (!rsvd && !lock_cap && current->mm->locked_vm + lock_acct > limit) {
>  		put_pfn(*pfn_base, prot);
>  		pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n", __func__,
>  			limit << PAGE_SHIFT);
>  		return -ENOMEM;
>  	}
>  
> +

Extra whitespace

>  	if (unlikely(disable_hugepages)) {
>  		if (!rsvd)
> -			vfio_lock_acct(1);
> +			vfio_lock_acct(current, lock_acct);
>  		return 1;
>  	}
>  
> @@ -293,7 +427,7 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
>  	for (i = 1, vaddr += PAGE_SIZE; i < npage; i++, vaddr += PAGE_SIZE) {
>  		unsigned long pfn = 0;
>  
> -		ret = vaddr_get_pfn(vaddr, prot, &pfn);
> +		ret = vaddr_get_pfn(NULL, vaddr, prot, &pfn);
>  		if (ret)
>  			break;
>  
> @@ -303,8 +437,10 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
>  			break;
>  		}
>  
> +		lock_acct += vfio_pfn_account(iommu, pfn);
> +

I take it that this is the new technique for keeping the accounting
accurate, we only increment the locked accounting by the amount not
already pinned in a vfio_pfn.

>  		if (!rsvd && !lock_cap &&
> -		    current->mm->locked_vm + i + 1 > limit) {
> +		    current->mm->locked_vm + lock_acct > limit) {
>  			put_pfn(pfn, prot);
>  			pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
>  				__func__, limit << PAGE_SHIFT);
> @@ -313,23 +449,216 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
>  	}
>  
>  	if (!rsvd)
> -		vfio_lock_acct(i);
> +		vfio_lock_acct(current, lock_acct);
>  
>  	return i;
>  }
>  
> -static long vfio_unpin_pages(unsigned long pfn, long npage,
> -			     int prot, bool do_accounting)
> +static long __vfio_unpin_pages_remote(struct vfio_iommu *iommu,
> +				      unsigned long pfn, long npage, int prot,
> +				      bool do_accounting)

Have you noticed that it's kind of confusing that
__vfio_{un}pin_pages_remote() uses current, which does a
get_user_pages_fast() while "local" uses a provided task_struct and
uses get_user_pages_*remote*()?  And also what was effectively local
(ie. we're pinning for our own use here) is now "remote" and pinning
for a remote, vendor driver consumer, is now "local".  It's not very
intuitive.

>  {
> -	unsigned long unlocked = 0;
> +	unsigned long unlocked = 0, unlock_acct = 0;
>  	long i;
>  
> -	for (i = 0; i < npage; i++)
> +	for (i = 0; i < npage; i++) {
> +		if (do_accounting)
> +			unlock_acct += vfio_pfn_account(iommu, pfn);
> +
>  		unlocked += put_pfn(pfn++, prot);
> +	}
>  
>  	if (do_accounting)
> -		vfio_lock_acct(-unlocked);
> +		vfio_lock_acct(current, -unlock_acct);
> +
> +	return unlocked;
> +}
> +
> +static long __vfio_pin_page_local(struct vfio_domain *domain,
> +				  unsigned long vaddr, int prot,
> +				  unsigned long *pfn_base,
> +				  bool do_accounting)
> +{
> +	unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> +	bool lock_cap = capable(CAP_IPC_LOCK);
> +	long ret;
> +	bool rsvd;
> +	struct task_struct *task = domain->local_addr_space->task;
> +
> +	if (!task->mm)
> +		return -ENODEV;
> +
> +	ret = vaddr_get_pfn(task->mm, vaddr, prot, pfn_base);
> +	if (ret)
> +		return ret;
> +
> +	rsvd = is_invalid_reserved_pfn(*pfn_base);
> +
> +	if (!rsvd && !lock_cap && task->mm->locked_vm + 1 > limit) {
> +		put_pfn(*pfn_base, prot);
> +		pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n", __func__,
> +			limit << PAGE_SHIFT);
> +		return -ENOMEM;
> +	}
> +
> +	if (!rsvd && do_accounting)
> +		vfio_lock_acct(task, 1);
> +
> +	return 1;
> +}
> +
> +static void __vfio_unpin_page_local(struct vfio_domain *domain,
> +				    unsigned long pfn, int prot,
> +				    bool do_accounting)
> +{
> +	put_pfn(pfn, prot);
> +
> +	if (do_accounting)
> +		vfio_lock_acct(domain->local_addr_space->task, -1);
> +}
> +
> +static int vfio_unpin_pfn(struct vfio_domain *domain,
> +			  struct vfio_pfn *vpfn, bool do_accounting)
> +{
> +	__vfio_unpin_page_local(domain, vpfn->pfn, vpfn->prot,
> +				do_accounting);
> +
> +	if (atomic_dec_and_test(&vpfn->ref_count))
> +		vfio_remove_from_pfn_list(domain, vpfn);
> +
> +	return 1;
> +}
> +
> +static long vfio_iommu_type1_pin_pages(void *iommu_data,
> +				       unsigned long *user_pfn,
> +				       long npage, int prot,
> +				       unsigned long *phys_pfn)
> +{
> +	struct vfio_iommu *iommu = iommu_data;
> +	struct vfio_domain *domain;
> +	int i, j, ret;
> +	long retpage;
> +	unsigned long remote_vaddr;
> +	unsigned long *pfn = phys_pfn;
> +	struct vfio_dma *dma;
> +	bool do_accounting;
> +
> +	if (!iommu || !user_pfn || !phys_pfn)
> +		return -EINVAL;
> +
> +	mutex_lock(&iommu->lock);
> +
> +	if (!iommu->local_domain) {
> +		ret = -EINVAL;
> +		goto pin_done;
> +	}
> +
> +	domain = iommu->local_domain;
> +
> +	/*
> +	 * If iommu capable domain exist in the container then all pages are
> +	 * already pinned and accounted. Accouting should be done if there is no
> +	 * iommu capable domain in the container.
> +	 */
> +	do_accounting = !IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu);
> +
> +	for (i = 0; i < npage; i++) {
> +		struct vfio_pfn *p;
> +		dma_addr_t iova;
> +
> +		iova = user_pfn[i] << PAGE_SHIFT;
> +
> +		dma = vfio_find_dma(iommu, iova, 0);
> +		if (!dma) {
> +			ret = -EINVAL;
> +			goto pin_unwind;
> +		}
> +
> +		remote_vaddr = dma->vaddr + iova - dma->iova;
> +
> +		retpage = __vfio_pin_page_local(domain, remote_vaddr, prot,
> +						&pfn[i], do_accounting);
> +		if (retpage <= 0) {
> +			WARN_ON(!retpage);
> +			ret = (int)retpage;
> +			goto pin_unwind;
> +		}
> +
> +		mutex_lock(&domain->local_addr_space->pfn_list_lock);
> +
> +		/* search if pfn exist */
> +		p = vfio_find_pfn(domain, pfn[i]);
> +		if (p) {
> +			atomic_inc(&p->ref_count);
> +			mutex_unlock(&domain->local_addr_space->pfn_list_lock);
> +			continue;
> +		}
> +
> +		ret = vfio_add_to_pfn_list(domain, remote_vaddr, iova,
> +					   pfn[i], prot);
> +		mutex_unlock(&domain->local_addr_space->pfn_list_lock);
> +
> +		if (ret) {
> +			__vfio_unpin_page_local(domain, pfn[i], prot,
> +						do_accounting);
> +			goto pin_unwind;
> +		}
> +	}
> +
> +	ret = i;
> +	goto pin_done;
> +
> +pin_unwind:
> +	pfn[i] = 0;
> +	mutex_lock(&domain->local_addr_space->pfn_list_lock);
> +	for (j = 0; j < i; j++) {
> +		struct vfio_pfn *p;
> +
> +		p = vfio_find_pfn(domain, pfn[j]);
> +		if (p)
> +			vfio_unpin_pfn(domain, p, do_accounting);
> +
> +		pfn[j] = 0;
> +	}
> +	mutex_unlock(&domain->local_addr_space->pfn_list_lock);
> +
> +pin_done:
> +	mutex_unlock(&iommu->lock);
> +	return ret;
> +}
> +
> +static long vfio_iommu_type1_unpin_pages(void *iommu_data, unsigned long *pfn,
> +					 long npage)
> +{
> +	struct vfio_iommu *iommu = iommu_data;
> +	struct vfio_domain *domain = NULL;
> +	bool do_accounting;
> +	long unlocked = 0;
> +	int i;
> +
> +	if (!iommu || !pfn)
> +		return -EINVAL;
> +
> +	mutex_lock(&iommu->lock);
> +
> +	domain = iommu->local_domain;
> +
> +	do_accounting = !IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu);
> +
> +	mutex_lock(&domain->local_addr_space->pfn_list_lock);
> +
> +	for (i = 0; i < npage; i++) {
> +		struct vfio_pfn *p;
>  
> +		/* verify if pfn exist in pfn_list */
> +		p = vfio_find_pfn(domain, pfn[i]);
> +		if (p)
> +			unlocked += vfio_unpin_pfn(domain, p, do_accounting);
> +
> +	}
> +	mutex_unlock(&domain->local_addr_space->pfn_list_lock);
> +
> +	mutex_unlock(&iommu->lock);
>  	return unlocked;
>  }
>  
> @@ -341,6 +670,10 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
>  
>  	if (!dma->size)
>  		return;
> +
> +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu))
> +		return;
> +
>  	/*
>  	 * We use the IOMMU to track the physical addresses, otherwise we'd
>  	 * need a much more complicated tracking system.  Unfortunately that
> @@ -382,15 +715,16 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
>  		if (WARN_ON(!unmapped))
>  			break;
>  
> -		unlocked += vfio_unpin_pages(phys >> PAGE_SHIFT,
> -					     unmapped >> PAGE_SHIFT,
> -					     dma->prot, false);
> +		unlocked += __vfio_unpin_pages_remote(iommu, phys >> PAGE_SHIFT,
> +						      unmapped >> PAGE_SHIFT,
> +						      dma->prot, false);
>  		iova += unmapped;
>  
>  		cond_resched();
>  	}
>  
> -	vfio_lock_acct(-unlocked);
> +	dma->iommu_mapped = false;
> +	vfio_lock_acct(current, -unlocked);
>  }
>  
>  static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
> @@ -558,17 +892,57 @@ unwind:
>  	return ret;
>  }
>  
> +static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct vfio_dma *dma,
> +			    size_t map_size)
> +{
> +	dma_addr_t iova = dma->iova;
> +	unsigned long vaddr = dma->vaddr;
> +	size_t size = map_size;
> +	long npage;
> +	unsigned long pfn;
> +	int ret = 0;
> +
> +	while (size) {
> +		/* Pin a contiguous chunk of memory */
> +		npage = __vfio_pin_pages_remote(iommu, vaddr + dma->size,
> +						size >> PAGE_SHIFT, dma->prot,
> +						&pfn);
> +		if (npage <= 0) {
> +			WARN_ON(!npage);
> +			ret = (int)npage;
> +			break;
> +		}
> +
> +		/* Map it! */
> +		ret = vfio_iommu_map(iommu, iova + dma->size, pfn, npage,
> +				     dma->prot);
> +		if (ret) {
> +			__vfio_unpin_pages_remote(iommu, pfn, npage, dma->prot,
> +						  true);
> +			break;
> +		}
> +
> +		size -= npage << PAGE_SHIFT;
> +		dma->size += npage << PAGE_SHIFT;
> +	}
> +
> +	dma->iommu_mapped = true;
> +
> +	if (ret)
> +		vfio_remove_dma(iommu, dma);
> +
> +	return ret;
> +}
> +
>  static int vfio_dma_do_map(struct vfio_iommu *iommu,
>  			   struct vfio_iommu_type1_dma_map *map)
>  {
>  	dma_addr_t iova = map->iova;
>  	unsigned long vaddr = map->vaddr;
>  	size_t size = map->size;
> -	long npage;
>  	int ret = 0, prot = 0;
>  	uint64_t mask;
>  	struct vfio_dma *dma;
> -	unsigned long pfn;
>  
>  	/* Verify that none of our __u64 fields overflow */
>  	if (map->size != size || map->vaddr != vaddr || map->iova != iova)
> @@ -611,29 +985,11 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>  	/* Insert zero-sized and grow as we map chunks of it */
>  	vfio_link_dma(iommu, dma);
>  
> -	while (size) {
> -		/* Pin a contiguous chunk of memory */
> -		npage = vfio_pin_pages(vaddr + dma->size,
> -				       size >> PAGE_SHIFT, prot, &pfn);
> -		if (npage <= 0) {
> -			WARN_ON(!npage);
> -			ret = (int)npage;
> -			break;
> -		}
> -
> -		/* Map it! */
> -		ret = vfio_iommu_map(iommu, iova + dma->size, pfn, npage, prot);
> -		if (ret) {
> -			vfio_unpin_pages(pfn, npage, prot, true);
> -			break;
> -		}
> -
> -		size -= npage << PAGE_SHIFT;
> -		dma->size += npage << PAGE_SHIFT;
> -	}
> -
> -	if (ret)
> -		vfio_remove_dma(iommu, dma);
> +	/* Don't pin and map if container doesn't contain IOMMU capable domain*/
> +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu))
> +		dma->size = size;
> +	else
> +		ret = vfio_pin_map_dma(iommu, dma, size);
>  
>  	mutex_unlock(&iommu->lock);
>  	return ret;
> @@ -662,10 +1018,6 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
>  	d = list_first_entry(&iommu->domain_list, struct vfio_domain, next);
>  	n = rb_first(&iommu->dma_list);
>  
> -	/* If there's not a domain, there better not be any mappings */
> -	if (WARN_ON(n && !d))
> -		return -EINVAL;
> -
>  	for (; n; n = rb_next(n)) {
>  		struct vfio_dma *dma;
>  		dma_addr_t iova;
> @@ -674,20 +1026,43 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
>  		iova = dma->iova;
>  
>  		while (iova < dma->iova + dma->size) {
> -			phys_addr_t phys = iommu_iova_to_phys(d->domain, iova);
> +			phys_addr_t phys;
>  			size_t size;
>  
> -			if (WARN_ON(!phys)) {
> -				iova += PAGE_SIZE;
> -				continue;
> -			}
> +			if (dma->iommu_mapped) {
> +				phys = iommu_iova_to_phys(d->domain, iova);
> +
> +				if (WARN_ON(!phys)) {
> +					iova += PAGE_SIZE;
> +					continue;
> +				}
>  
> -			size = PAGE_SIZE;
> +				size = PAGE_SIZE;
>  
> -			while (iova + size < dma->iova + dma->size &&
> -			       phys + size == iommu_iova_to_phys(d->domain,
> +				while (iova + size < dma->iova + dma->size &&
> +				    phys + size == iommu_iova_to_phys(d->domain,
>  								 iova + size))
> -				size += PAGE_SIZE;
> +					size += PAGE_SIZE;
> +			} else {
> +				unsigned long pfn;
> +				unsigned long vaddr = dma->vaddr +
> +						     (iova - dma->iova);
> +				size_t n = dma->iova + dma->size - iova;
> +				long npage;
> +
> +				npage = __vfio_pin_pages_remote(iommu, vaddr,
> +								n >> PAGE_SHIFT,
> +								dma->prot,
> +								&pfn);
> +				if (npage <= 0) {
> +					WARN_ON(!npage);
> +					ret = (int)npage;
> +					return ret;
> +				}
> +
> +				phys = pfn << PAGE_SHIFT;
> +				size = npage << PAGE_SHIFT;
> +			}
>  
>  			ret = iommu_map(domain->domain, iova, phys,
>  					size, dma->prot | domain->prot);
> @@ -696,6 +1071,8 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
>  
>  			iova += size;
>  		}
> +
> +		dma->iommu_mapped = true;
>  	}
>  
>  	return 0;
> @@ -734,11 +1111,24 @@ static void vfio_test_domain_fgsp(struct vfio_domain *domain)
>  	__free_pages(pages, order);
>  }
>  
> +static struct vfio_group *find_iommu_group(struct vfio_domain *domain,
> +				   struct iommu_group *iommu_group)
> +{
> +	struct vfio_group *g;
> +
> +	list_for_each_entry(g, &domain->group_list, next) {
> +		if (g->iommu_group == iommu_group)
> +			return g;
> +	}
> +
> +	return NULL;
> +}
> +
>  static int vfio_iommu_type1_attach_group(void *iommu_data,
>  					 struct iommu_group *iommu_group)
>  {
>  	struct vfio_iommu *iommu = iommu_data;
> -	struct vfio_group *group, *g;
> +	struct vfio_group *group;
>  	struct vfio_domain *domain, *d;
>  	struct bus_type *bus = NULL;
>  	int ret;
> @@ -746,10 +1136,14 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>  	mutex_lock(&iommu->lock);
>  
>  	list_for_each_entry(d, &iommu->domain_list, next) {
> -		list_for_each_entry(g, &d->group_list, next) {
> -			if (g->iommu_group != iommu_group)
> -				continue;
> +		if (find_iommu_group(d, iommu_group)) {
> +			mutex_unlock(&iommu->lock);
> +			return -EINVAL;
> +		}
> +	}

The find_iommu_group() conversion would also be an easy separate patch.

>  
> +	if (iommu->local_domain) {
> +		if (find_iommu_group(iommu->local_domain, iommu_group)) {
>  			mutex_unlock(&iommu->lock);
>  			return -EINVAL;
>  		}
> @@ -769,6 +1163,30 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>  	if (ret)
>  		goto out_free;
>  
> +	if (IS_ENABLED(CONFIG_VFIO_MDEV) && !iommu_present(bus) &&
> +	    (bus == &mdev_bus_type)) {
> +		if (!iommu->local_domain) {
> +			domain->local_addr_space =
> +				kzalloc(sizeof(*domain->local_addr_space),
> +						GFP_KERNEL);
> +			if (!domain->local_addr_space) {
> +				ret = -ENOMEM;
> +				goto out_free;
> +			}
> +
> +			domain->local_addr_space->task = current;
> +			INIT_LIST_HEAD(&domain->group_list);
> +			domain->local_addr_space->pfn_list = RB_ROOT;
> +			mutex_init(&domain->local_addr_space->pfn_list_lock);
> +			iommu->local_domain = domain;
> +		} else
> +			kfree(domain);
> +
> +		list_add(&group->next, &domain->group_list);

I think you mean s/domain/iommu->local_domain/ here, we just freed
domain in the else path.

> +		mutex_unlock(&iommu->lock);
> +		return 0;
> +	}
> +
>  	domain->domain = iommu_domain_alloc(bus);
>  	if (!domain->domain) {
>  		ret = -EIO;
> @@ -859,6 +1277,41 @@ static void vfio_iommu_unmap_unpin_all(struct vfio_iommu *iommu)
>  		vfio_remove_dma(iommu, rb_entry(node, struct vfio_dma, node));
>  }
>  
> +static void vfio_iommu_unmap_unpin_reaccount(struct vfio_iommu *iommu)
> +{
> +	struct vfio_domain *domain = iommu->local_domain;
> +	struct vfio_dma *dma, *tdma;
> +	struct rb_node *n;
> +	long locked = 0;
> +
> +	rbtree_postorder_for_each_entry_safe(dma, tdma, &iommu->dma_list,
> +					     node) {
> +		vfio_unmap_unpin(iommu, dma);
> +	}
> +
> +	mutex_lock(&domain->local_addr_space->pfn_list_lock);
> +
> +	n = rb_first(&domain->local_addr_space->pfn_list);
> +
> +	for (; n; n = rb_next(n))
> +		locked++;
> +
> +	vfio_lock_acct(domain->local_addr_space->task, locked);
> +	mutex_unlock(&domain->local_addr_space->pfn_list_lock);
> +}

Couldn't a properly timed mlock by the user allow them to lock more
memory than they're allowed here?  For instance imagine the vendor
driver has pinned the entire VM memory and the user has exactly the
locked memory limit for that VM.  During the gap here between unpinning
the entire vfio_dma list and re-accounting for the pfn_list, the user
can mlock up to their limit again an now they've doubled the locked
memory they're allowed.

> +
> +static void vfio_local_unpin_all(struct vfio_domain *domain)
> +{
> +	struct rb_node *node;
> +
> +	mutex_lock(&domain->local_addr_space->pfn_list_lock);
> +	while ((node = rb_first(&domain->local_addr_space->pfn_list)))
> +		vfio_unpin_pfn(domain,
> +				rb_entry(node, struct vfio_pfn, node), false);
> +
> +	mutex_unlock(&domain->local_addr_space->pfn_list_lock);
> +}
> +
>  static void vfio_iommu_type1_detach_group(void *iommu_data,
>  					  struct iommu_group *iommu_group)
>  {
> @@ -868,31 +1321,57 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
>  
>  	mutex_lock(&iommu->lock);
>  
> -	list_for_each_entry(domain, &iommu->domain_list, next) {
> -		list_for_each_entry(group, &domain->group_list, next) {
> -			if (group->iommu_group != iommu_group)
> -				continue;
> -
> -			iommu_detach_group(domain->domain, iommu_group);
> +	if (iommu->local_domain) {
> +		domain = iommu->local_domain;
> +		group = find_iommu_group(domain, iommu_group);
> +		if (group) {
>  			list_del(&group->next);
>  			kfree(group);
> -			/*
> -			 * Group ownership provides privilege, if the group
> -			 * list is empty, the domain goes away.  If it's the
> -			 * last domain, then all the mappings go away too.
> -			 */
> +
>  			if (list_empty(&domain->group_list)) {
> -				if (list_is_singular(&iommu->domain_list))
> +				vfio_local_unpin_all(domain);
> +				if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu))
>  					vfio_iommu_unmap_unpin_all(iommu);
> -				iommu_domain_free(domain->domain);
> -				list_del(&domain->next);
>  				kfree(domain);
> +				iommu->local_domain = NULL;
> +			}


I can't quite wrap my head around this, if we have mdev groups attached
and this iommu group matches an mdev group, remove from list and free
the group.  If there are now no more groups in the mdev group list,
then for each vfio_pfn, unpin the pfn, /without/ doing accounting
udpates and remove the vfio_pfn, but only if the ref_count is now
zero.  We free the domain, so if the ref_count was non-zero we've now
just leaked memory.  I think that means that if a vendor driver pins a
given page twice, that leak occurs.  Furthermore, if there is not an
iommu capable domain in the container, we remove all the vfio_dma
entries as well, ok.  Maybe the only issue is those leaked vfio_pfns.

> +			goto detach_group_done;
> +		}
> +	}
> +
> +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu))
> +		goto detach_group_done;
> +
> +	list_for_each_entry(domain, &iommu->domain_list, next) {
> +		group = find_iommu_group(domain, iommu_group);
> +		if (!group)
> +			continue;
> +
> +		iommu_detach_group(domain->domain, iommu_group);
> +		list_del(&group->next);
> +		kfree(group);
> +		/*
> +		 * Group ownership provides privilege, if the group list is
> +		 * empty, the domain goes away. If it's the last domain with
> +		 * iommu and local domain doesn't exist, then all the mappings
> +		 * go away too. If it's the last domain with iommu and local
> +		 * domain exist, update accounting
> +		 */
> +		if (list_empty(&domain->group_list)) {
> +			if (list_is_singular(&iommu->domain_list)) {
> +				if (!iommu->local_domain)
> +					vfio_iommu_unmap_unpin_all(iommu);
> +				else
> +					vfio_iommu_unmap_unpin_reaccount(iommu);
>  			}
> -			goto done;
> +			iommu_domain_free(domain->domain);
> +			list_del(&domain->next);
> +			kfree(domain);
>  		}
> +		break;
>  	}
>  
> -done:
> +detach_group_done:
>  	mutex_unlock(&iommu->lock);
>  }
>  
> @@ -924,27 +1403,48 @@ static void *vfio_iommu_type1_open(unsigned long arg)
>  	return iommu;
>  }
>  
> +static void vfio_release_domain(struct vfio_domain *domain)
> +{
> +	struct vfio_group *group, *group_tmp;
> +
> +	list_for_each_entry_safe(group, group_tmp,
> +				 &domain->group_list, next) {
> +		if (!domain->local_addr_space)
> +			iommu_detach_group(domain->domain, group->iommu_group);
> +		list_del(&group->next);
> +		kfree(group);
> +	}
> +
> +	if (domain->local_addr_space)
> +		vfio_local_unpin_all(domain);
> +	else
> +		iommu_domain_free(domain->domain);
> +}
> +
>  static void vfio_iommu_type1_release(void *iommu_data)
>  {
>  	struct vfio_iommu *iommu = iommu_data;
>  	struct vfio_domain *domain, *domain_tmp;
> -	struct vfio_group *group, *group_tmp;
> +
> +	if (iommu->local_domain) {
> +		vfio_release_domain(iommu->local_domain);
> +		kfree(iommu->local_domain);
> +		iommu->local_domain = NULL;
> +	}
>  
>  	vfio_iommu_unmap_unpin_all(iommu);
>  
> +	if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu))
> +		goto release_exit;

This is a bit redundant, the below for_each should just have no entries
and we skip to there anyway.  Thanks,

Alex

> +
>  	list_for_each_entry_safe(domain, domain_tmp,
>  				 &iommu->domain_list, next) {
> -		list_for_each_entry_safe(group, group_tmp,
> -					 &domain->group_list, next) {
> -			iommu_detach_group(domain->domain, group->iommu_group);
> -			list_del(&group->next);
> -			kfree(group);
> -		}
> -		iommu_domain_free(domain->domain);
> +		vfio_release_domain(domain);
>  		list_del(&domain->next);
>  		kfree(domain);
>  	}
>  
> +release_exit:
>  	kfree(iommu);
>  }
>  
> @@ -1048,6 +1548,8 @@ static const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_type1 = {
>  	.ioctl		= vfio_iommu_type1_ioctl,
>  	.attach_group	= vfio_iommu_type1_attach_group,
>  	.detach_group	= vfio_iommu_type1_detach_group,
> +	.pin_pages	= vfio_iommu_type1_pin_pages,
> +	.unpin_pages	= vfio_iommu_type1_unpin_pages,
>  };
>  
>  static int __init vfio_iommu_type1_init(void)
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index 0ecae0b1cd34..0bd25ba6223d 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -17,6 +17,7 @@
>  #include <linux/workqueue.h>
>  #include <linux/poll.h>
>  #include <uapi/linux/vfio.h>
> +#include <linux/mdev.h>
>  
>  /**
>   * struct vfio_device_ops - VFIO bus driver device callbacks
> @@ -75,7 +76,11 @@ struct vfio_iommu_driver_ops {
>  					struct iommu_group *group);
>  	void		(*detach_group)(void *iommu_data,
>  					struct iommu_group *group);
> -
> +	long		(*pin_pages)(void *iommu_data, unsigned long *user_pfn,
> +				     long npage, int prot,
> +				     unsigned long *phys_pfn);
> +	long		(*unpin_pages)(void *iommu_data, unsigned long *pfn,
> +				       long npage);
>  };
>  
>  extern int vfio_register_iommu_driver(const struct vfio_iommu_driver_ops *ops);
> @@ -127,6 +132,12 @@ static inline long vfio_spapr_iommu_eeh_ioctl(struct iommu_group *group,
>  }
>  #endif /* CONFIG_EEH */
>  
> +extern long vfio_pin_pages(struct device *dev, unsigned long *user_pfn,
> +			   long npage, int prot, unsigned long *phys_pfn);
> +
> +extern long vfio_unpin_pages(struct device *dev, unsigned long *pfn,
> +			     long npage);
> +
>  /*
>   * IRQfd - generic
>   */

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 01/12] vfio: Mediated device Core driver
  2016-10-19 19:16     ` Kirti Wankhede
@ 2016-10-19 22:20       ` Alex Williamson
  0 siblings, 0 replies; 73+ messages in thread
From: Alex Williamson @ 2016-10-19 22:20 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song,
	bjsdjshi, linux-kernel

On Thu, 20 Oct 2016 00:46:48 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 10/19/2016 4:46 AM, Alex Williamson wrote:
> > On Tue, 18 Oct 2016 02:52:01 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> ...
> >> +static struct mdev_device *__find_mdev_device(struct parent_device *parent,
> >> +					      uuid_le uuid)
> >> +{
> >> +	struct device *dev;
> >> +
> >> +	dev = device_find_child(parent->dev, &uuid, _find_mdev_device);
> >> +	if (!dev)
> >> +		return NULL;
> >> +
> >> +	put_device(dev);
> >> +
> >> +	return to_mdev_device(dev);
> >> +}  
> > 
> > This function is only used by mdev_device_create() for the purpose of
> > checking whether a given uuid for a parent already exists, so the
> > returned device is not actually used.  However, at the point where
> > we're using to_mdev_device() here, we don't actually hold a reference to
> > the device, so that function call and any possible use of the returned
> > pointer by the callee is invalid.  I would either turn this into a
> > "get" function where the callee has a device reference and needs to do
> > a "put" on it or change this to a "exists" test where true/false is
> > returned and the function cannot be later mis-used to do a device
> > lookup where the reference isn't actually valid.
> >   
> 
> I'll change it to return 0 if not found and -EEXIST if found.
> 
> 
> >> +int mdev_device_create(struct kobject *kobj, struct device *dev, uuid_le uuid)
> >> +{
> >> +	int ret;
> >> +	struct mdev_device *mdev;
> >> +	struct parent_device *parent;
> >> +	struct mdev_type *type = to_mdev_type(kobj);
> >> +
> >> +	parent = mdev_get_parent(type->parent);
> >> +	if (!parent)
> >> +		return -EINVAL;
> >> +
> >> +	/* Check for duplicate */
> >> +	mdev = __find_mdev_device(parent, uuid);
> >> +	if (mdev) {
> >> +		ret = -EEXIST;
> >> +		goto create_err;
> >> +	}  
> > 
> > We check here whether the {parent,uuid} already exists, but what
> > prevents us racing with another create call with the same uuid?  ie.
> > neither exists at this point.  Will device_register() fail if the
> > device name already exists?  If so, should we just rely on the error
> > there and skip this duplicate check?  If not, we need a mutex to avoid
> > the race.
> >  
> 
> Yes, device_register() fails if device exists already with below
> warning. Is it ok to dump such warning? I think, this should be fine,
> right? then we can remove duplicate check.
> 
> If we want to avoid such warning, we should have duplication check.

We should avoid such warnings, bugs will get filed otherwise.  Thanks
for checking.  Thanks,

Alex
 
> [  610.847958] ------------[ cut here ]------------
> [  610.855377] WARNING: CPU: 15 PID: 19839 at fs/sysfs/dir.c:31
> sysfs_warn_dup+0x64/0x80
> [  610.865798] sysfs: cannot create duplicate filename
> '/devices/pci0000:80/0000:80:02.0/0000:83:00.0/0000:84:08.0/0000:85:00.0/83b8f4f2-509f-382f-3c1e-e6bfe0fa1234'
> [  610.885101] Modules linked in:[  610.888039]  nvidia(POE)
> vfio_iommu_type1 vfio_mdev mdev vfio nfsv4 dns_resolver nfs fscache
> sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp
> kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel
> ghash_clmulni_intel aesni_intel glue_helper lrw gf128mul ablk_helper
> cryptd nfsd auth_rpcgss nfs_acl lockd mei_me grace iTCO_wdt
> iTCO_vendor_support mei ipmi_si pcspkr ioatdma i2c_i801 lpc_ich shpchp
> i2c_smbus mfd_core ipmi_msghandler acpi_pad uinput sunrpc xfs libcrc32c
> sd_mod mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt
> fb_sys_fops ttm drm igb ahci libahci ptp libata pps_core dca
> i2c_algo_bit i2c_core dm_mirror dm_region_hash dm_log dm_mod [last
> unloaded: mdev]
> [  610.963835] CPU: 15 PID: 19839 Comm: bash Tainted: P           OE
> 4.8.0-next-20161013+ #0
> [  610.973779] Hardware name: Supermicro
> SYS-2027GR-5204A-NC024/X9DRG-HF, BIOS 1.0c 02/28/2013
> [  610.983769]  ffffc90009323ae0 ffffffff813568bf ffffc90009323b30
> 0000000000000000
> [  610.992867]  ffffc90009323b20 ffffffff81085511 0000001f00001000
> ffff8808839ef000
> [  611.001954]  ffff88108b30f900 ffff88109ae368e8 ffff88109ae580b0
> ffff881099cc0818
> [  611.011055] Call Trace:
> [  611.015087]  [<ffffffff813568bf>] dump_stack+0x63/0x84
> [  611.021784]  [<ffffffff81085511>] __warn+0xd1/0xf0
> [  611.028115]  [<ffffffff8108558f>] warn_slowpath_fmt+0x5f/0x80
> [  611.035379]  [<ffffffff812a4e80>] ? kernfs_path_from_node+0x50/0x60
> [  611.043148]  [<ffffffff812a86c4>] sysfs_warn_dup+0x64/0x80
> [  611.050109]  [<ffffffff812a87ae>] sysfs_create_dir_ns+0x7e/0x90
> [  611.057481]  [<ffffffff81359891>] kobject_add_internal+0xc1/0x340
> [  611.065018]  [<ffffffff81359d45>] kobject_add+0x75/0xd0
> [  611.071635]  [<ffffffff81483829>] device_add+0x119/0x610
> [  611.078314]  [<ffffffff81483d3a>] device_register+0x1a/0x20
> [  611.085261]  [<ffffffffa03c748d>] mdev_device_create+0xdd/0x200 [mdev]
> [  611.093143]  [<ffffffffa03c7768>] create_store+0xa8/0xe0 [mdev]
> [  611.100385]  [<ffffffffa03c76ab>] mdev_type_attr_store+0x1b/0x30 [mdev]
> [  611.108309]  [<ffffffff812a7d8a>] sysfs_kf_write+0x3a/0x50
> [  611.115096]  [<ffffffff812a78bb>] kernfs_fop_write+0x10b/0x190
> [  611.122231]  [<ffffffff81224e97>] __vfs_write+0x37/0x140
> [  611.128817]  [<ffffffff811cea84>] ? handle_mm_fault+0x724/0xd80
> [  611.135976]  [<ffffffff81225da2>] vfs_write+0xb2/0x1b0
> [  611.142354]  [<ffffffff81003510>] ? syscall_trace_enter+0x1d0/0x2b0
> [  611.149836]  [<ffffffff812271f5>] SyS_write+0x55/0xc0
> [  611.156065]  [<ffffffff81003a47>] do_syscall_64+0x67/0x180
> [  611.162734]  [<ffffffff816d41eb>] entry_SYSCALL64_slow_path+0x25/0x25
> [  611.170345] ---[ end trace b05a73599da2ba3f ]---
> [  611.175940] ------------[ cut here ]------------
> 
> 
> 
> >> +static ssize_t create_store(struct kobject *kobj, struct device *dev,
> >> +			    const char *buf, size_t count)
> >> +{
> >> +	char *str;
> >> +	uuid_le uuid;
> >> +	int ret;
> >> +
> >> +	if (count < UUID_STRING_LEN)
> >> +		return -EINVAL;  
> > 
> > 
> > Can't we also test for something unreasonably large?
> >   
> 
> Ok. I'll add that check.
> 
> >   
> >> +
> >> +	str = kstrndup(buf, count, GFP_KERNEL);
> >> +	if (!str)
> >> +		return -ENOMEM;
> >> +
> >> +	ret = uuid_le_to_bin(str, &uuid);  
> > 
> > nit, we can kfree(str) here regardless of the return.
> >   
> >> +	if (!ret) {
> >> +
> >> +		ret = mdev_device_create(kobj, dev, uuid);
> >> +		if (ret)
> >> +			pr_err("mdev_create: Failed to create mdev device\n");  
> > 
> > What value does this pr_err add?  It doesn't tell us why it failed and
> > the user will already know if failed by the return value of their write.
> >   
> 
> Ok, will remove it.
> 
> >> +		else
> >> +			ret = count;
> >> +	}
> >> +
> >> +	kfree(str);
> >> +	return ret;
> >> +}  
> 
> ...
> 
> >> +static inline struct mdev_driver *to_mdev_driver(struct device_driver *drv)
> >> +{
> >> +	return drv ? container_of(drv, struct mdev_driver, driver) : NULL;
> >> +}
> >> +
> >> +static inline struct mdev_device *to_mdev_device(struct device *dev)
> >> +{
> >> +	return dev ? container_of(dev, struct mdev_device, dev) : NULL;  
> > 
> > 
> > Do we really need this NULL dev/drv behavior?  I don't see that any of
> > the callers can pass NULL into these.  The PCI equivalents don't
> > support this behavior and it doesn't seem they need to.  Thanks,
> >  
> 
> Ok, I'll update that.
> 
> Kirti

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 01/12] vfio: Mediated device Core driver
  2016-10-17 21:22 ` [PATCH v9 01/12] vfio: Mediated device Core driver Kirti Wankhede
  2016-10-18 23:16   ` Alex Williamson
@ 2016-10-20  7:23   ` Jike Song
  2016-10-20 17:12     ` Alex Williamson
  2016-10-26  6:52   ` Tian, Kevin
  2 siblings, 1 reply; 73+ messages in thread
From: Jike Song @ 2016-10-20  7:23 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: alex.williamson, pbonzini, kraxel, cjia, qemu-devel, kvm,
	kevin.tian, bjsdjshi, linux-kernel

On 10/18/2016 05:22 AM, Kirti Wankhede wrote:
> diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
> new file mode 100644
> index 000000000000..7db5ec164aeb
> --- /dev/null
> +++ b/drivers/vfio/mdev/mdev_core.c
> @@ -0,0 +1,372 @@
> +/*
> + * Mediated device Core Driver
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/module.h>
> +#include <linux/device.h>
> +#include <linux/slab.h>
> +#include <linux/uuid.h>
> +#include <linux/sysfs.h>
> +#include <linux/mdev.h>
> +
> +#include "mdev_private.h"
> +
> +#define DRIVER_VERSION		"0.1"
> +#define DRIVER_AUTHOR		"NVIDIA Corporation"
> +#define DRIVER_DESC		"Mediated device Core Driver"
> +
> +static LIST_HEAD(parent_list);
> +static DEFINE_MUTEX(parent_list_lock);
> +static struct class_compat *mdev_bus_compat_class;
> +

> +
> +/*
> + * mdev_register_device : Register a device
> + * @dev: device structure representing parent device.
> + * @ops: Parent device operation structure to be registered.
> + *
> + * Add device to list of registered parent devices.
> + * Returns a negative value on error, otherwise 0.
> + */
> +int mdev_register_device(struct device *dev, const struct parent_ops *ops)
> +{
> +	int ret = 0;
> +	struct parent_device *parent;
> +
> +	/* check for mandatory ops */
> +	if (!ops || !ops->create || !ops->remove || !ops->supported_type_groups)
> +		return -EINVAL;
> +
> +	dev = get_device(dev);
> +	if (!dev)
> +		return -EINVAL;
> +
> +	mutex_lock(&parent_list_lock);
> +
> +	/* Check for duplicate */
> +	parent = __find_parent_device(dev);
> +	if (parent) {
> +		ret = -EEXIST;
> +		goto add_dev_err;
> +	}
> +
> +	parent = kzalloc(sizeof(*parent), GFP_KERNEL);
> +	if (!parent) {
> +		ret = -ENOMEM;
> +		goto add_dev_err;
> +	}
> +
> +	kref_init(&parent->ref);
> +
> +	parent->dev = dev;
> +	parent->ops = ops;
> +
> +	ret = parent_create_sysfs_files(parent);
> +	if (ret) {
> +		mutex_unlock(&parent_list_lock);
> +		mdev_put_parent(parent);
> +		return ret;
> +	}
> +
> +	ret = class_compat_create_link(mdev_bus_compat_class, dev, NULL);
> +	if (ret)
> +		dev_warn(dev, "Failed to create compatibility class link\n");
> +
> +	list_add(&parent->next, &parent_list);
> +	mutex_unlock(&parent_list_lock);
> +
> +	dev_info(dev, "MDEV: Registered\n");
> +	return 0;
> +
> +add_dev_err:
> +	mutex_unlock(&parent_list_lock);
> +	put_device(dev);
> +	return ret;
> +}
> +EXPORT_SYMBOL(mdev_register_device);

> +static int __init mdev_init(void)
> +{
> +	int ret;
> +
> +	ret = mdev_bus_register();
> +	if (ret) {
> +		pr_err("Failed to register mdev bus\n");
> +		return ret;
> +	}
> +
> +	mdev_bus_compat_class = class_compat_register("mdev_bus");
> +	if (!mdev_bus_compat_class) {
> +		mdev_bus_unregister();
> +		return -ENOMEM;
> +	}
> +
> +	/*
> +	 * Attempt to load known vfio_mdev.  This gives us a working environment
> +	 * without the user needing to explicitly load vfio_mdev driver.
> +	 */
> +	request_module_nowait("vfio_mdev");
> +
> +	return ret;
> +}
> +
> +static void __exit mdev_exit(void)
> +{
> +	class_compat_unregister(mdev_bus_compat_class);
> +	mdev_bus_unregister();
> +}
> +
> +module_init(mdev_init)
> +module_exit(mdev_exit)

Hi Kirti,

There is a possible issue: mdev_bus_register is called from mdev_init,
a module_init, equal to device_initcall if builtin to vmlinux; however,
the vendor driver, say i915.ko for intel case, have to call
mdev_register_device from its module_init: at that time, mdev_init
is still not called.

Not sure if this issue exists with nvidia.ko. Though in most cases we
are expecting users select mdev as a standalone module, we still won't
break builtin case.


Hi Alex, do you have any suggestion here?


--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 01/12] vfio: Mediated device Core driver
  2016-10-20  7:23   ` Jike Song
@ 2016-10-20 17:12     ` Alex Williamson
  2016-10-21  2:41       ` Jike Song
  2016-10-27  5:56       ` Jike Song
  0 siblings, 2 replies; 73+ messages in thread
From: Alex Williamson @ 2016-10-20 17:12 UTC (permalink / raw)
  To: Jike Song
  Cc: Kirti Wankhede, pbonzini, kraxel, cjia, qemu-devel, kvm,
	kevin.tian, bjsdjshi, linux-kernel

On Thu, 20 Oct 2016 15:23:53 +0800
Jike Song <jike.song@intel.com> wrote:

> On 10/18/2016 05:22 AM, Kirti Wankhede wrote:
> > diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
> > new file mode 100644
> > index 000000000000..7db5ec164aeb
> > --- /dev/null
> > +++ b/drivers/vfio/mdev/mdev_core.c
> > @@ -0,0 +1,372 @@
> > +/*
> > + * Mediated device Core Driver
> > + *
> > + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> > + *     Author: Neo Jia <cjia@nvidia.com>
> > + *	       Kirti Wankhede <kwankhede@nvidia.com>
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License version 2 as
> > + * published by the Free Software Foundation.
> > + */
> > +
> > +#include <linux/module.h>
> > +#include <linux/device.h>
> > +#include <linux/slab.h>
> > +#include <linux/uuid.h>
> > +#include <linux/sysfs.h>
> > +#include <linux/mdev.h>
> > +
> > +#include "mdev_private.h"
> > +
> > +#define DRIVER_VERSION		"0.1"
> > +#define DRIVER_AUTHOR		"NVIDIA Corporation"
> > +#define DRIVER_DESC		"Mediated device Core Driver"
> > +
> > +static LIST_HEAD(parent_list);
> > +static DEFINE_MUTEX(parent_list_lock);
> > +static struct class_compat *mdev_bus_compat_class;
> > +  
> 
> > +
> > +/*
> > + * mdev_register_device : Register a device
> > + * @dev: device structure representing parent device.
> > + * @ops: Parent device operation structure to be registered.
> > + *
> > + * Add device to list of registered parent devices.
> > + * Returns a negative value on error, otherwise 0.
> > + */
> > +int mdev_register_device(struct device *dev, const struct parent_ops *ops)
> > +{
> > +	int ret = 0;
> > +	struct parent_device *parent;
> > +
> > +	/* check for mandatory ops */
> > +	if (!ops || !ops->create || !ops->remove || !ops->supported_type_groups)
> > +		return -EINVAL;
> > +
> > +	dev = get_device(dev);
> > +	if (!dev)
> > +		return -EINVAL;
> > +
> > +	mutex_lock(&parent_list_lock);
> > +
> > +	/* Check for duplicate */
> > +	parent = __find_parent_device(dev);
> > +	if (parent) {
> > +		ret = -EEXIST;
> > +		goto add_dev_err;
> > +	}
> > +
> > +	parent = kzalloc(sizeof(*parent), GFP_KERNEL);
> > +	if (!parent) {
> > +		ret = -ENOMEM;
> > +		goto add_dev_err;
> > +	}
> > +
> > +	kref_init(&parent->ref);
> > +
> > +	parent->dev = dev;
> > +	parent->ops = ops;
> > +
> > +	ret = parent_create_sysfs_files(parent);
> > +	if (ret) {
> > +		mutex_unlock(&parent_list_lock);
> > +		mdev_put_parent(parent);
> > +		return ret;
> > +	}
> > +
> > +	ret = class_compat_create_link(mdev_bus_compat_class, dev, NULL);
> > +	if (ret)
> > +		dev_warn(dev, "Failed to create compatibility class link\n");
> > +
> > +	list_add(&parent->next, &parent_list);
> > +	mutex_unlock(&parent_list_lock);
> > +
> > +	dev_info(dev, "MDEV: Registered\n");
> > +	return 0;
> > +
> > +add_dev_err:
> > +	mutex_unlock(&parent_list_lock);
> > +	put_device(dev);
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL(mdev_register_device);  
> 
> > +static int __init mdev_init(void)
> > +{
> > +	int ret;
> > +
> > +	ret = mdev_bus_register();
> > +	if (ret) {
> > +		pr_err("Failed to register mdev bus\n");
> > +		return ret;
> > +	}
> > +
> > +	mdev_bus_compat_class = class_compat_register("mdev_bus");
> > +	if (!mdev_bus_compat_class) {
> > +		mdev_bus_unregister();
> > +		return -ENOMEM;
> > +	}
> > +
> > +	/*
> > +	 * Attempt to load known vfio_mdev.  This gives us a working environment
> > +	 * without the user needing to explicitly load vfio_mdev driver.
> > +	 */
> > +	request_module_nowait("vfio_mdev");
> > +
> > +	return ret;
> > +}
> > +
> > +static void __exit mdev_exit(void)
> > +{
> > +	class_compat_unregister(mdev_bus_compat_class);
> > +	mdev_bus_unregister();
> > +}
> > +
> > +module_init(mdev_init)
> > +module_exit(mdev_exit)  
> 
> Hi Kirti,
> 
> There is a possible issue: mdev_bus_register is called from mdev_init,
> a module_init, equal to device_initcall if builtin to vmlinux; however,
> the vendor driver, say i915.ko for intel case, have to call
> mdev_register_device from its module_init: at that time, mdev_init
> is still not called.
> 
> Not sure if this issue exists with nvidia.ko. Though in most cases we
> are expecting users select mdev as a standalone module, we still won't
> break builtin case.
> 
> 
> Hi Alex, do you have any suggestion here?

To fully solve the problem of built-in drivers making use of the mdev
infrastructure we'd need to make mdev itself builtin and possibly a
subsystem that is initialized prior to device drivers.  Is that really
necessary?  Even though i915.ko is often loaded as part of an
initramfs, most systems still build it as a module.  I would expect
that standard module dependencies will pull in the necessary mdev and
vfio modules to make this work correctly.  I can't say that I'm
prepared to make mdev be a subsystem as would be necessary for builtin
drivers to make use of.  Perhaps if such a driver exists it could
somehow do late binding with mdev.  i915 should certainly be tested as
a builtin driver to make sure it doesn't fail with mdev support added.
The kvm-vfio device (virt/kvm/vfio.c) makes use of symbol tricks to
avoid hard dependencies between kvm and vfio, perhaps when builtin to
the kernel, i915 could use something like that.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 05/12] vfio: Introduce common function to add capabilities
  2016-10-17 21:22 ` [PATCH v9 05/12] vfio: Introduce common function to add capabilities Kirti Wankhede
@ 2016-10-20 19:24   ` Alex Williamson
  2016-10-24 21:27     ` Kirti Wankhede
  0 siblings, 1 reply; 73+ messages in thread
From: Alex Williamson @ 2016-10-20 19:24 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song,
	bjsdjshi, linux-kernel

On Tue, 18 Oct 2016 02:52:05 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> Vendor driver using mediated device framework should use
> vfio_info_add_capability() to add capabilities.
> Introduced this function to reduce code duplication in vendor drivers.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Signed-off-by: Neo Jia <cjia@nvidia.com>
> Change-Id: I6fca329fa2291f37a2c859d0bc97574d9e2ce1a6
> ---
>  drivers/vfio/vfio.c  | 78 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/vfio.h |  4 +++
>  2 files changed, 82 insertions(+)
> 
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index a5a210005b65..e96cb3f7a23c 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -1799,6 +1799,84 @@ void vfio_info_cap_shift(struct vfio_info_cap *caps, size_t offset)
>  }
>  EXPORT_SYMBOL_GPL(vfio_info_cap_shift);
>  
> +static int sparse_mmap_cap(struct vfio_info_cap *caps, void *cap_type)
> +{
> +	struct vfio_info_cap_header *header;
> +	struct vfio_region_info_cap_sparse_mmap *sparse_cap, *sparse = cap_type;
> +	size_t size;
> +
> +	size = sizeof(*sparse) + sparse->nr_areas *  sizeof(*sparse->areas);
> +	header = vfio_info_cap_add(caps, size,
> +				   VFIO_REGION_INFO_CAP_SPARSE_MMAP, 1);
> +	if (IS_ERR(header))
> +		return PTR_ERR(header);
> +
> +	sparse_cap = container_of(header,
> +			struct vfio_region_info_cap_sparse_mmap, header);
> +	sparse_cap->nr_areas = sparse->nr_areas;
> +	memcpy(sparse_cap->areas, sparse->areas,
> +	       sparse->nr_areas * sizeof(*sparse->areas));
> +	return 0;
> +}
> +
> +static int region_type_cap(struct vfio_info_cap *caps, void *cap_type)
> +{
> +	struct vfio_info_cap_header *header;
> +	struct vfio_region_info_cap_type *type_cap, *cap = cap_type;
> +
> +	header = vfio_info_cap_add(caps, sizeof(*cap),
> +				   VFIO_REGION_INFO_CAP_TYPE, 1);
> +	if (IS_ERR(header))
> +		return PTR_ERR(header);
> +
> +	type_cap = container_of(header, struct vfio_region_info_cap_type,
> +				header);
> +	type_cap->type = cap->type;
> +	type_cap->subtype = cap->subtype;
> +	return 0;
> +}
> +
> +int vfio_info_add_capability(struct vfio_region_info *info,
> +			     struct vfio_info_cap *caps,
> +			     int cap_type_id,
> +			     void *cap_type)
> +{
> +	int ret;
> +
> +	if (!cap_type)
> +		return 0;
> +
> +	switch (cap_type_id) {
> +	case VFIO_REGION_INFO_CAP_SPARSE_MMAP:
> +		ret = sparse_mmap_cap(caps, cap_type);
> +		if (ret)
> +			return ret;
> +		break;
> +
> +	case VFIO_REGION_INFO_CAP_TYPE:
> +		ret = region_type_cap(caps, cap_type);
> +		if (ret)
> +			return ret;
> +		break;
> +	default:
> +		return -EINVAL;
> +	}
> +
> +	info->flags |= VFIO_REGION_INFO_FLAG_CAPS;
> +
> +	if (caps->size) {
> +		if (info->argsz < sizeof(*info) + caps->size) {
> +			info->argsz = sizeof(*info) + caps->size;
> +			info->cap_offset = 0;
> +		} else {
> +			vfio_info_cap_shift(caps, sizeof(*info));
> +			info->cap_offset = sizeof(*info);

This doesn't work.  We build the capability chain in a buffer and
vfio_info_cap_add() expects the chain to be zero-based as each
capability is added.  vfio_info_cap_shift() is meant to be called once
on that buffer immediately before copying it back to the user buffer to
adjust the chain offsets to account for the offset within the buffer.
vfio_info_cap_shift() cannot be called repeatedly on the buffer as we
do support multiple capabilities in a chain.

> +		}
> +	}
> +	return 0;
> +}
> +EXPORT_SYMBOL(vfio_info_add_capability);
> +
>  
>  /*
>   * Pin a set of guest PFNs and return their associated host PFNs for
> local diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index 0bd25ba6223d..854a4b40be02 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -108,6 +108,10 @@ extern struct vfio_info_cap_header
> *vfio_info_cap_add( struct vfio_info_cap *caps, size_t size, u16 id,
> u16 version); extern void vfio_info_cap_shift(struct vfio_info_cap
> *caps, size_t offset); 
> +extern int vfio_info_add_capability(struct vfio_region_info *info,
> +				    struct vfio_info_cap *caps,
> +				    int cap_type_id, void *cap_type);
> +
>  struct pci_dev;
>  #ifdef CONFIG_EEH
>  extern void vfio_spapr_pci_eeh_open(struct pci_dev *pdev);

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 06/12] vfio_pci: Update vfio_pci to use vfio_info_add_capability()
  2016-10-17 21:22 ` [PATCH v9 06/12] vfio_pci: Update vfio_pci to use vfio_info_add_capability() Kirti Wankhede
@ 2016-10-20 19:24   ` Alex Williamson
  2016-10-24 21:22     ` Kirti Wankhede
  0 siblings, 1 reply; 73+ messages in thread
From: Alex Williamson @ 2016-10-20 19:24 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song,
	bjsdjshi, linux-kernel

On Tue, 18 Oct 2016 02:52:06 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> Update msix_sparse_mmap_cap() to use vfio_info_add_capability()
> Update region type capability to use vfio_info_add_capability()
> Can't split this commit for MSIx and region_type cap since there is a
> common code which need to be updated for both the cases.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Signed-off-by: Neo Jia <cjia@nvidia.com>
> Change-Id: I52bb28c7875a6da5a79ddad1843e6088aff58a45
> ---
>  drivers/vfio/pci/vfio_pci.c | 72 +++++++++++++++++----------------------------
>  1 file changed, 27 insertions(+), 45 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index d624a527777f..1ec0565b48ea 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -556,12 +556,12 @@ static int vfio_pci_for_each_slot_or_bus(struct pci_dev *pdev,
>  }
>  
>  static int msix_sparse_mmap_cap(struct vfio_pci_device *vdev,
> +				struct vfio_region_info *info,
>  				struct vfio_info_cap *caps)
>  {
> -	struct vfio_info_cap_header *header;
>  	struct vfio_region_info_cap_sparse_mmap *sparse;
>  	size_t end, size;
> -	int nr_areas = 2, i = 0;
> +	int nr_areas = 2, i = 0, ret;
>  
>  	end = pci_resource_len(vdev->pdev, vdev->msix_bar);
>  
> @@ -572,13 +572,10 @@ static int msix_sparse_mmap_cap(struct vfio_pci_device *vdev,
>  
>  	size = sizeof(*sparse) + (nr_areas * sizeof(*sparse->areas));
>  
> -	header = vfio_info_cap_add(caps, size,
> -				   VFIO_REGION_INFO_CAP_SPARSE_MMAP, 1);
> -	if (IS_ERR(header))
> -		return PTR_ERR(header);
> +	sparse = kzalloc(size, GFP_KERNEL);
> +	if (!sparse)
> +		return -ENOMEM;
>  
> -	sparse = container_of(header,
> -			      struct vfio_region_info_cap_sparse_mmap, header);
>  	sparse->nr_areas = nr_areas;
>  
>  	if (vdev->msix_offset & PAGE_MASK) {
> @@ -594,26 +591,11 @@ static int msix_sparse_mmap_cap(struct vfio_pci_device *vdev,
>  		i++;
>  	}
>  
> -	return 0;
> -}
> -
> -static int region_type_cap(struct vfio_pci_device *vdev,
> -			   struct vfio_info_cap *caps,
> -			   unsigned int type, unsigned int subtype)
> -{
> -	struct vfio_info_cap_header *header;
> -	struct vfio_region_info_cap_type *cap;
> -
> -	header = vfio_info_cap_add(caps, sizeof(*cap),
> -				   VFIO_REGION_INFO_CAP_TYPE, 1);
> -	if (IS_ERR(header))
> -		return PTR_ERR(header);
> +	ret = vfio_info_add_capability(info, caps,
> +				      VFIO_REGION_INFO_CAP_SPARSE_MMAP, sparse);
> +	kfree(sparse);
>  
> -	cap = container_of(header, struct vfio_region_info_cap_type, header);
> -	cap->type = type;
> -	cap->subtype = subtype;
> -
> -	return 0;
> +	return ret;
>  }
>  
>  int vfio_pci_register_dev_region(struct vfio_pci_device *vdev,
> @@ -704,7 +686,8 @@ static long vfio_pci_ioctl(void *device_data,
>  			if (vdev->bar_mmap_supported[info.index]) {
>  				info.flags |= VFIO_REGION_INFO_FLAG_MMAP;
>  				if (info.index == vdev->msix_bar) {
> -					ret = msix_sparse_mmap_cap(vdev, &caps);
> +					ret = msix_sparse_mmap_cap(vdev, &info,
> +								   &caps);
>  					if (ret)
>  						return ret;
>  				}
> @@ -752,6 +735,9 @@ static long vfio_pci_ioctl(void *device_data,
>  
>  			break;
>  		default:
> +		{
> +			struct vfio_region_info_cap_type cap_type;
> +
>  			if (info.index >=
>  			    VFIO_PCI_NUM_REGIONS + vdev->num_regions)
>  				return -EINVAL;
> @@ -762,27 +748,23 @@ static long vfio_pci_ioctl(void *device_data,
>  			info.size = vdev->region[i].size;
>  			info.flags = vdev->region[i].flags;
>  
> -			ret = region_type_cap(vdev, &caps,
> -					      vdev->region[i].type,
> -					      vdev->region[i].subtype);
> +			cap_type.type = vdev->region[i].type;
> +			cap_type.subtype = vdev->region[i].subtype;
> +
> +			ret = vfio_info_add_capability(&info, &caps,
> +						      VFIO_REGION_INFO_CAP_TYPE,
> +						      &cap_type);
>  			if (ret)
>  				return ret;
> +
> +		}
>  		}
>  
> -		if (caps.size) {
> -			info.flags |= VFIO_REGION_INFO_FLAG_CAPS;
> -			if (info.argsz < sizeof(info) + caps.size) {
> -				info.argsz = sizeof(info) + caps.size;
> -				info.cap_offset = 0;
> -			} else {
> -				vfio_info_cap_shift(&caps, sizeof(info));
> -				if (copy_to_user((void __user *)arg +
> -						  sizeof(info), caps.buf,
> -						  caps.size)) {
> -					kfree(caps.buf);
> -					return -EFAULT;
> -				}
> -				info.cap_offset = sizeof(info);

I prefer the case above, I'm fine with breaking out helpers to build a
buffer containing the capability chain, but I would rather have the
caller manage placing that back into the return structure.  That also
allows the helper to be independent of the structure we're operating
on, it could be a region_info, irq_info, device_info, etc.  It only
needs to know the layout of the capability type we're trying to add,
not the info structure itself.  Thanks,

Alex

> +		if (info.cap_offset) {
> +			if (copy_to_user((void __user *)arg + info.cap_offset,
> +					 caps.buf, caps.size)) {
> +				kfree(caps.buf);
> +				return -EFAULT;
>  			}
>  
>  			kfree(caps.buf);

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 10/12] vfio: Add function to get device_api string from vfio_device_info.flags
  2016-10-17 21:22 ` [PATCH v9 10/12] vfio: Add function to get device_api string from vfio_device_info.flags Kirti Wankhede
@ 2016-10-20 19:34   ` Alex Williamson
  2016-10-20 20:29     ` Kirti Wankhede
  0 siblings, 1 reply; 73+ messages in thread
From: Alex Williamson @ 2016-10-20 19:34 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song,
	bjsdjshi, linux-kernel

On Tue, 18 Oct 2016 02:52:10 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> Function vfio_device_api_string() returns string based on flag set in
> vfio_device_info's flag. This should be used by vendor driver to get string
> based on flag for device_api attribute.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Signed-off-by: Neo Jia <cjia@nvidia.com>
> Change-Id: I42d29f475f02a7132ce13297fbf2b48f1da10995
> ---
>  drivers/vfio/vfio.c  | 15 +++++++++++++++
>  include/linux/vfio.h |  2 ++
>  2 files changed, 17 insertions(+)
> 
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index 10ef1c5fa762..aec470454a13 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -1917,6 +1917,21 @@ int vfio_set_irqs_validate_and_prepare(struct vfio_irq_set *hdr, int num_irqs,
>  }
>  EXPORT_SYMBOL(vfio_set_irqs_validate_and_prepare);
>  
> +const char *vfio_device_api_string(u32 flags)
> +{
> +	if (flags & VFIO_DEVICE_FLAGS_PCI)
> +		return "vfio-pci";
> +
> +	if (flags & VFIO_DEVICE_FLAGS_PLATFORM)
> +		return "vfio-platform";
> +
> +	if (flags & VFIO_DEVICE_FLAGS_AMBA)
> +		return "vfio-amba";
> +
> +	return "";
> +}
> +EXPORT_SYMBOL(vfio_device_api_string);
> +
>  /*
>   * Pin a set of guest PFNs and return their associated host PFNs for local
>   * domain only.
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index 31d059f1649b..fca2bf23c4f1 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -116,6 +116,8 @@ extern int vfio_set_irqs_validate_and_prepare(struct vfio_irq_set *hdr,
>  					      int num_irqs, int max_irq_type,
>  					      size_t *data_size);
>  
> +extern const char *vfio_device_api_string(u32 flags);
> +
>  struct pci_dev;
>  #ifdef CONFIG_EEH
>  extern void vfio_spapr_pci_eeh_open(struct pci_dev *pdev);

Couldn't this simply be a #define in the uapi header?

#define VFIO_DEVICE_PCI_API_STRING "vfio-pci"

I don't really see why we need a lookup function.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 04/12] vfio iommu: Add support for mediated devices
  2016-10-19 21:02   ` Alex Williamson
@ 2016-10-20 20:17     ` Kirti Wankhede
  2016-10-24  2:32       ` Alex Williamson
  2016-10-26  7:53     ` Tian, Kevin
  2016-10-26  7:54     ` Tian, Kevin
  2 siblings, 1 reply; 73+ messages in thread
From: Kirti Wankhede @ 2016-10-20 20:17 UTC (permalink / raw)
  To: Alex Williamson
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song,
	bjsdjshi, linux-kernel


Alex,

Addressing your comments other than invalidation part.

On 10/20/2016 2:32 AM, Alex Williamson wrote:
> On Tue, 18 Oct 2016 02:52:04 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
...
>> Tested by assigning below combinations of devices to a single VM:
>> - GPU pass through only
>> - vGPU device only
>> - One GPU pass through and one vGPU device
>> - Linux VM hot plug and unplug vGPU device while GPU pass through device
>>   exist
>> - Linux VM hot plug and unplug GPU pass through device while vGPU device
>>   exist
> 
> Were you able to do these with the locked memory limit of the user set
> to the minimum required for existing GPU assignment?
> 

No, is there a way to set memory limit through livbirt so that it would
set memory limit to system memory assigned to VM?

>>
...
>> +	container = group->container;
>> +	if (IS_ERR(container)) {
> 
> I don't see that we ever use an ERR_PTR to set group->container, it
> should either be NULL or valid and the fact that we added ourselves to
> container_users should mean that it's valid.  The paranoia test here
> would be if container is NULL, but IS_ERR() doesn't check NULL.  If we
> need that paranoia test, maybe we should just:
> 
> if (WARN_ON(!container)) {
> 
> I'm not fully convinced it's needed though.
> 

Ok removing this check.

>> +		ret = PTR_ERR(container);
>> +		goto err_pin_pages;
>> +	}
>> +
>> +	down_read(&container->group_lock);
>> +
>> +	driver = container->iommu_driver;
>> +	if (likely(driver && driver->ops->pin_pages))
>> +		ret = driver->ops->pin_pages(container->iommu_data, user_pfn,
>> +					     npage, prot, phys_pfn);
> 
> The caller is going to need to provide some means for us to callback to
> invalidate pinned pages.
> 
> ret has already been used, so it's zero at this point.  I expect the
> original intention was to let the initialization above fall through
> here so that the caller gets an errno if the driver doesn't support
> pin_pages.  Returning zero without actually doing anything seems like
> an unexpected return value.
> 

yes, changing it to:

driver = container->iommu_driver;
if (likely(driver && driver->ops->pin_pages))
        ret = driver->ops->pin_pages(container->iommu_data, user_pfn,
                                     npage, prot, phys_pfn);
else
        ret = -EINVAL;




>> +static int vfio_pfn_account(struct vfio_iommu *iommu, unsigned long pfn)
>> +{
>> +	struct vfio_pfn *p;
>> +	struct vfio_domain *domain = iommu->local_domain;
>> +	int ret = 1;
>> +
>> +	if (!domain)
>> +		return 1;
>> +
>> +	mutex_lock(&domain->local_addr_space->pfn_list_lock);
>> +
>> +	p = vfio_find_pfn(domain, pfn);
>> +	if (p)
>> +		ret = 0;
>> +
>> +	mutex_unlock(&domain->local_addr_space->pfn_list_lock);
>> +	return ret;
>> +}
> 
> So if the vfio_pfn for a given pfn exists, return 0, else return 1.
> But do we know that the vfio_pfn exists at the point where we actually
> do that accounting?
>

Only below functions call vfio_pfn_account()
__vfio_pin_pages_remote() -> vfio_pfn_account()
__vfio_unpin_pages_remote() -> vfio_pfn_account()

Consider the case when mdev device is already assigned to VM, run some
app in VM that pins some pages, then hotplug pass through device.
Then __vfio_pin_pages_remote() is called when iommu capable domain is
attached to container to pin all pages from vfio_iommu_replay(). So if
at this time vfio_pfn exist means that the page is pinned through
local_domain when iommu capable domain was not present, so accounting
was already done for that pages. Hence returned 0 here which mean don't
add this page in accounting.


>> +
>>  struct vwork {
>>  	struct mm_struct	*mm;
>>  	long			npage;
>> @@ -150,17 +269,17 @@ static void vfio_lock_acct_bg(struct work_struct *work)
>>  	kfree(vwork);
>>  }
>>  
>> -static void vfio_lock_acct(long npage)
>> +static void vfio_lock_acct(struct task_struct *task, long npage)
>>  {
>>  	struct vwork *vwork;
>>  	struct mm_struct *mm;
>>  
>> -	if (!current->mm || !npage)
>> +	if (!task->mm || !npage)
>>  		return; /* process exited or nothing to do */
>>  
>> -	if (down_write_trylock(&current->mm->mmap_sem)) {
>> -		current->mm->locked_vm += npage;
>> -		up_write(&current->mm->mmap_sem);
>> +	if (down_write_trylock(&task->mm->mmap_sem)) {
>> +		task->mm->locked_vm += npage;
>> +		up_write(&task->mm->mmap_sem);
>>  		return;
>>  	}
>>  
>> @@ -172,7 +291,7 @@ static void vfio_lock_acct(long npage)
>>  	vwork = kmalloc(sizeof(struct vwork), GFP_KERNEL);
>>  	if (!vwork)
>>  		return;
>> -	mm = get_task_mm(current);
>> +	mm = get_task_mm(task);
>>  	if (!mm) {
>>  		kfree(vwork);
>>  		return;
>> @@ -228,20 +347,31 @@ static int put_pfn(unsigned long pfn, int prot)
>>  	return 0;
>>  }
> 
> This coversion of vfio_lock_acct() to pass a task_struct and updating
> existing callers to pass current would be a great separate, easily
> review-able patch.
>

Ok. I'll split this in separate commit.


>>  
>> -static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
>> +static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
>> +			 int prot, unsigned long *pfn)
>>  {
>>  	struct page *page[1];
>>  	struct vm_area_struct *vma;
>> +	struct mm_struct *local_mm = (mm ? mm : current->mm);
>>  	int ret = -EFAULT;
>>  
>> -	if (get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE), page) == 1) {
>> +	if (mm) {
>> +		down_read(&local_mm->mmap_sem);
>> +		ret = get_user_pages_remote(NULL, local_mm, vaddr, 1,
>> +					!!(prot & IOMMU_WRITE), 0, page, NULL);
>> +		up_read(&local_mm->mmap_sem);
>> +	} else
>> +		ret = get_user_pages_fast(vaddr, 1,
>> +					  !!(prot & IOMMU_WRITE), page);
>> +
>> +	if (ret == 1) {
>>  		*pfn = page_to_pfn(page[0]);
>>  		return 0;
>>  	}
>>  
>> -	down_read(&current->mm->mmap_sem);
>> +	down_read(&local_mm->mmap_sem);
>>  
>> -	vma = find_vma_intersection(current->mm, vaddr, vaddr + 1);
>> +	vma = find_vma_intersection(local_mm, vaddr, vaddr + 1);
>>  
>>  	if (vma && vma->vm_flags & VM_PFNMAP) {
>>  		*pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
>> @@ -249,7 +379,7 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
>>  			ret = 0;
>>  	}
>>  
>> -	up_read(&current->mm->mmap_sem);
>> +	up_read(&local_mm->mmap_sem);
>>  
>>  	return ret;
>>  }
> 
> This would also be a great separate patch.

Ok.

>  Have you considered
> renaming the mm_struct function arg to "remote_mm" and making the local
> variable simply "mm"?  It seems like it would tie nicely with the
> remote_mm path using get_user_pages_remote() while passing NULL for
> remote_mm uses current->mm and the existing path (and avoid the general
> oddness of passing local_mm to a "remote" function).
> 

Yes, your suggestion looks good. Updating.


>> @@ -259,33 +389,37 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
>>   * the iommu can only map chunks of consecutive pfns anyway, so get the
>>   * first page and all consecutive pages with the same locking.
>>   */
>> -static long vfio_pin_pages(unsigned long vaddr, long npage,
>> -			   int prot, unsigned long *pfn_base)
>> +static long __vfio_pin_pages_remote(struct vfio_iommu *iommu,
>> +				    unsigned long vaddr, long npage,
>> +				    int prot, unsigned long *pfn_base)

...


>> @@ -303,8 +437,10 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
>>  			break;
>>  		}
>>  
>> +		lock_acct += vfio_pfn_account(iommu, pfn);
>> +
> 
> I take it that this is the new technique for keeping the accounting
> accurate, we only increment the locked accounting by the amount not
> already pinned in a vfio_pfn.
>

That's correct.


>>  		if (!rsvd && !lock_cap &&
>> -		    current->mm->locked_vm + i + 1 > limit) {
>> +		    current->mm->locked_vm + lock_acct > limit) {
>>  			put_pfn(pfn, prot);
>>  			pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
>>  				__func__, limit << PAGE_SHIFT);
>> @@ -313,23 +449,216 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
>>  	}
>>  
>>  	if (!rsvd)
>> -		vfio_lock_acct(i);
>> +		vfio_lock_acct(current, lock_acct);
>>  
>>  	return i;
>>  }
>>  
>> -static long vfio_unpin_pages(unsigned long pfn, long npage,
>> -			     int prot, bool do_accounting)
>> +static long __vfio_unpin_pages_remote(struct vfio_iommu *iommu,
>> +				      unsigned long pfn, long npage, int prot,
>> +				      bool do_accounting)
> 
> Have you noticed that it's kind of confusing that
> __vfio_{un}pin_pages_remote() uses current, which does a
> get_user_pages_fast() while "local" uses a provided task_struct and
> uses get_user_pages_*remote*()?  And also what was effectively local
> (ie. we're pinning for our own use here) is now "remote" and pinning
> for a remote, vendor driver consumer, is now "local".  It's not very
> intuitive.
> 

'local' in local_domain was suggested to describe the domain for local
page tracking. Earlier suggestions to have 'mdev' or 'noimmu' in this
name were discarded. May be we should revisit what the name should be.
Any suggestion?

For local_domain, to pin pages, flow is:

for local_domain
    |- vfio_pin_pages()
        |- vfio_iommu_type1_pin_pages()
            |- __vfio_pin_page_local()
                |-  vaddr_get_pfn(task->mm)
                    |- get_user_pages_remote()

__vfio_pin_page_local() --> get_user_pages_remote()



>>  static int vfio_iommu_type1_attach_group(void *iommu_data,
>>  					 struct iommu_group *iommu_group)
>>  {
>>  	struct vfio_iommu *iommu = iommu_data;
>> -	struct vfio_group *group, *g;
>> +	struct vfio_group *group;
>>  	struct vfio_domain *domain, *d;
>>  	struct bus_type *bus = NULL;
>>  	int ret;
>> @@ -746,10 +1136,14 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>>  	mutex_lock(&iommu->lock);
>>  
>>  	list_for_each_entry(d, &iommu->domain_list, next) {
>> -		list_for_each_entry(g, &d->group_list, next) {
>> -			if (g->iommu_group != iommu_group)
>> -				continue;
>> +		if (find_iommu_group(d, iommu_group)) {
>> +			mutex_unlock(&iommu->lock);
>> +			return -EINVAL;
>> +		}
>> +	}
> 
> The find_iommu_group() conversion would also be an easy separate patch.
> 

Ok.

>>  
>> +	if (iommu->local_domain) {
>> +		if (find_iommu_group(iommu->local_domain, iommu_group)) {
>>  			mutex_unlock(&iommu->lock);
>>  			return -EINVAL;
>>  		}
>> @@ -769,6 +1163,30 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>>  	if (ret)
>>  		goto out_free;
>>  
>> +	if (IS_ENABLED(CONFIG_VFIO_MDEV) && !iommu_present(bus) &&
>> +	    (bus == &mdev_bus_type)) {
>> +		if (!iommu->local_domain) {
>> +			domain->local_addr_space =
>> +				kzalloc(sizeof(*domain->local_addr_space),
>> +						GFP_KERNEL);
>> +			if (!domain->local_addr_space) {
>> +				ret = -ENOMEM;
>> +				goto out_free;
>> +			}
>> +
>> +			domain->local_addr_space->task = current;
>> +			INIT_LIST_HEAD(&domain->group_list);
>> +			domain->local_addr_space->pfn_list = RB_ROOT;
>> +			mutex_init(&domain->local_addr_space->pfn_list_lock);
>> +			iommu->local_domain = domain;
>> +		} else
>> +			kfree(domain);
>> +
>> +		list_add(&group->next, &domain->group_list);
> 
> I think you mean s/domain/iommu->local_domain/ here, we just freed
> domain in the else path.
> 

Yes, corrected.

>> +		mutex_unlock(&iommu->lock);
>> +		return 0;
>> +	}
>> +
>>  	domain->domain = iommu_domain_alloc(bus);
>>  	if (!domain->domain) {
>>  		ret = -EIO;
>> @@ -859,6 +1277,41 @@ static void vfio_iommu_unmap_unpin_all(struct vfio_iommu *iommu)
>>  		vfio_remove_dma(iommu, rb_entry(node, struct vfio_dma, node));
>>  }
>>  
>> +static void vfio_iommu_unmap_unpin_reaccount(struct vfio_iommu *iommu)
>> +{
>> +	struct vfio_domain *domain = iommu->local_domain;
>> +	struct vfio_dma *dma, *tdma;
>> +	struct rb_node *n;
>> +	long locked = 0;
>> +
>> +	rbtree_postorder_for_each_entry_safe(dma, tdma, &iommu->dma_list,
>> +					     node) {
>> +		vfio_unmap_unpin(iommu, dma);
>> +	}
>> +
>> +	mutex_lock(&domain->local_addr_space->pfn_list_lock);
>> +
>> +	n = rb_first(&domain->local_addr_space->pfn_list);
>> +
>> +	for (; n; n = rb_next(n))
>> +		locked++;
>> +
>> +	vfio_lock_acct(domain->local_addr_space->task, locked);
>> +	mutex_unlock(&domain->local_addr_space->pfn_list_lock);
>> +}
> 
> Couldn't a properly timed mlock by the user allow them to lock more
> memory than they're allowed here?  For instance imagine the vendor
> driver has pinned the entire VM memory and the user has exactly the
> locked memory limit for that VM.  During the gap here between unpinning
> the entire vfio_dma list and re-accounting for the pfn_list, the user
> can mlock up to their limit again an now they've doubled the locked
> memory they're allowed.
> 

As per original code, vfio_unmap_unpin() calls
__vfio_unpin_pages_remote(.., false) with do_accounting set to false,
why is that so?

Here if accounting is set to true then we don't have to do re-accounting
here.

>> +
>> +static void vfio_local_unpin_all(struct vfio_domain *domain)
>> +{
>> +	struct rb_node *node;
>> +
>> +	mutex_lock(&domain->local_addr_space->pfn_list_lock);
>> +	while ((node = rb_first(&domain->local_addr_space->pfn_list)))
>> +		vfio_unpin_pfn(domain,
>> +				rb_entry(node, struct vfio_pfn, node), false);
>> +
>> +	mutex_unlock(&domain->local_addr_space->pfn_list_lock);
>> +}
>> +
>>  static void vfio_iommu_type1_detach_group(void *iommu_data,
>>  					  struct iommu_group *iommu_group)
>>  {
>> @@ -868,31 +1321,57 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
>>  
>>  	mutex_lock(&iommu->lock);
>>  
>> -	list_for_each_entry(domain, &iommu->domain_list, next) {
>> -		list_for_each_entry(group, &domain->group_list, next) {
>> -			if (group->iommu_group != iommu_group)
>> -				continue;
>> -
>> -			iommu_detach_group(domain->domain, iommu_group);
>> +	if (iommu->local_domain) {
>> +		domain = iommu->local_domain;
>> +		group = find_iommu_group(domain, iommu_group);
>> +		if (group) {
>>  			list_del(&group->next);
>>  			kfree(group);
>> -			/*
>> -			 * Group ownership provides privilege, if the group
>> -			 * list is empty, the domain goes away.  If it's the
>> -			 * last domain, then all the mappings go away too.
>> -			 */
>> +
>>  			if (list_empty(&domain->group_list)) {
>> -				if (list_is_singular(&iommu->domain_list))
>> +				vfio_local_unpin_all(domain);
>> +				if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu))
>>  					vfio_iommu_unmap_unpin_all(iommu);
>> -				iommu_domain_free(domain->domain);
>> -				list_del(&domain->next);
>>  				kfree(domain);
>> +				iommu->local_domain = NULL;
>> +			}
> 
> 
> I can't quite wrap my head around this, if we have mdev groups attached
> and this iommu group matches an mdev group, remove from list and free
> the group.  If there are now no more groups in the mdev group list,
> then for each vfio_pfn, unpin the pfn, /without/ doing accounting
> udpates 

corrected the code to do accounting here.

> and remove the vfio_pfn, but only if the ref_count is now
> zero.

Yes, If you see the loop vfio_local_unpin_all(), it iterates till the
node in rb tree exist

>> +	while ((node = rb_first(&domain->local_addr_space->pfn_list)))
>> +		vfio_unpin_pfn(domain,
>> +				rb_entry(node, struct vfio_pfn, node), false);
>> +


and vfio_unpin_pfn() only remove the node from rb tree if ref count is
zero.

static int vfio_unpin_pfn(struct vfio_domain *domain,
                          struct vfio_pfn *vpfn, bool do_accounting)
{
        __vfio_unpin_page_local(domain, vpfn->pfn, vpfn->prot,
                                do_accounting);

        if (atomic_dec_and_test(&vpfn->ref_count))
                vfio_remove_from_pfn_list(domain, vpfn);

        return 1;
}

so for example for a vfio_pfn ref_count is 2, first iteration would be:
 - call __vfio_unpin_page_local()
 - atomic_dec(ref_count), so now ref_count is 1, but node is not removed
from rb tree.

In next iteration:
 - call __vfio_unpin_page_local()
 - atomic_dec(ref_count), so now ref_count is 0, remove node from rb tree.


>  We free the domain, so if the ref_count was non-zero we've now
> just leaked memory.  I think that means that if a vendor driver pins a
> given page twice, that leak occurs.  Furthermore, if there is not an
> iommu capable domain in the container, we remove all the vfio_dma
> entries as well, ok.  Maybe the only issue is those leaked vfio_pfns.
> 

So if vendor driver pins a page twice, vfio_unpin_pfn() would get called
twice and only when ref count is zero that node is removed from rb tree.
So there is no memory leak.

Kirti

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 10/12] vfio: Add function to get device_api string from vfio_device_info.flags
  2016-10-20 19:34   ` Alex Williamson
@ 2016-10-20 20:29     ` Kirti Wankhede
  2016-10-20 21:05       ` Alex Williamson
  0 siblings, 1 reply; 73+ messages in thread
From: Kirti Wankhede @ 2016-10-20 20:29 UTC (permalink / raw)
  To: Alex Williamson
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song,
	bjsdjshi, linux-kernel



On 10/21/2016 1:04 AM, Alex Williamson wrote:
> On Tue, 18 Oct 2016 02:52:10 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> Function vfio_device_api_string() returns string based on flag set in
>> vfio_device_info's flag. This should be used by vendor driver to get string
>> based on flag for device_api attribute.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Signed-off-by: Neo Jia <cjia@nvidia.com>
>> Change-Id: I42d29f475f02a7132ce13297fbf2b48f1da10995
>> ---
>>  drivers/vfio/vfio.c  | 15 +++++++++++++++
>>  include/linux/vfio.h |  2 ++
>>  2 files changed, 17 insertions(+)
>>
>> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
>> index 10ef1c5fa762..aec470454a13 100644
>> --- a/drivers/vfio/vfio.c
>> +++ b/drivers/vfio/vfio.c
>> @@ -1917,6 +1917,21 @@ int vfio_set_irqs_validate_and_prepare(struct vfio_irq_set *hdr, int num_irqs,
>>  }
>>  EXPORT_SYMBOL(vfio_set_irqs_validate_and_prepare);
>>  
>> +const char *vfio_device_api_string(u32 flags)
>> +{
>> +	if (flags & VFIO_DEVICE_FLAGS_PCI)
>> +		return "vfio-pci";
>> +
>> +	if (flags & VFIO_DEVICE_FLAGS_PLATFORM)
>> +		return "vfio-platform";
>> +
>> +	if (flags & VFIO_DEVICE_FLAGS_AMBA)
>> +		return "vfio-amba";
>> +
>> +	return "";
>> +}
>> +EXPORT_SYMBOL(vfio_device_api_string);
>> +
>>  /*
>>   * Pin a set of guest PFNs and return their associated host PFNs for local
>>   * domain only.
>> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
>> index 31d059f1649b..fca2bf23c4f1 100644
>> --- a/include/linux/vfio.h
>> +++ b/include/linux/vfio.h
>> @@ -116,6 +116,8 @@ extern int vfio_set_irqs_validate_and_prepare(struct vfio_irq_set *hdr,
>>  					      int num_irqs, int max_irq_type,
>>  					      size_t *data_size);
>>  
>> +extern const char *vfio_device_api_string(u32 flags);
>> +
>>  struct pci_dev;
>>  #ifdef CONFIG_EEH
>>  extern void vfio_spapr_pci_eeh_open(struct pci_dev *pdev);
> 
> Couldn't this simply be a #define in the uapi header?
> 
> #define VFIO_DEVICE_PCI_API_STRING "vfio-pci"
> 
> I don't really see why we need a lookup function.
> 

String is tightly coupled with the FLAG, right?
Instead user need to take care of making sure to return proper string,
and don't mis-match the string, I think having function is easier.
Vendor driver should decide the type of device they want to expose and
set the flag, using this function vendor driver would return string
which is based on flag they set.

Kirti

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 10/12] vfio: Add function to get device_api string from vfio_device_info.flags
  2016-10-20 20:29     ` Kirti Wankhede
@ 2016-10-20 21:05       ` Alex Williamson
  2016-10-20 21:14         ` Kirti Wankhede
  0 siblings, 1 reply; 73+ messages in thread
From: Alex Williamson @ 2016-10-20 21:05 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song,
	bjsdjshi, linux-kernel

On Fri, 21 Oct 2016 01:59:55 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 10/21/2016 1:04 AM, Alex Williamson wrote:
> > On Tue, 18 Oct 2016 02:52:10 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> Function vfio_device_api_string() returns string based on flag set in
> >> vfio_device_info's flag. This should be used by vendor driver to get string
> >> based on flag for device_api attribute.
> >>
> >> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >> Signed-off-by: Neo Jia <cjia@nvidia.com>
> >> Change-Id: I42d29f475f02a7132ce13297fbf2b48f1da10995
> >> ---
> >>  drivers/vfio/vfio.c  | 15 +++++++++++++++
> >>  include/linux/vfio.h |  2 ++
> >>  2 files changed, 17 insertions(+)
> >>
> >> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> >> index 10ef1c5fa762..aec470454a13 100644
> >> --- a/drivers/vfio/vfio.c
> >> +++ b/drivers/vfio/vfio.c
> >> @@ -1917,6 +1917,21 @@ int vfio_set_irqs_validate_and_prepare(struct vfio_irq_set *hdr, int num_irqs,
> >>  }
> >>  EXPORT_SYMBOL(vfio_set_irqs_validate_and_prepare);
> >>  
> >> +const char *vfio_device_api_string(u32 flags)
> >> +{
> >> +	if (flags & VFIO_DEVICE_FLAGS_PCI)
> >> +		return "vfio-pci";
> >> +
> >> +	if (flags & VFIO_DEVICE_FLAGS_PLATFORM)
> >> +		return "vfio-platform";
> >> +
> >> +	if (flags & VFIO_DEVICE_FLAGS_AMBA)
> >> +		return "vfio-amba";
> >> +
> >> +	return "";
> >> +}
> >> +EXPORT_SYMBOL(vfio_device_api_string);
> >> +
> >>  /*
> >>   * Pin a set of guest PFNs and return their associated host PFNs for local
> >>   * domain only.
> >> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> >> index 31d059f1649b..fca2bf23c4f1 100644
> >> --- a/include/linux/vfio.h
> >> +++ b/include/linux/vfio.h
> >> @@ -116,6 +116,8 @@ extern int vfio_set_irqs_validate_and_prepare(struct vfio_irq_set *hdr,
> >>  					      int num_irqs, int max_irq_type,
> >>  					      size_t *data_size);
> >>  
> >> +extern const char *vfio_device_api_string(u32 flags);
> >> +
> >>  struct pci_dev;
> >>  #ifdef CONFIG_EEH
> >>  extern void vfio_spapr_pci_eeh_open(struct pci_dev *pdev);  
> > 
> > Couldn't this simply be a #define in the uapi header?
> > 
> > #define VFIO_DEVICE_PCI_API_STRING "vfio-pci"
> > 
> > I don't really see why we need a lookup function.
> >   
> 
> String is tightly coupled with the FLAG, right?
> Instead user need to take care of making sure to return proper string,
> and don't mis-match the string, I think having function is easier.

That's exactly why I proposed putting the #define string in the uapi,
by that I mean the vfio uapi header.  That keeps the tight coupling to
the flag, they're both defined in the same place, plus it gives
userspace a reference so they're not just inventing a string to compare
against.  IOW, the vendor driver simply does an sprintf of
VFIO_DEVICE_PCI_API_STRING and userspace (ie. libvirt) can do a strcmp
with VFIO_DEVICE_PCI_API_STRING from the same header and everybody
arrives at the same result.

> Vendor driver should decide the type of device they want to expose and
> set the flag, using this function vendor driver would return string
> which is based on flag they set.

Being a function adds no intrinsic value and being in a uapi header does
add value to userspace.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 10/12] vfio: Add function to get device_api string from vfio_device_info.flags
  2016-10-20 21:05       ` Alex Williamson
@ 2016-10-20 21:14         ` Kirti Wankhede
  2016-10-20 21:22           ` Alex Williamson
  0 siblings, 1 reply; 73+ messages in thread
From: Kirti Wankhede @ 2016-10-20 21:14 UTC (permalink / raw)
  To: Alex Williamson
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song,
	bjsdjshi, linux-kernel



On 10/21/2016 2:35 AM, Alex Williamson wrote:
> On Fri, 21 Oct 2016 01:59:55 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> On 10/21/2016 1:04 AM, Alex Williamson wrote:
>>> On Tue, 18 Oct 2016 02:52:10 +0530
>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>   
>>>> Function vfio_device_api_string() returns string based on flag set in
>>>> vfio_device_info's flag. This should be used by vendor driver to get string
>>>> based on flag for device_api attribute.
>>>>
>>>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>>>> Signed-off-by: Neo Jia <cjia@nvidia.com>
>>>> Change-Id: I42d29f475f02a7132ce13297fbf2b48f1da10995
>>>> ---
>>>>  drivers/vfio/vfio.c  | 15 +++++++++++++++
>>>>  include/linux/vfio.h |  2 ++
>>>>  2 files changed, 17 insertions(+)
>>>>
>>>> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
>>>> index 10ef1c5fa762..aec470454a13 100644
>>>> --- a/drivers/vfio/vfio.c
>>>> +++ b/drivers/vfio/vfio.c
>>>> @@ -1917,6 +1917,21 @@ int vfio_set_irqs_validate_and_prepare(struct vfio_irq_set *hdr, int num_irqs,
>>>>  }
>>>>  EXPORT_SYMBOL(vfio_set_irqs_validate_and_prepare);
>>>>  
>>>> +const char *vfio_device_api_string(u32 flags)
>>>> +{
>>>> +	if (flags & VFIO_DEVICE_FLAGS_PCI)
>>>> +		return "vfio-pci";
>>>> +
>>>> +	if (flags & VFIO_DEVICE_FLAGS_PLATFORM)
>>>> +		return "vfio-platform";
>>>> +
>>>> +	if (flags & VFIO_DEVICE_FLAGS_AMBA)
>>>> +		return "vfio-amba";
>>>> +
>>>> +	return "";
>>>> +}
>>>> +EXPORT_SYMBOL(vfio_device_api_string);
>>>> +
>>>>  /*
>>>>   * Pin a set of guest PFNs and return their associated host PFNs for local
>>>>   * domain only.
>>>> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
>>>> index 31d059f1649b..fca2bf23c4f1 100644
>>>> --- a/include/linux/vfio.h
>>>> +++ b/include/linux/vfio.h
>>>> @@ -116,6 +116,8 @@ extern int vfio_set_irqs_validate_and_prepare(struct vfio_irq_set *hdr,
>>>>  					      int num_irqs, int max_irq_type,
>>>>  					      size_t *data_size);
>>>>  
>>>> +extern const char *vfio_device_api_string(u32 flags);
>>>> +
>>>>  struct pci_dev;
>>>>  #ifdef CONFIG_EEH
>>>>  extern void vfio_spapr_pci_eeh_open(struct pci_dev *pdev);  
>>>
>>> Couldn't this simply be a #define in the uapi header?
>>>
>>> #define VFIO_DEVICE_PCI_API_STRING "vfio-pci"
>>>
>>> I don't really see why we need a lookup function.
>>>   
>>
>> String is tightly coupled with the FLAG, right?
>> Instead user need to take care of making sure to return proper string,
>> and don't mis-match the string, I think having function is easier.
> 
> That's exactly why I proposed putting the #define string in the uapi,
> by that I mean the vfio uapi header.  That keeps the tight coupling to
> the flag, they're both defined in the same place, plus it gives
> userspace a reference so they're not just inventing a string to compare
> against.  IOW, the vendor driver simply does an sprintf of
> VFIO_DEVICE_PCI_API_STRING and userspace (ie. libvirt) can do a strcmp
> with VFIO_DEVICE_PCI_API_STRING from the same header and everybody
> arrives at the same result.
> 
>> Vendor driver should decide the type of device they want to expose and
>> set the flag, using this function vendor driver would return string
>> which is based on flag they set.
> 
> Being a function adds no intrinsic value and being in a uapi header does
> add value to userspace.  Thanks,
> 

Ok. The strings should be in uapi, but having function (like below) to
return proper string based on flag would be good to have for vendor driver.

 +const char *vfio_device_api_string(u32 flags)
 +{
 +	if (flags & VFIO_DEVICE_FLAGS_PCI)
 +		return VFIO_DEVICE_API_PCI_STRING;
 +
 +	if (flags & VFIO_DEVICE_FLAGS_PLATFORM)
 +		return VFIO_DEVICE_API_PLATFORM_STRING;
 +
 +	if (flags & VFIO_DEVICE_FLAGS_AMBA)
 +		return VFIO_DEVICE_API_AMBA_STRING;
 +
 +	return "";
 +}
 +EXPORT_SYMBOL(vfio_device_api_string);

Kirti

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 10/12] vfio: Add function to get device_api string from vfio_device_info.flags
  2016-10-20 21:14         ` Kirti Wankhede
@ 2016-10-20 21:22           ` Alex Williamson
  2016-10-21  3:00             ` Kirti Wankhede
  0 siblings, 1 reply; 73+ messages in thread
From: Alex Williamson @ 2016-10-20 21:22 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song,
	bjsdjshi, linux-kernel

On Fri, 21 Oct 2016 02:44:37 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 10/21/2016 2:35 AM, Alex Williamson wrote:
> > On Fri, 21 Oct 2016 01:59:55 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> On 10/21/2016 1:04 AM, Alex Williamson wrote:  
> >>> On Tue, 18 Oct 2016 02:52:10 +0530
> >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>     
> >>>> Function vfio_device_api_string() returns string based on flag set in
> >>>> vfio_device_info's flag. This should be used by vendor driver to get string
> >>>> based on flag for device_api attribute.
> >>>>
> >>>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >>>> Signed-off-by: Neo Jia <cjia@nvidia.com>
> >>>> Change-Id: I42d29f475f02a7132ce13297fbf2b48f1da10995
> >>>> ---
> >>>>  drivers/vfio/vfio.c  | 15 +++++++++++++++
> >>>>  include/linux/vfio.h |  2 ++
> >>>>  2 files changed, 17 insertions(+)
> >>>>
> >>>> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> >>>> index 10ef1c5fa762..aec470454a13 100644
> >>>> --- a/drivers/vfio/vfio.c
> >>>> +++ b/drivers/vfio/vfio.c
> >>>> @@ -1917,6 +1917,21 @@ int vfio_set_irqs_validate_and_prepare(struct vfio_irq_set *hdr, int num_irqs,
> >>>>  }
> >>>>  EXPORT_SYMBOL(vfio_set_irqs_validate_and_prepare);
> >>>>  
> >>>> +const char *vfio_device_api_string(u32 flags)
> >>>> +{
> >>>> +	if (flags & VFIO_DEVICE_FLAGS_PCI)
> >>>> +		return "vfio-pci";
> >>>> +
> >>>> +	if (flags & VFIO_DEVICE_FLAGS_PLATFORM)
> >>>> +		return "vfio-platform";
> >>>> +
> >>>> +	if (flags & VFIO_DEVICE_FLAGS_AMBA)
> >>>> +		return "vfio-amba";
> >>>> +
> >>>> +	return "";
> >>>> +}
> >>>> +EXPORT_SYMBOL(vfio_device_api_string);
> >>>> +
> >>>>  /*
> >>>>   * Pin a set of guest PFNs and return their associated host PFNs for local
> >>>>   * domain only.
> >>>> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> >>>> index 31d059f1649b..fca2bf23c4f1 100644
> >>>> --- a/include/linux/vfio.h
> >>>> +++ b/include/linux/vfio.h
> >>>> @@ -116,6 +116,8 @@ extern int vfio_set_irqs_validate_and_prepare(struct vfio_irq_set *hdr,
> >>>>  					      int num_irqs, int max_irq_type,
> >>>>  					      size_t *data_size);
> >>>>  
> >>>> +extern const char *vfio_device_api_string(u32 flags);
> >>>> +
> >>>>  struct pci_dev;
> >>>>  #ifdef CONFIG_EEH
> >>>>  extern void vfio_spapr_pci_eeh_open(struct pci_dev *pdev);    
> >>>
> >>> Couldn't this simply be a #define in the uapi header?
> >>>
> >>> #define VFIO_DEVICE_PCI_API_STRING "vfio-pci"
> >>>
> >>> I don't really see why we need a lookup function.
> >>>     
> >>
> >> String is tightly coupled with the FLAG, right?
> >> Instead user need to take care of making sure to return proper string,
> >> and don't mis-match the string, I think having function is easier.  
> > 
> > That's exactly why I proposed putting the #define string in the uapi,
> > by that I mean the vfio uapi header.  That keeps the tight coupling to
> > the flag, they're both defined in the same place, plus it gives
> > userspace a reference so they're not just inventing a string to compare
> > against.  IOW, the vendor driver simply does an sprintf of
> > VFIO_DEVICE_PCI_API_STRING and userspace (ie. libvirt) can do a strcmp
> > with VFIO_DEVICE_PCI_API_STRING from the same header and everybody
> > arrives at the same result.
> >   
> >> Vendor driver should decide the type of device they want to expose and
> >> set the flag, using this function vendor driver would return string
> >> which is based on flag they set.  
> > 
> > Being a function adds no intrinsic value and being in a uapi header does
> > add value to userspace.  Thanks,
> >   
> 
> Ok. The strings should be in uapi, but having function (like below) to
> return proper string based on flag would be good to have for vendor driver.
> 
>  +const char *vfio_device_api_string(u32 flags)
>  +{
>  +	if (flags & VFIO_DEVICE_FLAGS_PCI)
>  +		return VFIO_DEVICE_API_PCI_STRING;
>  +
>  +	if (flags & VFIO_DEVICE_FLAGS_PLATFORM)
>  +		return VFIO_DEVICE_API_PLATFORM_STRING;
>  +
>  +	if (flags & VFIO_DEVICE_FLAGS_AMBA)
>  +		return VFIO_DEVICE_API_AMBA_STRING;
>  +
>  +	return "";
>  +}
>  +EXPORT_SYMBOL(vfio_device_api_string);

I disagree, it's pointless maintenance overhead.  It's yet another
function that we need to care about for kABI and it offers almost no
value.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 01/12] vfio: Mediated device Core driver
  2016-10-20 17:12     ` Alex Williamson
@ 2016-10-21  2:41       ` Jike Song
  2016-10-27  5:56       ` Jike Song
  1 sibling, 0 replies; 73+ messages in thread
From: Jike Song @ 2016-10-21  2:41 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Kirti Wankhede, pbonzini, kraxel, cjia, qemu-devel, kvm,
	kevin.tian, bjsdjshi, linux-kernel

On 10/21/2016 01:12 AM, Alex Williamson wrote:
> On Thu, 20 Oct 2016 15:23:53 +0800
> Jike Song <jike.song@intel.com> wrote:
> 
>> On 10/18/2016 05:22 AM, Kirti Wankhede wrote:
>>> diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
>>> new file mode 100644
>>> index 000000000000..7db5ec164aeb
>>> --- /dev/null
>>> +++ b/drivers/vfio/mdev/mdev_core.c
>>> @@ -0,0 +1,372 @@
>>> +/*
>>> + * Mediated device Core Driver
>>> + *
>>> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
>>> + *     Author: Neo Jia <cjia@nvidia.com>
>>> + *	       Kirti Wankhede <kwankhede@nvidia.com>
>>> + *
>>> + * This program is free software; you can redistribute it and/or modify
>>> + * it under the terms of the GNU General Public License version 2 as
>>> + * published by the Free Software Foundation.
>>> + */
>>> +
>>> +#include <linux/module.h>
>>> +#include <linux/device.h>
>>> +#include <linux/slab.h>
>>> +#include <linux/uuid.h>
>>> +#include <linux/sysfs.h>
>>> +#include <linux/mdev.h>
>>> +
>>> +#include "mdev_private.h"
>>> +
>>> +#define DRIVER_VERSION		"0.1"
>>> +#define DRIVER_AUTHOR		"NVIDIA Corporation"
>>> +#define DRIVER_DESC		"Mediated device Core Driver"
>>> +
>>> +static LIST_HEAD(parent_list);
>>> +static DEFINE_MUTEX(parent_list_lock);
>>> +static struct class_compat *mdev_bus_compat_class;
>>> +  
>>
>>> +
>>> +/*
>>> + * mdev_register_device : Register a device
>>> + * @dev: device structure representing parent device.
>>> + * @ops: Parent device operation structure to be registered.
>>> + *
>>> + * Add device to list of registered parent devices.
>>> + * Returns a negative value on error, otherwise 0.
>>> + */
>>> +int mdev_register_device(struct device *dev, const struct parent_ops *ops)
>>> +{
>>> +	int ret = 0;
>>> +	struct parent_device *parent;
>>> +
>>> +	/* check for mandatory ops */
>>> +	if (!ops || !ops->create || !ops->remove || !ops->supported_type_groups)
>>> +		return -EINVAL;
>>> +
>>> +	dev = get_device(dev);
>>> +	if (!dev)
>>> +		return -EINVAL;
>>> +
>>> +	mutex_lock(&parent_list_lock);
>>> +
>>> +	/* Check for duplicate */
>>> +	parent = __find_parent_device(dev);
>>> +	if (parent) {
>>> +		ret = -EEXIST;
>>> +		goto add_dev_err;
>>> +	}
>>> +
>>> +	parent = kzalloc(sizeof(*parent), GFP_KERNEL);
>>> +	if (!parent) {
>>> +		ret = -ENOMEM;
>>> +		goto add_dev_err;
>>> +	}
>>> +
>>> +	kref_init(&parent->ref);
>>> +
>>> +	parent->dev = dev;
>>> +	parent->ops = ops;
>>> +
>>> +	ret = parent_create_sysfs_files(parent);
>>> +	if (ret) {
>>> +		mutex_unlock(&parent_list_lock);
>>> +		mdev_put_parent(parent);
>>> +		return ret;
>>> +	}
>>> +
>>> +	ret = class_compat_create_link(mdev_bus_compat_class, dev, NULL);
>>> +	if (ret)
>>> +		dev_warn(dev, "Failed to create compatibility class link\n");
>>> +
>>> +	list_add(&parent->next, &parent_list);
>>> +	mutex_unlock(&parent_list_lock);
>>> +
>>> +	dev_info(dev, "MDEV: Registered\n");
>>> +	return 0;
>>> +
>>> +add_dev_err:
>>> +	mutex_unlock(&parent_list_lock);
>>> +	put_device(dev);
>>> +	return ret;
>>> +}
>>> +EXPORT_SYMBOL(mdev_register_device);  
>>
>>> +static int __init mdev_init(void)
>>> +{
>>> +	int ret;
>>> +
>>> +	ret = mdev_bus_register();
>>> +	if (ret) {
>>> +		pr_err("Failed to register mdev bus\n");
>>> +		return ret;
>>> +	}
>>> +
>>> +	mdev_bus_compat_class = class_compat_register("mdev_bus");
>>> +	if (!mdev_bus_compat_class) {
>>> +		mdev_bus_unregister();
>>> +		return -ENOMEM;
>>> +	}
>>> +
>>> +	/*
>>> +	 * Attempt to load known vfio_mdev.  This gives us a working environment
>>> +	 * without the user needing to explicitly load vfio_mdev driver.
>>> +	 */
>>> +	request_module_nowait("vfio_mdev");
>>> +
>>> +	return ret;
>>> +}
>>> +
>>> +static void __exit mdev_exit(void)
>>> +{
>>> +	class_compat_unregister(mdev_bus_compat_class);
>>> +	mdev_bus_unregister();
>>> +}
>>> +
>>> +module_init(mdev_init)
>>> +module_exit(mdev_exit)  
>>
>> Hi Kirti,
>>
>> There is a possible issue: mdev_bus_register is called from mdev_init,
>> a module_init, equal to device_initcall if builtin to vmlinux; however,
>> the vendor driver, say i915.ko for intel case, have to call
>> mdev_register_device from its module_init: at that time, mdev_init
>> is still not called.
>>
>> Not sure if this issue exists with nvidia.ko. Though in most cases we
>> are expecting users select mdev as a standalone module, we still won't
>> break builtin case.
>>
>>
>> Hi Alex, do you have any suggestion here?
> 
> To fully solve the problem of built-in drivers making use of the mdev
> infrastructure we'd need to make mdev itself builtin and possibly a
> subsystem that is initialized prior to device drivers.  Is that really
> necessary?  Even though i915.ko is often loaded as part of an
> initramfs, most systems still build it as a module.  I would expect
> that standard module dependencies will pull in the necessary mdev and
> vfio modules to make this work correctly.  I can't say that I'm
> prepared to make mdev be a subsystem as would be necessary for builtin
> drivers to make use of.  Perhaps if such a driver exists it could
> somehow do late binding with mdev.  i915 should certainly be tested as
> a builtin driver to make sure it doesn't fail with mdev support added.
> The kvm-vfio device (virt/kvm/vfio.c) makes use of symbol tricks to
> avoid hard dependencies between kvm and vfio, perhaps when builtin to
> the kernel, i915 could use something like that.  Thanks,

Fair enough, I'll use symbol_get to avoid the problem. Thanks!

--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 10/12] vfio: Add function to get device_api string from vfio_device_info.flags
  2016-10-20 21:22           ` Alex Williamson
@ 2016-10-21  3:00             ` Kirti Wankhede
  2016-10-21  3:20               ` Alex Williamson
  0 siblings, 1 reply; 73+ messages in thread
From: Kirti Wankhede @ 2016-10-21  3:00 UTC (permalink / raw)
  To: Alex Williamson
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song,
	bjsdjshi, linux-kernel



On 10/21/2016 2:52 AM, Alex Williamson wrote:
> On Fri, 21 Oct 2016 02:44:37 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
...

>>>>>>  
>>>>>> +extern const char *vfio_device_api_string(u32 flags);
>>>>>> +
>>>>>>  struct pci_dev;
>>>>>>  #ifdef CONFIG_EEH
>>>>>>  extern void vfio_spapr_pci_eeh_open(struct pci_dev *pdev);    
>>>>>
>>>>> Couldn't this simply be a #define in the uapi header?
>>>>>
>>>>> #define VFIO_DEVICE_PCI_API_STRING "vfio-pci"
>>>>>
>>>>> I don't really see why we need a lookup function.
>>>>>     
>>>>
>>>> String is tightly coupled with the FLAG, right?
>>>> Instead user need to take care of making sure to return proper string,
>>>> and don't mis-match the string, I think having function is easier.  
>>>
>>> That's exactly why I proposed putting the #define string in the uapi,
>>> by that I mean the vfio uapi header.  That keeps the tight coupling to
>>> the flag, they're both defined in the same place, plus it gives
>>> userspace a reference so they're not just inventing a string to compare
>>> against.  IOW, the vendor driver simply does an sprintf of
>>> VFIO_DEVICE_PCI_API_STRING and userspace (ie. libvirt) can do a strcmp
>>> with VFIO_DEVICE_PCI_API_STRING from the same header and everybody
>>> arrives at the same result.
>>>   
>>>> Vendor driver should decide the type of device they want to expose and
>>>> set the flag, using this function vendor driver would return string
>>>> which is based on flag they set.  
>>>
>>> Being a function adds no intrinsic value and being in a uapi header does
>>> add value to userspace.  Thanks,
>>>   
>>
>> Ok. The strings should be in uapi, but having function (like below) to
>> return proper string based on flag would be good to have for vendor driver.
>>
>>  +const char *vfio_device_api_string(u32 flags)
>>  +{
>>  +	if (flags & VFIO_DEVICE_FLAGS_PCI)
>>  +		return VFIO_DEVICE_API_PCI_STRING;
>>  +
>>  +	if (flags & VFIO_DEVICE_FLAGS_PLATFORM)
>>  +		return VFIO_DEVICE_API_PLATFORM_STRING;
>>  +
>>  +	if (flags & VFIO_DEVICE_FLAGS_AMBA)
>>  +		return VFIO_DEVICE_API_AMBA_STRING;
>>  +
>>  +	return "";
>>  +}
>>  +EXPORT_SYMBOL(vfio_device_api_string);
> 
> I disagree, it's pointless maintenance overhead.  It's yet another
> function that we need to care about for kABI and it offers almost no
> value.  Thanks,
> 

If any vendor driver sets VFIO_DEVICE_FLAGS_PLATFORM flag but sets
VFIO_DEVICE_API_PCI_STRING, we don't have a way to verify this in kernel
driver. Is that acceptable?


Kirti

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 10/12] vfio: Add function to get device_api string from vfio_device_info.flags
  2016-10-21  3:00             ` Kirti Wankhede
@ 2016-10-21  3:20               ` Alex Williamson
  0 siblings, 0 replies; 73+ messages in thread
From: Alex Williamson @ 2016-10-21  3:20 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song,
	bjsdjshi, linux-kernel

On Fri, 21 Oct 2016 08:30:53 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 10/21/2016 2:52 AM, Alex Williamson wrote:
> > On Fri, 21 Oct 2016 02:44:37 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> ...
> 
> >>>>>>  
> >>>>>> +extern const char *vfio_device_api_string(u32 flags);
> >>>>>> +
> >>>>>>  struct pci_dev;
> >>>>>>  #ifdef CONFIG_EEH
> >>>>>>  extern void vfio_spapr_pci_eeh_open(struct pci_dev *pdev);      
> >>>>>
> >>>>> Couldn't this simply be a #define in the uapi header?
> >>>>>
> >>>>> #define VFIO_DEVICE_PCI_API_STRING "vfio-pci"
> >>>>>
> >>>>> I don't really see why we need a lookup function.
> >>>>>       
> >>>>
> >>>> String is tightly coupled with the FLAG, right?
> >>>> Instead user need to take care of making sure to return proper string,
> >>>> and don't mis-match the string, I think having function is easier.    
> >>>
> >>> That's exactly why I proposed putting the #define string in the uapi,
> >>> by that I mean the vfio uapi header.  That keeps the tight coupling to
> >>> the flag, they're both defined in the same place, plus it gives
> >>> userspace a reference so they're not just inventing a string to compare
> >>> against.  IOW, the vendor driver simply does an sprintf of
> >>> VFIO_DEVICE_PCI_API_STRING and userspace (ie. libvirt) can do a strcmp
> >>> with VFIO_DEVICE_PCI_API_STRING from the same header and everybody
> >>> arrives at the same result.
> >>>     
> >>>> Vendor driver should decide the type of device they want to expose and
> >>>> set the flag, using this function vendor driver would return string
> >>>> which is based on flag they set.    
> >>>
> >>> Being a function adds no intrinsic value and being in a uapi header does
> >>> add value to userspace.  Thanks,
> >>>     
> >>
> >> Ok. The strings should be in uapi, but having function (like below) to
> >> return proper string based on flag would be good to have for vendor driver.
> >>
> >>  +const char *vfio_device_api_string(u32 flags)
> >>  +{
> >>  +	if (flags & VFIO_DEVICE_FLAGS_PCI)
> >>  +		return VFIO_DEVICE_API_PCI_STRING;
> >>  +
> >>  +	if (flags & VFIO_DEVICE_FLAGS_PLATFORM)
> >>  +		return VFIO_DEVICE_API_PLATFORM_STRING;
> >>  +
> >>  +	if (flags & VFIO_DEVICE_FLAGS_AMBA)
> >>  +		return VFIO_DEVICE_API_AMBA_STRING;
> >>  +
> >>  +	return "";
> >>  +}
> >>  +EXPORT_SYMBOL(vfio_device_api_string);  
> > 
> > I disagree, it's pointless maintenance overhead.  It's yet another
> > function that we need to care about for kABI and it offers almost no
> > value.  Thanks,
> >   
> 
> If any vendor driver sets VFIO_DEVICE_FLAGS_PLATFORM flag but sets
> VFIO_DEVICE_API_PCI_STRING, we don't have a way to verify this in kernel
> driver. Is that acceptable?

a) The function doesn't solve that problem, as seen in the mtty sample
driver there's no guarantee that the vendor driver passes
device_info.flags vs the #define for the flag itself.  So we're already
expecting them to get it right in two separate places.

b) It's not going to take them very long to figure this out if they
care about userspace tools that make use of this field to determine the
device API.  This is a very obvious and simple bug.

c) We can check and correct how open source vendor drivers work.

Thanks,
Alex

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 04/12] vfio iommu: Add support for mediated devices
  2016-10-17 21:22 ` [PATCH v9 04/12] vfio iommu: Add support for mediated devices Kirti Wankhede
  2016-10-19 21:02   ` Alex Williamson
@ 2016-10-21  7:49   ` Jike Song
  2016-10-21 14:36     ` Alex Williamson
  2016-10-27  7:20   ` [Qemu-devel] " Alexey Kardashevskiy
  2 siblings, 1 reply; 73+ messages in thread
From: Jike Song @ 2016-10-21  7:49 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: alex.williamson, pbonzini, kraxel, cjia, qemu-devel, kvm,
	kevin.tian, bjsdjshi, linux-kernel

On 10/18/2016 05:22 AM, Kirti Wankhede wrote:
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 2ba19424e4a1..5d67058a611d 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
[snip]
>  static int vfio_iommu_type1_attach_group(void *iommu_data,
>  					 struct iommu_group *iommu_group)
>  {
>  	struct vfio_iommu *iommu = iommu_data;
> -	struct vfio_group *group, *g;
> +	struct vfio_group *group;
>  	struct vfio_domain *domain, *d;
>  	struct bus_type *bus = NULL;
>  	int ret;
> @@ -746,10 +1136,14 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>  	mutex_lock(&iommu->lock);
>  
>  	list_for_each_entry(d, &iommu->domain_list, next) {
> -		list_for_each_entry(g, &d->group_list, next) {
> -			if (g->iommu_group != iommu_group)
> -				continue;
> +		if (find_iommu_group(d, iommu_group)) {
> +			mutex_unlock(&iommu->lock);
> +			return -EINVAL;
> +		}
> +	}
>  
> +	if (iommu->local_domain) {
> +		if (find_iommu_group(iommu->local_domain, iommu_group)) {
>  			mutex_unlock(&iommu->lock);
>  			return -EINVAL;
>  		}
> @@ -769,6 +1163,30 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>  	if (ret)
>  		goto out_free;
>  
> +	if (IS_ENABLED(CONFIG_VFIO_MDEV) && !iommu_present(bus) &&
> +	    (bus == &mdev_bus_type)) {

Hi Kirti,

By refering mdev_bus_type directly you are making vfio_iommu_type1.ko depends
on mdev.ko, but in Kconfig doesn't guarantee the dependency. For example,
if CONFIG_VFIO_IOMMU_TYPE1=y and CONFIG_VFIO_MDEV=m, the building will fail.


--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 04/12] vfio iommu: Add support for mediated devices
  2016-10-21  7:49   ` Jike Song
@ 2016-10-21 14:36     ` Alex Williamson
  2016-10-24 10:35       ` Kirti Wankhede
  0 siblings, 1 reply; 73+ messages in thread
From: Alex Williamson @ 2016-10-21 14:36 UTC (permalink / raw)
  To: Jike Song
  Cc: Kirti Wankhede, pbonzini, kraxel, cjia, qemu-devel, kvm,
	kevin.tian, bjsdjshi, linux-kernel

On Fri, 21 Oct 2016 15:49:07 +0800
Jike Song <jike.song@intel.com> wrote:

> On 10/18/2016 05:22 AM, Kirti Wankhede wrote:
> > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > index 2ba19424e4a1..5d67058a611d 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c  
> [snip]
> >  static int vfio_iommu_type1_attach_group(void *iommu_data,
> >  					 struct iommu_group *iommu_group)
> >  {
> >  	struct vfio_iommu *iommu = iommu_data;
> > -	struct vfio_group *group, *g;
> > +	struct vfio_group *group;
> >  	struct vfio_domain *domain, *d;
> >  	struct bus_type *bus = NULL;
> >  	int ret;
> > @@ -746,10 +1136,14 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
> >  	mutex_lock(&iommu->lock);
> >  
> >  	list_for_each_entry(d, &iommu->domain_list, next) {
> > -		list_for_each_entry(g, &d->group_list, next) {
> > -			if (g->iommu_group != iommu_group)
> > -				continue;
> > +		if (find_iommu_group(d, iommu_group)) {
> > +			mutex_unlock(&iommu->lock);
> > +			return -EINVAL;
> > +		}
> > +	}
> >  
> > +	if (iommu->local_domain) {
> > +		if (find_iommu_group(iommu->local_domain, iommu_group)) {
> >  			mutex_unlock(&iommu->lock);
> >  			return -EINVAL;
> >  		}
> > @@ -769,6 +1163,30 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
> >  	if (ret)
> >  		goto out_free;
> >  
> > +	if (IS_ENABLED(CONFIG_VFIO_MDEV) && !iommu_present(bus) &&
> > +	    (bus == &mdev_bus_type)) {  
> 
> Hi Kirti,
> 
> By refering mdev_bus_type directly you are making vfio_iommu_type1.ko depends
> on mdev.ko, but in Kconfig doesn't guarantee the dependency. For example,
> if CONFIG_VFIO_IOMMU_TYPE1=y and CONFIG_VFIO_MDEV=m, the building will fail.

Good point, Jike.  I don't think we want to make existing vfio modules
dependent on mdev modules.  I wonder if we can lookup the mdev_bus_type
symbol w/o triggering the module load.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 04/12] vfio iommu: Add support for mediated devices
  2016-10-20 20:17     ` Kirti Wankhede
@ 2016-10-24  2:32       ` Alex Williamson
  2016-10-26  7:19         ` Tian, Kevin
  0 siblings, 1 reply; 73+ messages in thread
From: Alex Williamson @ 2016-10-24  2:32 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song,
	bjsdjshi, linux-kernel

On Fri, 21 Oct 2016 01:47:25 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> Alex,
> 
> Addressing your comments other than invalidation part.
> 
> On 10/20/2016 2:32 AM, Alex Williamson wrote:
> > On Tue, 18 Oct 2016 02:52:04 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> ...
> >> Tested by assigning below combinations of devices to a single VM:
> >> - GPU pass through only
> >> - vGPU device only
> >> - One GPU pass through and one vGPU device
> >> - Linux VM hot plug and unplug vGPU device while GPU pass through device
> >>   exist
> >> - Linux VM hot plug and unplug GPU pass through device while vGPU device
> >>   exist  
> > 
> > Were you able to do these with the locked memory limit of the user set
> > to the minimum required for existing GPU assignment?
> >   
> 
> No, is there a way to set memory limit through livbirt so that it would
> set memory limit to system memory assigned to VM?

Not that I know of, but I also don't know how you're making use of an
mdev device through libvirt yet since they don't have support for the
vfio-pci sysfsdev option.  I would recommend testing with QEMU manually.

> ...
> >> +	container = group->container;
> >> +	if (IS_ERR(container)) {  
> > 
> > I don't see that we ever use an ERR_PTR to set group->container, it
> > should either be NULL or valid and the fact that we added ourselves to
> > container_users should mean that it's valid.  The paranoia test here
> > would be if container is NULL, but IS_ERR() doesn't check NULL.  If we
> > need that paranoia test, maybe we should just:
> > 
> > if (WARN_ON(!container)) {
> > 
> > I'm not fully convinced it's needed though.
> >   
> 
> Ok removing this check.
> 
> >> +		ret = PTR_ERR(container);
> >> +		goto err_pin_pages;
> >> +	}
> >> +
> >> +	down_read(&container->group_lock);
> >> +
> >> +	driver = container->iommu_driver;
> >> +	if (likely(driver && driver->ops->pin_pages))
> >> +		ret = driver->ops->pin_pages(container->iommu_data, user_pfn,
> >> +					     npage, prot, phys_pfn);  
> > 
> > The caller is going to need to provide some means for us to callback to
> > invalidate pinned pages.
> > 
> > ret has already been used, so it's zero at this point.  I expect the
> > original intention was to let the initialization above fall through
> > here so that the caller gets an errno if the driver doesn't support
> > pin_pages.  Returning zero without actually doing anything seems like
> > an unexpected return value.
> >   
> 
> yes, changing it to:
> 
> driver = container->iommu_driver;
> if (likely(driver && driver->ops->pin_pages))
>         ret = driver->ops->pin_pages(container->iommu_data, user_pfn,
>                                      npage, prot, phys_pfn);
> else
>         ret = -EINVAL;
> 
> 
> 
> 
> >> +static int vfio_pfn_account(struct vfio_iommu *iommu, unsigned long pfn)
> >> +{
> >> +	struct vfio_pfn *p;
> >> +	struct vfio_domain *domain = iommu->local_domain;
> >> +	int ret = 1;
> >> +
> >> +	if (!domain)
> >> +		return 1;
> >> +
> >> +	mutex_lock(&domain->local_addr_space->pfn_list_lock);
> >> +
> >> +	p = vfio_find_pfn(domain, pfn);
> >> +	if (p)
> >> +		ret = 0;
> >> +
> >> +	mutex_unlock(&domain->local_addr_space->pfn_list_lock);
> >> +	return ret;
> >> +}  
> > 
> > So if the vfio_pfn for a given pfn exists, return 0, else return 1.
> > But do we know that the vfio_pfn exists at the point where we actually
> > do that accounting?
> >  
> 
> Only below functions call vfio_pfn_account()
> __vfio_pin_pages_remote() -> vfio_pfn_account()
> __vfio_unpin_pages_remote() -> vfio_pfn_account()
> 
> Consider the case when mdev device is already assigned to VM, run some
> app in VM that pins some pages, then hotplug pass through device.
> Then __vfio_pin_pages_remote() is called when iommu capable domain is
> attached to container to pin all pages from vfio_iommu_replay(). So if
> at this time vfio_pfn exist means that the page is pinned through
> local_domain when iommu capable domain was not present, so accounting
> was already done for that pages. Hence returned 0 here which mean don't
> add this page in accounting.

Right, I see that's the intention, I can't pick any holes in the
concept, but I'll continue to try to look for bugs.

> >> +
> >>  struct vwork {
> >>  	struct mm_struct	*mm;
> >>  	long			npage;
> >> @@ -150,17 +269,17 @@ static void vfio_lock_acct_bg(struct work_struct *work)
> >>  	kfree(vwork);
> >>  }
> >>  
> >> -static void vfio_lock_acct(long npage)
> >> +static void vfio_lock_acct(struct task_struct *task, long npage)
> >>  {
> >>  	struct vwork *vwork;
> >>  	struct mm_struct *mm;
> >>  
> >> -	if (!current->mm || !npage)
> >> +	if (!task->mm || !npage)
> >>  		return; /* process exited or nothing to do */
> >>  
> >> -	if (down_write_trylock(&current->mm->mmap_sem)) {
> >> -		current->mm->locked_vm += npage;
> >> -		up_write(&current->mm->mmap_sem);
> >> +	if (down_write_trylock(&task->mm->mmap_sem)) {
> >> +		task->mm->locked_vm += npage;
> >> +		up_write(&task->mm->mmap_sem);
> >>  		return;
> >>  	}
> >>  
> >> @@ -172,7 +291,7 @@ static void vfio_lock_acct(long npage)
> >>  	vwork = kmalloc(sizeof(struct vwork), GFP_KERNEL);
> >>  	if (!vwork)
> >>  		return;
> >> -	mm = get_task_mm(current);
> >> +	mm = get_task_mm(task);
> >>  	if (!mm) {
> >>  		kfree(vwork);
> >>  		return;
> >> @@ -228,20 +347,31 @@ static int put_pfn(unsigned long pfn, int prot)
> >>  	return 0;
> >>  }  
> > 
> > This coversion of vfio_lock_acct() to pass a task_struct and updating
> > existing callers to pass current would be a great separate, easily
> > review-able patch.
> >  
> 
> Ok. I'll split this in separate commit.
> 
> 
> >>  
> >> -static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
> >> +static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
> >> +			 int prot, unsigned long *pfn)
> >>  {
> >>  	struct page *page[1];
> >>  	struct vm_area_struct *vma;
> >> +	struct mm_struct *local_mm = (mm ? mm : current->mm);
> >>  	int ret = -EFAULT;
> >>  
> >> -	if (get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE), page) == 1) {
> >> +	if (mm) {
> >> +		down_read(&local_mm->mmap_sem);
> >> +		ret = get_user_pages_remote(NULL, local_mm, vaddr, 1,
> >> +					!!(prot & IOMMU_WRITE), 0, page, NULL);
> >> +		up_read(&local_mm->mmap_sem);
> >> +	} else
> >> +		ret = get_user_pages_fast(vaddr, 1,
> >> +					  !!(prot & IOMMU_WRITE), page);
> >> +
> >> +	if (ret == 1) {
> >>  		*pfn = page_to_pfn(page[0]);
> >>  		return 0;
> >>  	}
> >>  
> >> -	down_read(&current->mm->mmap_sem);
> >> +	down_read(&local_mm->mmap_sem);
> >>  
> >> -	vma = find_vma_intersection(current->mm, vaddr, vaddr + 1);
> >> +	vma = find_vma_intersection(local_mm, vaddr, vaddr + 1);
> >>  
> >>  	if (vma && vma->vm_flags & VM_PFNMAP) {
> >>  		*pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
> >> @@ -249,7 +379,7 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
> >>  			ret = 0;
> >>  	}
> >>  
> >> -	up_read(&current->mm->mmap_sem);
> >> +	up_read(&local_mm->mmap_sem);
> >>  
> >>  	return ret;
> >>  }  
> > 
> > This would also be a great separate patch.  
> 
> Ok.
> 
> >  Have you considered
> > renaming the mm_struct function arg to "remote_mm" and making the local
> > variable simply "mm"?  It seems like it would tie nicely with the
> > remote_mm path using get_user_pages_remote() while passing NULL for
> > remote_mm uses current->mm and the existing path (and avoid the general
> > oddness of passing local_mm to a "remote" function).
> >   
> 
> Yes, your suggestion looks good. Updating.
> 
> 
> >> @@ -259,33 +389,37 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
> >>   * the iommu can only map chunks of consecutive pfns anyway, so get the
> >>   * first page and all consecutive pages with the same locking.
> >>   */
> >> -static long vfio_pin_pages(unsigned long vaddr, long npage,
> >> -			   int prot, unsigned long *pfn_base)
> >> +static long __vfio_pin_pages_remote(struct vfio_iommu *iommu,
> >> +				    unsigned long vaddr, long npage,
> >> +				    int prot, unsigned long *pfn_base)  
> 
> ...
> 
> 
> >> @@ -303,8 +437,10 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
> >>  			break;
> >>  		}
> >>  
> >> +		lock_acct += vfio_pfn_account(iommu, pfn);
> >> +  
> > 
> > I take it that this is the new technique for keeping the accounting
> > accurate, we only increment the locked accounting by the amount not
> > already pinned in a vfio_pfn.
> >  
> 
> That's correct.
> 
> 
> >>  		if (!rsvd && !lock_cap &&
> >> -		    current->mm->locked_vm + i + 1 > limit) {
> >> +		    current->mm->locked_vm + lock_acct > limit) {
> >>  			put_pfn(pfn, prot);
> >>  			pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
> >>  				__func__, limit << PAGE_SHIFT);
> >> @@ -313,23 +449,216 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
> >>  	}
> >>  
> >>  	if (!rsvd)
> >> -		vfio_lock_acct(i);
> >> +		vfio_lock_acct(current, lock_acct);
> >>  
> >>  	return i;
> >>  }
> >>  
> >> -static long vfio_unpin_pages(unsigned long pfn, long npage,
> >> -			     int prot, bool do_accounting)
> >> +static long __vfio_unpin_pages_remote(struct vfio_iommu *iommu,
> >> +				      unsigned long pfn, long npage, int prot,
> >> +				      bool do_accounting)  
> > 
> > Have you noticed that it's kind of confusing that
> > __vfio_{un}pin_pages_remote() uses current, which does a
> > get_user_pages_fast() while "local" uses a provided task_struct and
> > uses get_user_pages_*remote*()?  And also what was effectively local
> > (ie. we're pinning for our own use here) is now "remote" and pinning
> > for a remote, vendor driver consumer, is now "local".  It's not very
> > intuitive.
> >   
> 
> 'local' in local_domain was suggested to describe the domain for local
> page tracking. Earlier suggestions to have 'mdev' or 'noimmu' in this
> name were discarded. May be we should revisit what the name should be.
> Any suggestion?
> 
> For local_domain, to pin pages, flow is:
> 
> for local_domain
>     |- vfio_pin_pages()
>         |- vfio_iommu_type1_pin_pages()
>             |- __vfio_pin_page_local()
>                 |-  vaddr_get_pfn(task->mm)
>                     |- get_user_pages_remote()
> 
> __vfio_pin_page_local() --> get_user_pages_remote()


In vfio.c we have the concept of an external user, perhaps that could
be continued here.  An mdev driver would be an external, or remote
pinning.

> >>  static int vfio_iommu_type1_attach_group(void *iommu_data,
> >>  					 struct iommu_group *iommu_group)
> >>  {
> >>  	struct vfio_iommu *iommu = iommu_data;
> >> -	struct vfio_group *group, *g;
> >> +	struct vfio_group *group;
> >>  	struct vfio_domain *domain, *d;
> >>  	struct bus_type *bus = NULL;
> >>  	int ret;
> >> @@ -746,10 +1136,14 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
> >>  	mutex_lock(&iommu->lock);
> >>  
> >>  	list_for_each_entry(d, &iommu->domain_list, next) {
> >> -		list_for_each_entry(g, &d->group_list, next) {
> >> -			if (g->iommu_group != iommu_group)
> >> -				continue;
> >> +		if (find_iommu_group(d, iommu_group)) {
> >> +			mutex_unlock(&iommu->lock);
> >> +			return -EINVAL;
> >> +		}
> >> +	}  
> > 
> > The find_iommu_group() conversion would also be an easy separate patch.
> >   
> 
> Ok.
> 
> >>  
> >> +	if (iommu->local_domain) {
> >> +		if (find_iommu_group(iommu->local_domain, iommu_group)) {
> >>  			mutex_unlock(&iommu->lock);
> >>  			return -EINVAL;
> >>  		}
> >> @@ -769,6 +1163,30 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
> >>  	if (ret)
> >>  		goto out_free;
> >>  
> >> +	if (IS_ENABLED(CONFIG_VFIO_MDEV) && !iommu_present(bus) &&
> >> +	    (bus == &mdev_bus_type)) {
> >> +		if (!iommu->local_domain) {
> >> +			domain->local_addr_space =
> >> +				kzalloc(sizeof(*domain->local_addr_space),
> >> +						GFP_KERNEL);
> >> +			if (!domain->local_addr_space) {
> >> +				ret = -ENOMEM;
> >> +				goto out_free;
> >> +			}
> >> +
> >> +			domain->local_addr_space->task = current;
> >> +			INIT_LIST_HEAD(&domain->group_list);
> >> +			domain->local_addr_space->pfn_list = RB_ROOT;
> >> +			mutex_init(&domain->local_addr_space->pfn_list_lock);
> >> +			iommu->local_domain = domain;
> >> +		} else
> >> +			kfree(domain);
> >> +
> >> +		list_add(&group->next, &domain->group_list);  
> > 
> > I think you mean s/domain/iommu->local_domain/ here, we just freed
> > domain in the else path.
> >   
> 
> Yes, corrected.
> 
> >> +		mutex_unlock(&iommu->lock);
> >> +		return 0;
> >> +	}
> >> +
> >>  	domain->domain = iommu_domain_alloc(bus);
> >>  	if (!domain->domain) {
> >>  		ret = -EIO;
> >> @@ -859,6 +1277,41 @@ static void vfio_iommu_unmap_unpin_all(struct vfio_iommu *iommu)
> >>  		vfio_remove_dma(iommu, rb_entry(node, struct vfio_dma, node));
> >>  }
> >>  
> >> +static void vfio_iommu_unmap_unpin_reaccount(struct vfio_iommu *iommu)
> >> +{
> >> +	struct vfio_domain *domain = iommu->local_domain;
> >> +	struct vfio_dma *dma, *tdma;
> >> +	struct rb_node *n;
> >> +	long locked = 0;
> >> +
> >> +	rbtree_postorder_for_each_entry_safe(dma, tdma, &iommu->dma_list,
> >> +					     node) {
> >> +		vfio_unmap_unpin(iommu, dma);
> >> +	}
> >> +
> >> +	mutex_lock(&domain->local_addr_space->pfn_list_lock);
> >> +
> >> +	n = rb_first(&domain->local_addr_space->pfn_list);
> >> +
> >> +	for (; n; n = rb_next(n))
> >> +		locked++;
> >> +
> >> +	vfio_lock_acct(domain->local_addr_space->task, locked);
> >> +	mutex_unlock(&domain->local_addr_space->pfn_list_lock);
> >> +}  
> > 
> > Couldn't a properly timed mlock by the user allow them to lock more
> > memory than they're allowed here?  For instance imagine the vendor
> > driver has pinned the entire VM memory and the user has exactly the
> > locked memory limit for that VM.  During the gap here between unpinning
> > the entire vfio_dma list and re-accounting for the pfn_list, the user
> > can mlock up to their limit again an now they've doubled the locked
> > memory they're allowed.
> >   
> 
> As per original code, vfio_unmap_unpin() calls
> __vfio_unpin_pages_remote(.., false) with do_accounting set to false,
> why is that so?

Because vfio_dma tracks the user granularity of calling MAP_DMA, not
the granularity with which the iommu mapping was actually done.  There
might be multiple non-contiguous chunks to make that mapping and we
don't know how the iommu chose to map a given chunk to support large
page sizes.  If we chose to do accounting on the iommu_unmap()
granularity, we might account for every 4k page separately.  We choose
not to do accounting there so that we can batch the accounting into one
update per range.

> Here if accounting is set to true then we don't have to do re-accounting
> here.

If vfio_unmap_unpin() did not do accounting, you could update
accounting once with the difference between what was pinned and what
remains pinned via the mdev and avoid the gap caused by de-accounting
everything and then re-accounting only for the mdev pinnings.

> >> +
> >> +static void vfio_local_unpin_all(struct vfio_domain *domain)
> >> +{
> >> +	struct rb_node *node;
> >> +
> >> +	mutex_lock(&domain->local_addr_space->pfn_list_lock);
> >> +	while ((node = rb_first(&domain->local_addr_space->pfn_list)))
> >> +		vfio_unpin_pfn(domain,
> >> +				rb_entry(node, struct vfio_pfn, node), false);
> >> +
> >> +	mutex_unlock(&domain->local_addr_space->pfn_list_lock);
> >> +}
> >> +
> >>  static void vfio_iommu_type1_detach_group(void *iommu_data,
> >>  					  struct iommu_group *iommu_group)
> >>  {
> >> @@ -868,31 +1321,57 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
> >>  
> >>  	mutex_lock(&iommu->lock);
> >>  
> >> -	list_for_each_entry(domain, &iommu->domain_list, next) {
> >> -		list_for_each_entry(group, &domain->group_list, next) {
> >> -			if (group->iommu_group != iommu_group)
> >> -				continue;
> >> -
> >> -			iommu_detach_group(domain->domain, iommu_group);
> >> +	if (iommu->local_domain) {
> >> +		domain = iommu->local_domain;
> >> +		group = find_iommu_group(domain, iommu_group);
> >> +		if (group) {
> >>  			list_del(&group->next);
> >>  			kfree(group);
> >> -			/*
> >> -			 * Group ownership provides privilege, if the group
> >> -			 * list is empty, the domain goes away.  If it's the
> >> -			 * last domain, then all the mappings go away too.
> >> -			 */
> >> +
> >>  			if (list_empty(&domain->group_list)) {
> >> -				if (list_is_singular(&iommu->domain_list))
> >> +				vfio_local_unpin_all(domain);
> >> +				if (!IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu))
> >>  					vfio_iommu_unmap_unpin_all(iommu);
> >> -				iommu_domain_free(domain->domain);
> >> -				list_del(&domain->next);
> >>  				kfree(domain);
> >> +				iommu->local_domain = NULL;
> >> +			}  
> > 
> > 
> > I can't quite wrap my head around this, if we have mdev groups attached
> > and this iommu group matches an mdev group, remove from list and free
> > the group.  If there are now no more groups in the mdev group list,
> > then for each vfio_pfn, unpin the pfn, /without/ doing accounting
> > udpates   
> 
> corrected the code to do accounting here.
> 
> > and remove the vfio_pfn, but only if the ref_count is now
> > zero.  
> 
> Yes, If you see the loop vfio_local_unpin_all(), it iterates till the
> node in rb tree exist
> 
> >> +	while ((node = rb_first(&domain->local_addr_space->pfn_list)))
> >> +		vfio_unpin_pfn(domain,
> >> +				rb_entry(node, struct vfio_pfn, node), false);
> >> +  
> 
> 
> and vfio_unpin_pfn() only remove the node from rb tree if ref count is
> zero.
> 
> static int vfio_unpin_pfn(struct vfio_domain *domain,
>                           struct vfio_pfn *vpfn, bool do_accounting)
> {
>         __vfio_unpin_page_local(domain, vpfn->pfn, vpfn->prot,
>                                 do_accounting);
> 
>         if (atomic_dec_and_test(&vpfn->ref_count))
>                 vfio_remove_from_pfn_list(domain, vpfn);
> 
>         return 1;
> }
> 
> so for example for a vfio_pfn ref_count is 2, first iteration would be:
>  - call __vfio_unpin_page_local()
>  - atomic_dec(ref_count), so now ref_count is 1, but node is not removed
> from rb tree.
> 
> In next iteration:
>  - call __vfio_unpin_page_local()
>  - atomic_dec(ref_count), so now ref_count is 0, remove node from rb tree.

Ok, I missed that, thanks.

> >  We free the domain, so if the ref_count was non-zero we've now
> > just leaked memory.  I think that means that if a vendor driver pins a
> > given page twice, that leak occurs.  Furthermore, if there is not an
> > iommu capable domain in the container, we remove all the vfio_dma
> > entries as well, ok.  Maybe the only issue is those leaked vfio_pfns.
> >   
> 
> So if vendor driver pins a page twice, vfio_unpin_pfn() would get called
> twice and only when ref count is zero that node is removed from rb tree.
> So there is no memory leak.

Ok

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 00/12] Add Mediated device support
  2016-10-17 21:22 [PATCH v9 00/12] Add Mediated device support Kirti Wankhede
                   ` (12 preceding siblings ...)
  2016-10-17 21:41 ` [PATCH v9 00/12] Add Mediated device support Alex Williamson
@ 2016-10-24  7:07 ` Jike Song
  2016-12-05 17:44   ` Gerd Hoffmann
  13 siblings, 1 reply; 73+ messages in thread
From: Jike Song @ 2016-10-24  7:07 UTC (permalink / raw)
  To: Kirti Wankhede, alex.williamson, cjia
  Cc: pbonzini, kraxel, qemu-devel, kvm, kevin.tian, bjsdjshi, linux-kernel

On 10/18/2016 05:22 AM, Kirti Wankhede wrote:
> This series adds Mediated device support to Linux host kernel. Purpose
> of this series is to provide a common interface for mediated device
> management that can be used by different devices. This series introduces
> Mdev core module that creates and manages mediated devices, VFIO based
> driver for mediated devices that are created by mdev core module and
> update VFIO type1 IOMMU module to support pinning & unpinning for mediated
> devices.
> 
> What changed in v9?

Hi Alex, Kirti and Neo

Just want to share that we have published a KVMGT implementation
based on this v9 patchset, to:

	https://github.com/01org/gvt-linux/tree/gvt-next-kvmgt

It doesn't utilize common routines introduced by 05+ patches yet.
The complete intel vGPU device-model is contained.


--
Thanks,
Jike

 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 04/12] vfio iommu: Add support for mediated devices
  2016-10-21 14:36     ` Alex Williamson
@ 2016-10-24 10:35       ` Kirti Wankhede
  0 siblings, 0 replies; 73+ messages in thread
From: Kirti Wankhede @ 2016-10-24 10:35 UTC (permalink / raw)
  To: Alex Williamson, Jike Song
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, bjsdjshi,
	linux-kernel



On 10/21/2016 8:06 PM, Alex Williamson wrote:
> On Fri, 21 Oct 2016 15:49:07 +0800
> Jike Song <jike.song@intel.com> wrote:
> 
>> On 10/18/2016 05:22 AM, Kirti Wankhede wrote:
>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>>> index 2ba19424e4a1..5d67058a611d 100644
>>> --- a/drivers/vfio/vfio_iommu_type1.c
>>> +++ b/drivers/vfio/vfio_iommu_type1.c  
>> [snip]
>>>  static int vfio_iommu_type1_attach_group(void *iommu_data,
>>>  					 struct iommu_group *iommu_group)
>>>  {
>>>  	struct vfio_iommu *iommu = iommu_data;
>>> -	struct vfio_group *group, *g;
>>> +	struct vfio_group *group;
>>>  	struct vfio_domain *domain, *d;
>>>  	struct bus_type *bus = NULL;
>>>  	int ret;
>>> @@ -746,10 +1136,14 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>>>  	mutex_lock(&iommu->lock);
>>>  
>>>  	list_for_each_entry(d, &iommu->domain_list, next) {
>>> -		list_for_each_entry(g, &d->group_list, next) {
>>> -			if (g->iommu_group != iommu_group)
>>> -				continue;
>>> +		if (find_iommu_group(d, iommu_group)) {
>>> +			mutex_unlock(&iommu->lock);
>>> +			return -EINVAL;
>>> +		}
>>> +	}
>>>  
>>> +	if (iommu->local_domain) {
>>> +		if (find_iommu_group(iommu->local_domain, iommu_group)) {
>>>  			mutex_unlock(&iommu->lock);
>>>  			return -EINVAL;
>>>  		}
>>> @@ -769,6 +1163,30 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>>>  	if (ret)
>>>  		goto out_free;
>>>  
>>> +	if (IS_ENABLED(CONFIG_VFIO_MDEV) && !iommu_present(bus) &&
>>> +	    (bus == &mdev_bus_type)) {  
>>
>> Hi Kirti,
>>
>> By refering mdev_bus_type directly you are making vfio_iommu_type1.ko depends
>> on mdev.ko, but in Kconfig doesn't guarantee the dependency. For example,
>> if CONFIG_VFIO_IOMMU_TYPE1=y and CONFIG_VFIO_MDEV=m, the building will fail.
> 
> Good point, Jike.  I don't think we want to make existing vfio modules
> dependent on mdev modules.  I wonder if we can lookup the mdev_bus_type
> symbol w/o triggering the module load.  Thanks,
> 

Ok. Modifying the check as below works in above case:

        mdev_bus = symbol_get(mdev_bus_type);

        if (mdev_bus && (bus == mdev_bus) && !iommu_present(bus) ) {
                symbol_put(mdev_bus_type);
                ...
        }

Kirti

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 06/12] vfio_pci: Update vfio_pci to use vfio_info_add_capability()
  2016-10-20 19:24   ` Alex Williamson
@ 2016-10-24 21:22     ` Kirti Wankhede
  2016-10-24 21:37       ` Alex Williamson
  0 siblings, 1 reply; 73+ messages in thread
From: Kirti Wankhede @ 2016-10-24 21:22 UTC (permalink / raw)
  To: Alex Williamson
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song,
	bjsdjshi, linux-kernel



On 10/21/2016 12:54 AM, Alex Williamson wrote:
> On Tue, 18 Oct 2016 02:52:06 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> Update msix_sparse_mmap_cap() to use vfio_info_add_capability()
>> Update region type capability to use vfio_info_add_capability()
>> Can't split this commit for MSIx and region_type cap since there is a
>> common code which need to be updated for both the cases.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Signed-off-by: Neo Jia <cjia@nvidia.com>
>> Change-Id: I52bb28c7875a6da5a79ddad1843e6088aff58a45
>> ---
>>  drivers/vfio/pci/vfio_pci.c | 72 +++++++++++++++++----------------------------
>>  1 file changed, 27 insertions(+), 45 deletions(-)
>>
>> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
>> index d624a527777f..1ec0565b48ea 100644
>> --- a/drivers/vfio/pci/vfio_pci.c
>> +++ b/drivers/vfio/pci/vfio_pci.c
>> @@ -556,12 +556,12 @@ static int vfio_pci_for_each_slot_or_bus(struct pci_dev *pdev,
>>  }
>>  
>>  static int msix_sparse_mmap_cap(struct vfio_pci_device *vdev,
>> +				struct vfio_region_info *info,
>>  				struct vfio_info_cap *caps)
>>  {
>> -	struct vfio_info_cap_header *header;
>>  	struct vfio_region_info_cap_sparse_mmap *sparse;
>>  	size_t end, size;
>> -	int nr_areas = 2, i = 0;
>> +	int nr_areas = 2, i = 0, ret;
>>  
>>  	end = pci_resource_len(vdev->pdev, vdev->msix_bar);
>>  
>> @@ -572,13 +572,10 @@ static int msix_sparse_mmap_cap(struct vfio_pci_device *vdev,
>>  
>>  	size = sizeof(*sparse) + (nr_areas * sizeof(*sparse->areas));
>>  
>> -	header = vfio_info_cap_add(caps, size,
>> -				   VFIO_REGION_INFO_CAP_SPARSE_MMAP, 1);
>> -	if (IS_ERR(header))
>> -		return PTR_ERR(header);
>> +	sparse = kzalloc(size, GFP_KERNEL);
>> +	if (!sparse)
>> +		return -ENOMEM;
>>  
>> -	sparse = container_of(header,
>> -			      struct vfio_region_info_cap_sparse_mmap, header);
>>  	sparse->nr_areas = nr_areas;
>>  
>>  	if (vdev->msix_offset & PAGE_MASK) {
>> @@ -594,26 +591,11 @@ static int msix_sparse_mmap_cap(struct vfio_pci_device *vdev,
>>  		i++;
>>  	}
>>  
>> -	return 0;
>> -}
>> -
>> -static int region_type_cap(struct vfio_pci_device *vdev,
>> -			   struct vfio_info_cap *caps,
>> -			   unsigned int type, unsigned int subtype)
>> -{
>> -	struct vfio_info_cap_header *header;
>> -	struct vfio_region_info_cap_type *cap;
>> -
>> -	header = vfio_info_cap_add(caps, sizeof(*cap),
>> -				   VFIO_REGION_INFO_CAP_TYPE, 1);
>> -	if (IS_ERR(header))
>> -		return PTR_ERR(header);
>> +	ret = vfio_info_add_capability(info, caps,
>> +				      VFIO_REGION_INFO_CAP_SPARSE_MMAP, sparse);
>> +	kfree(sparse);
>>  
>> -	cap = container_of(header, struct vfio_region_info_cap_type, header);
>> -	cap->type = type;
>> -	cap->subtype = subtype;
>> -
>> -	return 0;
>> +	return ret;
>>  }
>>  
>>  int vfio_pci_register_dev_region(struct vfio_pci_device *vdev,
>> @@ -704,7 +686,8 @@ static long vfio_pci_ioctl(void *device_data,
>>  			if (vdev->bar_mmap_supported[info.index]) {
>>  				info.flags |= VFIO_REGION_INFO_FLAG_MMAP;
>>  				if (info.index == vdev->msix_bar) {
>> -					ret = msix_sparse_mmap_cap(vdev, &caps);
>> +					ret = msix_sparse_mmap_cap(vdev, &info,
>> +								   &caps);
>>  					if (ret)
>>  						return ret;
>>  				}
>> @@ -752,6 +735,9 @@ static long vfio_pci_ioctl(void *device_data,
>>  
>>  			break;
>>  		default:
>> +		{
>> +			struct vfio_region_info_cap_type cap_type;
>> +
>>  			if (info.index >=
>>  			    VFIO_PCI_NUM_REGIONS + vdev->num_regions)
>>  				return -EINVAL;
>> @@ -762,27 +748,23 @@ static long vfio_pci_ioctl(void *device_data,
>>  			info.size = vdev->region[i].size;
>>  			info.flags = vdev->region[i].flags;
>>  
>> -			ret = region_type_cap(vdev, &caps,
>> -					      vdev->region[i].type,
>> -					      vdev->region[i].subtype);
>> +			cap_type.type = vdev->region[i].type;
>> +			cap_type.subtype = vdev->region[i].subtype;
>> +
>> +			ret = vfio_info_add_capability(&info, &caps,
>> +						      VFIO_REGION_INFO_CAP_TYPE,
>> +						      &cap_type);
>>  			if (ret)
>>  				return ret;
>> +
>> +		}
>>  		}
>>  
>> -		if (caps.size) {
>> -			info.flags |= VFIO_REGION_INFO_FLAG_CAPS;
>> -			if (info.argsz < sizeof(info) + caps.size) {
>> -				info.argsz = sizeof(info) + caps.size;
>> -				info.cap_offset = 0;
>> -			} else {
>> -				vfio_info_cap_shift(&caps, sizeof(info));
>> -				if (copy_to_user((void __user *)arg +
>> -						  sizeof(info), caps.buf,
>> -						  caps.size)) {
>> -					kfree(caps.buf);
>> -					return -EFAULT;
>> -				}
>> -				info.cap_offset = sizeof(info);
> 
> I prefer the case above, I'm fine with breaking out helpers to build a
> buffer containing the capability chain, but I would rather have the
> caller manage placing that back into the return structure.  That also
> allows the helper to be independent of the structure we're operating
> on, it could be a region_info, irq_info, device_info, etc.  It only
> needs to know the layout of the capability type we're trying to add,
> not the info structure itself.  Thanks,
> 

The capability feature is for region_info (VFIO_REGION_INFO_FLAG_CAPS),
then structure we are operating couldn't be irq_info or device_info, right?

Kirti.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 05/12] vfio: Introduce common function to add capabilities
  2016-10-20 19:24   ` Alex Williamson
@ 2016-10-24 21:27     ` Kirti Wankhede
  2016-10-24 21:39       ` Alex Williamson
  0 siblings, 1 reply; 73+ messages in thread
From: Kirti Wankhede @ 2016-10-24 21:27 UTC (permalink / raw)
  To: Alex Williamson
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song,
	bjsdjshi, linux-kernel



On 10/21/2016 12:54 AM, Alex Williamson wrote:
> On Tue, 18 Oct 2016 02:52:05 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> Vendor driver using mediated device framework should use
>> vfio_info_add_capability() to add capabilities.
>> Introduced this function to reduce code duplication in vendor drivers.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Signed-off-by: Neo Jia <cjia@nvidia.com>
>> Change-Id: I6fca329fa2291f37a2c859d0bc97574d9e2ce1a6
>> ---
>>  drivers/vfio/vfio.c  | 78 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  include/linux/vfio.h |  4 +++
>>  2 files changed, 82 insertions(+)
>>
>> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
>> index a5a210005b65..e96cb3f7a23c 100644
>> --- a/drivers/vfio/vfio.c
>> +++ b/drivers/vfio/vfio.c
>> @@ -1799,6 +1799,84 @@ void vfio_info_cap_shift(struct vfio_info_cap *caps, size_t offset)
>>  }
>>  EXPORT_SYMBOL_GPL(vfio_info_cap_shift);
>>  
>> +static int sparse_mmap_cap(struct vfio_info_cap *caps, void *cap_type)
>> +{
>> +	struct vfio_info_cap_header *header;
>> +	struct vfio_region_info_cap_sparse_mmap *sparse_cap, *sparse = cap_type;
>> +	size_t size;
>> +
>> +	size = sizeof(*sparse) + sparse->nr_areas *  sizeof(*sparse->areas);
>> +	header = vfio_info_cap_add(caps, size,
>> +				   VFIO_REGION_INFO_CAP_SPARSE_MMAP, 1);
>> +	if (IS_ERR(header))
>> +		return PTR_ERR(header);
>> +
>> +	sparse_cap = container_of(header,
>> +			struct vfio_region_info_cap_sparse_mmap, header);
>> +	sparse_cap->nr_areas = sparse->nr_areas;
>> +	memcpy(sparse_cap->areas, sparse->areas,
>> +	       sparse->nr_areas * sizeof(*sparse->areas));
>> +	return 0;
>> +}
>> +
>> +static int region_type_cap(struct vfio_info_cap *caps, void *cap_type)
>> +{
>> +	struct vfio_info_cap_header *header;
>> +	struct vfio_region_info_cap_type *type_cap, *cap = cap_type;
>> +
>> +	header = vfio_info_cap_add(caps, sizeof(*cap),
>> +				   VFIO_REGION_INFO_CAP_TYPE, 1);
>> +	if (IS_ERR(header))
>> +		return PTR_ERR(header);
>> +
>> +	type_cap = container_of(header, struct vfio_region_info_cap_type,
>> +				header);
>> +	type_cap->type = cap->type;
>> +	type_cap->subtype = cap->subtype;
>> +	return 0;
>> +}
>> +
>> +int vfio_info_add_capability(struct vfio_region_info *info,
>> +			     struct vfio_info_cap *caps,
>> +			     int cap_type_id,
>> +			     void *cap_type)
>> +{
>> +	int ret;
>> +
>> +	if (!cap_type)
>> +		return 0;
>> +
>> +	switch (cap_type_id) {
>> +	case VFIO_REGION_INFO_CAP_SPARSE_MMAP:
>> +		ret = sparse_mmap_cap(caps, cap_type);
>> +		if (ret)
>> +			return ret;
>> +		break;
>> +
>> +	case VFIO_REGION_INFO_CAP_TYPE:
>> +		ret = region_type_cap(caps, cap_type);
>> +		if (ret)
>> +			return ret;
>> +		break;
>> +	default:
>> +		return -EINVAL;
>> +	}
>> +
>> +	info->flags |= VFIO_REGION_INFO_FLAG_CAPS;
>> +
>> +	if (caps->size) {
>> +		if (info->argsz < sizeof(*info) + caps->size) {
>> +			info->argsz = sizeof(*info) + caps->size;
>> +			info->cap_offset = 0;
>> +		} else {
>> +			vfio_info_cap_shift(caps, sizeof(*info));
>> +			info->cap_offset = sizeof(*info);
> 
> This doesn't work.  We build the capability chain in a buffer and
> vfio_info_cap_add() expects the chain to be zero-based as each
> capability is added.  vfio_info_cap_shift() is meant to be called once
> on that buffer immediately before copying it back to the user buffer to
> adjust the chain offsets to account for the offset within the buffer.
> vfio_info_cap_shift() cannot be called repeatedly on the buffer as we
> do support multiple capabilities in a chain.
> 

>From the code I see, we add one type of capability at a time, either
VFIO_REGION_INFO_CAP_SPARSE_MMAP or VFIO_REGION_INFO_CAP_TYPE. Both are
not the part of same case in the switch, right?
I do tested VFIO_REGION_INFO_CAP_SPARSE_MMAP by mapping some part of
BAR0 and that works.

Kirti.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 06/12] vfio_pci: Update vfio_pci to use vfio_info_add_capability()
  2016-10-24 21:22     ` Kirti Wankhede
@ 2016-10-24 21:37       ` Alex Williamson
  0 siblings, 0 replies; 73+ messages in thread
From: Alex Williamson @ 2016-10-24 21:37 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song,
	bjsdjshi, linux-kernel

On Tue, 25 Oct 2016 02:52:39 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 10/21/2016 12:54 AM, Alex Williamson wrote:
> > On Tue, 18 Oct 2016 02:52:06 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> Update msix_sparse_mmap_cap() to use vfio_info_add_capability()
> >> Update region type capability to use vfio_info_add_capability()
> >> Can't split this commit for MSIx and region_type cap since there is a
> >> common code which need to be updated for both the cases.
> >>
> >> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >> Signed-off-by: Neo Jia <cjia@nvidia.com>
> >> Change-Id: I52bb28c7875a6da5a79ddad1843e6088aff58a45
> >> ---
> >>  drivers/vfio/pci/vfio_pci.c | 72 +++++++++++++++++----------------------------
> >>  1 file changed, 27 insertions(+), 45 deletions(-)
> >>
> >> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> >> index d624a527777f..1ec0565b48ea 100644
> >> --- a/drivers/vfio/pci/vfio_pci.c
> >> +++ b/drivers/vfio/pci/vfio_pci.c
> >> @@ -556,12 +556,12 @@ static int vfio_pci_for_each_slot_or_bus(struct pci_dev *pdev,
> >>  }
> >>  
> >>  static int msix_sparse_mmap_cap(struct vfio_pci_device *vdev,
> >> +				struct vfio_region_info *info,
> >>  				struct vfio_info_cap *caps)
> >>  {
> >> -	struct vfio_info_cap_header *header;
> >>  	struct vfio_region_info_cap_sparse_mmap *sparse;
> >>  	size_t end, size;
> >> -	int nr_areas = 2, i = 0;
> >> +	int nr_areas = 2, i = 0, ret;
> >>  
> >>  	end = pci_resource_len(vdev->pdev, vdev->msix_bar);
> >>  
> >> @@ -572,13 +572,10 @@ static int msix_sparse_mmap_cap(struct vfio_pci_device *vdev,
> >>  
> >>  	size = sizeof(*sparse) + (nr_areas * sizeof(*sparse->areas));
> >>  
> >> -	header = vfio_info_cap_add(caps, size,
> >> -				   VFIO_REGION_INFO_CAP_SPARSE_MMAP, 1);
> >> -	if (IS_ERR(header))
> >> -		return PTR_ERR(header);
> >> +	sparse = kzalloc(size, GFP_KERNEL);
> >> +	if (!sparse)
> >> +		return -ENOMEM;
> >>  
> >> -	sparse = container_of(header,
> >> -			      struct vfio_region_info_cap_sparse_mmap, header);
> >>  	sparse->nr_areas = nr_areas;
> >>  
> >>  	if (vdev->msix_offset & PAGE_MASK) {
> >> @@ -594,26 +591,11 @@ static int msix_sparse_mmap_cap(struct vfio_pci_device *vdev,
> >>  		i++;
> >>  	}
> >>  
> >> -	return 0;
> >> -}
> >> -
> >> -static int region_type_cap(struct vfio_pci_device *vdev,
> >> -			   struct vfio_info_cap *caps,
> >> -			   unsigned int type, unsigned int subtype)
> >> -{
> >> -	struct vfio_info_cap_header *header;
> >> -	struct vfio_region_info_cap_type *cap;
> >> -
> >> -	header = vfio_info_cap_add(caps, sizeof(*cap),
> >> -				   VFIO_REGION_INFO_CAP_TYPE, 1);
> >> -	if (IS_ERR(header))
> >> -		return PTR_ERR(header);
> >> +	ret = vfio_info_add_capability(info, caps,
> >> +				      VFIO_REGION_INFO_CAP_SPARSE_MMAP, sparse);
> >> +	kfree(sparse);
> >>  
> >> -	cap = container_of(header, struct vfio_region_info_cap_type, header);
> >> -	cap->type = type;
> >> -	cap->subtype = subtype;
> >> -
> >> -	return 0;
> >> +	return ret;
> >>  }
> >>  
> >>  int vfio_pci_register_dev_region(struct vfio_pci_device *vdev,
> >> @@ -704,7 +686,8 @@ static long vfio_pci_ioctl(void *device_data,
> >>  			if (vdev->bar_mmap_supported[info.index]) {
> >>  				info.flags |= VFIO_REGION_INFO_FLAG_MMAP;
> >>  				if (info.index == vdev->msix_bar) {
> >> -					ret = msix_sparse_mmap_cap(vdev, &caps);
> >> +					ret = msix_sparse_mmap_cap(vdev, &info,
> >> +								   &caps);
> >>  					if (ret)
> >>  						return ret;
> >>  				}
> >> @@ -752,6 +735,9 @@ static long vfio_pci_ioctl(void *device_data,
> >>  
> >>  			break;
> >>  		default:
> >> +		{
> >> +			struct vfio_region_info_cap_type cap_type;
> >> +
> >>  			if (info.index >=
> >>  			    VFIO_PCI_NUM_REGIONS + vdev->num_regions)
> >>  				return -EINVAL;
> >> @@ -762,27 +748,23 @@ static long vfio_pci_ioctl(void *device_data,
> >>  			info.size = vdev->region[i].size;
> >>  			info.flags = vdev->region[i].flags;
> >>  
> >> -			ret = region_type_cap(vdev, &caps,
> >> -					      vdev->region[i].type,
> >> -					      vdev->region[i].subtype);
> >> +			cap_type.type = vdev->region[i].type;
> >> +			cap_type.subtype = vdev->region[i].subtype;
> >> +
> >> +			ret = vfio_info_add_capability(&info, &caps,
> >> +						      VFIO_REGION_INFO_CAP_TYPE,
> >> +						      &cap_type);
> >>  			if (ret)
> >>  				return ret;
> >> +
> >> +		}
> >>  		}
> >>  
> >> -		if (caps.size) {
> >> -			info.flags |= VFIO_REGION_INFO_FLAG_CAPS;
> >> -			if (info.argsz < sizeof(info) + caps.size) {
> >> -				info.argsz = sizeof(info) + caps.size;
> >> -				info.cap_offset = 0;
> >> -			} else {
> >> -				vfio_info_cap_shift(&caps, sizeof(info));
> >> -				if (copy_to_user((void __user *)arg +
> >> -						  sizeof(info), caps.buf,
> >> -						  caps.size)) {
> >> -					kfree(caps.buf);
> >> -					return -EFAULT;
> >> -				}
> >> -				info.cap_offset = sizeof(info);  
> > 
> > I prefer the case above, I'm fine with breaking out helpers to build a
> > buffer containing the capability chain, but I would rather have the
> > caller manage placing that back into the return structure.  That also
> > allows the helper to be independent of the structure we're operating
> > on, it could be a region_info, irq_info, device_info, etc.  It only
> > needs to know the layout of the capability type we're trying to add,
> > not the info structure itself.  Thanks,
> >   
> 
> The capability feature is for region_info (VFIO_REGION_INFO_FLAG_CAPS),
> then structure we are operating couldn't be irq_info or device_info, right?

My point was that I prefer the caller managing region_info.flags and
region_info.cap_offset, the helper only managing creating the
capability chain within the buffer.  There is a namespace conflict for
the various capabilities though, each foo_info has their own space,
which can conflict, which make make a common helper impractical.
Thanks,

Alex

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 05/12] vfio: Introduce common function to add capabilities
  2016-10-24 21:27     ` Kirti Wankhede
@ 2016-10-24 21:39       ` Alex Williamson
  0 siblings, 0 replies; 73+ messages in thread
From: Alex Williamson @ 2016-10-24 21:39 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song,
	bjsdjshi, linux-kernel

On Tue, 25 Oct 2016 02:57:58 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 10/21/2016 12:54 AM, Alex Williamson wrote:
> > On Tue, 18 Oct 2016 02:52:05 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> Vendor driver using mediated device framework should use
> >> vfio_info_add_capability() to add capabilities.
> >> Introduced this function to reduce code duplication in vendor drivers.
> >>
> >> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >> Signed-off-by: Neo Jia <cjia@nvidia.com>
> >> Change-Id: I6fca329fa2291f37a2c859d0bc97574d9e2ce1a6
> >> ---
> >>  drivers/vfio/vfio.c  | 78 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>  include/linux/vfio.h |  4 +++
> >>  2 files changed, 82 insertions(+)
> >>
> >> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> >> index a5a210005b65..e96cb3f7a23c 100644
> >> --- a/drivers/vfio/vfio.c
> >> +++ b/drivers/vfio/vfio.c
> >> @@ -1799,6 +1799,84 @@ void vfio_info_cap_shift(struct vfio_info_cap *caps, size_t offset)
> >>  }
> >>  EXPORT_SYMBOL_GPL(vfio_info_cap_shift);
> >>  
> >> +static int sparse_mmap_cap(struct vfio_info_cap *caps, void *cap_type)
> >> +{
> >> +	struct vfio_info_cap_header *header;
> >> +	struct vfio_region_info_cap_sparse_mmap *sparse_cap, *sparse = cap_type;
> >> +	size_t size;
> >> +
> >> +	size = sizeof(*sparse) + sparse->nr_areas *  sizeof(*sparse->areas);
> >> +	header = vfio_info_cap_add(caps, size,
> >> +				   VFIO_REGION_INFO_CAP_SPARSE_MMAP, 1);
> >> +	if (IS_ERR(header))
> >> +		return PTR_ERR(header);
> >> +
> >> +	sparse_cap = container_of(header,
> >> +			struct vfio_region_info_cap_sparse_mmap, header);
> >> +	sparse_cap->nr_areas = sparse->nr_areas;
> >> +	memcpy(sparse_cap->areas, sparse->areas,
> >> +	       sparse->nr_areas * sizeof(*sparse->areas));
> >> +	return 0;
> >> +}
> >> +
> >> +static int region_type_cap(struct vfio_info_cap *caps, void *cap_type)
> >> +{
> >> +	struct vfio_info_cap_header *header;
> >> +	struct vfio_region_info_cap_type *type_cap, *cap = cap_type;
> >> +
> >> +	header = vfio_info_cap_add(caps, sizeof(*cap),
> >> +				   VFIO_REGION_INFO_CAP_TYPE, 1);
> >> +	if (IS_ERR(header))
> >> +		return PTR_ERR(header);
> >> +
> >> +	type_cap = container_of(header, struct vfio_region_info_cap_type,
> >> +				header);
> >> +	type_cap->type = cap->type;
> >> +	type_cap->subtype = cap->subtype;
> >> +	return 0;
> >> +}
> >> +
> >> +int vfio_info_add_capability(struct vfio_region_info *info,
> >> +			     struct vfio_info_cap *caps,
> >> +			     int cap_type_id,
> >> +			     void *cap_type)
> >> +{
> >> +	int ret;
> >> +
> >> +	if (!cap_type)
> >> +		return 0;
> >> +
> >> +	switch (cap_type_id) {
> >> +	case VFIO_REGION_INFO_CAP_SPARSE_MMAP:
> >> +		ret = sparse_mmap_cap(caps, cap_type);
> >> +		if (ret)
> >> +			return ret;
> >> +		break;
> >> +
> >> +	case VFIO_REGION_INFO_CAP_TYPE:
> >> +		ret = region_type_cap(caps, cap_type);
> >> +		if (ret)
> >> +			return ret;
> >> +		break;
> >> +	default:
> >> +		return -EINVAL;
> >> +	}
> >> +
> >> +	info->flags |= VFIO_REGION_INFO_FLAG_CAPS;
> >> +
> >> +	if (caps->size) {
> >> +		if (info->argsz < sizeof(*info) + caps->size) {
> >> +			info->argsz = sizeof(*info) + caps->size;
> >> +			info->cap_offset = 0;
> >> +		} else {
> >> +			vfio_info_cap_shift(caps, sizeof(*info));
> >> +			info->cap_offset = sizeof(*info);  
> > 
> > This doesn't work.  We build the capability chain in a buffer and
> > vfio_info_cap_add() expects the chain to be zero-based as each
> > capability is added.  vfio_info_cap_shift() is meant to be called once
> > on that buffer immediately before copying it back to the user buffer to
> > adjust the chain offsets to account for the offset within the buffer.
> > vfio_info_cap_shift() cannot be called repeatedly on the buffer as we
> > do support multiple capabilities in a chain.
> >   
> 
> From the code I see, we add one type of capability at a time, either
> VFIO_REGION_INFO_CAP_SPARSE_MMAP or VFIO_REGION_INFO_CAP_TYPE. Both are
> not the part of same case in the switch, right?
> I do tested VFIO_REGION_INFO_CAP_SPARSE_MMAP by mapping some part of
> BAR0 and that works.

That simply means that we don't _currently_ have a user that implements
multiple chain entries.  The interface is however designed to support
multiple entries and this breaks that goal.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 11/12] docs: Add Documentation for Mediated devices
  2016-10-17 21:22 ` [PATCH v9 11/12] docs: Add Documentation for Mediated devices Kirti Wankhede
@ 2016-10-25 16:17   ` Alex Williamson
  0 siblings, 0 replies; 73+ messages in thread
From: Alex Williamson @ 2016-10-25 16:17 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song,
	bjsdjshi, linux-kernel

On Tue, 18 Oct 2016 02:52:11 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> Add file Documentation/vfio-mediated-device.txt that include details of
> mediated device framework.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Signed-off-by: Neo Jia <cjia@nvidia.com>
> Change-Id: I137dd646442936090d92008b115908b7b2c7bc5d
> ---
>  Documentation/vfio-mdev/vfio-mediated-device.txt | 289 +++++++++++++++++++++++
>  1 file changed, 289 insertions(+)
>  create mode 100644 Documentation/vfio-mdev/vfio-mediated-device.txt

A missing component of the documentation is that the entire sysfs ABI
needs to be added to Documentation/ABI.  There's a README and plenty of
examples there.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 73+ messages in thread

* RE: [PATCH v9 01/12] vfio: Mediated device Core driver
  2016-10-17 21:22 ` [PATCH v9 01/12] vfio: Mediated device Core driver Kirti Wankhede
  2016-10-18 23:16   ` Alex Williamson
  2016-10-20  7:23   ` Jike Song
@ 2016-10-26  6:52   ` Tian, Kevin
  2016-10-26 14:58     ` Kirti Wankhede
  2 siblings, 1 reply; 73+ messages in thread
From: Tian, Kevin @ 2016-10-26  6:52 UTC (permalink / raw)
  To: Kirti Wankhede, alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, Song, Jike, bjsdjshi, linux-kernel

> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> Sent: Tuesday, October 18, 2016 5:22 AM
> 
> Design for Mediated Device Driver:
> Main purpose of this driver is to provide a common interface for mediated
> device management that can be used by different drivers of different
> devices.
> 
> This module provides a generic interface to create the device, add it to
> mediated bus, add device to IOMMU group and then add it to vfio group.
> 
> Below is the high Level block diagram, with Nvidia, Intel and IBM devices
> as example, since these are the devices which are going to actively use
> this module as of now.
> 
>  +---------------+
>  |               |
>  | +-----------+ |  mdev_register_driver() +--------------+
>  | |           | +<------------------------+ __init()     |
>  | |  mdev     | |                         |              |
>  | |  bus      | +------------------------>+              |<-> VFIO user
>  | |  driver   | |     probe()/remove()    | vfio_mdev.ko |    APIs
>  | |           | |                         |              |
>  | +-----------+ |                         +--------------+
>  |               |
>  |  MDEV CORE    |
>  |   MODULE      |
>  |   mdev.ko     |
>  | +-----------+ |  mdev_register_device() +--------------+
>  | |           | +<------------------------+              |
>  | |           | |                         |  nvidia.ko   |<-> physical
>  | |           | +------------------------>+              |    device
>  | |           | |        callback         +--------------+
>  | | Physical  | |
>  | |  device   | |  mdev_register_device() +--------------+
>  | | interface | |<------------------------+              |
>  | |           | |                         |  i915.ko     |<-> physical
>  | |           | +------------------------>+              |    device
>  | |           | |        callback         +--------------+
>  | |           | |
>  | |           | |  mdev_register_device() +--------------+
>  | |           | +<------------------------+              |
>  | |           | |                         | ccw_device.ko|<-> physical
>  | |           | +------------------------>+              |    device
>  | |           | |        callback         +--------------+
>  | +-----------+ |
>  +---------------+
> 
> Core driver provides two types of registration interfaces:
> 1. Registration interface for mediated bus driver:
> 
> /**
>   * struct mdev_driver - Mediated device's driver
>   * @name: driver name
>   * @probe: called when new device created
>   * @remove:called when device removed
>   * @driver:device driver structure
>   *
>   **/
> struct mdev_driver {
>          const char *name;
>          int  (*probe)  (struct device *dev);
>          void (*remove) (struct device *dev);
>          struct device_driver    driver;
> };
> 
> Mediated bus driver for mdev device should use this interface to register
> and unregister with core driver respectively:
> 
> int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
> void mdev_unregister_driver(struct mdev_driver *drv);
> 
> Medisted bus driver is responsible to add/delete mediated devices to/from

Medisted -> Mediated

> VFIO group when devices are bound and unbound to the driver.
> 
> 2. Physical device driver interface
> This interface provides vendor driver the set APIs to manage physical
> device related work in its driver. APIs are :
> 
> * dev_attr_groups: attributes of the parent device.
> * mdev_attr_groups: attributes of the mediated device.
> * supported_type_groups: attributes to define supported type. This is
> 			 mandatory field.
> * create: to allocate basic resources in driver for a mediated device.

in 'which driver'? it should be clear to remove 'in driver' here

> * remove: to free resources in driver when mediated device is destroyed.
> * open: open callback of mediated device
> * release: release callback of mediated device
> * read : read emulation callback.
> * write: write emulation callback.
> * mmap: mmap emulation callback.
> * ioctl: ioctl callback.

You only highlight 'mandatory field' for supported_type_groups. What
about other fields? Are all of them optional? Please clarify and also
stay consistent to later code comment.

> 
> Drivers should use these interfaces to register and unregister device to
> mdev core driver respectively:
> 
> extern int  mdev_register_device(struct device *dev,
>                                  const struct parent_ops *ops);
> extern void mdev_unregister_device(struct device *dev);
> 
> There are no locks to serialize above callbacks in mdev driver and
> vfio_mdev driver. If required, vendor driver can have locks to serialize
> above APIs in their driver.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Signed-off-by: Neo Jia <cjia@nvidia.com>
> Change-Id: I73a5084574270b14541c529461ea2f03c292d510
> ---
>  drivers/vfio/Kconfig             |   1 +
>  drivers/vfio/Makefile            |   1 +
>  drivers/vfio/mdev/Kconfig        |  11 ++
>  drivers/vfio/mdev/Makefile       |   4 +
>  drivers/vfio/mdev/mdev_core.c    | 372
> +++++++++++++++++++++++++++++++++++++++
>  drivers/vfio/mdev/mdev_driver.c  | 128 ++++++++++++++
>  drivers/vfio/mdev/mdev_private.h |  41 +++++
>  drivers/vfio/mdev/mdev_sysfs.c   | 296
> +++++++++++++++++++++++++++++++
>  include/linux/mdev.h             | 177 +++++++++++++++++++
>  9 files changed, 1031 insertions(+)
>  create mode 100644 drivers/vfio/mdev/Kconfig
>  create mode 100644 drivers/vfio/mdev/Makefile
>  create mode 100644 drivers/vfio/mdev/mdev_core.c
>  create mode 100644 drivers/vfio/mdev/mdev_driver.c
>  create mode 100644 drivers/vfio/mdev/mdev_private.h
>  create mode 100644 drivers/vfio/mdev/mdev_sysfs.c
>  create mode 100644 include/linux/mdev.h
> 
> diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
> index da6e2ce77495..23eced02aaf6 100644
> --- a/drivers/vfio/Kconfig
> +++ b/drivers/vfio/Kconfig
> @@ -48,4 +48,5 @@ menuconfig VFIO_NOIOMMU
> 
>  source "drivers/vfio/pci/Kconfig"
>  source "drivers/vfio/platform/Kconfig"
> +source "drivers/vfio/mdev/Kconfig"
>  source "virt/lib/Kconfig"
> diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
> index 7b8a31f63fea..4a23c13b6be4 100644
> --- a/drivers/vfio/Makefile
> +++ b/drivers/vfio/Makefile
> @@ -7,3 +7,4 @@ obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) +=
> vfio_iommu_spapr_tce.o
>  obj-$(CONFIG_VFIO_SPAPR_EEH) += vfio_spapr_eeh.o
>  obj-$(CONFIG_VFIO_PCI) += pci/
>  obj-$(CONFIG_VFIO_PLATFORM) += platform/
> +obj-$(CONFIG_VFIO_MDEV) += mdev/
> diff --git a/drivers/vfio/mdev/Kconfig b/drivers/vfio/mdev/Kconfig
> new file mode 100644
> index 000000000000..93addace9a67
> --- /dev/null
> +++ b/drivers/vfio/mdev/Kconfig
> @@ -0,0 +1,11 @@
> +
> +config VFIO_MDEV
> +    tristate "Mediated device driver framework"
> +    depends on VFIO
> +    default n
> +    help
> +        Provides a framework to virtualize devices which don't have SR_IOV
> +	capability built-in.

This statement is not accurate. A device can support SR-IOV, but in the same
time using this mediated technology w/ SR-IOV capability disabled.

> +	See Documentation/vfio-mdev/vfio-mediated-device.txt for more details.
> +
> +        If you don't know what do here, say N.
> diff --git a/drivers/vfio/mdev/Makefile b/drivers/vfio/mdev/Makefile
> new file mode 100644
> index 000000000000..31bc04801d94
> --- /dev/null
> +++ b/drivers/vfio/mdev/Makefile
> @@ -0,0 +1,4 @@
> +
> +mdev-y := mdev_core.o mdev_sysfs.o mdev_driver.o
> +
> +obj-$(CONFIG_VFIO_MDEV) += mdev.o
> diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
> new file mode 100644
> index 000000000000..7db5ec164aeb
> --- /dev/null
> +++ b/drivers/vfio/mdev/mdev_core.c
> @@ -0,0 +1,372 @@
> +/*
> + * Mediated device Core Driver
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/module.h>
> +#include <linux/device.h>
> +#include <linux/slab.h>
> +#include <linux/uuid.h>
> +#include <linux/sysfs.h>
> +#include <linux/mdev.h>
> +
> +#include "mdev_private.h"
> +
> +#define DRIVER_VERSION		"0.1"
> +#define DRIVER_AUTHOR		"NVIDIA Corporation"
> +#define DRIVER_DESC		"Mediated device Core Driver"
> +
> +static LIST_HEAD(parent_list);
> +static DEFINE_MUTEX(parent_list_lock);
> +static struct class_compat *mdev_bus_compat_class;
> +
> +static int _find_mdev_device(struct device *dev, void *data)
> +{
> +	struct mdev_device *mdev;
> +
> +	if (!dev_is_mdev(dev))
> +		return 0;
> +
> +	mdev = to_mdev_device(dev);
> +
> +	if (uuid_le_cmp(mdev->uuid, *(uuid_le *)data) == 0)
> +		return 1;
> +
> +	return 0;
> +}
> +
> +static struct mdev_device *__find_mdev_device(struct parent_device *parent,
> +					      uuid_le uuid)

parent_find_mdev_device?

> +{
> +	struct device *dev;
> +
> +	dev = device_find_child(parent->dev, &uuid, _find_mdev_device);
> +	if (!dev)
> +		return NULL;
> +
> +	put_device(dev);
> +
> +	return to_mdev_device(dev);
> +}
> +
> +/* Should be called holding parent_list_lock */
> +static struct parent_device *__find_parent_device(struct device *dev)
> +{
> +	struct parent_device *parent;
> +
> +	list_for_each_entry(parent, &parent_list, next) {
> +		if (parent->dev == dev)
> +			return parent;
> +	}
> +	return NULL;
> +}
> +
> +static void mdev_release_parent(struct kref *kref)
> +{
> +	struct parent_device *parent = container_of(kref, struct parent_device,
> +						    ref);
> +	struct device *dev = parent->dev;
> +
> +	kfree(parent);
> +	put_device(dev);
> +}
> +
> +static
> +inline struct parent_device *mdev_get_parent(struct parent_device *parent)
> +{
> +	if (parent)
> +		kref_get(&parent->ref);
> +
> +	return parent;
> +}
> +
> +static inline void mdev_put_parent(struct parent_device *parent)
> +{
> +	if (parent)
> +		kref_put(&parent->ref, mdev_release_parent);
> +}
> +
> +static int mdev_device_create_ops(struct kobject *kobj,
> +				  struct mdev_device *mdev)
> +{
> +	struct parent_device *parent = mdev->parent;
> +	int ret;
> +
> +	ret = parent->ops->create(kobj, mdev);
> +	if (ret)
> +		return ret;
> +
> +	ret = sysfs_create_groups(&mdev->dev.kobj,
> +				  parent->ops->mdev_attr_groups);
> +	if (ret)
> +		parent->ops->remove(mdev);
> +
> +	return ret;
> +}
> +
> +static int mdev_device_remove_ops(struct mdev_device *mdev, bool force_remove)
> +{


Can you add some comment here about when force_remove may be expected
here, which would help others understand immediately instead of walking through
the whole patch set?


> +	struct parent_device *parent = mdev->parent;
> +	int ret;
> +
> +	/*
> +	 * Vendor driver can return error if VMM or userspace application is
> +	 * using this mdev device.
> +	 */
> +	ret = parent->ops->remove(mdev);

what about passing force_remove flag to remove callback, so vendor driver
can decide whether any force cleanup required?

> +	if (ret && !force_remove)
> +		return -EBUSY;
> +
> +	sysfs_remove_groups(&mdev->dev.kobj, parent->ops->mdev_attr_groups);
> +	return 0;
> +}
> +
> +static int mdev_device_remove_cb(struct device *dev, void *data)
> +{
> +	return mdev_device_remove(dev, data ? *(bool *)data : true);
> +}
> +
> +/*
> + * mdev_register_device : Register a device
> + * @dev: device structure representing parent device.
> + * @ops: Parent device operation structure to be registered.
> + *
> + * Add device to list of registered parent devices.
> + * Returns a negative value on error, otherwise 0.
> + */
> +int mdev_register_device(struct device *dev, const struct parent_ops *ops)
> +{
> +	int ret = 0;
> +	struct parent_device *parent;
> +
> +	/* check for mandatory ops */
> +	if (!ops || !ops->create || !ops->remove || !ops->supported_type_groups)
> +		return -EINVAL;
> +
> +	dev = get_device(dev);
> +	if (!dev)
> +		return -EINVAL;
> +
> +	mutex_lock(&parent_list_lock);
> +
> +	/* Check for duplicate */
> +	parent = __find_parent_device(dev);
> +	if (parent) {
> +		ret = -EEXIST;
> +		goto add_dev_err;
> +	}
> +
> +	parent = kzalloc(sizeof(*parent), GFP_KERNEL);
> +	if (!parent) {
> +		ret = -ENOMEM;
> +		goto add_dev_err;
> +	}
> +
> +	kref_init(&parent->ref);
> +
> +	parent->dev = dev;
> +	parent->ops = ops;
> +
> +	ret = parent_create_sysfs_files(parent);
> +	if (ret) {
> +		mutex_unlock(&parent_list_lock);
> +		mdev_put_parent(parent);
> +		return ret;
> +	}
> +
> +	ret = class_compat_create_link(mdev_bus_compat_class, dev, NULL);
> +	if (ret)
> +		dev_warn(dev, "Failed to create compatibility class link\n");
> +
> +	list_add(&parent->next, &parent_list);
> +	mutex_unlock(&parent_list_lock);
> +
> +	dev_info(dev, "MDEV: Registered\n");
> +	return 0;
> +
> +add_dev_err:
> +	mutex_unlock(&parent_list_lock);
> +	put_device(dev);
> +	return ret;
> +}
> +EXPORT_SYMBOL(mdev_register_device);
> +
> +/*
> + * mdev_unregister_device : Unregister a parent device
> + * @dev: device structure representing parent device.
> + *
> + * Remove device from list of registered parent devices. Give a chance to free
> + * existing mediated devices for given device.
> + */
> +
> +void mdev_unregister_device(struct device *dev)
> +{
> +	struct parent_device *parent;
> +	bool force_remove = true;
> +
> +	mutex_lock(&parent_list_lock);
> +	parent = __find_parent_device(dev);
> +
> +	if (!parent) {
> +		mutex_unlock(&parent_list_lock);
> +		return;
> +	}
> +	dev_info(dev, "MDEV: Unregistering\n");
> +
> +	/*
> +	 * Remove parent from the list and remove "mdev_supported_types"
> +	 * sysfs files so that no new mediated device could be
> +	 * created for this parent
> +	 */
> +	list_del(&parent->next);
> +	parent_remove_sysfs_files(parent);
> +
> +	mutex_unlock(&parent_list_lock);
> +
> +	class_compat_remove_link(mdev_bus_compat_class, dev, NULL);
> +
> +	device_for_each_child(dev, (void *)&force_remove,
> +			      mdev_device_remove_cb);
> +	mdev_put_parent(parent);
> +}
> +EXPORT_SYMBOL(mdev_unregister_device);
> +
> +static void mdev_device_release(struct device *dev)
> +{
> +	struct mdev_device *mdev = to_mdev_device(dev);
> +
> +	dev_dbg(&mdev->dev, "MDEV: destroying\n");
> +	kfree(mdev);
> +}
> +
> +int mdev_device_create(struct kobject *kobj, struct device *dev, uuid_le uuid)
> +{
> +	int ret;
> +	struct mdev_device *mdev;
> +	struct parent_device *parent;
> +	struct mdev_type *type = to_mdev_type(kobj);
> +
> +	parent = mdev_get_parent(type->parent);
> +	if (!parent)
> +		return -EINVAL;
> +
> +	/* Check for duplicate */
> +	mdev = __find_mdev_device(parent, uuid);
> +	if (mdev) {
> +		ret = -EEXIST;
> +		goto create_err;
> +	}
> +
> +	mdev = kzalloc(sizeof(*mdev), GFP_KERNEL);
> +	if (!mdev) {
> +		ret = -ENOMEM;
> +		goto create_err;
> +	}
> +
> +	memcpy(&mdev->uuid, &uuid, sizeof(uuid_le));
> +	mdev->parent = parent;
> +	kref_init(&mdev->ref);
> +
> +	mdev->dev.parent  = dev;
> +	mdev->dev.bus     = &mdev_bus_type;
> +	mdev->dev.release = mdev_device_release;
> +	dev_set_name(&mdev->dev, "%pUl", uuid.b);
> +
> +	ret = device_register(&mdev->dev);
> +	if (ret) {
> +		put_device(&mdev->dev);
> +		goto create_err;
> +	}
> +
> +	ret = mdev_device_create_ops(kobj, mdev);
> +	if (ret)
> +		goto create_failed;
> +
> +	ret = mdev_create_sysfs_files(&mdev->dev, type);
> +	if (ret) {
> +		mdev_device_remove_ops(mdev, true);
> +		goto create_failed;
> +	}
> +
> +	mdev->type_kobj = kobj;
> +	dev_dbg(&mdev->dev, "MDEV: created\n");
> +
> +	return ret;
> +
> +create_failed:
> +	device_unregister(&mdev->dev);
> +
> +create_err:
> +	mdev_put_parent(parent);
> +	return ret;
> +}
> +
> +int mdev_device_remove(struct device *dev, bool force_remove)
> +{
> +	struct mdev_device *mdev;
> +	struct parent_device *parent;
> +	struct mdev_type *type;
> +	int ret = 0;
> +
> +	if (!dev_is_mdev(dev))
> +		return 0;
> +
> +	mdev = to_mdev_device(dev);
> +	parent = mdev->parent;
> +	type = to_mdev_type(mdev->type_kobj);
> +
> +	ret = mdev_device_remove_ops(mdev, force_remove);
> +	if (ret)
> +		return ret;
> +
> +	mdev_remove_sysfs_files(dev, type);
> +	device_unregister(dev);
> +	mdev_put_parent(parent);
> +	return ret;
> +}
> +
> +static int __init mdev_init(void)
> +{
> +	int ret;
> +
> +	ret = mdev_bus_register();
> +	if (ret) {
> +		pr_err("Failed to register mdev bus\n");
> +		return ret;
> +	}
> +
> +	mdev_bus_compat_class = class_compat_register("mdev_bus");
> +	if (!mdev_bus_compat_class) {
> +		mdev_bus_unregister();
> +		return -ENOMEM;
> +	}
> +
> +	/*
> +	 * Attempt to load known vfio_mdev.  This gives us a working environment
> +	 * without the user needing to explicitly load vfio_mdev driver.
> +	 */
> +	request_module_nowait("vfio_mdev");
> +
> +	return ret;
> +}
> +
> +static void __exit mdev_exit(void)
> +{
> +	class_compat_unregister(mdev_bus_compat_class);
> +	mdev_bus_unregister();
> +}
> +
> +module_init(mdev_init)
> +module_exit(mdev_exit)
> +
> +MODULE_VERSION(DRIVER_VERSION);
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR(DRIVER_AUTHOR);
> +MODULE_DESCRIPTION(DRIVER_DESC);
> diff --git a/drivers/vfio/mdev/mdev_driver.c b/drivers/vfio/mdev/mdev_driver.c
> new file mode 100644
> index 000000000000..7768ef87f528
> --- /dev/null
> +++ b/drivers/vfio/mdev/mdev_driver.c
> @@ -0,0 +1,128 @@
> +/*
> + * MDEV driver
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/device.h>
> +#include <linux/iommu.h>
> +#include <linux/mdev.h>
> +
> +#include "mdev_private.h"
> +
> +static int mdev_attach_iommu(struct mdev_device *mdev)
> +{
> +	int ret;
> +	struct iommu_group *group;
> +
> +	group = iommu_group_alloc();
> +	if (IS_ERR(group)) {
> +		dev_err(&mdev->dev, "MDEV: failed to allocate group!\n");
> +		return PTR_ERR(group);
> +	}
> +
> +	ret = iommu_group_add_device(group, &mdev->dev);
> +	if (ret) {
> +		dev_err(&mdev->dev, "MDEV: failed to add dev to group!\n");
> +		goto attach_fail;
> +	}
> +
> +	dev_info(&mdev->dev, "MDEV: group_id = %d\n",
> +				 iommu_group_id(group));
> +attach_fail:
> +	iommu_group_put(group);
> +	return ret;
> +}
> +
> +static void mdev_detach_iommu(struct mdev_device *mdev)
> +{
> +	iommu_group_remove_device(&mdev->dev);
> +	dev_info(&mdev->dev, "MDEV: detaching iommu\n");
> +}
> +
> +static int mdev_probe(struct device *dev)
> +{
> +	struct mdev_driver *drv = to_mdev_driver(dev->driver);
> +	struct mdev_device *mdev = to_mdev_device(dev);
> +	int ret;
> +
> +	ret = mdev_attach_iommu(mdev);
> +	if (ret) {
> +		dev_err(dev, "Failed to attach IOMMU\n");
> +		return ret;
> +	}
> +
> +	if (drv && drv->probe)
> +		ret = drv->probe(dev);
> +
> +	if (ret)
> +		mdev_detach_iommu(mdev);
> +
> +	return ret;
> +}
> +
> +static int mdev_remove(struct device *dev)
> +{
> +	struct mdev_driver *drv = to_mdev_driver(dev->driver);
> +	struct mdev_device *mdev = to_mdev_device(dev);
> +
> +	if (drv && drv->remove)
> +		drv->remove(dev);
> +
> +	mdev_detach_iommu(mdev);
> +
> +	return 0;
> +}
> +
> +struct bus_type mdev_bus_type = {
> +	.name		= "mdev",
> +	.probe		= mdev_probe,
> +	.remove		= mdev_remove,
> +};
> +EXPORT_SYMBOL_GPL(mdev_bus_type);
> +
> +/*
> + * mdev_register_driver - register a new MDEV driver
> + * @drv: the driver to register
> + * @owner: module owner of driver to be registered
> + *
> + * Returns a negative value on error, otherwise 0.
> + */
> +int mdev_register_driver(struct mdev_driver *drv, struct module *owner)
> +{
> +	/* initialize common driver fields */
> +	drv->driver.name = drv->name;
> +	drv->driver.bus = &mdev_bus_type;
> +	drv->driver.owner = owner;
> +
> +	/* register with core */
> +	return driver_register(&drv->driver);
> +}
> +EXPORT_SYMBOL(mdev_register_driver);
> +
> +/*
> + * mdev_unregister_driver - unregister MDEV driver
> + * @drv: the driver to unregister
> + *
> + */
> +void mdev_unregister_driver(struct mdev_driver *drv)
> +{
> +	driver_unregister(&drv->driver);
> +}
> +EXPORT_SYMBOL(mdev_unregister_driver);
> +
> +int mdev_bus_register(void)
> +{
> +	return bus_register(&mdev_bus_type);
> +}
> +
> +void mdev_bus_unregister(void)
> +{
> +	bus_unregister(&mdev_bus_type);
> +}
> diff --git a/drivers/vfio/mdev/mdev_private.h b/drivers/vfio/mdev/mdev_private.h
> new file mode 100644
> index 000000000000..000c93fcfdbd
> --- /dev/null
> +++ b/drivers/vfio/mdev/mdev_private.h
> @@ -0,0 +1,41 @@
> +/*
> + * Mediated device interal definitions
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#ifndef MDEV_PRIVATE_H
> +#define MDEV_PRIVATE_H
> +
> +int  mdev_bus_register(void);
> +void mdev_bus_unregister(void);
> +
> +struct mdev_type {
> +	struct kobject kobj;
> +	struct kobject *devices_kobj;
> +	struct parent_device *parent;
> +	struct list_head next;
> +	struct attribute_group *group;
> +};
> +
> +#define to_mdev_type_attr(_attr)	\
> +	container_of(_attr, struct mdev_type_attribute, attr)
> +#define to_mdev_type(_kobj)		\
> +	container_of(_kobj, struct mdev_type, kobj)
> +
> +int  parent_create_sysfs_files(struct parent_device *parent);
> +void parent_remove_sysfs_files(struct parent_device *parent);
> +
> +int  mdev_create_sysfs_files(struct device *dev, struct mdev_type *type);
> +void mdev_remove_sysfs_files(struct device *dev, struct mdev_type *type);
> +
> +int  mdev_device_create(struct kobject *kobj, struct device *dev, uuid_le uuid);
> +int  mdev_device_remove(struct device *dev, bool force_remove);
> +
> +#endif /* MDEV_PRIVATE_H */
> diff --git a/drivers/vfio/mdev/mdev_sysfs.c b/drivers/vfio/mdev/mdev_sysfs.c
> new file mode 100644
> index 000000000000..426e35cf79d0
> --- /dev/null
> +++ b/drivers/vfio/mdev/mdev_sysfs.c
> @@ -0,0 +1,296 @@
> +/*
> + * File attributes for Mediated devices
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/sysfs.h>
> +#include <linux/ctype.h>
> +#include <linux/device.h>
> +#include <linux/slab.h>
> +#include <linux/uuid.h>
> +#include <linux/mdev.h>
> +
> +#include "mdev_private.h"
> +
> +/* Static functions */
> +
> +static ssize_t mdev_type_attr_show(struct kobject *kobj,
> +				     struct attribute *__attr, char *buf)
> +{
> +	struct mdev_type_attribute *attr = to_mdev_type_attr(__attr);
> +	struct mdev_type *type = to_mdev_type(kobj);
> +	ssize_t ret = -EIO;
> +
> +	if (attr->show)
> +		ret = attr->show(kobj, type->parent->dev, buf);
> +	return ret;
> +}
> +
> +static ssize_t mdev_type_attr_store(struct kobject *kobj,
> +				      struct attribute *__attr,
> +				      const char *buf, size_t count)
> +{
> +	struct mdev_type_attribute *attr = to_mdev_type_attr(__attr);
> +	struct mdev_type *type = to_mdev_type(kobj);
> +	ssize_t ret = -EIO;
> +
> +	if (attr->store)
> +		ret = attr->store(&type->kobj, type->parent->dev, buf, count);
> +	return ret;
> +}
> +
> +static const struct sysfs_ops mdev_type_sysfs_ops = {
> +	.show = mdev_type_attr_show,
> +	.store = mdev_type_attr_store,
> +};
> +
> +static ssize_t create_store(struct kobject *kobj, struct device *dev,
> +			    const char *buf, size_t count)
> +{
> +	char *str;
> +	uuid_le uuid;
> +	int ret;
> +
> +	if (count < UUID_STRING_LEN)
> +		return -EINVAL;
> +
> +	str = kstrndup(buf, count, GFP_KERNEL);
> +	if (!str)
> +		return -ENOMEM;
> +
> +	ret = uuid_le_to_bin(str, &uuid);
> +	if (!ret) {
> +
> +		ret = mdev_device_create(kobj, dev, uuid);
> +		if (ret)
> +			pr_err("mdev_create: Failed to create mdev device\n");
> +		else
> +			ret = count;
> +	}
> +
> +	kfree(str);
> +	return ret;
> +}
> +
> +MDEV_TYPE_ATTR_WO(create);
> +
> +static void mdev_type_release(struct kobject *kobj)
> +{
> +	struct mdev_type *type = to_mdev_type(kobj);
> +
> +	pr_debug("Releasing group %s\n", kobj->name);
> +	kfree(type);
> +}
> +
> +static struct kobj_type mdev_type_ktype = {
> +	.sysfs_ops = &mdev_type_sysfs_ops,
> +	.release = mdev_type_release,
> +};
> +
> +struct mdev_type *add_mdev_supported_type(struct parent_device *parent,
> +					  struct attribute_group *group)
> +{
> +	struct mdev_type *type;
> +	int ret;
> +
> +	if (!group->name) {
> +		pr_err("%s: Type name empty!\n", __func__);
> +		return ERR_PTR(-EINVAL);
> +	}
> +
> +	type = kzalloc(sizeof(*type), GFP_KERNEL);
> +	if (!type)
> +		return ERR_PTR(-ENOMEM);
> +
> +	type->kobj.kset = parent->mdev_types_kset;
> +
> +	ret = kobject_init_and_add(&type->kobj, &mdev_type_ktype, NULL,
> +				   "%s-%s", dev_driver_string(parent->dev),
> +				   group->name);
> +	if (ret) {
> +		kfree(type);
> +		return ERR_PTR(ret);
> +	}
> +
> +	ret = sysfs_create_file(&type->kobj, &mdev_type_attr_create.attr);
> +	if (ret)
> +		goto attr_create_failed;
> +
> +	type->devices_kobj = kobject_create_and_add("devices", &type->kobj);
> +	if (!type->devices_kobj) {
> +		ret = -ENOMEM;
> +		goto attr_devices_failed;
> +	}
> +
> +	ret = sysfs_create_files(&type->kobj,
> +				 (const struct attribute **)group->attrs);
> +	if (ret) {
> +		ret = -ENOMEM;
> +		goto attrs_failed;
> +	}
> +
> +	type->group = group;
> +	type->parent = parent;
> +	return type;
> +
> +attrs_failed:
> +	kobject_put(type->devices_kobj);
> +attr_devices_failed:
> +	sysfs_remove_file(&type->kobj, &mdev_type_attr_create.attr);
> +attr_create_failed:
> +	kobject_del(&type->kobj);
> +	kobject_put(&type->kobj);
> +	return ERR_PTR(ret);
> +}
> +
> +static void remove_mdev_supported_type(struct mdev_type *type)
> +{
> +	sysfs_remove_files(&type->kobj,
> +			   (const struct attribute **)type->group->attrs);
> +	kobject_put(type->devices_kobj);
> +	sysfs_remove_file(&type->kobj, &mdev_type_attr_create.attr);
> +	kobject_del(&type->kobj);
> +	kobject_put(&type->kobj);
> +}
> +
> +static int add_mdev_supported_type_groups(struct parent_device *parent)
> +{
> +	int i;
> +
> +	for (i = 0; parent->ops->supported_type_groups[i]; i++) {
> +		struct mdev_type *type;
> +
> +		type = add_mdev_supported_type(parent,
> +					parent->ops->supported_type_groups[i]);
> +		if (IS_ERR(type)) {
> +			struct mdev_type *ltype, *tmp;
> +
> +			list_for_each_entry_safe(ltype, tmp, &parent->type_list,
> +						  next) {
> +				list_del(&ltype->next);
> +				remove_mdev_supported_type(ltype);
> +			}
> +			return PTR_ERR(type);
> +		}
> +		list_add(&type->next, &parent->type_list);
> +	}
> +	return 0;
> +}
> +
> +/* mdev sysfs Functions */
> +
> +void parent_remove_sysfs_files(struct parent_device *parent)
> +{
> +	struct mdev_type *type, *tmp;
> +
> +	list_for_each_entry_safe(type, tmp, &parent->type_list, next) {
> +		list_del(&type->next);
> +		remove_mdev_supported_type(type);
> +	}
> +
> +	sysfs_remove_groups(&parent->dev->kobj, parent->ops->dev_attr_groups);
> +	kset_unregister(parent->mdev_types_kset);
> +}
> +
> +int parent_create_sysfs_files(struct parent_device *parent)
> +{
> +	int ret;
> +
> +	parent->mdev_types_kset = kset_create_and_add("mdev_supported_types",
> +					       NULL, &parent->dev->kobj);
> +
> +	if (!parent->mdev_types_kset)
> +		return -ENOMEM;
> +
> +	INIT_LIST_HEAD(&parent->type_list);
> +
> +	ret = sysfs_create_groups(&parent->dev->kobj,
> +				  parent->ops->dev_attr_groups);
> +	if (ret)
> +		goto create_err;
> +
> +	ret = add_mdev_supported_type_groups(parent);
> +	if (ret)
> +		sysfs_remove_groups(&parent->dev->kobj,
> +				    parent->ops->dev_attr_groups);
> +	else
> +		return ret;
> +
> +create_err:
> +	kset_unregister(parent->mdev_types_kset);
> +	return ret;
> +}
> +
> +static ssize_t remove_store(struct device *dev, struct device_attribute *attr,
> +			    const char *buf, size_t count)
> +{
> +	unsigned long val;
> +
> +	if (kstrtoul(buf, 0, &val) < 0)
> +		return -EINVAL;
> +
> +	if (val && device_remove_file_self(dev, attr)) {
> +		int ret;
> +
> +		ret = mdev_device_remove(dev, false);
> +		if (ret) {
> +			device_create_file(dev, attr);
> +			return ret;
> +		}
> +	}
> +
> +	return count;
> +}
> +
> +static DEVICE_ATTR_WO(remove);
> +
> +static const struct attribute *mdev_device_attrs[] = {
> +	&dev_attr_remove.attr,
> +	NULL,
> +};
> +
> +int  mdev_create_sysfs_files(struct device *dev, struct mdev_type *type)
> +{
> +	int ret;
> +
> +	ret = sysfs_create_files(&dev->kobj, mdev_device_attrs);
> +	if (ret) {
> +		pr_err("Failed to create remove sysfs entry\n");
> +		return ret;
> +	}
> +
> +	ret = sysfs_create_link(type->devices_kobj, &dev->kobj, dev_name(dev));
> +	if (ret) {
> +		pr_err("Failed to create symlink in types\n");

looks wrong place...

> +		goto device_link_failed;
> +	}
> +
> +	ret = sysfs_create_link(&dev->kobj, &type->kobj, "mdev_type");
> +	if (ret) {
> +		pr_err("Failed to create symlink in device directory\n");

exchange with above.

> +		goto type_link_failed;
> +	}
> +
> +	return ret;
> +
> +type_link_failed:
> +	sysfs_remove_link(type->devices_kobj, dev_name(dev));
> +device_link_failed:
> +	sysfs_remove_files(&dev->kobj, mdev_device_attrs);
> +	return ret;
> +}
> +
> +void mdev_remove_sysfs_files(struct device *dev, struct mdev_type *type)
> +{
> +	sysfs_remove_link(&dev->kobj, "mdev_type");
> +	sysfs_remove_link(type->devices_kobj, dev_name(dev));
> +	sysfs_remove_files(&dev->kobj, mdev_device_attrs);
> +
> +}
> diff --git a/include/linux/mdev.h b/include/linux/mdev.h
> new file mode 100644
> index 000000000000..727209b2a67f
> --- /dev/null
> +++ b/include/linux/mdev.h
> @@ -0,0 +1,177 @@
> +/*
> + * Mediated device definition
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#ifndef MDEV_H
> +#define MDEV_H
> +
> +#include <uapi/linux/vfio.h>
> +
> +struct parent_device;
> +
> +/* Mediated device */
> +struct mdev_device {
> +	struct device		dev;
> +	struct parent_device	*parent;
> +	uuid_le			uuid;
> +	void			*driver_data;
> +
> +	/* internal only */
> +	struct kref		ref;
> +	struct list_head	next;
> +	struct kobject		*type_kobj;
> +};
> +
> +
> +/**
> + * struct parent_ops - Structure to be registered for each parent device to
> + * register the device to mdev module.
> + *
> + * @owner:		The module owner.
> + * @dev_attr_groups:	Attributes of the parent device.
> + * @mdev_attr_groups:	Attributes of the mediated device.
> + * @supported_type_groups: Attributes to define supported types. It is mandatory
> + *			to provide supported types.
> + * @create:		Called to allocate basic resources in parent device's
> + *			driver for a particular mediated device. It is
> + *			mandatory to provide create ops.
> + *			@kobj: kobject of type for which 'create' is called.
> + *			@mdev: mdev_device structure on of mediated device
> + *			      that is being created
> + *			Returns integer: success (0) or error (< 0)
> + * @remove:		Called to free resources in parent device's driver for a
> + *			a mediated device. It is mandatory to provide 'remove'
> + *			ops.
> + *			@mdev: mdev_device device structure which is being
> + *			       destroyed
> + *			Returns integer: success (0) or error (< 0)
> + * @open:		Open mediated device.
> + *			@mdev: mediated device.
> + *			Returns integer: success (0) or error (< 0)
> + * @release:		release mediated device
> + *			@mdev: mediated device.
> + * @read:		Read emulation callback
> + *			@mdev: mediated device structure
> + *			@buf: read buffer
> + *			@count: number of bytes to read
> + *			@ppos: address.
> + *			Retuns number on bytes read on success or error.
> + * @write:		Write emulation callback
> + *			@mdev: mediated device structure
> + *			@buf: write buffer
> + *			@count: number of bytes to be written
> + *			@ppos: address.
> + *			Retuns number on bytes written on success or error.
> + * @ioctl:		IOCTL callback
> + *			@mdev: mediated device structure
> + *			@cmd: mediated device structure
> + *			@arg: mediated device structure
> + * @mmap:		mmap callback
> + * Parent device that support mediated device should be registered with mdev
> + * module with parent_ops structure.
> + */
> +
> +struct parent_ops {
> +	struct module   *owner;
> +	const struct attribute_group **dev_attr_groups;
> +	const struct attribute_group **mdev_attr_groups;
> +	struct attribute_group **supported_type_groups;
> +
> +	int     (*create)(struct kobject *kobj, struct mdev_device *mdev);
> +	int     (*remove)(struct mdev_device *mdev);
> +	int     (*open)(struct mdev_device *mdev);
> +	void    (*release)(struct mdev_device *mdev);
> +	ssize_t (*read)(struct mdev_device *mdev, char __user *buf,
> +			size_t count, loff_t *ppos);
> +	ssize_t (*write)(struct mdev_device *mdev, const char __user *buf,
> +			 size_t count, loff_t *ppos);
> +	ssize_t (*ioctl)(struct mdev_device *mdev, unsigned int cmd,
> +			 unsigned long arg);
> +	int	(*mmap)(struct mdev_device *mdev, struct vm_area_struct *vma);
> +};
> +
> +/* Parent Device */
> +struct parent_device {
> +	struct device		*dev;
> +	const struct parent_ops	*ops;
> +
> +	/* internal */
> +	struct kref		ref;
> +	struct list_head	next;
> +	struct kset *mdev_types_kset;
> +	struct list_head	type_list;
> +};
> +
> +/* interface for exporting mdev supported type attributes */
> +struct mdev_type_attribute {
> +	struct attribute attr;
> +	ssize_t (*show)(struct kobject *kobj, struct device *dev, char *buf);
> +	ssize_t (*store)(struct kobject *kobj, struct device *dev,
> +			 const char *buf, size_t count);
> +};
> +
> +#define MDEV_TYPE_ATTR(_name, _mode, _show, _store)		\
> +struct mdev_type_attribute mdev_type_attr_##_name =		\
> +	__ATTR(_name, _mode, _show, _store)
> +#define MDEV_TYPE_ATTR_RW(_name) \
> +	struct mdev_type_attribute mdev_type_attr_##_name = __ATTR_RW(_name)
> +#define MDEV_TYPE_ATTR_RO(_name) \
> +	struct mdev_type_attribute mdev_type_attr_##_name = __ATTR_RO(_name)
> +#define MDEV_TYPE_ATTR_WO(_name) \
> +	struct mdev_type_attribute mdev_type_attr_##_name = __ATTR_WO(_name)
> +
> +/**
> + * struct mdev_driver - Mediated device driver
> + * @name: driver name
> + * @probe: called when new device created
> + * @remove: called when device removed
> + * @driver: device driver structure
> + *
> + **/
> +struct mdev_driver {
> +	const char *name;
> +	int  (*probe)(struct device *dev);
> +	void (*remove)(struct device *dev);
> +	struct device_driver driver;
> +};
> +
> +static inline struct mdev_driver *to_mdev_driver(struct device_driver *drv)
> +{
> +	return drv ? container_of(drv, struct mdev_driver, driver) : NULL;
> +}
> +
> +static inline struct mdev_device *to_mdev_device(struct device *dev)
> +{
> +	return dev ? container_of(dev, struct mdev_device, dev) : NULL;
> +}
> +
> +static inline void *mdev_get_drvdata(struct mdev_device *mdev)
> +{
> +	return mdev->driver_data;
> +}
> +
> +static inline void mdev_set_drvdata(struct mdev_device *mdev, void *data)
> +{
> +	mdev->driver_data = data;
> +}
> +
> +extern struct bus_type mdev_bus_type;
> +
> +#define dev_is_mdev(d) ((d)->bus == &mdev_bus_type)
> +
> +extern int  mdev_register_device(struct device *dev,
> +				 const struct parent_ops *ops);
> +extern void mdev_unregister_device(struct device *dev);
> +
> +extern int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
> +extern void mdev_unregister_driver(struct mdev_driver *drv);
> +
> +#endif /* MDEV_H */
> --
> 2.7.0

^ permalink raw reply	[flat|nested] 73+ messages in thread

* RE: [PATCH v9 02/12] vfio: VFIO based driver for Mediated devices
  2016-10-17 21:22 ` [PATCH v9 02/12] vfio: VFIO based driver for Mediated devices Kirti Wankhede
@ 2016-10-26  6:57   ` Tian, Kevin
  2016-10-26 15:01     ` Kirti Wankhede
  0 siblings, 1 reply; 73+ messages in thread
From: Tian, Kevin @ 2016-10-26  6:57 UTC (permalink / raw)
  To: Kirti Wankhede, alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, Song, Jike, bjsdjshi, linux-kernel

> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> Sent: Tuesday, October 18, 2016 5:22 AM
> 
> vfio_mdev driver registers with mdev core driver.
> MDEV core driver creates mediated device and calls probe routine of

use same case - either 'mdev core' or 'MDEV core'

> vfio_mdev driver for each device.
> Probe routine of vfio_mdev driver adds mediated device to VFIO core module
> 
> This driver forms a shim layer that pass through VFIO devices operations
> to vendor driver for mediated devices.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Signed-off-by: Neo Jia <cjia@nvidia.com>
> Change-Id: I583f4734752971d3d112324d69e2508c88f359ec
> ---
>  drivers/vfio/mdev/Kconfig     |   7 ++
>  drivers/vfio/mdev/Makefile    |   1 +
>  drivers/vfio/mdev/vfio_mdev.c | 148
> ++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 156 insertions(+)
>  create mode 100644 drivers/vfio/mdev/vfio_mdev.c
> 
> diff --git a/drivers/vfio/mdev/Kconfig b/drivers/vfio/mdev/Kconfig
> index 93addace9a67..6cef0c4d2ceb 100644
> --- a/drivers/vfio/mdev/Kconfig
> +++ b/drivers/vfio/mdev/Kconfig
> @@ -9,3 +9,10 @@ config VFIO_MDEV
>  	See Documentation/vfio-mdev/vfio-mediated-device.txt for more details.
> 
>          If you don't know what do here, say N.
> +
> +config VFIO_MDEV_DEVICE
> +    tristate "VFIO support for Mediated devices"
> +    depends on VFIO && VFIO_MDEV
> +    default n
> +    help
> +        VFIO based driver for mediated devices.
> diff --git a/drivers/vfio/mdev/Makefile b/drivers/vfio/mdev/Makefile
> index 31bc04801d94..fa2d5ea466ee 100644
> --- a/drivers/vfio/mdev/Makefile
> +++ b/drivers/vfio/mdev/Makefile
> @@ -2,3 +2,4 @@
>  mdev-y := mdev_core.o mdev_sysfs.o mdev_driver.o
> 
>  obj-$(CONFIG_VFIO_MDEV) += mdev.o
> +obj-$(CONFIG_VFIO_MDEV_DEVICE) += vfio_mdev.o
> diff --git a/drivers/vfio/mdev/vfio_mdev.c b/drivers/vfio/mdev/vfio_mdev.c
> new file mode 100644
> index 000000000000..b7b47604ce7a
> --- /dev/null
> +++ b/drivers/vfio/mdev/vfio_mdev.c
> @@ -0,0 +1,148 @@
> +/*
> + * VFIO based driver for Mediated device
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/init.h>
> +#include <linux/module.h>
> +#include <linux/device.h>
> +#include <linux/kernel.h>
> +#include <linux/slab.h>
> +#include <linux/vfio.h>
> +#include <linux/mdev.h>
> +
> +#include "mdev_private.h"
> +
> +#define DRIVER_VERSION  "0.1"
> +#define DRIVER_AUTHOR   "NVIDIA Corporation"
> +#define DRIVER_DESC     "VFIO based driver for Mediated device"
> +
> +static int vfio_mdev_open(void *device_data)
> +{
> +	struct mdev_device *mdev = device_data;
> +	struct parent_device *parent = mdev->parent;
> +	int ret;
> +
> +	if (unlikely(!parent->ops->open))
> +		return -EINVAL;
> +
> +	if (!try_module_get(THIS_MODULE))
> +		return -ENODEV;
> +
> +	ret = parent->ops->open(mdev);
> +	if (ret)
> +		module_put(THIS_MODULE);
> +
> +	return ret;
> +}
> +
> +static void vfio_mdev_release(void *device_data)
> +{
> +	struct mdev_device *mdev = device_data;
> +	struct parent_device *parent = mdev->parent;
> +
> +	if (parent->ops->release)

likely()

> +		parent->ops->release(mdev);
> +
> +	module_put(THIS_MODULE);
> +}
> +
> +static long vfio_mdev_unlocked_ioctl(void *device_data,
> +				     unsigned int cmd, unsigned long arg)
> +{
> +	struct mdev_device *mdev = device_data;
> +	struct parent_device *parent = mdev->parent;
> +
> +	if (unlikely(!parent->ops->ioctl))
> +		return -EINVAL;
> +
> +	return parent->ops->ioctl(mdev, cmd, arg);
> +}
> +
> +static ssize_t vfio_mdev_read(void *device_data, char __user *buf,
> +			      size_t count, loff_t *ppos)
> +{
> +	struct mdev_device *mdev = device_data;
> +	struct parent_device *parent = mdev->parent;
> +
> +	if (unlikely(!parent->ops->read))
> +		return -EINVAL;
> +
> +	return parent->ops->read(mdev, buf, count, ppos);
> +}
> +
> +static ssize_t vfio_mdev_write(void *device_data, const char __user *buf,
> +			       size_t count, loff_t *ppos)
> +{
> +	struct mdev_device *mdev = device_data;
> +	struct parent_device *parent = mdev->parent;
> +
> +	if (unlikely(!parent->ops->write))
> +		return -EINVAL;
> +
> +	return parent->ops->write(mdev, buf, count, ppos);
> +}
> +
> +static int vfio_mdev_mmap(void *device_data, struct vm_area_struct *vma)
> +{
> +	struct mdev_device *mdev = device_data;
> +	struct parent_device *parent = mdev->parent;
> +
> +	if (unlikely(!parent->ops->mmap))
> +		return -EINVAL;
> +
> +	return parent->ops->mmap(mdev, vma);
> +}
> +
> +static const struct vfio_device_ops vfio_mdev_dev_ops = {
> +	.name		= "vfio-mdev",
> +	.open		= vfio_mdev_open,
> +	.release	= vfio_mdev_release,
> +	.ioctl		= vfio_mdev_unlocked_ioctl,
> +	.read		= vfio_mdev_read,
> +	.write		= vfio_mdev_write,
> +	.mmap		= vfio_mdev_mmap,
> +};
> +
> +int vfio_mdev_probe(struct device *dev)
> +{
> +	struct mdev_device *mdev = to_mdev_device(dev);
> +
> +	return vfio_add_group_dev(dev, &vfio_mdev_dev_ops, mdev);
> +}
> +
> +void vfio_mdev_remove(struct device *dev)
> +{
> +	vfio_del_group_dev(dev);
> +}
> +
> +struct mdev_driver vfio_mdev_driver = {
> +	.name	= "vfio_mdev",
> +	.probe	= vfio_mdev_probe,
> +	.remove	= vfio_mdev_remove,
> +};
> +
> +static int __init vfio_mdev_init(void)
> +{
> +	return mdev_register_driver(&vfio_mdev_driver, THIS_MODULE);
> +}
> +
> +static void __exit vfio_mdev_exit(void)
> +{
> +	mdev_unregister_driver(&vfio_mdev_driver);
> +}
> +
> +module_init(vfio_mdev_init)
> +module_exit(vfio_mdev_exit)
> +
> +MODULE_VERSION(DRIVER_VERSION);
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR(DRIVER_AUTHOR);
> +MODULE_DESCRIPTION(DRIVER_DESC);
> --
> 2.7.0

^ permalink raw reply	[flat|nested] 73+ messages in thread

* RE: [PATCH v9 04/12] vfio iommu: Add support for mediated devices
  2016-10-24  2:32       ` Alex Williamson
@ 2016-10-26  7:19         ` Tian, Kevin
  2016-10-26 15:06           ` Kirti Wankhede
  0 siblings, 1 reply; 73+ messages in thread
From: Tian, Kevin @ 2016-10-26  7:19 UTC (permalink / raw)
  To: Alex Williamson, Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, Song, Jike, bjsdjshi,
	linux-kernel

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Monday, October 24, 2016 10:32 AM
> 
> > >> -static long vfio_unpin_pages(unsigned long pfn, long npage,
> > >> -			     int prot, bool do_accounting)
> > >> +static long __vfio_unpin_pages_remote(struct vfio_iommu *iommu,
> > >> +				      unsigned long pfn, long npage, int prot,
> > >> +				      bool do_accounting)
> > >
> > > Have you noticed that it's kind of confusing that
> > > __vfio_{un}pin_pages_remote() uses current, which does a
> > > get_user_pages_fast() while "local" uses a provided task_struct and
> > > uses get_user_pages_*remote*()?  And also what was effectively local
> > > (ie. we're pinning for our own use here) is now "remote" and pinning
> > > for a remote, vendor driver consumer, is now "local".  It's not very
> > > intuitive.
> > >

I questioned this confusing naming in v8 too...

> >
> > 'local' in local_domain was suggested to describe the domain for local
> > page tracking. Earlier suggestions to have 'mdev' or 'noimmu' in this
> > name were discarded. May be we should revisit what the name should be.
> > Any suggestion?
> >
> > For local_domain, to pin pages, flow is:
> >
> > for local_domain
> >     |- vfio_pin_pages()
> >         |- vfio_iommu_type1_pin_pages()
> >             |- __vfio_pin_page_local()
> >                 |-  vaddr_get_pfn(task->mm)
> >                     |- get_user_pages_remote()
> >
> > __vfio_pin_page_local() --> get_user_pages_remote()
> 
> 
> In vfio.c we have the concept of an external user, perhaps that could
> be continued here.  An mdev driver would be an external, or remote
> pinning.
> 

I prefer to use remote here. It's aligned with underlying mm operations

Thanks
Kevin

^ permalink raw reply	[flat|nested] 73+ messages in thread

* RE: [PATCH v9 04/12] vfio iommu: Add support for mediated devices
  2016-10-19 21:02   ` Alex Williamson
  2016-10-20 20:17     ` Kirti Wankhede
@ 2016-10-26  7:53     ` Tian, Kevin
  2016-10-26 15:16       ` Alex Williamson
  2016-10-26  7:54     ` Tian, Kevin
  2 siblings, 1 reply; 73+ messages in thread
From: Tian, Kevin @ 2016-10-26  7:53 UTC (permalink / raw)
  To: Alex Williamson, Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, Song, Jike, bjsdjshi,
	linux-kernel

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Thursday, October 20, 2016 5:03 AM
> > @@ -83,6 +92,21 @@ struct vfio_group {
> >  };
> >
> >  /*
> > + * Guest RAM pinning working set or DMA target
> > + */
> > +struct vfio_pfn {
> > +	struct rb_node		node;
> > +	unsigned long		vaddr;		/* virtual addr */
> > +	dma_addr_t		iova;		/* IOVA */
> > +	unsigned long		pfn;		/* Host pfn */
> > +	int			prot;
> > +	atomic_t		ref_count;
> > +};
> 
> Somehow we're going to need to fit an invalidation callback here too.
> How would we handle a case where there are multiple mdev devices, from
> different vendor drivers, that all have the same pfn pinned?  I'm
> already concerned about the per pfn overhead we're introducing here so
> clearly we cannot store an invalidation callback per pinned page, per
> vendor driver.  Perhaps invalidations should be done using a notifier
> chain per vfio_iommu, the vendor drivers are required to register on
> that chain (fail pinning with empty notifier list) user unmapping
> will be broadcast to the notifier chain, the vendor driver will be
> responsible for deciding if each unmap is relevant to them (potentially
> it's for a pinning from another driver).
> 
> I expect we also need to enforce that vendors perform a synchronous
> unmap such that after returning from the notifier list call, the
> vfio_pfn should no longer exist.  If it does we might need to BUG_ON.
> Also be careful to pay attention to the locking of the notifier vs
> unpin callbacks to avoid deadlocks.
> 

What about just requesting vendor driver to provide a callback in parent 
device ops? 

Curious in which scenario the user application (say Qemu here) may 
unmap memory pages which are still pinned by vendor driver... Is it 
purely about a corner case which we want to handle elegantly? 

If yes, possibly a simpler way is to force destroying mdev instead of 
asking vendor driver to take care of each invalidation request under
such situation. Since anyway the mdev device won't be in an usable
state anymore... (sorry if I missed the key problem here.)

Thanks
Kevin

^ permalink raw reply	[flat|nested] 73+ messages in thread

* RE: [PATCH v9 04/12] vfio iommu: Add support for mediated devices
  2016-10-19 21:02   ` Alex Williamson
  2016-10-20 20:17     ` Kirti Wankhede
  2016-10-26  7:53     ` Tian, Kevin
@ 2016-10-26  7:54     ` Tian, Kevin
  2016-10-26 15:19       ` Alex Williamson
  2 siblings, 1 reply; 73+ messages in thread
From: Tian, Kevin @ 2016-10-26  7:54 UTC (permalink / raw)
  To: Alex Williamson, Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, Song, Jike, bjsdjshi,
	linux-kernel

> From: Tian, Kevin
> Sent: Wednesday, October 26, 2016 3:54 PM
> 
> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Thursday, October 20, 2016 5:03 AM
> > > @@ -83,6 +92,21 @@ struct vfio_group {
> > >  };
> > >
> > >  /*
> > > + * Guest RAM pinning working set or DMA target
> > > + */
> > > +struct vfio_pfn {
> > > +	struct rb_node		node;
> > > +	unsigned long		vaddr;		/* virtual addr */
> > > +	dma_addr_t		iova;		/* IOVA */
> > > +	unsigned long		pfn;		/* Host pfn */
> > > +	int			prot;
> > > +	atomic_t		ref_count;
> > > +};
> >
> > Somehow we're going to need to fit an invalidation callback here too.
> > How would we handle a case where there are multiple mdev devices, from
> > different vendor drivers, that all have the same pfn pinned?  I'm
> > already concerned about the per pfn overhead we're introducing here so
> > clearly we cannot store an invalidation callback per pinned page, per
> > vendor driver.  Perhaps invalidations should be done using a notifier
> > chain per vfio_iommu, the vendor drivers are required to register on
> > that chain (fail pinning with empty notifier list) user unmapping
> > will be broadcast to the notifier chain, the vendor driver will be
> > responsible for deciding if each unmap is relevant to them (potentially
> > it's for a pinning from another driver).
> >
> > I expect we also need to enforce that vendors perform a synchronous
> > unmap such that after returning from the notifier list call, the
> > vfio_pfn should no longer exist.  If it does we might need to BUG_ON.
> > Also be careful to pay attention to the locking of the notifier vs
> > unpin callbacks to avoid deadlocks.
> >
> 
> What about just requesting vendor driver to provide a callback in parent
> device ops?
> 
> Curious in which scenario the user application (say Qemu here) may
> unmap memory pages which are still pinned by vendor driver... Is it
> purely about a corner case which we want to handle elegantly?
> 
> If yes, possibly a simpler way is to force destroying mdev instead of
> asking vendor driver to take care of each invalidation request under
> such situation. Since anyway the mdev device won't be in an usable
> state anymore... (sorry if I missed the key problem here.)
> 

or calling reset callback of parent device driver, if we don't want to
break libvirt's expectation by blindly removing mdev device...

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 01/12] vfio: Mediated device Core driver
  2016-10-26  6:52   ` Tian, Kevin
@ 2016-10-26 14:58     ` Kirti Wankhede
  0 siblings, 0 replies; 73+ messages in thread
From: Kirti Wankhede @ 2016-10-26 14:58 UTC (permalink / raw)
  To: Tian, Kevin, alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, Song, Jike, bjsdjshi, linux-kernel


>> Medisted bus driver is responsible to add/delete mediated devices to/from
> 
> Medisted -> Mediated
>

Thanks for pointing out the typeo. Correcting it.


>> VFIO group when devices are bound and unbound to the driver.
>>
>> 2. Physical device driver interface
>> This interface provides vendor driver the set APIs to manage physical
>> device related work in its driver. APIs are :
>>
>> * dev_attr_groups: attributes of the parent device.
>> * mdev_attr_groups: attributes of the mediated device.
>> * supported_type_groups: attributes to define supported type. This is
>> 			 mandatory field.
>> * create: to allocate basic resources in driver for a mediated device.
> 
> in 'which driver'? it should be clear to remove 'in driver' here
> 
>> * remove: to free resources in driver when mediated device is destroyed.
>> * open: open callback of mediated device
>> * release: release callback of mediated device
>> * read : read emulation callback.
>> * write: write emulation callback.
>> * mmap: mmap emulation callback.
>> * ioctl: ioctl callback.
> 
> You only highlight 'mandatory field' for supported_type_groups. What
> about other fields? Are all of them optional? Please clarify and also
> stay consistent to later code comment.
> 

'create' and 'remove' are mandatory. Updating the description here. Rest
all are not cross-checked in mdev core driver, like 'create' and
'remove' but yes rest are optional. If vendor driver don't want to
support emulated region they don't need read/write callbacks. Similarly
if vendor driver don't want to support mmap region, they don't need mmap
callback.

Code comments are consistent with this description.

...
>> +
>> +config VFIO_MDEV
>> +    tristate "Mediated device driver framework"
>> +    depends on VFIO
>> +    default n
>> +    help
>> +        Provides a framework to virtualize devices which don't have SR_IOV
>> +	capability built-in.
> 
> This statement is not accurate. A device can support SR-IOV, but in the same
> time using this mediated technology w/ SR-IOV capability disabled.
> 

If SR-IOV is supported why would user use this framework? SR-IOV would
give better performance.

...
>> +
>> +static struct mdev_device *__find_mdev_device(struct parent_device *parent,
>> +					      uuid_le uuid)
> 
> parent_find_mdev_device?
> 

This function search for mdev device with given UUID, so I think its
consistent what we have below for parent, __find_parent_device().

...

>> +static int mdev_device_remove_ops(struct mdev_device *mdev, bool force_remove)
>> +{
> 
> 
> Can you add some comment here about when force_remove may be expected
> here, which would help others understand immediately instead of walking through
> the whole patch set?
>


mdev_device_remove_ops gets called from sysfs's 'remove' and when parent
device is being unregistered from mdev device framework.
- 'force_remove' is set to 'false' when called from sysfs's 'remove'
which indicates that if the mdev device is active, used by VMM or
userspace application, vendor driver could return error then don't
remove the device.
- 'force_remove' is set to 'true' when called from
mdev_unregister_device() which indicate that parent device is being
removed from mdev device framework so remove mdev device forcefully.

> 
>> +	struct parent_device *parent = mdev->parent;
>> +	int ret;
>> +
>> +	/*
>> +	 * Vendor driver can return error if VMM or userspace application is
>> +	 * using this mdev device.
>> +	 */
>> +	ret = parent->ops->remove(mdev);
> 
> what about passing force_remove flag to remove callback, so vendor driver
> can decide whether any force cleanup required?
>

'remove' getting called from sysfs is asynchronous, so vendor driver can
retrun failure in that case if vendor driver finds that mdev device is
being actively used.

mdev_unregister_device() is going to be called from vendor driver itself
when device is being unbound or driver is being unloaded. In this case
vendor driver can identify itself that its in its own teardown path.

So I feel there is no need to pass force_remove flag to 'remove' callback.



>> +int  mdev_create_sysfs_files(struct device *dev, struct mdev_type *type)
>> +{
>> +	int ret;
>> +
>> +	ret = sysfs_create_files(&dev->kobj, mdev_device_attrs);
>> +	if (ret) {
>> +		pr_err("Failed to create remove sysfs entry\n");
>> +		return ret;
>> +	}
>> +
>> +	ret = sysfs_create_link(type->devices_kobj, &dev->kobj, dev_name(dev));
>> +	if (ret) {
>> +		pr_err("Failed to create symlink in types\n");
> 
> looks wrong place...
>

No, this is correct. Above function creates symlink in
mdev_supported_types/<type>/devices directory.

>> +		goto device_link_failed;
>> +	}
>> +
>> +	ret = sysfs_create_link(&dev->kobj, &type->kobj, "mdev_type");
>> +	if (ret) {
>> +		pr_err("Failed to create symlink in device directory\n");
> 
> exchange with above.
> 
Again this is also correct. Above creates 'mdev_type' symlink in mdev
device's directory.

Although these are correct, removing these error prints. You can find it
from its return type or if sysfs functions through warnings.

Kirti.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 02/12] vfio: VFIO based driver for Mediated devices
  2016-10-26  6:57   ` Tian, Kevin
@ 2016-10-26 15:01     ` Kirti Wankhede
  0 siblings, 0 replies; 73+ messages in thread
From: Kirti Wankhede @ 2016-10-26 15:01 UTC (permalink / raw)
  To: Tian, Kevin, alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, Song, Jike, bjsdjshi, linux-kernel


>> +static void vfio_mdev_release(void *device_data)
>> +{
>> +	struct mdev_device *mdev = device_data;
>> +	struct parent_device *parent = mdev->parent;
>> +
>> +	if (parent->ops->release)
> 
> likely()
> 
>> +		parent->ops->release(mdev);
>> +
>> +	module_put(THIS_MODULE);
>> +}
>> +

Thanks for pointing that out. Fixing this in next set of patch.

Kirti

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 04/12] vfio iommu: Add support for mediated devices
  2016-10-26  7:19         ` Tian, Kevin
@ 2016-10-26 15:06           ` Kirti Wankhede
  0 siblings, 0 replies; 73+ messages in thread
From: Kirti Wankhede @ 2016-10-26 15:06 UTC (permalink / raw)
  To: Tian, Kevin, Alex Williamson
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, Song, Jike, bjsdjshi,
	linux-kernel



On 10/26/2016 12:49 PM, Tian, Kevin wrote:
>> From: Alex Williamson [mailto:alex.williamson@redhat.com]
>> Sent: Monday, October 24, 2016 10:32 AM
>>
>>>>> -static long vfio_unpin_pages(unsigned long pfn, long npage,
>>>>> -			     int prot, bool do_accounting)
>>>>> +static long __vfio_unpin_pages_remote(struct vfio_iommu *iommu,
>>>>> +				      unsigned long pfn, long npage, int prot,
>>>>> +				      bool do_accounting)
>>>>
>>>> Have you noticed that it's kind of confusing that
>>>> __vfio_{un}pin_pages_remote() uses current, which does a
>>>> get_user_pages_fast() while "local" uses a provided task_struct and
>>>> uses get_user_pages_*remote*()?  And also what was effectively local
>>>> (ie. we're pinning for our own use here) is now "remote" and pinning
>>>> for a remote, vendor driver consumer, is now "local".  It's not very
>>>> intuitive.
>>>>
> 
> I questioned this confusing naming in v8 too...
> 

I do tried to address your concerns on v8.

>>>
>>> 'local' in local_domain was suggested to describe the domain for local
>>> page tracking. Earlier suggestions to have 'mdev' or 'noimmu' in this
>>> name were discarded. May be we should revisit what the name should be.
>>> Any suggestion?
>>>
>>> For local_domain, to pin pages, flow is:
>>>
>>> for local_domain
>>>     |- vfio_pin_pages()
>>>         |- vfio_iommu_type1_pin_pages()
>>>             |- __vfio_pin_page_local()
>>>                 |-  vaddr_get_pfn(task->mm)
>>>                     |- get_user_pages_remote()
>>>
>>> __vfio_pin_page_local() --> get_user_pages_remote()
>>
>>
>> In vfio.c we have the concept of an external user, perhaps that could
>> be continued here.  An mdev driver would be an external, or remote
>> pinning.
>>
> 
> I prefer to use remote here. It's aligned with underlying mm operations
> 

Using 'remote' in this case is also confusing since it is already used
in this file. I liked Alex's suggestion to use external and I'll have
those changed in next version of patch set.

Kirti

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 04/12] vfio iommu: Add support for mediated devices
  2016-10-26  7:53     ` Tian, Kevin
@ 2016-10-26 15:16       ` Alex Williamson
  0 siblings, 0 replies; 73+ messages in thread
From: Alex Williamson @ 2016-10-26 15:16 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Kirti Wankhede, pbonzini, kraxel, cjia, qemu-devel, kvm, Song,
	Jike, bjsdjshi, linux-kernel

On Wed, 26 Oct 2016 07:53:43 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Thursday, October 20, 2016 5:03 AM  
> > > @@ -83,6 +92,21 @@ struct vfio_group {
> > >  };
> > >
> > >  /*
> > > + * Guest RAM pinning working set or DMA target
> > > + */
> > > +struct vfio_pfn {
> > > +	struct rb_node		node;
> > > +	unsigned long		vaddr;		/* virtual addr */
> > > +	dma_addr_t		iova;		/* IOVA */
> > > +	unsigned long		pfn;		/* Host pfn */
> > > +	int			prot;
> > > +	atomic_t		ref_count;
> > > +};  
> > 
> > Somehow we're going to need to fit an invalidation callback here too.
> > How would we handle a case where there are multiple mdev devices, from
> > different vendor drivers, that all have the same pfn pinned?  I'm
> > already concerned about the per pfn overhead we're introducing here so
> > clearly we cannot store an invalidation callback per pinned page, per
> > vendor driver.  Perhaps invalidations should be done using a notifier
> > chain per vfio_iommu, the vendor drivers are required to register on
> > that chain (fail pinning with empty notifier list) user unmapping
> > will be broadcast to the notifier chain, the vendor driver will be
> > responsible for deciding if each unmap is relevant to them (potentially
> > it's for a pinning from another driver).
> > 
> > I expect we also need to enforce that vendors perform a synchronous
> > unmap such that after returning from the notifier list call, the
> > vfio_pfn should no longer exist.  If it does we might need to BUG_ON.
> > Also be careful to pay attention to the locking of the notifier vs
> > unpin callbacks to avoid deadlocks.
> >   
> 
> What about just requesting vendor driver to provide a callback in parent 
> device ops?

How does the iommu driver get to the mdev vendor driver callback?  We
can also have pages pinned by multiple vendor drivers, I don't think
we want the additional overhead of a per page list of invalidation
callbacks.
 
> Curious in which scenario the user application (say Qemu here) may 
> unmap memory pages which are still pinned by vendor driver... Is it 
> purely about a corner case which we want to handle elegantly? 

The vfio type1 iommu API provides a MAP and UNMAP interface.  The unmap
call is expected to work regardless of how it might inhibit the device
from working.  This is currently true of iommu protected devices today,
a user can unmap pages which might be DMA targets for the device and
the iommu prevents further access to those pages, possibly at the
expense of device operation.  We cannot support an interface where a
user can unmap a set of pages and map in new pages to replace them when
the vendor driver might be caching stale mappings.

In normal VM operation perhaps this is a corner case, but the API is
not defined only for the normal and expected behavior of a VM.
 
> If yes, possibly a simpler way is to force destroying mdev instead of 
> asking vendor driver to take care of each invalidation request under
> such situation. Since anyway the mdev device won't be in an usable
> state anymore... (sorry if I missed the key problem here.)

That's a pretty harsh response for an operation which is completely
valid from an API perspective.  What if the VM does an unmap of all
memory around reset?  We cannot guarantee that the guest driver will
have a chance to do cleanup, the guest may have crashed or a
system_reset invoked.  Would you have the mdev destroyed in this case?
How could QEMU, which has no device specific driver to know that vendor
pinnings are present, recover from this?  Thanks,

Alex

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 04/12] vfio iommu: Add support for mediated devices
  2016-10-26  7:54     ` Tian, Kevin
@ 2016-10-26 15:19       ` Alex Williamson
  0 siblings, 0 replies; 73+ messages in thread
From: Alex Williamson @ 2016-10-26 15:19 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Kirti Wankhede, pbonzini, kraxel, cjia, qemu-devel, kvm, Song,
	Jike, bjsdjshi, linux-kernel

On Wed, 26 Oct 2016 07:54:56 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Tian, Kevin
> > Sent: Wednesday, October 26, 2016 3:54 PM
> >   
> > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > Sent: Thursday, October 20, 2016 5:03 AM  
> > > > @@ -83,6 +92,21 @@ struct vfio_group {
> > > >  };
> > > >
> > > >  /*
> > > > + * Guest RAM pinning working set or DMA target
> > > > + */
> > > > +struct vfio_pfn {
> > > > +	struct rb_node		node;
> > > > +	unsigned long		vaddr;		/* virtual addr */
> > > > +	dma_addr_t		iova;		/* IOVA */
> > > > +	unsigned long		pfn;		/* Host pfn */
> > > > +	int			prot;
> > > > +	atomic_t		ref_count;
> > > > +};  
> > >
> > > Somehow we're going to need to fit an invalidation callback here too.
> > > How would we handle a case where there are multiple mdev devices, from
> > > different vendor drivers, that all have the same pfn pinned?  I'm
> > > already concerned about the per pfn overhead we're introducing here so
> > > clearly we cannot store an invalidation callback per pinned page, per
> > > vendor driver.  Perhaps invalidations should be done using a notifier
> > > chain per vfio_iommu, the vendor drivers are required to register on
> > > that chain (fail pinning with empty notifier list) user unmapping
> > > will be broadcast to the notifier chain, the vendor driver will be
> > > responsible for deciding if each unmap is relevant to them (potentially
> > > it's for a pinning from another driver).
> > >
> > > I expect we also need to enforce that vendors perform a synchronous
> > > unmap such that after returning from the notifier list call, the
> > > vfio_pfn should no longer exist.  If it does we might need to BUG_ON.
> > > Also be careful to pay attention to the locking of the notifier vs
> > > unpin callbacks to avoid deadlocks.
> > >  
> > 
> > What about just requesting vendor driver to provide a callback in parent
> > device ops?
> > 
> > Curious in which scenario the user application (say Qemu here) may
> > unmap memory pages which are still pinned by vendor driver... Is it
> > purely about a corner case which we want to handle elegantly?
> > 
> > If yes, possibly a simpler way is to force destroying mdev instead of
> > asking vendor driver to take care of each invalidation request under
> > such situation. Since anyway the mdev device won't be in an usable
> > state anymore... (sorry if I missed the key problem here.)
> >   
> 
> or calling reset callback of parent device driver, if we don't want to
> break libvirt's expectation by blindly removing mdev device...

I think we're going off into the weeds here.  mdev devices need to
honor the existing API, therefore an unmap should result in preventing
the device from further access to the unmapped pages, nothing more,
nothing less.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 01/12] vfio: Mediated device Core driver
  2016-10-20 17:12     ` Alex Williamson
  2016-10-21  2:41       ` Jike Song
@ 2016-10-27  5:56       ` Jike Song
  1 sibling, 0 replies; 73+ messages in thread
From: Jike Song @ 2016-10-27  5:56 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Kirti Wankhede, pbonzini, kraxel, cjia, qemu-devel, kvm,
	kevin.tian, bjsdjshi, linux-kernel

On 10/21/2016 01:12 AM, Alex Williamson wrote:
> On Thu, 20 Oct 2016 15:23:53 +0800
> Jike Song <jike.song@intel.com> wrote:
> 
>> On 10/18/2016 05:22 AM, Kirti Wankhede wrote:
>>> diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
>>> new file mode 100644
>>> index 000000000000..7db5ec164aeb
>>> --- /dev/null
>>> +++ b/drivers/vfio/mdev/mdev_core.c
>>> @@ -0,0 +1,372 @@
>>> +/*
>>> + * Mediated device Core Driver
>>> + *
>>> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
>>> + *     Author: Neo Jia <cjia@nvidia.com>
>>> + *	       Kirti Wankhede <kwankhede@nvidia.com>
>>> + *
>>> + * This program is free software; you can redistribute it and/or modify
>>> + * it under the terms of the GNU General Public License version 2 as
>>> + * published by the Free Software Foundation.
>>> + */
>>> +
>>> +#include <linux/module.h>
>>> +#include <linux/device.h>
>>> +#include <linux/slab.h>
>>> +#include <linux/uuid.h>
>>> +#include <linux/sysfs.h>
>>> +#include <linux/mdev.h>
>>> +
>>> +#include "mdev_private.h"
>>> +
>>> +#define DRIVER_VERSION		"0.1"
>>> +#define DRIVER_AUTHOR		"NVIDIA Corporation"
>>> +#define DRIVER_DESC		"Mediated device Core Driver"
>>> +
>>> +static LIST_HEAD(parent_list);
>>> +static DEFINE_MUTEX(parent_list_lock);
>>> +static struct class_compat *mdev_bus_compat_class;
>>> +  
>>
>>> +
>>> +/*
>>> + * mdev_register_device : Register a device
>>> + * @dev: device structure representing parent device.
>>> + * @ops: Parent device operation structure to be registered.
>>> + *
>>> + * Add device to list of registered parent devices.
>>> + * Returns a negative value on error, otherwise 0.
>>> + */
>>> +int mdev_register_device(struct device *dev, const struct parent_ops *ops)
>>> +{
>>> +	int ret = 0;
>>> +	struct parent_device *parent;
>>> +
>>> +	/* check for mandatory ops */
>>> +	if (!ops || !ops->create || !ops->remove || !ops->supported_type_groups)
>>> +		return -EINVAL;
>>> +
>>> +	dev = get_device(dev);
>>> +	if (!dev)
>>> +		return -EINVAL;
>>> +
>>> +	mutex_lock(&parent_list_lock);
>>> +
>>> +	/* Check for duplicate */
>>> +	parent = __find_parent_device(dev);
>>> +	if (parent) {
>>> +		ret = -EEXIST;
>>> +		goto add_dev_err;
>>> +	}
>>> +
>>> +	parent = kzalloc(sizeof(*parent), GFP_KERNEL);
>>> +	if (!parent) {
>>> +		ret = -ENOMEM;
>>> +		goto add_dev_err;
>>> +	}
>>> +
>>> +	kref_init(&parent->ref);
>>> +
>>> +	parent->dev = dev;
>>> +	parent->ops = ops;
>>> +
>>> +	ret = parent_create_sysfs_files(parent);
>>> +	if (ret) {
>>> +		mutex_unlock(&parent_list_lock);
>>> +		mdev_put_parent(parent);
>>> +		return ret;
>>> +	}
>>> +
>>> +	ret = class_compat_create_link(mdev_bus_compat_class, dev, NULL);
>>> +	if (ret)
>>> +		dev_warn(dev, "Failed to create compatibility class link\n");
>>> +
>>> +	list_add(&parent->next, &parent_list);
>>> +	mutex_unlock(&parent_list_lock);
>>> +
>>> +	dev_info(dev, "MDEV: Registered\n");
>>> +	return 0;
>>> +
>>> +add_dev_err:
>>> +	mutex_unlock(&parent_list_lock);
>>> +	put_device(dev);
>>> +	return ret;
>>> +}
>>> +EXPORT_SYMBOL(mdev_register_device);  
>>
>>> +static int __init mdev_init(void)
>>> +{
>>> +	int ret;
>>> +
>>> +	ret = mdev_bus_register();
>>> +	if (ret) {
>>> +		pr_err("Failed to register mdev bus\n");
>>> +		return ret;
>>> +	}
>>> +
>>> +	mdev_bus_compat_class = class_compat_register("mdev_bus");
>>> +	if (!mdev_bus_compat_class) {
>>> +		mdev_bus_unregister();
>>> +		return -ENOMEM;
>>> +	}
>>> +
>>> +	/*
>>> +	 * Attempt to load known vfio_mdev.  This gives us a working environment
>>> +	 * without the user needing to explicitly load vfio_mdev driver.
>>> +	 */
>>> +	request_module_nowait("vfio_mdev");
>>> +
>>> +	return ret;
>>> +}
>>> +
>>> +static void __exit mdev_exit(void)
>>> +{
>>> +	class_compat_unregister(mdev_bus_compat_class);
>>> +	mdev_bus_unregister();
>>> +}
>>> +
>>> +module_init(mdev_init)
>>> +module_exit(mdev_exit)  
>>
>> Hi Kirti,
>>
>> There is a possible issue: mdev_bus_register is called from mdev_init,
>> a module_init, equal to device_initcall if builtin to vmlinux; however,
>> the vendor driver, say i915.ko for intel case, have to call
>> mdev_register_device from its module_init: at that time, mdev_init
>> is still not called.
>>
>> Not sure if this issue exists with nvidia.ko. Though in most cases we
>> are expecting users select mdev as a standalone module, we still won't
>> break builtin case.
>>
>>
>> Hi Alex, do you have any suggestion here?
> 
> To fully solve the problem of built-in drivers making use of the mdev
> infrastructure we'd need to make mdev itself builtin and possibly a
> subsystem that is initialized prior to device drivers.  Is that really
> necessary?  Even though i915.ko is often loaded as part of an
> initramfs, most systems still build it as a module.  I would expect
> that standard module dependencies will pull in the necessary mdev and
> vfio modules to make this work correctly.  I can't say that I'm
> prepared to make mdev be a subsystem as would be necessary for builtin
> drivers to make use of.

Hi Alex,

I'm sorry to say that my previous understanding is not fully correct.
Current combination of mdev and i915 are prone to panic the system
as long as both built into vmlinux.

mdev_init:

	mdev_bus_register();
	mdev_bus_compat_class = class_compat_register("mdev_bus");
	request_module_nowait("vfio_mdev");

mdev_register_device:

	class_compat_create_link(mdev_bus_compat_class, dev, NULL);


If both mdev and i915 are builtin, the class_compat_create_link call
will simply panic the system. People having such .config will be annoyed,
for example, when he tries to bisect.

I'm not arguing that mdev should be a subsys here, but there still
be a possible way out. Adding the class_compat_register into
mdev_register_device if not yet registered, will help out. Maybe
it isn't the typical location to register a class, nonetheless it
fix the problem here with least impact.

Looking forward to your opinion :)

--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v9 04/12] vfio iommu: Add support for mediated devices
  2016-10-17 21:22 ` [PATCH v9 04/12] vfio iommu: Add support for mediated devices Kirti Wankhede
  2016-10-19 21:02   ` Alex Williamson
  2016-10-21  7:49   ` Jike Song
@ 2016-10-27  7:20   ` Alexey Kardashevskiy
  2016-10-27 12:31     ` Kirti Wankhede
  2 siblings, 1 reply; 73+ messages in thread
From: Alexey Kardashevskiy @ 2016-10-27  7:20 UTC (permalink / raw)
  To: Kirti Wankhede, alex.williamson, pbonzini, kraxel, cjia
  Cc: jike.song, kvm, linux-kernel, kevin.tian, qemu-devel, bjsdjshi

On 18/10/16 08:22, Kirti Wankhede wrote:
> VFIO IOMMU drivers are designed for the devices which are IOMMU capable.
> Mediated device only uses IOMMU APIs, the underlying hardware can be
> managed by an IOMMU domain.
> 
> Aim of this change is:
> - To use most of the code of TYPE1 IOMMU driver for mediated devices
> - To support direct assigned device and mediated device in single module
> 
> Added two new callback functions to struct vfio_iommu_driver_ops. Backend
> IOMMU module that supports pining and unpinning pages for mdev devices
> should provide these functions.
> Added APIs for pining and unpining pages to VFIO module. These calls back
> into backend iommu module to actually pin and unpin pages.
> 
> This change adds pin and unpin support for mediated device to TYPE1 IOMMU
> backend module. More details:
> - When iommu_group of mediated devices is attached, task structure is
>   cached which is used later to pin pages and page accounting.


For SPAPR TCE IOMMU driver, I ended up caching mm_struct with
atomic_inc(&container->mm->mm_count) (patches are on the way) instead of
using @current or task as the process might be gone while VFIO container is
still alive and @mm might be needed to do proper cleanup; this might not be
an issue with this patchset now but still you seem to only use @mm from
task_struct.



> - It keeps track of pinned pages for mediated domain. This data is used to
>   verify unpinning request and to unpin remaining pages while detaching, if
>   there are any.
> - Used existing mechanism for page accounting. If iommu capable domain
>   exist in the container then all pages are already pinned and accounted.
>   Accouting for mdev device is only done if there is no iommu capable
>   domain in the container.
> - Page accouting is updated on hot plug and unplug mdev device and pass
>   through device.
> 
> Tested by assigning below combinations of devices to a single VM:
> - GPU pass through only
> - vGPU device only
> - One GPU pass through and one vGPU device
> - Linux VM hot plug and unplug vGPU device while GPU pass through device
>   exist
> - Linux VM hot plug and unplug GPU pass through device while vGPU device
>   exist
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Signed-off-by: Neo Jia <cjia@nvidia.com>
> Change-Id: I295d6f0f2e0579b8d9882bfd8fd5a4194b97bd9a


-- 
Alexey

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v9 04/12] vfio iommu: Add support for mediated devices
  2016-10-27  7:20   ` [Qemu-devel] " Alexey Kardashevskiy
@ 2016-10-27 12:31     ` Kirti Wankhede
  2016-10-27 14:30       ` Alex Williamson
  2016-10-28  2:18       ` Alexey Kardashevskiy
  0 siblings, 2 replies; 73+ messages in thread
From: Kirti Wankhede @ 2016-10-27 12:31 UTC (permalink / raw)
  To: Alexey Kardashevskiy, alex.williamson, pbonzini, kraxel, cjia
  Cc: jike.song, kvm, linux-kernel, kevin.tian, qemu-devel, bjsdjshi



On 10/27/2016 12:50 PM, Alexey Kardashevskiy wrote:
> On 18/10/16 08:22, Kirti Wankhede wrote:
>> VFIO IOMMU drivers are designed for the devices which are IOMMU capable.
>> Mediated device only uses IOMMU APIs, the underlying hardware can be
>> managed by an IOMMU domain.
>>
>> Aim of this change is:
>> - To use most of the code of TYPE1 IOMMU driver for mediated devices
>> - To support direct assigned device and mediated device in single module
>>
>> Added two new callback functions to struct vfio_iommu_driver_ops. Backend
>> IOMMU module that supports pining and unpinning pages for mdev devices
>> should provide these functions.
>> Added APIs for pining and unpining pages to VFIO module. These calls back
>> into backend iommu module to actually pin and unpin pages.
>>
>> This change adds pin and unpin support for mediated device to TYPE1 IOMMU
>> backend module. More details:
>> - When iommu_group of mediated devices is attached, task structure is
>>   cached which is used later to pin pages and page accounting.
> 
> 
> For SPAPR TCE IOMMU driver, I ended up caching mm_struct with
> atomic_inc(&container->mm->mm_count) (patches are on the way) instead of
> using @current or task as the process might be gone while VFIO container is
> still alive and @mm might be needed to do proper cleanup; this might not be
> an issue with this patchset now but still you seem to only use @mm from
> task_struct.
> 

Consider the example of QEMU process which creates VFIO container, QEMU
in its teardown path would release the container. How could container be
alive when process is gone?

Kirti

> 
> 
>> - It keeps track of pinned pages for mediated domain. This data is used to
>>   verify unpinning request and to unpin remaining pages while detaching, if
>>   there are any.
>> - Used existing mechanism for page accounting. If iommu capable domain
>>   exist in the container then all pages are already pinned and accounted.
>>   Accouting for mdev device is only done if there is no iommu capable
>>   domain in the container.
>> - Page accouting is updated on hot plug and unplug mdev device and pass
>>   through device.
>>
>> Tested by assigning below combinations of devices to a single VM:
>> - GPU pass through only
>> - vGPU device only
>> - One GPU pass through and one vGPU device
>> - Linux VM hot plug and unplug vGPU device while GPU pass through device
>>   exist
>> - Linux VM hot plug and unplug GPU pass through device while vGPU device
>>   exist
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Signed-off-by: Neo Jia <cjia@nvidia.com>
>> Change-Id: I295d6f0f2e0579b8d9882bfd8fd5a4194b97bd9a
> 
> 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v9 04/12] vfio iommu: Add support for mediated devices
  2016-10-27 12:31     ` Kirti Wankhede
@ 2016-10-27 14:30       ` Alex Williamson
  2016-10-27 15:59         ` Kirti Wankhede
  2016-10-28  2:18       ` Alexey Kardashevskiy
  1 sibling, 1 reply; 73+ messages in thread
From: Alex Williamson @ 2016-10-27 14:30 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Alexey Kardashevskiy, pbonzini, kraxel, cjia, jike.song, kvm,
	linux-kernel, kevin.tian, qemu-devel, bjsdjshi

On Thu, 27 Oct 2016 18:01:51 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 10/27/2016 12:50 PM, Alexey Kardashevskiy wrote:
> > On 18/10/16 08:22, Kirti Wankhede wrote:  
> >> VFIO IOMMU drivers are designed for the devices which are IOMMU capable.
> >> Mediated device only uses IOMMU APIs, the underlying hardware can be
> >> managed by an IOMMU domain.
> >>
> >> Aim of this change is:
> >> - To use most of the code of TYPE1 IOMMU driver for mediated devices
> >> - To support direct assigned device and mediated device in single module
> >>
> >> Added two new callback functions to struct vfio_iommu_driver_ops. Backend
> >> IOMMU module that supports pining and unpinning pages for mdev devices
> >> should provide these functions.
> >> Added APIs for pining and unpining pages to VFIO module. These calls back
> >> into backend iommu module to actually pin and unpin pages.
> >>
> >> This change adds pin and unpin support for mediated device to TYPE1 IOMMU
> >> backend module. More details:
> >> - When iommu_group of mediated devices is attached, task structure is
> >>   cached which is used later to pin pages and page accounting.  
> > 
> > 
> > For SPAPR TCE IOMMU driver, I ended up caching mm_struct with
> > atomic_inc(&container->mm->mm_count) (patches are on the way) instead of
> > using @current or task as the process might be gone while VFIO container is
> > still alive and @mm might be needed to do proper cleanup; this might not be
> > an issue with this patchset now but still you seem to only use @mm from
> > task_struct.
> >   
> 
> Consider the example of QEMU process which creates VFIO container, QEMU
> in its teardown path would release the container. How could container be
> alive when process is gone?

If QEMU is sent a SIGKILL, does the process still exist?  We must be
able to perform cleanup regardless of the state, or existence, of the
task that created it.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v9 04/12] vfio iommu: Add support for mediated devices
  2016-10-27 14:30       ` Alex Williamson
@ 2016-10-27 15:59         ` Kirti Wankhede
  0 siblings, 0 replies; 73+ messages in thread
From: Kirti Wankhede @ 2016-10-27 15:59 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, pbonzini, kraxel, cjia, jike.song, kvm,
	linux-kernel, kevin.tian, qemu-devel, bjsdjshi



On 10/27/2016 8:00 PM, Alex Williamson wrote:
> On Thu, 27 Oct 2016 18:01:51 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> On 10/27/2016 12:50 PM, Alexey Kardashevskiy wrote:
>>> On 18/10/16 08:22, Kirti Wankhede wrote:  
>>>> VFIO IOMMU drivers are designed for the devices which are IOMMU capable.
>>>> Mediated device only uses IOMMU APIs, the underlying hardware can be
>>>> managed by an IOMMU domain.
>>>>
>>>> Aim of this change is:
>>>> - To use most of the code of TYPE1 IOMMU driver for mediated devices
>>>> - To support direct assigned device and mediated device in single module
>>>>
>>>> Added two new callback functions to struct vfio_iommu_driver_ops. Backend
>>>> IOMMU module that supports pining and unpinning pages for mdev devices
>>>> should provide these functions.
>>>> Added APIs for pining and unpining pages to VFIO module. These calls back
>>>> into backend iommu module to actually pin and unpin pages.
>>>>
>>>> This change adds pin and unpin support for mediated device to TYPE1 IOMMU
>>>> backend module. More details:
>>>> - When iommu_group of mediated devices is attached, task structure is
>>>>   cached which is used later to pin pages and page accounting.  
>>>
>>>
>>> For SPAPR TCE IOMMU driver, I ended up caching mm_struct with
>>> atomic_inc(&container->mm->mm_count) (patches are on the way) instead of
>>> using @current or task as the process might be gone while VFIO container is
>>> still alive and @mm might be needed to do proper cleanup; this might not be
>>> an issue with this patchset now but still you seem to only use @mm from
>>> task_struct.
>>>   
>>
>> Consider the example of QEMU process which creates VFIO container, QEMU
>> in its teardown path would release the container. How could container be
>> alive when process is gone?
> 
> If QEMU is sent a SIGKILL, does the process still exist?  We must be
> able to perform cleanup regardless of the state, or existence, of the
> task that created it.  Thanks,
> 

The kernel closes all open file descriptors when any process is
terminated, so .release() from struct vfio_iommu_driver_ops gets called
on SIGKILL or SIGTERM and release() function do all cleanup.

Kirti

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v9 04/12] vfio iommu: Add support for mediated devices
  2016-10-27 12:31     ` Kirti Wankhede
  2016-10-27 14:30       ` Alex Williamson
@ 2016-10-28  2:18       ` Alexey Kardashevskiy
  2016-11-01 14:01         ` Kirti Wankhede
  1 sibling, 1 reply; 73+ messages in thread
From: Alexey Kardashevskiy @ 2016-10-28  2:18 UTC (permalink / raw)
  To: Kirti Wankhede, alex.williamson, pbonzini, kraxel, cjia
  Cc: jike.song, kvm, linux-kernel, kevin.tian, qemu-devel, bjsdjshi

On 27/10/16 23:31, Kirti Wankhede wrote:
> 
> 
> On 10/27/2016 12:50 PM, Alexey Kardashevskiy wrote:
>> On 18/10/16 08:22, Kirti Wankhede wrote:
>>> VFIO IOMMU drivers are designed for the devices which are IOMMU capable.
>>> Mediated device only uses IOMMU APIs, the underlying hardware can be
>>> managed by an IOMMU domain.
>>>
>>> Aim of this change is:
>>> - To use most of the code of TYPE1 IOMMU driver for mediated devices
>>> - To support direct assigned device and mediated device in single module
>>>
>>> Added two new callback functions to struct vfio_iommu_driver_ops. Backend
>>> IOMMU module that supports pining and unpinning pages for mdev devices
>>> should provide these functions.
>>> Added APIs for pining and unpining pages to VFIO module. These calls back
>>> into backend iommu module to actually pin and unpin pages.
>>>
>>> This change adds pin and unpin support for mediated device to TYPE1 IOMMU
>>> backend module. More details:
>>> - When iommu_group of mediated devices is attached, task structure is
>>>   cached which is used later to pin pages and page accounting.
>>
>>
>> For SPAPR TCE IOMMU driver, I ended up caching mm_struct with
>> atomic_inc(&container->mm->mm_count) (patches are on the way) instead of
>> using @current or task as the process might be gone while VFIO container is
>> still alive and @mm might be needed to do proper cleanup; this might not be
>> an issue with this patchset now but still you seem to only use @mm from
>> task_struct.
>>
> 
> Consider the example of QEMU process which creates VFIO container, QEMU
> in its teardown path would release the container. How could container be
> alive when process is gone?

do_exit() in kernel/exit.c calls exit_mm() (which sets NULL to tsk->mm)
first, and then releases open files by calling  exit_files(). So
container's release() does not have current->mm.



> 
> Kirti
> 
>>
>>
>>> - It keeps track of pinned pages for mediated domain. This data is used to
>>>   verify unpinning request and to unpin remaining pages while detaching, if
>>>   there are any.
>>> - Used existing mechanism for page accounting. If iommu capable domain
>>>   exist in the container then all pages are already pinned and accounted.
>>>   Accouting for mdev device is only done if there is no iommu capable
>>>   domain in the container.
>>> - Page accouting is updated on hot plug and unplug mdev device and pass
>>>   through device.
>>>
>>> Tested by assigning below combinations of devices to a single VM:
>>> - GPU pass through only
>>> - vGPU device only
>>> - One GPU pass through and one vGPU device
>>> - Linux VM hot plug and unplug vGPU device while GPU pass through device
>>>   exist
>>> - Linux VM hot plug and unplug GPU pass through device while vGPU device
>>>   exist
>>>
>>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>>> Signed-off-by: Neo Jia <cjia@nvidia.com>
>>> Change-Id: I295d6f0f2e0579b8d9882bfd8fd5a4194b97bd9a
>>
>>


-- 
Alexey

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v9 04/12] vfio iommu: Add support for mediated devices
  2016-10-28  2:18       ` Alexey Kardashevskiy
@ 2016-11-01 14:01         ` Kirti Wankhede
  2016-11-02  1:24           ` Alexey Kardashevskiy
  0 siblings, 1 reply; 73+ messages in thread
From: Kirti Wankhede @ 2016-11-01 14:01 UTC (permalink / raw)
  To: Alexey Kardashevskiy, alex.williamson, pbonzini, kraxel, cjia
  Cc: jike.song, kvm, linux-kernel, kevin.tian, qemu-devel, bjsdjshi



On 10/28/2016 7:48 AM, Alexey Kardashevskiy wrote:
> On 27/10/16 23:31, Kirti Wankhede wrote:
>>
>>
>> On 10/27/2016 12:50 PM, Alexey Kardashevskiy wrote:
>>> On 18/10/16 08:22, Kirti Wankhede wrote:
>>>> VFIO IOMMU drivers are designed for the devices which are IOMMU capable.
>>>> Mediated device only uses IOMMU APIs, the underlying hardware can be
>>>> managed by an IOMMU domain.
>>>>
>>>> Aim of this change is:
>>>> - To use most of the code of TYPE1 IOMMU driver for mediated devices
>>>> - To support direct assigned device and mediated device in single module
>>>>
>>>> Added two new callback functions to struct vfio_iommu_driver_ops. Backend
>>>> IOMMU module that supports pining and unpinning pages for mdev devices
>>>> should provide these functions.
>>>> Added APIs for pining and unpining pages to VFIO module. These calls back
>>>> into backend iommu module to actually pin and unpin pages.
>>>>
>>>> This change adds pin and unpin support for mediated device to TYPE1 IOMMU
>>>> backend module. More details:
>>>> - When iommu_group of mediated devices is attached, task structure is
>>>>   cached which is used later to pin pages and page accounting.
>>>
>>>
>>> For SPAPR TCE IOMMU driver, I ended up caching mm_struct with
>>> atomic_inc(&container->mm->mm_count) (patches are on the way) instead of
>>> using @current or task as the process might be gone while VFIO container is
>>> still alive and @mm might be needed to do proper cleanup; this might not be
>>> an issue with this patchset now but still you seem to only use @mm from
>>> task_struct.
>>>
>>
>> Consider the example of QEMU process which creates VFIO container, QEMU
>> in its teardown path would release the container. How could container be
>> alive when process is gone?
> 
> do_exit() in kernel/exit.c calls exit_mm() (which sets NULL to tsk->mm)
> first, and then releases open files by calling  exit_files(). So
> container's release() does not have current->mm.
> 

Incrementing usage count (get_task_struct()) while saving task structure
and decementing it (put_task_struct()) from release() should  work here.
Updating the patch.

Thanks,
Kirti

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v9 04/12] vfio iommu: Add support for mediated devices
  2016-11-01 14:01         ` Kirti Wankhede
@ 2016-11-02  1:24           ` Alexey Kardashevskiy
  2016-11-02  3:29             ` Kirti Wankhede
  0 siblings, 1 reply; 73+ messages in thread
From: Alexey Kardashevskiy @ 2016-11-02  1:24 UTC (permalink / raw)
  To: Kirti Wankhede, alex.williamson, pbonzini, kraxel, cjia
  Cc: jike.song, kvm, linux-kernel, kevin.tian, qemu-devel, bjsdjshi

On 02/11/16 01:01, Kirti Wankhede wrote:
> 
> 
> On 10/28/2016 7:48 AM, Alexey Kardashevskiy wrote:
>> On 27/10/16 23:31, Kirti Wankhede wrote:
>>>
>>>
>>> On 10/27/2016 12:50 PM, Alexey Kardashevskiy wrote:
>>>> On 18/10/16 08:22, Kirti Wankhede wrote:
>>>>> VFIO IOMMU drivers are designed for the devices which are IOMMU capable.
>>>>> Mediated device only uses IOMMU APIs, the underlying hardware can be
>>>>> managed by an IOMMU domain.
>>>>>
>>>>> Aim of this change is:
>>>>> - To use most of the code of TYPE1 IOMMU driver for mediated devices
>>>>> - To support direct assigned device and mediated device in single module
>>>>>
>>>>> Added two new callback functions to struct vfio_iommu_driver_ops. Backend
>>>>> IOMMU module that supports pining and unpinning pages for mdev devices
>>>>> should provide these functions.
>>>>> Added APIs for pining and unpining pages to VFIO module. These calls back
>>>>> into backend iommu module to actually pin and unpin pages.
>>>>>
>>>>> This change adds pin and unpin support for mediated device to TYPE1 IOMMU
>>>>> backend module. More details:
>>>>> - When iommu_group of mediated devices is attached, task structure is
>>>>>   cached which is used later to pin pages and page accounting.
>>>>
>>>>
>>>> For SPAPR TCE IOMMU driver, I ended up caching mm_struct with
>>>> atomic_inc(&container->mm->mm_count) (patches are on the way) instead of
>>>> using @current or task as the process might be gone while VFIO container is
>>>> still alive and @mm might be needed to do proper cleanup; this might not be
>>>> an issue with this patchset now but still you seem to only use @mm from
>>>> task_struct.
>>>>
>>>
>>> Consider the example of QEMU process which creates VFIO container, QEMU
>>> in its teardown path would release the container. How could container be
>>> alive when process is gone?
>>
>> do_exit() in kernel/exit.c calls exit_mm() (which sets NULL to tsk->mm)
>> first, and then releases open files by calling  exit_files(). So
>> container's release() does not have current->mm.
>>
> 
> Incrementing usage count (get_task_struct()) while saving task structure
> and decementing it (put_task_struct()) from release() should  work here.
> Updating the patch.

I cannot see how the task->usage counter prevents do_exit() from performing
the exit, can you?



-- 
Alexey

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v9 04/12] vfio iommu: Add support for mediated devices
  2016-11-02  1:24           ` Alexey Kardashevskiy
@ 2016-11-02  3:29             ` Kirti Wankhede
  2016-11-02  4:09               ` Alexey Kardashevskiy
  0 siblings, 1 reply; 73+ messages in thread
From: Kirti Wankhede @ 2016-11-02  3:29 UTC (permalink / raw)
  To: Alexey Kardashevskiy, alex.williamson, pbonzini, kraxel, cjia
  Cc: jike.song, kvm, linux-kernel, kevin.tian, qemu-devel, bjsdjshi



On 11/2/2016 6:54 AM, Alexey Kardashevskiy wrote:
> On 02/11/16 01:01, Kirti Wankhede wrote:
>>
>>
>> On 10/28/2016 7:48 AM, Alexey Kardashevskiy wrote:
>>> On 27/10/16 23:31, Kirti Wankhede wrote:
>>>>
>>>>
>>>> On 10/27/2016 12:50 PM, Alexey Kardashevskiy wrote:
>>>>> On 18/10/16 08:22, Kirti Wankhede wrote:
>>>>>> VFIO IOMMU drivers are designed for the devices which are IOMMU capable.
>>>>>> Mediated device only uses IOMMU APIs, the underlying hardware can be
>>>>>> managed by an IOMMU domain.
>>>>>>
>>>>>> Aim of this change is:
>>>>>> - To use most of the code of TYPE1 IOMMU driver for mediated devices
>>>>>> - To support direct assigned device and mediated device in single module
>>>>>>
>>>>>> Added two new callback functions to struct vfio_iommu_driver_ops. Backend
>>>>>> IOMMU module that supports pining and unpinning pages for mdev devices
>>>>>> should provide these functions.
>>>>>> Added APIs for pining and unpining pages to VFIO module. These calls back
>>>>>> into backend iommu module to actually pin and unpin pages.
>>>>>>
>>>>>> This change adds pin and unpin support for mediated device to TYPE1 IOMMU
>>>>>> backend module. More details:
>>>>>> - When iommu_group of mediated devices is attached, task structure is
>>>>>>   cached which is used later to pin pages and page accounting.
>>>>>
>>>>>
>>>>> For SPAPR TCE IOMMU driver, I ended up caching mm_struct with
>>>>> atomic_inc(&container->mm->mm_count) (patches are on the way) instead of
>>>>> using @current or task as the process might be gone while VFIO container is
>>>>> still alive and @mm might be needed to do proper cleanup; this might not be
>>>>> an issue with this patchset now but still you seem to only use @mm from
>>>>> task_struct.
>>>>>
>>>>
>>>> Consider the example of QEMU process which creates VFIO container, QEMU
>>>> in its teardown path would release the container. How could container be
>>>> alive when process is gone?
>>>
>>> do_exit() in kernel/exit.c calls exit_mm() (which sets NULL to tsk->mm)
>>> first, and then releases open files by calling  exit_files(). So
>>> container's release() does not have current->mm.
>>>
>>
>> Incrementing usage count (get_task_struct()) while saving task structure
>> and decementing it (put_task_struct()) from release() should  work here.
>> Updating the patch.
> 
> I cannot see how the task->usage counter prevents do_exit() from performing
> the exit, can you?
> 

It will not prevent exit from do_exit(), but that will make sure that we
don't have stale pointer of task structure. Then we can check whether
the task is alive and get mm pointer in teardown path as below:

{
        struct task_struct *task = domain->external_addr_space->task;
        struct mm_struct *mm = NULL;

        put_pfn(pfn, prot);

        if (pid_alive(task))
                mm = get_task_mm(task);

        if (mm) {
                if (do_accounting)
                        vfio_lock_acct(task, -1);

                mmput(mm);
        }
}

Thanks,
Kirti

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v9 04/12] vfio iommu: Add support for mediated devices
  2016-11-02  3:29             ` Kirti Wankhede
@ 2016-11-02  4:09               ` Alexey Kardashevskiy
  2016-11-02 12:21                 ` Jike Song
  0 siblings, 1 reply; 73+ messages in thread
From: Alexey Kardashevskiy @ 2016-11-02  4:09 UTC (permalink / raw)
  To: Kirti Wankhede, alex.williamson, pbonzini, kraxel, cjia
  Cc: jike.song, kvm, linux-kernel, kevin.tian, qemu-devel, bjsdjshi

On 02/11/16 14:29, Kirti Wankhede wrote:
> 
> 
> On 11/2/2016 6:54 AM, Alexey Kardashevskiy wrote:
>> On 02/11/16 01:01, Kirti Wankhede wrote:
>>>
>>>
>>> On 10/28/2016 7:48 AM, Alexey Kardashevskiy wrote:
>>>> On 27/10/16 23:31, Kirti Wankhede wrote:
>>>>>
>>>>>
>>>>> On 10/27/2016 12:50 PM, Alexey Kardashevskiy wrote:
>>>>>> On 18/10/16 08:22, Kirti Wankhede wrote:
>>>>>>> VFIO IOMMU drivers are designed for the devices which are IOMMU capable.
>>>>>>> Mediated device only uses IOMMU APIs, the underlying hardware can be
>>>>>>> managed by an IOMMU domain.
>>>>>>>
>>>>>>> Aim of this change is:
>>>>>>> - To use most of the code of TYPE1 IOMMU driver for mediated devices
>>>>>>> - To support direct assigned device and mediated device in single module
>>>>>>>
>>>>>>> Added two new callback functions to struct vfio_iommu_driver_ops. Backend
>>>>>>> IOMMU module that supports pining and unpinning pages for mdev devices
>>>>>>> should provide these functions.
>>>>>>> Added APIs for pining and unpining pages to VFIO module. These calls back
>>>>>>> into backend iommu module to actually pin and unpin pages.
>>>>>>>
>>>>>>> This change adds pin and unpin support for mediated device to TYPE1 IOMMU
>>>>>>> backend module. More details:
>>>>>>> - When iommu_group of mediated devices is attached, task structure is
>>>>>>>   cached which is used later to pin pages and page accounting.
>>>>>>
>>>>>>
>>>>>> For SPAPR TCE IOMMU driver, I ended up caching mm_struct with
>>>>>> atomic_inc(&container->mm->mm_count) (patches are on the way) instead of
>>>>>> using @current or task as the process might be gone while VFIO container is
>>>>>> still alive and @mm might be needed to do proper cleanup; this might not be
>>>>>> an issue with this patchset now but still you seem to only use @mm from
>>>>>> task_struct.
>>>>>>
>>>>>
>>>>> Consider the example of QEMU process which creates VFIO container, QEMU
>>>>> in its teardown path would release the container. How could container be
>>>>> alive when process is gone?
>>>>
>>>> do_exit() in kernel/exit.c calls exit_mm() (which sets NULL to tsk->mm)
>>>> first, and then releases open files by calling  exit_files(). So
>>>> container's release() does not have current->mm.
>>>>
>>>
>>> Incrementing usage count (get_task_struct()) while saving task structure
>>> and decementing it (put_task_struct()) from release() should  work here.
>>> Updating the patch.
>>
>> I cannot see how the task->usage counter prevents do_exit() from performing
>> the exit, can you?
>>
> 
> It will not prevent exit from do_exit(), but that will make sure that we
> don't have stale pointer of task structure. Then we can check whether
> the task is alive and get mm pointer in teardown path as below:


Or you could just reference and use @mm as KVM and others do. Or there is
anything else you need from @current than just @mm?


> 
> {
>         struct task_struct *task = domain->external_addr_space->task;
>         struct mm_struct *mm = NULL;
> 
>         put_pfn(pfn, prot);
> 
>         if (pid_alive(task))
>                 mm = get_task_mm(task);
> 
>         if (mm) {
>                 if (do_accounting)
>                         vfio_lock_acct(task, -1);
> 
>                 mmput(mm);
>         }
> }



-- 
Alexey

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v9 04/12] vfio iommu: Add support for mediated devices
  2016-11-02  4:09               ` Alexey Kardashevskiy
@ 2016-11-02 12:21                 ` Jike Song
  2016-11-02 12:41                   ` Kirti Wankhede
  0 siblings, 1 reply; 73+ messages in thread
From: Jike Song @ 2016-11-02 12:21 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Kirti Wankhede, alex.williamson, pbonzini, kraxel, cjia, kvm,
	linux-kernel, kevin.tian, qemu-devel, bjsdjshi

On 11/02/2016 12:09 PM, Alexey Kardashevskiy wrote:
> On 02/11/16 14:29, Kirti Wankhede wrote:
>>
>>
>> On 11/2/2016 6:54 AM, Alexey Kardashevskiy wrote:
>>> On 02/11/16 01:01, Kirti Wankhede wrote:
>>>>
>>>>
>>>> On 10/28/2016 7:48 AM, Alexey Kardashevskiy wrote:
>>>>> On 27/10/16 23:31, Kirti Wankhede wrote:
>>>>>>
>>>>>>
>>>>>> On 10/27/2016 12:50 PM, Alexey Kardashevskiy wrote:
>>>>>>> On 18/10/16 08:22, Kirti Wankhede wrote:
>>>>>>>> VFIO IOMMU drivers are designed for the devices which are IOMMU capable.
>>>>>>>> Mediated device only uses IOMMU APIs, the underlying hardware can be
>>>>>>>> managed by an IOMMU domain.
>>>>>>>>
>>>>>>>> Aim of this change is:
>>>>>>>> - To use most of the code of TYPE1 IOMMU driver for mediated devices
>>>>>>>> - To support direct assigned device and mediated device in single module
>>>>>>>>
>>>>>>>> Added two new callback functions to struct vfio_iommu_driver_ops. Backend
>>>>>>>> IOMMU module that supports pining and unpinning pages for mdev devices
>>>>>>>> should provide these functions.
>>>>>>>> Added APIs for pining and unpining pages to VFIO module. These calls back
>>>>>>>> into backend iommu module to actually pin and unpin pages.
>>>>>>>>
>>>>>>>> This change adds pin and unpin support for mediated device to TYPE1 IOMMU
>>>>>>>> backend module. More details:
>>>>>>>> - When iommu_group of mediated devices is attached, task structure is
>>>>>>>>   cached which is used later to pin pages and page accounting.
>>>>>>>
>>>>>>>
>>>>>>> For SPAPR TCE IOMMU driver, I ended up caching mm_struct with
>>>>>>> atomic_inc(&container->mm->mm_count) (patches are on the way) instead of
>>>>>>> using @current or task as the process might be gone while VFIO container is
>>>>>>> still alive and @mm might be needed to do proper cleanup; this might not be
>>>>>>> an issue with this patchset now but still you seem to only use @mm from
>>>>>>> task_struct.
>>>>>>>
>>>>>>
>>>>>> Consider the example of QEMU process which creates VFIO container, QEMU
>>>>>> in its teardown path would release the container. How could container be
>>>>>> alive when process is gone?
>>>>>
>>>>> do_exit() in kernel/exit.c calls exit_mm() (which sets NULL to tsk->mm)
>>>>> first, and then releases open files by calling  exit_files(). So
>>>>> container's release() does not have current->mm.
>>>>>
>>>>
>>>> Incrementing usage count (get_task_struct()) while saving task structure
>>>> and decementing it (put_task_struct()) from release() should  work here.
>>>> Updating the patch.
>>>
>>> I cannot see how the task->usage counter prevents do_exit() from performing
>>> the exit, can you?
>>>
>>
>> It will not prevent exit from do_exit(), but that will make sure that we
>> don't have stale pointer of task structure. Then we can check whether
>> the task is alive and get mm pointer in teardown path as below:
> 
> 
> Or you could just reference and use @mm as KVM and others do. Or there is
> anything else you need from @current than just @mm?
> 

I agree. If @mm is the only thing needed, there is really no reason to
refer to the @task :-)

--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v9 04/12] vfio iommu: Add support for mediated devices
  2016-11-02 12:21                 ` Jike Song
@ 2016-11-02 12:41                   ` Kirti Wankhede
  2016-11-02 13:00                     ` Jike Song
  0 siblings, 1 reply; 73+ messages in thread
From: Kirti Wankhede @ 2016-11-02 12:41 UTC (permalink / raw)
  To: Jike Song, Alexey Kardashevskiy
  Cc: alex.williamson, pbonzini, kraxel, cjia, kvm, linux-kernel,
	kevin.tian, qemu-devel, bjsdjshi



On 11/2/2016 5:51 PM, Jike Song wrote:
> On 11/02/2016 12:09 PM, Alexey Kardashevskiy wrote:
>> On 02/11/16 14:29, Kirti Wankhede wrote:
>>>
>>>
>>> On 11/2/2016 6:54 AM, Alexey Kardashevskiy wrote:
>>>> On 02/11/16 01:01, Kirti Wankhede wrote:
>>>>>
>>>>>
>>>>> On 10/28/2016 7:48 AM, Alexey Kardashevskiy wrote:
>>>>>> On 27/10/16 23:31, Kirti Wankhede wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 10/27/2016 12:50 PM, Alexey Kardashevskiy wrote:
>>>>>>>> On 18/10/16 08:22, Kirti Wankhede wrote:
>>>>>>>>> VFIO IOMMU drivers are designed for the devices which are IOMMU capable.
>>>>>>>>> Mediated device only uses IOMMU APIs, the underlying hardware can be
>>>>>>>>> managed by an IOMMU domain.
>>>>>>>>>
>>>>>>>>> Aim of this change is:
>>>>>>>>> - To use most of the code of TYPE1 IOMMU driver for mediated devices
>>>>>>>>> - To support direct assigned device and mediated device in single module
>>>>>>>>>
>>>>>>>>> Added two new callback functions to struct vfio_iommu_driver_ops. Backend
>>>>>>>>> IOMMU module that supports pining and unpinning pages for mdev devices
>>>>>>>>> should provide these functions.
>>>>>>>>> Added APIs for pining and unpining pages to VFIO module. These calls back
>>>>>>>>> into backend iommu module to actually pin and unpin pages.
>>>>>>>>>
>>>>>>>>> This change adds pin and unpin support for mediated device to TYPE1 IOMMU
>>>>>>>>> backend module. More details:
>>>>>>>>> - When iommu_group of mediated devices is attached, task structure is
>>>>>>>>>   cached which is used later to pin pages and page accounting.
>>>>>>>>
>>>>>>>>
>>>>>>>> For SPAPR TCE IOMMU driver, I ended up caching mm_struct with
>>>>>>>> atomic_inc(&container->mm->mm_count) (patches are on the way) instead of
>>>>>>>> using @current or task as the process might be gone while VFIO container is
>>>>>>>> still alive and @mm might be needed to do proper cleanup; this might not be
>>>>>>>> an issue with this patchset now but still you seem to only use @mm from
>>>>>>>> task_struct.
>>>>>>>>
>>>>>>>
>>>>>>> Consider the example of QEMU process which creates VFIO container, QEMU
>>>>>>> in its teardown path would release the container. How could container be
>>>>>>> alive when process is gone?
>>>>>>
>>>>>> do_exit() in kernel/exit.c calls exit_mm() (which sets NULL to tsk->mm)
>>>>>> first, and then releases open files by calling  exit_files(). So
>>>>>> container's release() does not have current->mm.
>>>>>>
>>>>>
>>>>> Incrementing usage count (get_task_struct()) while saving task structure
>>>>> and decementing it (put_task_struct()) from release() should  work here.
>>>>> Updating the patch.
>>>>
>>>> I cannot see how the task->usage counter prevents do_exit() from performing
>>>> the exit, can you?
>>>>
>>>
>>> It will not prevent exit from do_exit(), but that will make sure that we
>>> don't have stale pointer of task structure. Then we can check whether
>>> the task is alive and get mm pointer in teardown path as below:
>>
>>
>> Or you could just reference and use @mm as KVM and others do. Or there is
>> anything else you need from @current than just @mm?
>>
> 
> I agree. If @mm is the only thing needed, there is really no reason to
> refer to the @task :-)
> 

In vfio_lock_acct(), that is for page accounting, if mm->mmap_sem is
already held then page accounting is deferred, where task structure is
used to get mm and work is deferred only if mm exist:
	mm = get_task_mm(task);

That is where this module need task structure.

Thanks,
Kirti

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v9 04/12] vfio iommu: Add support for mediated devices
  2016-11-02 12:41                   ` Kirti Wankhede
@ 2016-11-02 13:00                     ` Jike Song
  2016-11-02 13:18                       ` Kirti Wankhede
  0 siblings, 1 reply; 73+ messages in thread
From: Jike Song @ 2016-11-02 13:00 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Alexey Kardashevskiy, alex.williamson, pbonzini, kraxel, cjia,
	kvm, linux-kernel, kevin.tian, qemu-devel, bjsdjshi

On 11/02/2016 08:41 PM, Kirti Wankhede wrote:
> On 11/2/2016 5:51 PM, Jike Song wrote:
>> On 11/02/2016 12:09 PM, Alexey Kardashevskiy wrote:
>>> Or you could just reference and use @mm as KVM and others do. Or there is
>>> anything else you need from @current than just @mm?
>>>
>>
>> I agree. If @mm is the only thing needed, there is really no reason to
>> refer to the @task :-)
>>
> 
> In vfio_lock_acct(), that is for page accounting, if mm->mmap_sem is
> already held then page accounting is deferred, where task structure is
> used to get mm and work is deferred only if mm exist:
> 	mm = get_task_mm(task);
> 
> That is where this module need task structure.

Kirti,

By calling get_task_mm you hold a ref on @mm and save it in iommu,
whenever you want to do something like vfio_lock_acct(), use that mm
(as you said, if mmap_sem not accessible then defer it to a work, but
still @mm is the whole information), and put it after the usage.

I still can't see any reason that the @task have to be saved. It's
always the @mm all the time. Did I miss anything?

--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v9 04/12] vfio iommu: Add support for mediated devices
  2016-11-02 13:00                     ` Jike Song
@ 2016-11-02 13:18                       ` Kirti Wankhede
  2016-11-02 13:35                         ` Jike Song
  2016-11-03  4:29                         ` Alexey Kardashevskiy
  0 siblings, 2 replies; 73+ messages in thread
From: Kirti Wankhede @ 2016-11-02 13:18 UTC (permalink / raw)
  To: Jike Song
  Cc: Alexey Kardashevskiy, alex.williamson, pbonzini, kraxel, cjia,
	kvm, linux-kernel, kevin.tian, qemu-devel, bjsdjshi



On 11/2/2016 6:30 PM, Jike Song wrote:
> On 11/02/2016 08:41 PM, Kirti Wankhede wrote:
>> On 11/2/2016 5:51 PM, Jike Song wrote:
>>> On 11/02/2016 12:09 PM, Alexey Kardashevskiy wrote:
>>>> Or you could just reference and use @mm as KVM and others do. Or there is
>>>> anything else you need from @current than just @mm?
>>>>
>>>
>>> I agree. If @mm is the only thing needed, there is really no reason to
>>> refer to the @task :-)
>>>
>>
>> In vfio_lock_acct(), that is for page accounting, if mm->mmap_sem is
>> already held then page accounting is deferred, where task structure is
>> used to get mm and work is deferred only if mm exist:
>> 	mm = get_task_mm(task);
>>
>> That is where this module need task structure.
> 
> Kirti,
> 
> By calling get_task_mm you hold a ref on @mm and save it in iommu,
> whenever you want to do something like vfio_lock_acct(), use that mm
> (as you said, if mmap_sem not accessible then defer it to a work, but
> still @mm is the whole information), and put it after the usage.
> 
> I still can't see any reason that the @task have to be saved. It's
> always the @mm all the time. Did I miss anything?
> 

If the process is terminated by SIGKILL, as Alexey mentioned in this
mail thread earlier exit_mm() is called first and then all files are
closed. From exit_mm(), task->mm is set to NULL. So from teardown path,
we should call get_task_mm(task) to get current status intsead of using
stale pointer.

Thanks,
Kirti.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v9 04/12] vfio iommu: Add support for mediated devices
  2016-11-02 13:18                       ` Kirti Wankhede
@ 2016-11-02 13:35                         ` Jike Song
  2016-11-03  4:29                         ` Alexey Kardashevskiy
  1 sibling, 0 replies; 73+ messages in thread
From: Jike Song @ 2016-11-02 13:35 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Alexey Kardashevskiy, alex.williamson, pbonzini, kraxel, cjia,
	kvm, linux-kernel, kevin.tian, qemu-devel, bjsdjshi

On 11/02/2016 09:18 PM, Kirti Wankhede wrote:
> On 11/2/2016 6:30 PM, Jike Song wrote:
>> On 11/02/2016 08:41 PM, Kirti Wankhede wrote:
>>> On 11/2/2016 5:51 PM, Jike Song wrote:
>>>> On 11/02/2016 12:09 PM, Alexey Kardashevskiy wrote:
>>>>> Or you could just reference and use @mm as KVM and others do. Or there is
>>>>> anything else you need from @current than just @mm?
>>>>>
>>>>
>>>> I agree. If @mm is the only thing needed, there is really no reason to
>>>> refer to the @task :-)
>>>>
>>>
>>> In vfio_lock_acct(), that is for page accounting, if mm->mmap_sem is
>>> already held then page accounting is deferred, where task structure is
>>> used to get mm and work is deferred only if mm exist:
>>> 	mm = get_task_mm(task);
>>>
>>> That is where this module need task structure.
>>
>> Kirti,
>>
>> By calling get_task_mm you hold a ref on @mm and save it in iommu,
>> whenever you want to do something like vfio_lock_acct(), use that mm
>> (as you said, if mmap_sem not accessible then defer it to a work, but
>> still @mm is the whole information), and put it after the usage.
>>
>> I still can't see any reason that the @task have to be saved. It's
>> always the @mm all the time. Did I miss anything?
>>
> 
> If the process is terminated by SIGKILL, as Alexey mentioned in this
> mail thread earlier exit_mm() is called first and then all files are
> closed. From exit_mm(), task->mm is set to NULL. So from teardown path,
> we should call get_task_mm(task) to get current status intsead of using
> stale pointer.

You have got the ref on a task->mm and stored it somewhere, then after
that at some time the task->mm was set to NULL -- what's exactly the
problem here? It's perfectly okay per my understanding ...

--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v9 04/12] vfio iommu: Add support for mediated devices
  2016-11-02 13:18                       ` Kirti Wankhede
  2016-11-02 13:35                         ` Jike Song
@ 2016-11-03  4:29                         ` Alexey Kardashevskiy
  1 sibling, 0 replies; 73+ messages in thread
From: Alexey Kardashevskiy @ 2016-11-03  4:29 UTC (permalink / raw)
  To: Kirti Wankhede, Jike Song
  Cc: alex.williamson, pbonzini, kraxel, cjia, kvm, linux-kernel,
	kevin.tian, qemu-devel, bjsdjshi


[-- Attachment #1.1: Type: text/plain, Size: 1988 bytes --]

On 03/11/16 00:18, Kirti Wankhede wrote:
> 
> 
> On 11/2/2016 6:30 PM, Jike Song wrote:
>> On 11/02/2016 08:41 PM, Kirti Wankhede wrote:
>>> On 11/2/2016 5:51 PM, Jike Song wrote:
>>>> On 11/02/2016 12:09 PM, Alexey Kardashevskiy wrote:
>>>>> Or you could just reference and use @mm as KVM and others do. Or there is
>>>>> anything else you need from @current than just @mm?
>>>>>
>>>>
>>>> I agree. If @mm is the only thing needed, there is really no reason to
>>>> refer to the @task :-)
>>>>
>>>
>>> In vfio_lock_acct(), that is for page accounting, if mm->mmap_sem is
>>> already held then page accounting is deferred, where task structure is
>>> used to get mm and work is deferred only if mm exist:
>>> 	mm = get_task_mm(task);

get_task_mm() increments mm_users which is basically a number of userspaces
holding the reference to mm. As this case it is not a userspace, mm_count
needs to be incremented imho.


>>>
>>> That is where this module need task structure.
>>
>> Kirti,
>>
>> By calling get_task_mm you hold a ref on @mm and save it in iommu,
>> whenever you want to do something like vfio_lock_acct(), use that mm
>> (as you said, if mmap_sem not accessible then defer it to a work, but
>> still @mm is the whole information), and put it after the usage.
>>
>> I still can't see any reason that the @task have to be saved. It's
>> always the @mm all the time. Did I miss anything?
>>
> 
> If the process is terminated by SIGKILL, as Alexey mentioned in this
> mail thread earlier exit_mm() is called first and then all files are
> closed. From exit_mm(), task->mm is set to NULL. So from teardown path,
> we should call get_task_mm(task)

... which will return NULL, no?

> to get current status intsead of using
> stale pointer.

If you increment either mm_users or mm_count at the exact place where you
want to cache task pointer, why would mm pointer become stale until you do
mmdrop() or mmput()?


-- 
Alexey


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 839 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 00/12] Add Mediated device support
  2016-10-24  7:07 ` Jike Song
@ 2016-12-05 17:44   ` Gerd Hoffmann
  2016-12-06  2:24     ` Jike Song
  0 siblings, 1 reply; 73+ messages in thread
From: Gerd Hoffmann @ 2016-12-05 17:44 UTC (permalink / raw)
  To: Jike Song
  Cc: Kirti Wankhede, alex.williamson, cjia, pbonzini, qemu-devel, kvm,
	kevin.tian, bjsdjshi, linux-kernel

  Hi,

> Just want to share that we have published a KVMGT implementation
> based on this v9 patchset, to:
> 
> 	https://github.com/01org/gvt-linux/tree/gvt-next-kvmgt
> 
> It doesn't utilize common routines introduced by 05+ patches yet.
> The complete intel vGPU device-model is contained.

Tried to use this implementation.  Used the
topic/gvt-next-kvmgt-mdev-2016-11-18 branch which looked like the most
recent one.  Setup:

  * Everything compiled as modules.
  * iommu turned off for the igd (intel_iommu=on,igfx_off).
  * Blacklisted i915 so dracut initrd doesn't load it
    (rd.driver.blacklist=i915)
  * tweaked module config so kvmgt is loaded before i915,
    also enable gvt:

      # cat /etc/modprobe.d/kraxel-gvt.conf 
      options i915 enable_gvt=1
      softdep i915 pre: kvmgt

Everything seems to load fine.  Sysfs files are there, and I can create
vgpus.

Trying to assign a vgpu this way:

  -device vfio-pci,sysfsdev=/sys/class/mdev_bus/0000:00:02.0/<uuid>

fails though and gives this message in the kernel log:

  [  402.560350] [drm:intel_vgpu_open [kvmgt]] *ERROR* gvt: KVM is
required to use Intel vGPU

Trying the same with a mtty sample device works and I can see the pci
serial device in the guest.

Any clues what is going wrong?

Has this version any support for exporting the guest display as dma-buf,
so qemu can show it?  Or is this a headless vgpu?

thanks,
  Gerd

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 00/12] Add Mediated device support
  2016-12-05 17:44   ` Gerd Hoffmann
@ 2016-12-06  2:24     ` Jike Song
  2016-12-07 14:40       ` Gerd Hoffmann
  0 siblings, 1 reply; 73+ messages in thread
From: Jike Song @ 2016-12-06  2:24 UTC (permalink / raw)
  To: Gerd Hoffmann
  Cc: Kirti Wankhede, alex.williamson, cjia, pbonzini, qemu-devel, kvm,
	kevin.tian, bjsdjshi, linux-kernel

On 12/06/2016 01:44 AM, Gerd Hoffmann wrote:
>   Hi,
> 
>> Just want to share that we have published a KVMGT implementation
>> based on this v9 patchset, to:
>>
>> 	https://github.com/01org/gvt-linux/tree/gvt-next-kvmgt
>>
>> It doesn't utilize common routines introduced by 05+ patches yet.
>> The complete intel vGPU device-model is contained.
> 
> Tried to use this implementation.  Used the
> topic/gvt-next-kvmgt-mdev-2016-11-18 branch which looked like the most
> recent one.  Setup:
> 

Hi Gerd,

We didn't catch up with updating the newest kvmgt code accordingly,
partly because we are preparing the 'final' version to be upstreamed.

Will update a topic/gvt-next-kvmgt-2016-12-06 today, sorry for the
inconvenience :)

>   * Everything compiled as modules.
>   * iommu turned off for the igd (intel_iommu=on,igfx_off).
>   * Blacklisted i915 so dracut initrd doesn't load it
>     (rd.driver.blacklist=i915)
>   * tweaked module config so kvmgt is loaded before i915,
>     also enable gvt:
> 
>       # cat /etc/modprobe.d/kraxel-gvt.conf 
>       options i915 enable_gvt=1
>       softdep i915 pre: kvmgt
> 
> Everything seems to load fine.  Sysfs files are there, and I can create
> vgpus.
> 

Yes, everything looks good so far.

> Trying to assign a vgpu this way:
> 
>   -device vfio-pci,sysfsdev=/sys/class/mdev_bus/0000:00:02.0/<uuid>
> 
> fails though and gives this message in the kernel log:
> 
>   [  402.560350] [drm:intel_vgpu_open [kvmgt]] *ERROR* gvt: KVM is
> required to use Intel vGPU
> 
> Trying the same with a mtty sample device works and I can see the pci
> serial device in the guest.
> 
> Any clues what is going wrong?

The getting kvm instance code is missing in that branch, will be
contained in the new one.

> Has this version any support for exporting the guest display as dma-buf,
> so qemu can show it?  Or is this a headless vgpu?

No, this version doesn't have dma-buf support yet, we were using x11vnc
in guest to test it internally. I'll include you in the igvt-g-dev
mailing list for further discussion :)

--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v9 00/12] Add Mediated device support
  2016-12-06  2:24     ` Jike Song
@ 2016-12-07 14:40       ` Gerd Hoffmann
  0 siblings, 0 replies; 73+ messages in thread
From: Gerd Hoffmann @ 2016-12-07 14:40 UTC (permalink / raw)
  To: Jike Song
  Cc: Kirti Wankhede, alex.williamson, cjia, pbonzini, qemu-devel, kvm,
	kevin.tian, bjsdjshi, linux-kernel

  Hi,

> Will update a topic/gvt-next-kvmgt-2016-12-06 today, sorry for the
> inconvenience :)

Thanks, that brings us one step forward.

Linux guest can see the device (in lspci).  Trying to load i915.ko leads
to kernel oopses on both guest and host though.  So I guess the guest
driver can't handle the device.

I've tried RHEL-7.3, kernel 3.10.0-514.el7.x86_64.  The drivers/gpu/drm/
subsystem is roughly at upstream kernel v4.6 level.  Is that too old?
IIRC some vgpu guest patches have been merged in v4.2 timeframe, so I
kind-of expected this to work ...

I'll go try the kvmgt branch in the guest next.

What is the upstream merge status btw?  As far I know the bulk of the
gvt code is scheduled for 4.10.  What about the gvt/mdev plumbing?  Will
that land in 4.10 too or is that planned for 4.11?

thanks,
  Gerd

^ permalink raw reply	[flat|nested] 73+ messages in thread

end of thread, other threads:[~2016-12-07 14:40 UTC | newest]

Thread overview: 73+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-17 21:22 [PATCH v9 00/12] Add Mediated device support Kirti Wankhede
2016-10-17 21:22 ` [PATCH v9 01/12] vfio: Mediated device Core driver Kirti Wankhede
2016-10-18 23:16   ` Alex Williamson
2016-10-19 19:16     ` Kirti Wankhede
2016-10-19 22:20       ` Alex Williamson
2016-10-20  7:23   ` Jike Song
2016-10-20 17:12     ` Alex Williamson
2016-10-21  2:41       ` Jike Song
2016-10-27  5:56       ` Jike Song
2016-10-26  6:52   ` Tian, Kevin
2016-10-26 14:58     ` Kirti Wankhede
2016-10-17 21:22 ` [PATCH v9 02/12] vfio: VFIO based driver for Mediated devices Kirti Wankhede
2016-10-26  6:57   ` Tian, Kevin
2016-10-26 15:01     ` Kirti Wankhede
2016-10-17 21:22 ` [PATCH v9 03/12] vfio: Rearrange functions to get vfio_group from dev Kirti Wankhede
2016-10-19 17:26   ` Alex Williamson
2016-10-17 21:22 ` [PATCH v9 04/12] vfio iommu: Add support for mediated devices Kirti Wankhede
2016-10-19 21:02   ` Alex Williamson
2016-10-20 20:17     ` Kirti Wankhede
2016-10-24  2:32       ` Alex Williamson
2016-10-26  7:19         ` Tian, Kevin
2016-10-26 15:06           ` Kirti Wankhede
2016-10-26  7:53     ` Tian, Kevin
2016-10-26 15:16       ` Alex Williamson
2016-10-26  7:54     ` Tian, Kevin
2016-10-26 15:19       ` Alex Williamson
2016-10-21  7:49   ` Jike Song
2016-10-21 14:36     ` Alex Williamson
2016-10-24 10:35       ` Kirti Wankhede
2016-10-27  7:20   ` [Qemu-devel] " Alexey Kardashevskiy
2016-10-27 12:31     ` Kirti Wankhede
2016-10-27 14:30       ` Alex Williamson
2016-10-27 15:59         ` Kirti Wankhede
2016-10-28  2:18       ` Alexey Kardashevskiy
2016-11-01 14:01         ` Kirti Wankhede
2016-11-02  1:24           ` Alexey Kardashevskiy
2016-11-02  3:29             ` Kirti Wankhede
2016-11-02  4:09               ` Alexey Kardashevskiy
2016-11-02 12:21                 ` Jike Song
2016-11-02 12:41                   ` Kirti Wankhede
2016-11-02 13:00                     ` Jike Song
2016-11-02 13:18                       ` Kirti Wankhede
2016-11-02 13:35                         ` Jike Song
2016-11-03  4:29                         ` Alexey Kardashevskiy
2016-10-17 21:22 ` [PATCH v9 05/12] vfio: Introduce common function to add capabilities Kirti Wankhede
2016-10-20 19:24   ` Alex Williamson
2016-10-24 21:27     ` Kirti Wankhede
2016-10-24 21:39       ` Alex Williamson
2016-10-17 21:22 ` [PATCH v9 06/12] vfio_pci: Update vfio_pci to use vfio_info_add_capability() Kirti Wankhede
2016-10-20 19:24   ` Alex Williamson
2016-10-24 21:22     ` Kirti Wankhede
2016-10-24 21:37       ` Alex Williamson
2016-10-17 21:22 ` [PATCH v9 07/12] vfio: Introduce vfio_set_irqs_validate_and_prepare() Kirti Wankhede
2016-10-17 21:22 ` [PATCH v9 08/12] vfio_pci: Updated to use vfio_set_irqs_validate_and_prepare() Kirti Wankhede
2016-10-17 21:22 ` [PATCH v9 09/12] vfio_platform: " Kirti Wankhede
2016-10-17 21:22 ` [PATCH v9 10/12] vfio: Add function to get device_api string from vfio_device_info.flags Kirti Wankhede
2016-10-20 19:34   ` Alex Williamson
2016-10-20 20:29     ` Kirti Wankhede
2016-10-20 21:05       ` Alex Williamson
2016-10-20 21:14         ` Kirti Wankhede
2016-10-20 21:22           ` Alex Williamson
2016-10-21  3:00             ` Kirti Wankhede
2016-10-21  3:20               ` Alex Williamson
2016-10-17 21:22 ` [PATCH v9 11/12] docs: Add Documentation for Mediated devices Kirti Wankhede
2016-10-25 16:17   ` Alex Williamson
2016-10-17 21:22 ` [PATCH v9 12/12] docs: Sample driver to demonstrate how to use Mediated device framework Kirti Wankhede
     [not found]   ` <20161018025411.GA22572@bjsdjshi@linux.vnet.ibm.com>
2016-10-18 17:17     ` Alex Williamson
2016-10-19 19:19       ` Kirti Wankhede
2016-10-17 21:41 ` [PATCH v9 00/12] Add Mediated device support Alex Williamson
2016-10-24  7:07 ` Jike Song
2016-12-05 17:44   ` Gerd Hoffmann
2016-12-06  2:24     ` Jike Song
2016-12-07 14:40       ` Gerd Hoffmann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).