All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v8 0/6] Add Mediated device support
@ 2016-10-10 20:28 ` Kirti Wankhede
  0 siblings, 0 replies; 73+ messages in thread
From: Kirti Wankhede @ 2016-10-10 20:28 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi, Kirti Wankhede

This series adds Mediated device support to Linux host kernel. Purpose
of this series is to provide a common interface for mediated device
management that can be used by different devices. This series introduces
Mdev core module that creates and manages mediated devices, VFIO based
driver for mediated devices that are created by mdev core module and
update VFIO type1 IOMMU module to support pinning & unpinning for mediated
devices.

This change uses uuid_le_to_bin() to parse UUID string and convert to bin.
This requires following commits from linux master branch:
* commit bc9dc9d5eec908806f1b15c9ec2253d44dcf7835 :
        lib/uuid.c: use correct offset in uuid parser
* commit 2b1b0d66704a8cafe83be7114ec4c15ab3a314ad :
        lib/uuid.c: introduce a few more generic helpers

Requires below commits from linux master branch for mmap region fault
handler that uses remap_pfn_range() to setup EPT properly.
* commit add6a0cd1c5ba51b201e1361b05a5df817083618
        KVM: MMU: try to fix up page faults before giving up
* commit 92176a8ede577d0ff78ab3298e06701f67ad5f51 :
        KVM: MMU: prepare to support mapping of VM_IO and VM_PFNMAP frames

What changed in v8?
mdev-core:
- Removed start/stop or online/offline interfaces.
- Added open() and close() interfaces that should be used to commit
  resources for mdev devices from vendor driver.
- Removed supported_config callback function. Introduced sysfs interface
  for 'mdev_supported_types' as discussed earlier. It is mandatory to
  provide supported types by vendor driver.
- Removed 'mdev_create' and 'mdev_destroy' sysfs files from device's
  directory. Added 'create' file in each supported type group that vendor
  driver would define. Added 'remove' file in mdev device directory to
  destroy mdev device.

vfio_mdev:
- Added ioctl() callback. All ioctls should be handled in vendor driver
- Common functions for SET_IRQS and GET_REGION_INFO ioctls are added to
  reduce code duplication in vendor drivers.
- This forms a shim layer that pass through VFIO devices operations to
  vendor driver for mediated devices.

vfio_iommu_type1:
- Handled the case if all devices attached to the normal IOMMU API domain
  go away and mdev device still exist in domain. Updated page accounting
  for local domain.
- Similarly if device is attached to normal IOMMU API domain, mappings are
  establised and page accounting is updated accordingly.
- Tested hot-plug and hot-unplug of vGPU and GPU pass through device with
  Linux VM.

Documentation:
- Updated vfio-mediated-device.txt with current interface.
- Added sample driver that simulates serial port over PCI card for a VM.
  This driver is added as an example for how to use mediated device
  framework.
- Moved updated document and example driver to 'vfio-mdev' directory in
  Documentation.


Kirti Wankhede (6):
  vfio: Mediated device Core driver
  vfio: VFIO based driver for Mediated devices
  vfio iommu: Add support for mediated devices
  docs: Add Documentation for Mediated devices
  Add simple sample driver for mediated device framework
  Add common functions for SET_IRQS and GET_REGION_INFO ioctls

 Documentation/vfio-mdev/Makefile                 |   14 +
 Documentation/vfio-mdev/mtty.c                   | 1353 ++++++++++++++++++++++
 Documentation/vfio-mdev/vfio-mediated-device.txt |  282 +++++
 drivers/vfio/Kconfig                             |    1 +
 drivers/vfio/Makefile                            |    1 +
 drivers/vfio/mdev/Kconfig                        |   18 +
 drivers/vfio/mdev/Makefile                       |    6 +
 drivers/vfio/mdev/mdev_core.c                    |  363 ++++++
 drivers/vfio/mdev/mdev_driver.c                  |  131 +++
 drivers/vfio/mdev/mdev_private.h                 |   41 +
 drivers/vfio/mdev/mdev_sysfs.c                   |  295 +++++
 drivers/vfio/mdev/vfio_mdev.c                    |  171 +++
 drivers/vfio/pci/vfio_pci.c                      |  103 +-
 drivers/vfio/pci/vfio_pci_private.h              |    6 +-
 drivers/vfio/vfio.c                              |  233 ++++
 drivers/vfio/vfio_iommu_type1.c                  |  685 +++++++++--
 include/linux/mdev.h                             |  178 +++
 include/linux/vfio.h                             |   20 +-
 18 files changed, 3743 insertions(+), 158 deletions(-)
 create mode 100644 Documentation/vfio-mdev/Makefile
 create mode 100644 Documentation/vfio-mdev/mtty.c
 create mode 100644 Documentation/vfio-mdev/vfio-mediated-device.txt
 create mode 100644 drivers/vfio/mdev/Kconfig
 create mode 100644 drivers/vfio/mdev/Makefile
 create mode 100644 drivers/vfio/mdev/mdev_core.c
 create mode 100644 drivers/vfio/mdev/mdev_driver.c
 create mode 100644 drivers/vfio/mdev/mdev_private.h
 create mode 100644 drivers/vfio/mdev/mdev_sysfs.c
 create mode 100644 drivers/vfio/mdev/vfio_mdev.c
 create mode 100644 include/linux/mdev.h

-- 
2.7.0


^ permalink raw reply	[flat|nested] 73+ messages in thread

* [Qemu-devel] [PATCH v8 0/6] Add Mediated device support
@ 2016-10-10 20:28 ` Kirti Wankhede
  0 siblings, 0 replies; 73+ messages in thread
From: Kirti Wankhede @ 2016-10-10 20:28 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi, Kirti Wankhede

This series adds Mediated device support to Linux host kernel. Purpose
of this series is to provide a common interface for mediated device
management that can be used by different devices. This series introduces
Mdev core module that creates and manages mediated devices, VFIO based
driver for mediated devices that are created by mdev core module and
update VFIO type1 IOMMU module to support pinning & unpinning for mediated
devices.

This change uses uuid_le_to_bin() to parse UUID string and convert to bin.
This requires following commits from linux master branch:
* commit bc9dc9d5eec908806f1b15c9ec2253d44dcf7835 :
        lib/uuid.c: use correct offset in uuid parser
* commit 2b1b0d66704a8cafe83be7114ec4c15ab3a314ad :
        lib/uuid.c: introduce a few more generic helpers

Requires below commits from linux master branch for mmap region fault
handler that uses remap_pfn_range() to setup EPT properly.
* commit add6a0cd1c5ba51b201e1361b05a5df817083618
        KVM: MMU: try to fix up page faults before giving up
* commit 92176a8ede577d0ff78ab3298e06701f67ad5f51 :
        KVM: MMU: prepare to support mapping of VM_IO and VM_PFNMAP frames

What changed in v8?
mdev-core:
- Removed start/stop or online/offline interfaces.
- Added open() and close() interfaces that should be used to commit
  resources for mdev devices from vendor driver.
- Removed supported_config callback function. Introduced sysfs interface
  for 'mdev_supported_types' as discussed earlier. It is mandatory to
  provide supported types by vendor driver.
- Removed 'mdev_create' and 'mdev_destroy' sysfs files from device's
  directory. Added 'create' file in each supported type group that vendor
  driver would define. Added 'remove' file in mdev device directory to
  destroy mdev device.

vfio_mdev:
- Added ioctl() callback. All ioctls should be handled in vendor driver
- Common functions for SET_IRQS and GET_REGION_INFO ioctls are added to
  reduce code duplication in vendor drivers.
- This forms a shim layer that pass through VFIO devices operations to
  vendor driver for mediated devices.

vfio_iommu_type1:
- Handled the case if all devices attached to the normal IOMMU API domain
  go away and mdev device still exist in domain. Updated page accounting
  for local domain.
- Similarly if device is attached to normal IOMMU API domain, mappings are
  establised and page accounting is updated accordingly.
- Tested hot-plug and hot-unplug of vGPU and GPU pass through device with
  Linux VM.

Documentation:
- Updated vfio-mediated-device.txt with current interface.
- Added sample driver that simulates serial port over PCI card for a VM.
  This driver is added as an example for how to use mediated device
  framework.
- Moved updated document and example driver to 'vfio-mdev' directory in
  Documentation.


Kirti Wankhede (6):
  vfio: Mediated device Core driver
  vfio: VFIO based driver for Mediated devices
  vfio iommu: Add support for mediated devices
  docs: Add Documentation for Mediated devices
  Add simple sample driver for mediated device framework
  Add common functions for SET_IRQS and GET_REGION_INFO ioctls

 Documentation/vfio-mdev/Makefile                 |   14 +
 Documentation/vfio-mdev/mtty.c                   | 1353 ++++++++++++++++++++++
 Documentation/vfio-mdev/vfio-mediated-device.txt |  282 +++++
 drivers/vfio/Kconfig                             |    1 +
 drivers/vfio/Makefile                            |    1 +
 drivers/vfio/mdev/Kconfig                        |   18 +
 drivers/vfio/mdev/Makefile                       |    6 +
 drivers/vfio/mdev/mdev_core.c                    |  363 ++++++
 drivers/vfio/mdev/mdev_driver.c                  |  131 +++
 drivers/vfio/mdev/mdev_private.h                 |   41 +
 drivers/vfio/mdev/mdev_sysfs.c                   |  295 +++++
 drivers/vfio/mdev/vfio_mdev.c                    |  171 +++
 drivers/vfio/pci/vfio_pci.c                      |  103 +-
 drivers/vfio/pci/vfio_pci_private.h              |    6 +-
 drivers/vfio/vfio.c                              |  233 ++++
 drivers/vfio/vfio_iommu_type1.c                  |  685 +++++++++--
 include/linux/mdev.h                             |  178 +++
 include/linux/vfio.h                             |   20 +-
 18 files changed, 3743 insertions(+), 158 deletions(-)
 create mode 100644 Documentation/vfio-mdev/Makefile
 create mode 100644 Documentation/vfio-mdev/mtty.c
 create mode 100644 Documentation/vfio-mdev/vfio-mediated-device.txt
 create mode 100644 drivers/vfio/mdev/Kconfig
 create mode 100644 drivers/vfio/mdev/Makefile
 create mode 100644 drivers/vfio/mdev/mdev_core.c
 create mode 100644 drivers/vfio/mdev/mdev_driver.c
 create mode 100644 drivers/vfio/mdev/mdev_private.h
 create mode 100644 drivers/vfio/mdev/mdev_sysfs.c
 create mode 100644 drivers/vfio/mdev/vfio_mdev.c
 create mode 100644 include/linux/mdev.h

-- 
2.7.0

^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH v8 1/6] vfio: Mediated device Core driver
  2016-10-10 20:28 ` [Qemu-devel] " Kirti Wankhede
@ 2016-10-10 20:28   ` Kirti Wankhede
  -1 siblings, 0 replies; 73+ messages in thread
From: Kirti Wankhede @ 2016-10-10 20:28 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi, Kirti Wankhede

Design for Mediated Device Driver:
Main purpose of this driver is to provide a common interface for mediated
device management that can be used by different drivers of different
devices.

This module provides a generic interface to create the device, add it to
mediated bus, add device to IOMMU group and then add it to vfio group.

Below is the high Level block diagram, with Nvidia, Intel and IBM devices
as example, since these are the devices which are going to actively use
this module as of now.

 +---------------+
 |               |
 | +-----------+ |  mdev_register_driver() +--------------+
 | |           | +<------------------------+ __init()     |
 | |  mdev     | |                         |              |
 | |  bus      | +------------------------>+              |<-> VFIO user
 | |  driver   | |     probe()/remove()    | vfio_mdev.ko |    APIs
 | |           | |                         |              |
 | +-----------+ |                         +--------------+
 |               |
 |  MDEV CORE    |
 |   MODULE      |
 |   mdev.ko     |
 | +-----------+ |  mdev_register_device() +--------------+
 | |           | +<------------------------+              |
 | |           | |                         |  nvidia.ko   |<-> physical
 | |           | +------------------------>+              |    device
 | |           | |        callback         +--------------+
 | | Physical  | |
 | |  device   | |  mdev_register_device() +--------------+
 | | interface | |<------------------------+              |
 | |           | |                         |  i915.ko     |<-> physical
 | |           | +------------------------>+              |    device
 | |           | |        callback         +--------------+
 | |           | |
 | |           | |  mdev_register_device() +--------------+
 | |           | +<------------------------+              |
 | |           | |                         | ccw_device.ko|<-> physical
 | |           | +------------------------>+              |    device
 | |           | |        callback         +--------------+
 | +-----------+ |
 +---------------+

Core driver provides two types of registration interfaces:
1. Registration interface for mediated bus driver:

/**
  * struct mdev_driver - Mediated device's driver
  * @name: driver name
  * @probe: called when new device created
  * @remove:called when device removed
  * @driver:device driver structure
  *
  **/
struct mdev_driver {
         const char *name;
         int  (*probe)  (struct device *dev);
         void (*remove) (struct device *dev);
         struct device_driver    driver;
};

Mediated bus driver for mdev device should use this interface to register
and unregister with core driver respectively:

int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
void mdev_unregister_driver(struct mdev_driver *drv);

Medisted bus driver is responsible to add/delete mediated devices to/from
VFIO group when devices are bound and unbound to the driver.

2. Physical device driver interface
This interface provides vendor driver the set APIs to manage physical
device related work in its driver. APIs are :

* dev_attr_groups: attributes of the parent device.
* mdev_attr_groups: attributes of the mediated device.
* supported_type_groups: attributes to define supported type. This is
			 mandatory field.
* create: to allocate basic resources in driver for a mediated device.
* remove: to free resources in driver when mediated device is destroyed.
* open: open callback of mediated device
* release: release callback of mediated device
* read : read emulation callback.
* write: write emulation callback.
* mmap: mmap emulation callback.
* ioctl: ioctl callback.

Drivers should use these interfaces to register and unregister device to
mdev core driver respectively:

extern int  mdev_register_device(struct device *dev,
                                 const struct parent_ops *ops);
extern void mdev_unregister_device(struct device *dev);

There are no locks to serialize above callbacks in mdev driver and
vfio_mdev driver. If required, vendor driver can have locks to serialize
above APIs in their driver.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Change-Id: I73a5084574270b14541c529461ea2f03c292d510
---
 drivers/vfio/Kconfig             |   1 +
 drivers/vfio/Makefile            |   1 +
 drivers/vfio/mdev/Kconfig        |  12 ++
 drivers/vfio/mdev/Makefile       |   5 +
 drivers/vfio/mdev/mdev_core.c    | 363 +++++++++++++++++++++++++++++++++++++++
 drivers/vfio/mdev/mdev_driver.c  | 131 ++++++++++++++
 drivers/vfio/mdev/mdev_private.h |  41 +++++
 drivers/vfio/mdev/mdev_sysfs.c   | 295 +++++++++++++++++++++++++++++++
 include/linux/mdev.h             | 178 +++++++++++++++++++
 9 files changed, 1027 insertions(+)
 create mode 100644 drivers/vfio/mdev/Kconfig
 create mode 100644 drivers/vfio/mdev/Makefile
 create mode 100644 drivers/vfio/mdev/mdev_core.c
 create mode 100644 drivers/vfio/mdev/mdev_driver.c
 create mode 100644 drivers/vfio/mdev/mdev_private.h
 create mode 100644 drivers/vfio/mdev/mdev_sysfs.c
 create mode 100644 include/linux/mdev.h

diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index da6e2ce77495..23eced02aaf6 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -48,4 +48,5 @@ menuconfig VFIO_NOIOMMU
 
 source "drivers/vfio/pci/Kconfig"
 source "drivers/vfio/platform/Kconfig"
+source "drivers/vfio/mdev/Kconfig"
 source "virt/lib/Kconfig"
diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
index 7b8a31f63fea..4a23c13b6be4 100644
--- a/drivers/vfio/Makefile
+++ b/drivers/vfio/Makefile
@@ -7,3 +7,4 @@ obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
 obj-$(CONFIG_VFIO_SPAPR_EEH) += vfio_spapr_eeh.o
 obj-$(CONFIG_VFIO_PCI) += pci/
 obj-$(CONFIG_VFIO_PLATFORM) += platform/
+obj-$(CONFIG_VFIO_MDEV) += mdev/
diff --git a/drivers/vfio/mdev/Kconfig b/drivers/vfio/mdev/Kconfig
new file mode 100644
index 000000000000..23d5b9d08a5c
--- /dev/null
+++ b/drivers/vfio/mdev/Kconfig
@@ -0,0 +1,12 @@
+
+config VFIO_MDEV
+    tristate "Mediated device driver framework"
+    depends on VFIO
+    default n
+    help
+        Provides a framework to virtualize device.
+	See Documentation/vfio-mdev/vfio-mediated-device.txt for more details.
+
+        If you don't know what do here, say N.
+
+
diff --git a/drivers/vfio/mdev/Makefile b/drivers/vfio/mdev/Makefile
new file mode 100644
index 000000000000..56a75e689582
--- /dev/null
+++ b/drivers/vfio/mdev/Makefile
@@ -0,0 +1,5 @@
+
+mdev-y := mdev_core.o mdev_sysfs.o mdev_driver.o
+
+obj-$(CONFIG_VFIO_MDEV) += mdev.o
+
diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
new file mode 100644
index 000000000000..019c196e62d5
--- /dev/null
+++ b/drivers/vfio/mdev/mdev_core.c
@@ -0,0 +1,363 @@
+/*
+ * Mediated device Core Driver
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/fs.h>
+#include <linux/slab.h>
+#include <linux/sched.h>
+#include <linux/uuid.h>
+#include <linux/vfio.h>
+#include <linux/iommu.h>
+#include <linux/sysfs.h>
+#include <linux/mdev.h>
+
+#include "mdev_private.h"
+
+#define DRIVER_VERSION		"0.1"
+#define DRIVER_AUTHOR		"NVIDIA Corporation"
+#define DRIVER_DESC		"Mediated device Core Driver"
+
+static LIST_HEAD(parent_list);
+static DEFINE_MUTEX(parent_list_lock);
+
+static int _find_mdev_device(struct device *dev, void *data)
+{
+	struct mdev_device *mdev;
+
+	if (!dev_is_mdev(dev))
+		return 0;
+
+	mdev = to_mdev_device(dev);
+
+	if (uuid_le_cmp(mdev->uuid, *(uuid_le *)data) == 0)
+		return 1;
+
+	return 0;
+}
+
+static struct mdev_device *__find_mdev_device(struct parent_device *parent,
+					      uuid_le uuid)
+{
+	struct device *dev;
+
+	dev = device_find_child(parent->dev, &uuid, _find_mdev_device);
+	if (!dev)
+		return NULL;
+
+	put_device(dev);
+
+	return to_mdev_device(dev);
+}
+
+/* Should be called holding parent_list_lock */
+static struct parent_device *__find_parent_device(struct device *dev)
+{
+	struct parent_device *parent;
+
+	list_for_each_entry(parent, &parent_list, next) {
+		if (parent->dev == dev)
+			return parent;
+	}
+	return NULL;
+}
+
+static void mdev_release_parent(struct kref *kref)
+{
+	struct parent_device *parent = container_of(kref, struct parent_device,
+						    ref);
+	struct device *dev = parent->dev;
+
+	kfree(parent);
+	put_device(dev);
+}
+
+static
+inline struct parent_device *mdev_get_parent(struct parent_device *parent)
+{
+	if (parent)
+		kref_get(&parent->ref);
+
+	return parent;
+}
+
+static inline void mdev_put_parent(struct parent_device *parent)
+{
+	if (parent)
+		kref_put(&parent->ref, mdev_release_parent);
+}
+
+static int mdev_device_create_ops(struct kobject *kobj,
+				  struct mdev_device *mdev)
+{
+	struct parent_device *parent = mdev->parent;
+	int ret;
+
+	ret = parent->ops->create(kobj, mdev);
+	if (ret)
+		return ret;
+
+	ret = sysfs_create_groups(&mdev->dev.kobj,
+				  parent->ops->mdev_attr_groups);
+	if (ret)
+		parent->ops->remove(mdev);
+
+	return ret;
+}
+
+static int mdev_device_remove_ops(struct mdev_device *mdev, bool force_remove)
+{
+	struct parent_device *parent = mdev->parent;
+	int ret;
+
+	/*
+	 * Vendor driver can return error if VMM or userspace application is
+	 * using this mdev device.
+	 */
+	ret = parent->ops->remove(mdev);
+	if (ret && !force_remove)
+		return -EBUSY;
+
+	sysfs_remove_groups(&mdev->dev.kobj, parent->ops->mdev_attr_groups);
+	return 0;
+}
+
+/*
+ * mdev_register_device : Register a device
+ * @dev: device structure representing parent device.
+ * @ops: Parent device operation structure to be registered.
+ *
+ * Add device to list of registered parent devices.
+ * Returns a negative value on error, otherwise 0.
+ */
+int mdev_register_device(struct device *dev, const struct parent_ops *ops)
+{
+	int ret = 0;
+	struct parent_device *parent;
+
+	/* check for mandatory ops */
+	if (!ops || !ops->create || !ops->remove || !ops->supported_type_groups)
+		return -EINVAL;
+
+	dev = get_device(dev);
+	if (!dev)
+		return -EINVAL;
+
+	mutex_lock(&parent_list_lock);
+
+	/* Check for duplicate */
+	parent = __find_parent_device(dev);
+	if (parent) {
+		ret = -EEXIST;
+		goto add_dev_err;
+	}
+
+	parent = kzalloc(sizeof(*parent), GFP_KERNEL);
+	if (!parent) {
+		ret = -ENOMEM;
+		goto add_dev_err;
+	}
+
+	kref_init(&parent->ref);
+
+	parent->dev = dev;
+	parent->ops = ops;
+
+	ret = parent_create_sysfs_files(parent);
+	if (ret) {
+		mutex_unlock(&parent_list_lock);
+		mdev_put_parent(parent);
+		return ret;
+	}
+
+	list_add(&parent->next, &parent_list);
+	mutex_unlock(&parent_list_lock);
+
+	dev_info(dev, "MDEV: Registered\n");
+	return 0;
+
+add_dev_err:
+	mutex_unlock(&parent_list_lock);
+	put_device(dev);
+	return ret;
+}
+EXPORT_SYMBOL(mdev_register_device);
+
+/*
+ * mdev_unregister_device : Unregister a parent device
+ * @dev: device structure representing parent device.
+ *
+ * Remove device from list of registered parent devices. Give a chance to free
+ * existing mediated devices for given device.
+ */
+
+void mdev_unregister_device(struct device *dev)
+{
+	struct parent_device *parent;
+	bool force_remove = true;
+
+	mutex_lock(&parent_list_lock);
+	parent = __find_parent_device(dev);
+
+	if (!parent) {
+		mutex_unlock(&parent_list_lock);
+		return;
+	}
+	dev_info(dev, "MDEV: Unregistering\n");
+
+	/*
+	 * Remove parent from the list and remove "mdev_supported_types"
+	 * sysfs files so that no new mediated device could be
+	 * created for this parent
+	 */
+	list_del(&parent->next);
+	parent_remove_sysfs_files(parent);
+
+	mutex_unlock(&parent_list_lock);
+
+	device_for_each_child(dev, (void *)&force_remove, mdev_device_remove);
+	mdev_put_parent(parent);
+}
+EXPORT_SYMBOL(mdev_unregister_device);
+
+/*
+ * Functions required for mdev_sysfs
+ */
+static void mdev_device_release(struct device *dev)
+{
+	struct mdev_device *mdev = to_mdev_device(dev);
+
+	dev_dbg(&mdev->dev, "MDEV: destroying\n");
+	kfree(mdev);
+}
+
+int mdev_device_create(struct kobject *kobj, struct device *dev, uuid_le uuid)
+{
+	int ret;
+	struct mdev_device *mdev;
+	struct parent_device *parent;
+	struct mdev_type *type = to_mdev_type(kobj);
+
+	parent = mdev_get_parent(type->parent);
+	if (!parent)
+		return -EINVAL;
+
+	/* Check for duplicate */
+	mdev = __find_mdev_device(parent, uuid);
+	if (mdev) {
+		ret = -EEXIST;
+		goto create_err;
+	}
+
+	mdev = kzalloc(sizeof(*mdev), GFP_KERNEL);
+	if (!mdev) {
+		ret = -ENOMEM;
+		goto create_err;
+	}
+
+	memcpy(&mdev->uuid, &uuid, sizeof(uuid_le));
+	mdev->parent = parent;
+	kref_init(&mdev->ref);
+
+	mdev->dev.parent  = dev;
+	mdev->dev.bus     = &mdev_bus_type;
+	mdev->dev.release = mdev_device_release;
+	dev_set_name(&mdev->dev, "%pUl", uuid.b);
+
+	ret = device_register(&mdev->dev);
+	if (ret) {
+		put_device(&mdev->dev);
+		goto create_err;
+	}
+
+	ret = mdev_device_create_ops(kobj, mdev);
+	if (ret)
+		goto create_failed;
+
+	ret = mdev_create_sysfs_files(&mdev->dev, type);
+	if (ret) {
+		mdev_device_remove_ops(mdev, true);
+		goto create_failed;
+	}
+
+	mdev->type_kobj = kobj;
+	dev_dbg(&mdev->dev, "MDEV: created\n");
+
+	return ret;
+
+create_failed:
+	device_unregister(&mdev->dev);
+
+create_err:
+	mdev_put_parent(parent);
+	return ret;
+}
+
+int mdev_device_remove(struct device *dev, void *data)
+{
+	struct mdev_device *mdev;
+	struct parent_device *parent;
+	struct mdev_type *type;
+	bool force_remove = true;
+	int ret = 0;
+
+	if (!dev_is_mdev(dev))
+		return 0;
+
+	mdev = to_mdev_device(dev);
+	parent = mdev->parent;
+	type = to_mdev_type(mdev->type_kobj);
+
+	if (data)
+		force_remove = *(bool *)data;
+
+	ret = mdev_device_remove_ops(mdev, force_remove);
+	if (ret)
+		return ret;
+
+	mdev_remove_sysfs_files(dev, type);
+	device_unregister(dev);
+	mdev_put_parent(parent);
+	return ret;
+}
+
+static int __init mdev_init(void)
+{
+	int ret;
+
+	ret = mdev_bus_register();
+	if (ret) {
+		pr_err("Failed to register mdev bus\n");
+		return ret;
+	}
+
+	/*
+	 * Attempt to load known vfio_mdev.  This gives us a working environment
+	 * without the user needing to explicitly load vfio_mdev driver.
+	 */
+	request_module_nowait("vfio_mdev");
+
+	return ret;
+}
+
+static void __exit mdev_exit(void)
+{
+	mdev_bus_unregister();
+}
+
+module_init(mdev_init)
+module_exit(mdev_exit)
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/drivers/vfio/mdev/mdev_driver.c b/drivers/vfio/mdev/mdev_driver.c
new file mode 100644
index 000000000000..8afc2d8e5c04
--- /dev/null
+++ b/drivers/vfio/mdev/mdev_driver.c
@@ -0,0 +1,131 @@
+/*
+ * MDEV driver
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/device.h>
+#include <linux/iommu.h>
+#include <linux/mdev.h>
+
+#include "mdev_private.h"
+
+static int mdev_attach_iommu(struct mdev_device *mdev)
+{
+	int ret;
+	struct iommu_group *group;
+
+	group = iommu_group_alloc();
+	if (IS_ERR(group)) {
+		dev_err(&mdev->dev, "MDEV: failed to allocate group!\n");
+		return PTR_ERR(group);
+	}
+
+	ret = iommu_group_add_device(group, &mdev->dev);
+	if (ret) {
+		dev_err(&mdev->dev, "MDEV: failed to add dev to group!\n");
+		goto attach_fail;
+	}
+
+	mdev->group = group;
+
+	dev_info(&mdev->dev, "MDEV: group_id = %d\n",
+				 iommu_group_id(group));
+attach_fail:
+	iommu_group_put(group);
+	return ret;
+}
+
+static void mdev_detach_iommu(struct mdev_device *mdev)
+{
+	iommu_group_remove_device(&mdev->dev);
+	mdev->group = NULL;
+	dev_info(&mdev->dev, "MDEV: detaching iommu\n");
+}
+
+static int mdev_probe(struct device *dev)
+{
+	struct mdev_driver *drv = to_mdev_driver(dev->driver);
+	struct mdev_device *mdev = to_mdev_device(dev);
+	int ret;
+
+	ret = mdev_attach_iommu(mdev);
+	if (ret) {
+		dev_err(dev, "Failed to attach IOMMU\n");
+		return ret;
+	}
+
+	if (drv && drv->probe)
+		ret = drv->probe(dev);
+
+	if (ret)
+		mdev_detach_iommu(mdev);
+
+	return ret;
+}
+
+static int mdev_remove(struct device *dev)
+{
+	struct mdev_driver *drv = to_mdev_driver(dev->driver);
+	struct mdev_device *mdev = to_mdev_device(dev);
+
+	if (drv && drv->remove)
+		drv->remove(dev);
+
+	mdev_detach_iommu(mdev);
+
+	return 0;
+}
+
+struct bus_type mdev_bus_type = {
+	.name		= "mdev",
+	.probe		= mdev_probe,
+	.remove		= mdev_remove,
+};
+EXPORT_SYMBOL_GPL(mdev_bus_type);
+
+/*
+ * mdev_register_driver - register a new MDEV driver
+ * @drv: the driver to register
+ * @owner: module owner of driver to be registered
+ *
+ * Returns a negative value on error, otherwise 0.
+ */
+int mdev_register_driver(struct mdev_driver *drv, struct module *owner)
+{
+	/* initialize common driver fields */
+	drv->driver.name = drv->name;
+	drv->driver.bus = &mdev_bus_type;
+	drv->driver.owner = owner;
+
+	/* register with core */
+	return driver_register(&drv->driver);
+}
+EXPORT_SYMBOL(mdev_register_driver);
+
+/*
+ * mdev_unregister_driver - unregister MDEV driver
+ * @drv: the driver to unregister
+ *
+ */
+void mdev_unregister_driver(struct mdev_driver *drv)
+{
+	driver_unregister(&drv->driver);
+}
+EXPORT_SYMBOL(mdev_unregister_driver);
+
+int mdev_bus_register(void)
+{
+	return bus_register(&mdev_bus_type);
+}
+
+void mdev_bus_unregister(void)
+{
+	bus_unregister(&mdev_bus_type);
+}
diff --git a/drivers/vfio/mdev/mdev_private.h b/drivers/vfio/mdev/mdev_private.h
new file mode 100644
index 000000000000..8f6cbffda9bd
--- /dev/null
+++ b/drivers/vfio/mdev/mdev_private.h
@@ -0,0 +1,41 @@
+/*
+ * Mediated device interal definitions
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef MDEV_PRIVATE_H
+#define MDEV_PRIVATE_H
+
+int  mdev_bus_register(void);
+void mdev_bus_unregister(void);
+
+struct mdev_type {
+	struct kobject kobj;
+	struct kobject *devices_kobj;
+	struct parent_device *parent;
+	struct list_head next;
+	struct attribute_group *group;
+};
+
+#define to_mdev_type_attr(_attr)	\
+	container_of(_attr, struct mdev_type_attribute, attr)
+#define to_mdev_type(_kobj)		\
+	container_of(_kobj, struct mdev_type, kobj)
+
+int  parent_create_sysfs_files(struct parent_device *parent);
+void parent_remove_sysfs_files(struct parent_device *parent);
+
+int  mdev_create_sysfs_files(struct device *dev, struct mdev_type *type);
+void mdev_remove_sysfs_files(struct device *dev, struct mdev_type *type);
+
+int  mdev_device_create(struct kobject *kobj, struct device *dev, uuid_le uuid);
+int  mdev_device_remove(struct device *dev, void *data);
+
+#endif /* MDEV_PRIVATE_H */
diff --git a/drivers/vfio/mdev/mdev_sysfs.c b/drivers/vfio/mdev/mdev_sysfs.c
new file mode 100644
index 000000000000..228698f46234
--- /dev/null
+++ b/drivers/vfio/mdev/mdev_sysfs.c
@@ -0,0 +1,295 @@
+/*
+ * File attributes for Mediated devices
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/sysfs.h>
+#include <linux/ctype.h>
+#include <linux/device.h>
+#include <linux/slab.h>
+#include <linux/uuid.h>
+#include <linux/mdev.h>
+
+#include "mdev_private.h"
+
+/* Static functions */
+
+static ssize_t mdev_type_attr_show(struct kobject *kobj,
+				     struct attribute *__attr, char *buf)
+{
+	struct mdev_type_attribute *attr = to_mdev_type_attr(__attr);
+	struct mdev_type *type = to_mdev_type(kobj);
+	ssize_t ret = -EIO;
+
+	if (attr->show)
+		ret = attr->show(kobj, type->parent->dev, buf);
+	return ret;
+}
+
+static ssize_t mdev_type_attr_store(struct kobject *kobj,
+				      struct attribute *__attr,
+				      const char *buf, size_t count)
+{
+	struct mdev_type_attribute *attr = to_mdev_type_attr(__attr);
+	struct mdev_type *type = to_mdev_type(kobj);
+	ssize_t ret = -EIO;
+
+	if (attr->store)
+		ret = attr->store(&type->kobj, type->parent->dev, buf, count);
+	return ret;
+}
+
+static const struct sysfs_ops mdev_type_sysfs_ops = {
+	.show = mdev_type_attr_show,
+	.store = mdev_type_attr_store,
+};
+
+static ssize_t create_store(struct kobject *kobj, struct device *dev,
+			    const char *buf, size_t count)
+{
+	char *str;
+	uuid_le uuid;
+	int ret;
+
+	str = kstrndup(buf, count, GFP_KERNEL);
+	if (!str)
+		return -ENOMEM;
+
+	ret = uuid_le_to_bin(str, &uuid);
+	if (!ret) {
+
+		ret = mdev_device_create(kobj, dev, uuid);
+		if (ret)
+			pr_err("mdev_create: Failed to create mdev device\n");
+		else
+			ret = count;
+	}
+
+	kfree(str);
+	return ret;
+}
+
+MDEV_TYPE_ATTR_WO(create);
+
+static void mdev_type_release(struct kobject *kobj)
+{
+	struct mdev_type *type = to_mdev_type(kobj);
+
+	pr_debug("Releasing group %s\n", kobj->name);
+	kfree(type);
+}
+
+static struct kobj_type mdev_type_ktype = {
+	.sysfs_ops = &mdev_type_sysfs_ops,
+	.release = mdev_type_release,
+};
+
+struct mdev_type *add_mdev_supported_type(struct parent_device *parent,
+					  struct attribute_group *group)
+{
+	struct mdev_type *type;
+	int ret;
+
+	if (!group->name) {
+		pr_err("%s: Type name empty!\n", __func__);
+		return ERR_PTR(-EINVAL);
+	}
+
+	type = kzalloc(sizeof(*type), GFP_KERNEL);
+	if (!type)
+		return ERR_PTR(-ENOMEM);
+
+	type->kobj.kset = parent->mdev_types_kset;
+
+	ret = kobject_init_and_add(&type->kobj, &mdev_type_ktype, NULL,
+				   "%s", group->name);
+	if (ret) {
+		kfree(type);
+		return ERR_PTR(ret);
+	}
+
+	ret = sysfs_create_file(&type->kobj, &mdev_type_attr_create.attr);
+	if (ret)
+		goto attr_create_failed;
+
+	type->devices_kobj = kobject_create_and_add("devices", &type->kobj);
+	if (!type->devices_kobj) {
+		ret = -ENOMEM;
+		goto attr_devices_failed;
+	}
+
+	ret = sysfs_create_files(&type->kobj,
+				 (const struct attribute **)group->attrs);
+	if (ret) {
+		ret = -ENOMEM;
+		goto attrs_failed;
+	}
+
+	type->group = group;
+	type->parent = parent;
+	return type;
+
+attrs_failed:
+	kobject_put(type->devices_kobj);
+attr_devices_failed:
+	sysfs_remove_file(&type->kobj, &mdev_type_attr_create.attr);
+attr_create_failed:
+	kobject_del(&type->kobj);
+	kobject_put(&type->kobj);
+	return ERR_PTR(ret);
+}
+
+static void remove_mdev_supported_type(struct mdev_type *type)
+{
+	sysfs_remove_files(&type->kobj,
+			   (const struct attribute **)type->group->attrs);
+	kobject_put(type->devices_kobj);
+	sysfs_remove_file(&type->kobj, &mdev_type_attr_create.attr);
+	kobject_del(&type->kobj);
+	kobject_put(&type->kobj);
+}
+
+static int add_mdev_supported_type_groups(struct parent_device *parent)
+{
+	int i;
+
+	for (i = 0; parent->ops->supported_type_groups[i]; i++) {
+		struct mdev_type *type;
+
+		type = add_mdev_supported_type(parent,
+					parent->ops->supported_type_groups[i]);
+		if (IS_ERR(type)) {
+			struct mdev_type *ltype, *tmp;
+
+			list_for_each_entry_safe(ltype, tmp, &parent->type_list,
+						  next) {
+				list_del(&ltype->next);
+				remove_mdev_supported_type(ltype);
+			}
+			return PTR_ERR(type);
+		}
+		list_add(&type->next, &parent->type_list);
+	}
+	return 0;
+}
+
+/* mdev sysfs Functions */
+
+void parent_remove_sysfs_files(struct parent_device *parent)
+{
+	struct mdev_type *type, *tmp;
+
+	list_for_each_entry_safe(type, tmp, &parent->type_list, next) {
+		list_del(&type->next);
+		remove_mdev_supported_type(type);
+	}
+
+	sysfs_remove_groups(&parent->dev->kobj, parent->ops->dev_attr_groups);
+	kset_unregister(parent->mdev_types_kset);
+}
+
+int parent_create_sysfs_files(struct parent_device *parent)
+{
+	int ret;
+
+	parent->mdev_types_kset = kset_create_and_add("mdev_supported_types",
+					       NULL, &parent->dev->kobj);
+
+	if (!parent->mdev_types_kset)
+		return -ENOMEM;
+
+	INIT_LIST_HEAD(&parent->type_list);
+
+	ret = sysfs_create_groups(&parent->dev->kobj,
+				  parent->ops->dev_attr_groups);
+	if (ret)
+		goto create_err;
+
+	ret = add_mdev_supported_type_groups(parent);
+	if (ret)
+		sysfs_remove_groups(&parent->dev->kobj,
+				    parent->ops->dev_attr_groups);
+	else
+		return ret;
+
+create_err:
+	kset_unregister(parent->mdev_types_kset);
+	return ret;
+}
+
+static ssize_t remove_store(struct device *dev, struct device_attribute *attr,
+			    const char *buf, size_t count)
+{
+	unsigned long val;
+
+	if (kstrtoul(buf, 0, &val) < 0)
+		return -EINVAL;
+
+	if (val && device_remove_file_self(dev, attr)) {
+		int ret;
+		bool force = false;
+
+		ret = mdev_device_remove(dev, (void *)&force);
+		if (ret) {
+			device_create_file(dev, attr);
+			return ret;
+		}
+	}
+
+	return count;
+}
+
+static DEVICE_ATTR_WO(remove);
+
+static const struct attribute *mdev_device_attrs[] = {
+	&dev_attr_remove.attr,
+	NULL,
+};
+
+int  mdev_create_sysfs_files(struct device *dev, struct mdev_type *type)
+{
+	int ret;
+
+	ret = sysfs_create_files(&dev->kobj, mdev_device_attrs);
+	if (ret) {
+		pr_err("Failed to create remove sysfs entry\n");
+		return ret;
+	}
+
+	ret = sysfs_create_link(type->devices_kobj, &dev->kobj, dev_name(dev));
+	if (ret) {
+		pr_err("Failed to create symlink in types\n");
+		goto device_link_failed;
+	}
+
+	ret = sysfs_create_link(&dev->kobj, &type->kobj,
+				kobject_name(&type->kobj));
+	if (ret) {
+		pr_err("Failed to create symlink in device directory\n");
+		goto type_link_failed;
+	}
+
+	return ret;
+
+type_link_failed:
+	sysfs_remove_link(type->devices_kobj, dev_name(dev));
+device_link_failed:
+	sysfs_remove_files(&dev->kobj, mdev_device_attrs);
+	return ret;
+}
+
+void mdev_remove_sysfs_files(struct device *dev, struct mdev_type *type)
+{
+	sysfs_remove_link(&dev->kobj, kobject_name(&type->kobj));
+	sysfs_remove_link(type->devices_kobj, dev_name(dev));
+	sysfs_remove_files(&dev->kobj, mdev_device_attrs);
+
+}
+
diff --git a/include/linux/mdev.h b/include/linux/mdev.h
new file mode 100644
index 000000000000..93c177609efe
--- /dev/null
+++ b/include/linux/mdev.h
@@ -0,0 +1,178 @@
+/*
+ * Mediated device definition
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef MDEV_H
+#define MDEV_H
+
+#include <uapi/linux/vfio.h>
+
+struct parent_device;
+
+/* Mediated device */
+struct mdev_device {
+	struct device		dev;
+	struct parent_device	*parent;
+	struct iommu_group	*group;
+	uuid_le			uuid;
+	void			*driver_data;
+
+	/* internal only */
+	struct kref		ref;
+	struct list_head	next;
+	struct kobject		*type_kobj;
+};
+
+
+/**
+ * struct parent_ops - Structure to be registered for each parent device to
+ * register the device to mdev module.
+ *
+ * @owner:		The module owner.
+ * @dev_attr_groups:	Attributes of the parent device.
+ * @mdev_attr_groups:	Attributes of the mediated device.
+ * @supported_type_groups: Attributes to define supported types. It is mandatory
+ *			to provide supported types.
+ * @create:		Called to allocate basic resources in parent device's
+ *			driver for a particular mediated device. It is
+ *			mandatory to provide create ops.
+ *			@kobj: kobject of type for which 'create' is called.
+ *			@mdev: mdev_device structure on of mediated device
+ *			      that is being created
+ *			Returns integer: success (0) or error (< 0)
+ * @remove:		Called to free resources in parent device's driver for a
+ *			a mediated device. It is mandatory to provide 'remove'
+ *			ops.
+ *			@mdev: mdev_device device structure which is being
+ *			       destroyed
+ *			Returns integer: success (0) or error (< 0)
+ * @open:		Open mediated device.
+ *			@mdev: mediated device.
+ *			Returns integer: success (0) or error (< 0)
+ * @release:		release mediated device
+ *			@mdev: mediated device.
+ * @read:		Read emulation callback
+ *			@mdev: mediated device structure
+ *			@buf: read buffer
+ *			@count: number of bytes to read
+ *			@pos: address.
+ *			Retuns number on bytes read on success or error.
+ * @write:		Write emulation callback
+ *			@mdev: mediated device structure
+ *			@buf: write buffer
+ *			@count: number of bytes to be written
+ *			@pos: address.
+ *			Retuns number on bytes written on success or error.
+ * @ioctl:		IOCTL callback
+ *			@mdev: mediated device structure
+ *			@cmd: mediated device structure
+ *			@arg: mediated device structure
+ * @mmap:		mmap callback
+ * Parent device that support mediated device should be registered with mdev
+ * module with parent_ops structure.
+ */
+
+struct parent_ops {
+	struct module   *owner;
+	const struct attribute_group **dev_attr_groups;
+	const struct attribute_group **mdev_attr_groups;
+	struct attribute_group **supported_type_groups;
+
+	int     (*create)(struct kobject *kobj, struct mdev_device *mdev);
+	int     (*remove)(struct mdev_device *mdev);
+	int     (*open)(struct mdev_device *mdev);
+	void    (*release)(struct mdev_device *mdev);
+	ssize_t (*read)(struct mdev_device *mdev, char *buf, size_t count,
+			loff_t pos);
+	ssize_t (*write)(struct mdev_device *mdev, char *buf, size_t count,
+			 loff_t pos);
+	ssize_t (*ioctl)(struct mdev_device *mdev, unsigned int cmd,
+			 unsigned long arg);
+	int	(*mmap)(struct mdev_device *mdev, struct vm_area_struct *vma);
+};
+
+/* Parent Device */
+struct parent_device {
+	struct device		*dev;
+	const struct parent_ops	*ops;
+
+	/* internal */
+	struct kref		ref;
+	struct list_head	next;
+	struct kset *mdev_types_kset;
+	struct list_head	type_list;
+};
+
+/* interface for exporting mdev supported type attributes */
+struct mdev_type_attribute {
+	struct attribute attr;
+	ssize_t (*show)(struct kobject *kobj, struct device *dev, char *buf);
+	ssize_t (*store)(struct kobject *kobj, struct device *dev,
+			 const char *buf, size_t count);
+};
+
+#define MDEV_TYPE_ATTR(_name, _mode, _show, _store)		\
+struct mdev_type_attribute mdev_type_attr_##_name =		\
+	__ATTR(_name, _mode, _show, _store)
+#define MDEV_TYPE_ATTR_RW(_name) \
+	struct mdev_type_attribute mdev_type_attr_##_name = __ATTR_RW(_name)
+#define MDEV_TYPE_ATTR_RO(_name) \
+	struct mdev_type_attribute mdev_type_attr_##_name = __ATTR_RO(_name)
+#define MDEV_TYPE_ATTR_WO(_name) \
+	struct mdev_type_attribute mdev_type_attr_##_name = __ATTR_WO(_name)
+
+/**
+ * struct mdev_driver - Mediated device driver
+ * @name: driver name
+ * @probe: called when new device created
+ * @remove: called when device removed
+ * @driver: device driver structure
+ *
+ **/
+struct mdev_driver {
+	const char *name;
+	int  (*probe)(struct device *dev);
+	void (*remove)(struct device *dev);
+	struct device_driver driver;
+};
+
+static inline struct mdev_driver *to_mdev_driver(struct device_driver *drv)
+{
+	return drv ? container_of(drv, struct mdev_driver, driver) : NULL;
+}
+
+static inline struct mdev_device *to_mdev_device(struct device *dev)
+{
+	return dev ? container_of(dev, struct mdev_device, dev) : NULL;
+}
+
+static inline void *mdev_get_drvdata(struct mdev_device *mdev)
+{
+	return mdev->driver_data;
+}
+
+static inline void mdev_set_drvdata(struct mdev_device *mdev, void *data)
+{
+	mdev->driver_data = data;
+}
+
+extern struct bus_type mdev_bus_type;
+
+#define dev_is_mdev(d) ((d)->bus == &mdev_bus_type)
+
+extern int  mdev_register_device(struct device *dev,
+				 const struct parent_ops *ops);
+extern void mdev_unregister_device(struct device *dev);
+
+extern int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
+extern void mdev_unregister_driver(struct mdev_driver *drv);
+
+#endif /* MDEV_H */
-- 
2.7.0


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [Qemu-devel] [PATCH v8 1/6] vfio: Mediated device Core driver
@ 2016-10-10 20:28   ` Kirti Wankhede
  0 siblings, 0 replies; 73+ messages in thread
From: Kirti Wankhede @ 2016-10-10 20:28 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi, Kirti Wankhede

Design for Mediated Device Driver:
Main purpose of this driver is to provide a common interface for mediated
device management that can be used by different drivers of different
devices.

This module provides a generic interface to create the device, add it to
mediated bus, add device to IOMMU group and then add it to vfio group.

Below is the high Level block diagram, with Nvidia, Intel and IBM devices
as example, since these are the devices which are going to actively use
this module as of now.

 +---------------+
 |               |
 | +-----------+ |  mdev_register_driver() +--------------+
 | |           | +<------------------------+ __init()     |
 | |  mdev     | |                         |              |
 | |  bus      | +------------------------>+              |<-> VFIO user
 | |  driver   | |     probe()/remove()    | vfio_mdev.ko |    APIs
 | |           | |                         |              |
 | +-----------+ |                         +--------------+
 |               |
 |  MDEV CORE    |
 |   MODULE      |
 |   mdev.ko     |
 | +-----------+ |  mdev_register_device() +--------------+
 | |           | +<------------------------+              |
 | |           | |                         |  nvidia.ko   |<-> physical
 | |           | +------------------------>+              |    device
 | |           | |        callback         +--------------+
 | | Physical  | |
 | |  device   | |  mdev_register_device() +--------------+
 | | interface | |<------------------------+              |
 | |           | |                         |  i915.ko     |<-> physical
 | |           | +------------------------>+              |    device
 | |           | |        callback         +--------------+
 | |           | |
 | |           | |  mdev_register_device() +--------------+
 | |           | +<------------------------+              |
 | |           | |                         | ccw_device.ko|<-> physical
 | |           | +------------------------>+              |    device
 | |           | |        callback         +--------------+
 | +-----------+ |
 +---------------+

Core driver provides two types of registration interfaces:
1. Registration interface for mediated bus driver:

/**
  * struct mdev_driver - Mediated device's driver
  * @name: driver name
  * @probe: called when new device created
  * @remove:called when device removed
  * @driver:device driver structure
  *
  **/
struct mdev_driver {
         const char *name;
         int  (*probe)  (struct device *dev);
         void (*remove) (struct device *dev);
         struct device_driver    driver;
};

Mediated bus driver for mdev device should use this interface to register
and unregister with core driver respectively:

int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
void mdev_unregister_driver(struct mdev_driver *drv);

Medisted bus driver is responsible to add/delete mediated devices to/from
VFIO group when devices are bound and unbound to the driver.

2. Physical device driver interface
This interface provides vendor driver the set APIs to manage physical
device related work in its driver. APIs are :

* dev_attr_groups: attributes of the parent device.
* mdev_attr_groups: attributes of the mediated device.
* supported_type_groups: attributes to define supported type. This is
			 mandatory field.
* create: to allocate basic resources in driver for a mediated device.
* remove: to free resources in driver when mediated device is destroyed.
* open: open callback of mediated device
* release: release callback of mediated device
* read : read emulation callback.
* write: write emulation callback.
* mmap: mmap emulation callback.
* ioctl: ioctl callback.

Drivers should use these interfaces to register and unregister device to
mdev core driver respectively:

extern int  mdev_register_device(struct device *dev,
                                 const struct parent_ops *ops);
extern void mdev_unregister_device(struct device *dev);

There are no locks to serialize above callbacks in mdev driver and
vfio_mdev driver. If required, vendor driver can have locks to serialize
above APIs in their driver.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Change-Id: I73a5084574270b14541c529461ea2f03c292d510
---
 drivers/vfio/Kconfig             |   1 +
 drivers/vfio/Makefile            |   1 +
 drivers/vfio/mdev/Kconfig        |  12 ++
 drivers/vfio/mdev/Makefile       |   5 +
 drivers/vfio/mdev/mdev_core.c    | 363 +++++++++++++++++++++++++++++++++++++++
 drivers/vfio/mdev/mdev_driver.c  | 131 ++++++++++++++
 drivers/vfio/mdev/mdev_private.h |  41 +++++
 drivers/vfio/mdev/mdev_sysfs.c   | 295 +++++++++++++++++++++++++++++++
 include/linux/mdev.h             | 178 +++++++++++++++++++
 9 files changed, 1027 insertions(+)
 create mode 100644 drivers/vfio/mdev/Kconfig
 create mode 100644 drivers/vfio/mdev/Makefile
 create mode 100644 drivers/vfio/mdev/mdev_core.c
 create mode 100644 drivers/vfio/mdev/mdev_driver.c
 create mode 100644 drivers/vfio/mdev/mdev_private.h
 create mode 100644 drivers/vfio/mdev/mdev_sysfs.c
 create mode 100644 include/linux/mdev.h

diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index da6e2ce77495..23eced02aaf6 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -48,4 +48,5 @@ menuconfig VFIO_NOIOMMU
 
 source "drivers/vfio/pci/Kconfig"
 source "drivers/vfio/platform/Kconfig"
+source "drivers/vfio/mdev/Kconfig"
 source "virt/lib/Kconfig"
diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
index 7b8a31f63fea..4a23c13b6be4 100644
--- a/drivers/vfio/Makefile
+++ b/drivers/vfio/Makefile
@@ -7,3 +7,4 @@ obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
 obj-$(CONFIG_VFIO_SPAPR_EEH) += vfio_spapr_eeh.o
 obj-$(CONFIG_VFIO_PCI) += pci/
 obj-$(CONFIG_VFIO_PLATFORM) += platform/
+obj-$(CONFIG_VFIO_MDEV) += mdev/
diff --git a/drivers/vfio/mdev/Kconfig b/drivers/vfio/mdev/Kconfig
new file mode 100644
index 000000000000..23d5b9d08a5c
--- /dev/null
+++ b/drivers/vfio/mdev/Kconfig
@@ -0,0 +1,12 @@
+
+config VFIO_MDEV
+    tristate "Mediated device driver framework"
+    depends on VFIO
+    default n
+    help
+        Provides a framework to virtualize device.
+	See Documentation/vfio-mdev/vfio-mediated-device.txt for more details.
+
+        If you don't know what do here, say N.
+
+
diff --git a/drivers/vfio/mdev/Makefile b/drivers/vfio/mdev/Makefile
new file mode 100644
index 000000000000..56a75e689582
--- /dev/null
+++ b/drivers/vfio/mdev/Makefile
@@ -0,0 +1,5 @@
+
+mdev-y := mdev_core.o mdev_sysfs.o mdev_driver.o
+
+obj-$(CONFIG_VFIO_MDEV) += mdev.o
+
diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
new file mode 100644
index 000000000000..019c196e62d5
--- /dev/null
+++ b/drivers/vfio/mdev/mdev_core.c
@@ -0,0 +1,363 @@
+/*
+ * Mediated device Core Driver
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/fs.h>
+#include <linux/slab.h>
+#include <linux/sched.h>
+#include <linux/uuid.h>
+#include <linux/vfio.h>
+#include <linux/iommu.h>
+#include <linux/sysfs.h>
+#include <linux/mdev.h>
+
+#include "mdev_private.h"
+
+#define DRIVER_VERSION		"0.1"
+#define DRIVER_AUTHOR		"NVIDIA Corporation"
+#define DRIVER_DESC		"Mediated device Core Driver"
+
+static LIST_HEAD(parent_list);
+static DEFINE_MUTEX(parent_list_lock);
+
+static int _find_mdev_device(struct device *dev, void *data)
+{
+	struct mdev_device *mdev;
+
+	if (!dev_is_mdev(dev))
+		return 0;
+
+	mdev = to_mdev_device(dev);
+
+	if (uuid_le_cmp(mdev->uuid, *(uuid_le *)data) == 0)
+		return 1;
+
+	return 0;
+}
+
+static struct mdev_device *__find_mdev_device(struct parent_device *parent,
+					      uuid_le uuid)
+{
+	struct device *dev;
+
+	dev = device_find_child(parent->dev, &uuid, _find_mdev_device);
+	if (!dev)
+		return NULL;
+
+	put_device(dev);
+
+	return to_mdev_device(dev);
+}
+
+/* Should be called holding parent_list_lock */
+static struct parent_device *__find_parent_device(struct device *dev)
+{
+	struct parent_device *parent;
+
+	list_for_each_entry(parent, &parent_list, next) {
+		if (parent->dev == dev)
+			return parent;
+	}
+	return NULL;
+}
+
+static void mdev_release_parent(struct kref *kref)
+{
+	struct parent_device *parent = container_of(kref, struct parent_device,
+						    ref);
+	struct device *dev = parent->dev;
+
+	kfree(parent);
+	put_device(dev);
+}
+
+static
+inline struct parent_device *mdev_get_parent(struct parent_device *parent)
+{
+	if (parent)
+		kref_get(&parent->ref);
+
+	return parent;
+}
+
+static inline void mdev_put_parent(struct parent_device *parent)
+{
+	if (parent)
+		kref_put(&parent->ref, mdev_release_parent);
+}
+
+static int mdev_device_create_ops(struct kobject *kobj,
+				  struct mdev_device *mdev)
+{
+	struct parent_device *parent = mdev->parent;
+	int ret;
+
+	ret = parent->ops->create(kobj, mdev);
+	if (ret)
+		return ret;
+
+	ret = sysfs_create_groups(&mdev->dev.kobj,
+				  parent->ops->mdev_attr_groups);
+	if (ret)
+		parent->ops->remove(mdev);
+
+	return ret;
+}
+
+static int mdev_device_remove_ops(struct mdev_device *mdev, bool force_remove)
+{
+	struct parent_device *parent = mdev->parent;
+	int ret;
+
+	/*
+	 * Vendor driver can return error if VMM or userspace application is
+	 * using this mdev device.
+	 */
+	ret = parent->ops->remove(mdev);
+	if (ret && !force_remove)
+		return -EBUSY;
+
+	sysfs_remove_groups(&mdev->dev.kobj, parent->ops->mdev_attr_groups);
+	return 0;
+}
+
+/*
+ * mdev_register_device : Register a device
+ * @dev: device structure representing parent device.
+ * @ops: Parent device operation structure to be registered.
+ *
+ * Add device to list of registered parent devices.
+ * Returns a negative value on error, otherwise 0.
+ */
+int mdev_register_device(struct device *dev, const struct parent_ops *ops)
+{
+	int ret = 0;
+	struct parent_device *parent;
+
+	/* check for mandatory ops */
+	if (!ops || !ops->create || !ops->remove || !ops->supported_type_groups)
+		return -EINVAL;
+
+	dev = get_device(dev);
+	if (!dev)
+		return -EINVAL;
+
+	mutex_lock(&parent_list_lock);
+
+	/* Check for duplicate */
+	parent = __find_parent_device(dev);
+	if (parent) {
+		ret = -EEXIST;
+		goto add_dev_err;
+	}
+
+	parent = kzalloc(sizeof(*parent), GFP_KERNEL);
+	if (!parent) {
+		ret = -ENOMEM;
+		goto add_dev_err;
+	}
+
+	kref_init(&parent->ref);
+
+	parent->dev = dev;
+	parent->ops = ops;
+
+	ret = parent_create_sysfs_files(parent);
+	if (ret) {
+		mutex_unlock(&parent_list_lock);
+		mdev_put_parent(parent);
+		return ret;
+	}
+
+	list_add(&parent->next, &parent_list);
+	mutex_unlock(&parent_list_lock);
+
+	dev_info(dev, "MDEV: Registered\n");
+	return 0;
+
+add_dev_err:
+	mutex_unlock(&parent_list_lock);
+	put_device(dev);
+	return ret;
+}
+EXPORT_SYMBOL(mdev_register_device);
+
+/*
+ * mdev_unregister_device : Unregister a parent device
+ * @dev: device structure representing parent device.
+ *
+ * Remove device from list of registered parent devices. Give a chance to free
+ * existing mediated devices for given device.
+ */
+
+void mdev_unregister_device(struct device *dev)
+{
+	struct parent_device *parent;
+	bool force_remove = true;
+
+	mutex_lock(&parent_list_lock);
+	parent = __find_parent_device(dev);
+
+	if (!parent) {
+		mutex_unlock(&parent_list_lock);
+		return;
+	}
+	dev_info(dev, "MDEV: Unregistering\n");
+
+	/*
+	 * Remove parent from the list and remove "mdev_supported_types"
+	 * sysfs files so that no new mediated device could be
+	 * created for this parent
+	 */
+	list_del(&parent->next);
+	parent_remove_sysfs_files(parent);
+
+	mutex_unlock(&parent_list_lock);
+
+	device_for_each_child(dev, (void *)&force_remove, mdev_device_remove);
+	mdev_put_parent(parent);
+}
+EXPORT_SYMBOL(mdev_unregister_device);
+
+/*
+ * Functions required for mdev_sysfs
+ */
+static void mdev_device_release(struct device *dev)
+{
+	struct mdev_device *mdev = to_mdev_device(dev);
+
+	dev_dbg(&mdev->dev, "MDEV: destroying\n");
+	kfree(mdev);
+}
+
+int mdev_device_create(struct kobject *kobj, struct device *dev, uuid_le uuid)
+{
+	int ret;
+	struct mdev_device *mdev;
+	struct parent_device *parent;
+	struct mdev_type *type = to_mdev_type(kobj);
+
+	parent = mdev_get_parent(type->parent);
+	if (!parent)
+		return -EINVAL;
+
+	/* Check for duplicate */
+	mdev = __find_mdev_device(parent, uuid);
+	if (mdev) {
+		ret = -EEXIST;
+		goto create_err;
+	}
+
+	mdev = kzalloc(sizeof(*mdev), GFP_KERNEL);
+	if (!mdev) {
+		ret = -ENOMEM;
+		goto create_err;
+	}
+
+	memcpy(&mdev->uuid, &uuid, sizeof(uuid_le));
+	mdev->parent = parent;
+	kref_init(&mdev->ref);
+
+	mdev->dev.parent  = dev;
+	mdev->dev.bus     = &mdev_bus_type;
+	mdev->dev.release = mdev_device_release;
+	dev_set_name(&mdev->dev, "%pUl", uuid.b);
+
+	ret = device_register(&mdev->dev);
+	if (ret) {
+		put_device(&mdev->dev);
+		goto create_err;
+	}
+
+	ret = mdev_device_create_ops(kobj, mdev);
+	if (ret)
+		goto create_failed;
+
+	ret = mdev_create_sysfs_files(&mdev->dev, type);
+	if (ret) {
+		mdev_device_remove_ops(mdev, true);
+		goto create_failed;
+	}
+
+	mdev->type_kobj = kobj;
+	dev_dbg(&mdev->dev, "MDEV: created\n");
+
+	return ret;
+
+create_failed:
+	device_unregister(&mdev->dev);
+
+create_err:
+	mdev_put_parent(parent);
+	return ret;
+}
+
+int mdev_device_remove(struct device *dev, void *data)
+{
+	struct mdev_device *mdev;
+	struct parent_device *parent;
+	struct mdev_type *type;
+	bool force_remove = true;
+	int ret = 0;
+
+	if (!dev_is_mdev(dev))
+		return 0;
+
+	mdev = to_mdev_device(dev);
+	parent = mdev->parent;
+	type = to_mdev_type(mdev->type_kobj);
+
+	if (data)
+		force_remove = *(bool *)data;
+
+	ret = mdev_device_remove_ops(mdev, force_remove);
+	if (ret)
+		return ret;
+
+	mdev_remove_sysfs_files(dev, type);
+	device_unregister(dev);
+	mdev_put_parent(parent);
+	return ret;
+}
+
+static int __init mdev_init(void)
+{
+	int ret;
+
+	ret = mdev_bus_register();
+	if (ret) {
+		pr_err("Failed to register mdev bus\n");
+		return ret;
+	}
+
+	/*
+	 * Attempt to load known vfio_mdev.  This gives us a working environment
+	 * without the user needing to explicitly load vfio_mdev driver.
+	 */
+	request_module_nowait("vfio_mdev");
+
+	return ret;
+}
+
+static void __exit mdev_exit(void)
+{
+	mdev_bus_unregister();
+}
+
+module_init(mdev_init)
+module_exit(mdev_exit)
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/drivers/vfio/mdev/mdev_driver.c b/drivers/vfio/mdev/mdev_driver.c
new file mode 100644
index 000000000000..8afc2d8e5c04
--- /dev/null
+++ b/drivers/vfio/mdev/mdev_driver.c
@@ -0,0 +1,131 @@
+/*
+ * MDEV driver
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/device.h>
+#include <linux/iommu.h>
+#include <linux/mdev.h>
+
+#include "mdev_private.h"
+
+static int mdev_attach_iommu(struct mdev_device *mdev)
+{
+	int ret;
+	struct iommu_group *group;
+
+	group = iommu_group_alloc();
+	if (IS_ERR(group)) {
+		dev_err(&mdev->dev, "MDEV: failed to allocate group!\n");
+		return PTR_ERR(group);
+	}
+
+	ret = iommu_group_add_device(group, &mdev->dev);
+	if (ret) {
+		dev_err(&mdev->dev, "MDEV: failed to add dev to group!\n");
+		goto attach_fail;
+	}
+
+	mdev->group = group;
+
+	dev_info(&mdev->dev, "MDEV: group_id = %d\n",
+				 iommu_group_id(group));
+attach_fail:
+	iommu_group_put(group);
+	return ret;
+}
+
+static void mdev_detach_iommu(struct mdev_device *mdev)
+{
+	iommu_group_remove_device(&mdev->dev);
+	mdev->group = NULL;
+	dev_info(&mdev->dev, "MDEV: detaching iommu\n");
+}
+
+static int mdev_probe(struct device *dev)
+{
+	struct mdev_driver *drv = to_mdev_driver(dev->driver);
+	struct mdev_device *mdev = to_mdev_device(dev);
+	int ret;
+
+	ret = mdev_attach_iommu(mdev);
+	if (ret) {
+		dev_err(dev, "Failed to attach IOMMU\n");
+		return ret;
+	}
+
+	if (drv && drv->probe)
+		ret = drv->probe(dev);
+
+	if (ret)
+		mdev_detach_iommu(mdev);
+
+	return ret;
+}
+
+static int mdev_remove(struct device *dev)
+{
+	struct mdev_driver *drv = to_mdev_driver(dev->driver);
+	struct mdev_device *mdev = to_mdev_device(dev);
+
+	if (drv && drv->remove)
+		drv->remove(dev);
+
+	mdev_detach_iommu(mdev);
+
+	return 0;
+}
+
+struct bus_type mdev_bus_type = {
+	.name		= "mdev",
+	.probe		= mdev_probe,
+	.remove		= mdev_remove,
+};
+EXPORT_SYMBOL_GPL(mdev_bus_type);
+
+/*
+ * mdev_register_driver - register a new MDEV driver
+ * @drv: the driver to register
+ * @owner: module owner of driver to be registered
+ *
+ * Returns a negative value on error, otherwise 0.
+ */
+int mdev_register_driver(struct mdev_driver *drv, struct module *owner)
+{
+	/* initialize common driver fields */
+	drv->driver.name = drv->name;
+	drv->driver.bus = &mdev_bus_type;
+	drv->driver.owner = owner;
+
+	/* register with core */
+	return driver_register(&drv->driver);
+}
+EXPORT_SYMBOL(mdev_register_driver);
+
+/*
+ * mdev_unregister_driver - unregister MDEV driver
+ * @drv: the driver to unregister
+ *
+ */
+void mdev_unregister_driver(struct mdev_driver *drv)
+{
+	driver_unregister(&drv->driver);
+}
+EXPORT_SYMBOL(mdev_unregister_driver);
+
+int mdev_bus_register(void)
+{
+	return bus_register(&mdev_bus_type);
+}
+
+void mdev_bus_unregister(void)
+{
+	bus_unregister(&mdev_bus_type);
+}
diff --git a/drivers/vfio/mdev/mdev_private.h b/drivers/vfio/mdev/mdev_private.h
new file mode 100644
index 000000000000..8f6cbffda9bd
--- /dev/null
+++ b/drivers/vfio/mdev/mdev_private.h
@@ -0,0 +1,41 @@
+/*
+ * Mediated device interal definitions
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef MDEV_PRIVATE_H
+#define MDEV_PRIVATE_H
+
+int  mdev_bus_register(void);
+void mdev_bus_unregister(void);
+
+struct mdev_type {
+	struct kobject kobj;
+	struct kobject *devices_kobj;
+	struct parent_device *parent;
+	struct list_head next;
+	struct attribute_group *group;
+};
+
+#define to_mdev_type_attr(_attr)	\
+	container_of(_attr, struct mdev_type_attribute, attr)
+#define to_mdev_type(_kobj)		\
+	container_of(_kobj, struct mdev_type, kobj)
+
+int  parent_create_sysfs_files(struct parent_device *parent);
+void parent_remove_sysfs_files(struct parent_device *parent);
+
+int  mdev_create_sysfs_files(struct device *dev, struct mdev_type *type);
+void mdev_remove_sysfs_files(struct device *dev, struct mdev_type *type);
+
+int  mdev_device_create(struct kobject *kobj, struct device *dev, uuid_le uuid);
+int  mdev_device_remove(struct device *dev, void *data);
+
+#endif /* MDEV_PRIVATE_H */
diff --git a/drivers/vfio/mdev/mdev_sysfs.c b/drivers/vfio/mdev/mdev_sysfs.c
new file mode 100644
index 000000000000..228698f46234
--- /dev/null
+++ b/drivers/vfio/mdev/mdev_sysfs.c
@@ -0,0 +1,295 @@
+/*
+ * File attributes for Mediated devices
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/sysfs.h>
+#include <linux/ctype.h>
+#include <linux/device.h>
+#include <linux/slab.h>
+#include <linux/uuid.h>
+#include <linux/mdev.h>
+
+#include "mdev_private.h"
+
+/* Static functions */
+
+static ssize_t mdev_type_attr_show(struct kobject *kobj,
+				     struct attribute *__attr, char *buf)
+{
+	struct mdev_type_attribute *attr = to_mdev_type_attr(__attr);
+	struct mdev_type *type = to_mdev_type(kobj);
+	ssize_t ret = -EIO;
+
+	if (attr->show)
+		ret = attr->show(kobj, type->parent->dev, buf);
+	return ret;
+}
+
+static ssize_t mdev_type_attr_store(struct kobject *kobj,
+				      struct attribute *__attr,
+				      const char *buf, size_t count)
+{
+	struct mdev_type_attribute *attr = to_mdev_type_attr(__attr);
+	struct mdev_type *type = to_mdev_type(kobj);
+	ssize_t ret = -EIO;
+
+	if (attr->store)
+		ret = attr->store(&type->kobj, type->parent->dev, buf, count);
+	return ret;
+}
+
+static const struct sysfs_ops mdev_type_sysfs_ops = {
+	.show = mdev_type_attr_show,
+	.store = mdev_type_attr_store,
+};
+
+static ssize_t create_store(struct kobject *kobj, struct device *dev,
+			    const char *buf, size_t count)
+{
+	char *str;
+	uuid_le uuid;
+	int ret;
+
+	str = kstrndup(buf, count, GFP_KERNEL);
+	if (!str)
+		return -ENOMEM;
+
+	ret = uuid_le_to_bin(str, &uuid);
+	if (!ret) {
+
+		ret = mdev_device_create(kobj, dev, uuid);
+		if (ret)
+			pr_err("mdev_create: Failed to create mdev device\n");
+		else
+			ret = count;
+	}
+
+	kfree(str);
+	return ret;
+}
+
+MDEV_TYPE_ATTR_WO(create);
+
+static void mdev_type_release(struct kobject *kobj)
+{
+	struct mdev_type *type = to_mdev_type(kobj);
+
+	pr_debug("Releasing group %s\n", kobj->name);
+	kfree(type);
+}
+
+static struct kobj_type mdev_type_ktype = {
+	.sysfs_ops = &mdev_type_sysfs_ops,
+	.release = mdev_type_release,
+};
+
+struct mdev_type *add_mdev_supported_type(struct parent_device *parent,
+					  struct attribute_group *group)
+{
+	struct mdev_type *type;
+	int ret;
+
+	if (!group->name) {
+		pr_err("%s: Type name empty!\n", __func__);
+		return ERR_PTR(-EINVAL);
+	}
+
+	type = kzalloc(sizeof(*type), GFP_KERNEL);
+	if (!type)
+		return ERR_PTR(-ENOMEM);
+
+	type->kobj.kset = parent->mdev_types_kset;
+
+	ret = kobject_init_and_add(&type->kobj, &mdev_type_ktype, NULL,
+				   "%s", group->name);
+	if (ret) {
+		kfree(type);
+		return ERR_PTR(ret);
+	}
+
+	ret = sysfs_create_file(&type->kobj, &mdev_type_attr_create.attr);
+	if (ret)
+		goto attr_create_failed;
+
+	type->devices_kobj = kobject_create_and_add("devices", &type->kobj);
+	if (!type->devices_kobj) {
+		ret = -ENOMEM;
+		goto attr_devices_failed;
+	}
+
+	ret = sysfs_create_files(&type->kobj,
+				 (const struct attribute **)group->attrs);
+	if (ret) {
+		ret = -ENOMEM;
+		goto attrs_failed;
+	}
+
+	type->group = group;
+	type->parent = parent;
+	return type;
+
+attrs_failed:
+	kobject_put(type->devices_kobj);
+attr_devices_failed:
+	sysfs_remove_file(&type->kobj, &mdev_type_attr_create.attr);
+attr_create_failed:
+	kobject_del(&type->kobj);
+	kobject_put(&type->kobj);
+	return ERR_PTR(ret);
+}
+
+static void remove_mdev_supported_type(struct mdev_type *type)
+{
+	sysfs_remove_files(&type->kobj,
+			   (const struct attribute **)type->group->attrs);
+	kobject_put(type->devices_kobj);
+	sysfs_remove_file(&type->kobj, &mdev_type_attr_create.attr);
+	kobject_del(&type->kobj);
+	kobject_put(&type->kobj);
+}
+
+static int add_mdev_supported_type_groups(struct parent_device *parent)
+{
+	int i;
+
+	for (i = 0; parent->ops->supported_type_groups[i]; i++) {
+		struct mdev_type *type;
+
+		type = add_mdev_supported_type(parent,
+					parent->ops->supported_type_groups[i]);
+		if (IS_ERR(type)) {
+			struct mdev_type *ltype, *tmp;
+
+			list_for_each_entry_safe(ltype, tmp, &parent->type_list,
+						  next) {
+				list_del(&ltype->next);
+				remove_mdev_supported_type(ltype);
+			}
+			return PTR_ERR(type);
+		}
+		list_add(&type->next, &parent->type_list);
+	}
+	return 0;
+}
+
+/* mdev sysfs Functions */
+
+void parent_remove_sysfs_files(struct parent_device *parent)
+{
+	struct mdev_type *type, *tmp;
+
+	list_for_each_entry_safe(type, tmp, &parent->type_list, next) {
+		list_del(&type->next);
+		remove_mdev_supported_type(type);
+	}
+
+	sysfs_remove_groups(&parent->dev->kobj, parent->ops->dev_attr_groups);
+	kset_unregister(parent->mdev_types_kset);
+}
+
+int parent_create_sysfs_files(struct parent_device *parent)
+{
+	int ret;
+
+	parent->mdev_types_kset = kset_create_and_add("mdev_supported_types",
+					       NULL, &parent->dev->kobj);
+
+	if (!parent->mdev_types_kset)
+		return -ENOMEM;
+
+	INIT_LIST_HEAD(&parent->type_list);
+
+	ret = sysfs_create_groups(&parent->dev->kobj,
+				  parent->ops->dev_attr_groups);
+	if (ret)
+		goto create_err;
+
+	ret = add_mdev_supported_type_groups(parent);
+	if (ret)
+		sysfs_remove_groups(&parent->dev->kobj,
+				    parent->ops->dev_attr_groups);
+	else
+		return ret;
+
+create_err:
+	kset_unregister(parent->mdev_types_kset);
+	return ret;
+}
+
+static ssize_t remove_store(struct device *dev, struct device_attribute *attr,
+			    const char *buf, size_t count)
+{
+	unsigned long val;
+
+	if (kstrtoul(buf, 0, &val) < 0)
+		return -EINVAL;
+
+	if (val && device_remove_file_self(dev, attr)) {
+		int ret;
+		bool force = false;
+
+		ret = mdev_device_remove(dev, (void *)&force);
+		if (ret) {
+			device_create_file(dev, attr);
+			return ret;
+		}
+	}
+
+	return count;
+}
+
+static DEVICE_ATTR_WO(remove);
+
+static const struct attribute *mdev_device_attrs[] = {
+	&dev_attr_remove.attr,
+	NULL,
+};
+
+int  mdev_create_sysfs_files(struct device *dev, struct mdev_type *type)
+{
+	int ret;
+
+	ret = sysfs_create_files(&dev->kobj, mdev_device_attrs);
+	if (ret) {
+		pr_err("Failed to create remove sysfs entry\n");
+		return ret;
+	}
+
+	ret = sysfs_create_link(type->devices_kobj, &dev->kobj, dev_name(dev));
+	if (ret) {
+		pr_err("Failed to create symlink in types\n");
+		goto device_link_failed;
+	}
+
+	ret = sysfs_create_link(&dev->kobj, &type->kobj,
+				kobject_name(&type->kobj));
+	if (ret) {
+		pr_err("Failed to create symlink in device directory\n");
+		goto type_link_failed;
+	}
+
+	return ret;
+
+type_link_failed:
+	sysfs_remove_link(type->devices_kobj, dev_name(dev));
+device_link_failed:
+	sysfs_remove_files(&dev->kobj, mdev_device_attrs);
+	return ret;
+}
+
+void mdev_remove_sysfs_files(struct device *dev, struct mdev_type *type)
+{
+	sysfs_remove_link(&dev->kobj, kobject_name(&type->kobj));
+	sysfs_remove_link(type->devices_kobj, dev_name(dev));
+	sysfs_remove_files(&dev->kobj, mdev_device_attrs);
+
+}
+
diff --git a/include/linux/mdev.h b/include/linux/mdev.h
new file mode 100644
index 000000000000..93c177609efe
--- /dev/null
+++ b/include/linux/mdev.h
@@ -0,0 +1,178 @@
+/*
+ * Mediated device definition
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef MDEV_H
+#define MDEV_H
+
+#include <uapi/linux/vfio.h>
+
+struct parent_device;
+
+/* Mediated device */
+struct mdev_device {
+	struct device		dev;
+	struct parent_device	*parent;
+	struct iommu_group	*group;
+	uuid_le			uuid;
+	void			*driver_data;
+
+	/* internal only */
+	struct kref		ref;
+	struct list_head	next;
+	struct kobject		*type_kobj;
+};
+
+
+/**
+ * struct parent_ops - Structure to be registered for each parent device to
+ * register the device to mdev module.
+ *
+ * @owner:		The module owner.
+ * @dev_attr_groups:	Attributes of the parent device.
+ * @mdev_attr_groups:	Attributes of the mediated device.
+ * @supported_type_groups: Attributes to define supported types. It is mandatory
+ *			to provide supported types.
+ * @create:		Called to allocate basic resources in parent device's
+ *			driver for a particular mediated device. It is
+ *			mandatory to provide create ops.
+ *			@kobj: kobject of type for which 'create' is called.
+ *			@mdev: mdev_device structure on of mediated device
+ *			      that is being created
+ *			Returns integer: success (0) or error (< 0)
+ * @remove:		Called to free resources in parent device's driver for a
+ *			a mediated device. It is mandatory to provide 'remove'
+ *			ops.
+ *			@mdev: mdev_device device structure which is being
+ *			       destroyed
+ *			Returns integer: success (0) or error (< 0)
+ * @open:		Open mediated device.
+ *			@mdev: mediated device.
+ *			Returns integer: success (0) or error (< 0)
+ * @release:		release mediated device
+ *			@mdev: mediated device.
+ * @read:		Read emulation callback
+ *			@mdev: mediated device structure
+ *			@buf: read buffer
+ *			@count: number of bytes to read
+ *			@pos: address.
+ *			Retuns number on bytes read on success or error.
+ * @write:		Write emulation callback
+ *			@mdev: mediated device structure
+ *			@buf: write buffer
+ *			@count: number of bytes to be written
+ *			@pos: address.
+ *			Retuns number on bytes written on success or error.
+ * @ioctl:		IOCTL callback
+ *			@mdev: mediated device structure
+ *			@cmd: mediated device structure
+ *			@arg: mediated device structure
+ * @mmap:		mmap callback
+ * Parent device that support mediated device should be registered with mdev
+ * module with parent_ops structure.
+ */
+
+struct parent_ops {
+	struct module   *owner;
+	const struct attribute_group **dev_attr_groups;
+	const struct attribute_group **mdev_attr_groups;
+	struct attribute_group **supported_type_groups;
+
+	int     (*create)(struct kobject *kobj, struct mdev_device *mdev);
+	int     (*remove)(struct mdev_device *mdev);
+	int     (*open)(struct mdev_device *mdev);
+	void    (*release)(struct mdev_device *mdev);
+	ssize_t (*read)(struct mdev_device *mdev, char *buf, size_t count,
+			loff_t pos);
+	ssize_t (*write)(struct mdev_device *mdev, char *buf, size_t count,
+			 loff_t pos);
+	ssize_t (*ioctl)(struct mdev_device *mdev, unsigned int cmd,
+			 unsigned long arg);
+	int	(*mmap)(struct mdev_device *mdev, struct vm_area_struct *vma);
+};
+
+/* Parent Device */
+struct parent_device {
+	struct device		*dev;
+	const struct parent_ops	*ops;
+
+	/* internal */
+	struct kref		ref;
+	struct list_head	next;
+	struct kset *mdev_types_kset;
+	struct list_head	type_list;
+};
+
+/* interface for exporting mdev supported type attributes */
+struct mdev_type_attribute {
+	struct attribute attr;
+	ssize_t (*show)(struct kobject *kobj, struct device *dev, char *buf);
+	ssize_t (*store)(struct kobject *kobj, struct device *dev,
+			 const char *buf, size_t count);
+};
+
+#define MDEV_TYPE_ATTR(_name, _mode, _show, _store)		\
+struct mdev_type_attribute mdev_type_attr_##_name =		\
+	__ATTR(_name, _mode, _show, _store)
+#define MDEV_TYPE_ATTR_RW(_name) \
+	struct mdev_type_attribute mdev_type_attr_##_name = __ATTR_RW(_name)
+#define MDEV_TYPE_ATTR_RO(_name) \
+	struct mdev_type_attribute mdev_type_attr_##_name = __ATTR_RO(_name)
+#define MDEV_TYPE_ATTR_WO(_name) \
+	struct mdev_type_attribute mdev_type_attr_##_name = __ATTR_WO(_name)
+
+/**
+ * struct mdev_driver - Mediated device driver
+ * @name: driver name
+ * @probe: called when new device created
+ * @remove: called when device removed
+ * @driver: device driver structure
+ *
+ **/
+struct mdev_driver {
+	const char *name;
+	int  (*probe)(struct device *dev);
+	void (*remove)(struct device *dev);
+	struct device_driver driver;
+};
+
+static inline struct mdev_driver *to_mdev_driver(struct device_driver *drv)
+{
+	return drv ? container_of(drv, struct mdev_driver, driver) : NULL;
+}
+
+static inline struct mdev_device *to_mdev_device(struct device *dev)
+{
+	return dev ? container_of(dev, struct mdev_device, dev) : NULL;
+}
+
+static inline void *mdev_get_drvdata(struct mdev_device *mdev)
+{
+	return mdev->driver_data;
+}
+
+static inline void mdev_set_drvdata(struct mdev_device *mdev, void *data)
+{
+	mdev->driver_data = data;
+}
+
+extern struct bus_type mdev_bus_type;
+
+#define dev_is_mdev(d) ((d)->bus == &mdev_bus_type)
+
+extern int  mdev_register_device(struct device *dev,
+				 const struct parent_ops *ops);
+extern void mdev_unregister_device(struct device *dev);
+
+extern int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
+extern void mdev_unregister_driver(struct mdev_driver *drv);
+
+#endif /* MDEV_H */
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v8 2/6] vfio: VFIO based driver for Mediated devices
  2016-10-10 20:28 ` [Qemu-devel] " Kirti Wankhede
@ 2016-10-10 20:28   ` Kirti Wankhede
  -1 siblings, 0 replies; 73+ messages in thread
From: Kirti Wankhede @ 2016-10-10 20:28 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi, Kirti Wankhede

vfio_mdev driver registers with mdev core driver.
MDEV core driver creates mediated device and calls probe routine of
vfio_mdev driver for each device.
Probe routine of vfio_mdev driver adds mediated device to VFIO core module

This driver forms a shim layer that pass through VFIO devices operations
to vendor driver for mediated devices.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Change-Id: I583f4734752971d3d112324d69e2508c88f359ec
---
 drivers/vfio/mdev/Kconfig           |   6 ++
 drivers/vfio/mdev/Makefile          |   1 +
 drivers/vfio/mdev/vfio_mdev.c       | 171 ++++++++++++++++++++++++++++++++++++
 drivers/vfio/pci/vfio_pci_private.h |   6 +-
 4 files changed, 181 insertions(+), 3 deletions(-)
 create mode 100644 drivers/vfio/mdev/vfio_mdev.c

diff --git a/drivers/vfio/mdev/Kconfig b/drivers/vfio/mdev/Kconfig
index 23d5b9d08a5c..e1b23697261d 100644
--- a/drivers/vfio/mdev/Kconfig
+++ b/drivers/vfio/mdev/Kconfig
@@ -9,4 +9,10 @@ config VFIO_MDEV
 
         If you don't know what do here, say N.
 
+config VFIO_MDEV_DEVICE
+    tristate "VFIO support for Mediated devices"
+    depends on VFIO && VFIO_MDEV
+    default n
+    help
+        VFIO based driver for mediated devices.
 
diff --git a/drivers/vfio/mdev/Makefile b/drivers/vfio/mdev/Makefile
index 56a75e689582..e5087ed83a34 100644
--- a/drivers/vfio/mdev/Makefile
+++ b/drivers/vfio/mdev/Makefile
@@ -2,4 +2,5 @@
 mdev-y := mdev_core.o mdev_sysfs.o mdev_driver.o
 
 obj-$(CONFIG_VFIO_MDEV) += mdev.o
+obj-$(CONFIG_VFIO_MDEV_DEVICE) += vfio_mdev.o
 
diff --git a/drivers/vfio/mdev/vfio_mdev.c b/drivers/vfio/mdev/vfio_mdev.c
new file mode 100644
index 000000000000..1efc3f309510
--- /dev/null
+++ b/drivers/vfio/mdev/vfio_mdev.c
@@ -0,0 +1,171 @@
+/*
+ * VFIO based driver for Mediated device
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/uuid.h>
+#include <linux/vfio.h>
+#include <linux/iommu.h>
+#include <linux/mdev.h>
+
+#include "mdev_private.h"
+
+#define DRIVER_VERSION  "0.1"
+#define DRIVER_AUTHOR   "NVIDIA Corporation"
+#define DRIVER_DESC     "VFIO based driver for Mediated device"
+
+struct vfio_mdev {
+	struct iommu_group *group;
+	struct mdev_device *mdev;
+};
+
+static int vfio_mdev_open(void *device_data)
+{
+	struct vfio_mdev *vmdev = device_data;
+	struct parent_device *parent = vmdev->mdev->parent;
+	int ret;
+
+	if (unlikely(!parent->ops->open))
+		return -EINVAL;
+
+	if (!try_module_get(THIS_MODULE))
+		return -ENODEV;
+
+	ret = parent->ops->open(vmdev->mdev);
+	if (ret)
+		module_put(THIS_MODULE);
+
+	return ret;
+}
+
+static void vfio_mdev_release(void *device_data)
+{
+	struct vfio_mdev *vmdev = device_data;
+	struct parent_device *parent = vmdev->mdev->parent;
+
+	if (parent->ops->release)
+		parent->ops->release(vmdev->mdev);
+
+	module_put(THIS_MODULE);
+}
+
+static long vfio_mdev_unlocked_ioctl(void *device_data,
+				     unsigned int cmd, unsigned long arg)
+{
+	struct vfio_mdev *vmdev = device_data;
+	struct parent_device *parent = vmdev->mdev->parent;
+
+	if (unlikely(!parent->ops->ioctl))
+		return -EINVAL;
+
+	return parent->ops->ioctl(vmdev->mdev, cmd, arg);
+}
+
+static ssize_t vfio_mdev_read(void *device_data, char __user *buf,
+			      size_t count, loff_t *ppos)
+{
+	struct vfio_mdev *vmdev = device_data;
+	struct parent_device *parent = vmdev->mdev->parent;
+
+	if (unlikely(!parent->ops->read))
+		return -EINVAL;
+
+	return parent->ops->read(vmdev->mdev, buf, count, *ppos);
+}
+
+static ssize_t vfio_mdev_write(void *device_data, const char __user *buf,
+			       size_t count, loff_t *ppos)
+{
+	struct vfio_mdev *vmdev = device_data;
+	struct parent_device *parent = vmdev->mdev->parent;
+
+	if (unlikely(!parent->ops->write))
+		return -EINVAL;
+
+	return parent->ops->write(vmdev->mdev, (char *)buf, count, *ppos);
+}
+
+static int vfio_mdev_mmap(void *device_data, struct vm_area_struct *vma)
+{
+	struct vfio_mdev *vmdev = device_data;
+	struct parent_device *parent = vmdev->mdev->parent;
+
+	if (unlikely(!parent->ops->mmap))
+		return -EINVAL;
+
+	return parent->ops->mmap(vmdev->mdev, vma);
+}
+
+static const struct vfio_device_ops vfio_mdev_dev_ops = {
+	.name		= "vfio-mdev",
+	.open		= vfio_mdev_open,
+	.release	= vfio_mdev_release,
+	.ioctl		= vfio_mdev_unlocked_ioctl,
+	.read		= vfio_mdev_read,
+	.write		= vfio_mdev_write,
+	.mmap		= vfio_mdev_mmap,
+};
+
+int vfio_mdev_probe(struct device *dev)
+{
+	struct vfio_mdev *vmdev;
+	struct mdev_device *mdev = to_mdev_device(dev);
+	int ret;
+
+	vmdev = kzalloc(sizeof(*vmdev), GFP_KERNEL);
+	if (IS_ERR(vmdev))
+		return PTR_ERR(vmdev);
+
+	vmdev->mdev = mdev;
+	vmdev->group = mdev->group;
+
+	ret = vfio_add_group_dev(dev, &vfio_mdev_dev_ops, vmdev);
+	if (ret)
+		kfree(vmdev);
+
+	return ret;
+}
+
+void vfio_mdev_remove(struct device *dev)
+{
+	struct vfio_mdev *vmdev;
+
+	vmdev = vfio_del_group_dev(dev);
+	kfree(vmdev);
+}
+
+struct mdev_driver vfio_mdev_driver = {
+	.name	= "vfio_mdev",
+	.probe	= vfio_mdev_probe,
+	.remove	= vfio_mdev_remove,
+};
+
+static int __init vfio_mdev_init(void)
+{
+	return mdev_register_driver(&vfio_mdev_driver, THIS_MODULE);
+}
+
+static void __exit vfio_mdev_exit(void)
+{
+	mdev_unregister_driver(&vfio_mdev_driver);
+}
+
+module_init(vfio_mdev_init)
+module_exit(vfio_mdev_exit)
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index 016c14a1b454..776cc2b063d4 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -21,9 +21,9 @@
 
 #define VFIO_PCI_OFFSET_SHIFT   40
 
-#define VFIO_PCI_OFFSET_TO_INDEX(off)	(off >> VFIO_PCI_OFFSET_SHIFT)
-#define VFIO_PCI_INDEX_TO_OFFSET(index)	((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
-#define VFIO_PCI_OFFSET_MASK	(((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
+#define VFIO_PCI_OFFSET_TO_INDEX(off)   (off >> VFIO_PCI_OFFSET_SHIFT)
+#define VFIO_PCI_INDEX_TO_OFFSET(index) ((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
+#define VFIO_PCI_OFFSET_MASK    (((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
 
 /* Special capability IDs predefined access */
 #define PCI_CAP_ID_INVALID		0xFF	/* default raw access */
-- 
2.7.0


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [Qemu-devel] [PATCH v8 2/6] vfio: VFIO based driver for Mediated devices
@ 2016-10-10 20:28   ` Kirti Wankhede
  0 siblings, 0 replies; 73+ messages in thread
From: Kirti Wankhede @ 2016-10-10 20:28 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi, Kirti Wankhede

vfio_mdev driver registers with mdev core driver.
MDEV core driver creates mediated device and calls probe routine of
vfio_mdev driver for each device.
Probe routine of vfio_mdev driver adds mediated device to VFIO core module

This driver forms a shim layer that pass through VFIO devices operations
to vendor driver for mediated devices.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Change-Id: I583f4734752971d3d112324d69e2508c88f359ec
---
 drivers/vfio/mdev/Kconfig           |   6 ++
 drivers/vfio/mdev/Makefile          |   1 +
 drivers/vfio/mdev/vfio_mdev.c       | 171 ++++++++++++++++++++++++++++++++++++
 drivers/vfio/pci/vfio_pci_private.h |   6 +-
 4 files changed, 181 insertions(+), 3 deletions(-)
 create mode 100644 drivers/vfio/mdev/vfio_mdev.c

diff --git a/drivers/vfio/mdev/Kconfig b/drivers/vfio/mdev/Kconfig
index 23d5b9d08a5c..e1b23697261d 100644
--- a/drivers/vfio/mdev/Kconfig
+++ b/drivers/vfio/mdev/Kconfig
@@ -9,4 +9,10 @@ config VFIO_MDEV
 
         If you don't know what do here, say N.
 
+config VFIO_MDEV_DEVICE
+    tristate "VFIO support for Mediated devices"
+    depends on VFIO && VFIO_MDEV
+    default n
+    help
+        VFIO based driver for mediated devices.
 
diff --git a/drivers/vfio/mdev/Makefile b/drivers/vfio/mdev/Makefile
index 56a75e689582..e5087ed83a34 100644
--- a/drivers/vfio/mdev/Makefile
+++ b/drivers/vfio/mdev/Makefile
@@ -2,4 +2,5 @@
 mdev-y := mdev_core.o mdev_sysfs.o mdev_driver.o
 
 obj-$(CONFIG_VFIO_MDEV) += mdev.o
+obj-$(CONFIG_VFIO_MDEV_DEVICE) += vfio_mdev.o
 
diff --git a/drivers/vfio/mdev/vfio_mdev.c b/drivers/vfio/mdev/vfio_mdev.c
new file mode 100644
index 000000000000..1efc3f309510
--- /dev/null
+++ b/drivers/vfio/mdev/vfio_mdev.c
@@ -0,0 +1,171 @@
+/*
+ * VFIO based driver for Mediated device
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/uuid.h>
+#include <linux/vfio.h>
+#include <linux/iommu.h>
+#include <linux/mdev.h>
+
+#include "mdev_private.h"
+
+#define DRIVER_VERSION  "0.1"
+#define DRIVER_AUTHOR   "NVIDIA Corporation"
+#define DRIVER_DESC     "VFIO based driver for Mediated device"
+
+struct vfio_mdev {
+	struct iommu_group *group;
+	struct mdev_device *mdev;
+};
+
+static int vfio_mdev_open(void *device_data)
+{
+	struct vfio_mdev *vmdev = device_data;
+	struct parent_device *parent = vmdev->mdev->parent;
+	int ret;
+
+	if (unlikely(!parent->ops->open))
+		return -EINVAL;
+
+	if (!try_module_get(THIS_MODULE))
+		return -ENODEV;
+
+	ret = parent->ops->open(vmdev->mdev);
+	if (ret)
+		module_put(THIS_MODULE);
+
+	return ret;
+}
+
+static void vfio_mdev_release(void *device_data)
+{
+	struct vfio_mdev *vmdev = device_data;
+	struct parent_device *parent = vmdev->mdev->parent;
+
+	if (parent->ops->release)
+		parent->ops->release(vmdev->mdev);
+
+	module_put(THIS_MODULE);
+}
+
+static long vfio_mdev_unlocked_ioctl(void *device_data,
+				     unsigned int cmd, unsigned long arg)
+{
+	struct vfio_mdev *vmdev = device_data;
+	struct parent_device *parent = vmdev->mdev->parent;
+
+	if (unlikely(!parent->ops->ioctl))
+		return -EINVAL;
+
+	return parent->ops->ioctl(vmdev->mdev, cmd, arg);
+}
+
+static ssize_t vfio_mdev_read(void *device_data, char __user *buf,
+			      size_t count, loff_t *ppos)
+{
+	struct vfio_mdev *vmdev = device_data;
+	struct parent_device *parent = vmdev->mdev->parent;
+
+	if (unlikely(!parent->ops->read))
+		return -EINVAL;
+
+	return parent->ops->read(vmdev->mdev, buf, count, *ppos);
+}
+
+static ssize_t vfio_mdev_write(void *device_data, const char __user *buf,
+			       size_t count, loff_t *ppos)
+{
+	struct vfio_mdev *vmdev = device_data;
+	struct parent_device *parent = vmdev->mdev->parent;
+
+	if (unlikely(!parent->ops->write))
+		return -EINVAL;
+
+	return parent->ops->write(vmdev->mdev, (char *)buf, count, *ppos);
+}
+
+static int vfio_mdev_mmap(void *device_data, struct vm_area_struct *vma)
+{
+	struct vfio_mdev *vmdev = device_data;
+	struct parent_device *parent = vmdev->mdev->parent;
+
+	if (unlikely(!parent->ops->mmap))
+		return -EINVAL;
+
+	return parent->ops->mmap(vmdev->mdev, vma);
+}
+
+static const struct vfio_device_ops vfio_mdev_dev_ops = {
+	.name		= "vfio-mdev",
+	.open		= vfio_mdev_open,
+	.release	= vfio_mdev_release,
+	.ioctl		= vfio_mdev_unlocked_ioctl,
+	.read		= vfio_mdev_read,
+	.write		= vfio_mdev_write,
+	.mmap		= vfio_mdev_mmap,
+};
+
+int vfio_mdev_probe(struct device *dev)
+{
+	struct vfio_mdev *vmdev;
+	struct mdev_device *mdev = to_mdev_device(dev);
+	int ret;
+
+	vmdev = kzalloc(sizeof(*vmdev), GFP_KERNEL);
+	if (IS_ERR(vmdev))
+		return PTR_ERR(vmdev);
+
+	vmdev->mdev = mdev;
+	vmdev->group = mdev->group;
+
+	ret = vfio_add_group_dev(dev, &vfio_mdev_dev_ops, vmdev);
+	if (ret)
+		kfree(vmdev);
+
+	return ret;
+}
+
+void vfio_mdev_remove(struct device *dev)
+{
+	struct vfio_mdev *vmdev;
+
+	vmdev = vfio_del_group_dev(dev);
+	kfree(vmdev);
+}
+
+struct mdev_driver vfio_mdev_driver = {
+	.name	= "vfio_mdev",
+	.probe	= vfio_mdev_probe,
+	.remove	= vfio_mdev_remove,
+};
+
+static int __init vfio_mdev_init(void)
+{
+	return mdev_register_driver(&vfio_mdev_driver, THIS_MODULE);
+}
+
+static void __exit vfio_mdev_exit(void)
+{
+	mdev_unregister_driver(&vfio_mdev_driver);
+}
+
+module_init(vfio_mdev_init)
+module_exit(vfio_mdev_exit)
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index 016c14a1b454..776cc2b063d4 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -21,9 +21,9 @@
 
 #define VFIO_PCI_OFFSET_SHIFT   40
 
-#define VFIO_PCI_OFFSET_TO_INDEX(off)	(off >> VFIO_PCI_OFFSET_SHIFT)
-#define VFIO_PCI_INDEX_TO_OFFSET(index)	((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
-#define VFIO_PCI_OFFSET_MASK	(((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
+#define VFIO_PCI_OFFSET_TO_INDEX(off)   (off >> VFIO_PCI_OFFSET_SHIFT)
+#define VFIO_PCI_INDEX_TO_OFFSET(index) ((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
+#define VFIO_PCI_OFFSET_MASK    (((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
 
 /* Special capability IDs predefined access */
 #define PCI_CAP_ID_INVALID		0xFF	/* default raw access */
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v8 3/6] vfio iommu: Add support for mediated devices
  2016-10-10 20:28 ` [Qemu-devel] " Kirti Wankhede
@ 2016-10-10 20:28   ` Kirti Wankhede
  -1 siblings, 0 replies; 73+ messages in thread
From: Kirti Wankhede @ 2016-10-10 20:28 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi, Kirti Wankhede

VFIO IOMMU drivers are designed for the devices which are IOMMU capable.
Mediated device only uses IOMMU APIs, the underlying hardware can be
managed by an IOMMU domain.

Aim of this change is:
- To use most of the code of TYPE1 IOMMU driver for mediated devices
- To support direct assigned device and mediated device in single module

Added two new callback functions to struct vfio_iommu_driver_ops. Backend
IOMMU module that supports pining and unpinning pages for mdev devices
should provide these functions.
Added APIs for pining and unpining pages to VFIO module. These calls back
into backend iommu module to actually pin and unpin pages.

This change adds pin and unpin support for mediated device to TYPE1 IOMMU
backend module. More details:
- When iommu_group of mediated devices is attached, task structure is
  cached which is used later to pin pages and page accounting.
- It keeps track of pinned pages for mediated domain. This data is used to
  verify unpinning request and to unpin remaining pages while detaching, if
  there are any.
- Used existing mechanism for page accounting. If iommu capable domain
  exist in the container then all pages are already pinned and accounted.
  Accouting for mdev device is only done if there is no iommu capable
  domain in the container.
- Page accouting is updated on hot plug and unplug mdev device and pass
  through device.

Tested by assigning below combinations of devices to a single VM:
- GPU pass through only
- vGPU device only
- One GPU pass through and one vGPU device
- Linux VM hot plug and unplug vGPU device while GPU pass through device
  exist
- Linux VM hot plug and unplug GPU pass through device while vGPU device
  exist

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Change-Id: I295d6f0f2e0579b8d9882bfd8fd5a4194b97bd9a
---
 drivers/vfio/vfio.c             | 117 +++++++
 drivers/vfio/vfio_iommu_type1.c | 685 ++++++++++++++++++++++++++++++++++------
 include/linux/vfio.h            |  13 +-
 3 files changed, 724 insertions(+), 91 deletions(-)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 6fd6fa5469de..e3e342861e04 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -1782,6 +1782,123 @@ void vfio_info_cap_shift(struct vfio_info_cap *caps, size_t offset)
 }
 EXPORT_SYMBOL_GPL(vfio_info_cap_shift);
 
+static struct vfio_group *vfio_group_from_dev(struct device *dev)
+{
+	struct vfio_device *device;
+	struct vfio_group *group;
+	int ret;
+
+	device = vfio_device_get_from_dev(dev);
+	if (!device)
+		return ERR_PTR(-EINVAL);
+
+	group = device->group;
+	if (!atomic_inc_not_zero(&group->container_users)) {
+		ret = -EINVAL;
+		goto err_ret;
+	}
+
+	if (group->noiommu) {
+		atomic_dec(&group->container_users);
+		ret = -EPERM;
+		goto err_ret;
+	}
+
+	if (!group->container->iommu_driver ||
+	    !vfio_group_viable(group)) {
+		atomic_dec(&group->container_users);
+		ret = -EINVAL;
+		goto err_ret;
+	}
+
+	vfio_device_put(device);
+	return group;
+
+err_ret:
+	vfio_device_put(device);
+	return ERR_PTR(ret);
+}
+
+/*
+ * Pin a set of guest PFNs and return their associated host PFNs for local
+ * domain only.
+ * @dev [in] : device
+ * @user_pfn [in]: array of user/guest PFNs
+ * @npage [in]: count of array elements
+ * @prot [in] : protection flags
+ * @phys_pfn[out] : array of host PFNs
+ */
+long vfio_pin_pages(struct device *dev, unsigned long *user_pfn,
+		    long npage, int prot, unsigned long *phys_pfn)
+{
+	struct vfio_container *container;
+	struct vfio_group *group;
+	struct vfio_iommu_driver *driver;
+	ssize_t ret = -EINVAL;
+
+	if (!dev || !user_pfn || !phys_pfn)
+		return -EINVAL;
+
+	group = vfio_group_from_dev(dev);
+	if (IS_ERR(group))
+		return PTR_ERR(group);
+
+	container = group->container;
+	if (IS_ERR(container))
+		return PTR_ERR(container);
+
+	down_read(&container->group_lock);
+
+	driver = container->iommu_driver;
+	if (likely(driver && driver->ops->pin_pages))
+		ret = driver->ops->pin_pages(container->iommu_data, user_pfn,
+					     npage, prot, phys_pfn);
+
+	up_read(&container->group_lock);
+	vfio_group_try_dissolve_container(group);
+
+	return ret;
+
+}
+EXPORT_SYMBOL(vfio_pin_pages);
+
+/*
+ * Unpin set of host PFNs for local domain only.
+ * @dev [in] : device
+ * @pfn [in] : array of host PFNs to be unpinned.
+ * @npage [in] :count of elements in array, that is number of pages.
+ */
+long vfio_unpin_pages(struct device *dev, unsigned long *pfn, long npage)
+{
+	struct vfio_container *container;
+	struct vfio_group *group;
+	struct vfio_iommu_driver *driver;
+	ssize_t ret = -EINVAL;
+
+	if (!dev || !pfn)
+		return -EINVAL;
+
+	group = vfio_group_from_dev(dev);
+	if (IS_ERR(group))
+		return PTR_ERR(group);
+
+	container = group->container;
+	if (IS_ERR(container))
+		return PTR_ERR(container);
+
+	down_read(&container->group_lock);
+
+	driver = container->iommu_driver;
+	if (likely(driver && driver->ops->unpin_pages))
+		ret = driver->ops->unpin_pages(container->iommu_data, pfn,
+					       npage);
+
+	up_read(&container->group_lock);
+	vfio_group_try_dissolve_container(group);
+	return ret;
+}
+EXPORT_SYMBOL(vfio_unpin_pages);
+
 /**
  * Module/class support
  */
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 2ba19424e4a1..ce6d6dcbd9a8 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -55,18 +55,26 @@ MODULE_PARM_DESC(disable_hugepages,
 
 struct vfio_iommu {
 	struct list_head	domain_list;
+	struct vfio_domain	*local_domain;
 	struct mutex		lock;
 	struct rb_root		dma_list;
 	bool			v2;
 	bool			nesting;
 };
 
+struct local_addr_space {
+	struct task_struct	*task;
+	struct rb_root		pfn_list;	/* pinned Host pfn list */
+	struct mutex		pfn_list_lock;	/* mutex for pfn_list */
+};
+
 struct vfio_domain {
 	struct iommu_domain	*domain;
 	struct list_head	next;
 	struct list_head	group_list;
 	int			prot;		/* IOMMU_CACHE */
 	bool			fgsp;		/* Fine-grained super pages */
+	struct local_addr_space	*local_addr_space;
 };
 
 struct vfio_dma {
@@ -75,6 +83,7 @@ struct vfio_dma {
 	unsigned long		vaddr;		/* Process virtual addr */
 	size_t			size;		/* Map size (bytes) */
 	int			prot;		/* IOMMU_READ/WRITE */
+	bool			iommu_mapped;
 };
 
 struct vfio_group {
@@ -83,6 +92,22 @@ struct vfio_group {
 };
 
 /*
+ * Guest RAM pinning working set or DMA target
+ */
+struct vfio_pfn {
+	struct rb_node		node;
+	unsigned long		vaddr;		/* virtual addr */
+	dma_addr_t		iova;		/* IOVA */
+	unsigned long		pfn;		/* Host pfn */
+	size_t			prot;
+	atomic_t		ref_count;
+};
+
+
+#define IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu)	\
+			 (list_empty(&iommu->domain_list) ? false : true)
+
+/*
  * This code handles mapping and unmapping of user data buffers
  * into DMA'ble space using the IOMMU
  */
@@ -130,6 +155,84 @@ static void vfio_unlink_dma(struct vfio_iommu *iommu, struct vfio_dma *old)
 	rb_erase(&old->node, &iommu->dma_list);
 }
 
+/*
+ * Helper Functions for host pfn list
+ */
+
+static struct vfio_pfn *vfio_find_pfn(struct vfio_domain *domain,
+				      unsigned long pfn)
+{
+	struct rb_node *node;
+	struct vfio_pfn *vpfn, *ret = NULL;
+
+	node = domain->local_addr_space->pfn_list.rb_node;
+
+	while (node) {
+		vpfn = rb_entry(node, struct vfio_pfn, node);
+
+		if (pfn < vpfn->pfn)
+			node = node->rb_left;
+		else if (pfn > vpfn->pfn)
+			node = node->rb_right;
+		else {
+			ret = vpfn;
+			break;
+		}
+	}
+
+	return ret;
+}
+
+static void vfio_link_pfn(struct vfio_domain *domain, struct vfio_pfn *new)
+{
+	struct rb_node **link, *parent = NULL;
+	struct vfio_pfn *vpfn;
+
+	link = &domain->local_addr_space->pfn_list.rb_node;
+	while (*link) {
+		parent = *link;
+		vpfn = rb_entry(parent, struct vfio_pfn, node);
+
+		if (new->pfn < vpfn->pfn)
+			link = &(*link)->rb_left;
+		else
+			link = &(*link)->rb_right;
+	}
+
+	rb_link_node(&new->node, parent, link);
+	rb_insert_color(&new->node, &domain->local_addr_space->pfn_list);
+}
+
+static void vfio_unlink_pfn(struct vfio_domain *domain, struct vfio_pfn *old)
+{
+	rb_erase(&old->node, &domain->local_addr_space->pfn_list);
+}
+
+static int vfio_add_to_pfn_list(struct vfio_domain *domain, unsigned long vaddr,
+				dma_addr_t iova, unsigned long pfn, size_t prot)
+{
+	struct vfio_pfn *vpfn;
+
+	vpfn = kzalloc(sizeof(*vpfn), GFP_KERNEL);
+	if (!vpfn)
+		return -ENOMEM;
+
+	vpfn->vaddr = vaddr;
+	vpfn->iova = iova;
+	vpfn->pfn = pfn;
+	vpfn->prot = prot;
+	atomic_set(&vpfn->ref_count, 1);
+	vfio_link_pfn(domain, vpfn);
+	return 0;
+}
+
+static void vfio_remove_from_pfn_list(struct vfio_domain *domain,
+				      struct vfio_pfn *vpfn)
+{
+	vfio_unlink_pfn(domain, vpfn);
+	kfree(vpfn);
+}
+
 struct vwork {
 	struct mm_struct	*mm;
 	long			npage;
@@ -150,17 +253,17 @@ static void vfio_lock_acct_bg(struct work_struct *work)
 	kfree(vwork);
 }
 
-static void vfio_lock_acct(long npage)
+static void vfio_lock_acct(struct task_struct *task, long npage)
 {
 	struct vwork *vwork;
 	struct mm_struct *mm;
 
-	if (!current->mm || !npage)
+	if (!task->mm || !npage)
 		return; /* process exited or nothing to do */
 
-	if (down_write_trylock(&current->mm->mmap_sem)) {
-		current->mm->locked_vm += npage;
-		up_write(&current->mm->mmap_sem);
+	if (down_write_trylock(&task->mm->mmap_sem)) {
+		task->mm->locked_vm += npage;
+		up_write(&task->mm->mmap_sem);
 		return;
 	}
 
@@ -172,7 +275,7 @@ static void vfio_lock_acct(long npage)
 	vwork = kmalloc(sizeof(struct vwork), GFP_KERNEL);
 	if (!vwork)
 		return;
-	mm = get_task_mm(current);
+	mm = get_task_mm(task);
 	if (!mm) {
 		kfree(vwork);
 		return;
@@ -228,20 +331,31 @@ static int put_pfn(unsigned long pfn, int prot)
 	return 0;
 }
 
-static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
+static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
+			 int prot, unsigned long *pfn)
 {
 	struct page *page[1];
 	struct vm_area_struct *vma;
+	struct mm_struct *local_mm = (mm ? mm : current->mm);
 	int ret = -EFAULT;
 
-	if (get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE), page) == 1) {
+	if (mm) {
+		down_read(&local_mm->mmap_sem);
+		ret = get_user_pages_remote(NULL, local_mm, vaddr, 1,
+					!!(prot & IOMMU_WRITE), 0, page, NULL);
+		up_read(&local_mm->mmap_sem);
+	} else
+		ret = get_user_pages_fast(vaddr, 1,
+					  !!(prot & IOMMU_WRITE), page);
+
+	if (ret == 1) {
 		*pfn = page_to_pfn(page[0]);
 		return 0;
 	}
 
-	down_read(&current->mm->mmap_sem);
+	down_read(&local_mm->mmap_sem);
 
-	vma = find_vma_intersection(current->mm, vaddr, vaddr + 1);
+	vma = find_vma_intersection(local_mm, vaddr, vaddr + 1);
 
 	if (vma && vma->vm_flags & VM_PFNMAP) {
 		*pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
@@ -249,7 +363,7 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
 			ret = 0;
 	}
 
-	up_read(&current->mm->mmap_sem);
+	up_read(&local_mm->mmap_sem);
 
 	return ret;
 }
@@ -259,8 +373,8 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
  * the iommu can only map chunks of consecutive pfns anyway, so get the
  * first page and all consecutive pages with the same locking.
  */
-static long vfio_pin_pages(unsigned long vaddr, long npage,
-			   int prot, unsigned long *pfn_base)
+static long __vfio_pin_pages_remote(unsigned long vaddr, long npage,
+				    int prot, unsigned long *pfn_base)
 {
 	unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
 	bool lock_cap = capable(CAP_IPC_LOCK);
@@ -270,7 +384,7 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
 	if (!current->mm)
 		return -ENODEV;
 
-	ret = vaddr_get_pfn(vaddr, prot, pfn_base);
+	ret = vaddr_get_pfn(NULL, vaddr, prot, pfn_base);
 	if (ret)
 		return ret;
 
@@ -285,7 +399,7 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
 
 	if (unlikely(disable_hugepages)) {
 		if (!rsvd)
-			vfio_lock_acct(1);
+			vfio_lock_acct(current, 1);
 		return 1;
 	}
 
@@ -293,7 +407,7 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
 	for (i = 1, vaddr += PAGE_SIZE; i < npage; i++, vaddr += PAGE_SIZE) {
 		unsigned long pfn = 0;
 
-		ret = vaddr_get_pfn(vaddr, prot, &pfn);
+		ret = vaddr_get_pfn(NULL, vaddr, prot, &pfn);
 		if (ret)
 			break;
 
@@ -313,13 +427,13 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
 	}
 
 	if (!rsvd)
-		vfio_lock_acct(i);
+		vfio_lock_acct(current, i);
 
 	return i;
 }
 
-static long vfio_unpin_pages(unsigned long pfn, long npage,
-			     int prot, bool do_accounting)
+static long __vfio_unpin_pages_remote(unsigned long pfn, long npage, int prot,
+				      bool do_accounting)
 {
 	unsigned long unlocked = 0;
 	long i;
@@ -328,7 +442,188 @@ static long vfio_unpin_pages(unsigned long pfn, long npage,
 		unlocked += put_pfn(pfn++, prot);
 
 	if (do_accounting)
-		vfio_lock_acct(-unlocked);
+		vfio_lock_acct(current, -unlocked);
+	return unlocked;
+}
+
+static long __vfio_pin_pages_local(struct vfio_domain *domain,
+				   unsigned long vaddr, int prot,
+				   unsigned long *pfn_base,
+				   bool do_accounting)
+{
+	unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+	bool lock_cap = capable(CAP_IPC_LOCK);
+	long ret;
+	bool rsvd;
+	struct task_struct *task = domain->local_addr_space->task;
+
+	if (!task->mm)
+		return -ENODEV;
+
+	ret = vaddr_get_pfn(task->mm, vaddr, prot, pfn_base);
+	if (ret)
+		return ret;
+
+	rsvd = is_invalid_reserved_pfn(*pfn_base);
+
+	if (!rsvd && !lock_cap && task->mm->locked_vm + 1 > limit) {
+		put_pfn(*pfn_base, prot);
+		pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n", __func__,
+			limit << PAGE_SHIFT);
+		return -ENOMEM;
+	}
+
+	if (!rsvd && do_accounting)
+		vfio_lock_acct(task, 1);
+
+	return 1;
+}
+
+static void __vfio_unpin_pages_local(struct vfio_domain *domain,
+				     unsigned long pfn, int prot,
+				     bool do_accounting)
+{
+	put_pfn(pfn, prot);
+
+	if (do_accounting)
+		vfio_lock_acct(domain->local_addr_space->task, -1);
+}
+
+static int vfio_unpin_pfn(struct vfio_domain *domain,
+			  struct vfio_pfn *vpfn, bool do_accounting)
+{
+	__vfio_unpin_pages_local(domain, vpfn->pfn, vpfn->prot,
+				 do_accounting);
+
+	if (atomic_dec_and_test(&vpfn->ref_count))
+		vfio_remove_from_pfn_list(domain, vpfn);
+
+	return 1;
+}
+
+static long vfio_iommu_type1_pin_pages(void *iommu_data,
+				       unsigned long *user_pfn,
+				       long npage, int prot,
+				       unsigned long *phys_pfn)
+{
+	struct vfio_iommu *iommu = iommu_data;
+	struct vfio_domain *domain;
+	int i, j, ret;
+	long retpage;
+	unsigned long remote_vaddr;
+	unsigned long *pfn = phys_pfn;
+	struct vfio_dma *dma;
+	bool do_accounting = false;
+
+	if (!iommu || !user_pfn || !phys_pfn)
+		return -EINVAL;
+
+	mutex_lock(&iommu->lock);
+
+	if (!iommu->local_domain) {
+		ret = -EINVAL;
+		goto pin_done;
+	}
+
+	domain = iommu->local_domain;
+
+	/*
+	 * If iommu capable domain exist in the container then all pages are
+	 * already pinned and accounted. Accouting should be done if there is no
+	 * iommu capable domain in the container.
+	 */
+	do_accounting = !IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu);
+
+	for (i = 0; i < npage; i++) {
+		struct vfio_pfn *p;
+		dma_addr_t iova;
+
+		iova = user_pfn[i] << PAGE_SHIFT;
+
+		dma = vfio_find_dma(iommu, iova, 0);
+		if (!dma) {
+			ret = -EINVAL;
+			goto pin_unwind;
+		}
+
+		remote_vaddr = dma->vaddr + iova - dma->iova;
+
+		retpage = __vfio_pin_pages_local(domain, remote_vaddr, prot,
+						 &pfn[i], do_accounting);
+		if (retpage <= 0) {
+			WARN_ON(!retpage);
+			ret = (int)retpage;
+			goto pin_unwind;
+		}
+
+		mutex_lock(&domain->local_addr_space->pfn_list_lock);
+
+		/* search if pfn exist */
+		p = vfio_find_pfn(domain, pfn[i]);
+		if (p) {
+			atomic_inc(&p->ref_count);
+			mutex_unlock(&domain->local_addr_space->pfn_list_lock);
+			continue;
+		}
+
+		ret = vfio_add_to_pfn_list(domain, remote_vaddr, iova,
+					   pfn[i], prot);
+		mutex_unlock(&domain->local_addr_space->pfn_list_lock);
+
+		if (ret) {
+			__vfio_unpin_pages_local(domain, pfn[i], prot,
+						 do_accounting);
+			goto pin_unwind;
+		}
+	}
+
+	ret = i;
+	goto pin_done;
+
+pin_unwind:
+	pfn[i] = 0;
+	mutex_lock(&domain->local_addr_space->pfn_list_lock);
+	for (j = 0; j < i; j++) {
+		struct vfio_pfn *p;
+
+		p = vfio_find_pfn(domain, pfn[j]);
+		if (p)
+			vfio_unpin_pfn(domain, p, do_accounting);
+
+		pfn[j] = 0;
+	}
+	mutex_unlock(&domain->local_addr_space->pfn_list_lock);
+
+pin_done:
+	mutex_unlock(&iommu->lock);
+	return ret;
+}
+
+static long vfio_iommu_type1_unpin_pages(void *iommu_data, unsigned long *pfn,
+					 long npage)
+{
+	struct vfio_iommu *iommu = iommu_data;
+	struct vfio_domain *domain = NULL;
+	long unlocked = 0;
+	int i;
+
+	if (!iommu || !pfn)
+		return -EINVAL;
+
+	domain = iommu->local_domain;
+
+	for (i = 0; i < npage; i++) {
+		struct vfio_pfn *p;
+
+		mutex_lock(&domain->local_addr_space->pfn_list_lock);
+
+		/* verify if pfn exist in pfn_list */
+		p = vfio_find_pfn(domain, pfn[i]);
+		if (p)
+			unlocked += vfio_unpin_pfn(domain, p, true);
+
+		mutex_unlock(&domain->local_addr_space->pfn_list_lock);
+	}
 
 	return unlocked;
 }
@@ -341,6 +636,12 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
 
 	if (!dma->size)
 		return;
+
+	if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu))
+		return;
+
+	if (!dma->iommu_mapped)
+		return;
 	/*
 	 * We use the IOMMU to track the physical addresses, otherwise we'd
 	 * need a much more complicated tracking system.  Unfortunately that
@@ -382,15 +683,16 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
 		if (WARN_ON(!unmapped))
 			break;
 
-		unlocked += vfio_unpin_pages(phys >> PAGE_SHIFT,
-					     unmapped >> PAGE_SHIFT,
-					     dma->prot, false);
+		unlocked += __vfio_unpin_pages_remote(phys >> PAGE_SHIFT,
+						      unmapped >> PAGE_SHIFT,
+						      dma->prot, false);
 		iova += unmapped;
 
 		cond_resched();
 	}
 
-	vfio_lock_acct(-unlocked);
+	dma->iommu_mapped = false;
+	vfio_lock_acct(current, -unlocked);
 }
 
 static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
@@ -558,17 +860,85 @@ unwind:
 	return ret;
 }
 
+void vfio_update_accounting(struct vfio_iommu *iommu, struct vfio_dma *dma)
+{
+	struct vfio_domain *domain = iommu->local_domain;
+	struct rb_node *n;
+	long locked = 0;
+
+	if (!iommu->local_domain)
+		return;
+
+	mutex_lock(&domain->local_addr_space->pfn_list_lock);
+
+	n = rb_first(&domain->local_addr_space->pfn_list);
+
+	for (; n; n = rb_next(n)) {
+		struct vfio_pfn *vpfn;
+
+		vpfn = rb_entry(n, struct vfio_pfn, node);
+
+		if ((vpfn->iova >= dma->iova) &&
+		    (vpfn->iova < dma->iova + dma->size))
+			locked++;
+	}
+	vfio_lock_acct(current, -locked);
+	mutex_unlock(&domain->local_addr_space->pfn_list_lock);
+}
+
+static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct vfio_dma *dma,
+			    size_t map_size)
+{
+	dma_addr_t iova = dma->iova;
+	unsigned long vaddr = dma->vaddr;
+	size_t size = map_size, dma_size = 0;
+	long npage;
+	unsigned long pfn;
+	int ret = 0;
+
+	while (size) {
+		/* Pin a contiguous chunk of memory */
+		npage = __vfio_pin_pages_remote(vaddr + dma_size,
+						size >> PAGE_SHIFT, dma->prot,
+						&pfn);
+		if (npage <= 0) {
+			WARN_ON(!npage);
+			ret = (int)npage;
+			break;
+		}
+
+		/* Map it! */
+		ret = vfio_iommu_map(iommu, iova + dma_size, pfn, npage,
+				     dma->prot);
+		if (ret) {
+			__vfio_unpin_pages_remote(pfn, npage, dma->prot, true);
+			break;
+		}
+
+		size -= npage << PAGE_SHIFT;
+		dma_size += npage << PAGE_SHIFT;
+	}
+
+	if (ret)
+		vfio_remove_dma(iommu, dma);
+	else {
+		dma->size = dma_size;
+		dma->iommu_mapped = true;
+		vfio_update_accounting(iommu, dma);
+	}
+
+	return ret;
+}
+
 static int vfio_dma_do_map(struct vfio_iommu *iommu,
 			   struct vfio_iommu_type1_dma_map *map)
 {
 	dma_addr_t iova = map->iova;
 	unsigned long vaddr = map->vaddr;
 	size_t size = map->size;
-	long npage;
 	int ret = 0, prot = 0;
 	uint64_t mask;
 	struct vfio_dma *dma;
-	unsigned long pfn;
 
 	/* Verify that none of our __u64 fields overflow */
 	if (map->size != size || map->vaddr != vaddr || map->iova != iova)
@@ -611,29 +981,11 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
 	/* Insert zero-sized and grow as we map chunks of it */
 	vfio_link_dma(iommu, dma);
 
-	while (size) {
-		/* Pin a contiguous chunk of memory */
-		npage = vfio_pin_pages(vaddr + dma->size,
-				       size >> PAGE_SHIFT, prot, &pfn);
-		if (npage <= 0) {
-			WARN_ON(!npage);
-			ret = (int)npage;
-			break;
-		}
-
-		/* Map it! */
-		ret = vfio_iommu_map(iommu, iova + dma->size, pfn, npage, prot);
-		if (ret) {
-			vfio_unpin_pages(pfn, npage, prot, true);
-			break;
-		}
-
-		size -= npage << PAGE_SHIFT;
-		dma->size += npage << PAGE_SHIFT;
-	}
-
-	if (ret)
-		vfio_remove_dma(iommu, dma);
+	/* Don't pin and map if container doesn't contain IOMMU capable domain*/
+	if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu))
+		dma->size = size;
+	else
+		ret = vfio_pin_map_dma(iommu, dma, size);
 
 	mutex_unlock(&iommu->lock);
 	return ret;
@@ -662,10 +1014,6 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
 	d = list_first_entry(&iommu->domain_list, struct vfio_domain, next);
 	n = rb_first(&iommu->dma_list);
 
-	/* If there's not a domain, there better not be any mappings */
-	if (WARN_ON(n && !d))
-		return -EINVAL;
-
 	for (; n; n = rb_next(n)) {
 		struct vfio_dma *dma;
 		dma_addr_t iova;
@@ -674,20 +1022,43 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
 		iova = dma->iova;
 
 		while (iova < dma->iova + dma->size) {
-			phys_addr_t phys = iommu_iova_to_phys(d->domain, iova);
+			phys_addr_t phys;
 			size_t size;
 
-			if (WARN_ON(!phys)) {
-				iova += PAGE_SIZE;
-				continue;
-			}
+			if (dma->iommu_mapped) {
+				phys = iommu_iova_to_phys(d->domain, iova);
+
+				if (WARN_ON(!phys)) {
+					iova += PAGE_SIZE;
+					continue;
+				}
 
-			size = PAGE_SIZE;
+				size = PAGE_SIZE;
 
-			while (iova + size < dma->iova + dma->size &&
-			       phys + size == iommu_iova_to_phys(d->domain,
+				while (iova + size < dma->iova + dma->size &&
+				    phys + size == iommu_iova_to_phys(d->domain,
 								 iova + size))
-				size += PAGE_SIZE;
+					size += PAGE_SIZE;
+			} else {
+				unsigned long pfn;
+				unsigned long vaddr = dma->vaddr +
+						     (iova - dma->iova);
+				size_t n = dma->iova + dma->size - iova;
+				long npage;
+
+				npage = __vfio_pin_pages_remote(vaddr,
+								n >> PAGE_SHIFT,
+								dma->prot,
+								&pfn);
+				if (npage <= 0) {
+					WARN_ON(!npage);
+					ret = (int)npage;
+					return ret;
+				}
+
+				phys = pfn << PAGE_SHIFT;
+				size = npage << PAGE_SHIFT;
+			}
 
 			ret = iommu_map(domain->domain, iova, phys,
 					size, dma->prot | domain->prot);
@@ -696,6 +1067,11 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
 
 			iova += size;
 		}
+
+		if (!dma->iommu_mapped) {
+			dma->iommu_mapped = true;
+			vfio_update_accounting(iommu, dma);
+		}
 	}
 
 	return 0;
@@ -734,11 +1110,24 @@ static void vfio_test_domain_fgsp(struct vfio_domain *domain)
 	__free_pages(pages, order);
 }
 
+static struct vfio_group *find_iommu_group(struct vfio_domain *domain,
+				   struct iommu_group *iommu_group)
+{
+	struct vfio_group *g;
+
+	list_for_each_entry(g, &domain->group_list, next) {
+		if (g->iommu_group == iommu_group)
+			return g;
+	}
+
+	return NULL;
+}
+
 static int vfio_iommu_type1_attach_group(void *iommu_data,
 					 struct iommu_group *iommu_group)
 {
 	struct vfio_iommu *iommu = iommu_data;
-	struct vfio_group *group, *g;
+	struct vfio_group *group;
 	struct vfio_domain *domain, *d;
 	struct bus_type *bus = NULL;
 	int ret;
@@ -746,10 +1135,14 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 	mutex_lock(&iommu->lock);
 
 	list_for_each_entry(d, &iommu->domain_list, next) {
-		list_for_each_entry(g, &d->group_list, next) {
-			if (g->iommu_group != iommu_group)
-				continue;
+		if (find_iommu_group(d, iommu_group)) {
+			mutex_unlock(&iommu->lock);
+			return -EINVAL;
+		}
+	}
 
+	if (iommu->local_domain) {
+		if (find_iommu_group(iommu->local_domain, iommu_group)) {
 			mutex_unlock(&iommu->lock);
 			return -EINVAL;
 		}
@@ -769,6 +1162,34 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 	if (ret)
 		goto out_free;
 
+	if (IS_ENABLED(CONFIG_VFIO_MDEV) && !iommu_present(bus) &&
+	    (bus == &mdev_bus_type)) {
+		if (iommu->local_domain) {
+			list_add(&group->next,
+				 &iommu->local_domain->group_list);
+			kfree(domain);
+			mutex_unlock(&iommu->lock);
+			return 0;
+		}
+
+		domain->local_addr_space =
+				      kzalloc(sizeof(*domain->local_addr_space),
+					      GFP_KERNEL);
+		if (!domain->local_addr_space) {
+			ret = -ENOMEM;
+			goto out_free;
+		}
+
+		domain->local_addr_space->task = current;
+		INIT_LIST_HEAD(&domain->group_list);
+		list_add(&group->next, &domain->group_list);
+		domain->local_addr_space->pfn_list = RB_ROOT;
+		mutex_init(&domain->local_addr_space->pfn_list_lock);
+		iommu->local_domain = domain;
+		mutex_unlock(&iommu->lock);
+		return 0;
+	}
+
 	domain->domain = iommu_domain_alloc(bus);
 	if (!domain->domain) {
 		ret = -EIO;
@@ -859,6 +1280,41 @@ static void vfio_iommu_unmap_unpin_all(struct vfio_iommu *iommu)
 		vfio_remove_dma(iommu, rb_entry(node, struct vfio_dma, node));
 }
 
+static void vfio_iommu_unmap_unpin_reaccount(struct vfio_iommu *iommu)
+{
+	struct vfio_domain *domain = iommu->local_domain;
+	struct vfio_dma *dma, *tdma;
+	struct rb_node *n;
+	long locked = 0;
+
+	rbtree_postorder_for_each_entry_safe(dma, tdma, &iommu->dma_list,
+					     node) {
+		vfio_unmap_unpin(iommu, dma);
+	}
+
+	mutex_lock(&domain->local_addr_space->pfn_list_lock);
+
+	n = rb_first(&domain->local_addr_space->pfn_list);
+
+	for (; n; n = rb_next(n))
+		locked++;
+
+	vfio_lock_acct(domain->local_addr_space->task, locked);
+	mutex_unlock(&domain->local_addr_space->pfn_list_lock);
+}
+
+static void vfio_local_unpin_all(struct vfio_domain *domain)
+{
+	struct rb_node *node;
+
+	mutex_lock(&domain->local_addr_space->pfn_list_lock);
+	while ((node = rb_first(&domain->local_addr_space->pfn_list)))
+		vfio_unpin_pfn(domain,
+				rb_entry(node, struct vfio_pfn, node), false);
+
+	mutex_unlock(&domain->local_addr_space->pfn_list_lock);
+}
+
 static void vfio_iommu_type1_detach_group(void *iommu_data,
 					  struct iommu_group *iommu_group)
 {
@@ -868,31 +1324,57 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
 
 	mutex_lock(&iommu->lock);
 
-	list_for_each_entry(domain, &iommu->domain_list, next) {
-		list_for_each_entry(group, &domain->group_list, next) {
-			if (group->iommu_group != iommu_group)
-				continue;
-
-			iommu_detach_group(domain->domain, iommu_group);
+	if (iommu->local_domain) {
+		domain = iommu->local_domain;
+		group = find_iommu_group(domain, iommu_group);
+		if (group) {
 			list_del(&group->next);
 			kfree(group);
-			/*
-			 * Group ownership provides privilege, if the group
-			 * list is empty, the domain goes away.  If it's the
-			 * last domain, then all the mappings go away too.
-			 */
+
 			if (list_empty(&domain->group_list)) {
-				if (list_is_singular(&iommu->domain_list))
+				vfio_local_unpin_all(domain);
+				if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu))
 					vfio_iommu_unmap_unpin_all(iommu);
-				iommu_domain_free(domain->domain);
-				list_del(&domain->next);
 				kfree(domain);
+				iommu->local_domain = NULL;
+			}
+			goto detach_group_done;
+		}
+	}
+
+	if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu))
+		goto detach_group_done;
+
+	list_for_each_entry(domain, &iommu->domain_list, next) {
+		group = find_iommu_group(domain, iommu_group);
+		if (!group)
+			continue;
+
+		iommu_detach_group(domain->domain, iommu_group);
+		list_del(&group->next);
+		kfree(group);
+		/*
+		 * Group ownership provides privilege, if the group list is
+		 * empty, the domain goes away. If it's the last domain with
+		 * iommu and local domain doesn't exist, then all the mappings
+		 * go away too. If it's the last domain with iommu and local
+		 * domain exist, update accounting
+		 */
+		if (list_empty(&domain->group_list)) {
+			if (list_is_singular(&iommu->domain_list)) {
+				if (!iommu->local_domain)
+					vfio_iommu_unmap_unpin_all(iommu);
+				else
+					vfio_iommu_unmap_unpin_reaccount(iommu);
 			}
-			goto done;
+			iommu_domain_free(domain->domain);
+			list_del(&domain->next);
+			kfree(domain);
 		}
+		break;
 	}
 
-done:
+detach_group_done:
 	mutex_unlock(&iommu->lock);
 }
 
@@ -924,27 +1406,48 @@ static void *vfio_iommu_type1_open(unsigned long arg)
 	return iommu;
 }
 
+static void vfio_release_domain(struct vfio_domain *domain)
+{
+	struct vfio_group *group, *group_tmp;
+
+	list_for_each_entry_safe(group, group_tmp,
+				 &domain->group_list, next) {
+		if (!domain->local_addr_space)
+			iommu_detach_group(domain->domain, group->iommu_group);
+		list_del(&group->next);
+		kfree(group);
+	}
+
+	if (domain->local_addr_space)
+		vfio_local_unpin_all(domain);
+	else
+		iommu_domain_free(domain->domain);
+}
+
 static void vfio_iommu_type1_release(void *iommu_data)
 {
 	struct vfio_iommu *iommu = iommu_data;
 	struct vfio_domain *domain, *domain_tmp;
-	struct vfio_group *group, *group_tmp;
+
+	if (iommu->local_domain) {
+		vfio_release_domain(iommu->local_domain);
+		kfree(iommu->local_domain);
+		iommu->local_domain = NULL;
+	}
 
 	vfio_iommu_unmap_unpin_all(iommu);
 
+	if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu))
+		goto release_exit;
+
 	list_for_each_entry_safe(domain, domain_tmp,
 				 &iommu->domain_list, next) {
-		list_for_each_entry_safe(group, group_tmp,
-					 &domain->group_list, next) {
-			iommu_detach_group(domain->domain, group->iommu_group);
-			list_del(&group->next);
-			kfree(group);
-		}
-		iommu_domain_free(domain->domain);
+		vfio_release_domain(domain);
 		list_del(&domain->next);
 		kfree(domain);
 	}
 
+release_exit:
 	kfree(iommu);
 }
 
@@ -1048,6 +1551,8 @@ static const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_type1 = {
 	.ioctl		= vfio_iommu_type1_ioctl,
 	.attach_group	= vfio_iommu_type1_attach_group,
 	.detach_group	= vfio_iommu_type1_detach_group,
+	.pin_pages	= vfio_iommu_type1_pin_pages,
+	.unpin_pages	= vfio_iommu_type1_unpin_pages,
 };
 
 static int __init vfio_iommu_type1_init(void)
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 0ecae0b1cd34..0bd25ba6223d 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -17,6 +17,7 @@
 #include <linux/workqueue.h>
 #include <linux/poll.h>
 #include <uapi/linux/vfio.h>
+#include <linux/mdev.h>
 
 /**
  * struct vfio_device_ops - VFIO bus driver device callbacks
@@ -75,7 +76,11 @@ struct vfio_iommu_driver_ops {
 					struct iommu_group *group);
 	void		(*detach_group)(void *iommu_data,
 					struct iommu_group *group);
-
+	long		(*pin_pages)(void *iommu_data, unsigned long *user_pfn,
+				     long npage, int prot,
+				     unsigned long *phys_pfn);
+	long		(*unpin_pages)(void *iommu_data, unsigned long *pfn,
+				       long npage);
 };
 
 extern int vfio_register_iommu_driver(const struct vfio_iommu_driver_ops *ops);
@@ -127,6 +132,12 @@ static inline long vfio_spapr_iommu_eeh_ioctl(struct iommu_group *group,
 }
 #endif /* CONFIG_EEH */
 
+extern long vfio_pin_pages(struct device *dev, unsigned long *user_pfn,
+			   long npage, int prot, unsigned long *phys_pfn);
+
+extern long vfio_unpin_pages(struct device *dev, unsigned long *pfn,
+			     long npage);
+
 /*
  * IRQfd - generic
  */
-- 
2.7.0


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [Qemu-devel] [PATCH v8 3/6] vfio iommu: Add support for mediated devices
@ 2016-10-10 20:28   ` Kirti Wankhede
  0 siblings, 0 replies; 73+ messages in thread
From: Kirti Wankhede @ 2016-10-10 20:28 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi, Kirti Wankhede

VFIO IOMMU drivers are designed for the devices which are IOMMU capable.
Mediated device only uses IOMMU APIs, the underlying hardware can be
managed by an IOMMU domain.

Aim of this change is:
- To use most of the code of TYPE1 IOMMU driver for mediated devices
- To support direct assigned device and mediated device in single module

Added two new callback functions to struct vfio_iommu_driver_ops. Backend
IOMMU module that supports pining and unpinning pages for mdev devices
should provide these functions.
Added APIs for pining and unpining pages to VFIO module. These calls back
into backend iommu module to actually pin and unpin pages.

This change adds pin and unpin support for mediated device to TYPE1 IOMMU
backend module. More details:
- When iommu_group of mediated devices is attached, task structure is
  cached which is used later to pin pages and page accounting.
- It keeps track of pinned pages for mediated domain. This data is used to
  verify unpinning request and to unpin remaining pages while detaching, if
  there are any.
- Used existing mechanism for page accounting. If iommu capable domain
  exist in the container then all pages are already pinned and accounted.
  Accouting for mdev device is only done if there is no iommu capable
  domain in the container.
- Page accouting is updated on hot plug and unplug mdev device and pass
  through device.

Tested by assigning below combinations of devices to a single VM:
- GPU pass through only
- vGPU device only
- One GPU pass through and one vGPU device
- Linux VM hot plug and unplug vGPU device while GPU pass through device
  exist
- Linux VM hot plug and unplug GPU pass through device while vGPU device
  exist

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Change-Id: I295d6f0f2e0579b8d9882bfd8fd5a4194b97bd9a
---
 drivers/vfio/vfio.c             | 117 +++++++
 drivers/vfio/vfio_iommu_type1.c | 685 ++++++++++++++++++++++++++++++++++------
 include/linux/vfio.h            |  13 +-
 3 files changed, 724 insertions(+), 91 deletions(-)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 6fd6fa5469de..e3e342861e04 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -1782,6 +1782,123 @@ void vfio_info_cap_shift(struct vfio_info_cap *caps, size_t offset)
 }
 EXPORT_SYMBOL_GPL(vfio_info_cap_shift);
 
+static struct vfio_group *vfio_group_from_dev(struct device *dev)
+{
+	struct vfio_device *device;
+	struct vfio_group *group;
+	int ret;
+
+	device = vfio_device_get_from_dev(dev);
+	if (!device)
+		return ERR_PTR(-EINVAL);
+
+	group = device->group;
+	if (!atomic_inc_not_zero(&group->container_users)) {
+		ret = -EINVAL;
+		goto err_ret;
+	}
+
+	if (group->noiommu) {
+		atomic_dec(&group->container_users);
+		ret = -EPERM;
+		goto err_ret;
+	}
+
+	if (!group->container->iommu_driver ||
+	    !vfio_group_viable(group)) {
+		atomic_dec(&group->container_users);
+		ret = -EINVAL;
+		goto err_ret;
+	}
+
+	vfio_device_put(device);
+	return group;
+
+err_ret:
+	vfio_device_put(device);
+	return ERR_PTR(ret);
+}
+
+/*
+ * Pin a set of guest PFNs and return their associated host PFNs for local
+ * domain only.
+ * @dev [in] : device
+ * @user_pfn [in]: array of user/guest PFNs
+ * @npage [in]: count of array elements
+ * @prot [in] : protection flags
+ * @phys_pfn[out] : array of host PFNs
+ */
+long vfio_pin_pages(struct device *dev, unsigned long *user_pfn,
+		    long npage, int prot, unsigned long *phys_pfn)
+{
+	struct vfio_container *container;
+	struct vfio_group *group;
+	struct vfio_iommu_driver *driver;
+	ssize_t ret = -EINVAL;
+
+	if (!dev || !user_pfn || !phys_pfn)
+		return -EINVAL;
+
+	group = vfio_group_from_dev(dev);
+	if (IS_ERR(group))
+		return PTR_ERR(group);
+
+	container = group->container;
+	if (IS_ERR(container))
+		return PTR_ERR(container);
+
+	down_read(&container->group_lock);
+
+	driver = container->iommu_driver;
+	if (likely(driver && driver->ops->pin_pages))
+		ret = driver->ops->pin_pages(container->iommu_data, user_pfn,
+					     npage, prot, phys_pfn);
+
+	up_read(&container->group_lock);
+	vfio_group_try_dissolve_container(group);
+
+	return ret;
+
+}
+EXPORT_SYMBOL(vfio_pin_pages);
+
+/*
+ * Unpin set of host PFNs for local domain only.
+ * @dev [in] : device
+ * @pfn [in] : array of host PFNs to be unpinned.
+ * @npage [in] :count of elements in array, that is number of pages.
+ */
+long vfio_unpin_pages(struct device *dev, unsigned long *pfn, long npage)
+{
+	struct vfio_container *container;
+	struct vfio_group *group;
+	struct vfio_iommu_driver *driver;
+	ssize_t ret = -EINVAL;
+
+	if (!dev || !pfn)
+		return -EINVAL;
+
+	group = vfio_group_from_dev(dev);
+	if (IS_ERR(group))
+		return PTR_ERR(group);
+
+	container = group->container;
+	if (IS_ERR(container))
+		return PTR_ERR(container);
+
+	down_read(&container->group_lock);
+
+	driver = container->iommu_driver;
+	if (likely(driver && driver->ops->unpin_pages))
+		ret = driver->ops->unpin_pages(container->iommu_data, pfn,
+					       npage);
+
+	up_read(&container->group_lock);
+	vfio_group_try_dissolve_container(group);
+	return ret;
+}
+EXPORT_SYMBOL(vfio_unpin_pages);
+
 /**
  * Module/class support
  */
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 2ba19424e4a1..ce6d6dcbd9a8 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -55,18 +55,26 @@ MODULE_PARM_DESC(disable_hugepages,
 
 struct vfio_iommu {
 	struct list_head	domain_list;
+	struct vfio_domain	*local_domain;
 	struct mutex		lock;
 	struct rb_root		dma_list;
 	bool			v2;
 	bool			nesting;
 };
 
+struct local_addr_space {
+	struct task_struct	*task;
+	struct rb_root		pfn_list;	/* pinned Host pfn list */
+	struct mutex		pfn_list_lock;	/* mutex for pfn_list */
+};
+
 struct vfio_domain {
 	struct iommu_domain	*domain;
 	struct list_head	next;
 	struct list_head	group_list;
 	int			prot;		/* IOMMU_CACHE */
 	bool			fgsp;		/* Fine-grained super pages */
+	struct local_addr_space	*local_addr_space;
 };
 
 struct vfio_dma {
@@ -75,6 +83,7 @@ struct vfio_dma {
 	unsigned long		vaddr;		/* Process virtual addr */
 	size_t			size;		/* Map size (bytes) */
 	int			prot;		/* IOMMU_READ/WRITE */
+	bool			iommu_mapped;
 };
 
 struct vfio_group {
@@ -83,6 +92,22 @@ struct vfio_group {
 };
 
 /*
+ * Guest RAM pinning working set or DMA target
+ */
+struct vfio_pfn {
+	struct rb_node		node;
+	unsigned long		vaddr;		/* virtual addr */
+	dma_addr_t		iova;		/* IOVA */
+	unsigned long		pfn;		/* Host pfn */
+	size_t			prot;
+	atomic_t		ref_count;
+};
+
+
+#define IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu)	\
+			 (list_empty(&iommu->domain_list) ? false : true)
+
+/*
  * This code handles mapping and unmapping of user data buffers
  * into DMA'ble space using the IOMMU
  */
@@ -130,6 +155,84 @@ static void vfio_unlink_dma(struct vfio_iommu *iommu, struct vfio_dma *old)
 	rb_erase(&old->node, &iommu->dma_list);
 }
 
+/*
+ * Helper Functions for host pfn list
+ */
+
+static struct vfio_pfn *vfio_find_pfn(struct vfio_domain *domain,
+				      unsigned long pfn)
+{
+	struct rb_node *node;
+	struct vfio_pfn *vpfn, *ret = NULL;
+
+	node = domain->local_addr_space->pfn_list.rb_node;
+
+	while (node) {
+		vpfn = rb_entry(node, struct vfio_pfn, node);
+
+		if (pfn < vpfn->pfn)
+			node = node->rb_left;
+		else if (pfn > vpfn->pfn)
+			node = node->rb_right;
+		else {
+			ret = vpfn;
+			break;
+		}
+	}
+
+	return ret;
+}
+
+static void vfio_link_pfn(struct vfio_domain *domain, struct vfio_pfn *new)
+{
+	struct rb_node **link, *parent = NULL;
+	struct vfio_pfn *vpfn;
+
+	link = &domain->local_addr_space->pfn_list.rb_node;
+	while (*link) {
+		parent = *link;
+		vpfn = rb_entry(parent, struct vfio_pfn, node);
+
+		if (new->pfn < vpfn->pfn)
+			link = &(*link)->rb_left;
+		else
+			link = &(*link)->rb_right;
+	}
+
+	rb_link_node(&new->node, parent, link);
+	rb_insert_color(&new->node, &domain->local_addr_space->pfn_list);
+}
+
+static void vfio_unlink_pfn(struct vfio_domain *domain, struct vfio_pfn *old)
+{
+	rb_erase(&old->node, &domain->local_addr_space->pfn_list);
+}
+
+static int vfio_add_to_pfn_list(struct vfio_domain *domain, unsigned long vaddr,
+				dma_addr_t iova, unsigned long pfn, size_t prot)
+{
+	struct vfio_pfn *vpfn;
+
+	vpfn = kzalloc(sizeof(*vpfn), GFP_KERNEL);
+	if (!vpfn)
+		return -ENOMEM;
+
+	vpfn->vaddr = vaddr;
+	vpfn->iova = iova;
+	vpfn->pfn = pfn;
+	vpfn->prot = prot;
+	atomic_set(&vpfn->ref_count, 1);
+	vfio_link_pfn(domain, vpfn);
+	return 0;
+}
+
+static void vfio_remove_from_pfn_list(struct vfio_domain *domain,
+				      struct vfio_pfn *vpfn)
+{
+	vfio_unlink_pfn(domain, vpfn);
+	kfree(vpfn);
+}
+
 struct vwork {
 	struct mm_struct	*mm;
 	long			npage;
@@ -150,17 +253,17 @@ static void vfio_lock_acct_bg(struct work_struct *work)
 	kfree(vwork);
 }
 
-static void vfio_lock_acct(long npage)
+static void vfio_lock_acct(struct task_struct *task, long npage)
 {
 	struct vwork *vwork;
 	struct mm_struct *mm;
 
-	if (!current->mm || !npage)
+	if (!task->mm || !npage)
 		return; /* process exited or nothing to do */
 
-	if (down_write_trylock(&current->mm->mmap_sem)) {
-		current->mm->locked_vm += npage;
-		up_write(&current->mm->mmap_sem);
+	if (down_write_trylock(&task->mm->mmap_sem)) {
+		task->mm->locked_vm += npage;
+		up_write(&task->mm->mmap_sem);
 		return;
 	}
 
@@ -172,7 +275,7 @@ static void vfio_lock_acct(long npage)
 	vwork = kmalloc(sizeof(struct vwork), GFP_KERNEL);
 	if (!vwork)
 		return;
-	mm = get_task_mm(current);
+	mm = get_task_mm(task);
 	if (!mm) {
 		kfree(vwork);
 		return;
@@ -228,20 +331,31 @@ static int put_pfn(unsigned long pfn, int prot)
 	return 0;
 }
 
-static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
+static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
+			 int prot, unsigned long *pfn)
 {
 	struct page *page[1];
 	struct vm_area_struct *vma;
+	struct mm_struct *local_mm = (mm ? mm : current->mm);
 	int ret = -EFAULT;
 
-	if (get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE), page) == 1) {
+	if (mm) {
+		down_read(&local_mm->mmap_sem);
+		ret = get_user_pages_remote(NULL, local_mm, vaddr, 1,
+					!!(prot & IOMMU_WRITE), 0, page, NULL);
+		up_read(&local_mm->mmap_sem);
+	} else
+		ret = get_user_pages_fast(vaddr, 1,
+					  !!(prot & IOMMU_WRITE), page);
+
+	if (ret == 1) {
 		*pfn = page_to_pfn(page[0]);
 		return 0;
 	}
 
-	down_read(&current->mm->mmap_sem);
+	down_read(&local_mm->mmap_sem);
 
-	vma = find_vma_intersection(current->mm, vaddr, vaddr + 1);
+	vma = find_vma_intersection(local_mm, vaddr, vaddr + 1);
 
 	if (vma && vma->vm_flags & VM_PFNMAP) {
 		*pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
@@ -249,7 +363,7 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
 			ret = 0;
 	}
 
-	up_read(&current->mm->mmap_sem);
+	up_read(&local_mm->mmap_sem);
 
 	return ret;
 }
@@ -259,8 +373,8 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
  * the iommu can only map chunks of consecutive pfns anyway, so get the
  * first page and all consecutive pages with the same locking.
  */
-static long vfio_pin_pages(unsigned long vaddr, long npage,
-			   int prot, unsigned long *pfn_base)
+static long __vfio_pin_pages_remote(unsigned long vaddr, long npage,
+				    int prot, unsigned long *pfn_base)
 {
 	unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
 	bool lock_cap = capable(CAP_IPC_LOCK);
@@ -270,7 +384,7 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
 	if (!current->mm)
 		return -ENODEV;
 
-	ret = vaddr_get_pfn(vaddr, prot, pfn_base);
+	ret = vaddr_get_pfn(NULL, vaddr, prot, pfn_base);
 	if (ret)
 		return ret;
 
@@ -285,7 +399,7 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
 
 	if (unlikely(disable_hugepages)) {
 		if (!rsvd)
-			vfio_lock_acct(1);
+			vfio_lock_acct(current, 1);
 		return 1;
 	}
 
@@ -293,7 +407,7 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
 	for (i = 1, vaddr += PAGE_SIZE; i < npage; i++, vaddr += PAGE_SIZE) {
 		unsigned long pfn = 0;
 
-		ret = vaddr_get_pfn(vaddr, prot, &pfn);
+		ret = vaddr_get_pfn(NULL, vaddr, prot, &pfn);
 		if (ret)
 			break;
 
@@ -313,13 +427,13 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
 	}
 
 	if (!rsvd)
-		vfio_lock_acct(i);
+		vfio_lock_acct(current, i);
 
 	return i;
 }
 
-static long vfio_unpin_pages(unsigned long pfn, long npage,
-			     int prot, bool do_accounting)
+static long __vfio_unpin_pages_remote(unsigned long pfn, long npage, int prot,
+				      bool do_accounting)
 {
 	unsigned long unlocked = 0;
 	long i;
@@ -328,7 +442,188 @@ static long vfio_unpin_pages(unsigned long pfn, long npage,
 		unlocked += put_pfn(pfn++, prot);
 
 	if (do_accounting)
-		vfio_lock_acct(-unlocked);
+		vfio_lock_acct(current, -unlocked);
+	return unlocked;
+}
+
+static long __vfio_pin_pages_local(struct vfio_domain *domain,
+				   unsigned long vaddr, int prot,
+				   unsigned long *pfn_base,
+				   bool do_accounting)
+{
+	unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+	bool lock_cap = capable(CAP_IPC_LOCK);
+	long ret;
+	bool rsvd;
+	struct task_struct *task = domain->local_addr_space->task;
+
+	if (!task->mm)
+		return -ENODEV;
+
+	ret = vaddr_get_pfn(task->mm, vaddr, prot, pfn_base);
+	if (ret)
+		return ret;
+
+	rsvd = is_invalid_reserved_pfn(*pfn_base);
+
+	if (!rsvd && !lock_cap && task->mm->locked_vm + 1 > limit) {
+		put_pfn(*pfn_base, prot);
+		pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n", __func__,
+			limit << PAGE_SHIFT);
+		return -ENOMEM;
+	}
+
+	if (!rsvd && do_accounting)
+		vfio_lock_acct(task, 1);
+
+	return 1;
+}
+
+static void __vfio_unpin_pages_local(struct vfio_domain *domain,
+				     unsigned long pfn, int prot,
+				     bool do_accounting)
+{
+	put_pfn(pfn, prot);
+
+	if (do_accounting)
+		vfio_lock_acct(domain->local_addr_space->task, -1);
+}
+
+static int vfio_unpin_pfn(struct vfio_domain *domain,
+			  struct vfio_pfn *vpfn, bool do_accounting)
+{
+	__vfio_unpin_pages_local(domain, vpfn->pfn, vpfn->prot,
+				 do_accounting);
+
+	if (atomic_dec_and_test(&vpfn->ref_count))
+		vfio_remove_from_pfn_list(domain, vpfn);
+
+	return 1;
+}
+
+static long vfio_iommu_type1_pin_pages(void *iommu_data,
+				       unsigned long *user_pfn,
+				       long npage, int prot,
+				       unsigned long *phys_pfn)
+{
+	struct vfio_iommu *iommu = iommu_data;
+	struct vfio_domain *domain;
+	int i, j, ret;
+	long retpage;
+	unsigned long remote_vaddr;
+	unsigned long *pfn = phys_pfn;
+	struct vfio_dma *dma;
+	bool do_accounting = false;
+
+	if (!iommu || !user_pfn || !phys_pfn)
+		return -EINVAL;
+
+	mutex_lock(&iommu->lock);
+
+	if (!iommu->local_domain) {
+		ret = -EINVAL;
+		goto pin_done;
+	}
+
+	domain = iommu->local_domain;
+
+	/*
+	 * If iommu capable domain exist in the container then all pages are
+	 * already pinned and accounted. Accouting should be done if there is no
+	 * iommu capable domain in the container.
+	 */
+	do_accounting = !IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu);
+
+	for (i = 0; i < npage; i++) {
+		struct vfio_pfn *p;
+		dma_addr_t iova;
+
+		iova = user_pfn[i] << PAGE_SHIFT;
+
+		dma = vfio_find_dma(iommu, iova, 0);
+		if (!dma) {
+			ret = -EINVAL;
+			goto pin_unwind;
+		}
+
+		remote_vaddr = dma->vaddr + iova - dma->iova;
+
+		retpage = __vfio_pin_pages_local(domain, remote_vaddr, prot,
+						 &pfn[i], do_accounting);
+		if (retpage <= 0) {
+			WARN_ON(!retpage);
+			ret = (int)retpage;
+			goto pin_unwind;
+		}
+
+		mutex_lock(&domain->local_addr_space->pfn_list_lock);
+
+		/* search if pfn exist */
+		p = vfio_find_pfn(domain, pfn[i]);
+		if (p) {
+			atomic_inc(&p->ref_count);
+			mutex_unlock(&domain->local_addr_space->pfn_list_lock);
+			continue;
+		}
+
+		ret = vfio_add_to_pfn_list(domain, remote_vaddr, iova,
+					   pfn[i], prot);
+		mutex_unlock(&domain->local_addr_space->pfn_list_lock);
+
+		if (ret) {
+			__vfio_unpin_pages_local(domain, pfn[i], prot,
+						 do_accounting);
+			goto pin_unwind;
+		}
+	}
+
+	ret = i;
+	goto pin_done;
+
+pin_unwind:
+	pfn[i] = 0;
+	mutex_lock(&domain->local_addr_space->pfn_list_lock);
+	for (j = 0; j < i; j++) {
+		struct vfio_pfn *p;
+
+		p = vfio_find_pfn(domain, pfn[j]);
+		if (p)
+			vfio_unpin_pfn(domain, p, do_accounting);
+
+		pfn[j] = 0;
+	}
+	mutex_unlock(&domain->local_addr_space->pfn_list_lock);
+
+pin_done:
+	mutex_unlock(&iommu->lock);
+	return ret;
+}
+
+static long vfio_iommu_type1_unpin_pages(void *iommu_data, unsigned long *pfn,
+					 long npage)
+{
+	struct vfio_iommu *iommu = iommu_data;
+	struct vfio_domain *domain = NULL;
+	long unlocked = 0;
+	int i;
+
+	if (!iommu || !pfn)
+		return -EINVAL;
+
+	domain = iommu->local_domain;
+
+	for (i = 0; i < npage; i++) {
+		struct vfio_pfn *p;
+
+		mutex_lock(&domain->local_addr_space->pfn_list_lock);
+
+		/* verify if pfn exist in pfn_list */
+		p = vfio_find_pfn(domain, pfn[i]);
+		if (p)
+			unlocked += vfio_unpin_pfn(domain, p, true);
+
+		mutex_unlock(&domain->local_addr_space->pfn_list_lock);
+	}
 
 	return unlocked;
 }
@@ -341,6 +636,12 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
 
 	if (!dma->size)
 		return;
+
+	if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu))
+		return;
+
+	if (!dma->iommu_mapped)
+		return;
 	/*
 	 * We use the IOMMU to track the physical addresses, otherwise we'd
 	 * need a much more complicated tracking system.  Unfortunately that
@@ -382,15 +683,16 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
 		if (WARN_ON(!unmapped))
 			break;
 
-		unlocked += vfio_unpin_pages(phys >> PAGE_SHIFT,
-					     unmapped >> PAGE_SHIFT,
-					     dma->prot, false);
+		unlocked += __vfio_unpin_pages_remote(phys >> PAGE_SHIFT,
+						      unmapped >> PAGE_SHIFT,
+						      dma->prot, false);
 		iova += unmapped;
 
 		cond_resched();
 	}
 
-	vfio_lock_acct(-unlocked);
+	dma->iommu_mapped = false;
+	vfio_lock_acct(current, -unlocked);
 }
 
 static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
@@ -558,17 +860,85 @@ unwind:
 	return ret;
 }
 
+void vfio_update_accounting(struct vfio_iommu *iommu, struct vfio_dma *dma)
+{
+	struct vfio_domain *domain = iommu->local_domain;
+	struct rb_node *n;
+	long locked = 0;
+
+	if (!iommu->local_domain)
+		return;
+
+	mutex_lock(&domain->local_addr_space->pfn_list_lock);
+
+	n = rb_first(&domain->local_addr_space->pfn_list);
+
+	for (; n; n = rb_next(n)) {
+		struct vfio_pfn *vpfn;
+
+		vpfn = rb_entry(n, struct vfio_pfn, node);
+
+		if ((vpfn->iova >= dma->iova) &&
+		    (vpfn->iova < dma->iova + dma->size))
+			locked++;
+	}
+	vfio_lock_acct(current, -locked);
+	mutex_unlock(&domain->local_addr_space->pfn_list_lock);
+}
+
+static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct vfio_dma *dma,
+			    size_t map_size)
+{
+	dma_addr_t iova = dma->iova;
+	unsigned long vaddr = dma->vaddr;
+	size_t size = map_size, dma_size = 0;
+	long npage;
+	unsigned long pfn;
+	int ret = 0;
+
+	while (size) {
+		/* Pin a contiguous chunk of memory */
+		npage = __vfio_pin_pages_remote(vaddr + dma_size,
+						size >> PAGE_SHIFT, dma->prot,
+						&pfn);
+		if (npage <= 0) {
+			WARN_ON(!npage);
+			ret = (int)npage;
+			break;
+		}
+
+		/* Map it! */
+		ret = vfio_iommu_map(iommu, iova + dma_size, pfn, npage,
+				     dma->prot);
+		if (ret) {
+			__vfio_unpin_pages_remote(pfn, npage, dma->prot, true);
+			break;
+		}
+
+		size -= npage << PAGE_SHIFT;
+		dma_size += npage << PAGE_SHIFT;
+	}
+
+	if (ret)
+		vfio_remove_dma(iommu, dma);
+	else {
+		dma->size = dma_size;
+		dma->iommu_mapped = true;
+		vfio_update_accounting(iommu, dma);
+	}
+
+	return ret;
+}
+
 static int vfio_dma_do_map(struct vfio_iommu *iommu,
 			   struct vfio_iommu_type1_dma_map *map)
 {
 	dma_addr_t iova = map->iova;
 	unsigned long vaddr = map->vaddr;
 	size_t size = map->size;
-	long npage;
 	int ret = 0, prot = 0;
 	uint64_t mask;
 	struct vfio_dma *dma;
-	unsigned long pfn;
 
 	/* Verify that none of our __u64 fields overflow */
 	if (map->size != size || map->vaddr != vaddr || map->iova != iova)
@@ -611,29 +981,11 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
 	/* Insert zero-sized and grow as we map chunks of it */
 	vfio_link_dma(iommu, dma);
 
-	while (size) {
-		/* Pin a contiguous chunk of memory */
-		npage = vfio_pin_pages(vaddr + dma->size,
-				       size >> PAGE_SHIFT, prot, &pfn);
-		if (npage <= 0) {
-			WARN_ON(!npage);
-			ret = (int)npage;
-			break;
-		}
-
-		/* Map it! */
-		ret = vfio_iommu_map(iommu, iova + dma->size, pfn, npage, prot);
-		if (ret) {
-			vfio_unpin_pages(pfn, npage, prot, true);
-			break;
-		}
-
-		size -= npage << PAGE_SHIFT;
-		dma->size += npage << PAGE_SHIFT;
-	}
-
-	if (ret)
-		vfio_remove_dma(iommu, dma);
+	/* Don't pin and map if container doesn't contain IOMMU capable domain*/
+	if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu))
+		dma->size = size;
+	else
+		ret = vfio_pin_map_dma(iommu, dma, size);
 
 	mutex_unlock(&iommu->lock);
 	return ret;
@@ -662,10 +1014,6 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
 	d = list_first_entry(&iommu->domain_list, struct vfio_domain, next);
 	n = rb_first(&iommu->dma_list);
 
-	/* If there's not a domain, there better not be any mappings */
-	if (WARN_ON(n && !d))
-		return -EINVAL;
-
 	for (; n; n = rb_next(n)) {
 		struct vfio_dma *dma;
 		dma_addr_t iova;
@@ -674,20 +1022,43 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
 		iova = dma->iova;
 
 		while (iova < dma->iova + dma->size) {
-			phys_addr_t phys = iommu_iova_to_phys(d->domain, iova);
+			phys_addr_t phys;
 			size_t size;
 
-			if (WARN_ON(!phys)) {
-				iova += PAGE_SIZE;
-				continue;
-			}
+			if (dma->iommu_mapped) {
+				phys = iommu_iova_to_phys(d->domain, iova);
+
+				if (WARN_ON(!phys)) {
+					iova += PAGE_SIZE;
+					continue;
+				}
 
-			size = PAGE_SIZE;
+				size = PAGE_SIZE;
 
-			while (iova + size < dma->iova + dma->size &&
-			       phys + size == iommu_iova_to_phys(d->domain,
+				while (iova + size < dma->iova + dma->size &&
+				    phys + size == iommu_iova_to_phys(d->domain,
 								 iova + size))
-				size += PAGE_SIZE;
+					size += PAGE_SIZE;
+			} else {
+				unsigned long pfn;
+				unsigned long vaddr = dma->vaddr +
+						     (iova - dma->iova);
+				size_t n = dma->iova + dma->size - iova;
+				long npage;
+
+				npage = __vfio_pin_pages_remote(vaddr,
+								n >> PAGE_SHIFT,
+								dma->prot,
+								&pfn);
+				if (npage <= 0) {
+					WARN_ON(!npage);
+					ret = (int)npage;
+					return ret;
+				}
+
+				phys = pfn << PAGE_SHIFT;
+				size = npage << PAGE_SHIFT;
+			}
 
 			ret = iommu_map(domain->domain, iova, phys,
 					size, dma->prot | domain->prot);
@@ -696,6 +1067,11 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
 
 			iova += size;
 		}
+
+		if (!dma->iommu_mapped) {
+			dma->iommu_mapped = true;
+			vfio_update_accounting(iommu, dma);
+		}
 	}
 
 	return 0;
@@ -734,11 +1110,24 @@ static void vfio_test_domain_fgsp(struct vfio_domain *domain)
 	__free_pages(pages, order);
 }
 
+static struct vfio_group *find_iommu_group(struct vfio_domain *domain,
+				   struct iommu_group *iommu_group)
+{
+	struct vfio_group *g;
+
+	list_for_each_entry(g, &domain->group_list, next) {
+		if (g->iommu_group == iommu_group)
+			return g;
+	}
+
+	return NULL;
+}
+
 static int vfio_iommu_type1_attach_group(void *iommu_data,
 					 struct iommu_group *iommu_group)
 {
 	struct vfio_iommu *iommu = iommu_data;
-	struct vfio_group *group, *g;
+	struct vfio_group *group;
 	struct vfio_domain *domain, *d;
 	struct bus_type *bus = NULL;
 	int ret;
@@ -746,10 +1135,14 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 	mutex_lock(&iommu->lock);
 
 	list_for_each_entry(d, &iommu->domain_list, next) {
-		list_for_each_entry(g, &d->group_list, next) {
-			if (g->iommu_group != iommu_group)
-				continue;
+		if (find_iommu_group(d, iommu_group)) {
+			mutex_unlock(&iommu->lock);
+			return -EINVAL;
+		}
+	}
 
+	if (iommu->local_domain) {
+		if (find_iommu_group(iommu->local_domain, iommu_group)) {
 			mutex_unlock(&iommu->lock);
 			return -EINVAL;
 		}
@@ -769,6 +1162,34 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 	if (ret)
 		goto out_free;
 
+	if (IS_ENABLED(CONFIG_VFIO_MDEV) && !iommu_present(bus) &&
+	    (bus == &mdev_bus_type)) {
+		if (iommu->local_domain) {
+			list_add(&group->next,
+				 &iommu->local_domain->group_list);
+			kfree(domain);
+			mutex_unlock(&iommu->lock);
+			return 0;
+		}
+
+		domain->local_addr_space =
+				      kzalloc(sizeof(*domain->local_addr_space),
+					      GFP_KERNEL);
+		if (!domain->local_addr_space) {
+			ret = -ENOMEM;
+			goto out_free;
+		}
+
+		domain->local_addr_space->task = current;
+		INIT_LIST_HEAD(&domain->group_list);
+		list_add(&group->next, &domain->group_list);
+		domain->local_addr_space->pfn_list = RB_ROOT;
+		mutex_init(&domain->local_addr_space->pfn_list_lock);
+		iommu->local_domain = domain;
+		mutex_unlock(&iommu->lock);
+		return 0;
+	}
+
 	domain->domain = iommu_domain_alloc(bus);
 	if (!domain->domain) {
 		ret = -EIO;
@@ -859,6 +1280,41 @@ static void vfio_iommu_unmap_unpin_all(struct vfio_iommu *iommu)
 		vfio_remove_dma(iommu, rb_entry(node, struct vfio_dma, node));
 }
 
+static void vfio_iommu_unmap_unpin_reaccount(struct vfio_iommu *iommu)
+{
+	struct vfio_domain *domain = iommu->local_domain;
+	struct vfio_dma *dma, *tdma;
+	struct rb_node *n;
+	long locked = 0;
+
+	rbtree_postorder_for_each_entry_safe(dma, tdma, &iommu->dma_list,
+					     node) {
+		vfio_unmap_unpin(iommu, dma);
+	}
+
+	mutex_lock(&domain->local_addr_space->pfn_list_lock);
+
+	n = rb_first(&domain->local_addr_space->pfn_list);
+
+	for (; n; n = rb_next(n))
+		locked++;
+
+	vfio_lock_acct(domain->local_addr_space->task, locked);
+	mutex_unlock(&domain->local_addr_space->pfn_list_lock);
+}
+
+static void vfio_local_unpin_all(struct vfio_domain *domain)
+{
+	struct rb_node *node;
+
+	mutex_lock(&domain->local_addr_space->pfn_list_lock);
+	while ((node = rb_first(&domain->local_addr_space->pfn_list)))
+		vfio_unpin_pfn(domain,
+				rb_entry(node, struct vfio_pfn, node), false);
+
+	mutex_unlock(&domain->local_addr_space->pfn_list_lock);
+}
+
 static void vfio_iommu_type1_detach_group(void *iommu_data,
 					  struct iommu_group *iommu_group)
 {
@@ -868,31 +1324,57 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
 
 	mutex_lock(&iommu->lock);
 
-	list_for_each_entry(domain, &iommu->domain_list, next) {
-		list_for_each_entry(group, &domain->group_list, next) {
-			if (group->iommu_group != iommu_group)
-				continue;
-
-			iommu_detach_group(domain->domain, iommu_group);
+	if (iommu->local_domain) {
+		domain = iommu->local_domain;
+		group = find_iommu_group(domain, iommu_group);
+		if (group) {
 			list_del(&group->next);
 			kfree(group);
-			/*
-			 * Group ownership provides privilege, if the group
-			 * list is empty, the domain goes away.  If it's the
-			 * last domain, then all the mappings go away too.
-			 */
+
 			if (list_empty(&domain->group_list)) {
-				if (list_is_singular(&iommu->domain_list))
+				vfio_local_unpin_all(domain);
+				if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu))
 					vfio_iommu_unmap_unpin_all(iommu);
-				iommu_domain_free(domain->domain);
-				list_del(&domain->next);
 				kfree(domain);
+				iommu->local_domain = NULL;
+			}
+			goto detach_group_done;
+		}
+	}
+
+	if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu))
+		goto detach_group_done;
+
+	list_for_each_entry(domain, &iommu->domain_list, next) {
+		group = find_iommu_group(domain, iommu_group);
+		if (!group)
+			continue;
+
+		iommu_detach_group(domain->domain, iommu_group);
+		list_del(&group->next);
+		kfree(group);
+		/*
+		 * Group ownership provides privilege, if the group list is
+		 * empty, the domain goes away. If it's the last domain with
+		 * iommu and local domain doesn't exist, then all the mappings
+		 * go away too. If it's the last domain with iommu and local
+		 * domain exist, update accounting
+		 */
+		if (list_empty(&domain->group_list)) {
+			if (list_is_singular(&iommu->domain_list)) {
+				if (!iommu->local_domain)
+					vfio_iommu_unmap_unpin_all(iommu);
+				else
+					vfio_iommu_unmap_unpin_reaccount(iommu);
 			}
-			goto done;
+			iommu_domain_free(domain->domain);
+			list_del(&domain->next);
+			kfree(domain);
 		}
+		break;
 	}
 
-done:
+detach_group_done:
 	mutex_unlock(&iommu->lock);
 }
 
@@ -924,27 +1406,48 @@ static void *vfio_iommu_type1_open(unsigned long arg)
 	return iommu;
 }
 
+static void vfio_release_domain(struct vfio_domain *domain)
+{
+	struct vfio_group *group, *group_tmp;
+
+	list_for_each_entry_safe(group, group_tmp,
+				 &domain->group_list, next) {
+		if (!domain->local_addr_space)
+			iommu_detach_group(domain->domain, group->iommu_group);
+		list_del(&group->next);
+		kfree(group);
+	}
+
+	if (domain->local_addr_space)
+		vfio_local_unpin_all(domain);
+	else
+		iommu_domain_free(domain->domain);
+}
+
 static void vfio_iommu_type1_release(void *iommu_data)
 {
 	struct vfio_iommu *iommu = iommu_data;
 	struct vfio_domain *domain, *domain_tmp;
-	struct vfio_group *group, *group_tmp;
+
+	if (iommu->local_domain) {
+		vfio_release_domain(iommu->local_domain);
+		kfree(iommu->local_domain);
+		iommu->local_domain = NULL;
+	}
 
 	vfio_iommu_unmap_unpin_all(iommu);
 
+	if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu))
+		goto release_exit;
+
 	list_for_each_entry_safe(domain, domain_tmp,
 				 &iommu->domain_list, next) {
-		list_for_each_entry_safe(group, group_tmp,
-					 &domain->group_list, next) {
-			iommu_detach_group(domain->domain, group->iommu_group);
-			list_del(&group->next);
-			kfree(group);
-		}
-		iommu_domain_free(domain->domain);
+		vfio_release_domain(domain);
 		list_del(&domain->next);
 		kfree(domain);
 	}
 
+release_exit:
 	kfree(iommu);
 }
 
@@ -1048,6 +1551,8 @@ static const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_type1 = {
 	.ioctl		= vfio_iommu_type1_ioctl,
 	.attach_group	= vfio_iommu_type1_attach_group,
 	.detach_group	= vfio_iommu_type1_detach_group,
+	.pin_pages	= vfio_iommu_type1_pin_pages,
+	.unpin_pages	= vfio_iommu_type1_unpin_pages,
 };
 
 static int __init vfio_iommu_type1_init(void)
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 0ecae0b1cd34..0bd25ba6223d 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -17,6 +17,7 @@
 #include <linux/workqueue.h>
 #include <linux/poll.h>
 #include <uapi/linux/vfio.h>
+#include <linux/mdev.h>
 
 /**
  * struct vfio_device_ops - VFIO bus driver device callbacks
@@ -75,7 +76,11 @@ struct vfio_iommu_driver_ops {
 					struct iommu_group *group);
 	void		(*detach_group)(void *iommu_data,
 					struct iommu_group *group);
-
+	long		(*pin_pages)(void *iommu_data, unsigned long *user_pfn,
+				     long npage, int prot,
+				     unsigned long *phys_pfn);
+	long		(*unpin_pages)(void *iommu_data, unsigned long *pfn,
+				       long npage);
 };
 
 extern int vfio_register_iommu_driver(const struct vfio_iommu_driver_ops *ops);
@@ -127,6 +132,12 @@ static inline long vfio_spapr_iommu_eeh_ioctl(struct iommu_group *group,
 }
 #endif /* CONFIG_EEH */
 
+extern long vfio_pin_pages(struct device *dev, unsigned long *user_pfn,
+			   long npage, int prot, unsigned long *phys_pfn);
+
+extern long vfio_unpin_pages(struct device *dev, unsigned long *pfn,
+			     long npage);
+
 /*
  * IRQfd - generic
  */
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v8 4/6] docs: Add Documentation for Mediated devices
  2016-10-10 20:28 ` [Qemu-devel] " Kirti Wankhede
@ 2016-10-10 20:28   ` Kirti Wankhede
  -1 siblings, 0 replies; 73+ messages in thread
From: Kirti Wankhede @ 2016-10-10 20:28 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi, Kirti Wankhede

Add file Documentation/vfio-mediated-device.txt that include details of
mediated device framework.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Change-Id: I137dd646442936090d92008b115908b7b2c7bc5d
---
 Documentation/vfio-mdev/vfio-mediated-device.txt | 219 +++++++++++++++++++++++
 1 file changed, 219 insertions(+)
 create mode 100644 Documentation/vfio-mdev/vfio-mediated-device.txt

diff --git a/Documentation/vfio-mdev/vfio-mediated-device.txt b/Documentation/vfio-mdev/vfio-mediated-device.txt
new file mode 100644
index 000000000000..c1eacb83807b
--- /dev/null
+++ b/Documentation/vfio-mdev/vfio-mediated-device.txt
@@ -0,0 +1,219 @@
+/*
+ * VFIO Mediated devices
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *             Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+VFIO Mediated devices [1]
+-------------------------
+
+There are more and more use cases/demands to virtualize the DMA devices which
+don't have SR_IOV capability built-in. To do this, drivers of different
+devices had to develop their own management interface and set of APIs and then
+integrate it to user space software. We've identified common requirements and
+unified management interface for such devices to make user space software
+integration easier and simplify the device driver implementation to support such
+I/O virtualization solution.
+
+The VFIO driver framework provides unified APIs for direct device access from
+user space. It is an IOMMU/device agnostic framework for exposing direct device
+access to user space, in a secure, IOMMU protected environment. This framework
+is used for multiple devices like GPUs, network adapters and compute
+accelerators. With direct device access, virtual machines or user space
+applications have direct access of physical device. This framework is reused
+for mediated devices.
+
+Mediated core driver provides a common interface for mediated device management
+that can be used by drivers of different devices. This module provides a generic
+interface to create/destroy a mediated device, add/remove it to mediated bus
+driver and add/remove device to an IOMMU group. It also provides an interface to
+register bus driver, for example, Mediated VFIO mdev driver is designed for
+mediated devices and supports VFIO APIs. Mediated bus driver add/delete mediated
+device to VFIO Group.
+
+Below is the high Level block diagram, with NVIDIA, Intel and IBM devices
+as example, since these are the devices which are going to actively use
+this module as of now.
+
+     +---------------+
+     |               |
+     | +-----------+ |  mdev_register_driver() +--------------+
+     | |           | +<------------------------+              |
+     | |  mdev     | |                         |              |
+     | |  bus      | +------------------------>+ vfio_mdev.ko |<-> VFIO user
+     | |  driver   | |     probe()/remove()    |              |    APIs
+     | |           | |                         +--------------+
+     | +-----------+ |
+     |               |
+     |  MDEV CORE    |
+     |   MODULE      |
+     |   mdev.ko     |
+     | +-----------+ |  mdev_register_device() +--------------+
+     | |           | +<------------------------+              |
+     | |           | |                         |  nvidia.ko   |<-> physical
+     | |           | +------------------------>+              |    device
+     | |           | |        callbacks        +--------------+
+     | | Physical  | |
+     | |  device   | |  mdev_register_device() +--------------+
+     | | interface | |<------------------------+              |
+     | |           | |                         |  i915.ko     |<-> physical
+     | |           | +------------------------>+              |    device
+     | |           | |        callbacks        +--------------+
+     | |           | |
+     | |           | |  mdev_register_device() +--------------+
+     | |           | +<------------------------+              |
+     | |           | |                         | ccw_device.ko|<-> physical
+     | |           | +------------------------>+              |    device
+     | |           | |        callbacks        +--------------+
+     | +-----------+ |
+     +---------------+
+
+
+Registration Interfaces:
+------------------------
+
+Mediated core driver provides two types of registration interfaces:
+
+1. Registration interface for mediated bus driver:
+--------------------------------------------------
+     /*
+      * struct mdev_driver [2] - Mediated device's driver
+      * @name: driver name
+      * @probe: called when new device created
+      * @remove: called when device removed
+      * @driver: device driver structure
+      */
+     struct mdev_driver {
+	     const char *name;
+	     int  (*probe)  (struct device *dev);
+	     void (*remove) (struct device *dev);
+	     struct device_driver    driver;
+     };
+
+Mediated bus driver for mdev should use this interface to register and
+unregister with core driver respectively:
+
+extern int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
+extern void mdev_unregister_driver(struct mdev_driver *drv);
+
+Mediated bus driver is responsible to add/delete mediated devices to/from VFIO
+group when devices are bound and unbound to the driver.
+
+2. Physical device driver interface:
+------------------------------------
+This interface [3] provides a set of APIs to manage physical device related work
+in its driver. APIs are:
+
+* dev_attr_groups: attributes of parent device.
+* mdev_attr_groups: attributes of mediated device.
+* supported_type_groups: attributes to define supported types. It is mandatory
+			 to provide supported types.
+* create: to allocate basic resources in driver for a mediated device.
+* remove: to free resources in driver when mediated device is destroyed.
+* open: open callback of mediated device
+* release: close callback of mediated device
+* read : read emulation callback.
+* write: write emulation callback.
+* mmap: mmap emulation callback.
+* ioctl: ioctl callback.
+
+Drivers should use these interfaces to register and unregister device to mdev
+core driver respectively:
+
+extern int  mdev_register_device(struct device *dev,
+                                 const struct parent_ops *ops);
+extern void mdev_unregister_device(struct device *dev);
+
+Mediated device management interface via sysfs
+----------------------------------------------
+Management interface via sysfs allows user space software, like libvirt, to
+query and configure mediated device in a HW agnostic fashion. This management
+interface provide flexibility to underlying physical device's driver to support
+mediated device hotplug, multiple mediated devices per virtual machine, multiple
+mediated devices from different physical devices, etc.
+
+Under per-physical device sysfs:
+--------------------------------
+
+* mdev_supported_types:
+    List of current supported mediated device types and its details are added
+in this directory in following format:
+
+|- <parent phy device>
+|--- Vendor-specific-attributes [optional]
+|--- mdev_supported_types
+|     |--- <type id>
+|     |   |--- create
+|     |   |--- name
+|     |   |--- available_instances
+|     |   |--- description /class
+|     |   |--- [devices]
+|     |--- <type id>
+|     |   |--- create
+|     |   |--- name
+|     |   |--- available_instances
+|     |   |--- description /class
+|     |   |--- [devices]
+|     |--- <type id>
+|          |--- create
+|          |--- name
+|          |--- available_instances
+|          |--- description /class
+|          |--- [devices]
+
+[TBD : description or class is yet to be decided. This will change.]
+
+Under per mdev device:
+----------------------
+
+|- <parent phy device>
+|--- $MDEV_UUID
+         |--- remove
+         |--- {link to its type}
+         |--- vendor-specific-attributes [optional]
+
+* remove: (write only)
+	Write '1' to 'remove' file would destroy mdev device. Vendor driver can
+	fail remove() callback if that device is active and vendor driver
+	doesn't support hot-unplug.
+	Example:
+	# echo 1 > /sys/bus/mdev/devices/$mdev_UUID/remove
+
+
+Mediated device Hotplug:
+------------------------
+
+Mediated devices can be created and assigned during runtime. Procedure to
+hot-plug mediated device is same as hot-plug PCI device.
+
+Translation APIs for Mediated device
+------------------------------------
+
+Below APIs are provided for user pfn to host pfn translation in VFIO driver:
+
+extern long vfio_pin_pages(struct device *dev, unsigned long *user_pfn,
+                           long npage, int prot, unsigned long *phys_pfn);
+
+extern long vfio_unpin_pages(struct device *dev, unsigned long *pfn,
+			     long npage);
+
+These functions call back into the backend IOMMU module using two callbacks of
+struct vfio_iommu_driver_ops, pin_pages and unpin_pages [4]. Currently these are
+supported in TYPE1 IOMMU module. To enable the same for other IOMMU backend
+modules, such as PPC64 sPAPR module, they need to provide these two callback
+functions.
+
+References
+----------
+
+[1] See Documentation/vfio.txt for more information on VFIO.
+[2] struct mdev_driver in include/linux/mdev.h
+[3] struct parent_ops in include/linux/mdev.h
+[4] struct vfio_iommu_driver_ops in include/linux/vfio.h
+
-- 
2.7.0


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [Qemu-devel] [PATCH v8 4/6] docs: Add Documentation for Mediated devices
@ 2016-10-10 20:28   ` Kirti Wankhede
  0 siblings, 0 replies; 73+ messages in thread
From: Kirti Wankhede @ 2016-10-10 20:28 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi, Kirti Wankhede

Add file Documentation/vfio-mediated-device.txt that include details of
mediated device framework.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Change-Id: I137dd646442936090d92008b115908b7b2c7bc5d
---
 Documentation/vfio-mdev/vfio-mediated-device.txt | 219 +++++++++++++++++++++++
 1 file changed, 219 insertions(+)
 create mode 100644 Documentation/vfio-mdev/vfio-mediated-device.txt

diff --git a/Documentation/vfio-mdev/vfio-mediated-device.txt b/Documentation/vfio-mdev/vfio-mediated-device.txt
new file mode 100644
index 000000000000..c1eacb83807b
--- /dev/null
+++ b/Documentation/vfio-mdev/vfio-mediated-device.txt
@@ -0,0 +1,219 @@
+/*
+ * VFIO Mediated devices
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *             Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+VFIO Mediated devices [1]
+-------------------------
+
+There are more and more use cases/demands to virtualize the DMA devices which
+don't have SR_IOV capability built-in. To do this, drivers of different
+devices had to develop their own management interface and set of APIs and then
+integrate it to user space software. We've identified common requirements and
+unified management interface for such devices to make user space software
+integration easier and simplify the device driver implementation to support such
+I/O virtualization solution.
+
+The VFIO driver framework provides unified APIs for direct device access from
+user space. It is an IOMMU/device agnostic framework for exposing direct device
+access to user space, in a secure, IOMMU protected environment. This framework
+is used for multiple devices like GPUs, network adapters and compute
+accelerators. With direct device access, virtual machines or user space
+applications have direct access of physical device. This framework is reused
+for mediated devices.
+
+Mediated core driver provides a common interface for mediated device management
+that can be used by drivers of different devices. This module provides a generic
+interface to create/destroy a mediated device, add/remove it to mediated bus
+driver and add/remove device to an IOMMU group. It also provides an interface to
+register bus driver, for example, Mediated VFIO mdev driver is designed for
+mediated devices and supports VFIO APIs. Mediated bus driver add/delete mediated
+device to VFIO Group.
+
+Below is the high Level block diagram, with NVIDIA, Intel and IBM devices
+as example, since these are the devices which are going to actively use
+this module as of now.
+
+     +---------------+
+     |               |
+     | +-----------+ |  mdev_register_driver() +--------------+
+     | |           | +<------------------------+              |
+     | |  mdev     | |                         |              |
+     | |  bus      | +------------------------>+ vfio_mdev.ko |<-> VFIO user
+     | |  driver   | |     probe()/remove()    |              |    APIs
+     | |           | |                         +--------------+
+     | +-----------+ |
+     |               |
+     |  MDEV CORE    |
+     |   MODULE      |
+     |   mdev.ko     |
+     | +-----------+ |  mdev_register_device() +--------------+
+     | |           | +<------------------------+              |
+     | |           | |                         |  nvidia.ko   |<-> physical
+     | |           | +------------------------>+              |    device
+     | |           | |        callbacks        +--------------+
+     | | Physical  | |
+     | |  device   | |  mdev_register_device() +--------------+
+     | | interface | |<------------------------+              |
+     | |           | |                         |  i915.ko     |<-> physical
+     | |           | +------------------------>+              |    device
+     | |           | |        callbacks        +--------------+
+     | |           | |
+     | |           | |  mdev_register_device() +--------------+
+     | |           | +<------------------------+              |
+     | |           | |                         | ccw_device.ko|<-> physical
+     | |           | +------------------------>+              |    device
+     | |           | |        callbacks        +--------------+
+     | +-----------+ |
+     +---------------+
+
+
+Registration Interfaces:
+------------------------
+
+Mediated core driver provides two types of registration interfaces:
+
+1. Registration interface for mediated bus driver:
+--------------------------------------------------
+     /*
+      * struct mdev_driver [2] - Mediated device's driver
+      * @name: driver name
+      * @probe: called when new device created
+      * @remove: called when device removed
+      * @driver: device driver structure
+      */
+     struct mdev_driver {
+	     const char *name;
+	     int  (*probe)  (struct device *dev);
+	     void (*remove) (struct device *dev);
+	     struct device_driver    driver;
+     };
+
+Mediated bus driver for mdev should use this interface to register and
+unregister with core driver respectively:
+
+extern int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
+extern void mdev_unregister_driver(struct mdev_driver *drv);
+
+Mediated bus driver is responsible to add/delete mediated devices to/from VFIO
+group when devices are bound and unbound to the driver.
+
+2. Physical device driver interface:
+------------------------------------
+This interface [3] provides a set of APIs to manage physical device related work
+in its driver. APIs are:
+
+* dev_attr_groups: attributes of parent device.
+* mdev_attr_groups: attributes of mediated device.
+* supported_type_groups: attributes to define supported types. It is mandatory
+			 to provide supported types.
+* create: to allocate basic resources in driver for a mediated device.
+* remove: to free resources in driver when mediated device is destroyed.
+* open: open callback of mediated device
+* release: close callback of mediated device
+* read : read emulation callback.
+* write: write emulation callback.
+* mmap: mmap emulation callback.
+* ioctl: ioctl callback.
+
+Drivers should use these interfaces to register and unregister device to mdev
+core driver respectively:
+
+extern int  mdev_register_device(struct device *dev,
+                                 const struct parent_ops *ops);
+extern void mdev_unregister_device(struct device *dev);
+
+Mediated device management interface via sysfs
+----------------------------------------------
+Management interface via sysfs allows user space software, like libvirt, to
+query and configure mediated device in a HW agnostic fashion. This management
+interface provide flexibility to underlying physical device's driver to support
+mediated device hotplug, multiple mediated devices per virtual machine, multiple
+mediated devices from different physical devices, etc.
+
+Under per-physical device sysfs:
+--------------------------------
+
+* mdev_supported_types:
+    List of current supported mediated device types and its details are added
+in this directory in following format:
+
+|- <parent phy device>
+|--- Vendor-specific-attributes [optional]
+|--- mdev_supported_types
+|     |--- <type id>
+|     |   |--- create
+|     |   |--- name
+|     |   |--- available_instances
+|     |   |--- description /class
+|     |   |--- [devices]
+|     |--- <type id>
+|     |   |--- create
+|     |   |--- name
+|     |   |--- available_instances
+|     |   |--- description /class
+|     |   |--- [devices]
+|     |--- <type id>
+|          |--- create
+|          |--- name
+|          |--- available_instances
+|          |--- description /class
+|          |--- [devices]
+
+[TBD : description or class is yet to be decided. This will change.]
+
+Under per mdev device:
+----------------------
+
+|- <parent phy device>
+|--- $MDEV_UUID
+         |--- remove
+         |--- {link to its type}
+         |--- vendor-specific-attributes [optional]
+
+* remove: (write only)
+	Write '1' to 'remove' file would destroy mdev device. Vendor driver can
+	fail remove() callback if that device is active and vendor driver
+	doesn't support hot-unplug.
+	Example:
+	# echo 1 > /sys/bus/mdev/devices/$mdev_UUID/remove
+
+
+Mediated device Hotplug:
+------------------------
+
+Mediated devices can be created and assigned during runtime. Procedure to
+hot-plug mediated device is same as hot-plug PCI device.
+
+Translation APIs for Mediated device
+------------------------------------
+
+Below APIs are provided for user pfn to host pfn translation in VFIO driver:
+
+extern long vfio_pin_pages(struct device *dev, unsigned long *user_pfn,
+                           long npage, int prot, unsigned long *phys_pfn);
+
+extern long vfio_unpin_pages(struct device *dev, unsigned long *pfn,
+			     long npage);
+
+These functions call back into the backend IOMMU module using two callbacks of
+struct vfio_iommu_driver_ops, pin_pages and unpin_pages [4]. Currently these are
+supported in TYPE1 IOMMU module. To enable the same for other IOMMU backend
+modules, such as PPC64 sPAPR module, they need to provide these two callback
+functions.
+
+References
+----------
+
+[1] See Documentation/vfio.txt for more information on VFIO.
+[2] struct mdev_driver in include/linux/mdev.h
+[3] struct parent_ops in include/linux/mdev.h
+[4] struct vfio_iommu_driver_ops in include/linux/vfio.h
+
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v8 5/6] Add simple sample driver for mediated device framework
  2016-10-10 20:28 ` [Qemu-devel] " Kirti Wankhede
@ 2016-10-10 20:28   ` Kirti Wankhede
  -1 siblings, 0 replies; 73+ messages in thread
From: Kirti Wankhede @ 2016-10-10 20:28 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi, Kirti Wankhede

Sample driver creates mdev device that simulates serial port over PCI card.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Change-Id: I857f8f12f8b275f2498dfe8c628a5cdc7193b1b2
---
 Documentation/vfio-mdev/Makefile                 |   14 +
 Documentation/vfio-mdev/mtty.c                   | 1353 ++++++++++++++++++++++
 Documentation/vfio-mdev/vfio-mediated-device.txt |   63 +
 3 files changed, 1430 insertions(+)
 create mode 100644 Documentation/vfio-mdev/Makefile
 create mode 100644 Documentation/vfio-mdev/mtty.c

diff --git a/Documentation/vfio-mdev/Makefile b/Documentation/vfio-mdev/Makefile
new file mode 100644
index 000000000000..ff6f8a324c85
--- /dev/null
+++ b/Documentation/vfio-mdev/Makefile
@@ -0,0 +1,14 @@
+#
+# Makefile for mtty.c file
+#
+KDIR:=/lib/modules/$(shell uname -r)/build
+
+obj-m:=mtty.o
+
+default:
+	$(MAKE) -C $(KDIR) SUBDIRS=$(PWD) modules
+
+clean:
+	@rm -rf .*.cmd *.mod.c *.o *.ko .tmp*
+	@rm -rf Module.* Modules.* modules.* .tmp_versions
+
diff --git a/Documentation/vfio-mdev/mtty.c b/Documentation/vfio-mdev/mtty.c
new file mode 100644
index 000000000000..497c90ebe257
--- /dev/null
+++ b/Documentation/vfio-mdev/mtty.c
@@ -0,0 +1,1353 @@
+/*
+ * Mediated virtual PCI serial host device driver
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *             Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * Sample driver that creates mdev device that simulates serial port over PCI
+ * card.
+ *
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/poll.h>
+#include <linux/slab.h>
+#include <linux/cdev.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/uuid.h>
+#include <linux/vfio.h>
+#include <linux/iommu.h>
+#include <linux/sysfs.h>
+#include <linux/ctype.h>
+#include <linux/file.h>
+#include <linux/mdev.h>
+#include <linux/pci.h>
+#include <linux/serial.h>
+#include <uapi/linux/serial_reg.h>
+/*
+ * #defines
+ */
+
+#define VERSION_STRING  "0.1"
+#define DRIVER_AUTHOR   "NVIDIA Corporation"
+
+#define MTTY_CLASS_NAME "mtty"
+
+#define MTTY_NAME       "mtty"
+
+#define MTTY_CONFIG_SPACE_SIZE  0xff
+#define MTTY_IO_BAR_SIZE        0x8
+#define MTTY_MMIO_BAR_SIZE      0x100000
+
+#define STORE_LE16(addr, val)   (*(u16 *)addr = val)
+#define STORE_LE32(addr, val)   (*(u32 *)addr = val)
+
+#define MAX_FIFO_SIZE   16
+
+#define CIRCULAR_BUF_INC_IDX(idx)    (idx = (idx + 1) & (MAX_FIFO_SIZE - 1))
+
+#define MTTY_VFIO_PCI_OFFSET_SHIFT   40
+
+#define MTTY_VFIO_PCI_OFFSET_TO_INDEX(off)   (off >> MTTY_VFIO_PCI_OFFSET_SHIFT)
+#define MTTY_VFIO_PCI_INDEX_TO_OFFSET(index) \
+				((u64)(index) << MTTY_VFIO_PCI_OFFSET_SHIFT)
+#define MTTY_VFIO_PCI_OFFSET_MASK    \
+				(((u64)(1) << MTTY_VFIO_PCI_OFFSET_SHIFT) - 1)
+
+
+/*
+ * Global Structures
+ */
+
+struct mtty_dev {
+	dev_t		vd_devt;
+	struct class	*vd_class;
+	struct cdev	vd_cdev;
+	struct idr	vd_idr;
+	struct device	dev;
+} mtty_dev;
+
+struct mdev_region_info {
+	u64 start;
+	u64 phys_start;
+	u32 size;
+	u64 vfio_offset;
+};
+
+#if defined(DEBUG_REGS)
+const char *wr_reg[] = {
+	"TX",
+	"IER",
+	"FCR",
+	"LCR",
+	"MCR",
+	"LSR",
+	"MSR",
+	"SCR"
+};
+
+const char *rd_reg[] = {
+	"RX",
+	"IER",
+	"IIR",
+	"LCR",
+	"MCR",
+	"LSR",
+	"MSR",
+	"SCR"
+};
+#endif
+
+/* loop back buffer */
+struct rxtx {
+	u8 fifo[MAX_FIFO_SIZE];
+	u8 head, tail;
+	u8 count;
+};
+
+struct serial_port {
+	u8 uart_reg[8];         /* 8 registers */
+	struct rxtx rxtx;       /* loop back buffer */
+	bool dlab;
+	bool overrun;
+	u16 divisor;
+	u8 fcr;                 /* FIFO control register */
+	u8 max_fifo_size;
+	u8 intr_trigger_level;  /* interrupt trigger level */
+};
+
+/* State of each mdev device */
+struct mdev_state {
+	int irq_fd;
+	struct file *intx_file;
+	struct file *msi_file;
+	int irq_index;
+	u8 *vconfig;
+	struct mutex ops_lock;
+	struct mdev_device *mdev;
+	struct mdev_region_info region_info[VFIO_PCI_NUM_REGIONS];
+	u32 bar_mask[VFIO_PCI_NUM_REGIONS];
+	struct list_head next;
+	struct serial_port s[2];
+	struct mutex rxtx_lock;
+	struct vfio_device_info dev_info;
+};
+
+struct mutex mdev_list_lock;
+struct list_head mdev_devices_list;
+
+static const struct file_operations vd_fops = {
+	.owner          = THIS_MODULE,
+};
+
+/* function prototypes */
+
+static int mtty_trigger_interrupt(uuid_le uuid);
+
+/* Helper functions */
+static struct mdev_state *find_mdev_state_by_uuid(uuid_le uuid)
+{
+	struct mdev_state *mds;
+
+	list_for_each_entry(mds, &mdev_devices_list, next) {
+		if (uuid_le_cmp(mds->mdev->uuid, uuid) == 0)
+			return mds;
+	}
+
+	return NULL;
+}
+
+void dump_buffer(char *buf, uint32_t count)
+{
+#if defined(DEBUG)
+	int i;
+
+	pr_info("Buffer:\n");
+	for (i = 0; i < count; i++) {
+		pr_info("%2x ", *(buf + i));
+		if ((i + 1) % 16 == 0)
+			pr_info("\n");
+	}
+#endif
+}
+
+static void mtty_create_config_space(struct mdev_state *mdev_state)
+{
+	/* PCI dev ID */
+	STORE_LE32((u32 *) &mdev_state->vconfig[0x0], 0x32534348);
+
+	/* Control: I/O+, Mem-, BusMaster- */
+	STORE_LE16((u16 *) &mdev_state->vconfig[0x4], 0x0001);
+
+	/* Status: capabilities list absent */
+	STORE_LE16((u16 *) &mdev_state->vconfig[0x6], 0x0200);
+
+	/* Rev ID */
+	mdev_state->vconfig[0x8] =  0x10;
+
+	/* programming interface class : 16550-compatible serial controller */
+	mdev_state->vconfig[0x9] =  0x02;
+
+	/* Sub class : 00 */
+	mdev_state->vconfig[0xa] =  0x00;
+
+	/* Base class : Simple Communication controllers */
+	mdev_state->vconfig[0xb] =  0x07;
+
+	/* base address registers */
+	/* BAR0: IO space */
+	STORE_LE32((u32 *) &mdev_state->vconfig[0x10], 0x000001);
+	mdev_state->bar_mask[0] = ~(MTTY_IO_BAR_SIZE) + 1;
+
+	/* BAR1: IO space */
+	STORE_LE32((u32 *) &mdev_state->vconfig[0x14], 0x000001);
+	mdev_state->bar_mask[1] = ~(MTTY_IO_BAR_SIZE) + 1;
+
+	/* Subsystem ID */
+	STORE_LE32((u32 *) &mdev_state->vconfig[0x2c], 0x32534348);
+
+	mdev_state->vconfig[0x34] =  0x00;   /* Cap Ptr */
+	mdev_state->vconfig[0x3d] =  0x01;   /* interrupt pin (INTA#) */
+
+	/* Vendor specific data */
+	mdev_state->vconfig[0x40] =  0x23;
+	mdev_state->vconfig[0x43] =  0x80;
+	mdev_state->vconfig[0x44] =  0x23;
+	mdev_state->vconfig[0x48] =  0x23;
+	mdev_state->vconfig[0x4c] =  0x23;
+
+	mdev_state->vconfig[0x60] =  0x50;
+	mdev_state->vconfig[0x61] =  0x43;
+	mdev_state->vconfig[0x62] =  0x49;
+	mdev_state->vconfig[0x63] =  0x20;
+	mdev_state->vconfig[0x64] =  0x53;
+	mdev_state->vconfig[0x65] =  0x65;
+	mdev_state->vconfig[0x66] =  0x72;
+	mdev_state->vconfig[0x67] =  0x69;
+	mdev_state->vconfig[0x68] =  0x61;
+	mdev_state->vconfig[0x69] =  0x6c;
+	mdev_state->vconfig[0x6a] =  0x2f;
+	mdev_state->vconfig[0x6b] =  0x55;
+	mdev_state->vconfig[0x6c] =  0x41;
+	mdev_state->vconfig[0x6d] =  0x52;
+	mdev_state->vconfig[0x6e] =  0x54;
+}
+
+static void handle_pci_cfg_write(struct mdev_state *mdev_state, u16 offset,
+				 char *buf, u32 count)
+{
+	u32 cfg_addr, bar_mask, bar_index = 0;
+
+	switch (offset) {
+	case 0x04: /* device control */
+	case 0x06: /* device status */
+		/* do nothing */
+		break;
+	case 0x3c:  /* interrupt line */
+		mdev_state->vconfig[0x3c] = buf[0];
+		break;
+	case 0x3d:
+		/*
+		 * Interrupt Pin is hardwired to INTA.
+		 * This field is write protected by hardware
+		 */
+		break;
+	case 0x10:  /* BAR0 */
+	case 0x14:  /* BAR1 */
+		if (offset == 0x10)
+			bar_index = 0;
+		else if (offset == 0x14)
+			bar_index = 1;
+
+		cfg_addr = *(u32 *)buf;
+		pr_info("BAR%d addr 0x%x\n", bar_index, cfg_addr);
+
+		if (cfg_addr == 0xffffffff) {
+			bar_mask = mdev_state->bar_mask[bar_index];
+			cfg_addr = (cfg_addr & bar_mask);
+		}
+
+		cfg_addr |= (mdev_state->vconfig[offset] & 0x3ul);
+		STORE_LE32(&mdev_state->vconfig[offset], cfg_addr);
+		break;
+	case 0x18:  /* BAR2 */
+	case 0x1c:  /* BAR3 */
+	case 0x20:  /* BAR4 */
+		STORE_LE32(&mdev_state->vconfig[offset], 0);
+		break;
+	default:
+		pr_info("PCI config write @0x%x of %d bytes not handled\n",
+			offset, count);
+		break;
+	}
+}
+
+static void handle_bar_write(unsigned int index, struct mdev_state *mdev_state,
+				u16 offset, char *buf, u32 count)
+{
+	u8 data = *buf;
+
+	/* Handle data written by guest */
+	switch (offset) {
+	case UART_TX:
+		/* if DLAB set, data is LSB of divisor */
+		if (mdev_state->s[index].dlab) {
+			mdev_state->s[index].divisor |= data;
+			break;
+		}
+
+		mutex_lock(&mdev_state->rxtx_lock);
+
+		/* save in TX buffer */
+		if (mdev_state->s[index].rxtx.count <
+				mdev_state->s[index].max_fifo_size) {
+			mdev_state->s[index].rxtx.fifo[
+					mdev_state->s[index].rxtx.head] = data;
+			mdev_state->s[index].rxtx.count++;
+			CIRCULAR_BUF_INC_IDX(mdev_state->s[index].rxtx.head);
+			mdev_state->s[index].overrun = false;
+
+			/*
+			 * Trigger interrupt if receive data interrupt is
+			 * enabled and fifo reached trigger level
+			 */
+			if ((mdev_state->s[index].uart_reg[UART_IER] &
+						UART_IER_RDI) &&
+			   (mdev_state->s[index].rxtx.count ==
+				    mdev_state->s[index].intr_trigger_level)) {
+				/* trigger interrupt */
+#if defined(DEBUG_INTR)
+				pr_err("Serial port %d: Fifo level trigger\n",
+					index);
+#endif
+				mtty_trigger_interrupt(mdev_state->mdev->uuid);
+			}
+		} else {
+#if defined(DEBUG_INTR)
+			pr_err("Serial port %d: Buffer Overflow\n", index);
+#endif
+			mdev_state->s[index].overrun = true;
+
+			/*
+			 * Trigger interrupt if receiver line status interrupt
+			 * is enabled
+			 */
+			if (mdev_state->s[index].uart_reg[UART_IER] &
+								UART_IER_RLSI)
+				mtty_trigger_interrupt(mdev_state->mdev->uuid);
+		}
+		mutex_unlock(&mdev_state->rxtx_lock);
+		break;
+
+	case UART_IER:
+		/* if DLAB set, data is MSB of divisor */
+		if (mdev_state->s[index].dlab)
+			mdev_state->s[index].divisor |= (u16)data << 8;
+		else {
+			mdev_state->s[index].uart_reg[offset] = data;
+			mutex_lock(&mdev_state->rxtx_lock);
+			if ((data & UART_IER_THRI) &&
+			    (mdev_state->s[index].rxtx.head ==
+					mdev_state->s[index].rxtx.tail)) {
+#if defined(DEBUG_INTR)
+				pr_err("Serial port %d: IER_THRI write\n",
+					index);
+#endif
+				mtty_trigger_interrupt(mdev_state->mdev->uuid);
+			}
+
+			mutex_unlock(&mdev_state->rxtx_lock);
+		}
+
+		break;
+
+	case UART_FCR:
+		mdev_state->s[index].fcr = data;
+
+		mutex_lock(&mdev_state->rxtx_lock);
+		if (data & (UART_FCR_CLEAR_RCVR | UART_FCR_CLEAR_XMIT)) {
+			/* clear loop back FIFO */
+			mdev_state->s[index].rxtx.count = 0;
+			mdev_state->s[index].rxtx.head = 0;
+			mdev_state->s[index].rxtx.tail = 0;
+		}
+		mutex_unlock(&mdev_state->rxtx_lock);
+
+		switch (data & UART_FCR_TRIGGER_MASK) {
+		case UART_FCR_TRIGGER_1:
+			mdev_state->s[index].intr_trigger_level = 1;
+			break;
+
+		case UART_FCR_TRIGGER_4:
+			mdev_state->s[index].intr_trigger_level = 4;
+			break;
+
+		case UART_FCR_TRIGGER_8:
+			mdev_state->s[index].intr_trigger_level = 8;
+			break;
+
+		case UART_FCR_TRIGGER_14:
+			mdev_state->s[index].intr_trigger_level = 14;
+			break;
+		}
+
+		/*
+		 * Set trigger level to 1 otherwise or  implement timer with
+		 * timeout of 4 characters and on expiring that timer set
+		 * Recevice data timeout in IIR register
+		 */
+		mdev_state->s[index].intr_trigger_level = 1;
+		if (data & UART_FCR_ENABLE_FIFO)
+			mdev_state->s[index].max_fifo_size = MAX_FIFO_SIZE;
+		else {
+			mdev_state->s[index].max_fifo_size = 1;
+			mdev_state->s[index].intr_trigger_level = 1;
+		}
+
+		break;
+
+	case UART_LCR:
+		if (data & UART_LCR_DLAB) {
+			mdev_state->s[index].dlab = true;
+			mdev_state->s[index].divisor = 0;
+		} else
+			mdev_state->s[index].dlab = false;
+
+		mdev_state->s[index].uart_reg[offset] = data;
+		break;
+
+	case UART_MCR:
+		mdev_state->s[index].uart_reg[offset] = data;
+
+		if ((mdev_state->s[index].uart_reg[UART_IER] & UART_IER_MSI) &&
+				(data & UART_MCR_OUT2)) {
+#if defined(DEBUG_INTR)
+			pr_err("Serial port %d: MCR_OUT2 write\n", index);
+#endif
+			mtty_trigger_interrupt(mdev_state->mdev->uuid);
+		}
+
+		if ((mdev_state->s[index].uart_reg[UART_IER] & UART_IER_MSI) &&
+				(data & (UART_MCR_RTS | UART_MCR_DTR))) {
+#if defined(DEBUG_INTR)
+			pr_err("Serial port %d: MCR RTS/DTR write\n", index);
+#endif
+			mtty_trigger_interrupt(mdev_state->mdev->uuid);
+		}
+		break;
+
+	case UART_LSR:
+	case UART_MSR:
+		/* do nothing */
+		break;
+
+	case UART_SCR:
+		mdev_state->s[index].uart_reg[offset] = data;
+		break;
+
+	default:
+		break;
+	}
+}
+
+static void handle_bar_read(unsigned int index, struct mdev_state *mdev_state,
+			    u16 offset, char *buf, u32 count)
+{
+	/* Handle read requests by guest */
+	switch (offset) {
+	case UART_RX:
+		/* if DLAB set, data is LSB of divisor */
+		if (mdev_state->s[index].dlab) {
+			*buf  = (u8)mdev_state->s[index].divisor;
+			break;
+		}
+
+		mutex_lock(&mdev_state->rxtx_lock);
+		/* return data in tx buffer */
+		if (mdev_state->s[index].rxtx.head !=
+				 mdev_state->s[index].rxtx.tail) {
+			*buf = mdev_state->s[index].rxtx.fifo[
+						mdev_state->s[index].rxtx.tail];
+			mdev_state->s[index].rxtx.count--;
+			CIRCULAR_BUF_INC_IDX(mdev_state->s[index].rxtx.tail);
+		}
+
+		if (mdev_state->s[index].rxtx.head ==
+				mdev_state->s[index].rxtx.tail) {
+		/*
+		 *  Trigger interrupt if tx buffer empty interrupt is
+		 *  enabled and fifo is empty
+		 */
+#if defined(DEBUG_INTR)
+			pr_err("Serial port %d: Buffer Empty\n", index);
+#endif
+			if (mdev_state->s[index].uart_reg[UART_IER] &
+							 UART_IER_THRI)
+				mtty_trigger_interrupt(mdev_state->mdev->uuid);
+		}
+		mutex_unlock(&mdev_state->rxtx_lock);
+
+		break;
+
+	case UART_IER:
+		if (mdev_state->s[index].dlab) {
+			*buf = (u8)(mdev_state->s[index].divisor >> 8);
+			break;
+		}
+		*buf = mdev_state->s[index].uart_reg[offset] & 0x0f;
+		break;
+
+	case UART_IIR:
+	{
+		u8 ier = mdev_state->s[index].uart_reg[UART_IER];
+		*buf = 0;
+
+		mutex_lock(&mdev_state->rxtx_lock);
+		/* Interrupt priority 1: Parity, overrun, framing or break */
+		if ((ier & UART_IER_RLSI) && mdev_state->s[index].overrun)
+			*buf |= UART_IIR_RLSI;
+
+		/* Interrupt priority 2: Fifo trigger level reached */
+		if ((ier & UART_IER_RDI) &&
+		    (mdev_state->s[index].rxtx.count ==
+		      mdev_state->s[index].intr_trigger_level))
+			*buf |= UART_IIR_RDI;
+
+		/* Interrupt priotiry 3: transmitter holding register empty */
+		if ((ier & UART_IER_THRI) &&
+		    (mdev_state->s[index].rxtx.head ==
+				mdev_state->s[index].rxtx.tail))
+			*buf |= UART_IIR_THRI;
+
+		/* Interrupt priotiry 4: Modem status: CTS, DSR, RI or DCD  */
+		if ((ier & UART_IER_MSI) &&
+		    (mdev_state->s[index].uart_reg[UART_MCR] &
+				 (UART_MCR_RTS | UART_MCR_DTR)))
+			*buf |= UART_IIR_MSI;
+
+		/* bit0: 0=> interrupt pending, 1=> no interrupt is pending */
+		if (*buf == 0)
+			*buf = UART_IIR_NO_INT;
+
+		/* set bit 6 & 7 to be 16550 compatible */
+		*buf |= 0xC0;
+		mutex_unlock(&mdev_state->rxtx_lock);
+	}
+	break;
+
+	case UART_LCR:
+	case UART_MCR:
+		*buf = mdev_state->s[index].uart_reg[offset];
+		break;
+
+	case UART_LSR:
+	{
+		u8 lsr = 0;
+
+		mutex_lock(&mdev_state->rxtx_lock);
+		/* atleast one char in FIFO */
+		if (mdev_state->s[index].rxtx.head !=
+				 mdev_state->s[index].rxtx.tail)
+			lsr |= UART_LSR_DR;
+
+		/* if FIFO overrun */
+		if (mdev_state->s[index].overrun)
+			lsr |= UART_LSR_OE;
+
+		/* transmit FIFO empty and tramsitter empty */
+		if (mdev_state->s[index].rxtx.head ==
+				 mdev_state->s[index].rxtx.tail)
+			lsr |= UART_LSR_TEMT | UART_LSR_THRE;
+
+		mutex_unlock(&mdev_state->rxtx_lock);
+		*buf = lsr;
+		break;
+	}
+	case UART_MSR:
+		*buf = UART_MSR_DSR | UART_MSR_DDSR | UART_MSR_DCD;
+
+		mutex_lock(&mdev_state->rxtx_lock);
+		/* if AFE is 1 and FIFO have space, set CTS bit */
+		if (mdev_state->s[index].uart_reg[UART_MCR] &
+						 UART_MCR_AFE) {
+			if (mdev_state->s[index].rxtx.count <
+					mdev_state->s[index].max_fifo_size)
+				*buf |= UART_MSR_CTS | UART_MSR_DCTS;
+		} else
+			*buf |= UART_MSR_CTS | UART_MSR_DCTS;
+		mutex_unlock(&mdev_state->rxtx_lock);
+
+		break;
+
+	case UART_SCR:
+		*buf = mdev_state->s[index].uart_reg[offset];
+		break;
+
+	default:
+		break;
+	}
+}
+
+static void mdev_read_base(struct mdev_state *mdev_state)
+{
+	int index, pos;
+	u32 start_lo, start_hi;
+	u32 mem_type;
+
+	pos = PCI_BASE_ADDRESS_0;
+
+	for (index = 0; index <= VFIO_PCI_BAR5_REGION_INDEX; index++) {
+
+		if (!mdev_state->region_info[index].size)
+			continue;
+
+		start_lo = (*(u32 *)(mdev_state->vconfig + pos)) &
+			PCI_BASE_ADDRESS_MEM_MASK;
+		mem_type = (*(u32 *)(mdev_state->vconfig + pos)) &
+			PCI_BASE_ADDRESS_MEM_TYPE_MASK;
+
+		switch (mem_type) {
+		case PCI_BASE_ADDRESS_MEM_TYPE_64:
+			start_hi = (*(u32 *)(mdev_state->vconfig + pos + 4));
+			pos += 4;
+			break;
+		case PCI_BASE_ADDRESS_MEM_TYPE_32:
+		case PCI_BASE_ADDRESS_MEM_TYPE_1M:
+			/* 1M mem BAR treated as 32-bit BAR */
+		default:
+			/* mem unknown type treated as 32-bit BAR */
+			start_hi = 0;
+			break;
+		}
+		pos += 4;
+		mdev_state->region_info[index].start = ((u64)start_hi << 32) |
+							start_lo;
+	}
+}
+
+static ssize_t mdev_access(struct mdev_device *mdev, char *buf,
+		size_t count, loff_t pos, bool is_write)
+{
+	struct mdev_state *mdev_state;
+	unsigned int index;
+	loff_t offset;
+	int ret = 0;
+
+	if (!mdev || !buf)
+		return -EINVAL;
+
+	mdev_state = mdev_get_drvdata(mdev);
+	if (!mdev_state) {
+		pr_err("%s mdev_state not found\n", __func__);
+		return -EINVAL;
+	}
+
+	mutex_lock(&mdev_state->ops_lock);
+
+	index = MTTY_VFIO_PCI_OFFSET_TO_INDEX(pos);
+	offset = pos & MTTY_VFIO_PCI_OFFSET_MASK;
+	switch (index) {
+	case VFIO_PCI_CONFIG_REGION_INDEX:
+
+#if defined(DEBUG)
+		pr_info("%s: PCI config space %s at offset 0x%llx\n",
+			 __func__, is_write ? "write" : "read", offset);
+#endif
+		if (is_write) {
+			dump_buffer(buf, count);
+			handle_pci_cfg_write(mdev_state, offset, buf, count);
+		} else {
+			memcpy(buf, (mdev_state->vconfig + offset), count);
+			dump_buffer(buf, count);
+		}
+
+		break;
+
+	case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
+		if (!mdev_state->region_info[index].start)
+			mdev_read_base(mdev_state);
+
+		if (is_write) {
+			dump_buffer(buf, count);
+
+#if defined(DEBUG_REGS)
+			pr_info("%s: BAR%d  WR @0x%llx %s val:0x%02x dlab:%d\n",
+				__func__, index, offset, wr_reg[offset],
+				(u8)*buf, mdev_state->s[index].dlab);
+#endif
+			handle_bar_write(index, mdev_state, offset, buf, count);
+		} else {
+			handle_bar_read(index, mdev_state, offset, buf, count);
+			dump_buffer(buf, count);
+
+#if defined(DEBUG_REGS)
+			pr_info("%s: BAR%d  RD @0x%llx %s val:0x%02x dlab:%d\n",
+				__func__, index, offset, rd_reg[offset],
+				(u8)*buf, mdev_state->s[index].dlab);
+#endif
+		}
+		break;
+
+	default:
+		ret = -1;
+		goto accessfailed;
+	}
+
+	ret = count;
+
+
+accessfailed:
+	mutex_unlock(&mdev_state->ops_lock);
+
+	return ret;
+}
+
+int mtty_create(struct kobject *kobj, struct mdev_device *mdev)
+{
+	struct mdev_state *mdev_state;
+
+	if (!mdev)
+		return -EINVAL;
+
+	mdev_state = kzalloc(sizeof(struct mdev_state), GFP_KERNEL);
+	if (mdev_state == NULL)
+		return -ENOMEM;
+
+	mdev_state->irq_index = -1;
+	mdev_state->s[0].max_fifo_size = MAX_FIFO_SIZE;
+	mdev_state->s[1].max_fifo_size = MAX_FIFO_SIZE;
+	mutex_init(&mdev_state->rxtx_lock);
+	mdev_state->vconfig = kzalloc(MTTY_CONFIG_SPACE_SIZE, GFP_KERNEL);
+
+	if (mdev_state->vconfig == NULL) {
+		kfree(mdev_state);
+		return -ENOMEM;
+	}
+
+	mutex_init(&mdev_state->ops_lock);
+	mdev_state->mdev = mdev;
+	mdev_set_drvdata(mdev, mdev_state);
+
+	mtty_create_config_space(mdev_state);
+
+	mutex_lock(&mdev_list_lock);
+	list_add(&mdev_state->next, &mdev_devices_list);
+	mutex_unlock(&mdev_list_lock);
+
+	return 0;
+}
+
+int mtty_remove(struct mdev_device *mdev)
+{
+	struct mdev_state *mds, *tmp_mds;
+	struct mdev_state *mdev_state = mdev_get_drvdata(mdev);
+	int ret = -EINVAL;
+
+	mutex_lock(&mdev_list_lock);
+	list_for_each_entry_safe(mds, tmp_mds, &mdev_devices_list, next) {
+		if (mdev_state == mds) {
+			list_del(&mdev_state->next);
+			mdev_set_drvdata(mdev, NULL);
+			kfree(mdev_state->vconfig);
+			kfree(mdev_state);
+			ret = 0;
+			break;
+		}
+	}
+	mutex_unlock(&mdev_list_lock);
+
+	return ret;
+}
+
+int mtty_reset(struct mdev_device *mdev)
+{
+	struct mdev_state *mdev_state;
+
+	if (!mdev)
+		return -EINVAL;
+
+	mdev_state = mdev_get_drvdata(mdev);
+	if (!mdev_state)
+		return -EINVAL;
+
+	pr_info("%s: called\n", __func__);
+
+	return 0;
+}
+
+ssize_t mtty_read(struct mdev_device *mdev, char *buf,
+		size_t count, loff_t pos)
+{
+	return mdev_access(mdev, buf, count, pos, false);
+}
+
+ssize_t mtty_write(struct mdev_device *mdev, char *buf,
+		size_t count, loff_t pos)
+{
+	return mdev_access(mdev, buf, count, pos, true);
+}
+
+static int mtty_set_irqs(struct mdev_device *mdev, uint32_t flags,
+			 unsigned int index, unsigned int start,
+			 unsigned int count, void *data)
+{
+	int ret = 0;
+	struct mdev_state *mdev_state;
+
+	if (!mdev)
+		return -EINVAL;
+
+	mdev_state = mdev_get_drvdata(mdev);
+	if (!mdev_state)
+		return -EINVAL;
+
+	mutex_lock(&mdev_state->ops_lock);
+	switch (index) {
+	case VFIO_PCI_INTX_IRQ_INDEX:
+		switch (flags & VFIO_IRQ_SET_ACTION_TYPE_MASK) {
+		case VFIO_IRQ_SET_ACTION_MASK:
+		case VFIO_IRQ_SET_ACTION_UNMASK:
+			break;
+		case VFIO_IRQ_SET_ACTION_TRIGGER:
+		{
+			if (flags & VFIO_IRQ_SET_DATA_NONE) {
+				pr_info("%s: disable INTx\n", __func__);
+				break;
+			}
+
+			if (flags & VFIO_IRQ_SET_DATA_EVENTFD) {
+				int fd = *(int *)data;
+
+				if (fd > 0) {
+					struct fd irqfd;
+
+					irqfd = fdget(fd);
+					if (!irqfd.file) {
+						ret = -EBADF;
+						break;
+					}
+					mdev_state->intx_file = irqfd.file;
+					fdput(irqfd);
+					mdev_state->irq_fd = fd;
+					mdev_state->irq_index = index;
+					break;
+				}
+			}
+			break;
+		}
+		}
+		break;
+	case VFIO_PCI_MSI_IRQ_INDEX:
+		switch (flags & VFIO_IRQ_SET_ACTION_TYPE_MASK) {
+		case VFIO_IRQ_SET_ACTION_MASK:
+		case VFIO_IRQ_SET_ACTION_UNMASK:
+			break;
+		case VFIO_IRQ_SET_ACTION_TRIGGER:
+			if (flags & VFIO_IRQ_SET_DATA_NONE) {
+				pr_info("%s: disable MSI\n", __func__);
+				mdev_state->irq_index = VFIO_PCI_INTX_IRQ_INDEX;
+				break;
+			}
+			if (flags & VFIO_IRQ_SET_DATA_EVENTFD) {
+				int fd = *(int *)data;
+				struct fd irqfd;
+
+				if (fd <= 0)
+					break;
+
+				if (mdev_state->msi_file)
+					break;
+
+				irqfd = fdget(fd);
+				if (!irqfd.file) {
+					ret = -EBADF;
+					break;
+				}
+
+				mdev_state->msi_file = irqfd.file;
+				fdput(irqfd);
+				mdev_state->irq_fd = fd;
+				mdev_state->irq_index = index;
+			}
+			break;
+	}
+	break;
+	case VFIO_PCI_MSIX_IRQ_INDEX:
+		pr_info("%s: MSIX_IRQ\n", __func__);
+		break;
+	case VFIO_PCI_ERR_IRQ_INDEX:
+		pr_info("%s: ERR_IRQ\n", __func__);
+		break;
+	case VFIO_PCI_REQ_IRQ_INDEX:
+		pr_info("%s: REQ_IRQ\n", __func__);
+		break;
+	}
+
+	mutex_unlock(&mdev_state->ops_lock);
+	return ret;
+}
+
+static int mtty_trigger_interrupt(uuid_le uuid)
+{
+	mm_segment_t old_fs;
+	u64 val = 1;
+	loff_t offset = 0;
+	int ret = -1;
+	struct file *pfile = NULL;
+	struct mdev_state *mdev_state;
+
+	mdev_state = find_mdev_state_by_uuid(uuid);
+
+	if (!mdev_state) {
+		pr_info("%s: mdev not found\n", __func__);
+		return -EINVAL;
+	}
+
+	if ((mdev_state->irq_index == VFIO_PCI_MSI_IRQ_INDEX) &&
+			(mdev_state->msi_file == NULL))
+		return -EINVAL;
+	else if ((mdev_state->irq_index == VFIO_PCI_INTX_IRQ_INDEX) &&
+			(mdev_state->intx_file == NULL)) {
+		pr_info("%s: Intr file not found\n", __func__);
+		return -EINVAL;
+	}
+
+	old_fs = get_fs();
+	set_fs(KERNEL_DS);
+
+	if (mdev_state->irq_index == VFIO_PCI_MSI_IRQ_INDEX)
+		pfile = mdev_state->msi_file;
+	else
+		pfile = mdev_state->intx_file;
+
+	if (pfile && pfile->f_op && pfile->f_op->write) {
+		ret = pfile->f_op->write(pfile, (char *)&val, sizeof(val),
+					 &offset);
+#if defined(DEBUG_INTR)
+		pr_info("Intx triggered\n");
+#endif
+	} else
+		pr_err("%s: pfile not valid, intr_type = %d\n", __func__,
+				mdev_state->irq_index);
+
+	set_fs(old_fs);
+
+	if (ret < 0)
+		pr_err("%s: eventfd write failed (%d)\n", __func__, ret);
+
+	return ret;
+}
+
+int mtty_get_region_info(struct mdev_device *mdev,
+			 struct vfio_region_info *region_info,
+			 u16 *cap_type_id, void **cap_type)
+{
+	unsigned int size = 0;
+	struct mdev_state *mdev_state;
+	int bar_index;
+
+	if (!mdev)
+		return -EINVAL;
+
+	mdev_state = mdev_get_drvdata(mdev);
+	if (!mdev_state)
+		return -EINVAL;
+
+	mutex_lock(&mdev_state->ops_lock);
+	bar_index = region_info->index;
+
+	switch (bar_index) {
+	case VFIO_PCI_CONFIG_REGION_INDEX:
+		size = MTTY_CONFIG_SPACE_SIZE;
+		break;
+	case VFIO_PCI_BAR0_REGION_INDEX:
+		size = MTTY_IO_BAR_SIZE;
+		break;
+	case VFIO_PCI_BAR1_REGION_INDEX:
+		size = MTTY_IO_BAR_SIZE;
+		break;
+	default:
+		size = 0;
+		break;
+	}
+
+	mdev_state->region_info[bar_index].size = size;
+	mdev_state->region_info[bar_index].vfio_offset =
+		MTTY_VFIO_PCI_INDEX_TO_OFFSET(bar_index);
+
+	region_info->size = size;
+	region_info->offset = MTTY_VFIO_PCI_INDEX_TO_OFFSET(bar_index);
+	region_info->flags = VFIO_REGION_INFO_FLAG_READ |
+		VFIO_REGION_INFO_FLAG_WRITE;
+	mutex_unlock(&mdev_state->ops_lock);
+	return 0;
+}
+
+int mtty_get_irq_info(struct mdev_device *mdev, struct vfio_irq_info *irq_info)
+{
+	switch (irq_info->index) {
+	case VFIO_PCI_INTX_IRQ_INDEX:
+	case VFIO_PCI_MSI_IRQ_INDEX:
+	case VFIO_PCI_REQ_IRQ_INDEX:
+		break;
+
+	default:
+		return -EINVAL;
+	}
+
+	irq_info->flags = VFIO_IRQ_INFO_EVENTFD;
+	irq_info->count = 1;
+
+	if (irq_info->index == VFIO_PCI_INTX_IRQ_INDEX)
+		irq_info->flags |= (VFIO_IRQ_INFO_MASKABLE |
+				VFIO_IRQ_INFO_AUTOMASKED);
+	else
+		irq_info->flags |= VFIO_IRQ_INFO_NORESIZE;
+
+	return 0;
+}
+
+int mtty_get_device_info(struct mdev_device *mdev,
+			 struct vfio_device_info *dev_info)
+{
+	dev_info->flags = VFIO_DEVICE_FLAGS_PCI;
+	dev_info->num_regions = VFIO_PCI_NUM_REGIONS;
+	dev_info->num_irqs = VFIO_PCI_NUM_IRQS;
+
+	return 0;
+}
+
+static long mtty_ioctl(struct mdev_device *mdev, unsigned int cmd,
+			unsigned long arg)
+{
+	int ret = 0;
+	unsigned long minsz;
+	struct mdev_state *mdev_state;
+
+	if (!mdev)
+		return -EINVAL;
+
+	mdev_state = mdev_get_drvdata(mdev);
+	if (!mdev_state)
+		return -ENODEV;
+
+	switch (cmd) {
+	case VFIO_DEVICE_GET_INFO:
+	{
+		struct vfio_device_info info;
+
+		minsz = offsetofend(struct vfio_device_info, num_irqs);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz)
+			return -EINVAL;
+
+		ret = mtty_get_device_info(mdev, &info);
+		if (ret)
+			return ret;
+
+		memcpy(&mdev_state->dev_info, &info, sizeof(info));
+
+		return copy_to_user((void __user *)arg, &info, minsz);
+	}
+	case VFIO_DEVICE_GET_REGION_INFO:
+	{
+		struct vfio_region_info info;
+		struct vfio_info_cap caps = { .buf = NULL, .size = 0 };
+		u16 cap_type_id = 0;
+		void *cap_type = NULL;
+
+		minsz = offsetofend(struct vfio_region_info, offset);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz)
+			return -EINVAL;
+
+		ret = mtty_get_region_info(mdev, &info, &cap_type_id,
+					   &cap_type);
+		if (ret)
+			return ret;
+
+		ret = vfio_info_add_capability(&info, &caps, cap_type_id,
+						cap_type);
+		if (ret)
+			return ret;
+
+		if (info.cap_offset) {
+			if (copy_to_user((void __user *)arg + info.cap_offset,
+						caps.buf, caps.size)) {
+				kfree(caps.buf);
+				return -EFAULT;
+			}
+			kfree(caps.buf);
+		}
+
+		return copy_to_user((void __user *)arg, &info, minsz);
+	}
+
+	case VFIO_DEVICE_GET_IRQ_INFO:
+	{
+		struct vfio_irq_info info;
+
+		minsz = offsetofend(struct vfio_irq_info, count);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if ((info.argsz < minsz) ||
+		    (info.index >= mdev_state->dev_info.num_irqs))
+			return -EINVAL;
+
+		ret = mtty_get_irq_info(mdev, &info);
+		if (ret)
+			return ret;
+
+		if (info.count == -1)
+			return -EINVAL;
+
+		return copy_to_user((void __user *)arg, &info, minsz);
+	}
+	case VFIO_DEVICE_SET_IRQS:
+	{
+		struct vfio_irq_set hdr;
+		u8 *data = NULL, *ptr = NULL;
+		int data_size = 0;
+
+		minsz = offsetofend(struct vfio_irq_set, count);
+
+		if (copy_from_user(&hdr, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		ret = vfio_set_irqs_validate_and_prepare(&hdr,
+				mdev_state->dev_info.num_irqs,
+				&data_size);
+		if (ret)
+			return ret;
+
+		if (data_size) {
+			ptr = data = memdup_user((void __user *)(arg + minsz),
+					data_size);
+			if (IS_ERR(data))
+				return PTR_ERR(data);
+		}
+
+		ret = mtty_set_irqs(mdev, hdr.flags, hdr.index, hdr.start,
+				hdr.count, data);
+
+		kfree(ptr);
+		return ret;
+	}
+	case VFIO_DEVICE_RESET:
+		return mtty_reset(mdev);
+	}
+	return -ENOTTY;
+}
+
+int mtty_open(struct mdev_device *mdev)
+{
+	pr_info("%s\n", __func__);
+	return 0;
+}
+
+void mtty_close(struct mdev_device *mdev)
+{
+	pr_info("%s\n", __func__);
+}
+
+static ssize_t
+sample_mtty_dev_show(struct device *dev, struct device_attribute *attr,
+		     char *buf)
+{
+	return sprintf(buf, "This is phy device\n");
+}
+
+static DEVICE_ATTR_RO(sample_mtty_dev);
+
+static struct attribute *mtty_dev_attrs[] = {
+	&dev_attr_sample_mtty_dev.attr,
+	NULL,
+};
+
+static const struct attribute_group mtty_dev_group = {
+	.name  = "mtty_dev",
+	.attrs = mtty_dev_attrs,
+};
+
+const struct attribute_group *mtty_dev_groups[] = {
+	&mtty_dev_group,
+	NULL,
+};
+
+static ssize_t
+sample_mdev_dev_show(struct device *dev, struct device_attribute *attr,
+		     char *buf)
+{
+	struct mdev_device *mdev = to_mdev_device(dev);
+
+	if (mdev)
+		return sprintf(buf, "This is MDEV %s\n", dev_name(&mdev->dev));
+
+	return sprintf(buf, "\n");
+}
+
+static DEVICE_ATTR_RO(sample_mdev_dev);
+
+static struct attribute *mdev_dev_attrs[] = {
+	&dev_attr_sample_mdev_dev.attr,
+	NULL,
+};
+
+static const struct attribute_group mdev_dev_group = {
+	.name  = "vendor",
+	.attrs = mdev_dev_attrs,
+};
+
+const struct attribute_group *mdev_dev_groups[] = {
+	&mdev_dev_group,
+	NULL,
+};
+
+static ssize_t
+name_show(struct kobject *kobj, struct device *dev, char *buf)
+{
+	return sprintf(buf, "Dual-port-serial\n");
+}
+
+MDEV_TYPE_ATTR_RO(name);
+
+static ssize_t
+available_instances_show(struct kobject *kobj, struct device *dev, char *buf)
+{
+	return sprintf(buf, "1\n");
+}
+
+MDEV_TYPE_ATTR_RO(available_instances);
+
+static struct attribute *mdev_types_attrs[] = {
+	&mdev_type_attr_name.attr,
+	&mdev_type_attr_available_instances.attr,
+	NULL,
+};
+
+static struct attribute_group mdev_type_group = {
+	.name  = "mtty1",
+	.attrs = mdev_types_attrs,
+};
+
+struct attribute_group *mdev_type_groups[] = {
+	&mdev_type_group,
+	NULL,
+};
+
+struct parent_ops mdev_fops = {
+	.owner                  = THIS_MODULE,
+	.dev_attr_groups        = mtty_dev_groups,
+	.mdev_attr_groups       = mdev_dev_groups,
+	.supported_type_groups  = mdev_type_groups,
+	.create                 = mtty_create,
+	.remove			= mtty_remove,
+	.open                   = mtty_open,
+	.release                = mtty_close,
+	.read                   = mtty_read,
+	.write                  = mtty_write,
+	.ioctl		        = mtty_ioctl,
+};
+
+static void mtty_device_release(struct device *dev)
+{
+	dev_dbg(dev, "mtty: released\n");
+}
+
+static int __init mtty_dev_init(void)
+{
+	int ret = 0;
+
+	pr_info("mtty_dev: %s\n", __func__);
+
+	memset(&mtty_dev, 0, sizeof(mtty_dev));
+
+	idr_init(&mtty_dev.vd_idr);
+
+	ret = alloc_chrdev_region(&mtty_dev.vd_devt, 0, MINORMASK, MTTY_NAME);
+
+	if (ret < 0) {
+		pr_err("Error: failed to register mtty_dev, err:%d\n", ret);
+		return ret;
+	}
+
+	cdev_init(&mtty_dev.vd_cdev, &vd_fops);
+	cdev_add(&mtty_dev.vd_cdev, mtty_dev.vd_devt, MINORMASK);
+
+	pr_info("major_number:%d\n", MAJOR(mtty_dev.vd_devt));
+
+	mtty_dev.vd_class = class_create(THIS_MODULE, MTTY_CLASS_NAME);
+
+	if (IS_ERR(mtty_dev.vd_class)) {
+		pr_err("Error: failed to register mtty_dev class\n");
+		goto failed1;
+	}
+
+	mtty_dev.dev.release = mtty_device_release;
+	dev_set_name(&mtty_dev.dev, "%s", MTTY_NAME);
+
+	ret = device_register(&mtty_dev.dev);
+	if (ret)
+		goto failed2;
+
+	if (mdev_register_device(&mtty_dev.dev, &mdev_fops) != 0)
+		goto failed3;
+
+	mutex_init(&mdev_list_lock);
+	INIT_LIST_HEAD(&mdev_devices_list);
+
+	goto all_done;
+
+failed3:
+
+	device_unregister(&mtty_dev.dev);
+failed2:
+	class_destroy(mtty_dev.vd_class);
+
+failed1:
+	cdev_del(&mtty_dev.vd_cdev);
+	unregister_chrdev_region(mtty_dev.vd_devt, MINORMASK);
+
+all_done:
+	return ret;
+}
+
+static void __exit mtty_dev_exit(void)
+{
+	mtty_dev.dev.bus = NULL;
+	mdev_unregister_device(&mtty_dev.dev);
+
+	device_unregister(&mtty_dev.dev);
+	idr_destroy(&mtty_dev.vd_idr);
+	cdev_del(&mtty_dev.vd_cdev);
+	unregister_chrdev_region(mtty_dev.vd_devt, MINORMASK);
+	class_destroy(mtty_dev.vd_class);
+	mtty_dev.vd_class = NULL;
+	pr_info("mtty_dev: Unloaded!\n");
+}
+
+module_init(mtty_dev_init)
+module_exit(mtty_dev_exit)
+
+MODULE_LICENSE("GPL");
+MODULE_INFO(supported, "Test driver that simulate serial port over PCI");
+MODULE_VERSION(VERSION_STRING);
+MODULE_AUTHOR(DRIVER_AUTHOR);
diff --git a/Documentation/vfio-mdev/vfio-mediated-device.txt b/Documentation/vfio-mdev/vfio-mediated-device.txt
index c1eacb83807b..2b23b4d431a3 100644
--- a/Documentation/vfio-mdev/vfio-mediated-device.txt
+++ b/Documentation/vfio-mdev/vfio-mediated-device.txt
@@ -209,6 +209,69 @@ supported in TYPE1 IOMMU module. To enable the same for other IOMMU backend
 modules, such as PPC64 sPAPR module, they need to provide these two callback
 functions.
 
+Sample code
+-----------
+File mtty.c in this folder is a sample code to demonstrate how to use mediated
+device framework.
+
+Sample driver creates mdev device that simulates serial port over PCI card.
+
+Build and load mtty.ko module. This creates a dummy device, /sys/devices/mtty
+Files in this device directory in sysfs looks like:
+
+# ls /sys/devices/mtty/ -l
+total 0
+drwxr-xr-x 2 root root    0 Sep 29 12:34 mdev_supported_types
+drwxr-xr-x 2 root root    0 Sep 29 12:34 mtty_dev
+drwxr-xr-x 2 root root    0 Sep 29 12:34 power
+-rw-r--r-- 1 root root 4096 Sep 29 12:34 uevent
+
+Create mediated device using this device:
+# echo "83b8f4f2-509f-382f-3c1e-e6bfe0fa1001" >	\
+		 /sys/devices/mtty/mdev_supported_types/mtty1/create
+
+Add parameters to qemu-kvm:
+-device vfio-pci,\
+ sysfsdev=/sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1001
+
+Boot the VM. In Linux guest (no hardware in host), device is seen as below:
+
+# lspci -s 00:05.0 -xxvv
+00:05.0 Serial controller: Device 4348:3253 (rev 10) (prog-if 02 [16550])
+        Subsystem: Device 4348:3253
+        Physical Slot: 5
+        Control: I/O+ Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr-
+Stepping- SERR- FastB2B- DisINTx-
+        Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort-
+<TAbort- <MAbort- >SERR- <PERR- INTx-
+        Interrupt: pin A routed to IRQ 10
+        Region 0: I/O ports at c150 [size=8]
+        Region 1: I/O ports at c158 [size=8]
+        Kernel driver in use: serial
+00: 48 43 53 32 01 00 00 02 10 02 00 07 00 00 00 00
+10: 51 c1 00 00 59 c1 00 00 00 00 00 00 00 00 00 00
+20: 00 00 00 00 00 00 00 00 00 00 00 00 48 43 53 32
+30: 00 00 00 00 00 00 00 00 00 00 00 00 0a 01 00 00
+
+In guest dmesg:
+serial 0000:00:05.0: PCI INT A -> Link[LNKA] -> GSI 10 (level, high) -> IRQ 10
+0000:00:05.0: ttyS1 at I/O 0xc150 (irq = 10) is a 16550A
+0000:00:05.0: ttyS2 at I/O 0xc158 (irq = 10) is a 16550A
+
+Check the serial ports in guest:
+# setserial -g /dev/ttyS*
+/dev/ttyS0, UART: 16550A, Port: 0x03f8, IRQ: 4
+/dev/ttyS1, UART: 16550A, Port: 0xc150, IRQ: 10
+/dev/ttyS2, UART: 16550A, Port: 0xc158, IRQ: 10
+
+Using minicom or any terminal enulation program, open port /dev/ttyS1 or
+/dev/ttyS2 with hardware flow control disabled. Type data on minicom terminal or
+send data to terminal emulation program and read tha data. Data is loop backed
+from hosts mtty driver.
+
+Destroy mediated device created above by:
+# echo 1 > /sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1001/remove
+
 References
 ----------
 
-- 
2.7.0


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [Qemu-devel] [PATCH v8 5/6] Add simple sample driver for mediated device framework
@ 2016-10-10 20:28   ` Kirti Wankhede
  0 siblings, 0 replies; 73+ messages in thread
From: Kirti Wankhede @ 2016-10-10 20:28 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi, Kirti Wankhede

Sample driver creates mdev device that simulates serial port over PCI card.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Change-Id: I857f8f12f8b275f2498dfe8c628a5cdc7193b1b2
---
 Documentation/vfio-mdev/Makefile                 |   14 +
 Documentation/vfio-mdev/mtty.c                   | 1353 ++++++++++++++++++++++
 Documentation/vfio-mdev/vfio-mediated-device.txt |   63 +
 3 files changed, 1430 insertions(+)
 create mode 100644 Documentation/vfio-mdev/Makefile
 create mode 100644 Documentation/vfio-mdev/mtty.c

diff --git a/Documentation/vfio-mdev/Makefile b/Documentation/vfio-mdev/Makefile
new file mode 100644
index 000000000000..ff6f8a324c85
--- /dev/null
+++ b/Documentation/vfio-mdev/Makefile
@@ -0,0 +1,14 @@
+#
+# Makefile for mtty.c file
+#
+KDIR:=/lib/modules/$(shell uname -r)/build
+
+obj-m:=mtty.o
+
+default:
+	$(MAKE) -C $(KDIR) SUBDIRS=$(PWD) modules
+
+clean:
+	@rm -rf .*.cmd *.mod.c *.o *.ko .tmp*
+	@rm -rf Module.* Modules.* modules.* .tmp_versions
+
diff --git a/Documentation/vfio-mdev/mtty.c b/Documentation/vfio-mdev/mtty.c
new file mode 100644
index 000000000000..497c90ebe257
--- /dev/null
+++ b/Documentation/vfio-mdev/mtty.c
@@ -0,0 +1,1353 @@
+/*
+ * Mediated virtual PCI serial host device driver
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *             Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * Sample driver that creates mdev device that simulates serial port over PCI
+ * card.
+ *
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/poll.h>
+#include <linux/slab.h>
+#include <linux/cdev.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/uuid.h>
+#include <linux/vfio.h>
+#include <linux/iommu.h>
+#include <linux/sysfs.h>
+#include <linux/ctype.h>
+#include <linux/file.h>
+#include <linux/mdev.h>
+#include <linux/pci.h>
+#include <linux/serial.h>
+#include <uapi/linux/serial_reg.h>
+/*
+ * #defines
+ */
+
+#define VERSION_STRING  "0.1"
+#define DRIVER_AUTHOR   "NVIDIA Corporation"
+
+#define MTTY_CLASS_NAME "mtty"
+
+#define MTTY_NAME       "mtty"
+
+#define MTTY_CONFIG_SPACE_SIZE  0xff
+#define MTTY_IO_BAR_SIZE        0x8
+#define MTTY_MMIO_BAR_SIZE      0x100000
+
+#define STORE_LE16(addr, val)   (*(u16 *)addr = val)
+#define STORE_LE32(addr, val)   (*(u32 *)addr = val)
+
+#define MAX_FIFO_SIZE   16
+
+#define CIRCULAR_BUF_INC_IDX(idx)    (idx = (idx + 1) & (MAX_FIFO_SIZE - 1))
+
+#define MTTY_VFIO_PCI_OFFSET_SHIFT   40
+
+#define MTTY_VFIO_PCI_OFFSET_TO_INDEX(off)   (off >> MTTY_VFIO_PCI_OFFSET_SHIFT)
+#define MTTY_VFIO_PCI_INDEX_TO_OFFSET(index) \
+				((u64)(index) << MTTY_VFIO_PCI_OFFSET_SHIFT)
+#define MTTY_VFIO_PCI_OFFSET_MASK    \
+				(((u64)(1) << MTTY_VFIO_PCI_OFFSET_SHIFT) - 1)
+
+
+/*
+ * Global Structures
+ */
+
+struct mtty_dev {
+	dev_t		vd_devt;
+	struct class	*vd_class;
+	struct cdev	vd_cdev;
+	struct idr	vd_idr;
+	struct device	dev;
+} mtty_dev;
+
+struct mdev_region_info {
+	u64 start;
+	u64 phys_start;
+	u32 size;
+	u64 vfio_offset;
+};
+
+#if defined(DEBUG_REGS)
+const char *wr_reg[] = {
+	"TX",
+	"IER",
+	"FCR",
+	"LCR",
+	"MCR",
+	"LSR",
+	"MSR",
+	"SCR"
+};
+
+const char *rd_reg[] = {
+	"RX",
+	"IER",
+	"IIR",
+	"LCR",
+	"MCR",
+	"LSR",
+	"MSR",
+	"SCR"
+};
+#endif
+
+/* loop back buffer */
+struct rxtx {
+	u8 fifo[MAX_FIFO_SIZE];
+	u8 head, tail;
+	u8 count;
+};
+
+struct serial_port {
+	u8 uart_reg[8];         /* 8 registers */
+	struct rxtx rxtx;       /* loop back buffer */
+	bool dlab;
+	bool overrun;
+	u16 divisor;
+	u8 fcr;                 /* FIFO control register */
+	u8 max_fifo_size;
+	u8 intr_trigger_level;  /* interrupt trigger level */
+};
+
+/* State of each mdev device */
+struct mdev_state {
+	int irq_fd;
+	struct file *intx_file;
+	struct file *msi_file;
+	int irq_index;
+	u8 *vconfig;
+	struct mutex ops_lock;
+	struct mdev_device *mdev;
+	struct mdev_region_info region_info[VFIO_PCI_NUM_REGIONS];
+	u32 bar_mask[VFIO_PCI_NUM_REGIONS];
+	struct list_head next;
+	struct serial_port s[2];
+	struct mutex rxtx_lock;
+	struct vfio_device_info dev_info;
+};
+
+struct mutex mdev_list_lock;
+struct list_head mdev_devices_list;
+
+static const struct file_operations vd_fops = {
+	.owner          = THIS_MODULE,
+};
+
+/* function prototypes */
+
+static int mtty_trigger_interrupt(uuid_le uuid);
+
+/* Helper functions */
+static struct mdev_state *find_mdev_state_by_uuid(uuid_le uuid)
+{
+	struct mdev_state *mds;
+
+	list_for_each_entry(mds, &mdev_devices_list, next) {
+		if (uuid_le_cmp(mds->mdev->uuid, uuid) == 0)
+			return mds;
+	}
+
+	return NULL;
+}
+
+void dump_buffer(char *buf, uint32_t count)
+{
+#if defined(DEBUG)
+	int i;
+
+	pr_info("Buffer:\n");
+	for (i = 0; i < count; i++) {
+		pr_info("%2x ", *(buf + i));
+		if ((i + 1) % 16 == 0)
+			pr_info("\n");
+	}
+#endif
+}
+
+static void mtty_create_config_space(struct mdev_state *mdev_state)
+{
+	/* PCI dev ID */
+	STORE_LE32((u32 *) &mdev_state->vconfig[0x0], 0x32534348);
+
+	/* Control: I/O+, Mem-, BusMaster- */
+	STORE_LE16((u16 *) &mdev_state->vconfig[0x4], 0x0001);
+
+	/* Status: capabilities list absent */
+	STORE_LE16((u16 *) &mdev_state->vconfig[0x6], 0x0200);
+
+	/* Rev ID */
+	mdev_state->vconfig[0x8] =  0x10;
+
+	/* programming interface class : 16550-compatible serial controller */
+	mdev_state->vconfig[0x9] =  0x02;
+
+	/* Sub class : 00 */
+	mdev_state->vconfig[0xa] =  0x00;
+
+	/* Base class : Simple Communication controllers */
+	mdev_state->vconfig[0xb] =  0x07;
+
+	/* base address registers */
+	/* BAR0: IO space */
+	STORE_LE32((u32 *) &mdev_state->vconfig[0x10], 0x000001);
+	mdev_state->bar_mask[0] = ~(MTTY_IO_BAR_SIZE) + 1;
+
+	/* BAR1: IO space */
+	STORE_LE32((u32 *) &mdev_state->vconfig[0x14], 0x000001);
+	mdev_state->bar_mask[1] = ~(MTTY_IO_BAR_SIZE) + 1;
+
+	/* Subsystem ID */
+	STORE_LE32((u32 *) &mdev_state->vconfig[0x2c], 0x32534348);
+
+	mdev_state->vconfig[0x34] =  0x00;   /* Cap Ptr */
+	mdev_state->vconfig[0x3d] =  0x01;   /* interrupt pin (INTA#) */
+
+	/* Vendor specific data */
+	mdev_state->vconfig[0x40] =  0x23;
+	mdev_state->vconfig[0x43] =  0x80;
+	mdev_state->vconfig[0x44] =  0x23;
+	mdev_state->vconfig[0x48] =  0x23;
+	mdev_state->vconfig[0x4c] =  0x23;
+
+	mdev_state->vconfig[0x60] =  0x50;
+	mdev_state->vconfig[0x61] =  0x43;
+	mdev_state->vconfig[0x62] =  0x49;
+	mdev_state->vconfig[0x63] =  0x20;
+	mdev_state->vconfig[0x64] =  0x53;
+	mdev_state->vconfig[0x65] =  0x65;
+	mdev_state->vconfig[0x66] =  0x72;
+	mdev_state->vconfig[0x67] =  0x69;
+	mdev_state->vconfig[0x68] =  0x61;
+	mdev_state->vconfig[0x69] =  0x6c;
+	mdev_state->vconfig[0x6a] =  0x2f;
+	mdev_state->vconfig[0x6b] =  0x55;
+	mdev_state->vconfig[0x6c] =  0x41;
+	mdev_state->vconfig[0x6d] =  0x52;
+	mdev_state->vconfig[0x6e] =  0x54;
+}
+
+static void handle_pci_cfg_write(struct mdev_state *mdev_state, u16 offset,
+				 char *buf, u32 count)
+{
+	u32 cfg_addr, bar_mask, bar_index = 0;
+
+	switch (offset) {
+	case 0x04: /* device control */
+	case 0x06: /* device status */
+		/* do nothing */
+		break;
+	case 0x3c:  /* interrupt line */
+		mdev_state->vconfig[0x3c] = buf[0];
+		break;
+	case 0x3d:
+		/*
+		 * Interrupt Pin is hardwired to INTA.
+		 * This field is write protected by hardware
+		 */
+		break;
+	case 0x10:  /* BAR0 */
+	case 0x14:  /* BAR1 */
+		if (offset == 0x10)
+			bar_index = 0;
+		else if (offset == 0x14)
+			bar_index = 1;
+
+		cfg_addr = *(u32 *)buf;
+		pr_info("BAR%d addr 0x%x\n", bar_index, cfg_addr);
+
+		if (cfg_addr == 0xffffffff) {
+			bar_mask = mdev_state->bar_mask[bar_index];
+			cfg_addr = (cfg_addr & bar_mask);
+		}
+
+		cfg_addr |= (mdev_state->vconfig[offset] & 0x3ul);
+		STORE_LE32(&mdev_state->vconfig[offset], cfg_addr);
+		break;
+	case 0x18:  /* BAR2 */
+	case 0x1c:  /* BAR3 */
+	case 0x20:  /* BAR4 */
+		STORE_LE32(&mdev_state->vconfig[offset], 0);
+		break;
+	default:
+		pr_info("PCI config write @0x%x of %d bytes not handled\n",
+			offset, count);
+		break;
+	}
+}
+
+static void handle_bar_write(unsigned int index, struct mdev_state *mdev_state,
+				u16 offset, char *buf, u32 count)
+{
+	u8 data = *buf;
+
+	/* Handle data written by guest */
+	switch (offset) {
+	case UART_TX:
+		/* if DLAB set, data is LSB of divisor */
+		if (mdev_state->s[index].dlab) {
+			mdev_state->s[index].divisor |= data;
+			break;
+		}
+
+		mutex_lock(&mdev_state->rxtx_lock);
+
+		/* save in TX buffer */
+		if (mdev_state->s[index].rxtx.count <
+				mdev_state->s[index].max_fifo_size) {
+			mdev_state->s[index].rxtx.fifo[
+					mdev_state->s[index].rxtx.head] = data;
+			mdev_state->s[index].rxtx.count++;
+			CIRCULAR_BUF_INC_IDX(mdev_state->s[index].rxtx.head);
+			mdev_state->s[index].overrun = false;
+
+			/*
+			 * Trigger interrupt if receive data interrupt is
+			 * enabled and fifo reached trigger level
+			 */
+			if ((mdev_state->s[index].uart_reg[UART_IER] &
+						UART_IER_RDI) &&
+			   (mdev_state->s[index].rxtx.count ==
+				    mdev_state->s[index].intr_trigger_level)) {
+				/* trigger interrupt */
+#if defined(DEBUG_INTR)
+				pr_err("Serial port %d: Fifo level trigger\n",
+					index);
+#endif
+				mtty_trigger_interrupt(mdev_state->mdev->uuid);
+			}
+		} else {
+#if defined(DEBUG_INTR)
+			pr_err("Serial port %d: Buffer Overflow\n", index);
+#endif
+			mdev_state->s[index].overrun = true;
+
+			/*
+			 * Trigger interrupt if receiver line status interrupt
+			 * is enabled
+			 */
+			if (mdev_state->s[index].uart_reg[UART_IER] &
+								UART_IER_RLSI)
+				mtty_trigger_interrupt(mdev_state->mdev->uuid);
+		}
+		mutex_unlock(&mdev_state->rxtx_lock);
+		break;
+
+	case UART_IER:
+		/* if DLAB set, data is MSB of divisor */
+		if (mdev_state->s[index].dlab)
+			mdev_state->s[index].divisor |= (u16)data << 8;
+		else {
+			mdev_state->s[index].uart_reg[offset] = data;
+			mutex_lock(&mdev_state->rxtx_lock);
+			if ((data & UART_IER_THRI) &&
+			    (mdev_state->s[index].rxtx.head ==
+					mdev_state->s[index].rxtx.tail)) {
+#if defined(DEBUG_INTR)
+				pr_err("Serial port %d: IER_THRI write\n",
+					index);
+#endif
+				mtty_trigger_interrupt(mdev_state->mdev->uuid);
+			}
+
+			mutex_unlock(&mdev_state->rxtx_lock);
+		}
+
+		break;
+
+	case UART_FCR:
+		mdev_state->s[index].fcr = data;
+
+		mutex_lock(&mdev_state->rxtx_lock);
+		if (data & (UART_FCR_CLEAR_RCVR | UART_FCR_CLEAR_XMIT)) {
+			/* clear loop back FIFO */
+			mdev_state->s[index].rxtx.count = 0;
+			mdev_state->s[index].rxtx.head = 0;
+			mdev_state->s[index].rxtx.tail = 0;
+		}
+		mutex_unlock(&mdev_state->rxtx_lock);
+
+		switch (data & UART_FCR_TRIGGER_MASK) {
+		case UART_FCR_TRIGGER_1:
+			mdev_state->s[index].intr_trigger_level = 1;
+			break;
+
+		case UART_FCR_TRIGGER_4:
+			mdev_state->s[index].intr_trigger_level = 4;
+			break;
+
+		case UART_FCR_TRIGGER_8:
+			mdev_state->s[index].intr_trigger_level = 8;
+			break;
+
+		case UART_FCR_TRIGGER_14:
+			mdev_state->s[index].intr_trigger_level = 14;
+			break;
+		}
+
+		/*
+		 * Set trigger level to 1 otherwise or  implement timer with
+		 * timeout of 4 characters and on expiring that timer set
+		 * Recevice data timeout in IIR register
+		 */
+		mdev_state->s[index].intr_trigger_level = 1;
+		if (data & UART_FCR_ENABLE_FIFO)
+			mdev_state->s[index].max_fifo_size = MAX_FIFO_SIZE;
+		else {
+			mdev_state->s[index].max_fifo_size = 1;
+			mdev_state->s[index].intr_trigger_level = 1;
+		}
+
+		break;
+
+	case UART_LCR:
+		if (data & UART_LCR_DLAB) {
+			mdev_state->s[index].dlab = true;
+			mdev_state->s[index].divisor = 0;
+		} else
+			mdev_state->s[index].dlab = false;
+
+		mdev_state->s[index].uart_reg[offset] = data;
+		break;
+
+	case UART_MCR:
+		mdev_state->s[index].uart_reg[offset] = data;
+
+		if ((mdev_state->s[index].uart_reg[UART_IER] & UART_IER_MSI) &&
+				(data & UART_MCR_OUT2)) {
+#if defined(DEBUG_INTR)
+			pr_err("Serial port %d: MCR_OUT2 write\n", index);
+#endif
+			mtty_trigger_interrupt(mdev_state->mdev->uuid);
+		}
+
+		if ((mdev_state->s[index].uart_reg[UART_IER] & UART_IER_MSI) &&
+				(data & (UART_MCR_RTS | UART_MCR_DTR))) {
+#if defined(DEBUG_INTR)
+			pr_err("Serial port %d: MCR RTS/DTR write\n", index);
+#endif
+			mtty_trigger_interrupt(mdev_state->mdev->uuid);
+		}
+		break;
+
+	case UART_LSR:
+	case UART_MSR:
+		/* do nothing */
+		break;
+
+	case UART_SCR:
+		mdev_state->s[index].uart_reg[offset] = data;
+		break;
+
+	default:
+		break;
+	}
+}
+
+static void handle_bar_read(unsigned int index, struct mdev_state *mdev_state,
+			    u16 offset, char *buf, u32 count)
+{
+	/* Handle read requests by guest */
+	switch (offset) {
+	case UART_RX:
+		/* if DLAB set, data is LSB of divisor */
+		if (mdev_state->s[index].dlab) {
+			*buf  = (u8)mdev_state->s[index].divisor;
+			break;
+		}
+
+		mutex_lock(&mdev_state->rxtx_lock);
+		/* return data in tx buffer */
+		if (mdev_state->s[index].rxtx.head !=
+				 mdev_state->s[index].rxtx.tail) {
+			*buf = mdev_state->s[index].rxtx.fifo[
+						mdev_state->s[index].rxtx.tail];
+			mdev_state->s[index].rxtx.count--;
+			CIRCULAR_BUF_INC_IDX(mdev_state->s[index].rxtx.tail);
+		}
+
+		if (mdev_state->s[index].rxtx.head ==
+				mdev_state->s[index].rxtx.tail) {
+		/*
+		 *  Trigger interrupt if tx buffer empty interrupt is
+		 *  enabled and fifo is empty
+		 */
+#if defined(DEBUG_INTR)
+			pr_err("Serial port %d: Buffer Empty\n", index);
+#endif
+			if (mdev_state->s[index].uart_reg[UART_IER] &
+							 UART_IER_THRI)
+				mtty_trigger_interrupt(mdev_state->mdev->uuid);
+		}
+		mutex_unlock(&mdev_state->rxtx_lock);
+
+		break;
+
+	case UART_IER:
+		if (mdev_state->s[index].dlab) {
+			*buf = (u8)(mdev_state->s[index].divisor >> 8);
+			break;
+		}
+		*buf = mdev_state->s[index].uart_reg[offset] & 0x0f;
+		break;
+
+	case UART_IIR:
+	{
+		u8 ier = mdev_state->s[index].uart_reg[UART_IER];
+		*buf = 0;
+
+		mutex_lock(&mdev_state->rxtx_lock);
+		/* Interrupt priority 1: Parity, overrun, framing or break */
+		if ((ier & UART_IER_RLSI) && mdev_state->s[index].overrun)
+			*buf |= UART_IIR_RLSI;
+
+		/* Interrupt priority 2: Fifo trigger level reached */
+		if ((ier & UART_IER_RDI) &&
+		    (mdev_state->s[index].rxtx.count ==
+		      mdev_state->s[index].intr_trigger_level))
+			*buf |= UART_IIR_RDI;
+
+		/* Interrupt priotiry 3: transmitter holding register empty */
+		if ((ier & UART_IER_THRI) &&
+		    (mdev_state->s[index].rxtx.head ==
+				mdev_state->s[index].rxtx.tail))
+			*buf |= UART_IIR_THRI;
+
+		/* Interrupt priotiry 4: Modem status: CTS, DSR, RI or DCD  */
+		if ((ier & UART_IER_MSI) &&
+		    (mdev_state->s[index].uart_reg[UART_MCR] &
+				 (UART_MCR_RTS | UART_MCR_DTR)))
+			*buf |= UART_IIR_MSI;
+
+		/* bit0: 0=> interrupt pending, 1=> no interrupt is pending */
+		if (*buf == 0)
+			*buf = UART_IIR_NO_INT;
+
+		/* set bit 6 & 7 to be 16550 compatible */
+		*buf |= 0xC0;
+		mutex_unlock(&mdev_state->rxtx_lock);
+	}
+	break;
+
+	case UART_LCR:
+	case UART_MCR:
+		*buf = mdev_state->s[index].uart_reg[offset];
+		break;
+
+	case UART_LSR:
+	{
+		u8 lsr = 0;
+
+		mutex_lock(&mdev_state->rxtx_lock);
+		/* atleast one char in FIFO */
+		if (mdev_state->s[index].rxtx.head !=
+				 mdev_state->s[index].rxtx.tail)
+			lsr |= UART_LSR_DR;
+
+		/* if FIFO overrun */
+		if (mdev_state->s[index].overrun)
+			lsr |= UART_LSR_OE;
+
+		/* transmit FIFO empty and tramsitter empty */
+		if (mdev_state->s[index].rxtx.head ==
+				 mdev_state->s[index].rxtx.tail)
+			lsr |= UART_LSR_TEMT | UART_LSR_THRE;
+
+		mutex_unlock(&mdev_state->rxtx_lock);
+		*buf = lsr;
+		break;
+	}
+	case UART_MSR:
+		*buf = UART_MSR_DSR | UART_MSR_DDSR | UART_MSR_DCD;
+
+		mutex_lock(&mdev_state->rxtx_lock);
+		/* if AFE is 1 and FIFO have space, set CTS bit */
+		if (mdev_state->s[index].uart_reg[UART_MCR] &
+						 UART_MCR_AFE) {
+			if (mdev_state->s[index].rxtx.count <
+					mdev_state->s[index].max_fifo_size)
+				*buf |= UART_MSR_CTS | UART_MSR_DCTS;
+		} else
+			*buf |= UART_MSR_CTS | UART_MSR_DCTS;
+		mutex_unlock(&mdev_state->rxtx_lock);
+
+		break;
+
+	case UART_SCR:
+		*buf = mdev_state->s[index].uart_reg[offset];
+		break;
+
+	default:
+		break;
+	}
+}
+
+static void mdev_read_base(struct mdev_state *mdev_state)
+{
+	int index, pos;
+	u32 start_lo, start_hi;
+	u32 mem_type;
+
+	pos = PCI_BASE_ADDRESS_0;
+
+	for (index = 0; index <= VFIO_PCI_BAR5_REGION_INDEX; index++) {
+
+		if (!mdev_state->region_info[index].size)
+			continue;
+
+		start_lo = (*(u32 *)(mdev_state->vconfig + pos)) &
+			PCI_BASE_ADDRESS_MEM_MASK;
+		mem_type = (*(u32 *)(mdev_state->vconfig + pos)) &
+			PCI_BASE_ADDRESS_MEM_TYPE_MASK;
+
+		switch (mem_type) {
+		case PCI_BASE_ADDRESS_MEM_TYPE_64:
+			start_hi = (*(u32 *)(mdev_state->vconfig + pos + 4));
+			pos += 4;
+			break;
+		case PCI_BASE_ADDRESS_MEM_TYPE_32:
+		case PCI_BASE_ADDRESS_MEM_TYPE_1M:
+			/* 1M mem BAR treated as 32-bit BAR */
+		default:
+			/* mem unknown type treated as 32-bit BAR */
+			start_hi = 0;
+			break;
+		}
+		pos += 4;
+		mdev_state->region_info[index].start = ((u64)start_hi << 32) |
+							start_lo;
+	}
+}
+
+static ssize_t mdev_access(struct mdev_device *mdev, char *buf,
+		size_t count, loff_t pos, bool is_write)
+{
+	struct mdev_state *mdev_state;
+	unsigned int index;
+	loff_t offset;
+	int ret = 0;
+
+	if (!mdev || !buf)
+		return -EINVAL;
+
+	mdev_state = mdev_get_drvdata(mdev);
+	if (!mdev_state) {
+		pr_err("%s mdev_state not found\n", __func__);
+		return -EINVAL;
+	}
+
+	mutex_lock(&mdev_state->ops_lock);
+
+	index = MTTY_VFIO_PCI_OFFSET_TO_INDEX(pos);
+	offset = pos & MTTY_VFIO_PCI_OFFSET_MASK;
+	switch (index) {
+	case VFIO_PCI_CONFIG_REGION_INDEX:
+
+#if defined(DEBUG)
+		pr_info("%s: PCI config space %s at offset 0x%llx\n",
+			 __func__, is_write ? "write" : "read", offset);
+#endif
+		if (is_write) {
+			dump_buffer(buf, count);
+			handle_pci_cfg_write(mdev_state, offset, buf, count);
+		} else {
+			memcpy(buf, (mdev_state->vconfig + offset), count);
+			dump_buffer(buf, count);
+		}
+
+		break;
+
+	case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
+		if (!mdev_state->region_info[index].start)
+			mdev_read_base(mdev_state);
+
+		if (is_write) {
+			dump_buffer(buf, count);
+
+#if defined(DEBUG_REGS)
+			pr_info("%s: BAR%d  WR @0x%llx %s val:0x%02x dlab:%d\n",
+				__func__, index, offset, wr_reg[offset],
+				(u8)*buf, mdev_state->s[index].dlab);
+#endif
+			handle_bar_write(index, mdev_state, offset, buf, count);
+		} else {
+			handle_bar_read(index, mdev_state, offset, buf, count);
+			dump_buffer(buf, count);
+
+#if defined(DEBUG_REGS)
+			pr_info("%s: BAR%d  RD @0x%llx %s val:0x%02x dlab:%d\n",
+				__func__, index, offset, rd_reg[offset],
+				(u8)*buf, mdev_state->s[index].dlab);
+#endif
+		}
+		break;
+
+	default:
+		ret = -1;
+		goto accessfailed;
+	}
+
+	ret = count;
+
+
+accessfailed:
+	mutex_unlock(&mdev_state->ops_lock);
+
+	return ret;
+}
+
+int mtty_create(struct kobject *kobj, struct mdev_device *mdev)
+{
+	struct mdev_state *mdev_state;
+
+	if (!mdev)
+		return -EINVAL;
+
+	mdev_state = kzalloc(sizeof(struct mdev_state), GFP_KERNEL);
+	if (mdev_state == NULL)
+		return -ENOMEM;
+
+	mdev_state->irq_index = -1;
+	mdev_state->s[0].max_fifo_size = MAX_FIFO_SIZE;
+	mdev_state->s[1].max_fifo_size = MAX_FIFO_SIZE;
+	mutex_init(&mdev_state->rxtx_lock);
+	mdev_state->vconfig = kzalloc(MTTY_CONFIG_SPACE_SIZE, GFP_KERNEL);
+
+	if (mdev_state->vconfig == NULL) {
+		kfree(mdev_state);
+		return -ENOMEM;
+	}
+
+	mutex_init(&mdev_state->ops_lock);
+	mdev_state->mdev = mdev;
+	mdev_set_drvdata(mdev, mdev_state);
+
+	mtty_create_config_space(mdev_state);
+
+	mutex_lock(&mdev_list_lock);
+	list_add(&mdev_state->next, &mdev_devices_list);
+	mutex_unlock(&mdev_list_lock);
+
+	return 0;
+}
+
+int mtty_remove(struct mdev_device *mdev)
+{
+	struct mdev_state *mds, *tmp_mds;
+	struct mdev_state *mdev_state = mdev_get_drvdata(mdev);
+	int ret = -EINVAL;
+
+	mutex_lock(&mdev_list_lock);
+	list_for_each_entry_safe(mds, tmp_mds, &mdev_devices_list, next) {
+		if (mdev_state == mds) {
+			list_del(&mdev_state->next);
+			mdev_set_drvdata(mdev, NULL);
+			kfree(mdev_state->vconfig);
+			kfree(mdev_state);
+			ret = 0;
+			break;
+		}
+	}
+	mutex_unlock(&mdev_list_lock);
+
+	return ret;
+}
+
+int mtty_reset(struct mdev_device *mdev)
+{
+	struct mdev_state *mdev_state;
+
+	if (!mdev)
+		return -EINVAL;
+
+	mdev_state = mdev_get_drvdata(mdev);
+	if (!mdev_state)
+		return -EINVAL;
+
+	pr_info("%s: called\n", __func__);
+
+	return 0;
+}
+
+ssize_t mtty_read(struct mdev_device *mdev, char *buf,
+		size_t count, loff_t pos)
+{
+	return mdev_access(mdev, buf, count, pos, false);
+}
+
+ssize_t mtty_write(struct mdev_device *mdev, char *buf,
+		size_t count, loff_t pos)
+{
+	return mdev_access(mdev, buf, count, pos, true);
+}
+
+static int mtty_set_irqs(struct mdev_device *mdev, uint32_t flags,
+			 unsigned int index, unsigned int start,
+			 unsigned int count, void *data)
+{
+	int ret = 0;
+	struct mdev_state *mdev_state;
+
+	if (!mdev)
+		return -EINVAL;
+
+	mdev_state = mdev_get_drvdata(mdev);
+	if (!mdev_state)
+		return -EINVAL;
+
+	mutex_lock(&mdev_state->ops_lock);
+	switch (index) {
+	case VFIO_PCI_INTX_IRQ_INDEX:
+		switch (flags & VFIO_IRQ_SET_ACTION_TYPE_MASK) {
+		case VFIO_IRQ_SET_ACTION_MASK:
+		case VFIO_IRQ_SET_ACTION_UNMASK:
+			break;
+		case VFIO_IRQ_SET_ACTION_TRIGGER:
+		{
+			if (flags & VFIO_IRQ_SET_DATA_NONE) {
+				pr_info("%s: disable INTx\n", __func__);
+				break;
+			}
+
+			if (flags & VFIO_IRQ_SET_DATA_EVENTFD) {
+				int fd = *(int *)data;
+
+				if (fd > 0) {
+					struct fd irqfd;
+
+					irqfd = fdget(fd);
+					if (!irqfd.file) {
+						ret = -EBADF;
+						break;
+					}
+					mdev_state->intx_file = irqfd.file;
+					fdput(irqfd);
+					mdev_state->irq_fd = fd;
+					mdev_state->irq_index = index;
+					break;
+				}
+			}
+			break;
+		}
+		}
+		break;
+	case VFIO_PCI_MSI_IRQ_INDEX:
+		switch (flags & VFIO_IRQ_SET_ACTION_TYPE_MASK) {
+		case VFIO_IRQ_SET_ACTION_MASK:
+		case VFIO_IRQ_SET_ACTION_UNMASK:
+			break;
+		case VFIO_IRQ_SET_ACTION_TRIGGER:
+			if (flags & VFIO_IRQ_SET_DATA_NONE) {
+				pr_info("%s: disable MSI\n", __func__);
+				mdev_state->irq_index = VFIO_PCI_INTX_IRQ_INDEX;
+				break;
+			}
+			if (flags & VFIO_IRQ_SET_DATA_EVENTFD) {
+				int fd = *(int *)data;
+				struct fd irqfd;
+
+				if (fd <= 0)
+					break;
+
+				if (mdev_state->msi_file)
+					break;
+
+				irqfd = fdget(fd);
+				if (!irqfd.file) {
+					ret = -EBADF;
+					break;
+				}
+
+				mdev_state->msi_file = irqfd.file;
+				fdput(irqfd);
+				mdev_state->irq_fd = fd;
+				mdev_state->irq_index = index;
+			}
+			break;
+	}
+	break;
+	case VFIO_PCI_MSIX_IRQ_INDEX:
+		pr_info("%s: MSIX_IRQ\n", __func__);
+		break;
+	case VFIO_PCI_ERR_IRQ_INDEX:
+		pr_info("%s: ERR_IRQ\n", __func__);
+		break;
+	case VFIO_PCI_REQ_IRQ_INDEX:
+		pr_info("%s: REQ_IRQ\n", __func__);
+		break;
+	}
+
+	mutex_unlock(&mdev_state->ops_lock);
+	return ret;
+}
+
+static int mtty_trigger_interrupt(uuid_le uuid)
+{
+	mm_segment_t old_fs;
+	u64 val = 1;
+	loff_t offset = 0;
+	int ret = -1;
+	struct file *pfile = NULL;
+	struct mdev_state *mdev_state;
+
+	mdev_state = find_mdev_state_by_uuid(uuid);
+
+	if (!mdev_state) {
+		pr_info("%s: mdev not found\n", __func__);
+		return -EINVAL;
+	}
+
+	if ((mdev_state->irq_index == VFIO_PCI_MSI_IRQ_INDEX) &&
+			(mdev_state->msi_file == NULL))
+		return -EINVAL;
+	else if ((mdev_state->irq_index == VFIO_PCI_INTX_IRQ_INDEX) &&
+			(mdev_state->intx_file == NULL)) {
+		pr_info("%s: Intr file not found\n", __func__);
+		return -EINVAL;
+	}
+
+	old_fs = get_fs();
+	set_fs(KERNEL_DS);
+
+	if (mdev_state->irq_index == VFIO_PCI_MSI_IRQ_INDEX)
+		pfile = mdev_state->msi_file;
+	else
+		pfile = mdev_state->intx_file;
+
+	if (pfile && pfile->f_op && pfile->f_op->write) {
+		ret = pfile->f_op->write(pfile, (char *)&val, sizeof(val),
+					 &offset);
+#if defined(DEBUG_INTR)
+		pr_info("Intx triggered\n");
+#endif
+	} else
+		pr_err("%s: pfile not valid, intr_type = %d\n", __func__,
+				mdev_state->irq_index);
+
+	set_fs(old_fs);
+
+	if (ret < 0)
+		pr_err("%s: eventfd write failed (%d)\n", __func__, ret);
+
+	return ret;
+}
+
+int mtty_get_region_info(struct mdev_device *mdev,
+			 struct vfio_region_info *region_info,
+			 u16 *cap_type_id, void **cap_type)
+{
+	unsigned int size = 0;
+	struct mdev_state *mdev_state;
+	int bar_index;
+
+	if (!mdev)
+		return -EINVAL;
+
+	mdev_state = mdev_get_drvdata(mdev);
+	if (!mdev_state)
+		return -EINVAL;
+
+	mutex_lock(&mdev_state->ops_lock);
+	bar_index = region_info->index;
+
+	switch (bar_index) {
+	case VFIO_PCI_CONFIG_REGION_INDEX:
+		size = MTTY_CONFIG_SPACE_SIZE;
+		break;
+	case VFIO_PCI_BAR0_REGION_INDEX:
+		size = MTTY_IO_BAR_SIZE;
+		break;
+	case VFIO_PCI_BAR1_REGION_INDEX:
+		size = MTTY_IO_BAR_SIZE;
+		break;
+	default:
+		size = 0;
+		break;
+	}
+
+	mdev_state->region_info[bar_index].size = size;
+	mdev_state->region_info[bar_index].vfio_offset =
+		MTTY_VFIO_PCI_INDEX_TO_OFFSET(bar_index);
+
+	region_info->size = size;
+	region_info->offset = MTTY_VFIO_PCI_INDEX_TO_OFFSET(bar_index);
+	region_info->flags = VFIO_REGION_INFO_FLAG_READ |
+		VFIO_REGION_INFO_FLAG_WRITE;
+	mutex_unlock(&mdev_state->ops_lock);
+	return 0;
+}
+
+int mtty_get_irq_info(struct mdev_device *mdev, struct vfio_irq_info *irq_info)
+{
+	switch (irq_info->index) {
+	case VFIO_PCI_INTX_IRQ_INDEX:
+	case VFIO_PCI_MSI_IRQ_INDEX:
+	case VFIO_PCI_REQ_IRQ_INDEX:
+		break;
+
+	default:
+		return -EINVAL;
+	}
+
+	irq_info->flags = VFIO_IRQ_INFO_EVENTFD;
+	irq_info->count = 1;
+
+	if (irq_info->index == VFIO_PCI_INTX_IRQ_INDEX)
+		irq_info->flags |= (VFIO_IRQ_INFO_MASKABLE |
+				VFIO_IRQ_INFO_AUTOMASKED);
+	else
+		irq_info->flags |= VFIO_IRQ_INFO_NORESIZE;
+
+	return 0;
+}
+
+int mtty_get_device_info(struct mdev_device *mdev,
+			 struct vfio_device_info *dev_info)
+{
+	dev_info->flags = VFIO_DEVICE_FLAGS_PCI;
+	dev_info->num_regions = VFIO_PCI_NUM_REGIONS;
+	dev_info->num_irqs = VFIO_PCI_NUM_IRQS;
+
+	return 0;
+}
+
+static long mtty_ioctl(struct mdev_device *mdev, unsigned int cmd,
+			unsigned long arg)
+{
+	int ret = 0;
+	unsigned long minsz;
+	struct mdev_state *mdev_state;
+
+	if (!mdev)
+		return -EINVAL;
+
+	mdev_state = mdev_get_drvdata(mdev);
+	if (!mdev_state)
+		return -ENODEV;
+
+	switch (cmd) {
+	case VFIO_DEVICE_GET_INFO:
+	{
+		struct vfio_device_info info;
+
+		minsz = offsetofend(struct vfio_device_info, num_irqs);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz)
+			return -EINVAL;
+
+		ret = mtty_get_device_info(mdev, &info);
+		if (ret)
+			return ret;
+
+		memcpy(&mdev_state->dev_info, &info, sizeof(info));
+
+		return copy_to_user((void __user *)arg, &info, minsz);
+	}
+	case VFIO_DEVICE_GET_REGION_INFO:
+	{
+		struct vfio_region_info info;
+		struct vfio_info_cap caps = { .buf = NULL, .size = 0 };
+		u16 cap_type_id = 0;
+		void *cap_type = NULL;
+
+		minsz = offsetofend(struct vfio_region_info, offset);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz)
+			return -EINVAL;
+
+		ret = mtty_get_region_info(mdev, &info, &cap_type_id,
+					   &cap_type);
+		if (ret)
+			return ret;
+
+		ret = vfio_info_add_capability(&info, &caps, cap_type_id,
+						cap_type);
+		if (ret)
+			return ret;
+
+		if (info.cap_offset) {
+			if (copy_to_user((void __user *)arg + info.cap_offset,
+						caps.buf, caps.size)) {
+				kfree(caps.buf);
+				return -EFAULT;
+			}
+			kfree(caps.buf);
+		}
+
+		return copy_to_user((void __user *)arg, &info, minsz);
+	}
+
+	case VFIO_DEVICE_GET_IRQ_INFO:
+	{
+		struct vfio_irq_info info;
+
+		minsz = offsetofend(struct vfio_irq_info, count);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if ((info.argsz < minsz) ||
+		    (info.index >= mdev_state->dev_info.num_irqs))
+			return -EINVAL;
+
+		ret = mtty_get_irq_info(mdev, &info);
+		if (ret)
+			return ret;
+
+		if (info.count == -1)
+			return -EINVAL;
+
+		return copy_to_user((void __user *)arg, &info, minsz);
+	}
+	case VFIO_DEVICE_SET_IRQS:
+	{
+		struct vfio_irq_set hdr;
+		u8 *data = NULL, *ptr = NULL;
+		int data_size = 0;
+
+		minsz = offsetofend(struct vfio_irq_set, count);
+
+		if (copy_from_user(&hdr, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		ret = vfio_set_irqs_validate_and_prepare(&hdr,
+				mdev_state->dev_info.num_irqs,
+				&data_size);
+		if (ret)
+			return ret;
+
+		if (data_size) {
+			ptr = data = memdup_user((void __user *)(arg + minsz),
+					data_size);
+			if (IS_ERR(data))
+				return PTR_ERR(data);
+		}
+
+		ret = mtty_set_irqs(mdev, hdr.flags, hdr.index, hdr.start,
+				hdr.count, data);
+
+		kfree(ptr);
+		return ret;
+	}
+	case VFIO_DEVICE_RESET:
+		return mtty_reset(mdev);
+	}
+	return -ENOTTY;
+}
+
+int mtty_open(struct mdev_device *mdev)
+{
+	pr_info("%s\n", __func__);
+	return 0;
+}
+
+void mtty_close(struct mdev_device *mdev)
+{
+	pr_info("%s\n", __func__);
+}
+
+static ssize_t
+sample_mtty_dev_show(struct device *dev, struct device_attribute *attr,
+		     char *buf)
+{
+	return sprintf(buf, "This is phy device\n");
+}
+
+static DEVICE_ATTR_RO(sample_mtty_dev);
+
+static struct attribute *mtty_dev_attrs[] = {
+	&dev_attr_sample_mtty_dev.attr,
+	NULL,
+};
+
+static const struct attribute_group mtty_dev_group = {
+	.name  = "mtty_dev",
+	.attrs = mtty_dev_attrs,
+};
+
+const struct attribute_group *mtty_dev_groups[] = {
+	&mtty_dev_group,
+	NULL,
+};
+
+static ssize_t
+sample_mdev_dev_show(struct device *dev, struct device_attribute *attr,
+		     char *buf)
+{
+	struct mdev_device *mdev = to_mdev_device(dev);
+
+	if (mdev)
+		return sprintf(buf, "This is MDEV %s\n", dev_name(&mdev->dev));
+
+	return sprintf(buf, "\n");
+}
+
+static DEVICE_ATTR_RO(sample_mdev_dev);
+
+static struct attribute *mdev_dev_attrs[] = {
+	&dev_attr_sample_mdev_dev.attr,
+	NULL,
+};
+
+static const struct attribute_group mdev_dev_group = {
+	.name  = "vendor",
+	.attrs = mdev_dev_attrs,
+};
+
+const struct attribute_group *mdev_dev_groups[] = {
+	&mdev_dev_group,
+	NULL,
+};
+
+static ssize_t
+name_show(struct kobject *kobj, struct device *dev, char *buf)
+{
+	return sprintf(buf, "Dual-port-serial\n");
+}
+
+MDEV_TYPE_ATTR_RO(name);
+
+static ssize_t
+available_instances_show(struct kobject *kobj, struct device *dev, char *buf)
+{
+	return sprintf(buf, "1\n");
+}
+
+MDEV_TYPE_ATTR_RO(available_instances);
+
+static struct attribute *mdev_types_attrs[] = {
+	&mdev_type_attr_name.attr,
+	&mdev_type_attr_available_instances.attr,
+	NULL,
+};
+
+static struct attribute_group mdev_type_group = {
+	.name  = "mtty1",
+	.attrs = mdev_types_attrs,
+};
+
+struct attribute_group *mdev_type_groups[] = {
+	&mdev_type_group,
+	NULL,
+};
+
+struct parent_ops mdev_fops = {
+	.owner                  = THIS_MODULE,
+	.dev_attr_groups        = mtty_dev_groups,
+	.mdev_attr_groups       = mdev_dev_groups,
+	.supported_type_groups  = mdev_type_groups,
+	.create                 = mtty_create,
+	.remove			= mtty_remove,
+	.open                   = mtty_open,
+	.release                = mtty_close,
+	.read                   = mtty_read,
+	.write                  = mtty_write,
+	.ioctl		        = mtty_ioctl,
+};
+
+static void mtty_device_release(struct device *dev)
+{
+	dev_dbg(dev, "mtty: released\n");
+}
+
+static int __init mtty_dev_init(void)
+{
+	int ret = 0;
+
+	pr_info("mtty_dev: %s\n", __func__);
+
+	memset(&mtty_dev, 0, sizeof(mtty_dev));
+
+	idr_init(&mtty_dev.vd_idr);
+
+	ret = alloc_chrdev_region(&mtty_dev.vd_devt, 0, MINORMASK, MTTY_NAME);
+
+	if (ret < 0) {
+		pr_err("Error: failed to register mtty_dev, err:%d\n", ret);
+		return ret;
+	}
+
+	cdev_init(&mtty_dev.vd_cdev, &vd_fops);
+	cdev_add(&mtty_dev.vd_cdev, mtty_dev.vd_devt, MINORMASK);
+
+	pr_info("major_number:%d\n", MAJOR(mtty_dev.vd_devt));
+
+	mtty_dev.vd_class = class_create(THIS_MODULE, MTTY_CLASS_NAME);
+
+	if (IS_ERR(mtty_dev.vd_class)) {
+		pr_err("Error: failed to register mtty_dev class\n");
+		goto failed1;
+	}
+
+	mtty_dev.dev.release = mtty_device_release;
+	dev_set_name(&mtty_dev.dev, "%s", MTTY_NAME);
+
+	ret = device_register(&mtty_dev.dev);
+	if (ret)
+		goto failed2;
+
+	if (mdev_register_device(&mtty_dev.dev, &mdev_fops) != 0)
+		goto failed3;
+
+	mutex_init(&mdev_list_lock);
+	INIT_LIST_HEAD(&mdev_devices_list);
+
+	goto all_done;
+
+failed3:
+
+	device_unregister(&mtty_dev.dev);
+failed2:
+	class_destroy(mtty_dev.vd_class);
+
+failed1:
+	cdev_del(&mtty_dev.vd_cdev);
+	unregister_chrdev_region(mtty_dev.vd_devt, MINORMASK);
+
+all_done:
+	return ret;
+}
+
+static void __exit mtty_dev_exit(void)
+{
+	mtty_dev.dev.bus = NULL;
+	mdev_unregister_device(&mtty_dev.dev);
+
+	device_unregister(&mtty_dev.dev);
+	idr_destroy(&mtty_dev.vd_idr);
+	cdev_del(&mtty_dev.vd_cdev);
+	unregister_chrdev_region(mtty_dev.vd_devt, MINORMASK);
+	class_destroy(mtty_dev.vd_class);
+	mtty_dev.vd_class = NULL;
+	pr_info("mtty_dev: Unloaded!\n");
+}
+
+module_init(mtty_dev_init)
+module_exit(mtty_dev_exit)
+
+MODULE_LICENSE("GPL");
+MODULE_INFO(supported, "Test driver that simulate serial port over PCI");
+MODULE_VERSION(VERSION_STRING);
+MODULE_AUTHOR(DRIVER_AUTHOR);
diff --git a/Documentation/vfio-mdev/vfio-mediated-device.txt b/Documentation/vfio-mdev/vfio-mediated-device.txt
index c1eacb83807b..2b23b4d431a3 100644
--- a/Documentation/vfio-mdev/vfio-mediated-device.txt
+++ b/Documentation/vfio-mdev/vfio-mediated-device.txt
@@ -209,6 +209,69 @@ supported in TYPE1 IOMMU module. To enable the same for other IOMMU backend
 modules, such as PPC64 sPAPR module, they need to provide these two callback
 functions.
 
+Sample code
+-----------
+File mtty.c in this folder is a sample code to demonstrate how to use mediated
+device framework.
+
+Sample driver creates mdev device that simulates serial port over PCI card.
+
+Build and load mtty.ko module. This creates a dummy device, /sys/devices/mtty
+Files in this device directory in sysfs looks like:
+
+# ls /sys/devices/mtty/ -l
+total 0
+drwxr-xr-x 2 root root    0 Sep 29 12:34 mdev_supported_types
+drwxr-xr-x 2 root root    0 Sep 29 12:34 mtty_dev
+drwxr-xr-x 2 root root    0 Sep 29 12:34 power
+-rw-r--r-- 1 root root 4096 Sep 29 12:34 uevent
+
+Create mediated device using this device:
+# echo "83b8f4f2-509f-382f-3c1e-e6bfe0fa1001" >	\
+		 /sys/devices/mtty/mdev_supported_types/mtty1/create
+
+Add parameters to qemu-kvm:
+-device vfio-pci,\
+ sysfsdev=/sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1001
+
+Boot the VM. In Linux guest (no hardware in host), device is seen as below:
+
+# lspci -s 00:05.0 -xxvv
+00:05.0 Serial controller: Device 4348:3253 (rev 10) (prog-if 02 [16550])
+        Subsystem: Device 4348:3253
+        Physical Slot: 5
+        Control: I/O+ Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr-
+Stepping- SERR- FastB2B- DisINTx-
+        Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort-
+<TAbort- <MAbort- >SERR- <PERR- INTx-
+        Interrupt: pin A routed to IRQ 10
+        Region 0: I/O ports at c150 [size=8]
+        Region 1: I/O ports at c158 [size=8]
+        Kernel driver in use: serial
+00: 48 43 53 32 01 00 00 02 10 02 00 07 00 00 00 00
+10: 51 c1 00 00 59 c1 00 00 00 00 00 00 00 00 00 00
+20: 00 00 00 00 00 00 00 00 00 00 00 00 48 43 53 32
+30: 00 00 00 00 00 00 00 00 00 00 00 00 0a 01 00 00
+
+In guest dmesg:
+serial 0000:00:05.0: PCI INT A -> Link[LNKA] -> GSI 10 (level, high) -> IRQ 10
+0000:00:05.0: ttyS1 at I/O 0xc150 (irq = 10) is a 16550A
+0000:00:05.0: ttyS2 at I/O 0xc158 (irq = 10) is a 16550A
+
+Check the serial ports in guest:
+# setserial -g /dev/ttyS*
+/dev/ttyS0, UART: 16550A, Port: 0x03f8, IRQ: 4
+/dev/ttyS1, UART: 16550A, Port: 0xc150, IRQ: 10
+/dev/ttyS2, UART: 16550A, Port: 0xc158, IRQ: 10
+
+Using minicom or any terminal enulation program, open port /dev/ttyS1 or
+/dev/ttyS2 with hardware flow control disabled. Type data on minicom terminal or
+send data to terminal emulation program and read tha data. Data is loop backed
+from hosts mtty driver.
+
+Destroy mediated device created above by:
+# echo 1 > /sys/bus/mdev/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1001/remove
+
 References
 ----------
 
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v8 6/6] Add common functions for SET_IRQS and GET_REGION_INFO ioctls
  2016-10-10 20:28 ` [Qemu-devel] " Kirti Wankhede
@ 2016-10-10 20:28   ` Kirti Wankhede
  -1 siblings, 0 replies; 73+ messages in thread
From: Kirti Wankhede @ 2016-10-10 20:28 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi, Kirti Wankhede

Add common functions for SET_IRQS and to add capability buffer for
GET_REGION_INFO ioctls

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Change-Id: Id9e976a2c08b9b2b37da77dac4365ae8f6024b4a
---
 drivers/vfio/pci/vfio_pci.c | 103 +++++++++++++++------------------------
 drivers/vfio/vfio.c         | 116 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/vfio.h        |   7 +++
 3 files changed, 162 insertions(+), 64 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 188b1ff03f5f..f312cbb0eebc 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -478,12 +478,12 @@ static int vfio_pci_for_each_slot_or_bus(struct pci_dev *pdev,
 }
 
 static int msix_sparse_mmap_cap(struct vfio_pci_device *vdev,
+				struct vfio_region_info *info,
 				struct vfio_info_cap *caps)
 {
-	struct vfio_info_cap_header *header;
 	struct vfio_region_info_cap_sparse_mmap *sparse;
 	size_t end, size;
-	int nr_areas = 2, i = 0;
+	int nr_areas = 2, i = 0, ret;
 
 	end = pci_resource_len(vdev->pdev, vdev->msix_bar);
 
@@ -494,13 +494,10 @@ static int msix_sparse_mmap_cap(struct vfio_pci_device *vdev,
 
 	size = sizeof(*sparse) + (nr_areas * sizeof(*sparse->areas));
 
-	header = vfio_info_cap_add(caps, size,
-				   VFIO_REGION_INFO_CAP_SPARSE_MMAP, 1);
-	if (IS_ERR(header))
-		return PTR_ERR(header);
+	sparse = kzalloc(size, GFP_KERNEL);
+	if (!sparse)
+		return -ENOMEM;
 
-	sparse = container_of(header,
-			      struct vfio_region_info_cap_sparse_mmap, header);
 	sparse->nr_areas = nr_areas;
 
 	if (vdev->msix_offset & PAGE_MASK) {
@@ -516,24 +513,14 @@ static int msix_sparse_mmap_cap(struct vfio_pci_device *vdev,
 		i++;
 	}
 
-	return 0;
-}
-
-static int region_type_cap(struct vfio_pci_device *vdev,
-			   struct vfio_info_cap *caps,
-			   unsigned int type, unsigned int subtype)
-{
-	struct vfio_info_cap_header *header;
-	struct vfio_region_info_cap_type *cap;
+	info->flags |= VFIO_REGION_INFO_FLAG_CAPS;
 
-	header = vfio_info_cap_add(caps, sizeof(*cap),
-				   VFIO_REGION_INFO_CAP_TYPE, 1);
-	if (IS_ERR(header))
-		return PTR_ERR(header);
+	ret = vfio_info_add_capability(info, caps,
+				      VFIO_REGION_INFO_CAP_SPARSE_MMAP, sparse);
+	kfree(sparse);
 
-	cap = container_of(header, struct vfio_region_info_cap_type, header);
-	cap->type = type;
-	cap->subtype = subtype;
+	if (ret)
+		return ret;
 
 	return 0;
 }
@@ -628,7 +615,8 @@ static long vfio_pci_ioctl(void *device_data,
 			    IORESOURCE_MEM && info.size >= PAGE_SIZE) {
 				info.flags |= VFIO_REGION_INFO_FLAG_MMAP;
 				if (info.index == vdev->msix_bar) {
-					ret = msix_sparse_mmap_cap(vdev, &caps);
+					ret = msix_sparse_mmap_cap(vdev, &info,
+								   &caps);
 					if (ret)
 						return ret;
 				}
@@ -676,6 +664,9 @@ static long vfio_pci_ioctl(void *device_data,
 
 			break;
 		default:
+		{
+			struct vfio_region_info_cap_type cap_type;
+
 			if (info.index >=
 			    VFIO_PCI_NUM_REGIONS + vdev->num_regions)
 				return -EINVAL;
@@ -684,29 +675,26 @@ static long vfio_pci_ioctl(void *device_data,
 
 			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
 			info.size = vdev->region[i].size;
-			info.flags = vdev->region[i].flags;
+			info.flags = vdev->region[i].flags |
+				     VFIO_REGION_INFO_FLAG_CAPS;
 
-			ret = region_type_cap(vdev, &caps,
-					      vdev->region[i].type,
-					      vdev->region[i].subtype);
+			cap_type.type = vdev->region[i].type;
+			cap_type.subtype = vdev->region[i].subtype;
+
+			ret = vfio_info_add_capability(&info, &caps,
+						      VFIO_REGION_INFO_CAP_TYPE,
+						      &cap_type);
 			if (ret)
 				return ret;
+
+		}
 		}
 
-		if (caps.size) {
-			info.flags |= VFIO_REGION_INFO_FLAG_CAPS;
-			if (info.argsz < sizeof(info) + caps.size) {
-				info.argsz = sizeof(info) + caps.size;
-				info.cap_offset = 0;
-			} else {
-				vfio_info_cap_shift(&caps, sizeof(info));
-				if (copy_to_user((void __user *)arg +
-						  sizeof(info), caps.buf,
-						  caps.size)) {
-					kfree(caps.buf);
-					return -EFAULT;
-				}
-				info.cap_offset = sizeof(info);
+		if (info.cap_offset) {
+			if (copy_to_user((void __user *)arg + info.cap_offset,
+					 caps.buf, caps.size)) {
+				kfree(caps.buf);
+				return -EFAULT;
 			}
 
 			kfree(caps.buf);
@@ -754,35 +742,22 @@ static long vfio_pci_ioctl(void *device_data,
 	} else if (cmd == VFIO_DEVICE_SET_IRQS) {
 		struct vfio_irq_set hdr;
 		u8 *data = NULL;
-		int ret = 0;
+		int max, ret = 0, data_size = 0;
 
 		minsz = offsetofend(struct vfio_irq_set, count);
 
 		if (copy_from_user(&hdr, (void __user *)arg, minsz))
 			return -EFAULT;
 
-		if (hdr.argsz < minsz || hdr.index >= VFIO_PCI_NUM_IRQS ||
-		    hdr.flags & ~(VFIO_IRQ_SET_DATA_TYPE_MASK |
-				  VFIO_IRQ_SET_ACTION_TYPE_MASK))
-			return -EINVAL;
-
-		if (!(hdr.flags & VFIO_IRQ_SET_DATA_NONE)) {
-			size_t size;
-			int max = vfio_pci_get_irq_count(vdev, hdr.index);
+		max = vfio_pci_get_irq_count(vdev, hdr.index);
 
-			if (hdr.flags & VFIO_IRQ_SET_DATA_BOOL)
-				size = sizeof(uint8_t);
-			else if (hdr.flags & VFIO_IRQ_SET_DATA_EVENTFD)
-				size = sizeof(int32_t);
-			else
-				return -EINVAL;
-
-			if (hdr.argsz - minsz < hdr.count * size ||
-			    hdr.start >= max || hdr.start + hdr.count > max)
-				return -EINVAL;
+		ret = vfio_set_irqs_validate_and_prepare(&hdr, max, &data_size);
+		if (ret)
+			return ret;
 
+		if (data_size) {
 			data = memdup_user((void __user *)(arg + minsz),
-					   hdr.count * size);
+					    data_size);
 			if (IS_ERR(data))
 				return PTR_ERR(data);
 		}
@@ -790,7 +765,7 @@ static long vfio_pci_ioctl(void *device_data,
 		mutex_lock(&vdev->igate);
 
 		ret = vfio_pci_set_irqs_ioctl(vdev, hdr.flags, hdr.index,
-					      hdr.start, hdr.count, data);
+				hdr.start, hdr.count, data);
 
 		mutex_unlock(&vdev->igate);
 		kfree(data);
diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index e3e342861e04..0185d5fb2c85 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -1782,6 +1782,122 @@ void vfio_info_cap_shift(struct vfio_info_cap *caps, size_t offset)
 }
 EXPORT_SYMBOL_GPL(vfio_info_cap_shift);
 
+static int sparse_mmap_cap(struct vfio_info_cap *caps, void *cap_type)
+{
+	struct vfio_info_cap_header *header;
+	struct vfio_region_info_cap_sparse_mmap *sparse_cap, *sparse = cap_type;
+	size_t size;
+
+	size = sizeof(*sparse) + sparse->nr_areas *  sizeof(*sparse->areas);
+	header = vfio_info_cap_add(caps, size,
+				   VFIO_REGION_INFO_CAP_SPARSE_MMAP, 1);
+	if (IS_ERR(header))
+		return PTR_ERR(header);
+
+	sparse_cap = container_of(header,
+			struct vfio_region_info_cap_sparse_mmap, header);
+	sparse_cap->nr_areas = sparse->nr_areas;
+	memcpy(sparse_cap->areas, sparse->areas,
+	       sparse->nr_areas * sizeof(*sparse->areas));
+	return 0;
+}
+
+static int region_type_cap(struct vfio_info_cap *caps, void *cap_type)
+{
+	struct vfio_info_cap_header *header;
+	struct vfio_region_info_cap_type *type_cap, *cap = cap_type;
+
+	header = vfio_info_cap_add(caps, sizeof(*cap),
+				   VFIO_REGION_INFO_CAP_TYPE, 1);
+	if (IS_ERR(header))
+		return PTR_ERR(header);
+
+	type_cap = container_of(header, struct vfio_region_info_cap_type,
+				header);
+	type_cap->type = cap->type;
+	type_cap->subtype = cap->subtype;
+	return 0;
+}
+
+int vfio_info_add_capability(struct vfio_region_info *info,
+			     struct vfio_info_cap *caps,
+			     int cap_type_id,
+			     void *cap_type)
+{
+	int ret;
+
+	if (!(info->flags & VFIO_REGION_INFO_FLAG_CAPS) || !cap_type)
+		return 0;
+
+	switch (cap_type_id) {
+	case VFIO_REGION_INFO_CAP_SPARSE_MMAP:
+		ret = sparse_mmap_cap(caps, cap_type);
+		if (ret)
+			return ret;
+		break;
+
+	case VFIO_REGION_INFO_CAP_TYPE:
+		ret = region_type_cap(caps, cap_type);
+		if (ret)
+			return ret;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	if (caps->size) {
+		if (info->argsz < sizeof(*info) + caps->size) {
+			info->argsz = sizeof(*info) + caps->size;
+			info->cap_offset = 0;
+		} else {
+			vfio_info_cap_shift(caps, sizeof(*info));
+			info->cap_offset = sizeof(*info);
+		}
+	}
+	return 0;
+}
+EXPORT_SYMBOL(vfio_info_add_capability);
+
+int vfio_set_irqs_validate_and_prepare(struct vfio_irq_set *hdr, int num_irqs,
+				       int *data_size)
+{
+	unsigned long minsz;
+
+	minsz = offsetofend(struct vfio_irq_set, count);
+
+	if ((hdr->argsz < minsz) || (hdr->index >= VFIO_PCI_NUM_IRQS) ||
+	    (hdr->flags & ~(VFIO_IRQ_SET_DATA_TYPE_MASK |
+				VFIO_IRQ_SET_ACTION_TYPE_MASK)))
+		return -EINVAL;
+
+	if (data_size)
+		*data_size = 0;
+
+	if (!(hdr->flags & VFIO_IRQ_SET_DATA_NONE)) {
+		size_t size;
+
+		if (hdr->flags & VFIO_IRQ_SET_DATA_BOOL)
+			size = sizeof(uint8_t);
+		else if (hdr->flags & VFIO_IRQ_SET_DATA_EVENTFD)
+			size = sizeof(int32_t);
+		else
+			return -EINVAL;
+
+		if ((hdr->argsz - minsz < hdr->count * size) ||
+		    (hdr->start >= num_irqs) ||
+		    (hdr->start + hdr->count > num_irqs))
+			return -EINVAL;
+
+		if (!data_size)
+			return -EINVAL;
+
+		*data_size = hdr->count * size;
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL(vfio_set_irqs_validate_and_prepare);
+
 static struct vfio_group *vfio_group_from_dev(struct device *dev)
 {
 	struct vfio_device *device;
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 0bd25ba6223d..5641dab72ded 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -108,6 +108,13 @@ extern struct vfio_info_cap_header *vfio_info_cap_add(
 		struct vfio_info_cap *caps, size_t size, u16 id, u16 version);
 extern void vfio_info_cap_shift(struct vfio_info_cap *caps, size_t offset);
 
+extern int vfio_info_add_capability(struct vfio_region_info *info,
+				    struct vfio_info_cap *caps,
+				    int cap_type_id, void *cap_type);
+
+extern int vfio_set_irqs_validate_and_prepare(struct vfio_irq_set *hdr,
+					      int num_irqs, int *data_size);
+
 struct pci_dev;
 #ifdef CONFIG_EEH
 extern void vfio_spapr_pci_eeh_open(struct pci_dev *pdev);
-- 
2.7.0


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [Qemu-devel] [PATCH v8 6/6] Add common functions for SET_IRQS and GET_REGION_INFO ioctls
@ 2016-10-10 20:28   ` Kirti Wankhede
  0 siblings, 0 replies; 73+ messages in thread
From: Kirti Wankhede @ 2016-10-10 20:28 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi, Kirti Wankhede

Add common functions for SET_IRQS and to add capability buffer for
GET_REGION_INFO ioctls

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Change-Id: Id9e976a2c08b9b2b37da77dac4365ae8f6024b4a
---
 drivers/vfio/pci/vfio_pci.c | 103 +++++++++++++++------------------------
 drivers/vfio/vfio.c         | 116 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/vfio.h        |   7 +++
 3 files changed, 162 insertions(+), 64 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 188b1ff03f5f..f312cbb0eebc 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -478,12 +478,12 @@ static int vfio_pci_for_each_slot_or_bus(struct pci_dev *pdev,
 }
 
 static int msix_sparse_mmap_cap(struct vfio_pci_device *vdev,
+				struct vfio_region_info *info,
 				struct vfio_info_cap *caps)
 {
-	struct vfio_info_cap_header *header;
 	struct vfio_region_info_cap_sparse_mmap *sparse;
 	size_t end, size;
-	int nr_areas = 2, i = 0;
+	int nr_areas = 2, i = 0, ret;
 
 	end = pci_resource_len(vdev->pdev, vdev->msix_bar);
 
@@ -494,13 +494,10 @@ static int msix_sparse_mmap_cap(struct vfio_pci_device *vdev,
 
 	size = sizeof(*sparse) + (nr_areas * sizeof(*sparse->areas));
 
-	header = vfio_info_cap_add(caps, size,
-				   VFIO_REGION_INFO_CAP_SPARSE_MMAP, 1);
-	if (IS_ERR(header))
-		return PTR_ERR(header);
+	sparse = kzalloc(size, GFP_KERNEL);
+	if (!sparse)
+		return -ENOMEM;
 
-	sparse = container_of(header,
-			      struct vfio_region_info_cap_sparse_mmap, header);
 	sparse->nr_areas = nr_areas;
 
 	if (vdev->msix_offset & PAGE_MASK) {
@@ -516,24 +513,14 @@ static int msix_sparse_mmap_cap(struct vfio_pci_device *vdev,
 		i++;
 	}
 
-	return 0;
-}
-
-static int region_type_cap(struct vfio_pci_device *vdev,
-			   struct vfio_info_cap *caps,
-			   unsigned int type, unsigned int subtype)
-{
-	struct vfio_info_cap_header *header;
-	struct vfio_region_info_cap_type *cap;
+	info->flags |= VFIO_REGION_INFO_FLAG_CAPS;
 
-	header = vfio_info_cap_add(caps, sizeof(*cap),
-				   VFIO_REGION_INFO_CAP_TYPE, 1);
-	if (IS_ERR(header))
-		return PTR_ERR(header);
+	ret = vfio_info_add_capability(info, caps,
+				      VFIO_REGION_INFO_CAP_SPARSE_MMAP, sparse);
+	kfree(sparse);
 
-	cap = container_of(header, struct vfio_region_info_cap_type, header);
-	cap->type = type;
-	cap->subtype = subtype;
+	if (ret)
+		return ret;
 
 	return 0;
 }
@@ -628,7 +615,8 @@ static long vfio_pci_ioctl(void *device_data,
 			    IORESOURCE_MEM && info.size >= PAGE_SIZE) {
 				info.flags |= VFIO_REGION_INFO_FLAG_MMAP;
 				if (info.index == vdev->msix_bar) {
-					ret = msix_sparse_mmap_cap(vdev, &caps);
+					ret = msix_sparse_mmap_cap(vdev, &info,
+								   &caps);
 					if (ret)
 						return ret;
 				}
@@ -676,6 +664,9 @@ static long vfio_pci_ioctl(void *device_data,
 
 			break;
 		default:
+		{
+			struct vfio_region_info_cap_type cap_type;
+
 			if (info.index >=
 			    VFIO_PCI_NUM_REGIONS + vdev->num_regions)
 				return -EINVAL;
@@ -684,29 +675,26 @@ static long vfio_pci_ioctl(void *device_data,
 
 			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
 			info.size = vdev->region[i].size;
-			info.flags = vdev->region[i].flags;
+			info.flags = vdev->region[i].flags |
+				     VFIO_REGION_INFO_FLAG_CAPS;
 
-			ret = region_type_cap(vdev, &caps,
-					      vdev->region[i].type,
-					      vdev->region[i].subtype);
+			cap_type.type = vdev->region[i].type;
+			cap_type.subtype = vdev->region[i].subtype;
+
+			ret = vfio_info_add_capability(&info, &caps,
+						      VFIO_REGION_INFO_CAP_TYPE,
+						      &cap_type);
 			if (ret)
 				return ret;
+
+		}
 		}
 
-		if (caps.size) {
-			info.flags |= VFIO_REGION_INFO_FLAG_CAPS;
-			if (info.argsz < sizeof(info) + caps.size) {
-				info.argsz = sizeof(info) + caps.size;
-				info.cap_offset = 0;
-			} else {
-				vfio_info_cap_shift(&caps, sizeof(info));
-				if (copy_to_user((void __user *)arg +
-						  sizeof(info), caps.buf,
-						  caps.size)) {
-					kfree(caps.buf);
-					return -EFAULT;
-				}
-				info.cap_offset = sizeof(info);
+		if (info.cap_offset) {
+			if (copy_to_user((void __user *)arg + info.cap_offset,
+					 caps.buf, caps.size)) {
+				kfree(caps.buf);
+				return -EFAULT;
 			}
 
 			kfree(caps.buf);
@@ -754,35 +742,22 @@ static long vfio_pci_ioctl(void *device_data,
 	} else if (cmd == VFIO_DEVICE_SET_IRQS) {
 		struct vfio_irq_set hdr;
 		u8 *data = NULL;
-		int ret = 0;
+		int max, ret = 0, data_size = 0;
 
 		minsz = offsetofend(struct vfio_irq_set, count);
 
 		if (copy_from_user(&hdr, (void __user *)arg, minsz))
 			return -EFAULT;
 
-		if (hdr.argsz < minsz || hdr.index >= VFIO_PCI_NUM_IRQS ||
-		    hdr.flags & ~(VFIO_IRQ_SET_DATA_TYPE_MASK |
-				  VFIO_IRQ_SET_ACTION_TYPE_MASK))
-			return -EINVAL;
-
-		if (!(hdr.flags & VFIO_IRQ_SET_DATA_NONE)) {
-			size_t size;
-			int max = vfio_pci_get_irq_count(vdev, hdr.index);
+		max = vfio_pci_get_irq_count(vdev, hdr.index);
 
-			if (hdr.flags & VFIO_IRQ_SET_DATA_BOOL)
-				size = sizeof(uint8_t);
-			else if (hdr.flags & VFIO_IRQ_SET_DATA_EVENTFD)
-				size = sizeof(int32_t);
-			else
-				return -EINVAL;
-
-			if (hdr.argsz - minsz < hdr.count * size ||
-			    hdr.start >= max || hdr.start + hdr.count > max)
-				return -EINVAL;
+		ret = vfio_set_irqs_validate_and_prepare(&hdr, max, &data_size);
+		if (ret)
+			return ret;
 
+		if (data_size) {
 			data = memdup_user((void __user *)(arg + minsz),
-					   hdr.count * size);
+					    data_size);
 			if (IS_ERR(data))
 				return PTR_ERR(data);
 		}
@@ -790,7 +765,7 @@ static long vfio_pci_ioctl(void *device_data,
 		mutex_lock(&vdev->igate);
 
 		ret = vfio_pci_set_irqs_ioctl(vdev, hdr.flags, hdr.index,
-					      hdr.start, hdr.count, data);
+				hdr.start, hdr.count, data);
 
 		mutex_unlock(&vdev->igate);
 		kfree(data);
diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index e3e342861e04..0185d5fb2c85 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -1782,6 +1782,122 @@ void vfio_info_cap_shift(struct vfio_info_cap *caps, size_t offset)
 }
 EXPORT_SYMBOL_GPL(vfio_info_cap_shift);
 
+static int sparse_mmap_cap(struct vfio_info_cap *caps, void *cap_type)
+{
+	struct vfio_info_cap_header *header;
+	struct vfio_region_info_cap_sparse_mmap *sparse_cap, *sparse = cap_type;
+	size_t size;
+
+	size = sizeof(*sparse) + sparse->nr_areas *  sizeof(*sparse->areas);
+	header = vfio_info_cap_add(caps, size,
+				   VFIO_REGION_INFO_CAP_SPARSE_MMAP, 1);
+	if (IS_ERR(header))
+		return PTR_ERR(header);
+
+	sparse_cap = container_of(header,
+			struct vfio_region_info_cap_sparse_mmap, header);
+	sparse_cap->nr_areas = sparse->nr_areas;
+	memcpy(sparse_cap->areas, sparse->areas,
+	       sparse->nr_areas * sizeof(*sparse->areas));
+	return 0;
+}
+
+static int region_type_cap(struct vfio_info_cap *caps, void *cap_type)
+{
+	struct vfio_info_cap_header *header;
+	struct vfio_region_info_cap_type *type_cap, *cap = cap_type;
+
+	header = vfio_info_cap_add(caps, sizeof(*cap),
+				   VFIO_REGION_INFO_CAP_TYPE, 1);
+	if (IS_ERR(header))
+		return PTR_ERR(header);
+
+	type_cap = container_of(header, struct vfio_region_info_cap_type,
+				header);
+	type_cap->type = cap->type;
+	type_cap->subtype = cap->subtype;
+	return 0;
+}
+
+int vfio_info_add_capability(struct vfio_region_info *info,
+			     struct vfio_info_cap *caps,
+			     int cap_type_id,
+			     void *cap_type)
+{
+	int ret;
+
+	if (!(info->flags & VFIO_REGION_INFO_FLAG_CAPS) || !cap_type)
+		return 0;
+
+	switch (cap_type_id) {
+	case VFIO_REGION_INFO_CAP_SPARSE_MMAP:
+		ret = sparse_mmap_cap(caps, cap_type);
+		if (ret)
+			return ret;
+		break;
+
+	case VFIO_REGION_INFO_CAP_TYPE:
+		ret = region_type_cap(caps, cap_type);
+		if (ret)
+			return ret;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	if (caps->size) {
+		if (info->argsz < sizeof(*info) + caps->size) {
+			info->argsz = sizeof(*info) + caps->size;
+			info->cap_offset = 0;
+		} else {
+			vfio_info_cap_shift(caps, sizeof(*info));
+			info->cap_offset = sizeof(*info);
+		}
+	}
+	return 0;
+}
+EXPORT_SYMBOL(vfio_info_add_capability);
+
+int vfio_set_irqs_validate_and_prepare(struct vfio_irq_set *hdr, int num_irqs,
+				       int *data_size)
+{
+	unsigned long minsz;
+
+	minsz = offsetofend(struct vfio_irq_set, count);
+
+	if ((hdr->argsz < minsz) || (hdr->index >= VFIO_PCI_NUM_IRQS) ||
+	    (hdr->flags & ~(VFIO_IRQ_SET_DATA_TYPE_MASK |
+				VFIO_IRQ_SET_ACTION_TYPE_MASK)))
+		return -EINVAL;
+
+	if (data_size)
+		*data_size = 0;
+
+	if (!(hdr->flags & VFIO_IRQ_SET_DATA_NONE)) {
+		size_t size;
+
+		if (hdr->flags & VFIO_IRQ_SET_DATA_BOOL)
+			size = sizeof(uint8_t);
+		else if (hdr->flags & VFIO_IRQ_SET_DATA_EVENTFD)
+			size = sizeof(int32_t);
+		else
+			return -EINVAL;
+
+		if ((hdr->argsz - minsz < hdr->count * size) ||
+		    (hdr->start >= num_irqs) ||
+		    (hdr->start + hdr->count > num_irqs))
+			return -EINVAL;
+
+		if (!data_size)
+			return -EINVAL;
+
+		*data_size = hdr->count * size;
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL(vfio_set_irqs_validate_and_prepare);
+
 static struct vfio_group *vfio_group_from_dev(struct device *dev)
 {
 	struct vfio_device *device;
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 0bd25ba6223d..5641dab72ded 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -108,6 +108,13 @@ extern struct vfio_info_cap_header *vfio_info_cap_add(
 		struct vfio_info_cap *caps, size_t size, u16 id, u16 version);
 extern void vfio_info_cap_shift(struct vfio_info_cap *caps, size_t offset);
 
+extern int vfio_info_add_capability(struct vfio_region_info *info,
+				    struct vfio_info_cap *caps,
+				    int cap_type_id, void *cap_type);
+
+extern int vfio_set_irqs_validate_and_prepare(struct vfio_irq_set *hdr,
+					      int num_irqs, int *data_size);
+
 struct pci_dev;
 #ifdef CONFIG_EEH
 extern void vfio_spapr_pci_eeh_open(struct pci_dev *pdev);
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v8 1/6] vfio: Mediated device Core driver
  2016-10-10 20:28   ` [Qemu-devel] " Kirti Wankhede
@ 2016-10-10 21:00     ` Eric Blake
  -1 siblings, 0 replies; 73+ messages in thread
From: Eric Blake @ 2016-10-10 21:00 UTC (permalink / raw)
  To: Kirti Wankhede, alex.williamson, pbonzini, kraxel, cjia
  Cc: jike.song, kvm, kevin.tian, qemu-devel, bjsdjshi


[-- Attachment #1.1: Type: text/plain, Size: 1699 bytes --]

On 10/10/2016 03:28 PM, Kirti Wankhede wrote:
> Design for Mediated Device Driver:
> Main purpose of this driver is to provide a common interface for mediated
> device management that can be used by different drivers of different
> devices.
> 

> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Signed-off-by: Neo Jia <cjia@nvidia.com>
> Change-Id: I73a5084574270b14541c529461ea2f03c292d510
> ---

> +++ b/drivers/vfio/mdev/Kconfig
> @@ -0,0 +1,12 @@
> +
> +config VFIO_MDEV
> +    tristate "Mediated device driver framework"
> +    depends on VFIO
> +    default n
> +    help
> +        Provides a framework to virtualize device.

Feels like a missing word or two here, maybe 'virtualize a _____ device' ?

> +	See Documentation/vfio-mdev/vfio-mediated-device.txt for more details.
> +
> +        If you don't know what do here, say N.
> +

> +++ b/drivers/vfio/mdev/mdev_core.c
> @@ -0,0 +1,363 @@
> +/*
> + * Mediated device Core Driver
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.

Umm - "All rights reserved" is incompatible with GPLv2.  Either you
reserved the rights (and it is therefore not GPL), or it is GPL (and you
are specifically granting rights, and therefore not all rights are
reserved).

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v8 1/6] vfio: Mediated device Core driver
@ 2016-10-10 21:00     ` Eric Blake
  0 siblings, 0 replies; 73+ messages in thread
From: Eric Blake @ 2016-10-10 21:00 UTC (permalink / raw)
  To: Kirti Wankhede, alex.williamson, pbonzini, kraxel, cjia
  Cc: jike.song, kvm, kevin.tian, qemu-devel, bjsdjshi

[-- Attachment #1: Type: text/plain, Size: 1699 bytes --]

On 10/10/2016 03:28 PM, Kirti Wankhede wrote:
> Design for Mediated Device Driver:
> Main purpose of this driver is to provide a common interface for mediated
> device management that can be used by different drivers of different
> devices.
> 

> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Signed-off-by: Neo Jia <cjia@nvidia.com>
> Change-Id: I73a5084574270b14541c529461ea2f03c292d510
> ---

> +++ b/drivers/vfio/mdev/Kconfig
> @@ -0,0 +1,12 @@
> +
> +config VFIO_MDEV
> +    tristate "Mediated device driver framework"
> +    depends on VFIO
> +    default n
> +    help
> +        Provides a framework to virtualize device.

Feels like a missing word or two here, maybe 'virtualize a _____ device' ?

> +	See Documentation/vfio-mdev/vfio-mediated-device.txt for more details.
> +
> +        If you don't know what do here, say N.
> +

> +++ b/drivers/vfio/mdev/mdev_core.c
> @@ -0,0 +1,363 @@
> +/*
> + * Mediated device Core Driver
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.

Umm - "All rights reserved" is incompatible with GPLv2.  Either you
reserved the rights (and it is therefore not GPL), or it is GPL (and you
are specifically granting rights, and therefore not all rights are
reserved).

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 0/6] Add Mediated device support
  2016-10-10 20:28 ` [Qemu-devel] " Kirti Wankhede
@ 2016-10-11  2:23   ` Jike Song
  -1 siblings, 0 replies; 73+ messages in thread
From: Jike Song @ 2016-10-11  2:23 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: kevin.tian, cjia, kvm, qemu-devel, alex.williamson, kraxel,
	pbonzini, bjsdjshi

On 10/11/2016 04:28 AM, Kirti Wankhede wrote:
> This series adds Mediated device support to Linux host kernel. Purpose
> of this series is to provide a common interface for mediated device
> management that can be used by different devices. This series introduces
> Mdev core module that creates and manages mediated devices, VFIO based
> driver for mediated devices that are created by mdev core module and
> update VFIO type1 IOMMU module to support pinning & unpinning for mediated
> devices.
> 
> This change uses uuid_le_to_bin() to parse UUID string and convert to bin.
> This requires following commits from linux master branch:
> * commit bc9dc9d5eec908806f1b15c9ec2253d44dcf7835 :
>         lib/uuid.c: use correct offset in uuid parser
> * commit 2b1b0d66704a8cafe83be7114ec4c15ab3a314ad :
>         lib/uuid.c: introduce a few more generic helpers
> 
> Requires below commits from linux master branch for mmap region fault
> handler that uses remap_pfn_range() to setup EPT properly.
> * commit add6a0cd1c5ba51b201e1361b05a5df817083618
>         KVM: MMU: try to fix up page faults before giving up
> * commit 92176a8ede577d0ff78ab3298e06701f67ad5f51 :
>         KVM: MMU: prepare to support mapping of VM_IO and VM_PFNMAP frames
> 
> What changed in v8?
> mdev-core:
> - Removed start/stop or online/offline interfaces.
> - Added open() and close() interfaces that should be used to commit
>   resources for mdev devices from vendor driver.
> - Removed supported_config callback function. Introduced sysfs interface
>   for 'mdev_supported_types' as discussed earlier. It is mandatory to
>   provide supported types by vendor driver.
> - Removed 'mdev_create' and 'mdev_destroy' sysfs files from device's
>   directory. Added 'create' file in each supported type group that vendor
>   driver would define. Added 'remove' file in mdev device directory to
>   destroy mdev device.
> 
> vfio_mdev:
> - Added ioctl() callback. All ioctls should be handled in vendor driver
> - Common functions for SET_IRQS and GET_REGION_INFO ioctls are added to
>   reduce code duplication in vendor drivers.
> - This forms a shim layer that pass through VFIO devices operations to
>   vendor driver for mediated devices.

Hi Kirti,

While having not looked yet at the v8 details, I would say that this is
definitely the right way to go, as I have been proposing for a quite long
while :)

--
Thanks,
Jike

> 
> vfio_iommu_type1:
> - Handled the case if all devices attached to the normal IOMMU API domain
>   go away and mdev device still exist in domain. Updated page accounting
>   for local domain.
> - Similarly if device is attached to normal IOMMU API domain, mappings are
>   establised and page accounting is updated accordingly.
> - Tested hot-plug and hot-unplug of vGPU and GPU pass through device with
>   Linux VM.
> 
> Documentation:
> - Updated vfio-mediated-device.txt with current interface.
> - Added sample driver that simulates serial port over PCI card for a VM.
>   This driver is added as an example for how to use mediated device
>   framework.
> - Moved updated document and example driver to 'vfio-mdev' directory in
>   Documentation.
> 
> 
> Kirti Wankhede (6):
>   vfio: Mediated device Core driver
>   vfio: VFIO based driver for Mediated devices
>   vfio iommu: Add support for mediated devices
>   docs: Add Documentation for Mediated devices
>   Add simple sample driver for mediated device framework
>   Add common functions for SET_IRQS and GET_REGION_INFO ioctls
> 
>  Documentation/vfio-mdev/Makefile                 |   14 +
>  Documentation/vfio-mdev/mtty.c                   | 1353 ++++++++++++++++++++++
>  Documentation/vfio-mdev/vfio-mediated-device.txt |  282 +++++
>  drivers/vfio/Kconfig                             |    1 +
>  drivers/vfio/Makefile                            |    1 +
>  drivers/vfio/mdev/Kconfig                        |   18 +
>  drivers/vfio/mdev/Makefile                       |    6 +
>  drivers/vfio/mdev/mdev_core.c                    |  363 ++++++
>  drivers/vfio/mdev/mdev_driver.c                  |  131 +++
>  drivers/vfio/mdev/mdev_private.h                 |   41 +
>  drivers/vfio/mdev/mdev_sysfs.c                   |  295 +++++
>  drivers/vfio/mdev/vfio_mdev.c                    |  171 +++
>  drivers/vfio/pci/vfio_pci.c                      |  103 +-
>  drivers/vfio/pci/vfio_pci_private.h              |    6 +-
>  drivers/vfio/vfio.c                              |  233 ++++
>  drivers/vfio/vfio_iommu_type1.c                  |  685 +++++++++--
>  include/linux/mdev.h                             |  178 +++
>  include/linux/vfio.h                             |   20 +-
>  18 files changed, 3743 insertions(+), 158 deletions(-)
>  create mode 100644 Documentation/vfio-mdev/Makefile
>  create mode 100644 Documentation/vfio-mdev/mtty.c
>  create mode 100644 Documentation/vfio-mdev/vfio-mediated-device.txt
>  create mode 100644 drivers/vfio/mdev/Kconfig
>  create mode 100644 drivers/vfio/mdev/Makefile
>  create mode 100644 drivers/vfio/mdev/mdev_core.c
>  create mode 100644 drivers/vfio/mdev/mdev_driver.c
>  create mode 100644 drivers/vfio/mdev/mdev_private.h
>  create mode 100644 drivers/vfio/mdev/mdev_sysfs.c
>  create mode 100644 drivers/vfio/mdev/vfio_mdev.c
>  create mode 100644 include/linux/mdev.h
> 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v8 0/6] Add Mediated device support
@ 2016-10-11  2:23   ` Jike Song
  0 siblings, 0 replies; 73+ messages in thread
From: Jike Song @ 2016-10-11  2:23 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: alex.williamson, pbonzini, kraxel, cjia, qemu-devel, kvm,
	kevin.tian, bjsdjshi

On 10/11/2016 04:28 AM, Kirti Wankhede wrote:
> This series adds Mediated device support to Linux host kernel. Purpose
> of this series is to provide a common interface for mediated device
> management that can be used by different devices. This series introduces
> Mdev core module that creates and manages mediated devices, VFIO based
> driver for mediated devices that are created by mdev core module and
> update VFIO type1 IOMMU module to support pinning & unpinning for mediated
> devices.
> 
> This change uses uuid_le_to_bin() to parse UUID string and convert to bin.
> This requires following commits from linux master branch:
> * commit bc9dc9d5eec908806f1b15c9ec2253d44dcf7835 :
>         lib/uuid.c: use correct offset in uuid parser
> * commit 2b1b0d66704a8cafe83be7114ec4c15ab3a314ad :
>         lib/uuid.c: introduce a few more generic helpers
> 
> Requires below commits from linux master branch for mmap region fault
> handler that uses remap_pfn_range() to setup EPT properly.
> * commit add6a0cd1c5ba51b201e1361b05a5df817083618
>         KVM: MMU: try to fix up page faults before giving up
> * commit 92176a8ede577d0ff78ab3298e06701f67ad5f51 :
>         KVM: MMU: prepare to support mapping of VM_IO and VM_PFNMAP frames
> 
> What changed in v8?
> mdev-core:
> - Removed start/stop or online/offline interfaces.
> - Added open() and close() interfaces that should be used to commit
>   resources for mdev devices from vendor driver.
> - Removed supported_config callback function. Introduced sysfs interface
>   for 'mdev_supported_types' as discussed earlier. It is mandatory to
>   provide supported types by vendor driver.
> - Removed 'mdev_create' and 'mdev_destroy' sysfs files from device's
>   directory. Added 'create' file in each supported type group that vendor
>   driver would define. Added 'remove' file in mdev device directory to
>   destroy mdev device.
> 
> vfio_mdev:
> - Added ioctl() callback. All ioctls should be handled in vendor driver
> - Common functions for SET_IRQS and GET_REGION_INFO ioctls are added to
>   reduce code duplication in vendor drivers.
> - This forms a shim layer that pass through VFIO devices operations to
>   vendor driver for mediated devices.

Hi Kirti,

While having not looked yet at the v8 details, I would say that this is
definitely the right way to go, as I have been proposing for a quite long
while :)

--
Thanks,
Jike

> 
> vfio_iommu_type1:
> - Handled the case if all devices attached to the normal IOMMU API domain
>   go away and mdev device still exist in domain. Updated page accounting
>   for local domain.
> - Similarly if device is attached to normal IOMMU API domain, mappings are
>   establised and page accounting is updated accordingly.
> - Tested hot-plug and hot-unplug of vGPU and GPU pass through device with
>   Linux VM.
> 
> Documentation:
> - Updated vfio-mediated-device.txt with current interface.
> - Added sample driver that simulates serial port over PCI card for a VM.
>   This driver is added as an example for how to use mediated device
>   framework.
> - Moved updated document and example driver to 'vfio-mdev' directory in
>   Documentation.
> 
> 
> Kirti Wankhede (6):
>   vfio: Mediated device Core driver
>   vfio: VFIO based driver for Mediated devices
>   vfio iommu: Add support for mediated devices
>   docs: Add Documentation for Mediated devices
>   Add simple sample driver for mediated device framework
>   Add common functions for SET_IRQS and GET_REGION_INFO ioctls
> 
>  Documentation/vfio-mdev/Makefile                 |   14 +
>  Documentation/vfio-mdev/mtty.c                   | 1353 ++++++++++++++++++++++
>  Documentation/vfio-mdev/vfio-mediated-device.txt |  282 +++++
>  drivers/vfio/Kconfig                             |    1 +
>  drivers/vfio/Makefile                            |    1 +
>  drivers/vfio/mdev/Kconfig                        |   18 +
>  drivers/vfio/mdev/Makefile                       |    6 +
>  drivers/vfio/mdev/mdev_core.c                    |  363 ++++++
>  drivers/vfio/mdev/mdev_driver.c                  |  131 +++
>  drivers/vfio/mdev/mdev_private.h                 |   41 +
>  drivers/vfio/mdev/mdev_sysfs.c                   |  295 +++++
>  drivers/vfio/mdev/vfio_mdev.c                    |  171 +++
>  drivers/vfio/pci/vfio_pci.c                      |  103 +-
>  drivers/vfio/pci/vfio_pci_private.h              |    6 +-
>  drivers/vfio/vfio.c                              |  233 ++++
>  drivers/vfio/vfio_iommu_type1.c                  |  685 +++++++++--
>  include/linux/mdev.h                             |  178 +++
>  include/linux/vfio.h                             |   20 +-
>  18 files changed, 3743 insertions(+), 158 deletions(-)
>  create mode 100644 Documentation/vfio-mdev/Makefile
>  create mode 100644 Documentation/vfio-mdev/mtty.c
>  create mode 100644 Documentation/vfio-mdev/vfio-mediated-device.txt
>  create mode 100644 drivers/vfio/mdev/Kconfig
>  create mode 100644 drivers/vfio/mdev/Makefile
>  create mode 100644 drivers/vfio/mdev/mdev_core.c
>  create mode 100644 drivers/vfio/mdev/mdev_driver.c
>  create mode 100644 drivers/vfio/mdev/mdev_private.h
>  create mode 100644 drivers/vfio/mdev/mdev_sysfs.c
>  create mode 100644 drivers/vfio/mdev/vfio_mdev.c
>  create mode 100644 include/linux/mdev.h
> 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 1/6] vfio: Mediated device Core driver
  2016-10-10 20:28   ` [Qemu-devel] " Kirti Wankhede
@ 2016-10-11  3:51     ` Alex Williamson
  -1 siblings, 0 replies; 73+ messages in thread
From: Alex Williamson @ 2016-10-11  3:51 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: kevin.tian, cjia, kvm, qemu-devel, jike.song, kraxel, pbonzini, bjsdjshi

On Tue, 11 Oct 2016 01:58:32 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:
> ---
>  drivers/vfio/Kconfig             |   1 +
>  drivers/vfio/Makefile            |   1 +
>  drivers/vfio/mdev/Kconfig        |  12 ++
>  drivers/vfio/mdev/Makefile       |   5 +
>  drivers/vfio/mdev/mdev_core.c    | 363 +++++++++++++++++++++++++++++++++++++++
>  drivers/vfio/mdev/mdev_driver.c  | 131 ++++++++++++++
>  drivers/vfio/mdev/mdev_private.h |  41 +++++
>  drivers/vfio/mdev/mdev_sysfs.c   | 295 +++++++++++++++++++++++++++++++
>  include/linux/mdev.h             | 178 +++++++++++++++++++
>  9 files changed, 1027 insertions(+)
>  create mode 100644 drivers/vfio/mdev/Kconfig
>  create mode 100644 drivers/vfio/mdev/Makefile
>  create mode 100644 drivers/vfio/mdev/mdev_core.c
>  create mode 100644 drivers/vfio/mdev/mdev_driver.c
>  create mode 100644 drivers/vfio/mdev/mdev_private.h
>  create mode 100644 drivers/vfio/mdev/mdev_sysfs.c
>  create mode 100644 include/linux/mdev.h


Overall this is heading in a good direction.  What kernel is this
series against?  I could only apply it to v4.7, yet some of the
dependencies claimed in the cover letter are only in v4.8.  linux-next
or v4.8 are both good baselines right now, as we move to v4.9-rc
releases, linux-next probably becomes a better target.

A few initial comments below, I'll likely have more as I wrap my head
around it.  Thanks,

Alex

> diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
> new file mode 100644
> index 000000000000..019c196e62d5
> --- /dev/null
> +++ b/drivers/vfio/mdev/mdev_core.c
> @@ -0,0 +1,363 @@
> +/*
> + * Mediated device Core Driver
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/module.h>
> +#include <linux/device.h>
> +#include <linux/fs.h>
> +#include <linux/slab.h>
> +#include <linux/sched.h>
> +#include <linux/uuid.h>
> +#include <linux/vfio.h>

I don't see any vfio interfaces used here, is vfio.h necessary?

> +#include <linux/iommu.h>
> +#include <linux/sysfs.h>
> +#include <linux/mdev.h>
> +
> +#include "mdev_private.h"
> +
[snip]
> +int mdev_device_create(struct kobject *kobj, struct device *dev, uuid_le uuid)
> +{
> +	int ret;
> +	struct mdev_device *mdev;
> +	struct parent_device *parent;
> +	struct mdev_type *type = to_mdev_type(kobj);
> +
> +	parent = mdev_get_parent(type->parent);
> +	if (!parent)
> +		return -EINVAL;
> +
> +	/* Check for duplicate */
> +	mdev = __find_mdev_device(parent, uuid);
> +	if (mdev) {
> +		ret = -EEXIST;
> +		goto create_err;
> +	}
> +
> +	mdev = kzalloc(sizeof(*mdev), GFP_KERNEL);
> +	if (!mdev) {
> +		ret = -ENOMEM;
> +		goto create_err;
> +	}
> +
> +	memcpy(&mdev->uuid, &uuid, sizeof(uuid_le));
> +	mdev->parent = parent;
> +	kref_init(&mdev->ref);
> +
> +	mdev->dev.parent  = dev;
> +	mdev->dev.bus     = &mdev_bus_type;
> +	mdev->dev.release = mdev_device_release;
> +	dev_set_name(&mdev->dev, "%pUl", uuid.b);
> +
> +	ret = device_register(&mdev->dev);
> +	if (ret) {
> +		put_device(&mdev->dev);
> +		goto create_err;
> +	}
> +
> +	ret = mdev_device_create_ops(kobj, mdev);
> +	if (ret)
> +		goto create_failed;
> +
> +	ret = mdev_create_sysfs_files(&mdev->dev, type);
> +	if (ret) {
> +		mdev_device_remove_ops(mdev, true);
> +		goto create_failed;
> +	}
> +
> +	mdev->type_kobj = kobj;
> +	dev_dbg(&mdev->dev, "MDEV: created\n");
> +
> +	return ret;
> +
> +create_failed:
> +	device_unregister(&mdev->dev);
> +
> +create_err:
> +	mdev_put_parent(parent);
> +	return ret;
> +}
> +
> +int mdev_device_remove(struct device *dev, void *data)

I understand this void* is to be able to call this from
device_for_each_child(), but let's create a callback wrapper for that
path that converts data to bool, we really don't want to use void args
except where necessary.  IOW,

static int mdev_device_remove_cb(struct device *dev, void *data)
{
	mdev_device_remove(dev, data ? *(bool *)data : true);
}

int mdev_device_remove(struct device *dev, bool force_remove)
> +{
> +	struct mdev_device *mdev;
> +	struct parent_device *parent;
> +	struct mdev_type *type;
> +	bool force_remove = true;
> +	int ret = 0;
> +
> +	if (!dev_is_mdev(dev))
> +		return 0;
> +
> +	mdev = to_mdev_device(dev);
> +	parent = mdev->parent;
> +	type = to_mdev_type(mdev->type_kobj);
> +
> +	if (data)
> +		force_remove = *(bool *)data;
> +
> +	ret = mdev_device_remove_ops(mdev, force_remove);
> +	if (ret)
> +		return ret;
> +
> +	mdev_remove_sysfs_files(dev, type);
> +	device_unregister(dev);
> +	mdev_put_parent(parent);
> +	return ret;
> +}
> +
[snip]
> diff --git a/drivers/vfio/mdev/mdev_sysfs.c b/drivers/vfio/mdev/mdev_sysfs.c
> new file mode 100644
> index 000000000000..228698f46234
> --- /dev/null
> +++ b/drivers/vfio/mdev/mdev_sysfs.c
[snip]
> +
> +static ssize_t create_store(struct kobject *kobj, struct device *dev,
> +			    const char *buf, size_t count)
> +{
> +	char *str;
> +	uuid_le uuid;
> +	int ret;
> +
> +	str = kstrndup(buf, count, GFP_KERNEL);

We can sanity test @count to buffer formats that uuid_le_to_bin() is
able to parse to make this safer.

> +	if (!str)
> +		return -ENOMEM;
> +
> +	ret = uuid_le_to_bin(str, &uuid);
> +	if (!ret) {
> +
> +		ret = mdev_device_create(kobj, dev, uuid);
> +		if (ret)
> +			pr_err("mdev_create: Failed to create mdev device\n");
> +		else
> +			ret = count;
> +	}
> +
> +	kfree(str);
> +	return ret;
> +}
> +
[snip]
> diff --git a/include/linux/mdev.h b/include/linux/mdev.h
> new file mode 100644
> index 000000000000..93c177609efe
> --- /dev/null
> +++ b/include/linux/mdev.h
> @@ -0,0 +1,178 @@
> +/*
> + * Mediated device definition
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#ifndef MDEV_H
> +#define MDEV_H
> +
> +#include <uapi/linux/vfio.h>
> +
> +struct parent_device;
> +
> +/* Mediated device */
> +struct mdev_device {
> +	struct device		dev;
> +	struct parent_device	*parent;
> +	struct iommu_group	*group;

There's already an iommu_group pointer on struct device, isn't this a
duplicate?

> +	uuid_le			uuid;
> +	void			*driver_data;
> +
> +	/* internal only */
> +	struct kref		ref;
> +	struct list_head	next;
> +	struct kobject		*type_kobj;
> +};
> +
> +
> +/**
> + * struct parent_ops - Structure to be registered for each parent device to
> + * register the device to mdev module.
> + *
> + * @owner:		The module owner.
> + * @dev_attr_groups:	Attributes of the parent device.
> + * @mdev_attr_groups:	Attributes of the mediated device.
> + * @supported_type_groups: Attributes to define supported types. It is mandatory
> + *			to provide supported types.
> + * @create:		Called to allocate basic resources in parent device's
> + *			driver for a particular mediated device. It is
> + *			mandatory to provide create ops.
> + *			@kobj: kobject of type for which 'create' is called.
> + *			@mdev: mdev_device structure on of mediated device
> + *			      that is being created
> + *			Returns integer: success (0) or error (< 0)
> + * @remove:		Called to free resources in parent device's driver for a
> + *			a mediated device. It is mandatory to provide 'remove'
> + *			ops.
> + *			@mdev: mdev_device device structure which is being
> + *			       destroyed
> + *			Returns integer: success (0) or error (< 0)
> + * @open:		Open mediated device.
> + *			@mdev: mediated device.
> + *			Returns integer: success (0) or error (< 0)
> + * @release:		release mediated device
> + *			@mdev: mediated device.
> + * @read:		Read emulation callback
> + *			@mdev: mediated device structure
> + *			@buf: read buffer
> + *			@count: number of bytes to read
> + *			@pos: address.
> + *			Retuns number on bytes read on success or error.
> + * @write:		Write emulation callback
> + *			@mdev: mediated device structure
> + *			@buf: write buffer
> + *			@count: number of bytes to be written
> + *			@pos: address.
> + *			Retuns number on bytes written on success or error.
> + * @ioctl:		IOCTL callback
> + *			@mdev: mediated device structure
> + *			@cmd: mediated device structure
> + *			@arg: mediated device structure
> + * @mmap:		mmap callback
> + * Parent device that support mediated device should be registered with mdev
> + * module with parent_ops structure.
> + */
> +
> +struct parent_ops {
> +	struct module   *owner;
> +	const struct attribute_group **dev_attr_groups;
> +	const struct attribute_group **mdev_attr_groups;
> +	struct attribute_group **supported_type_groups;
> +
> +	int     (*create)(struct kobject *kobj, struct mdev_device *mdev);
> +	int     (*remove)(struct mdev_device *mdev);
> +	int     (*open)(struct mdev_device *mdev);
> +	void    (*release)(struct mdev_device *mdev);
> +	ssize_t (*read)(struct mdev_device *mdev, char *buf, size_t count,
> +			loff_t pos);

char __user *buf

> +	ssize_t (*write)(struct mdev_device *mdev, char *buf, size_t count,
> +			 loff_t pos);

const char __user *buf

> +	ssize_t (*ioctl)(struct mdev_device *mdev, unsigned int cmd,
> +			 unsigned long arg);
> +	int	(*mmap)(struct mdev_device *mdev, struct vm_area_struct *vma);
> +};
> +
> +/* Parent Device */
> +struct parent_device {
> +	struct device		*dev;
> +	const struct parent_ops	*ops;
> +
> +	/* internal */
> +	struct kref		ref;
> +	struct list_head	next;
> +	struct kset *mdev_types_kset;
> +	struct list_head	type_list;
> +};
> +
> +/* interface for exporting mdev supported type attributes */
> +struct mdev_type_attribute {
> +	struct attribute attr;
> +	ssize_t (*show)(struct kobject *kobj, struct device *dev, char *buf);
> +	ssize_t (*store)(struct kobject *kobj, struct device *dev,
> +			 const char *buf, size_t count);
> +};
> +
> +#define MDEV_TYPE_ATTR(_name, _mode, _show, _store)		\
> +struct mdev_type_attribute mdev_type_attr_##_name =		\
> +	__ATTR(_name, _mode, _show, _store)
> +#define MDEV_TYPE_ATTR_RW(_name) \
> +	struct mdev_type_attribute mdev_type_attr_##_name = __ATTR_RW(_name)
> +#define MDEV_TYPE_ATTR_RO(_name) \
> +	struct mdev_type_attribute mdev_type_attr_##_name = __ATTR_RO(_name)
> +#define MDEV_TYPE_ATTR_WO(_name) \
> +	struct mdev_type_attribute mdev_type_attr_##_name = __ATTR_WO(_name)
> +
> +/**
> + * struct mdev_driver - Mediated device driver
> + * @name: driver name
> + * @probe: called when new device created
> + * @remove: called when device removed
> + * @driver: device driver structure
> + *
> + **/
> +struct mdev_driver {
> +	const char *name;
> +	int  (*probe)(struct device *dev);
> +	void (*remove)(struct device *dev);
> +	struct device_driver driver;
> +};
> +
> +static inline struct mdev_driver *to_mdev_driver(struct device_driver *drv)
> +{
> +	return drv ? container_of(drv, struct mdev_driver, driver) : NULL;
> +}
> +
> +static inline struct mdev_device *to_mdev_device(struct device *dev)
> +{
> +	return dev ? container_of(dev, struct mdev_device, dev) : NULL;
> +}
> +
> +static inline void *mdev_get_drvdata(struct mdev_device *mdev)
> +{
> +	return mdev->driver_data;
> +}
> +
> +static inline void mdev_set_drvdata(struct mdev_device *mdev, void *data)
> +{
> +	mdev->driver_data = data;
> +}
> +
> +extern struct bus_type mdev_bus_type;
> +
> +#define dev_is_mdev(d) ((d)->bus == &mdev_bus_type)
> +
> +extern int  mdev_register_device(struct device *dev,
> +				 const struct parent_ops *ops);
> +extern void mdev_unregister_device(struct device *dev);
> +
> +extern int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
> +extern void mdev_unregister_driver(struct mdev_driver *drv);
> +
> +#endif /* MDEV_H */

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v8 1/6] vfio: Mediated device Core driver
@ 2016-10-11  3:51     ` Alex Williamson
  0 siblings, 0 replies; 73+ messages in thread
From: Alex Williamson @ 2016-10-11  3:51 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi

On Tue, 11 Oct 2016 01:58:32 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:
> ---
>  drivers/vfio/Kconfig             |   1 +
>  drivers/vfio/Makefile            |   1 +
>  drivers/vfio/mdev/Kconfig        |  12 ++
>  drivers/vfio/mdev/Makefile       |   5 +
>  drivers/vfio/mdev/mdev_core.c    | 363 +++++++++++++++++++++++++++++++++++++++
>  drivers/vfio/mdev/mdev_driver.c  | 131 ++++++++++++++
>  drivers/vfio/mdev/mdev_private.h |  41 +++++
>  drivers/vfio/mdev/mdev_sysfs.c   | 295 +++++++++++++++++++++++++++++++
>  include/linux/mdev.h             | 178 +++++++++++++++++++
>  9 files changed, 1027 insertions(+)
>  create mode 100644 drivers/vfio/mdev/Kconfig
>  create mode 100644 drivers/vfio/mdev/Makefile
>  create mode 100644 drivers/vfio/mdev/mdev_core.c
>  create mode 100644 drivers/vfio/mdev/mdev_driver.c
>  create mode 100644 drivers/vfio/mdev/mdev_private.h
>  create mode 100644 drivers/vfio/mdev/mdev_sysfs.c
>  create mode 100644 include/linux/mdev.h


Overall this is heading in a good direction.  What kernel is this
series against?  I could only apply it to v4.7, yet some of the
dependencies claimed in the cover letter are only in v4.8.  linux-next
or v4.8 are both good baselines right now, as we move to v4.9-rc
releases, linux-next probably becomes a better target.

A few initial comments below, I'll likely have more as I wrap my head
around it.  Thanks,

Alex

> diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
> new file mode 100644
> index 000000000000..019c196e62d5
> --- /dev/null
> +++ b/drivers/vfio/mdev/mdev_core.c
> @@ -0,0 +1,363 @@
> +/*
> + * Mediated device Core Driver
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/module.h>
> +#include <linux/device.h>
> +#include <linux/fs.h>
> +#include <linux/slab.h>
> +#include <linux/sched.h>
> +#include <linux/uuid.h>
> +#include <linux/vfio.h>

I don't see any vfio interfaces used here, is vfio.h necessary?

> +#include <linux/iommu.h>
> +#include <linux/sysfs.h>
> +#include <linux/mdev.h>
> +
> +#include "mdev_private.h"
> +
[snip]
> +int mdev_device_create(struct kobject *kobj, struct device *dev, uuid_le uuid)
> +{
> +	int ret;
> +	struct mdev_device *mdev;
> +	struct parent_device *parent;
> +	struct mdev_type *type = to_mdev_type(kobj);
> +
> +	parent = mdev_get_parent(type->parent);
> +	if (!parent)
> +		return -EINVAL;
> +
> +	/* Check for duplicate */
> +	mdev = __find_mdev_device(parent, uuid);
> +	if (mdev) {
> +		ret = -EEXIST;
> +		goto create_err;
> +	}
> +
> +	mdev = kzalloc(sizeof(*mdev), GFP_KERNEL);
> +	if (!mdev) {
> +		ret = -ENOMEM;
> +		goto create_err;
> +	}
> +
> +	memcpy(&mdev->uuid, &uuid, sizeof(uuid_le));
> +	mdev->parent = parent;
> +	kref_init(&mdev->ref);
> +
> +	mdev->dev.parent  = dev;
> +	mdev->dev.bus     = &mdev_bus_type;
> +	mdev->dev.release = mdev_device_release;
> +	dev_set_name(&mdev->dev, "%pUl", uuid.b);
> +
> +	ret = device_register(&mdev->dev);
> +	if (ret) {
> +		put_device(&mdev->dev);
> +		goto create_err;
> +	}
> +
> +	ret = mdev_device_create_ops(kobj, mdev);
> +	if (ret)
> +		goto create_failed;
> +
> +	ret = mdev_create_sysfs_files(&mdev->dev, type);
> +	if (ret) {
> +		mdev_device_remove_ops(mdev, true);
> +		goto create_failed;
> +	}
> +
> +	mdev->type_kobj = kobj;
> +	dev_dbg(&mdev->dev, "MDEV: created\n");
> +
> +	return ret;
> +
> +create_failed:
> +	device_unregister(&mdev->dev);
> +
> +create_err:
> +	mdev_put_parent(parent);
> +	return ret;
> +}
> +
> +int mdev_device_remove(struct device *dev, void *data)

I understand this void* is to be able to call this from
device_for_each_child(), but let's create a callback wrapper for that
path that converts data to bool, we really don't want to use void args
except where necessary.  IOW,

static int mdev_device_remove_cb(struct device *dev, void *data)
{
	mdev_device_remove(dev, data ? *(bool *)data : true);
}

int mdev_device_remove(struct device *dev, bool force_remove)
> +{
> +	struct mdev_device *mdev;
> +	struct parent_device *parent;
> +	struct mdev_type *type;
> +	bool force_remove = true;
> +	int ret = 0;
> +
> +	if (!dev_is_mdev(dev))
> +		return 0;
> +
> +	mdev = to_mdev_device(dev);
> +	parent = mdev->parent;
> +	type = to_mdev_type(mdev->type_kobj);
> +
> +	if (data)
> +		force_remove = *(bool *)data;
> +
> +	ret = mdev_device_remove_ops(mdev, force_remove);
> +	if (ret)
> +		return ret;
> +
> +	mdev_remove_sysfs_files(dev, type);
> +	device_unregister(dev);
> +	mdev_put_parent(parent);
> +	return ret;
> +}
> +
[snip]
> diff --git a/drivers/vfio/mdev/mdev_sysfs.c b/drivers/vfio/mdev/mdev_sysfs.c
> new file mode 100644
> index 000000000000..228698f46234
> --- /dev/null
> +++ b/drivers/vfio/mdev/mdev_sysfs.c
[snip]
> +
> +static ssize_t create_store(struct kobject *kobj, struct device *dev,
> +			    const char *buf, size_t count)
> +{
> +	char *str;
> +	uuid_le uuid;
> +	int ret;
> +
> +	str = kstrndup(buf, count, GFP_KERNEL);

We can sanity test @count to buffer formats that uuid_le_to_bin() is
able to parse to make this safer.

> +	if (!str)
> +		return -ENOMEM;
> +
> +	ret = uuid_le_to_bin(str, &uuid);
> +	if (!ret) {
> +
> +		ret = mdev_device_create(kobj, dev, uuid);
> +		if (ret)
> +			pr_err("mdev_create: Failed to create mdev device\n");
> +		else
> +			ret = count;
> +	}
> +
> +	kfree(str);
> +	return ret;
> +}
> +
[snip]
> diff --git a/include/linux/mdev.h b/include/linux/mdev.h
> new file mode 100644
> index 000000000000..93c177609efe
> --- /dev/null
> +++ b/include/linux/mdev.h
> @@ -0,0 +1,178 @@
> +/*
> + * Mediated device definition
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#ifndef MDEV_H
> +#define MDEV_H
> +
> +#include <uapi/linux/vfio.h>
> +
> +struct parent_device;
> +
> +/* Mediated device */
> +struct mdev_device {
> +	struct device		dev;
> +	struct parent_device	*parent;
> +	struct iommu_group	*group;

There's already an iommu_group pointer on struct device, isn't this a
duplicate?

> +	uuid_le			uuid;
> +	void			*driver_data;
> +
> +	/* internal only */
> +	struct kref		ref;
> +	struct list_head	next;
> +	struct kobject		*type_kobj;
> +};
> +
> +
> +/**
> + * struct parent_ops - Structure to be registered for each parent device to
> + * register the device to mdev module.
> + *
> + * @owner:		The module owner.
> + * @dev_attr_groups:	Attributes of the parent device.
> + * @mdev_attr_groups:	Attributes of the mediated device.
> + * @supported_type_groups: Attributes to define supported types. It is mandatory
> + *			to provide supported types.
> + * @create:		Called to allocate basic resources in parent device's
> + *			driver for a particular mediated device. It is
> + *			mandatory to provide create ops.
> + *			@kobj: kobject of type for which 'create' is called.
> + *			@mdev: mdev_device structure on of mediated device
> + *			      that is being created
> + *			Returns integer: success (0) or error (< 0)
> + * @remove:		Called to free resources in parent device's driver for a
> + *			a mediated device. It is mandatory to provide 'remove'
> + *			ops.
> + *			@mdev: mdev_device device structure which is being
> + *			       destroyed
> + *			Returns integer: success (0) or error (< 0)
> + * @open:		Open mediated device.
> + *			@mdev: mediated device.
> + *			Returns integer: success (0) or error (< 0)
> + * @release:		release mediated device
> + *			@mdev: mediated device.
> + * @read:		Read emulation callback
> + *			@mdev: mediated device structure
> + *			@buf: read buffer
> + *			@count: number of bytes to read
> + *			@pos: address.
> + *			Retuns number on bytes read on success or error.
> + * @write:		Write emulation callback
> + *			@mdev: mediated device structure
> + *			@buf: write buffer
> + *			@count: number of bytes to be written
> + *			@pos: address.
> + *			Retuns number on bytes written on success or error.
> + * @ioctl:		IOCTL callback
> + *			@mdev: mediated device structure
> + *			@cmd: mediated device structure
> + *			@arg: mediated device structure
> + * @mmap:		mmap callback
> + * Parent device that support mediated device should be registered with mdev
> + * module with parent_ops structure.
> + */
> +
> +struct parent_ops {
> +	struct module   *owner;
> +	const struct attribute_group **dev_attr_groups;
> +	const struct attribute_group **mdev_attr_groups;
> +	struct attribute_group **supported_type_groups;
> +
> +	int     (*create)(struct kobject *kobj, struct mdev_device *mdev);
> +	int     (*remove)(struct mdev_device *mdev);
> +	int     (*open)(struct mdev_device *mdev);
> +	void    (*release)(struct mdev_device *mdev);
> +	ssize_t (*read)(struct mdev_device *mdev, char *buf, size_t count,
> +			loff_t pos);

char __user *buf

> +	ssize_t (*write)(struct mdev_device *mdev, char *buf, size_t count,
> +			 loff_t pos);

const char __user *buf

> +	ssize_t (*ioctl)(struct mdev_device *mdev, unsigned int cmd,
> +			 unsigned long arg);
> +	int	(*mmap)(struct mdev_device *mdev, struct vm_area_struct *vma);
> +};
> +
> +/* Parent Device */
> +struct parent_device {
> +	struct device		*dev;
> +	const struct parent_ops	*ops;
> +
> +	/* internal */
> +	struct kref		ref;
> +	struct list_head	next;
> +	struct kset *mdev_types_kset;
> +	struct list_head	type_list;
> +};
> +
> +/* interface for exporting mdev supported type attributes */
> +struct mdev_type_attribute {
> +	struct attribute attr;
> +	ssize_t (*show)(struct kobject *kobj, struct device *dev, char *buf);
> +	ssize_t (*store)(struct kobject *kobj, struct device *dev,
> +			 const char *buf, size_t count);
> +};
> +
> +#define MDEV_TYPE_ATTR(_name, _mode, _show, _store)		\
> +struct mdev_type_attribute mdev_type_attr_##_name =		\
> +	__ATTR(_name, _mode, _show, _store)
> +#define MDEV_TYPE_ATTR_RW(_name) \
> +	struct mdev_type_attribute mdev_type_attr_##_name = __ATTR_RW(_name)
> +#define MDEV_TYPE_ATTR_RO(_name) \
> +	struct mdev_type_attribute mdev_type_attr_##_name = __ATTR_RO(_name)
> +#define MDEV_TYPE_ATTR_WO(_name) \
> +	struct mdev_type_attribute mdev_type_attr_##_name = __ATTR_WO(_name)
> +
> +/**
> + * struct mdev_driver - Mediated device driver
> + * @name: driver name
> + * @probe: called when new device created
> + * @remove: called when device removed
> + * @driver: device driver structure
> + *
> + **/
> +struct mdev_driver {
> +	const char *name;
> +	int  (*probe)(struct device *dev);
> +	void (*remove)(struct device *dev);
> +	struct device_driver driver;
> +};
> +
> +static inline struct mdev_driver *to_mdev_driver(struct device_driver *drv)
> +{
> +	return drv ? container_of(drv, struct mdev_driver, driver) : NULL;
> +}
> +
> +static inline struct mdev_device *to_mdev_device(struct device *dev)
> +{
> +	return dev ? container_of(dev, struct mdev_device, dev) : NULL;
> +}
> +
> +static inline void *mdev_get_drvdata(struct mdev_device *mdev)
> +{
> +	return mdev->driver_data;
> +}
> +
> +static inline void mdev_set_drvdata(struct mdev_device *mdev, void *data)
> +{
> +	mdev->driver_data = data;
> +}
> +
> +extern struct bus_type mdev_bus_type;
> +
> +#define dev_is_mdev(d) ((d)->bus == &mdev_bus_type)
> +
> +extern int  mdev_register_device(struct device *dev,
> +				 const struct parent_ops *ops);
> +extern void mdev_unregister_device(struct device *dev);
> +
> +extern int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
> +extern void mdev_unregister_driver(struct mdev_driver *drv);
> +
> +#endif /* MDEV_H */

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 2/6] vfio: VFIO based driver for Mediated devices
  2016-10-10 20:28   ` [Qemu-devel] " Kirti Wankhede
@ 2016-10-11  3:55     ` Alex Williamson
  -1 siblings, 0 replies; 73+ messages in thread
From: Alex Williamson @ 2016-10-11  3:55 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi

On Tue, 11 Oct 2016 01:58:33 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> vfio_mdev driver registers with mdev core driver.
> MDEV core driver creates mediated device and calls probe routine of
> vfio_mdev driver for each device.
> Probe routine of vfio_mdev driver adds mediated device to VFIO core module
> 
> This driver forms a shim layer that pass through VFIO devices operations
> to vendor driver for mediated devices.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Signed-off-by: Neo Jia <cjia@nvidia.com>
> Change-Id: I583f4734752971d3d112324d69e2508c88f359ec
> ---
>  drivers/vfio/mdev/Kconfig           |   6 ++
>  drivers/vfio/mdev/Makefile          |   1 +
>  drivers/vfio/mdev/vfio_mdev.c       | 171 ++++++++++++++++++++++++++++++++++++
>  drivers/vfio/pci/vfio_pci_private.h |   6 +-
>  4 files changed, 181 insertions(+), 3 deletions(-)
>  create mode 100644 drivers/vfio/mdev/vfio_mdev.c

Looking pretty good so far, a few preliminary comments below.  Thanks,

Alex

> 
> diff --git a/drivers/vfio/mdev/Kconfig b/drivers/vfio/mdev/Kconfig
> index 23d5b9d08a5c..e1b23697261d 100644
> --- a/drivers/vfio/mdev/Kconfig
> +++ b/drivers/vfio/mdev/Kconfig
> @@ -9,4 +9,10 @@ config VFIO_MDEV
>  
>          If you don't know what do here, say N.
>  
> +config VFIO_MDEV_DEVICE
> +    tristate "VFIO support for Mediated devices"
> +    depends on VFIO && VFIO_MDEV
> +    default n
> +    help
> +        VFIO based driver for mediated devices.
>  
> diff --git a/drivers/vfio/mdev/Makefile b/drivers/vfio/mdev/Makefile
> index 56a75e689582..e5087ed83a34 100644
> --- a/drivers/vfio/mdev/Makefile
> +++ b/drivers/vfio/mdev/Makefile
> @@ -2,4 +2,5 @@
>  mdev-y := mdev_core.o mdev_sysfs.o mdev_driver.o
>  
>  obj-$(CONFIG_VFIO_MDEV) += mdev.o
> +obj-$(CONFIG_VFIO_MDEV_DEVICE) += vfio_mdev.o
>  
> diff --git a/drivers/vfio/mdev/vfio_mdev.c b/drivers/vfio/mdev/vfio_mdev.c
> new file mode 100644
> index 000000000000..1efc3f309510
> --- /dev/null
> +++ b/drivers/vfio/mdev/vfio_mdev.c
> @@ -0,0 +1,171 @@
> +/*
> + * VFIO based driver for Mediated device
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/init.h>
> +#include <linux/module.h>
> +#include <linux/device.h>
> +#include <linux/kernel.h>
> +#include <linux/slab.h>
> +#include <linux/uuid.h>
> +#include <linux/vfio.h>
> +#include <linux/iommu.h>
> +#include <linux/mdev.h>
> +
> +#include "mdev_private.h"
> +
> +#define DRIVER_VERSION  "0.1"
> +#define DRIVER_AUTHOR   "NVIDIA Corporation"
> +#define DRIVER_DESC     "VFIO based driver for Mediated device"
> +
> +struct vfio_mdev {
> +	struct iommu_group *group;

a) this is never used, b) it's redundant, there are other ways to it.

> +	struct mdev_device *mdev;

Which leaves us with just *mdev, so do we even need a struct vfio_mdev?

> +};
> +
> +static int vfio_mdev_open(void *device_data)
> +{
> +	struct vfio_mdev *vmdev = device_data;
> +	struct parent_device *parent = vmdev->mdev->parent;
> +	int ret;
> +
> +	if (unlikely(!parent->ops->open))
> +		return -EINVAL;
> +
> +	if (!try_module_get(THIS_MODULE))
> +		return -ENODEV;
> +
> +	ret = parent->ops->open(vmdev->mdev);
> +	if (ret)
> +		module_put(THIS_MODULE);
> +
> +	return ret;
> +}
> +
> +static void vfio_mdev_release(void *device_data)
> +{
> +	struct vfio_mdev *vmdev = device_data;
> +	struct parent_device *parent = vmdev->mdev->parent;
> +
> +	if (parent->ops->release)
> +		parent->ops->release(vmdev->mdev);
> +
> +	module_put(THIS_MODULE);
> +}
> +
> +static long vfio_mdev_unlocked_ioctl(void *device_data,
> +				     unsigned int cmd, unsigned long arg)
> +{
> +	struct vfio_mdev *vmdev = device_data;
> +	struct parent_device *parent = vmdev->mdev->parent;
> +
> +	if (unlikely(!parent->ops->ioctl))
> +		return -EINVAL;
> +
> +	return parent->ops->ioctl(vmdev->mdev, cmd, arg);
> +}
> +
> +static ssize_t vfio_mdev_read(void *device_data, char __user *buf,
> +			      size_t count, loff_t *ppos)
> +{
> +	struct vfio_mdev *vmdev = device_data;
> +	struct parent_device *parent = vmdev->mdev->parent;
> +
> +	if (unlikely(!parent->ops->read))
> +		return -EINVAL;
> +
> +	return parent->ops->read(vmdev->mdev, buf, count, *ppos);
> +}
> +
> +static ssize_t vfio_mdev_write(void *device_data, const char __user *buf,
> +			       size_t count, loff_t *ppos)
> +{
> +	struct vfio_mdev *vmdev = device_data;
> +	struct parent_device *parent = vmdev->mdev->parent;
> +
> +	if (unlikely(!parent->ops->write))
> +		return -EINVAL;
> +
> +	return parent->ops->write(vmdev->mdev, (char *)buf, count, *ppos);

We should not be losing the attributes on buf here, struct parent_ops
should be updated so the vendor driver needs to also use the correct
attributes.

> +}
> +
> +static int vfio_mdev_mmap(void *device_data, struct vm_area_struct *vma)
> +{
> +	struct vfio_mdev *vmdev = device_data;
> +	struct parent_device *parent = vmdev->mdev->parent;
> +
> +	if (unlikely(!parent->ops->mmap))
> +		return -EINVAL;
> +
> +	return parent->ops->mmap(vmdev->mdev, vma);
> +}
> +
> +static const struct vfio_device_ops vfio_mdev_dev_ops = {
> +	.name		= "vfio-mdev",
> +	.open		= vfio_mdev_open,
> +	.release	= vfio_mdev_release,
> +	.ioctl		= vfio_mdev_unlocked_ioctl,
> +	.read		= vfio_mdev_read,
> +	.write		= vfio_mdev_write,
> +	.mmap		= vfio_mdev_mmap,
> +};
> +
> +int vfio_mdev_probe(struct device *dev)
> +{
> +	struct vfio_mdev *vmdev;
> +	struct mdev_device *mdev = to_mdev_device(dev);
> +	int ret;
> +
> +	vmdev = kzalloc(sizeof(*vmdev), GFP_KERNEL);
> +	if (IS_ERR(vmdev))
> +		return PTR_ERR(vmdev);
> +
> +	vmdev->mdev = mdev;
> +	vmdev->group = mdev->group;
> +
> +	ret = vfio_add_group_dev(dev, &vfio_mdev_dev_ops, vmdev);
> +	if (ret)
> +		kfree(vmdev);
> +
> +	return ret;
> +}
> +
> +void vfio_mdev_remove(struct device *dev)
> +{
> +	struct vfio_mdev *vmdev;
> +
> +	vmdev = vfio_del_group_dev(dev);
> +	kfree(vmdev);
> +}
> +
> +struct mdev_driver vfio_mdev_driver = {
> +	.name	= "vfio_mdev",
> +	.probe	= vfio_mdev_probe,
> +	.remove	= vfio_mdev_remove,
> +};
> +
> +static int __init vfio_mdev_init(void)
> +{
> +	return mdev_register_driver(&vfio_mdev_driver, THIS_MODULE);
> +}
> +
> +static void __exit vfio_mdev_exit(void)
> +{
> +	mdev_unregister_driver(&vfio_mdev_driver);
> +}
> +
> +module_init(vfio_mdev_init)
> +module_exit(vfio_mdev_exit)
> +
> +MODULE_VERSION(DRIVER_VERSION);
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR(DRIVER_AUTHOR);
> +MODULE_DESCRIPTION(DRIVER_DESC);
> diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
> index 016c14a1b454..776cc2b063d4 100644
> --- a/drivers/vfio/pci/vfio_pci_private.h
> +++ b/drivers/vfio/pci/vfio_pci_private.h
> @@ -21,9 +21,9 @@
>  
>  #define VFIO_PCI_OFFSET_SHIFT   40
>  
> -#define VFIO_PCI_OFFSET_TO_INDEX(off)	(off >> VFIO_PCI_OFFSET_SHIFT)
> -#define VFIO_PCI_INDEX_TO_OFFSET(index)	((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
> -#define VFIO_PCI_OFFSET_MASK	(((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
> +#define VFIO_PCI_OFFSET_TO_INDEX(off)   (off >> VFIO_PCI_OFFSET_SHIFT)
> +#define VFIO_PCI_INDEX_TO_OFFSET(index) ((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
> +#define VFIO_PCI_OFFSET_MASK    (((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)

Gratuitous white space changes, drop this chunk/file

>  
>  /* Special capability IDs predefined access */
>  #define PCI_CAP_ID_INVALID		0xFF	/* default raw access */


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v8 2/6] vfio: VFIO based driver for Mediated devices
@ 2016-10-11  3:55     ` Alex Williamson
  0 siblings, 0 replies; 73+ messages in thread
From: Alex Williamson @ 2016-10-11  3:55 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi

On Tue, 11 Oct 2016 01:58:33 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> vfio_mdev driver registers with mdev core driver.
> MDEV core driver creates mediated device and calls probe routine of
> vfio_mdev driver for each device.
> Probe routine of vfio_mdev driver adds mediated device to VFIO core module
> 
> This driver forms a shim layer that pass through VFIO devices operations
> to vendor driver for mediated devices.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Signed-off-by: Neo Jia <cjia@nvidia.com>
> Change-Id: I583f4734752971d3d112324d69e2508c88f359ec
> ---
>  drivers/vfio/mdev/Kconfig           |   6 ++
>  drivers/vfio/mdev/Makefile          |   1 +
>  drivers/vfio/mdev/vfio_mdev.c       | 171 ++++++++++++++++++++++++++++++++++++
>  drivers/vfio/pci/vfio_pci_private.h |   6 +-
>  4 files changed, 181 insertions(+), 3 deletions(-)
>  create mode 100644 drivers/vfio/mdev/vfio_mdev.c

Looking pretty good so far, a few preliminary comments below.  Thanks,

Alex

> 
> diff --git a/drivers/vfio/mdev/Kconfig b/drivers/vfio/mdev/Kconfig
> index 23d5b9d08a5c..e1b23697261d 100644
> --- a/drivers/vfio/mdev/Kconfig
> +++ b/drivers/vfio/mdev/Kconfig
> @@ -9,4 +9,10 @@ config VFIO_MDEV
>  
>          If you don't know what do here, say N.
>  
> +config VFIO_MDEV_DEVICE
> +    tristate "VFIO support for Mediated devices"
> +    depends on VFIO && VFIO_MDEV
> +    default n
> +    help
> +        VFIO based driver for mediated devices.
>  
> diff --git a/drivers/vfio/mdev/Makefile b/drivers/vfio/mdev/Makefile
> index 56a75e689582..e5087ed83a34 100644
> --- a/drivers/vfio/mdev/Makefile
> +++ b/drivers/vfio/mdev/Makefile
> @@ -2,4 +2,5 @@
>  mdev-y := mdev_core.o mdev_sysfs.o mdev_driver.o
>  
>  obj-$(CONFIG_VFIO_MDEV) += mdev.o
> +obj-$(CONFIG_VFIO_MDEV_DEVICE) += vfio_mdev.o
>  
> diff --git a/drivers/vfio/mdev/vfio_mdev.c b/drivers/vfio/mdev/vfio_mdev.c
> new file mode 100644
> index 000000000000..1efc3f309510
> --- /dev/null
> +++ b/drivers/vfio/mdev/vfio_mdev.c
> @@ -0,0 +1,171 @@
> +/*
> + * VFIO based driver for Mediated device
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/init.h>
> +#include <linux/module.h>
> +#include <linux/device.h>
> +#include <linux/kernel.h>
> +#include <linux/slab.h>
> +#include <linux/uuid.h>
> +#include <linux/vfio.h>
> +#include <linux/iommu.h>
> +#include <linux/mdev.h>
> +
> +#include "mdev_private.h"
> +
> +#define DRIVER_VERSION  "0.1"
> +#define DRIVER_AUTHOR   "NVIDIA Corporation"
> +#define DRIVER_DESC     "VFIO based driver for Mediated device"
> +
> +struct vfio_mdev {
> +	struct iommu_group *group;

a) this is never used, b) it's redundant, there are other ways to it.

> +	struct mdev_device *mdev;

Which leaves us with just *mdev, so do we even need a struct vfio_mdev?

> +};
> +
> +static int vfio_mdev_open(void *device_data)
> +{
> +	struct vfio_mdev *vmdev = device_data;
> +	struct parent_device *parent = vmdev->mdev->parent;
> +	int ret;
> +
> +	if (unlikely(!parent->ops->open))
> +		return -EINVAL;
> +
> +	if (!try_module_get(THIS_MODULE))
> +		return -ENODEV;
> +
> +	ret = parent->ops->open(vmdev->mdev);
> +	if (ret)
> +		module_put(THIS_MODULE);
> +
> +	return ret;
> +}
> +
> +static void vfio_mdev_release(void *device_data)
> +{
> +	struct vfio_mdev *vmdev = device_data;
> +	struct parent_device *parent = vmdev->mdev->parent;
> +
> +	if (parent->ops->release)
> +		parent->ops->release(vmdev->mdev);
> +
> +	module_put(THIS_MODULE);
> +}
> +
> +static long vfio_mdev_unlocked_ioctl(void *device_data,
> +				     unsigned int cmd, unsigned long arg)
> +{
> +	struct vfio_mdev *vmdev = device_data;
> +	struct parent_device *parent = vmdev->mdev->parent;
> +
> +	if (unlikely(!parent->ops->ioctl))
> +		return -EINVAL;
> +
> +	return parent->ops->ioctl(vmdev->mdev, cmd, arg);
> +}
> +
> +static ssize_t vfio_mdev_read(void *device_data, char __user *buf,
> +			      size_t count, loff_t *ppos)
> +{
> +	struct vfio_mdev *vmdev = device_data;
> +	struct parent_device *parent = vmdev->mdev->parent;
> +
> +	if (unlikely(!parent->ops->read))
> +		return -EINVAL;
> +
> +	return parent->ops->read(vmdev->mdev, buf, count, *ppos);
> +}
> +
> +static ssize_t vfio_mdev_write(void *device_data, const char __user *buf,
> +			       size_t count, loff_t *ppos)
> +{
> +	struct vfio_mdev *vmdev = device_data;
> +	struct parent_device *parent = vmdev->mdev->parent;
> +
> +	if (unlikely(!parent->ops->write))
> +		return -EINVAL;
> +
> +	return parent->ops->write(vmdev->mdev, (char *)buf, count, *ppos);

We should not be losing the attributes on buf here, struct parent_ops
should be updated so the vendor driver needs to also use the correct
attributes.

> +}
> +
> +static int vfio_mdev_mmap(void *device_data, struct vm_area_struct *vma)
> +{
> +	struct vfio_mdev *vmdev = device_data;
> +	struct parent_device *parent = vmdev->mdev->parent;
> +
> +	if (unlikely(!parent->ops->mmap))
> +		return -EINVAL;
> +
> +	return parent->ops->mmap(vmdev->mdev, vma);
> +}
> +
> +static const struct vfio_device_ops vfio_mdev_dev_ops = {
> +	.name		= "vfio-mdev",
> +	.open		= vfio_mdev_open,
> +	.release	= vfio_mdev_release,
> +	.ioctl		= vfio_mdev_unlocked_ioctl,
> +	.read		= vfio_mdev_read,
> +	.write		= vfio_mdev_write,
> +	.mmap		= vfio_mdev_mmap,
> +};
> +
> +int vfio_mdev_probe(struct device *dev)
> +{
> +	struct vfio_mdev *vmdev;
> +	struct mdev_device *mdev = to_mdev_device(dev);
> +	int ret;
> +
> +	vmdev = kzalloc(sizeof(*vmdev), GFP_KERNEL);
> +	if (IS_ERR(vmdev))
> +		return PTR_ERR(vmdev);
> +
> +	vmdev->mdev = mdev;
> +	vmdev->group = mdev->group;
> +
> +	ret = vfio_add_group_dev(dev, &vfio_mdev_dev_ops, vmdev);
> +	if (ret)
> +		kfree(vmdev);
> +
> +	return ret;
> +}
> +
> +void vfio_mdev_remove(struct device *dev)
> +{
> +	struct vfio_mdev *vmdev;
> +
> +	vmdev = vfio_del_group_dev(dev);
> +	kfree(vmdev);
> +}
> +
> +struct mdev_driver vfio_mdev_driver = {
> +	.name	= "vfio_mdev",
> +	.probe	= vfio_mdev_probe,
> +	.remove	= vfio_mdev_remove,
> +};
> +
> +static int __init vfio_mdev_init(void)
> +{
> +	return mdev_register_driver(&vfio_mdev_driver, THIS_MODULE);
> +}
> +
> +static void __exit vfio_mdev_exit(void)
> +{
> +	mdev_unregister_driver(&vfio_mdev_driver);
> +}
> +
> +module_init(vfio_mdev_init)
> +module_exit(vfio_mdev_exit)
> +
> +MODULE_VERSION(DRIVER_VERSION);
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR(DRIVER_AUTHOR);
> +MODULE_DESCRIPTION(DRIVER_DESC);
> diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
> index 016c14a1b454..776cc2b063d4 100644
> --- a/drivers/vfio/pci/vfio_pci_private.h
> +++ b/drivers/vfio/pci/vfio_pci_private.h
> @@ -21,9 +21,9 @@
>  
>  #define VFIO_PCI_OFFSET_SHIFT   40
>  
> -#define VFIO_PCI_OFFSET_TO_INDEX(off)	(off >> VFIO_PCI_OFFSET_SHIFT)
> -#define VFIO_PCI_INDEX_TO_OFFSET(index)	((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
> -#define VFIO_PCI_OFFSET_MASK	(((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
> +#define VFIO_PCI_OFFSET_TO_INDEX(off)   (off >> VFIO_PCI_OFFSET_SHIFT)
> +#define VFIO_PCI_INDEX_TO_OFFSET(index) ((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
> +#define VFIO_PCI_OFFSET_MASK    (((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)

Gratuitous white space changes, drop this chunk/file

>  
>  /* Special capability IDs predefined access */
>  #define PCI_CAP_ID_INVALID		0xFF	/* default raw access */

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v8 4/6] docs: Add Documentation for Mediated devices
  2016-10-10 20:28   ` [Qemu-devel] " Kirti Wankhede
  (?)
@ 2016-10-11 14:14   ` Daniel P. Berrange
  2016-10-11 20:44       ` Kirti Wankhede
  -1 siblings, 1 reply; 73+ messages in thread
From: Daniel P. Berrange @ 2016-10-11 14:14 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: alex.williamson, pbonzini, kraxel, cjia, jike.song, kvm,
	kevin.tian, qemu-devel, bjsdjshi

On Tue, Oct 11, 2016 at 01:58:35AM +0530, Kirti Wankhede wrote:
> Add file Documentation/vfio-mediated-device.txt that include details of
> mediated device framework.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Signed-off-by: Neo Jia <cjia@nvidia.com>
> Change-Id: I137dd646442936090d92008b115908b7b2c7bc5d
> ---
>  Documentation/vfio-mdev/vfio-mediated-device.txt | 219 +++++++++++++++++++++++
>  1 file changed, 219 insertions(+)
>  create mode 100644 Documentation/vfio-mdev/vfio-mediated-device.txt
> 
> diff --git a/Documentation/vfio-mdev/vfio-mediated-device.txt b/Documentation/vfio-mdev/vfio-mediated-device.txt
> new file mode 100644
> index 000000000000..c1eacb83807b
> --- /dev/null
> +++ b/Documentation/vfio-mdev/vfio-mediated-device.txt
> @@ -0,0 +1,219 @@
> +/*
> + * VFIO Mediated devices
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.

Adding "All rights reserved" is bogus since you're providing it under
the GPL, but I see countless kernel source files have this, so meh.

> + *     Author: Neo Jia <cjia@nvidia.com>
> + *             Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +

> +Mediated device management interface via sysfs
> +----------------------------------------------
> +Management interface via sysfs allows user space software, like libvirt, to
> +query and configure mediated device in a HW agnostic fashion. This management
> +interface provide flexibility to underlying physical device's driver to support
> +mediated device hotplug, multiple mediated devices per virtual machine, multiple
> +mediated devices from different physical devices, etc.
> +
> +Under per-physical device sysfs:
> +--------------------------------
> +
> +* mdev_supported_types:
> +    List of current supported mediated device types and its details are added
> +in this directory in following format:
> +
> +|- <parent phy device>
> +|--- Vendor-specific-attributes [optional]
> +|--- mdev_supported_types
> +|     |--- <type id>
> +|     |   |--- create
> +|     |   |--- name
> +|     |   |--- available_instances
> +|     |   |--- description /class
> +|     |   |--- [devices]
> +|     |--- <type id>
> +|     |   |--- create
> +|     |   |--- name
> +|     |   |--- available_instances
> +|     |   |--- description /class
> +|     |   |--- [devices]
> +|     |--- <type id>
> +|          |--- create
> +|          |--- name
> +|          |--- available_instances
> +|          |--- description /class
> +|          |--- [devices]
> +
> +[TBD : description or class is yet to be decided. This will change.]

I thought that in previous discussions we had agreed to drop
the <type id> concept and use the name as the unique identifier.
When reporting these types in libvirt we won't want to report
the type id values - we'll want the name strings to be unique.

Based on this sysfs spec, the only fields we would report in
libvirt would be name + available_instances.

> +Under per mdev device:
> +----------------------
> +
> +|- <parent phy device>
> +|--- $MDEV_UUID
> +         |--- remove
> +         |--- {link to its type}
> +         |--- vendor-specific-attributes [optional]

Again, I thought we'd agreed to not have arbitrary vendor
specific attributes ?

That said, I don't mind them existing in kernel sysfs, just
be aware that we'll *not* expose any vendor specific attributes
in libvirt, so your functional implementation must not rely on
these attributes being used in any way by libvirt.



> +
> +* remove: (write only)
> +	Write '1' to 'remove' file would destroy mdev device. Vendor driver can
> +	fail remove() callback if that device is active and vendor driver
> +	doesn't support hot-unplug.
> +	Example:
> +	# echo 1 > /sys/bus/mdev/devices/$mdev_UUID/remove

> +Mediated device Hotplug:
> +------------------------
> +
> +Mediated devices can be created and assigned during runtime. Procedure to
> +hot-plug mediated device is same as hot-plug PCI device.

Generally this looks much saner now all the grouping stuff has gone.



Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://entangle-photo.org       -o-    http://search.cpan.org/~danberr/ :|

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 1/6] vfio: Mediated device Core driver
  2016-10-11  3:51     ` [Qemu-devel] " Alex Williamson
@ 2016-10-11 20:13       ` Kirti Wankhede
  -1 siblings, 0 replies; 73+ messages in thread
From: Kirti Wankhede @ 2016-10-11 20:13 UTC (permalink / raw)
  To: Alex Williamson
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi



On 10/11/2016 9:21 AM, Alex Williamson wrote:
> On Tue, 11 Oct 2016 01:58:32 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>> ---
>>  drivers/vfio/Kconfig             |   1 +
>>  drivers/vfio/Makefile            |   1 +
>>  drivers/vfio/mdev/Kconfig        |  12 ++
>>  drivers/vfio/mdev/Makefile       |   5 +
>>  drivers/vfio/mdev/mdev_core.c    | 363 +++++++++++++++++++++++++++++++++++++++
>>  drivers/vfio/mdev/mdev_driver.c  | 131 ++++++++++++++
>>  drivers/vfio/mdev/mdev_private.h |  41 +++++
>>  drivers/vfio/mdev/mdev_sysfs.c   | 295 +++++++++++++++++++++++++++++++
>>  include/linux/mdev.h             | 178 +++++++++++++++++++
>>  9 files changed, 1027 insertions(+)
>>  create mode 100644 drivers/vfio/mdev/Kconfig
>>  create mode 100644 drivers/vfio/mdev/Makefile
>>  create mode 100644 drivers/vfio/mdev/mdev_core.c
>>  create mode 100644 drivers/vfio/mdev/mdev_driver.c
>>  create mode 100644 drivers/vfio/mdev/mdev_private.h
>>  create mode 100644 drivers/vfio/mdev/mdev_sysfs.c
>>  create mode 100644 include/linux/mdev.h
> 
> 
> Overall this is heading in a good direction.  What kernel is this
> series against?  I could only apply it to v4.7, yet some of the
> dependencies claimed in the cover letter are only in v4.8.  linux-next
> or v4.8 are both good baselines right now, as we move to v4.9-rc
> releases, linux-next probably becomes a better target.
> 

Thanks Alex.

Yes, this series is against kernel v4.7. Patch 1 - 5 gets applied to
linux-next cleanly, patch 6/6 shows conflicts against linux-next.

I'm preparing next version of this patch set against linux-next.

Thanks,
Kirti.


> A few initial comments below, I'll likely have more as I wrap my head
> around it.  Thanks,
> 
> Alex
> 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v8 1/6] vfio: Mediated device Core driver
@ 2016-10-11 20:13       ` Kirti Wankhede
  0 siblings, 0 replies; 73+ messages in thread
From: Kirti Wankhede @ 2016-10-11 20:13 UTC (permalink / raw)
  To: Alex Williamson
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi



On 10/11/2016 9:21 AM, Alex Williamson wrote:
> On Tue, 11 Oct 2016 01:58:32 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>> ---
>>  drivers/vfio/Kconfig             |   1 +
>>  drivers/vfio/Makefile            |   1 +
>>  drivers/vfio/mdev/Kconfig        |  12 ++
>>  drivers/vfio/mdev/Makefile       |   5 +
>>  drivers/vfio/mdev/mdev_core.c    | 363 +++++++++++++++++++++++++++++++++++++++
>>  drivers/vfio/mdev/mdev_driver.c  | 131 ++++++++++++++
>>  drivers/vfio/mdev/mdev_private.h |  41 +++++
>>  drivers/vfio/mdev/mdev_sysfs.c   | 295 +++++++++++++++++++++++++++++++
>>  include/linux/mdev.h             | 178 +++++++++++++++++++
>>  9 files changed, 1027 insertions(+)
>>  create mode 100644 drivers/vfio/mdev/Kconfig
>>  create mode 100644 drivers/vfio/mdev/Makefile
>>  create mode 100644 drivers/vfio/mdev/mdev_core.c
>>  create mode 100644 drivers/vfio/mdev/mdev_driver.c
>>  create mode 100644 drivers/vfio/mdev/mdev_private.h
>>  create mode 100644 drivers/vfio/mdev/mdev_sysfs.c
>>  create mode 100644 include/linux/mdev.h
> 
> 
> Overall this is heading in a good direction.  What kernel is this
> series against?  I could only apply it to v4.7, yet some of the
> dependencies claimed in the cover letter are only in v4.8.  linux-next
> or v4.8 are both good baselines right now, as we move to v4.9-rc
> releases, linux-next probably becomes a better target.
> 

Thanks Alex.

Yes, this series is against kernel v4.7. Patch 1 - 5 gets applied to
linux-next cleanly, patch 6/6 shows conflicts against linux-next.

I'm preparing next version of this patch set against linux-next.

Thanks,
Kirti.


> A few initial comments below, I'll likely have more as I wrap my head
> around it.  Thanks,
> 
> Alex
> 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 2/6] vfio: VFIO based driver for Mediated devices
  2016-10-11  3:55     ` [Qemu-devel] " Alex Williamson
@ 2016-10-11 20:24       ` Kirti Wankhede
  -1 siblings, 0 replies; 73+ messages in thread
From: Kirti Wankhede @ 2016-10-11 20:24 UTC (permalink / raw)
  To: Alex Williamson
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi



On 10/11/2016 9:25 AM, Alex Williamson wrote:
> On Tue, 11 Oct 2016 01:58:33 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> vfio_mdev driver registers with mdev core driver.
>> MDEV core driver creates mediated device and calls probe routine of
>> vfio_mdev driver for each device.
>> Probe routine of vfio_mdev driver adds mediated device to VFIO core module
>>
>> This driver forms a shim layer that pass through VFIO devices operations
>> to vendor driver for mediated devices.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Signed-off-by: Neo Jia <cjia@nvidia.com>
>> Change-Id: I583f4734752971d3d112324d69e2508c88f359ec
>> ---
>>  drivers/vfio/mdev/Kconfig           |   6 ++
>>  drivers/vfio/mdev/Makefile          |   1 +
>>  drivers/vfio/mdev/vfio_mdev.c       | 171 ++++++++++++++++++++++++++++++++++++
>>  drivers/vfio/pci/vfio_pci_private.h |   6 +-
>>  4 files changed, 181 insertions(+), 3 deletions(-)
>>  create mode 100644 drivers/vfio/mdev/vfio_mdev.c
> 
> Looking pretty good so far, a few preliminary comments below.  Thanks,
> 
> Alex
> 

Thanks Alex.

I'm preparing next patch with your suggestions here. Also let us know if
you have any more comments.

Thanks,
Kirti

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v8 2/6] vfio: VFIO based driver for Mediated devices
@ 2016-10-11 20:24       ` Kirti Wankhede
  0 siblings, 0 replies; 73+ messages in thread
From: Kirti Wankhede @ 2016-10-11 20:24 UTC (permalink / raw)
  To: Alex Williamson
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi



On 10/11/2016 9:25 AM, Alex Williamson wrote:
> On Tue, 11 Oct 2016 01:58:33 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> vfio_mdev driver registers with mdev core driver.
>> MDEV core driver creates mediated device and calls probe routine of
>> vfio_mdev driver for each device.
>> Probe routine of vfio_mdev driver adds mediated device to VFIO core module
>>
>> This driver forms a shim layer that pass through VFIO devices operations
>> to vendor driver for mediated devices.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Signed-off-by: Neo Jia <cjia@nvidia.com>
>> Change-Id: I583f4734752971d3d112324d69e2508c88f359ec
>> ---
>>  drivers/vfio/mdev/Kconfig           |   6 ++
>>  drivers/vfio/mdev/Makefile          |   1 +
>>  drivers/vfio/mdev/vfio_mdev.c       | 171 ++++++++++++++++++++++++++++++++++++
>>  drivers/vfio/pci/vfio_pci_private.h |   6 +-
>>  4 files changed, 181 insertions(+), 3 deletions(-)
>>  create mode 100644 drivers/vfio/mdev/vfio_mdev.c
> 
> Looking pretty good so far, a few preliminary comments below.  Thanks,
> 
> Alex
> 

Thanks Alex.

I'm preparing next patch with your suggestions here. Also let us know if
you have any more comments.

Thanks,
Kirti

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v8 4/6] docs: Add Documentation for Mediated devices
  2016-10-11 14:14   ` Daniel P. Berrange
@ 2016-10-11 20:44       ` Kirti Wankhede
  0 siblings, 0 replies; 73+ messages in thread
From: Kirti Wankhede @ 2016-10-11 20:44 UTC (permalink / raw)
  To: Daniel P. Berrange
  Cc: alex.williamson, pbonzini, kraxel, cjia, jike.song, kvm,
	kevin.tian, qemu-devel, bjsdjshi



On 10/11/2016 7:44 PM, Daniel P. Berrange wrote:
> On Tue, Oct 11, 2016 at 01:58:35AM +0530, Kirti Wankhede wrote:
>> Add file Documentation/vfio-mediated-device.txt that include details of
>> mediated device framework.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Signed-off-by: Neo Jia <cjia@nvidia.com>
>> Change-Id: I137dd646442936090d92008b115908b7b2c7bc5d
>> ---
>>  Documentation/vfio-mdev/vfio-mediated-device.txt | 219 +++++++++++++++++++++++
>>  1 file changed, 219 insertions(+)
>>  create mode 100644 Documentation/vfio-mdev/vfio-mediated-device.txt
>>
>> diff --git a/Documentation/vfio-mdev/vfio-mediated-device.txt b/Documentation/vfio-mdev/vfio-mediated-device.txt
>> new file mode 100644
>> index 000000000000..c1eacb83807b
>> --- /dev/null
>> +++ b/Documentation/vfio-mdev/vfio-mediated-device.txt
>> @@ -0,0 +1,219 @@
>> +/*
>> + * VFIO Mediated devices
>> + *
>> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> 
> Adding "All rights reserved" is bogus since you're providing it under
> the GPL, but I see countless kernel source files have this, so meh.
> 
>> + *     Author: Neo Jia <cjia@nvidia.com>
>> + *             Kirti Wankhede <kwankhede@nvidia.com>
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License version 2 as
>> + * published by the Free Software Foundation.
>> + */
>> +
> 
>> +Mediated device management interface via sysfs
>> +----------------------------------------------
>> +Management interface via sysfs allows user space software, like libvirt, to
>> +query and configure mediated device in a HW agnostic fashion. This management
>> +interface provide flexibility to underlying physical device's driver to support
>> +mediated device hotplug, multiple mediated devices per virtual machine, multiple
>> +mediated devices from different physical devices, etc.
>> +
>> +Under per-physical device sysfs:
>> +--------------------------------
>> +
>> +* mdev_supported_types:
>> +    List of current supported mediated device types and its details are added
>> +in this directory in following format:
>> +
>> +|- <parent phy device>
>> +|--- Vendor-specific-attributes [optional]
>> +|--- mdev_supported_types
>> +|     |--- <type id>
>> +|     |   |--- create
>> +|     |   |--- name
>> +|     |   |--- available_instances
>> +|     |   |--- description /class
>> +|     |   |--- [devices]
>> +|     |--- <type id>
>> +|     |   |--- create
>> +|     |   |--- name
>> +|     |   |--- available_instances
>> +|     |   |--- description /class
>> +|     |   |--- [devices]
>> +|     |--- <type id>
>> +|          |--- create
>> +|          |--- name
>> +|          |--- available_instances
>> +|          |--- description /class
>> +|          |--- [devices]
>> +
>> +[TBD : description or class is yet to be decided. This will change.]
> 
> I thought that in previous discussions we had agreed to drop
> the <type id> concept and use the name as the unique identifier.
> When reporting these types in libvirt we won't want to report
> the type id values - we'll want the name strings to be unique.
> 

The 'name' might not be unique but type_id will be. For example that Neo
pointed out in earlier discussion, virtual devices can come from two
different physical devices, end user would be presented with what they
had selected but there will be internal implementation differences. In
that case 'type_id' will be unique.

> Based on this sysfs spec, the only fields we would report in
> libvirt would be name + available_instances.
> 
>> +Under per mdev device:
>> +----------------------
>> +
>> +|- <parent phy device>
>> +|--- $MDEV_UUID
>> +         |--- remove
>> +         |--- {link to its type}
>> +         |--- vendor-specific-attributes [optional]
> 
> Again, I thought we'd agreed to not have arbitrary vendor
> specific attributes ?
> 
> That said, I don't mind them existing in kernel sysfs, just
> be aware that we'll *not* expose any vendor specific attributes
> in libvirt, so your functional implementation must not rely on
> these attributes being used in any way by libvirt.
> 
> 

Right, Libvirt would not use vendor specific attributes but admin can
use these to get/set extra information for a particular device. These
are optional, so its up to vendor to provide such attributes or not.

Thanks,
Kirti

> 
>> +
>> +* remove: (write only)
>> +	Write '1' to 'remove' file would destroy mdev device. Vendor driver can
>> +	fail remove() callback if that device is active and vendor driver
>> +	doesn't support hot-unplug.
>> +	Example:
>> +	# echo 1 > /sys/bus/mdev/devices/$mdev_UUID/remove
> 
>> +Mediated device Hotplug:
>> +------------------------
>> +
>> +Mediated devices can be created and assigned during runtime. Procedure to
>> +hot-plug mediated device is same as hot-plug PCI device.
> 
> Generally this looks much saner now all the grouping stuff has gone.
> 
> 
> 
> Regards,
> Daniel
> 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v8 4/6] docs: Add Documentation for Mediated devices
@ 2016-10-11 20:44       ` Kirti Wankhede
  0 siblings, 0 replies; 73+ messages in thread
From: Kirti Wankhede @ 2016-10-11 20:44 UTC (permalink / raw)
  To: Daniel P. Berrange
  Cc: alex.williamson, pbonzini, kraxel, cjia, jike.song, kvm,
	kevin.tian, qemu-devel, bjsdjshi



On 10/11/2016 7:44 PM, Daniel P. Berrange wrote:
> On Tue, Oct 11, 2016 at 01:58:35AM +0530, Kirti Wankhede wrote:
>> Add file Documentation/vfio-mediated-device.txt that include details of
>> mediated device framework.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Signed-off-by: Neo Jia <cjia@nvidia.com>
>> Change-Id: I137dd646442936090d92008b115908b7b2c7bc5d
>> ---
>>  Documentation/vfio-mdev/vfio-mediated-device.txt | 219 +++++++++++++++++++++++
>>  1 file changed, 219 insertions(+)
>>  create mode 100644 Documentation/vfio-mdev/vfio-mediated-device.txt
>>
>> diff --git a/Documentation/vfio-mdev/vfio-mediated-device.txt b/Documentation/vfio-mdev/vfio-mediated-device.txt
>> new file mode 100644
>> index 000000000000..c1eacb83807b
>> --- /dev/null
>> +++ b/Documentation/vfio-mdev/vfio-mediated-device.txt
>> @@ -0,0 +1,219 @@
>> +/*
>> + * VFIO Mediated devices
>> + *
>> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> 
> Adding "All rights reserved" is bogus since you're providing it under
> the GPL, but I see countless kernel source files have this, so meh.
> 
>> + *     Author: Neo Jia <cjia@nvidia.com>
>> + *             Kirti Wankhede <kwankhede@nvidia.com>
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License version 2 as
>> + * published by the Free Software Foundation.
>> + */
>> +
> 
>> +Mediated device management interface via sysfs
>> +----------------------------------------------
>> +Management interface via sysfs allows user space software, like libvirt, to
>> +query and configure mediated device in a HW agnostic fashion. This management
>> +interface provide flexibility to underlying physical device's driver to support
>> +mediated device hotplug, multiple mediated devices per virtual machine, multiple
>> +mediated devices from different physical devices, etc.
>> +
>> +Under per-physical device sysfs:
>> +--------------------------------
>> +
>> +* mdev_supported_types:
>> +    List of current supported mediated device types and its details are added
>> +in this directory in following format:
>> +
>> +|- <parent phy device>
>> +|--- Vendor-specific-attributes [optional]
>> +|--- mdev_supported_types
>> +|     |--- <type id>
>> +|     |   |--- create
>> +|     |   |--- name
>> +|     |   |--- available_instances
>> +|     |   |--- description /class
>> +|     |   |--- [devices]
>> +|     |--- <type id>
>> +|     |   |--- create
>> +|     |   |--- name
>> +|     |   |--- available_instances
>> +|     |   |--- description /class
>> +|     |   |--- [devices]
>> +|     |--- <type id>
>> +|          |--- create
>> +|          |--- name
>> +|          |--- available_instances
>> +|          |--- description /class
>> +|          |--- [devices]
>> +
>> +[TBD : description or class is yet to be decided. This will change.]
> 
> I thought that in previous discussions we had agreed to drop
> the <type id> concept and use the name as the unique identifier.
> When reporting these types in libvirt we won't want to report
> the type id values - we'll want the name strings to be unique.
> 

The 'name' might not be unique but type_id will be. For example that Neo
pointed out in earlier discussion, virtual devices can come from two
different physical devices, end user would be presented with what they
had selected but there will be internal implementation differences. In
that case 'type_id' will be unique.

> Based on this sysfs spec, the only fields we would report in
> libvirt would be name + available_instances.
> 
>> +Under per mdev device:
>> +----------------------
>> +
>> +|- <parent phy device>
>> +|--- $MDEV_UUID
>> +         |--- remove
>> +         |--- {link to its type}
>> +         |--- vendor-specific-attributes [optional]
> 
> Again, I thought we'd agreed to not have arbitrary vendor
> specific attributes ?
> 
> That said, I don't mind them existing in kernel sysfs, just
> be aware that we'll *not* expose any vendor specific attributes
> in libvirt, so your functional implementation must not rely on
> these attributes being used in any way by libvirt.
> 
> 

Right, Libvirt would not use vendor specific attributes but admin can
use these to get/set extra information for a particular device. These
are optional, so its up to vendor to provide such attributes or not.

Thanks,
Kirti

> 
>> +
>> +* remove: (write only)
>> +	Write '1' to 'remove' file would destroy mdev device. Vendor driver can
>> +	fail remove() callback if that device is active and vendor driver
>> +	doesn't support hot-unplug.
>> +	Example:
>> +	# echo 1 > /sys/bus/mdev/devices/$mdev_UUID/remove
> 
>> +Mediated device Hotplug:
>> +------------------------
>> +
>> +Mediated devices can be created and assigned during runtime. Procedure to
>> +hot-plug mediated device is same as hot-plug PCI device.
> 
> Generally this looks much saner now all the grouping stuff has gone.
> 
> 
> 
> Regards,
> Daniel
> 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 3/6] vfio iommu: Add support for mediated devices
  2016-10-10 20:28   ` [Qemu-devel] " Kirti Wankhede
@ 2016-10-11 22:06     ` Alex Williamson
  -1 siblings, 0 replies; 73+ messages in thread
From: Alex Williamson @ 2016-10-11 22:06 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi

On Tue, 11 Oct 2016 01:58:34 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> VFIO IOMMU drivers are designed for the devices which are IOMMU capable.
> Mediated device only uses IOMMU APIs, the underlying hardware can be
> managed by an IOMMU domain.
> 
> Aim of this change is:
> - To use most of the code of TYPE1 IOMMU driver for mediated devices
> - To support direct assigned device and mediated device in single module
> 
> Added two new callback functions to struct vfio_iommu_driver_ops. Backend
> IOMMU module that supports pining and unpinning pages for mdev devices
> should provide these functions.
> Added APIs for pining and unpining pages to VFIO module. These calls back
> into backend iommu module to actually pin and unpin pages.
> 
> This change adds pin and unpin support for mediated device to TYPE1 IOMMU
> backend module. More details:
> - When iommu_group of mediated devices is attached, task structure is
>   cached which is used later to pin pages and page accounting.
> - It keeps track of pinned pages for mediated domain. This data is used to
>   verify unpinning request and to unpin remaining pages while detaching, if
>   there are any.
> - Used existing mechanism for page accounting. If iommu capable domain
>   exist in the container then all pages are already pinned and accounted.
>   Accouting for mdev device is only done if there is no iommu capable
>   domain in the container.
> - Page accouting is updated on hot plug and unplug mdev device and pass
>   through device.
> 
> Tested by assigning below combinations of devices to a single VM:
> - GPU pass through only
> - vGPU device only
> - One GPU pass through and one vGPU device
> - Linux VM hot plug and unplug vGPU device while GPU pass through device
>   exist
> - Linux VM hot plug and unplug GPU pass through device while vGPU device
>   exist
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Signed-off-by: Neo Jia <cjia@nvidia.com>
> Change-Id: I295d6f0f2e0579b8d9882bfd8fd5a4194b97bd9a
> ---
>  drivers/vfio/vfio.c             | 117 +++++++
>  drivers/vfio/vfio_iommu_type1.c | 685 ++++++++++++++++++++++++++++++++++------
>  include/linux/vfio.h            |  13 +-
>  3 files changed, 724 insertions(+), 91 deletions(-)
> 
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index 6fd6fa5469de..e3e342861e04 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -1782,6 +1782,123 @@ void vfio_info_cap_shift(struct vfio_info_cap *caps, size_t offset)
>  }
>  EXPORT_SYMBOL_GPL(vfio_info_cap_shift);
>  
> +static struct vfio_group *vfio_group_from_dev(struct device *dev)
> +{
> +	struct vfio_device *device;
> +	struct vfio_group *group;
> +	int ret;
> +
> +	device = vfio_device_get_from_dev(dev);

Note how this does dev->iommu_group->vfio_group->vfio_device and then
we back out one level to get the vfio_group, it's not a terribly
lightweight path.  Perhaps we should have:

struct vfio_device *vfio_group_get_from_dev(struct device *dev)
{
        struct iommu_group *iommu_group;
        struct vfio_group *group;

        iommu_group = iommu_group_get(dev);
        if (!iommu_group)
                return NULL;

        group = vfio_group_get_from_iommu(iommu_group);
	iommu_group_put(iommu_group);

	return group;
}

vfio_device_get_from_dev() would make use of this.

Then create a separate:

static int vfio_group_add_container_user(struct vfio_group *group)
{

> +	if (!atomic_inc_not_zero(&group->container_users)) {
		return -EINVAL;
> +	}
> +
> +	if (group->noiommu) {
> +		atomic_dec(&group->container_users);
		return -EPERM;
> +	}
> +
> +	if (!group->container->iommu_driver ||
> +	    !vfio_group_viable(group)) {
> +		atomic_dec(&group->container_users);
		return -EINVAL;
> +	}
> +
	return 0;
}

vfio_group_get_external_user() would be updated to use this.  In fact,
creating these two functions and updating the existing code to use
these should be a separate patch.

Note that your version did not hold a group reference while doing the
pin/unpin operations below, which seems like a bug.

> +
> +err_ret:
> +	vfio_device_put(device);
> +	return ERR_PTR(ret);
> +}
> +
> +/*
> + * Pin a set of guest PFNs and return their associated host PFNs for local
> + * domain only.
> + * @dev [in] : device
> + * @user_pfn [in]: array of user/guest PFNs
> + * @npage [in]: count of array elements
> + * @prot [in] : protection flags
> + * @phys_pfn[out] : array of host PFNs
> + */
> +long vfio_pin_pages(struct device *dev, unsigned long *user_pfn,
> +		    long npage, int prot, unsigned long *phys_pfn)
> +{
> +	struct vfio_container *container;
> +	struct vfio_group *group;
> +	struct vfio_iommu_driver *driver;
> +	ssize_t ret = -EINVAL;
> +
> +	if (!dev || !user_pfn || !phys_pfn)
> +		return -EINVAL;
> +
> +	group = vfio_group_from_dev(dev);
> +	if (IS_ERR(group))
> +		return PTR_ERR(group);

As suggested above:

	group = vfio_group_get_from_dev(dev);
	if (!group)
		return -ENODEV;

	ret = vfio_group_add_container_user(group)
	if (ret)
		vfio_group_put(group);
		return ret;
	}

> +
> +	container = group->container;
> +	if (IS_ERR(container))
> +		return PTR_ERR(container);
> +
> +	down_read(&container->group_lock);
> +
> +	driver = container->iommu_driver;
> +	if (likely(driver && driver->ops->pin_pages))
> +		ret = driver->ops->pin_pages(container->iommu_data, user_pfn,
> +					     npage, prot, phys_pfn);
> +
> +	up_read(&container->group_lock);
> +	vfio_group_try_dissolve_container(group);

Even if you're considering that the container_user reference holds the
driver, I think we need a group reference throughout this and this
should end with a vfio_group_put(group);

> +
> +	return ret;
> +
> +}
> +EXPORT_SYMBOL(vfio_pin_pages);
> +
> +/*
> + * Unpin set of host PFNs for local domain only.
> + * @dev [in] : device
> + * @pfn [in] : array of host PFNs to be unpinned.
> + * @npage [in] :count of elements in array, that is number of pages.
> + */
> +long vfio_unpin_pages(struct device *dev, unsigned long *pfn, long npage)
> +{
> +	struct vfio_container *container;
> +	struct vfio_group *group;
> +	struct vfio_iommu_driver *driver;
> +	ssize_t ret = -EINVAL;
> +
> +	if (!dev || !pfn)
> +		return -EINVAL;
> +
> +	group = vfio_group_from_dev(dev);
> +	if (IS_ERR(group))
> +		return PTR_ERR(group);
> +
> +	container = group->container;
> +	if (IS_ERR(container))
> +		return PTR_ERR(container);
> +
> +	down_read(&container->group_lock);
> +
> +	driver = container->iommu_driver;
> +	if (likely(driver && driver->ops->unpin_pages))
> +		ret = driver->ops->unpin_pages(container->iommu_data, pfn,
> +					       npage);
> +
> +	up_read(&container->group_lock);
> +	vfio_group_try_dissolve_container(group);


Same as above on all counts.

> +	return ret;
> +}
> +EXPORT_SYMBOL(vfio_unpin_pages);
> +
>  /**
>   * Module/class support
>   */
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 2ba19424e4a1..ce6d6dcbd9a8 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -55,18 +55,26 @@ MODULE_PARM_DESC(disable_hugepages,
>  
>  struct vfio_iommu {
>  	struct list_head	domain_list;
> +	struct vfio_domain	*local_domain;
>  	struct mutex		lock;
>  	struct rb_root		dma_list;
>  	bool			v2;
>  	bool			nesting;
>  };
>  
> +struct local_addr_space {
> +	struct task_struct	*task;
> +	struct rb_root		pfn_list;	/* pinned Host pfn list */
> +	struct mutex		pfn_list_lock;	/* mutex for pfn_list */
> +};
> +
>  struct vfio_domain {
>  	struct iommu_domain	*domain;
>  	struct list_head	next;
>  	struct list_head	group_list;
>  	int			prot;		/* IOMMU_CACHE */
>  	bool			fgsp;		/* Fine-grained super pages */
> +	struct local_addr_space	*local_addr_space;
>  };

Consider structure internal alignment, this should be placed below
group_list.

>  
>  struct vfio_dma {
> @@ -75,6 +83,7 @@ struct vfio_dma {
>  	unsigned long		vaddr;		/* Process virtual addr */
>  	size_t			size;		/* Map size (bytes) */
>  	int			prot;		/* IOMMU_READ/WRITE */
> +	bool			iommu_mapped;
>  };
>  
>  struct vfio_group {
> @@ -83,6 +92,22 @@ struct vfio_group {
>  };
>  
>  /*
> + * Guest RAM pinning working set or DMA target
> + */
> +struct vfio_pfn {
> +	struct rb_node		node;
> +	unsigned long		vaddr;		/* virtual addr */
> +	dma_addr_t		iova;		/* IOVA */
> +	unsigned long		pfn;		/* Host pfn */
> +	size_t			prot;

size_t?  Shouldn't this be an int?

> +	atomic_t		ref_count;
> +};
> +
> +
> +#define IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu)	\
> +			 (list_empty(&iommu->domain_list) ? false : true)

(!list_empty(...))

> +
> +/*
>   * This code handles mapping and unmapping of user data buffers
>   * into DMA'ble space using the IOMMU
>   */
> @@ -130,6 +155,84 @@ static void vfio_unlink_dma(struct vfio_iommu *iommu, struct vfio_dma *old)
>  	rb_erase(&old->node, &iommu->dma_list);
>  }
>  
> +/*
> + * Helper Functions for host pfn list
> + */
> +
> +static struct vfio_pfn *vfio_find_pfn(struct vfio_domain *domain,
> +				      unsigned long pfn)
> +{
> +	struct rb_node *node;
> +	struct vfio_pfn *vpfn, *ret = NULL;
> +
> +	node = domain->local_addr_space->pfn_list.rb_node;
> +
> +	while (node) {
> +		vpfn = rb_entry(node, struct vfio_pfn, node);
> +
> +		if (pfn < vpfn->pfn)
> +			node = node->rb_left;
> +		else if (pfn > vpfn->pfn)
> +			node = node->rb_right;
> +		else {
> +			ret = vpfn;
> +			break;
> +		}
> +	}
> +
> +	return ret;
> +}

Some unnecessary style differences from vfio_find_dma()

> +
> +static void vfio_link_pfn(struct vfio_domain *domain, struct vfio_pfn *new)
> +{
> +	struct rb_node **link, *parent = NULL;
> +	struct vfio_pfn *vpfn;
> +
> +	link = &domain->local_addr_space->pfn_list.rb_node;
> +	while (*link) {
> +		parent = *link;
> +		vpfn = rb_entry(parent, struct vfio_pfn, node);
> +
> +		if (new->pfn < vpfn->pfn)
> +			link = &(*link)->rb_left;
> +		else
> +			link = &(*link)->rb_right;
> +	}
> +
> +	rb_link_node(&new->node, parent, link);
> +	rb_insert_color(&new->node, &domain->local_addr_space->pfn_list);
> +}
> +
> +static void vfio_unlink_pfn(struct vfio_domain *domain, struct vfio_pfn *old)
> +{
> +	rb_erase(&old->node, &domain->local_addr_space->pfn_list);
> +}
> +
> +static int vfio_add_to_pfn_list(struct vfio_domain *domain, unsigned long vaddr,
> +				dma_addr_t iova, unsigned long pfn, size_t prot)

size_t?

> +{
> +	struct vfio_pfn *vpfn;
> +
> +	vpfn = kzalloc(sizeof(*vpfn), GFP_KERNEL);
> +	if (!vpfn)
> +		return -ENOMEM;
> +
> +	vpfn->vaddr = vaddr;
> +	vpfn->iova = iova;
> +	vpfn->pfn = pfn;
> +	vpfn->prot = prot;
> +	atomic_set(&vpfn->ref_count, 1);
> +	vfio_link_pfn(domain, vpfn);
> +	return 0;
> +}
> +
> +static void vfio_remove_from_pfn_list(struct vfio_domain *domain,
> +				      struct vfio_pfn *vpfn)
> +{
> +	vfio_unlink_pfn(domain, vpfn);
> +	kfree(vpfn);
> +}
> +
>  struct vwork {
>  	struct mm_struct	*mm;
>  	long			npage;
> @@ -150,17 +253,17 @@ static void vfio_lock_acct_bg(struct work_struct *work)
>  	kfree(vwork);
>  }
>  
> -static void vfio_lock_acct(long npage)
> +static void vfio_lock_acct(struct task_struct *task, long npage)
>  {
>  	struct vwork *vwork;
>  	struct mm_struct *mm;
>  
> -	if (!current->mm || !npage)
> +	if (!task->mm || !npage)
>  		return; /* process exited or nothing to do */
>  
> -	if (down_write_trylock(&current->mm->mmap_sem)) {
> -		current->mm->locked_vm += npage;
> -		up_write(&current->mm->mmap_sem);
> +	if (down_write_trylock(&task->mm->mmap_sem)) {
> +		task->mm->locked_vm += npage;
> +		up_write(&task->mm->mmap_sem);
>  		return;
>  	}
>  
> @@ -172,7 +275,7 @@ static void vfio_lock_acct(long npage)
>  	vwork = kmalloc(sizeof(struct vwork), GFP_KERNEL);
>  	if (!vwork)
>  		return;
> -	mm = get_task_mm(current);
> +	mm = get_task_mm(task);
>  	if (!mm) {
>  		kfree(vwork);
>  		return;
> @@ -228,20 +331,31 @@ static int put_pfn(unsigned long pfn, int prot)
>  	return 0;
>  }
>  
> -static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
> +static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
> +			 int prot, unsigned long *pfn)
>  {
>  	struct page *page[1];
>  	struct vm_area_struct *vma;
> +	struct mm_struct *local_mm = (mm ? mm : current->mm);
>  	int ret = -EFAULT;
>  
> -	if (get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE), page) == 1) {
> +	if (mm) {
> +		down_read(&local_mm->mmap_sem);
> +		ret = get_user_pages_remote(NULL, local_mm, vaddr, 1,
> +					!!(prot & IOMMU_WRITE), 0, page, NULL);
> +		up_read(&local_mm->mmap_sem);
> +	} else
> +		ret = get_user_pages_fast(vaddr, 1,
> +					  !!(prot & IOMMU_WRITE), page);
> +
> +	if (ret == 1) {
>  		*pfn = page_to_pfn(page[0]);
>  		return 0;
>  	}
>  
> -	down_read(&current->mm->mmap_sem);
> +	down_read(&local_mm->mmap_sem);
>  
> -	vma = find_vma_intersection(current->mm, vaddr, vaddr + 1);
> +	vma = find_vma_intersection(local_mm, vaddr, vaddr + 1);
>  
>  	if (vma && vma->vm_flags & VM_PFNMAP) {
>  		*pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
> @@ -249,7 +363,7 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
>  			ret = 0;
>  	}
>  
> -	up_read(&current->mm->mmap_sem);
> +	up_read(&local_mm->mmap_sem);
>  
>  	return ret;
>  }
> @@ -259,8 +373,8 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
>   * the iommu can only map chunks of consecutive pfns anyway, so get the
>   * first page and all consecutive pages with the same locking.
>   */
> -static long vfio_pin_pages(unsigned long vaddr, long npage,
> -			   int prot, unsigned long *pfn_base)
> +static long __vfio_pin_pages_remote(unsigned long vaddr, long npage,
> +				    int prot, unsigned long *pfn_base)
>  {
>  	unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
>  	bool lock_cap = capable(CAP_IPC_LOCK);
> @@ -270,7 +384,7 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
>  	if (!current->mm)
>  		return -ENODEV;
>  
> -	ret = vaddr_get_pfn(vaddr, prot, pfn_base);
> +	ret = vaddr_get_pfn(NULL, vaddr, prot, pfn_base);
>  	if (ret)
>  		return ret;
>  
> @@ -285,7 +399,7 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
>  
>  	if (unlikely(disable_hugepages)) {
>  		if (!rsvd)
> -			vfio_lock_acct(1);
> +			vfio_lock_acct(current, 1);
>  		return 1;
>  	}
>  
> @@ -293,7 +407,7 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
>  	for (i = 1, vaddr += PAGE_SIZE; i < npage; i++, vaddr += PAGE_SIZE) {
>  		unsigned long pfn = 0;
>  
> -		ret = vaddr_get_pfn(vaddr, prot, &pfn);
> +		ret = vaddr_get_pfn(NULL, vaddr, prot, &pfn);
>  		if (ret)
>  			break;
>  
> @@ -313,13 +427,13 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
>  	}
>  
>  	if (!rsvd)
> -		vfio_lock_acct(i);
> +		vfio_lock_acct(current, i);
>  
>  	return i;
>  }
>  
> -static long vfio_unpin_pages(unsigned long pfn, long npage,
> -			     int prot, bool do_accounting)
> +static long __vfio_unpin_pages_remote(unsigned long pfn, long npage, int prot,
> +				      bool do_accounting)
>  {
>  	unsigned long unlocked = 0;
>  	long i;
> @@ -328,7 +442,188 @@ static long vfio_unpin_pages(unsigned long pfn, long npage,
>  		unlocked += put_pfn(pfn++, prot);
>  
>  	if (do_accounting)
> -		vfio_lock_acct(-unlocked);
> +		vfio_lock_acct(current, -unlocked);
> +	return unlocked;
> +}
> +
> +static long __vfio_pin_pages_local(struct vfio_domain *domain,
> +				   unsigned long vaddr, int prot,
> +				   unsigned long *pfn_base,
> +				   bool do_accounting)
> +{
> +	unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> +	bool lock_cap = capable(CAP_IPC_LOCK);
> +	long ret;
> +	bool rsvd;
> +	struct task_struct *task = domain->local_addr_space->task;
> +
> +	if (!task->mm)
> +		return -ENODEV;
> +
> +	ret = vaddr_get_pfn(task->mm, vaddr, prot, pfn_base);
> +	if (ret)
> +		return ret;
> +
> +	rsvd = is_invalid_reserved_pfn(*pfn_base);
> +
> +	if (!rsvd && !lock_cap && task->mm->locked_vm + 1 > limit) {
> +		put_pfn(*pfn_base, prot);
> +		pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n", __func__,
> +			limit << PAGE_SHIFT);
> +		return -ENOMEM;
> +	}
> +
> +	if (!rsvd && do_accounting)
> +		vfio_lock_acct(task, 1);
> +
> +	return 1;
> +}
> +
> +static void __vfio_unpin_pages_local(struct vfio_domain *domain,
> +				     unsigned long pfn, int prot,
> +				     bool do_accounting)
> +{
> +	put_pfn(pfn, prot);
> +
> +	if (do_accounting)
> +		vfio_lock_acct(domain->local_addr_space->task, -1);
> +}
> +
> +static int vfio_unpin_pfn(struct vfio_domain *domain,
> +			  struct vfio_pfn *vpfn, bool do_accounting)
> +{
> +	__vfio_unpin_pages_local(domain, vpfn->pfn, vpfn->prot,
> +				 do_accounting);
> +
> +	if (atomic_dec_and_test(&vpfn->ref_count))
> +		vfio_remove_from_pfn_list(domain, vpfn);
> +
> +	return 1;
> +}
> +
> +static long vfio_iommu_type1_pin_pages(void *iommu_data,
> +				       unsigned long *user_pfn,
> +				       long npage, int prot,
> +				       unsigned long *phys_pfn)
> +{
> +	struct vfio_iommu *iommu = iommu_data;
> +	struct vfio_domain *domain;
> +	int i, j, ret;
> +	long retpage;
> +	unsigned long remote_vaddr;
> +	unsigned long *pfn = phys_pfn;
> +	struct vfio_dma *dma;
> +	bool do_accounting = false;
> +
> +	if (!iommu || !user_pfn || !phys_pfn)
> +		return -EINVAL;
> +
> +	mutex_lock(&iommu->lock);
> +
> +	if (!iommu->local_domain) {
> +		ret = -EINVAL;
> +		goto pin_done;
> +	}
> +
> +	domain = iommu->local_domain;
> +
> +	/*
> +	 * If iommu capable domain exist in the container then all pages are
> +	 * already pinned and accounted. Accouting should be done if there is no
> +	 * iommu capable domain in the container.
> +	 */
> +	do_accounting = !IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu);
> +
> +	for (i = 0; i < npage; i++) {
> +		struct vfio_pfn *p;
> +		dma_addr_t iova;
> +
> +		iova = user_pfn[i] << PAGE_SHIFT;
> +
> +		dma = vfio_find_dma(iommu, iova, 0);
> +		if (!dma) {
> +			ret = -EINVAL;
> +			goto pin_unwind;
> +		}
> +
> +		remote_vaddr = dma->vaddr + iova - dma->iova;
> +
> +		retpage = __vfio_pin_pages_local(domain, remote_vaddr, prot,
> +						 &pfn[i], do_accounting);
> +		if (retpage <= 0) {
> +			WARN_ON(!retpage);
> +			ret = (int)retpage;
> +			goto pin_unwind;
> +		}
> +
> +		mutex_lock(&domain->local_addr_space->pfn_list_lock);
> +
> +		/* search if pfn exist */
> +		p = vfio_find_pfn(domain, pfn[i]);
> +		if (p) {
> +			atomic_inc(&p->ref_count);
> +			mutex_unlock(&domain->local_addr_space->pfn_list_lock);
> +			continue;
> +		}
> +
> +		ret = vfio_add_to_pfn_list(domain, remote_vaddr, iova,
> +					   pfn[i], prot);
> +		mutex_unlock(&domain->local_addr_space->pfn_list_lock);
> +
> +		if (ret) {
> +			__vfio_unpin_pages_local(domain, pfn[i], prot,
> +						 do_accounting);
> +			goto pin_unwind;
> +		}
> +	}
> +
> +	ret = i;
> +	goto pin_done;
> +
> +pin_unwind:
> +	pfn[i] = 0;
> +	mutex_lock(&domain->local_addr_space->pfn_list_lock);
> +	for (j = 0; j < i; j++) {
> +		struct vfio_pfn *p;
> +
> +		p = vfio_find_pfn(domain, pfn[j]);
> +		if (p)
> +			vfio_unpin_pfn(domain, p, do_accounting);
> +
> +		pfn[j] = 0;
> +	}
> +	mutex_unlock(&domain->local_addr_space->pfn_list_lock);
> +
> +pin_done:
> +	mutex_unlock(&iommu->lock);
> +	return ret;
> +}
> +
> +static long vfio_iommu_type1_unpin_pages(void *iommu_data, unsigned long *pfn,
> +					 long npage)
> +{
> +	struct vfio_iommu *iommu = iommu_data;
> +	struct vfio_domain *domain = NULL;
> +	long unlocked = 0;
> +	int i;
> +
> +	if (!iommu || !pfn)
> +		return -EINVAL;
> +

We need iommu->lock here, right?

> +	domain = iommu->local_domain;
> +
> +	for (i = 0; i < npage; i++) {
> +		struct vfio_pfn *p;
> +
> +		mutex_lock(&domain->local_addr_space->pfn_list_lock);
> +
> +		/* verify if pfn exist in pfn_list */
> +		p = vfio_find_pfn(domain, pfn[i]);
> +		if (p)
> +			unlocked += vfio_unpin_pfn(domain, p, true);
> +
> +		mutex_unlock(&domain->local_addr_space->pfn_list_lock);

We hold this mutex outside the loop in the pin unwind case, why is it
different here?

> +	}
>  
>  	return unlocked;
>  }
> @@ -341,6 +636,12 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
>  
>  	if (!dma->size)
>  		return;
> +
> +	if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu))
> +		return;
> +
> +	if (!dma->iommu_mapped)
> +		return;
>  	/*
>  	 * We use the IOMMU to track the physical addresses, otherwise we'd
>  	 * need a much more complicated tracking system.  Unfortunately that
> @@ -382,15 +683,16 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
>  		if (WARN_ON(!unmapped))
>  			break;
>  
> -		unlocked += vfio_unpin_pages(phys >> PAGE_SHIFT,
> -					     unmapped >> PAGE_SHIFT,
> -					     dma->prot, false);
> +		unlocked += __vfio_unpin_pages_remote(phys >> PAGE_SHIFT,
> +						      unmapped >> PAGE_SHIFT,
> +						      dma->prot, false);
>  		iova += unmapped;
>  
>  		cond_resched();
>  	}
>  
> -	vfio_lock_acct(-unlocked);
> +	dma->iommu_mapped = false;
> +	vfio_lock_acct(current, -unlocked);
>  }
>  
>  static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
> @@ -558,17 +860,85 @@ unwind:
>  	return ret;
>  }
>  
> +void vfio_update_accounting(struct vfio_iommu *iommu, struct vfio_dma *dma)
> +{
> +	struct vfio_domain *domain = iommu->local_domain;
> +	struct rb_node *n;
> +	long locked = 0;
> +
> +	if (!iommu->local_domain)
> +		return;
> +
> +	mutex_lock(&domain->local_addr_space->pfn_list_lock);
> +
> +	n = rb_first(&domain->local_addr_space->pfn_list);
> +
> +	for (; n; n = rb_next(n)) {
> +		struct vfio_pfn *vpfn;
> +
> +		vpfn = rb_entry(n, struct vfio_pfn, node);
> +
> +		if ((vpfn->iova >= dma->iova) &&
> +		    (vpfn->iova < dma->iova + dma->size))
> +			locked++;
> +	}
> +	vfio_lock_acct(current, -locked);
> +	mutex_unlock(&domain->local_addr_space->pfn_list_lock);
> +}
> +
> +static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct vfio_dma *dma,
> +			    size_t map_size)
> +{
> +	dma_addr_t iova = dma->iova;
> +	unsigned long vaddr = dma->vaddr;
> +	size_t size = map_size, dma_size = 0;
> +	long npage;
> +	unsigned long pfn;
> +	int ret = 0;
> +
> +	while (size) {
> +		/* Pin a contiguous chunk of memory */
> +		npage = __vfio_pin_pages_remote(vaddr + dma_size,
> +						size >> PAGE_SHIFT, dma->prot,
> +						&pfn);
> +		if (npage <= 0) {
> +			WARN_ON(!npage);
> +			ret = (int)npage;
> +			break;
> +		}
> +
> +		/* Map it! */
> +		ret = vfio_iommu_map(iommu, iova + dma_size, pfn, npage,
> +				     dma->prot);
> +		if (ret) {
> +			__vfio_unpin_pages_remote(pfn, npage, dma->prot, true);
> +			break;
> +		}
> +
> +		size -= npage << PAGE_SHIFT;
> +		dma_size += npage << PAGE_SHIFT;
> +	}
> +
> +	if (ret)
> +		vfio_remove_dma(iommu, dma);


There's a bug introduced here, vfio_remove_dma() needs dma->size to be
accurate to the point of failure, it's not updated until the success
branch below, so it's never going to unmap/unpin anything.

> +	else {
> +		dma->size = dma_size;
> +		dma->iommu_mapped = true;
> +		vfio_update_accounting(iommu, dma);

I'm confused how this works, when called from vfio_dma_do_map() we're
populating a vfio_dma, that is we're populating part of the iova space
of the device.  How could we have pinned pfns in the local address
space that overlap that?  It would be invalid to have such pinned pfns
since that part of the iova space was not previously mapped.

Another issue is that if there were existing overlaps, userspace would
need to have locked memory limits sufficient for this temporary double
accounting.  I'm not sure how they'd come up with heuristics to handle
that since we're potentially looking at the bulk of VM memory in a
single vfio_dma entry.

> +	}
> +
> +	return ret;
> +}
> +
>  static int vfio_dma_do_map(struct vfio_iommu *iommu,
>  			   struct vfio_iommu_type1_dma_map *map)
>  {
>  	dma_addr_t iova = map->iova;
>  	unsigned long vaddr = map->vaddr;
>  	size_t size = map->size;
> -	long npage;
>  	int ret = 0, prot = 0;
>  	uint64_t mask;
>  	struct vfio_dma *dma;
> -	unsigned long pfn;
>  
>  	/* Verify that none of our __u64 fields overflow */
>  	if (map->size != size || map->vaddr != vaddr || map->iova != iova)
> @@ -611,29 +981,11 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>  	/* Insert zero-sized and grow as we map chunks of it */
>  	vfio_link_dma(iommu, dma);
>  
> -	while (size) {
> -		/* Pin a contiguous chunk of memory */
> -		npage = vfio_pin_pages(vaddr + dma->size,
> -				       size >> PAGE_SHIFT, prot, &pfn);
> -		if (npage <= 0) {
> -			WARN_ON(!npage);
> -			ret = (int)npage;
> -			break;
> -		}
> -
> -		/* Map it! */
> -		ret = vfio_iommu_map(iommu, iova + dma->size, pfn, npage, prot);
> -		if (ret) {
> -			vfio_unpin_pages(pfn, npage, prot, true);
> -			break;
> -		}
> -
> -		size -= npage << PAGE_SHIFT;
> -		dma->size += npage << PAGE_SHIFT;
> -	}
> -
> -	if (ret)
> -		vfio_remove_dma(iommu, dma);
> +	/* Don't pin and map if container doesn't contain IOMMU capable domain*/
> +	if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu))
> +		dma->size = size;
> +	else
> +		ret = vfio_pin_map_dma(iommu, dma, size);
>  
>  	mutex_unlock(&iommu->lock);
>  	return ret;
> @@ -662,10 +1014,6 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
>  	d = list_first_entry(&iommu->domain_list, struct vfio_domain, next);
>  	n = rb_first(&iommu->dma_list);
>  
> -	/* If there's not a domain, there better not be any mappings */
> -	if (WARN_ON(n && !d))
> -		return -EINVAL;
> -
>  	for (; n; n = rb_next(n)) {
>  		struct vfio_dma *dma;
>  		dma_addr_t iova;
> @@ -674,20 +1022,43 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
>  		iova = dma->iova;
>  
>  		while (iova < dma->iova + dma->size) {
> -			phys_addr_t phys = iommu_iova_to_phys(d->domain, iova);
> +			phys_addr_t phys;
>  			size_t size;
>  
> -			if (WARN_ON(!phys)) {
> -				iova += PAGE_SIZE;
> -				continue;
> -			}
> +			if (dma->iommu_mapped) {
> +				phys = iommu_iova_to_phys(d->domain, iova);
> +
> +				if (WARN_ON(!phys)) {
> +					iova += PAGE_SIZE;
> +					continue;
> +				}
>  
> -			size = PAGE_SIZE;
> +				size = PAGE_SIZE;
>  
> -			while (iova + size < dma->iova + dma->size &&
> -			       phys + size == iommu_iova_to_phys(d->domain,
> +				while (iova + size < dma->iova + dma->size &&
> +				    phys + size == iommu_iova_to_phys(d->domain,
>  								 iova + size))
> -				size += PAGE_SIZE;
> +					size += PAGE_SIZE;
> +			} else {
> +				unsigned long pfn;
> +				unsigned long vaddr = dma->vaddr +
> +						     (iova - dma->iova);
> +				size_t n = dma->iova + dma->size - iova;
> +				long npage;
> +
> +				npage = __vfio_pin_pages_remote(vaddr,
> +								n >> PAGE_SHIFT,
> +								dma->prot,
> +								&pfn);
> +				if (npage <= 0) {
> +					WARN_ON(!npage);
> +					ret = (int)npage;
> +					return ret;
> +				}
> +
> +				phys = pfn << PAGE_SHIFT;
> +				size = npage << PAGE_SHIFT;
> +			}
>  
>  			ret = iommu_map(domain->domain, iova, phys,
>  					size, dma->prot | domain->prot);
> @@ -696,6 +1067,11 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
>  
>  			iova += size;
>  		}
> +
> +		if (!dma->iommu_mapped) {
> +			dma->iommu_mapped = true;
> +			vfio_update_accounting(iommu, dma);
> +		}

This is the case where we potentially have pinned pfns and we've added
an iommu mapped device and need to adjust accounting.  But we've fully
pinned and accounted the entire iommu mapped space while still holding
the accounting for any pfn mapped space.  So for a time, assuming some
pfn pinned pages, we have duplicate accounting.  How does userspace
deal with that?  For instance, if I'm using an mdev device where the
vendor driver has pinned 512MB of guest memory, then I hot-add an
assigned NIC and the entire VM address space gets pinned, that pinning
will fail unless my locked memory limits are at least 512MB in excess
of my VM size.  Additionally, the user doesn't know how much memory the
vendor driver is going to pin, it might be the whole VM address space,
so the user would need 2x the locked memory limits.

>  	}
>  
>  	return 0;
> @@ -734,11 +1110,24 @@ static void vfio_test_domain_fgsp(struct vfio_domain *domain)
>  	__free_pages(pages, order);
>  }
>  
> +static struct vfio_group *find_iommu_group(struct vfio_domain *domain,
> +				   struct iommu_group *iommu_group)
> +{
> +	struct vfio_group *g;
> +
> +	list_for_each_entry(g, &domain->group_list, next) {
> +		if (g->iommu_group == iommu_group)
> +			return g;
> +	}
> +
> +	return NULL;
> +}

It would make review easier if changes like splitting this into a
separate function with no functional change on the calling path could
be a separate patch.

> +
>  static int vfio_iommu_type1_attach_group(void *iommu_data,
>  					 struct iommu_group *iommu_group)
>  {
>  	struct vfio_iommu *iommu = iommu_data;
> -	struct vfio_group *group, *g;
> +	struct vfio_group *group;
>  	struct vfio_domain *domain, *d;
>  	struct bus_type *bus = NULL;
>  	int ret;
> @@ -746,10 +1135,14 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>  	mutex_lock(&iommu->lock);
>  
>  	list_for_each_entry(d, &iommu->domain_list, next) {
> -		list_for_each_entry(g, &d->group_list, next) {
> -			if (g->iommu_group != iommu_group)
> -				continue;
> +		if (find_iommu_group(d, iommu_group)) {
> +			mutex_unlock(&iommu->lock);
> +			return -EINVAL;
> +		}
> +	}
>  
> +	if (iommu->local_domain) {
> +		if (find_iommu_group(iommu->local_domain, iommu_group)) {
>  			mutex_unlock(&iommu->lock);
>  			return -EINVAL;
>  		}
> @@ -769,6 +1162,34 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>  	if (ret)
>  		goto out_free;
>  
> +	if (IS_ENABLED(CONFIG_VFIO_MDEV) && !iommu_present(bus) &&
> +	    (bus == &mdev_bus_type)) {
> +		if (iommu->local_domain) {
> +			list_add(&group->next,
> +				 &iommu->local_domain->group_list);
> +			kfree(domain);
> +			mutex_unlock(&iommu->lock);
> +			return 0;
> +		}
> +
> +		domain->local_addr_space =
> +				      kzalloc(sizeof(*domain->local_addr_space),
> +					      GFP_KERNEL);
> +		if (!domain->local_addr_space) {
> +			ret = -ENOMEM;
> +			goto out_free;
> +		}
> +
> +		domain->local_addr_space->task = current;
> +		INIT_LIST_HEAD(&domain->group_list);
> +		list_add(&group->next, &domain->group_list);
> +		domain->local_addr_space->pfn_list = RB_ROOT;
> +		mutex_init(&domain->local_addr_space->pfn_list_lock);
> +		iommu->local_domain = domain;
> +		mutex_unlock(&iommu->lock);
> +		return 0;


This could have been

		if (!iommu->local_domain) {
			domain->local_addr_space =
				kzalloc(sizeof(*domain->local_addr_space),
					GFP_KERNEL);
			if (!domain->local_addr_space) {
				ret = -ENOMEM;
				goto out_free;
			}

			domain->local_addr_space->task = current;
			domain->local_addr_space->pfn_list = RB_ROOT;
			mutex_init(&domain->local_addr_space->pfn_list_lock);
			INIT_LIST_HEAD(&domain->group_list);
			iommu->local_domain = domain;
		} else {
			kfree(domain);
		}

		list_add(&group->next, &iommu->local_domain->group_list);

		mutex_unlock(&iommu->lock);
		return 0;

> +	}
> +
>  	domain->domain = iommu_domain_alloc(bus);
>  	if (!domain->domain) {
>  		ret = -EIO;
> @@ -859,6 +1280,41 @@ static void vfio_iommu_unmap_unpin_all(struct vfio_iommu *iommu)
>  		vfio_remove_dma(iommu, rb_entry(node, struct vfio_dma, node));
>  }
>  
> +static void vfio_iommu_unmap_unpin_reaccount(struct vfio_iommu *iommu)
> +{
> +	struct vfio_domain *domain = iommu->local_domain;
> +	struct vfio_dma *dma, *tdma;
> +	struct rb_node *n;
> +	long locked = 0;
> +
> +	rbtree_postorder_for_each_entry_safe(dma, tdma, &iommu->dma_list,
> +					     node) {
> +		vfio_unmap_unpin(iommu, dma);
> +	}
> +
> +	mutex_lock(&domain->local_addr_space->pfn_list_lock);
> +
> +	n = rb_first(&domain->local_addr_space->pfn_list);
> +
> +	for (; n; n = rb_next(n))
> +		locked++;
> +
> +	vfio_lock_acct(domain->local_addr_space->task, locked);
> +	mutex_unlock(&domain->local_addr_space->pfn_list_lock);
> +}
> +
> +static void vfio_local_unpin_all(struct vfio_domain *domain)
> +{
> +	struct rb_node *node;
> +
> +	mutex_lock(&domain->local_addr_space->pfn_list_lock);
> +	while ((node = rb_first(&domain->local_addr_space->pfn_list)))
> +		vfio_unpin_pfn(domain,
> +				rb_entry(node, struct vfio_pfn, node), false);
> +
> +	mutex_unlock(&domain->local_addr_space->pfn_list_lock);
> +}
> +
>  static void vfio_iommu_type1_detach_group(void *iommu_data,
>  					  struct iommu_group *iommu_group)
>  {
> @@ -868,31 +1324,57 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
>  
>  	mutex_lock(&iommu->lock);
>  
> -	list_for_each_entry(domain, &iommu->domain_list, next) {
> -		list_for_each_entry(group, &domain->group_list, next) {
> -			if (group->iommu_group != iommu_group)
> -				continue;
> -
> -			iommu_detach_group(domain->domain, iommu_group);
> +	if (iommu->local_domain) {
> +		domain = iommu->local_domain;
> +		group = find_iommu_group(domain, iommu_group);
> +		if (group) {
>  			list_del(&group->next);
>  			kfree(group);
> -			/*
> -			 * Group ownership provides privilege, if the group
> -			 * list is empty, the domain goes away.  If it's the
> -			 * last domain, then all the mappings go away too.
> -			 */
> +
>  			if (list_empty(&domain->group_list)) {
> -				if (list_is_singular(&iommu->domain_list))
> +				vfio_local_unpin_all(domain);
> +				if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu))
>  					vfio_iommu_unmap_unpin_all(iommu);
> -				iommu_domain_free(domain->domain);
> -				list_del(&domain->next);
>  				kfree(domain);
> +				iommu->local_domain = NULL;
> +			}
> +			goto detach_group_done;
> +		}
> +	}
> +
> +	if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu))
> +		goto detach_group_done;
> +
> +	list_for_each_entry(domain, &iommu->domain_list, next) {
> +		group = find_iommu_group(domain, iommu_group);
> +		if (!group)
> +			continue;
> +
> +		iommu_detach_group(domain->domain, iommu_group);
> +		list_del(&group->next);
> +		kfree(group);
> +		/*
> +		 * Group ownership provides privilege, if the group list is
> +		 * empty, the domain goes away. If it's the last domain with
> +		 * iommu and local domain doesn't exist, then all the mappings
> +		 * go away too. If it's the last domain with iommu and local
> +		 * domain exist, update accounting
> +		 */
> +		if (list_empty(&domain->group_list)) {
> +			if (list_is_singular(&iommu->domain_list)) {
> +				if (!iommu->local_domain)
> +					vfio_iommu_unmap_unpin_all(iommu);
> +				else
> +					vfio_iommu_unmap_unpin_reaccount(iommu);
>  			}
> -			goto done;
> +			iommu_domain_free(domain->domain);
> +			list_del(&domain->next);
> +			kfree(domain);
>  		}
> +		break;
>  	}
>  
> -done:
> +detach_group_done:
>  	mutex_unlock(&iommu->lock);
>  }
>  
> @@ -924,27 +1406,48 @@ static void *vfio_iommu_type1_open(unsigned long arg)
>  	return iommu;
>  }
>  
> +static void vfio_release_domain(struct vfio_domain *domain)
> +{
> +	struct vfio_group *group, *group_tmp;
> +
> +	list_for_each_entry_safe(group, group_tmp,
> +				 &domain->group_list, next) {
> +		if (!domain->local_addr_space)
> +			iommu_detach_group(domain->domain, group->iommu_group);
> +		list_del(&group->next);
> +		kfree(group);
> +	}
> +
> +	if (domain->local_addr_space)
> +		vfio_local_unpin_all(domain);
> +	else
> +		iommu_domain_free(domain->domain);
> +}
> +
>  static void vfio_iommu_type1_release(void *iommu_data)
>  {
>  	struct vfio_iommu *iommu = iommu_data;
>  	struct vfio_domain *domain, *domain_tmp;
> -	struct vfio_group *group, *group_tmp;
> +
> +	if (iommu->local_domain) {
> +		vfio_release_domain(iommu->local_domain);
> +		kfree(iommu->local_domain);
> +		iommu->local_domain = NULL;
> +	}
>  
>  	vfio_iommu_unmap_unpin_all(iommu);
>  
> +	if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu))
> +		goto release_exit;
> +
>  	list_for_each_entry_safe(domain, domain_tmp,
>  				 &iommu->domain_list, next) {
> -		list_for_each_entry_safe(group, group_tmp,
> -					 &domain->group_list, next) {
> -			iommu_detach_group(domain->domain, group->iommu_group);
> -			list_del(&group->next);
> -			kfree(group);
> -		}
> -		iommu_domain_free(domain->domain);
> +		vfio_release_domain(domain);
>  		list_del(&domain->next);
>  		kfree(domain);
>  	}
>  
> +release_exit:
>  	kfree(iommu);
>  }
>  
> @@ -1048,6 +1551,8 @@ static const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_type1 = {
>  	.ioctl		= vfio_iommu_type1_ioctl,
>  	.attach_group	= vfio_iommu_type1_attach_group,
>  	.detach_group	= vfio_iommu_type1_detach_group,
> +	.pin_pages	= vfio_iommu_type1_pin_pages,
> +	.unpin_pages	= vfio_iommu_type1_unpin_pages,
>  };
>  
>  static int __init vfio_iommu_type1_init(void)
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index 0ecae0b1cd34..0bd25ba6223d 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -17,6 +17,7 @@
>  #include <linux/workqueue.h>
>  #include <linux/poll.h>
>  #include <uapi/linux/vfio.h>
> +#include <linux/mdev.h>
>  
>  /**
>   * struct vfio_device_ops - VFIO bus driver device callbacks
> @@ -75,7 +76,11 @@ struct vfio_iommu_driver_ops {
>  					struct iommu_group *group);
>  	void		(*detach_group)(void *iommu_data,
>  					struct iommu_group *group);
> -
> +	long		(*pin_pages)(void *iommu_data, unsigned long *user_pfn,
> +				     long npage, int prot,
> +				     unsigned long *phys_pfn);
> +	long		(*unpin_pages)(void *iommu_data, unsigned long *pfn,
> +				       long npage);
>  };
>  
>  extern int vfio_register_iommu_driver(const struct vfio_iommu_driver_ops *ops);
> @@ -127,6 +132,12 @@ static inline long vfio_spapr_iommu_eeh_ioctl(struct iommu_group *group,
>  }
>  #endif /* CONFIG_EEH */
>  
> +extern long vfio_pin_pages(struct device *dev, unsigned long *user_pfn,
> +			   long npage, int prot, unsigned long *phys_pfn);
> +
> +extern long vfio_unpin_pages(struct device *dev, unsigned long *pfn,
> +			     long npage);
> +
>  /*
>   * IRQfd - generic
>   */


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v8 3/6] vfio iommu: Add support for mediated devices
@ 2016-10-11 22:06     ` Alex Williamson
  0 siblings, 0 replies; 73+ messages in thread
From: Alex Williamson @ 2016-10-11 22:06 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi

On Tue, 11 Oct 2016 01:58:34 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> VFIO IOMMU drivers are designed for the devices which are IOMMU capable.
> Mediated device only uses IOMMU APIs, the underlying hardware can be
> managed by an IOMMU domain.
> 
> Aim of this change is:
> - To use most of the code of TYPE1 IOMMU driver for mediated devices
> - To support direct assigned device and mediated device in single module
> 
> Added two new callback functions to struct vfio_iommu_driver_ops. Backend
> IOMMU module that supports pining and unpinning pages for mdev devices
> should provide these functions.
> Added APIs for pining and unpining pages to VFIO module. These calls back
> into backend iommu module to actually pin and unpin pages.
> 
> This change adds pin and unpin support for mediated device to TYPE1 IOMMU
> backend module. More details:
> - When iommu_group of mediated devices is attached, task structure is
>   cached which is used later to pin pages and page accounting.
> - It keeps track of pinned pages for mediated domain. This data is used to
>   verify unpinning request and to unpin remaining pages while detaching, if
>   there are any.
> - Used existing mechanism for page accounting. If iommu capable domain
>   exist in the container then all pages are already pinned and accounted.
>   Accouting for mdev device is only done if there is no iommu capable
>   domain in the container.
> - Page accouting is updated on hot plug and unplug mdev device and pass
>   through device.
> 
> Tested by assigning below combinations of devices to a single VM:
> - GPU pass through only
> - vGPU device only
> - One GPU pass through and one vGPU device
> - Linux VM hot plug and unplug vGPU device while GPU pass through device
>   exist
> - Linux VM hot plug and unplug GPU pass through device while vGPU device
>   exist
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Signed-off-by: Neo Jia <cjia@nvidia.com>
> Change-Id: I295d6f0f2e0579b8d9882bfd8fd5a4194b97bd9a
> ---
>  drivers/vfio/vfio.c             | 117 +++++++
>  drivers/vfio/vfio_iommu_type1.c | 685 ++++++++++++++++++++++++++++++++++------
>  include/linux/vfio.h            |  13 +-
>  3 files changed, 724 insertions(+), 91 deletions(-)
> 
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index 6fd6fa5469de..e3e342861e04 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -1782,6 +1782,123 @@ void vfio_info_cap_shift(struct vfio_info_cap *caps, size_t offset)
>  }
>  EXPORT_SYMBOL_GPL(vfio_info_cap_shift);
>  
> +static struct vfio_group *vfio_group_from_dev(struct device *dev)
> +{
> +	struct vfio_device *device;
> +	struct vfio_group *group;
> +	int ret;
> +
> +	device = vfio_device_get_from_dev(dev);

Note how this does dev->iommu_group->vfio_group->vfio_device and then
we back out one level to get the vfio_group, it's not a terribly
lightweight path.  Perhaps we should have:

struct vfio_device *vfio_group_get_from_dev(struct device *dev)
{
        struct iommu_group *iommu_group;
        struct vfio_group *group;

        iommu_group = iommu_group_get(dev);
        if (!iommu_group)
                return NULL;

        group = vfio_group_get_from_iommu(iommu_group);
	iommu_group_put(iommu_group);

	return group;
}

vfio_device_get_from_dev() would make use of this.

Then create a separate:

static int vfio_group_add_container_user(struct vfio_group *group)
{

> +	if (!atomic_inc_not_zero(&group->container_users)) {
		return -EINVAL;
> +	}
> +
> +	if (group->noiommu) {
> +		atomic_dec(&group->container_users);
		return -EPERM;
> +	}
> +
> +	if (!group->container->iommu_driver ||
> +	    !vfio_group_viable(group)) {
> +		atomic_dec(&group->container_users);
		return -EINVAL;
> +	}
> +
	return 0;
}

vfio_group_get_external_user() would be updated to use this.  In fact,
creating these two functions and updating the existing code to use
these should be a separate patch.

Note that your version did not hold a group reference while doing the
pin/unpin operations below, which seems like a bug.

> +
> +err_ret:
> +	vfio_device_put(device);
> +	return ERR_PTR(ret);
> +}
> +
> +/*
> + * Pin a set of guest PFNs and return their associated host PFNs for local
> + * domain only.
> + * @dev [in] : device
> + * @user_pfn [in]: array of user/guest PFNs
> + * @npage [in]: count of array elements
> + * @prot [in] : protection flags
> + * @phys_pfn[out] : array of host PFNs
> + */
> +long vfio_pin_pages(struct device *dev, unsigned long *user_pfn,
> +		    long npage, int prot, unsigned long *phys_pfn)
> +{
> +	struct vfio_container *container;
> +	struct vfio_group *group;
> +	struct vfio_iommu_driver *driver;
> +	ssize_t ret = -EINVAL;
> +
> +	if (!dev || !user_pfn || !phys_pfn)
> +		return -EINVAL;
> +
> +	group = vfio_group_from_dev(dev);
> +	if (IS_ERR(group))
> +		return PTR_ERR(group);

As suggested above:

	group = vfio_group_get_from_dev(dev);
	if (!group)
		return -ENODEV;

	ret = vfio_group_add_container_user(group)
	if (ret)
		vfio_group_put(group);
		return ret;
	}

> +
> +	container = group->container;
> +	if (IS_ERR(container))
> +		return PTR_ERR(container);
> +
> +	down_read(&container->group_lock);
> +
> +	driver = container->iommu_driver;
> +	if (likely(driver && driver->ops->pin_pages))
> +		ret = driver->ops->pin_pages(container->iommu_data, user_pfn,
> +					     npage, prot, phys_pfn);
> +
> +	up_read(&container->group_lock);
> +	vfio_group_try_dissolve_container(group);

Even if you're considering that the container_user reference holds the
driver, I think we need a group reference throughout this and this
should end with a vfio_group_put(group);

> +
> +	return ret;
> +
> +}
> +EXPORT_SYMBOL(vfio_pin_pages);
> +
> +/*
> + * Unpin set of host PFNs for local domain only.
> + * @dev [in] : device
> + * @pfn [in] : array of host PFNs to be unpinned.
> + * @npage [in] :count of elements in array, that is number of pages.
> + */
> +long vfio_unpin_pages(struct device *dev, unsigned long *pfn, long npage)
> +{
> +	struct vfio_container *container;
> +	struct vfio_group *group;
> +	struct vfio_iommu_driver *driver;
> +	ssize_t ret = -EINVAL;
> +
> +	if (!dev || !pfn)
> +		return -EINVAL;
> +
> +	group = vfio_group_from_dev(dev);
> +	if (IS_ERR(group))
> +		return PTR_ERR(group);
> +
> +	container = group->container;
> +	if (IS_ERR(container))
> +		return PTR_ERR(container);
> +
> +	down_read(&container->group_lock);
> +
> +	driver = container->iommu_driver;
> +	if (likely(driver && driver->ops->unpin_pages))
> +		ret = driver->ops->unpin_pages(container->iommu_data, pfn,
> +					       npage);
> +
> +	up_read(&container->group_lock);
> +	vfio_group_try_dissolve_container(group);


Same as above on all counts.

> +	return ret;
> +}
> +EXPORT_SYMBOL(vfio_unpin_pages);
> +
>  /**
>   * Module/class support
>   */
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 2ba19424e4a1..ce6d6dcbd9a8 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -55,18 +55,26 @@ MODULE_PARM_DESC(disable_hugepages,
>  
>  struct vfio_iommu {
>  	struct list_head	domain_list;
> +	struct vfio_domain	*local_domain;
>  	struct mutex		lock;
>  	struct rb_root		dma_list;
>  	bool			v2;
>  	bool			nesting;
>  };
>  
> +struct local_addr_space {
> +	struct task_struct	*task;
> +	struct rb_root		pfn_list;	/* pinned Host pfn list */
> +	struct mutex		pfn_list_lock;	/* mutex for pfn_list */
> +};
> +
>  struct vfio_domain {
>  	struct iommu_domain	*domain;
>  	struct list_head	next;
>  	struct list_head	group_list;
>  	int			prot;		/* IOMMU_CACHE */
>  	bool			fgsp;		/* Fine-grained super pages */
> +	struct local_addr_space	*local_addr_space;
>  };

Consider structure internal alignment, this should be placed below
group_list.

>  
>  struct vfio_dma {
> @@ -75,6 +83,7 @@ struct vfio_dma {
>  	unsigned long		vaddr;		/* Process virtual addr */
>  	size_t			size;		/* Map size (bytes) */
>  	int			prot;		/* IOMMU_READ/WRITE */
> +	bool			iommu_mapped;
>  };
>  
>  struct vfio_group {
> @@ -83,6 +92,22 @@ struct vfio_group {
>  };
>  
>  /*
> + * Guest RAM pinning working set or DMA target
> + */
> +struct vfio_pfn {
> +	struct rb_node		node;
> +	unsigned long		vaddr;		/* virtual addr */
> +	dma_addr_t		iova;		/* IOVA */
> +	unsigned long		pfn;		/* Host pfn */
> +	size_t			prot;

size_t?  Shouldn't this be an int?

> +	atomic_t		ref_count;
> +};
> +
> +
> +#define IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu)	\
> +			 (list_empty(&iommu->domain_list) ? false : true)

(!list_empty(...))

> +
> +/*
>   * This code handles mapping and unmapping of user data buffers
>   * into DMA'ble space using the IOMMU
>   */
> @@ -130,6 +155,84 @@ static void vfio_unlink_dma(struct vfio_iommu *iommu, struct vfio_dma *old)
>  	rb_erase(&old->node, &iommu->dma_list);
>  }
>  
> +/*
> + * Helper Functions for host pfn list
> + */
> +
> +static struct vfio_pfn *vfio_find_pfn(struct vfio_domain *domain,
> +				      unsigned long pfn)
> +{
> +	struct rb_node *node;
> +	struct vfio_pfn *vpfn, *ret = NULL;
> +
> +	node = domain->local_addr_space->pfn_list.rb_node;
> +
> +	while (node) {
> +		vpfn = rb_entry(node, struct vfio_pfn, node);
> +
> +		if (pfn < vpfn->pfn)
> +			node = node->rb_left;
> +		else if (pfn > vpfn->pfn)
> +			node = node->rb_right;
> +		else {
> +			ret = vpfn;
> +			break;
> +		}
> +	}
> +
> +	return ret;
> +}

Some unnecessary style differences from vfio_find_dma()

> +
> +static void vfio_link_pfn(struct vfio_domain *domain, struct vfio_pfn *new)
> +{
> +	struct rb_node **link, *parent = NULL;
> +	struct vfio_pfn *vpfn;
> +
> +	link = &domain->local_addr_space->pfn_list.rb_node;
> +	while (*link) {
> +		parent = *link;
> +		vpfn = rb_entry(parent, struct vfio_pfn, node);
> +
> +		if (new->pfn < vpfn->pfn)
> +			link = &(*link)->rb_left;
> +		else
> +			link = &(*link)->rb_right;
> +	}
> +
> +	rb_link_node(&new->node, parent, link);
> +	rb_insert_color(&new->node, &domain->local_addr_space->pfn_list);
> +}
> +
> +static void vfio_unlink_pfn(struct vfio_domain *domain, struct vfio_pfn *old)
> +{
> +	rb_erase(&old->node, &domain->local_addr_space->pfn_list);
> +}
> +
> +static int vfio_add_to_pfn_list(struct vfio_domain *domain, unsigned long vaddr,
> +				dma_addr_t iova, unsigned long pfn, size_t prot)

size_t?

> +{
> +	struct vfio_pfn *vpfn;
> +
> +	vpfn = kzalloc(sizeof(*vpfn), GFP_KERNEL);
> +	if (!vpfn)
> +		return -ENOMEM;
> +
> +	vpfn->vaddr = vaddr;
> +	vpfn->iova = iova;
> +	vpfn->pfn = pfn;
> +	vpfn->prot = prot;
> +	atomic_set(&vpfn->ref_count, 1);
> +	vfio_link_pfn(domain, vpfn);
> +	return 0;
> +}
> +
> +static void vfio_remove_from_pfn_list(struct vfio_domain *domain,
> +				      struct vfio_pfn *vpfn)
> +{
> +	vfio_unlink_pfn(domain, vpfn);
> +	kfree(vpfn);
> +}
> +
>  struct vwork {
>  	struct mm_struct	*mm;
>  	long			npage;
> @@ -150,17 +253,17 @@ static void vfio_lock_acct_bg(struct work_struct *work)
>  	kfree(vwork);
>  }
>  
> -static void vfio_lock_acct(long npage)
> +static void vfio_lock_acct(struct task_struct *task, long npage)
>  {
>  	struct vwork *vwork;
>  	struct mm_struct *mm;
>  
> -	if (!current->mm || !npage)
> +	if (!task->mm || !npage)
>  		return; /* process exited or nothing to do */
>  
> -	if (down_write_trylock(&current->mm->mmap_sem)) {
> -		current->mm->locked_vm += npage;
> -		up_write(&current->mm->mmap_sem);
> +	if (down_write_trylock(&task->mm->mmap_sem)) {
> +		task->mm->locked_vm += npage;
> +		up_write(&task->mm->mmap_sem);
>  		return;
>  	}
>  
> @@ -172,7 +275,7 @@ static void vfio_lock_acct(long npage)
>  	vwork = kmalloc(sizeof(struct vwork), GFP_KERNEL);
>  	if (!vwork)
>  		return;
> -	mm = get_task_mm(current);
> +	mm = get_task_mm(task);
>  	if (!mm) {
>  		kfree(vwork);
>  		return;
> @@ -228,20 +331,31 @@ static int put_pfn(unsigned long pfn, int prot)
>  	return 0;
>  }
>  
> -static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
> +static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
> +			 int prot, unsigned long *pfn)
>  {
>  	struct page *page[1];
>  	struct vm_area_struct *vma;
> +	struct mm_struct *local_mm = (mm ? mm : current->mm);
>  	int ret = -EFAULT;
>  
> -	if (get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE), page) == 1) {
> +	if (mm) {
> +		down_read(&local_mm->mmap_sem);
> +		ret = get_user_pages_remote(NULL, local_mm, vaddr, 1,
> +					!!(prot & IOMMU_WRITE), 0, page, NULL);
> +		up_read(&local_mm->mmap_sem);
> +	} else
> +		ret = get_user_pages_fast(vaddr, 1,
> +					  !!(prot & IOMMU_WRITE), page);
> +
> +	if (ret == 1) {
>  		*pfn = page_to_pfn(page[0]);
>  		return 0;
>  	}
>  
> -	down_read(&current->mm->mmap_sem);
> +	down_read(&local_mm->mmap_sem);
>  
> -	vma = find_vma_intersection(current->mm, vaddr, vaddr + 1);
> +	vma = find_vma_intersection(local_mm, vaddr, vaddr + 1);
>  
>  	if (vma && vma->vm_flags & VM_PFNMAP) {
>  		*pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
> @@ -249,7 +363,7 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
>  			ret = 0;
>  	}
>  
> -	up_read(&current->mm->mmap_sem);
> +	up_read(&local_mm->mmap_sem);
>  
>  	return ret;
>  }
> @@ -259,8 +373,8 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
>   * the iommu can only map chunks of consecutive pfns anyway, so get the
>   * first page and all consecutive pages with the same locking.
>   */
> -static long vfio_pin_pages(unsigned long vaddr, long npage,
> -			   int prot, unsigned long *pfn_base)
> +static long __vfio_pin_pages_remote(unsigned long vaddr, long npage,
> +				    int prot, unsigned long *pfn_base)
>  {
>  	unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
>  	bool lock_cap = capable(CAP_IPC_LOCK);
> @@ -270,7 +384,7 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
>  	if (!current->mm)
>  		return -ENODEV;
>  
> -	ret = vaddr_get_pfn(vaddr, prot, pfn_base);
> +	ret = vaddr_get_pfn(NULL, vaddr, prot, pfn_base);
>  	if (ret)
>  		return ret;
>  
> @@ -285,7 +399,7 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
>  
>  	if (unlikely(disable_hugepages)) {
>  		if (!rsvd)
> -			vfio_lock_acct(1);
> +			vfio_lock_acct(current, 1);
>  		return 1;
>  	}
>  
> @@ -293,7 +407,7 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
>  	for (i = 1, vaddr += PAGE_SIZE; i < npage; i++, vaddr += PAGE_SIZE) {
>  		unsigned long pfn = 0;
>  
> -		ret = vaddr_get_pfn(vaddr, prot, &pfn);
> +		ret = vaddr_get_pfn(NULL, vaddr, prot, &pfn);
>  		if (ret)
>  			break;
>  
> @@ -313,13 +427,13 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
>  	}
>  
>  	if (!rsvd)
> -		vfio_lock_acct(i);
> +		vfio_lock_acct(current, i);
>  
>  	return i;
>  }
>  
> -static long vfio_unpin_pages(unsigned long pfn, long npage,
> -			     int prot, bool do_accounting)
> +static long __vfio_unpin_pages_remote(unsigned long pfn, long npage, int prot,
> +				      bool do_accounting)
>  {
>  	unsigned long unlocked = 0;
>  	long i;
> @@ -328,7 +442,188 @@ static long vfio_unpin_pages(unsigned long pfn, long npage,
>  		unlocked += put_pfn(pfn++, prot);
>  
>  	if (do_accounting)
> -		vfio_lock_acct(-unlocked);
> +		vfio_lock_acct(current, -unlocked);
> +	return unlocked;
> +}
> +
> +static long __vfio_pin_pages_local(struct vfio_domain *domain,
> +				   unsigned long vaddr, int prot,
> +				   unsigned long *pfn_base,
> +				   bool do_accounting)
> +{
> +	unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> +	bool lock_cap = capable(CAP_IPC_LOCK);
> +	long ret;
> +	bool rsvd;
> +	struct task_struct *task = domain->local_addr_space->task;
> +
> +	if (!task->mm)
> +		return -ENODEV;
> +
> +	ret = vaddr_get_pfn(task->mm, vaddr, prot, pfn_base);
> +	if (ret)
> +		return ret;
> +
> +	rsvd = is_invalid_reserved_pfn(*pfn_base);
> +
> +	if (!rsvd && !lock_cap && task->mm->locked_vm + 1 > limit) {
> +		put_pfn(*pfn_base, prot);
> +		pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n", __func__,
> +			limit << PAGE_SHIFT);
> +		return -ENOMEM;
> +	}
> +
> +	if (!rsvd && do_accounting)
> +		vfio_lock_acct(task, 1);
> +
> +	return 1;
> +}
> +
> +static void __vfio_unpin_pages_local(struct vfio_domain *domain,
> +				     unsigned long pfn, int prot,
> +				     bool do_accounting)
> +{
> +	put_pfn(pfn, prot);
> +
> +	if (do_accounting)
> +		vfio_lock_acct(domain->local_addr_space->task, -1);
> +}
> +
> +static int vfio_unpin_pfn(struct vfio_domain *domain,
> +			  struct vfio_pfn *vpfn, bool do_accounting)
> +{
> +	__vfio_unpin_pages_local(domain, vpfn->pfn, vpfn->prot,
> +				 do_accounting);
> +
> +	if (atomic_dec_and_test(&vpfn->ref_count))
> +		vfio_remove_from_pfn_list(domain, vpfn);
> +
> +	return 1;
> +}
> +
> +static long vfio_iommu_type1_pin_pages(void *iommu_data,
> +				       unsigned long *user_pfn,
> +				       long npage, int prot,
> +				       unsigned long *phys_pfn)
> +{
> +	struct vfio_iommu *iommu = iommu_data;
> +	struct vfio_domain *domain;
> +	int i, j, ret;
> +	long retpage;
> +	unsigned long remote_vaddr;
> +	unsigned long *pfn = phys_pfn;
> +	struct vfio_dma *dma;
> +	bool do_accounting = false;
> +
> +	if (!iommu || !user_pfn || !phys_pfn)
> +		return -EINVAL;
> +
> +	mutex_lock(&iommu->lock);
> +
> +	if (!iommu->local_domain) {
> +		ret = -EINVAL;
> +		goto pin_done;
> +	}
> +
> +	domain = iommu->local_domain;
> +
> +	/*
> +	 * If iommu capable domain exist in the container then all pages are
> +	 * already pinned and accounted. Accouting should be done if there is no
> +	 * iommu capable domain in the container.
> +	 */
> +	do_accounting = !IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu);
> +
> +	for (i = 0; i < npage; i++) {
> +		struct vfio_pfn *p;
> +		dma_addr_t iova;
> +
> +		iova = user_pfn[i] << PAGE_SHIFT;
> +
> +		dma = vfio_find_dma(iommu, iova, 0);
> +		if (!dma) {
> +			ret = -EINVAL;
> +			goto pin_unwind;
> +		}
> +
> +		remote_vaddr = dma->vaddr + iova - dma->iova;
> +
> +		retpage = __vfio_pin_pages_local(domain, remote_vaddr, prot,
> +						 &pfn[i], do_accounting);
> +		if (retpage <= 0) {
> +			WARN_ON(!retpage);
> +			ret = (int)retpage;
> +			goto pin_unwind;
> +		}
> +
> +		mutex_lock(&domain->local_addr_space->pfn_list_lock);
> +
> +		/* search if pfn exist */
> +		p = vfio_find_pfn(domain, pfn[i]);
> +		if (p) {
> +			atomic_inc(&p->ref_count);
> +			mutex_unlock(&domain->local_addr_space->pfn_list_lock);
> +			continue;
> +		}
> +
> +		ret = vfio_add_to_pfn_list(domain, remote_vaddr, iova,
> +					   pfn[i], prot);
> +		mutex_unlock(&domain->local_addr_space->pfn_list_lock);
> +
> +		if (ret) {
> +			__vfio_unpin_pages_local(domain, pfn[i], prot,
> +						 do_accounting);
> +			goto pin_unwind;
> +		}
> +	}
> +
> +	ret = i;
> +	goto pin_done;
> +
> +pin_unwind:
> +	pfn[i] = 0;
> +	mutex_lock(&domain->local_addr_space->pfn_list_lock);
> +	for (j = 0; j < i; j++) {
> +		struct vfio_pfn *p;
> +
> +		p = vfio_find_pfn(domain, pfn[j]);
> +		if (p)
> +			vfio_unpin_pfn(domain, p, do_accounting);
> +
> +		pfn[j] = 0;
> +	}
> +	mutex_unlock(&domain->local_addr_space->pfn_list_lock);
> +
> +pin_done:
> +	mutex_unlock(&iommu->lock);
> +	return ret;
> +}
> +
> +static long vfio_iommu_type1_unpin_pages(void *iommu_data, unsigned long *pfn,
> +					 long npage)
> +{
> +	struct vfio_iommu *iommu = iommu_data;
> +	struct vfio_domain *domain = NULL;
> +	long unlocked = 0;
> +	int i;
> +
> +	if (!iommu || !pfn)
> +		return -EINVAL;
> +

We need iommu->lock here, right?

> +	domain = iommu->local_domain;
> +
> +	for (i = 0; i < npage; i++) {
> +		struct vfio_pfn *p;
> +
> +		mutex_lock(&domain->local_addr_space->pfn_list_lock);
> +
> +		/* verify if pfn exist in pfn_list */
> +		p = vfio_find_pfn(domain, pfn[i]);
> +		if (p)
> +			unlocked += vfio_unpin_pfn(domain, p, true);
> +
> +		mutex_unlock(&domain->local_addr_space->pfn_list_lock);

We hold this mutex outside the loop in the pin unwind case, why is it
different here?

> +	}
>  
>  	return unlocked;
>  }
> @@ -341,6 +636,12 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
>  
>  	if (!dma->size)
>  		return;
> +
> +	if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu))
> +		return;
> +
> +	if (!dma->iommu_mapped)
> +		return;
>  	/*
>  	 * We use the IOMMU to track the physical addresses, otherwise we'd
>  	 * need a much more complicated tracking system.  Unfortunately that
> @@ -382,15 +683,16 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
>  		if (WARN_ON(!unmapped))
>  			break;
>  
> -		unlocked += vfio_unpin_pages(phys >> PAGE_SHIFT,
> -					     unmapped >> PAGE_SHIFT,
> -					     dma->prot, false);
> +		unlocked += __vfio_unpin_pages_remote(phys >> PAGE_SHIFT,
> +						      unmapped >> PAGE_SHIFT,
> +						      dma->prot, false);
>  		iova += unmapped;
>  
>  		cond_resched();
>  	}
>  
> -	vfio_lock_acct(-unlocked);
> +	dma->iommu_mapped = false;
> +	vfio_lock_acct(current, -unlocked);
>  }
>  
>  static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
> @@ -558,17 +860,85 @@ unwind:
>  	return ret;
>  }
>  
> +void vfio_update_accounting(struct vfio_iommu *iommu, struct vfio_dma *dma)
> +{
> +	struct vfio_domain *domain = iommu->local_domain;
> +	struct rb_node *n;
> +	long locked = 0;
> +
> +	if (!iommu->local_domain)
> +		return;
> +
> +	mutex_lock(&domain->local_addr_space->pfn_list_lock);
> +
> +	n = rb_first(&domain->local_addr_space->pfn_list);
> +
> +	for (; n; n = rb_next(n)) {
> +		struct vfio_pfn *vpfn;
> +
> +		vpfn = rb_entry(n, struct vfio_pfn, node);
> +
> +		if ((vpfn->iova >= dma->iova) &&
> +		    (vpfn->iova < dma->iova + dma->size))
> +			locked++;
> +	}
> +	vfio_lock_acct(current, -locked);
> +	mutex_unlock(&domain->local_addr_space->pfn_list_lock);
> +}
> +
> +static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct vfio_dma *dma,
> +			    size_t map_size)
> +{
> +	dma_addr_t iova = dma->iova;
> +	unsigned long vaddr = dma->vaddr;
> +	size_t size = map_size, dma_size = 0;
> +	long npage;
> +	unsigned long pfn;
> +	int ret = 0;
> +
> +	while (size) {
> +		/* Pin a contiguous chunk of memory */
> +		npage = __vfio_pin_pages_remote(vaddr + dma_size,
> +						size >> PAGE_SHIFT, dma->prot,
> +						&pfn);
> +		if (npage <= 0) {
> +			WARN_ON(!npage);
> +			ret = (int)npage;
> +			break;
> +		}
> +
> +		/* Map it! */
> +		ret = vfio_iommu_map(iommu, iova + dma_size, pfn, npage,
> +				     dma->prot);
> +		if (ret) {
> +			__vfio_unpin_pages_remote(pfn, npage, dma->prot, true);
> +			break;
> +		}
> +
> +		size -= npage << PAGE_SHIFT;
> +		dma_size += npage << PAGE_SHIFT;
> +	}
> +
> +	if (ret)
> +		vfio_remove_dma(iommu, dma);


There's a bug introduced here, vfio_remove_dma() needs dma->size to be
accurate to the point of failure, it's not updated until the success
branch below, so it's never going to unmap/unpin anything.

> +	else {
> +		dma->size = dma_size;
> +		dma->iommu_mapped = true;
> +		vfio_update_accounting(iommu, dma);

I'm confused how this works, when called from vfio_dma_do_map() we're
populating a vfio_dma, that is we're populating part of the iova space
of the device.  How could we have pinned pfns in the local address
space that overlap that?  It would be invalid to have such pinned pfns
since that part of the iova space was not previously mapped.

Another issue is that if there were existing overlaps, userspace would
need to have locked memory limits sufficient for this temporary double
accounting.  I'm not sure how they'd come up with heuristics to handle
that since we're potentially looking at the bulk of VM memory in a
single vfio_dma entry.

> +	}
> +
> +	return ret;
> +}
> +
>  static int vfio_dma_do_map(struct vfio_iommu *iommu,
>  			   struct vfio_iommu_type1_dma_map *map)
>  {
>  	dma_addr_t iova = map->iova;
>  	unsigned long vaddr = map->vaddr;
>  	size_t size = map->size;
> -	long npage;
>  	int ret = 0, prot = 0;
>  	uint64_t mask;
>  	struct vfio_dma *dma;
> -	unsigned long pfn;
>  
>  	/* Verify that none of our __u64 fields overflow */
>  	if (map->size != size || map->vaddr != vaddr || map->iova != iova)
> @@ -611,29 +981,11 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>  	/* Insert zero-sized and grow as we map chunks of it */
>  	vfio_link_dma(iommu, dma);
>  
> -	while (size) {
> -		/* Pin a contiguous chunk of memory */
> -		npage = vfio_pin_pages(vaddr + dma->size,
> -				       size >> PAGE_SHIFT, prot, &pfn);
> -		if (npage <= 0) {
> -			WARN_ON(!npage);
> -			ret = (int)npage;
> -			break;
> -		}
> -
> -		/* Map it! */
> -		ret = vfio_iommu_map(iommu, iova + dma->size, pfn, npage, prot);
> -		if (ret) {
> -			vfio_unpin_pages(pfn, npage, prot, true);
> -			break;
> -		}
> -
> -		size -= npage << PAGE_SHIFT;
> -		dma->size += npage << PAGE_SHIFT;
> -	}
> -
> -	if (ret)
> -		vfio_remove_dma(iommu, dma);
> +	/* Don't pin and map if container doesn't contain IOMMU capable domain*/
> +	if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu))
> +		dma->size = size;
> +	else
> +		ret = vfio_pin_map_dma(iommu, dma, size);
>  
>  	mutex_unlock(&iommu->lock);
>  	return ret;
> @@ -662,10 +1014,6 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
>  	d = list_first_entry(&iommu->domain_list, struct vfio_domain, next);
>  	n = rb_first(&iommu->dma_list);
>  
> -	/* If there's not a domain, there better not be any mappings */
> -	if (WARN_ON(n && !d))
> -		return -EINVAL;
> -
>  	for (; n; n = rb_next(n)) {
>  		struct vfio_dma *dma;
>  		dma_addr_t iova;
> @@ -674,20 +1022,43 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
>  		iova = dma->iova;
>  
>  		while (iova < dma->iova + dma->size) {
> -			phys_addr_t phys = iommu_iova_to_phys(d->domain, iova);
> +			phys_addr_t phys;
>  			size_t size;
>  
> -			if (WARN_ON(!phys)) {
> -				iova += PAGE_SIZE;
> -				continue;
> -			}
> +			if (dma->iommu_mapped) {
> +				phys = iommu_iova_to_phys(d->domain, iova);
> +
> +				if (WARN_ON(!phys)) {
> +					iova += PAGE_SIZE;
> +					continue;
> +				}
>  
> -			size = PAGE_SIZE;
> +				size = PAGE_SIZE;
>  
> -			while (iova + size < dma->iova + dma->size &&
> -			       phys + size == iommu_iova_to_phys(d->domain,
> +				while (iova + size < dma->iova + dma->size &&
> +				    phys + size == iommu_iova_to_phys(d->domain,
>  								 iova + size))
> -				size += PAGE_SIZE;
> +					size += PAGE_SIZE;
> +			} else {
> +				unsigned long pfn;
> +				unsigned long vaddr = dma->vaddr +
> +						     (iova - dma->iova);
> +				size_t n = dma->iova + dma->size - iova;
> +				long npage;
> +
> +				npage = __vfio_pin_pages_remote(vaddr,
> +								n >> PAGE_SHIFT,
> +								dma->prot,
> +								&pfn);
> +				if (npage <= 0) {
> +					WARN_ON(!npage);
> +					ret = (int)npage;
> +					return ret;
> +				}
> +
> +				phys = pfn << PAGE_SHIFT;
> +				size = npage << PAGE_SHIFT;
> +			}
>  
>  			ret = iommu_map(domain->domain, iova, phys,
>  					size, dma->prot | domain->prot);
> @@ -696,6 +1067,11 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
>  
>  			iova += size;
>  		}
> +
> +		if (!dma->iommu_mapped) {
> +			dma->iommu_mapped = true;
> +			vfio_update_accounting(iommu, dma);
> +		}

This is the case where we potentially have pinned pfns and we've added
an iommu mapped device and need to adjust accounting.  But we've fully
pinned and accounted the entire iommu mapped space while still holding
the accounting for any pfn mapped space.  So for a time, assuming some
pfn pinned pages, we have duplicate accounting.  How does userspace
deal with that?  For instance, if I'm using an mdev device where the
vendor driver has pinned 512MB of guest memory, then I hot-add an
assigned NIC and the entire VM address space gets pinned, that pinning
will fail unless my locked memory limits are at least 512MB in excess
of my VM size.  Additionally, the user doesn't know how much memory the
vendor driver is going to pin, it might be the whole VM address space,
so the user would need 2x the locked memory limits.

>  	}
>  
>  	return 0;
> @@ -734,11 +1110,24 @@ static void vfio_test_domain_fgsp(struct vfio_domain *domain)
>  	__free_pages(pages, order);
>  }
>  
> +static struct vfio_group *find_iommu_group(struct vfio_domain *domain,
> +				   struct iommu_group *iommu_group)
> +{
> +	struct vfio_group *g;
> +
> +	list_for_each_entry(g, &domain->group_list, next) {
> +		if (g->iommu_group == iommu_group)
> +			return g;
> +	}
> +
> +	return NULL;
> +}

It would make review easier if changes like splitting this into a
separate function with no functional change on the calling path could
be a separate patch.

> +
>  static int vfio_iommu_type1_attach_group(void *iommu_data,
>  					 struct iommu_group *iommu_group)
>  {
>  	struct vfio_iommu *iommu = iommu_data;
> -	struct vfio_group *group, *g;
> +	struct vfio_group *group;
>  	struct vfio_domain *domain, *d;
>  	struct bus_type *bus = NULL;
>  	int ret;
> @@ -746,10 +1135,14 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>  	mutex_lock(&iommu->lock);
>  
>  	list_for_each_entry(d, &iommu->domain_list, next) {
> -		list_for_each_entry(g, &d->group_list, next) {
> -			if (g->iommu_group != iommu_group)
> -				continue;
> +		if (find_iommu_group(d, iommu_group)) {
> +			mutex_unlock(&iommu->lock);
> +			return -EINVAL;
> +		}
> +	}
>  
> +	if (iommu->local_domain) {
> +		if (find_iommu_group(iommu->local_domain, iommu_group)) {
>  			mutex_unlock(&iommu->lock);
>  			return -EINVAL;
>  		}
> @@ -769,6 +1162,34 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>  	if (ret)
>  		goto out_free;
>  
> +	if (IS_ENABLED(CONFIG_VFIO_MDEV) && !iommu_present(bus) &&
> +	    (bus == &mdev_bus_type)) {
> +		if (iommu->local_domain) {
> +			list_add(&group->next,
> +				 &iommu->local_domain->group_list);
> +			kfree(domain);
> +			mutex_unlock(&iommu->lock);
> +			return 0;
> +		}
> +
> +		domain->local_addr_space =
> +				      kzalloc(sizeof(*domain->local_addr_space),
> +					      GFP_KERNEL);
> +		if (!domain->local_addr_space) {
> +			ret = -ENOMEM;
> +			goto out_free;
> +		}
> +
> +		domain->local_addr_space->task = current;
> +		INIT_LIST_HEAD(&domain->group_list);
> +		list_add(&group->next, &domain->group_list);
> +		domain->local_addr_space->pfn_list = RB_ROOT;
> +		mutex_init(&domain->local_addr_space->pfn_list_lock);
> +		iommu->local_domain = domain;
> +		mutex_unlock(&iommu->lock);
> +		return 0;


This could have been

		if (!iommu->local_domain) {
			domain->local_addr_space =
				kzalloc(sizeof(*domain->local_addr_space),
					GFP_KERNEL);
			if (!domain->local_addr_space) {
				ret = -ENOMEM;
				goto out_free;
			}

			domain->local_addr_space->task = current;
			domain->local_addr_space->pfn_list = RB_ROOT;
			mutex_init(&domain->local_addr_space->pfn_list_lock);
			INIT_LIST_HEAD(&domain->group_list);
			iommu->local_domain = domain;
		} else {
			kfree(domain);
		}

		list_add(&group->next, &iommu->local_domain->group_list);

		mutex_unlock(&iommu->lock);
		return 0;

> +	}
> +
>  	domain->domain = iommu_domain_alloc(bus);
>  	if (!domain->domain) {
>  		ret = -EIO;
> @@ -859,6 +1280,41 @@ static void vfio_iommu_unmap_unpin_all(struct vfio_iommu *iommu)
>  		vfio_remove_dma(iommu, rb_entry(node, struct vfio_dma, node));
>  }
>  
> +static void vfio_iommu_unmap_unpin_reaccount(struct vfio_iommu *iommu)
> +{
> +	struct vfio_domain *domain = iommu->local_domain;
> +	struct vfio_dma *dma, *tdma;
> +	struct rb_node *n;
> +	long locked = 0;
> +
> +	rbtree_postorder_for_each_entry_safe(dma, tdma, &iommu->dma_list,
> +					     node) {
> +		vfio_unmap_unpin(iommu, dma);
> +	}
> +
> +	mutex_lock(&domain->local_addr_space->pfn_list_lock);
> +
> +	n = rb_first(&domain->local_addr_space->pfn_list);
> +
> +	for (; n; n = rb_next(n))
> +		locked++;
> +
> +	vfio_lock_acct(domain->local_addr_space->task, locked);
> +	mutex_unlock(&domain->local_addr_space->pfn_list_lock);
> +}
> +
> +static void vfio_local_unpin_all(struct vfio_domain *domain)
> +{
> +	struct rb_node *node;
> +
> +	mutex_lock(&domain->local_addr_space->pfn_list_lock);
> +	while ((node = rb_first(&domain->local_addr_space->pfn_list)))
> +		vfio_unpin_pfn(domain,
> +				rb_entry(node, struct vfio_pfn, node), false);
> +
> +	mutex_unlock(&domain->local_addr_space->pfn_list_lock);
> +}
> +
>  static void vfio_iommu_type1_detach_group(void *iommu_data,
>  					  struct iommu_group *iommu_group)
>  {
> @@ -868,31 +1324,57 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
>  
>  	mutex_lock(&iommu->lock);
>  
> -	list_for_each_entry(domain, &iommu->domain_list, next) {
> -		list_for_each_entry(group, &domain->group_list, next) {
> -			if (group->iommu_group != iommu_group)
> -				continue;
> -
> -			iommu_detach_group(domain->domain, iommu_group);
> +	if (iommu->local_domain) {
> +		domain = iommu->local_domain;
> +		group = find_iommu_group(domain, iommu_group);
> +		if (group) {
>  			list_del(&group->next);
>  			kfree(group);
> -			/*
> -			 * Group ownership provides privilege, if the group
> -			 * list is empty, the domain goes away.  If it's the
> -			 * last domain, then all the mappings go away too.
> -			 */
> +
>  			if (list_empty(&domain->group_list)) {
> -				if (list_is_singular(&iommu->domain_list))
> +				vfio_local_unpin_all(domain);
> +				if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu))
>  					vfio_iommu_unmap_unpin_all(iommu);
> -				iommu_domain_free(domain->domain);
> -				list_del(&domain->next);
>  				kfree(domain);
> +				iommu->local_domain = NULL;
> +			}
> +			goto detach_group_done;
> +		}
> +	}
> +
> +	if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu))
> +		goto detach_group_done;
> +
> +	list_for_each_entry(domain, &iommu->domain_list, next) {
> +		group = find_iommu_group(domain, iommu_group);
> +		if (!group)
> +			continue;
> +
> +		iommu_detach_group(domain->domain, iommu_group);
> +		list_del(&group->next);
> +		kfree(group);
> +		/*
> +		 * Group ownership provides privilege, if the group list is
> +		 * empty, the domain goes away. If it's the last domain with
> +		 * iommu and local domain doesn't exist, then all the mappings
> +		 * go away too. If it's the last domain with iommu and local
> +		 * domain exist, update accounting
> +		 */
> +		if (list_empty(&domain->group_list)) {
> +			if (list_is_singular(&iommu->domain_list)) {
> +				if (!iommu->local_domain)
> +					vfio_iommu_unmap_unpin_all(iommu);
> +				else
> +					vfio_iommu_unmap_unpin_reaccount(iommu);
>  			}
> -			goto done;
> +			iommu_domain_free(domain->domain);
> +			list_del(&domain->next);
> +			kfree(domain);
>  		}
> +		break;
>  	}
>  
> -done:
> +detach_group_done:
>  	mutex_unlock(&iommu->lock);
>  }
>  
> @@ -924,27 +1406,48 @@ static void *vfio_iommu_type1_open(unsigned long arg)
>  	return iommu;
>  }
>  
> +static void vfio_release_domain(struct vfio_domain *domain)
> +{
> +	struct vfio_group *group, *group_tmp;
> +
> +	list_for_each_entry_safe(group, group_tmp,
> +				 &domain->group_list, next) {
> +		if (!domain->local_addr_space)
> +			iommu_detach_group(domain->domain, group->iommu_group);
> +		list_del(&group->next);
> +		kfree(group);
> +	}
> +
> +	if (domain->local_addr_space)
> +		vfio_local_unpin_all(domain);
> +	else
> +		iommu_domain_free(domain->domain);
> +}
> +
>  static void vfio_iommu_type1_release(void *iommu_data)
>  {
>  	struct vfio_iommu *iommu = iommu_data;
>  	struct vfio_domain *domain, *domain_tmp;
> -	struct vfio_group *group, *group_tmp;
> +
> +	if (iommu->local_domain) {
> +		vfio_release_domain(iommu->local_domain);
> +		kfree(iommu->local_domain);
> +		iommu->local_domain = NULL;
> +	}
>  
>  	vfio_iommu_unmap_unpin_all(iommu);
>  
> +	if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu))
> +		goto release_exit;
> +
>  	list_for_each_entry_safe(domain, domain_tmp,
>  				 &iommu->domain_list, next) {
> -		list_for_each_entry_safe(group, group_tmp,
> -					 &domain->group_list, next) {
> -			iommu_detach_group(domain->domain, group->iommu_group);
> -			list_del(&group->next);
> -			kfree(group);
> -		}
> -		iommu_domain_free(domain->domain);
> +		vfio_release_domain(domain);
>  		list_del(&domain->next);
>  		kfree(domain);
>  	}
>  
> +release_exit:
>  	kfree(iommu);
>  }
>  
> @@ -1048,6 +1551,8 @@ static const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_type1 = {
>  	.ioctl		= vfio_iommu_type1_ioctl,
>  	.attach_group	= vfio_iommu_type1_attach_group,
>  	.detach_group	= vfio_iommu_type1_detach_group,
> +	.pin_pages	= vfio_iommu_type1_pin_pages,
> +	.unpin_pages	= vfio_iommu_type1_unpin_pages,
>  };
>  
>  static int __init vfio_iommu_type1_init(void)
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index 0ecae0b1cd34..0bd25ba6223d 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -17,6 +17,7 @@
>  #include <linux/workqueue.h>
>  #include <linux/poll.h>
>  #include <uapi/linux/vfio.h>
> +#include <linux/mdev.h>
>  
>  /**
>   * struct vfio_device_ops - VFIO bus driver device callbacks
> @@ -75,7 +76,11 @@ struct vfio_iommu_driver_ops {
>  					struct iommu_group *group);
>  	void		(*detach_group)(void *iommu_data,
>  					struct iommu_group *group);
> -
> +	long		(*pin_pages)(void *iommu_data, unsigned long *user_pfn,
> +				     long npage, int prot,
> +				     unsigned long *phys_pfn);
> +	long		(*unpin_pages)(void *iommu_data, unsigned long *pfn,
> +				       long npage);
>  };
>  
>  extern int vfio_register_iommu_driver(const struct vfio_iommu_driver_ops *ops);
> @@ -127,6 +132,12 @@ static inline long vfio_spapr_iommu_eeh_ioctl(struct iommu_group *group,
>  }
>  #endif /* CONFIG_EEH */
>  
> +extern long vfio_pin_pages(struct device *dev, unsigned long *user_pfn,
> +			   long npage, int prot, unsigned long *phys_pfn);
> +
> +extern long vfio_unpin_pages(struct device *dev, unsigned long *pfn,
> +			     long npage);
> +
>  /*
>   * IRQfd - generic
>   */

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 6/6] Add common functions for SET_IRQS and GET_REGION_INFO ioctls
  2016-10-10 20:28   ` [Qemu-devel] " Kirti Wankhede
@ 2016-10-11 23:18     ` Alex Williamson
  -1 siblings, 0 replies; 73+ messages in thread
From: Alex Williamson @ 2016-10-11 23:18 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi

On Tue, 11 Oct 2016 01:58:37 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> Add common functions for SET_IRQS and to add capability buffer for
> GET_REGION_INFO ioctls

Clearly should be two (or more) separate patches since SET_IRQS and
REGION_INFO are unrelated changes.  Each of the two capabilities handled
could possibly be separate patches as well.

 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Signed-off-by: Neo Jia <cjia@nvidia.com>
> Change-Id: Id9e976a2c08b9b2b37da77dac4365ae8f6024b4a
> ---
>  drivers/vfio/pci/vfio_pci.c | 103 +++++++++++++++------------------------
>  drivers/vfio/vfio.c         | 116 ++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/vfio.h        |   7 +++
>  3 files changed, 162 insertions(+), 64 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 188b1ff03f5f..f312cbb0eebc 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -478,12 +478,12 @@ static int vfio_pci_for_each_slot_or_bus(struct pci_dev *pdev,
>  }
>  
>  static int msix_sparse_mmap_cap(struct vfio_pci_device *vdev,
> +				struct vfio_region_info *info,
>  				struct vfio_info_cap *caps)
>  {
> -	struct vfio_info_cap_header *header;
>  	struct vfio_region_info_cap_sparse_mmap *sparse;
>  	size_t end, size;
> -	int nr_areas = 2, i = 0;
> +	int nr_areas = 2, i = 0, ret;
>  
>  	end = pci_resource_len(vdev->pdev, vdev->msix_bar);
>  
> @@ -494,13 +494,10 @@ static int msix_sparse_mmap_cap(struct vfio_pci_device *vdev,
>  
>  	size = sizeof(*sparse) + (nr_areas * sizeof(*sparse->areas));
>  
> -	header = vfio_info_cap_add(caps, size,
> -				   VFIO_REGION_INFO_CAP_SPARSE_MMAP, 1);
> -	if (IS_ERR(header))
> -		return PTR_ERR(header);
> +	sparse = kzalloc(size, GFP_KERNEL);
> +	if (!sparse)
> +		return -ENOMEM;
>  
> -	sparse = container_of(header,
> -			      struct vfio_region_info_cap_sparse_mmap, header);
>  	sparse->nr_areas = nr_areas;
>  
>  	if (vdev->msix_offset & PAGE_MASK) {
> @@ -516,24 +513,14 @@ static int msix_sparse_mmap_cap(struct vfio_pci_device *vdev,
>  		i++;
>  	}
>  
> -	return 0;
> -}
> -
> -static int region_type_cap(struct vfio_pci_device *vdev,
> -			   struct vfio_info_cap *caps,
> -			   unsigned int type, unsigned int subtype)
> -{
> -	struct vfio_info_cap_header *header;
> -	struct vfio_region_info_cap_type *cap;
> +	info->flags |= VFIO_REGION_INFO_FLAG_CAPS;
>  
> -	header = vfio_info_cap_add(caps, sizeof(*cap),
> -				   VFIO_REGION_INFO_CAP_TYPE, 1);
> -	if (IS_ERR(header))
> -		return PTR_ERR(header);
> +	ret = vfio_info_add_capability(info, caps,
> +				      VFIO_REGION_INFO_CAP_SPARSE_MMAP, sparse);
> +	kfree(sparse);
>  
> -	cap = container_of(header, struct vfio_region_info_cap_type, header);
> -	cap->type = type;
> -	cap->subtype = subtype;
> +	if (ret)
> +		return ret;
>  
>  	return 0;

Just: return ret;

>  }
> @@ -628,7 +615,8 @@ static long vfio_pci_ioctl(void *device_data,
>  			    IORESOURCE_MEM && info.size >= PAGE_SIZE) {
>  				info.flags |= VFIO_REGION_INFO_FLAG_MMAP;
>  				if (info.index == vdev->msix_bar) {
> -					ret = msix_sparse_mmap_cap(vdev, &caps);
> +					ret = msix_sparse_mmap_cap(vdev, &info,
> +								   &caps);
>  					if (ret)
>  						return ret;
>  				}
> @@ -676,6 +664,9 @@ static long vfio_pci_ioctl(void *device_data,
>  
>  			break;
>  		default:
> +		{
> +			struct vfio_region_info_cap_type cap_type;
> +
>  			if (info.index >=
>  			    VFIO_PCI_NUM_REGIONS + vdev->num_regions)
>  				return -EINVAL;
> @@ -684,29 +675,26 @@ static long vfio_pci_ioctl(void *device_data,
>  
>  			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
>  			info.size = vdev->region[i].size;
> -			info.flags = vdev->region[i].flags;
> +			info.flags = vdev->region[i].flags |
> +				     VFIO_REGION_INFO_FLAG_CAPS;
>  
> -			ret = region_type_cap(vdev, &caps,
> -					      vdev->region[i].type,
> -					      vdev->region[i].subtype);
> +			cap_type.type = vdev->region[i].type;
> +			cap_type.subtype = vdev->region[i].subtype;
> +
> +			ret = vfio_info_add_capability(&info, &caps,
> +						      VFIO_REGION_INFO_CAP_TYPE,
> +						      &cap_type);
>  			if (ret)
>  				return ret;
> +
> +		}
>  		}
>  
> -		if (caps.size) {
> -			info.flags |= VFIO_REGION_INFO_FLAG_CAPS;
> -			if (info.argsz < sizeof(info) + caps.size) {
> -				info.argsz = sizeof(info) + caps.size;
> -				info.cap_offset = 0;
> -			} else {
> -				vfio_info_cap_shift(&caps, sizeof(info));
> -				if (copy_to_user((void __user *)arg +
> -						  sizeof(info), caps.buf,
> -						  caps.size)) {
> -					kfree(caps.buf);
> -					return -EFAULT;
> -				}
> -				info.cap_offset = sizeof(info);
> +		if (info.cap_offset) {
> +			if (copy_to_user((void __user *)arg + info.cap_offset,
> +					 caps.buf, caps.size)) {
> +				kfree(caps.buf);
> +				return -EFAULT;
>  			}
>  
>  			kfree(caps.buf);
> @@ -754,35 +742,22 @@ static long vfio_pci_ioctl(void *device_data,
>  	} else if (cmd == VFIO_DEVICE_SET_IRQS) {
>  		struct vfio_irq_set hdr;
>  		u8 *data = NULL;
> -		int ret = 0;
> +		int max, ret = 0, data_size = 0;
>  
>  		minsz = offsetofend(struct vfio_irq_set, count);
>  
>  		if (copy_from_user(&hdr, (void __user *)arg, minsz))
>  			return -EFAULT;
>  
> -		if (hdr.argsz < minsz || hdr.index >= VFIO_PCI_NUM_IRQS ||
> -		    hdr.flags & ~(VFIO_IRQ_SET_DATA_TYPE_MASK |
> -				  VFIO_IRQ_SET_ACTION_TYPE_MASK))
> -			return -EINVAL;
> -
> -		if (!(hdr.flags & VFIO_IRQ_SET_DATA_NONE)) {
> -			size_t size;
> -			int max = vfio_pci_get_irq_count(vdev, hdr.index);
> +		max = vfio_pci_get_irq_count(vdev, hdr.index);
>  
> -			if (hdr.flags & VFIO_IRQ_SET_DATA_BOOL)
> -				size = sizeof(uint8_t);
> -			else if (hdr.flags & VFIO_IRQ_SET_DATA_EVENTFD)
> -				size = sizeof(int32_t);
> -			else
> -				return -EINVAL;
> -
> -			if (hdr.argsz - minsz < hdr.count * size ||
> -			    hdr.start >= max || hdr.start + hdr.count > max)
> -				return -EINVAL;


vfio_platform has very similar code that would also need to be updated.

> +		ret = vfio_set_irqs_validate_and_prepare(&hdr, max, &data_size);
> +		if (ret)
> +			return ret;
>  
> +		if (data_size) {
>  			data = memdup_user((void __user *)(arg + minsz),
> -					   hdr.count * size);
> +					    data_size);
>  			if (IS_ERR(data))
>  				return PTR_ERR(data);
>  		}
> @@ -790,7 +765,7 @@ static long vfio_pci_ioctl(void *device_data,
>  		mutex_lock(&vdev->igate);
>  
>  		ret = vfio_pci_set_irqs_ioctl(vdev, hdr.flags, hdr.index,
> -					      hdr.start, hdr.count, data);
> +				hdr.start, hdr.count, data);

White space bogosity.

>  
>  		mutex_unlock(&vdev->igate);
>  		kfree(data);
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index e3e342861e04..0185d5fb2c85 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -1782,6 +1782,122 @@ void vfio_info_cap_shift(struct vfio_info_cap *caps, size_t offset)
>  }
>  EXPORT_SYMBOL_GPL(vfio_info_cap_shift);
>  
> +static int sparse_mmap_cap(struct vfio_info_cap *caps, void *cap_type)
> +{
> +	struct vfio_info_cap_header *header;
> +	struct vfio_region_info_cap_sparse_mmap *sparse_cap, *sparse = cap_type;
> +	size_t size;
> +
> +	size = sizeof(*sparse) + sparse->nr_areas *  sizeof(*sparse->areas);
> +	header = vfio_info_cap_add(caps, size,
> +				   VFIO_REGION_INFO_CAP_SPARSE_MMAP, 1);
> +	if (IS_ERR(header))
> +		return PTR_ERR(header);
> +
> +	sparse_cap = container_of(header,
> +			struct vfio_region_info_cap_sparse_mmap, header);
> +	sparse_cap->nr_areas = sparse->nr_areas;
> +	memcpy(sparse_cap->areas, sparse->areas,
> +	       sparse->nr_areas * sizeof(*sparse->areas));
> +	return 0;
> +}
> +
> +static int region_type_cap(struct vfio_info_cap *caps, void *cap_type)
> +{
> +	struct vfio_info_cap_header *header;
> +	struct vfio_region_info_cap_type *type_cap, *cap = cap_type;
> +
> +	header = vfio_info_cap_add(caps, sizeof(*cap),
> +				   VFIO_REGION_INFO_CAP_TYPE, 1);
> +	if (IS_ERR(header))
> +		return PTR_ERR(header);
> +
> +	type_cap = container_of(header, struct vfio_region_info_cap_type,
> +				header);
> +	type_cap->type = cap->type;
> +	type_cap->subtype = cap->subtype;
> +	return 0;
> +}

Why can't we just do a memcpy of all the data past the header?  Do we
need separate functions for these?

vfio_info_cap_add() should now be static and unexported, right?

> +
> +int vfio_info_add_capability(struct vfio_region_info *info,
> +			     struct vfio_info_cap *caps,
> +			     int cap_type_id,
> +			     void *cap_type)
> +{
> +	int ret;
> +
> +	if (!(info->flags & VFIO_REGION_INFO_FLAG_CAPS) || !cap_type)

Why make the caller set flags, seems rather arbitrary since this
function controls the cap_offset and whether we actually end up copying
the data.

> +		return 0;
> +
> +	switch (cap_type_id) {
> +	case VFIO_REGION_INFO_CAP_SPARSE_MMAP:
> +		ret = sparse_mmap_cap(caps, cap_type);
> +		if (ret)
> +			return ret;
> +		break;
> +
> +	case VFIO_REGION_INFO_CAP_TYPE:
> +		ret = region_type_cap(caps, cap_type);
> +		if (ret)
> +			return ret;
> +		break;
> +	default:
> +		return -EINVAL;
> +	}
> +
> +	if (caps->size) {
> +		if (info->argsz < sizeof(*info) + caps->size) {
> +			info->argsz = sizeof(*info) + caps->size;
> +			info->cap_offset = 0;
> +		} else {
> +			vfio_info_cap_shift(caps, sizeof(*info));
> +			info->cap_offset = sizeof(*info);
> +		}
> +	}
> +	return 0;
> +}
> +EXPORT_SYMBOL(vfio_info_add_capability);
> +
> +int vfio_set_irqs_validate_and_prepare(struct vfio_irq_set *hdr, int num_irqs,
> +				       int *data_size)
> +{
> +	unsigned long minsz;
> +
> +	minsz = offsetofend(struct vfio_irq_set, count);
> +
> +	if ((hdr->argsz < minsz) || (hdr->index >= VFIO_PCI_NUM_IRQS) ||
> +	    (hdr->flags & ~(VFIO_IRQ_SET_DATA_TYPE_MASK |
> +				VFIO_IRQ_SET_ACTION_TYPE_MASK)))
> +		return -EINVAL;
> +
> +	if (data_size)
> +		*data_size = 0;
> +
> +	if (!(hdr->flags & VFIO_IRQ_SET_DATA_NONE)) {
> +		size_t size;
> +
> +		if (hdr->flags & VFIO_IRQ_SET_DATA_BOOL)
> +			size = sizeof(uint8_t);
> +		else if (hdr->flags & VFIO_IRQ_SET_DATA_EVENTFD)
> +			size = sizeof(int32_t);
> +		else
> +			return -EINVAL;
> +
> +		if ((hdr->argsz - minsz < hdr->count * size) ||
> +		    (hdr->start >= num_irqs) ||
> +		    (hdr->start + hdr->count > num_irqs))
> +			return -EINVAL;
> +
> +		if (!data_size)
> +			return -EINVAL;
> +
> +		*data_size = hdr->count * size;
> +	}
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL(vfio_set_irqs_validate_and_prepare);
> +
>  static struct vfio_group *vfio_group_from_dev(struct device *dev)
>  {
>  	struct vfio_device *device;
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index 0bd25ba6223d..5641dab72ded 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -108,6 +108,13 @@ extern struct vfio_info_cap_header *vfio_info_cap_add(
>  		struct vfio_info_cap *caps, size_t size, u16 id, u16 version);
>  extern void vfio_info_cap_shift(struct vfio_info_cap *caps, size_t offset);
>  
> +extern int vfio_info_add_capability(struct vfio_region_info *info,
> +				    struct vfio_info_cap *caps,
> +				    int cap_type_id, void *cap_type);
> +
> +extern int vfio_set_irqs_validate_and_prepare(struct vfio_irq_set *hdr,
> +					      int num_irqs, int *data_size);
> +
>  struct pci_dev;
>  #ifdef CONFIG_EEH
>  extern void vfio_spapr_pci_eeh_open(struct pci_dev *pdev);


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v8 6/6] Add common functions for SET_IRQS and GET_REGION_INFO ioctls
@ 2016-10-11 23:18     ` Alex Williamson
  0 siblings, 0 replies; 73+ messages in thread
From: Alex Williamson @ 2016-10-11 23:18 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi

On Tue, 11 Oct 2016 01:58:37 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> Add common functions for SET_IRQS and to add capability buffer for
> GET_REGION_INFO ioctls

Clearly should be two (or more) separate patches since SET_IRQS and
REGION_INFO are unrelated changes.  Each of the two capabilities handled
could possibly be separate patches as well.

 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Signed-off-by: Neo Jia <cjia@nvidia.com>
> Change-Id: Id9e976a2c08b9b2b37da77dac4365ae8f6024b4a
> ---
>  drivers/vfio/pci/vfio_pci.c | 103 +++++++++++++++------------------------
>  drivers/vfio/vfio.c         | 116 ++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/vfio.h        |   7 +++
>  3 files changed, 162 insertions(+), 64 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 188b1ff03f5f..f312cbb0eebc 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -478,12 +478,12 @@ static int vfio_pci_for_each_slot_or_bus(struct pci_dev *pdev,
>  }
>  
>  static int msix_sparse_mmap_cap(struct vfio_pci_device *vdev,
> +				struct vfio_region_info *info,
>  				struct vfio_info_cap *caps)
>  {
> -	struct vfio_info_cap_header *header;
>  	struct vfio_region_info_cap_sparse_mmap *sparse;
>  	size_t end, size;
> -	int nr_areas = 2, i = 0;
> +	int nr_areas = 2, i = 0, ret;
>  
>  	end = pci_resource_len(vdev->pdev, vdev->msix_bar);
>  
> @@ -494,13 +494,10 @@ static int msix_sparse_mmap_cap(struct vfio_pci_device *vdev,
>  
>  	size = sizeof(*sparse) + (nr_areas * sizeof(*sparse->areas));
>  
> -	header = vfio_info_cap_add(caps, size,
> -				   VFIO_REGION_INFO_CAP_SPARSE_MMAP, 1);
> -	if (IS_ERR(header))
> -		return PTR_ERR(header);
> +	sparse = kzalloc(size, GFP_KERNEL);
> +	if (!sparse)
> +		return -ENOMEM;
>  
> -	sparse = container_of(header,
> -			      struct vfio_region_info_cap_sparse_mmap, header);
>  	sparse->nr_areas = nr_areas;
>  
>  	if (vdev->msix_offset & PAGE_MASK) {
> @@ -516,24 +513,14 @@ static int msix_sparse_mmap_cap(struct vfio_pci_device *vdev,
>  		i++;
>  	}
>  
> -	return 0;
> -}
> -
> -static int region_type_cap(struct vfio_pci_device *vdev,
> -			   struct vfio_info_cap *caps,
> -			   unsigned int type, unsigned int subtype)
> -{
> -	struct vfio_info_cap_header *header;
> -	struct vfio_region_info_cap_type *cap;
> +	info->flags |= VFIO_REGION_INFO_FLAG_CAPS;
>  
> -	header = vfio_info_cap_add(caps, sizeof(*cap),
> -				   VFIO_REGION_INFO_CAP_TYPE, 1);
> -	if (IS_ERR(header))
> -		return PTR_ERR(header);
> +	ret = vfio_info_add_capability(info, caps,
> +				      VFIO_REGION_INFO_CAP_SPARSE_MMAP, sparse);
> +	kfree(sparse);
>  
> -	cap = container_of(header, struct vfio_region_info_cap_type, header);
> -	cap->type = type;
> -	cap->subtype = subtype;
> +	if (ret)
> +		return ret;
>  
>  	return 0;

Just: return ret;

>  }
> @@ -628,7 +615,8 @@ static long vfio_pci_ioctl(void *device_data,
>  			    IORESOURCE_MEM && info.size >= PAGE_SIZE) {
>  				info.flags |= VFIO_REGION_INFO_FLAG_MMAP;
>  				if (info.index == vdev->msix_bar) {
> -					ret = msix_sparse_mmap_cap(vdev, &caps);
> +					ret = msix_sparse_mmap_cap(vdev, &info,
> +								   &caps);
>  					if (ret)
>  						return ret;
>  				}
> @@ -676,6 +664,9 @@ static long vfio_pci_ioctl(void *device_data,
>  
>  			break;
>  		default:
> +		{
> +			struct vfio_region_info_cap_type cap_type;
> +
>  			if (info.index >=
>  			    VFIO_PCI_NUM_REGIONS + vdev->num_regions)
>  				return -EINVAL;
> @@ -684,29 +675,26 @@ static long vfio_pci_ioctl(void *device_data,
>  
>  			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
>  			info.size = vdev->region[i].size;
> -			info.flags = vdev->region[i].flags;
> +			info.flags = vdev->region[i].flags |
> +				     VFIO_REGION_INFO_FLAG_CAPS;
>  
> -			ret = region_type_cap(vdev, &caps,
> -					      vdev->region[i].type,
> -					      vdev->region[i].subtype);
> +			cap_type.type = vdev->region[i].type;
> +			cap_type.subtype = vdev->region[i].subtype;
> +
> +			ret = vfio_info_add_capability(&info, &caps,
> +						      VFIO_REGION_INFO_CAP_TYPE,
> +						      &cap_type);
>  			if (ret)
>  				return ret;
> +
> +		}
>  		}
>  
> -		if (caps.size) {
> -			info.flags |= VFIO_REGION_INFO_FLAG_CAPS;
> -			if (info.argsz < sizeof(info) + caps.size) {
> -				info.argsz = sizeof(info) + caps.size;
> -				info.cap_offset = 0;
> -			} else {
> -				vfio_info_cap_shift(&caps, sizeof(info));
> -				if (copy_to_user((void __user *)arg +
> -						  sizeof(info), caps.buf,
> -						  caps.size)) {
> -					kfree(caps.buf);
> -					return -EFAULT;
> -				}
> -				info.cap_offset = sizeof(info);
> +		if (info.cap_offset) {
> +			if (copy_to_user((void __user *)arg + info.cap_offset,
> +					 caps.buf, caps.size)) {
> +				kfree(caps.buf);
> +				return -EFAULT;
>  			}
>  
>  			kfree(caps.buf);
> @@ -754,35 +742,22 @@ static long vfio_pci_ioctl(void *device_data,
>  	} else if (cmd == VFIO_DEVICE_SET_IRQS) {
>  		struct vfio_irq_set hdr;
>  		u8 *data = NULL;
> -		int ret = 0;
> +		int max, ret = 0, data_size = 0;
>  
>  		minsz = offsetofend(struct vfio_irq_set, count);
>  
>  		if (copy_from_user(&hdr, (void __user *)arg, minsz))
>  			return -EFAULT;
>  
> -		if (hdr.argsz < minsz || hdr.index >= VFIO_PCI_NUM_IRQS ||
> -		    hdr.flags & ~(VFIO_IRQ_SET_DATA_TYPE_MASK |
> -				  VFIO_IRQ_SET_ACTION_TYPE_MASK))
> -			return -EINVAL;
> -
> -		if (!(hdr.flags & VFIO_IRQ_SET_DATA_NONE)) {
> -			size_t size;
> -			int max = vfio_pci_get_irq_count(vdev, hdr.index);
> +		max = vfio_pci_get_irq_count(vdev, hdr.index);
>  
> -			if (hdr.flags & VFIO_IRQ_SET_DATA_BOOL)
> -				size = sizeof(uint8_t);
> -			else if (hdr.flags & VFIO_IRQ_SET_DATA_EVENTFD)
> -				size = sizeof(int32_t);
> -			else
> -				return -EINVAL;
> -
> -			if (hdr.argsz - minsz < hdr.count * size ||
> -			    hdr.start >= max || hdr.start + hdr.count > max)
> -				return -EINVAL;


vfio_platform has very similar code that would also need to be updated.

> +		ret = vfio_set_irqs_validate_and_prepare(&hdr, max, &data_size);
> +		if (ret)
> +			return ret;
>  
> +		if (data_size) {
>  			data = memdup_user((void __user *)(arg + minsz),
> -					   hdr.count * size);
> +					    data_size);
>  			if (IS_ERR(data))
>  				return PTR_ERR(data);
>  		}
> @@ -790,7 +765,7 @@ static long vfio_pci_ioctl(void *device_data,
>  		mutex_lock(&vdev->igate);
>  
>  		ret = vfio_pci_set_irqs_ioctl(vdev, hdr.flags, hdr.index,
> -					      hdr.start, hdr.count, data);
> +				hdr.start, hdr.count, data);

White space bogosity.

>  
>  		mutex_unlock(&vdev->igate);
>  		kfree(data);
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index e3e342861e04..0185d5fb2c85 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -1782,6 +1782,122 @@ void vfio_info_cap_shift(struct vfio_info_cap *caps, size_t offset)
>  }
>  EXPORT_SYMBOL_GPL(vfio_info_cap_shift);
>  
> +static int sparse_mmap_cap(struct vfio_info_cap *caps, void *cap_type)
> +{
> +	struct vfio_info_cap_header *header;
> +	struct vfio_region_info_cap_sparse_mmap *sparse_cap, *sparse = cap_type;
> +	size_t size;
> +
> +	size = sizeof(*sparse) + sparse->nr_areas *  sizeof(*sparse->areas);
> +	header = vfio_info_cap_add(caps, size,
> +				   VFIO_REGION_INFO_CAP_SPARSE_MMAP, 1);
> +	if (IS_ERR(header))
> +		return PTR_ERR(header);
> +
> +	sparse_cap = container_of(header,
> +			struct vfio_region_info_cap_sparse_mmap, header);
> +	sparse_cap->nr_areas = sparse->nr_areas;
> +	memcpy(sparse_cap->areas, sparse->areas,
> +	       sparse->nr_areas * sizeof(*sparse->areas));
> +	return 0;
> +}
> +
> +static int region_type_cap(struct vfio_info_cap *caps, void *cap_type)
> +{
> +	struct vfio_info_cap_header *header;
> +	struct vfio_region_info_cap_type *type_cap, *cap = cap_type;
> +
> +	header = vfio_info_cap_add(caps, sizeof(*cap),
> +				   VFIO_REGION_INFO_CAP_TYPE, 1);
> +	if (IS_ERR(header))
> +		return PTR_ERR(header);
> +
> +	type_cap = container_of(header, struct vfio_region_info_cap_type,
> +				header);
> +	type_cap->type = cap->type;
> +	type_cap->subtype = cap->subtype;
> +	return 0;
> +}

Why can't we just do a memcpy of all the data past the header?  Do we
need separate functions for these?

vfio_info_cap_add() should now be static and unexported, right?

> +
> +int vfio_info_add_capability(struct vfio_region_info *info,
> +			     struct vfio_info_cap *caps,
> +			     int cap_type_id,
> +			     void *cap_type)
> +{
> +	int ret;
> +
> +	if (!(info->flags & VFIO_REGION_INFO_FLAG_CAPS) || !cap_type)

Why make the caller set flags, seems rather arbitrary since this
function controls the cap_offset and whether we actually end up copying
the data.

> +		return 0;
> +
> +	switch (cap_type_id) {
> +	case VFIO_REGION_INFO_CAP_SPARSE_MMAP:
> +		ret = sparse_mmap_cap(caps, cap_type);
> +		if (ret)
> +			return ret;
> +		break;
> +
> +	case VFIO_REGION_INFO_CAP_TYPE:
> +		ret = region_type_cap(caps, cap_type);
> +		if (ret)
> +			return ret;
> +		break;
> +	default:
> +		return -EINVAL;
> +	}
> +
> +	if (caps->size) {
> +		if (info->argsz < sizeof(*info) + caps->size) {
> +			info->argsz = sizeof(*info) + caps->size;
> +			info->cap_offset = 0;
> +		} else {
> +			vfio_info_cap_shift(caps, sizeof(*info));
> +			info->cap_offset = sizeof(*info);
> +		}
> +	}
> +	return 0;
> +}
> +EXPORT_SYMBOL(vfio_info_add_capability);
> +
> +int vfio_set_irqs_validate_and_prepare(struct vfio_irq_set *hdr, int num_irqs,
> +				       int *data_size)
> +{
> +	unsigned long minsz;
> +
> +	minsz = offsetofend(struct vfio_irq_set, count);
> +
> +	if ((hdr->argsz < minsz) || (hdr->index >= VFIO_PCI_NUM_IRQS) ||
> +	    (hdr->flags & ~(VFIO_IRQ_SET_DATA_TYPE_MASK |
> +				VFIO_IRQ_SET_ACTION_TYPE_MASK)))
> +		return -EINVAL;
> +
> +	if (data_size)
> +		*data_size = 0;
> +
> +	if (!(hdr->flags & VFIO_IRQ_SET_DATA_NONE)) {
> +		size_t size;
> +
> +		if (hdr->flags & VFIO_IRQ_SET_DATA_BOOL)
> +			size = sizeof(uint8_t);
> +		else if (hdr->flags & VFIO_IRQ_SET_DATA_EVENTFD)
> +			size = sizeof(int32_t);
> +		else
> +			return -EINVAL;
> +
> +		if ((hdr->argsz - minsz < hdr->count * size) ||
> +		    (hdr->start >= num_irqs) ||
> +		    (hdr->start + hdr->count > num_irqs))
> +			return -EINVAL;
> +
> +		if (!data_size)
> +			return -EINVAL;
> +
> +		*data_size = hdr->count * size;
> +	}
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL(vfio_set_irqs_validate_and_prepare);
> +
>  static struct vfio_group *vfio_group_from_dev(struct device *dev)
>  {
>  	struct vfio_device *device;
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index 0bd25ba6223d..5641dab72ded 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -108,6 +108,13 @@ extern struct vfio_info_cap_header *vfio_info_cap_add(
>  		struct vfio_info_cap *caps, size_t size, u16 id, u16 version);
>  extern void vfio_info_cap_shift(struct vfio_info_cap *caps, size_t offset);
>  
> +extern int vfio_info_add_capability(struct vfio_region_info *info,
> +				    struct vfio_info_cap *caps,
> +				    int cap_type_id, void *cap_type);
> +
> +extern int vfio_set_irqs_validate_and_prepare(struct vfio_irq_set *hdr,
> +					      int num_irqs, int *data_size);
> +
>  struct pci_dev;
>  #ifdef CONFIG_EEH
>  extern void vfio_spapr_pci_eeh_open(struct pci_dev *pdev);

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 4/6] docs: Add Documentation for Mediated devices
  2016-10-11 20:44       ` Kirti Wankhede
@ 2016-10-12  1:52         ` Tian, Kevin
  -1 siblings, 0 replies; 73+ messages in thread
From: Tian, Kevin @ 2016-10-12  1:52 UTC (permalink / raw)
  To: Kirti Wankhede, Daniel P. Berrange
  Cc: Song, Jike, cjia, kvm, qemu-devel, alex.williamson, kraxel,
	pbonzini, bjsdjshi

> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> Sent: Wednesday, October 12, 2016 4:45 AM
> >> +* mdev_supported_types:
> >> +    List of current supported mediated device types and its details are added
> >> +in this directory in following format:
> >> +
> >> +|- <parent phy device>
> >> +|--- Vendor-specific-attributes [optional]
> >> +|--- mdev_supported_types
> >> +|     |--- <type id>
> >> +|     |   |--- create
> >> +|     |   |--- name
> >> +|     |   |--- available_instances
> >> +|     |   |--- description /class
> >> +|     |   |--- [devices]
> >> +|     |--- <type id>
> >> +|     |   |--- create
> >> +|     |   |--- name
> >> +|     |   |--- available_instances
> >> +|     |   |--- description /class
> >> +|     |   |--- [devices]
> >> +|     |--- <type id>
> >> +|          |--- create
> >> +|          |--- name
> >> +|          |--- available_instances
> >> +|          |--- description /class
> >> +|          |--- [devices]
> >> +
> >> +[TBD : description or class is yet to be decided. This will change.]
> >
> > I thought that in previous discussions we had agreed to drop
> > the <type id> concept and use the name as the unique identifier.
> > When reporting these types in libvirt we won't want to report
> > the type id values - we'll want the name strings to be unique.
> >
> 
> The 'name' might not be unique but type_id will be. For example that Neo
> pointed out in earlier discussion, virtual devices can come from two
> different physical devices, end user would be presented with what they
> had selected but there will be internal implementation differences. In
> that case 'type_id' will be unique.
> 

Hi, Kirti, my understanding is that Neo agreed to use an unique type
string (if you still called it <type id>), and then no need of additional
'name' field which can be put inside 'description' field. See below quote:

--<from Alex>--
> I think your discovery only means that for your vendor driver, the name
> will be "11" (as a string).  Perhaps you'd like some sort of vendor
> provided description within each type, but I am not in favor of having
> an arbitrary integer value imply something specific within the sysfs
> interface.  IOW, the NVIDIA vendor driver should be able to create:
> 
> 11
> ├── create
> ├── description
> ├── etc
> └── resolution
> 
> While Intel might create:
> 
> Skylake-vGPU
> ├── create
> ├── description
> ├── etc
> └── resolution
> 
> Maybe "description" is optional for vendors that use useful names?
> Thanks,

--<From Neo>--
> I think we should be able to have a unique vendor type string instead of an
> arbitrary integer value there as long as we are allowed to have a description
> field that can be used to show to the end user as "name / label". 

Thanks
Kevin

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v8 4/6] docs: Add Documentation for Mediated devices
@ 2016-10-12  1:52         ` Tian, Kevin
  0 siblings, 0 replies; 73+ messages in thread
From: Tian, Kevin @ 2016-10-12  1:52 UTC (permalink / raw)
  To: Kirti Wankhede, Daniel P. Berrange
  Cc: alex.williamson, pbonzini, kraxel, cjia, Song, Jike, kvm,
	qemu-devel, bjsdjshi

> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> Sent: Wednesday, October 12, 2016 4:45 AM
> >> +* mdev_supported_types:
> >> +    List of current supported mediated device types and its details are added
> >> +in this directory in following format:
> >> +
> >> +|- <parent phy device>
> >> +|--- Vendor-specific-attributes [optional]
> >> +|--- mdev_supported_types
> >> +|     |--- <type id>
> >> +|     |   |--- create
> >> +|     |   |--- name
> >> +|     |   |--- available_instances
> >> +|     |   |--- description /class
> >> +|     |   |--- [devices]
> >> +|     |--- <type id>
> >> +|     |   |--- create
> >> +|     |   |--- name
> >> +|     |   |--- available_instances
> >> +|     |   |--- description /class
> >> +|     |   |--- [devices]
> >> +|     |--- <type id>
> >> +|          |--- create
> >> +|          |--- name
> >> +|          |--- available_instances
> >> +|          |--- description /class
> >> +|          |--- [devices]
> >> +
> >> +[TBD : description or class is yet to be decided. This will change.]
> >
> > I thought that in previous discussions we had agreed to drop
> > the <type id> concept and use the name as the unique identifier.
> > When reporting these types in libvirt we won't want to report
> > the type id values - we'll want the name strings to be unique.
> >
> 
> The 'name' might not be unique but type_id will be. For example that Neo
> pointed out in earlier discussion, virtual devices can come from two
> different physical devices, end user would be presented with what they
> had selected but there will be internal implementation differences. In
> that case 'type_id' will be unique.
> 

Hi, Kirti, my understanding is that Neo agreed to use an unique type
string (if you still called it <type id>), and then no need of additional
'name' field which can be put inside 'description' field. See below quote:

--<from Alex>--
> I think your discovery only means that for your vendor driver, the name
> will be "11" (as a string).  Perhaps you'd like some sort of vendor
> provided description within each type, but I am not in favor of having
> an arbitrary integer value imply something specific within the sysfs
> interface.  IOW, the NVIDIA vendor driver should be able to create:
> 
> 11
> ├── create
> ├── description
> ├── etc
> └── resolution
> 
> While Intel might create:
> 
> Skylake-vGPU
> ├── create
> ├── description
> ├── etc
> └── resolution
> 
> Maybe "description" is optional for vendors that use useful names?
> Thanks,

--<From Neo>--
> I think we should be able to have a unique vendor type string instead of an
> arbitrary integer value there as long as we are allowed to have a description
> field that can be used to show to the end user as "name / label". 

Thanks
Kevin

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 1/6] vfio: Mediated device Core driver
  2016-10-10 20:28   ` [Qemu-devel] " Kirti Wankhede
@ 2016-10-12  8:39     ` Tian, Kevin
  -1 siblings, 0 replies; 73+ messages in thread
From: Tian, Kevin @ 2016-10-12  8:39 UTC (permalink / raw)
  To: Kirti Wankhede, alex.williamson, pbonzini, kraxel, cjia
  Cc: bjsdjshi, Song, Jike, qemu-devel, kvm

> From: Kirti Wankhede
> Sent: Tuesday, October 11, 2016 4:29 AM
> 
[...]

> +
> +/*
> + * mdev_unregister_device : Unregister a parent device
> + * @dev: device structure representing parent device.
> + *
> + * Remove device from list of registered parent devices. Give a chance to free
> + * existing mediated devices for given device.
> + */
> +
> +void mdev_unregister_device(struct device *dev)
> +{
> +	struct parent_device *parent;
> +	bool force_remove = true;
> +
> +	mutex_lock(&parent_list_lock);
> +	parent = __find_parent_device(dev);
> +
> +	if (!parent) {
> +		mutex_unlock(&parent_list_lock);
> +		return;
> +	}
> +	dev_info(dev, "MDEV: Unregistering\n");
> +
> +	/*
> +	 * Remove parent from the list and remove "mdev_supported_types"
> +	 * sysfs files so that no new mediated device could be
> +	 * created for this parent
> +	 */
> +	list_del(&parent->next);
> +	parent_remove_sysfs_files(parent);

this could be moved out of mutex.

> +
> +	mutex_unlock(&parent_list_lock);
> +
> +	device_for_each_child(dev, (void *)&force_remove, mdev_device_remove);
> +	mdev_put_parent(parent);
> +}
> +EXPORT_SYMBOL(mdev_unregister_device);
> +

Thanks
Kevin

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v8 1/6] vfio: Mediated device Core driver
@ 2016-10-12  8:39     ` Tian, Kevin
  0 siblings, 0 replies; 73+ messages in thread
From: Tian, Kevin @ 2016-10-12  8:39 UTC (permalink / raw)
  To: Kirti Wankhede, alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, Song, Jike, bjsdjshi

> From: Kirti Wankhede
> Sent: Tuesday, October 11, 2016 4:29 AM
> 
[...]

> +
> +/*
> + * mdev_unregister_device : Unregister a parent device
> + * @dev: device structure representing parent device.
> + *
> + * Remove device from list of registered parent devices. Give a chance to free
> + * existing mediated devices for given device.
> + */
> +
> +void mdev_unregister_device(struct device *dev)
> +{
> +	struct parent_device *parent;
> +	bool force_remove = true;
> +
> +	mutex_lock(&parent_list_lock);
> +	parent = __find_parent_device(dev);
> +
> +	if (!parent) {
> +		mutex_unlock(&parent_list_lock);
> +		return;
> +	}
> +	dev_info(dev, "MDEV: Unregistering\n");
> +
> +	/*
> +	 * Remove parent from the list and remove "mdev_supported_types"
> +	 * sysfs files so that no new mediated device could be
> +	 * created for this parent
> +	 */
> +	list_del(&parent->next);
> +	parent_remove_sysfs_files(parent);

this could be moved out of mutex.

> +
> +	mutex_unlock(&parent_list_lock);
> +
> +	device_for_each_child(dev, (void *)&force_remove, mdev_device_remove);
> +	mdev_put_parent(parent);
> +}
> +EXPORT_SYMBOL(mdev_unregister_device);
> +

Thanks
Kevin

^ permalink raw reply	[flat|nested] 73+ messages in thread

* RE: [PATCH v8 3/6] vfio iommu: Add support for mediated devices
  2016-10-10 20:28   ` [Qemu-devel] " Kirti Wankhede
@ 2016-10-12 10:31     ` Tian, Kevin
  -1 siblings, 0 replies; 73+ messages in thread
From: Tian, Kevin @ 2016-10-12 10:31 UTC (permalink / raw)
  To: Kirti Wankhede, alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, Song, Jike, bjsdjshi

> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> Sent: Tuesday, October 11, 2016 4:29 AM
> 
[...]
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 2ba19424e4a1..ce6d6dcbd9a8 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -55,18 +55,26 @@ MODULE_PARM_DESC(disable_hugepages,
> 
>  struct vfio_iommu {
>  	struct list_head	domain_list;
> +	struct vfio_domain	*local_domain;

Hi, Kirti, can you help explain the meaning of 'local" here? I have a hard time 
to understand its intention... In your later change of vaddr_get_pfn, it's
even more confusing where get_user_pages_remote is used on a 'local_mm':

+	if (mm) {
+		down_read(&local_mm->mmap_sem);
+		ret = get_user_pages_remote(NULL, local_mm, vaddr, 1,
+					!!(prot & IOMMU_WRITE), 0, page, NULL);
+		up_read(&local_mm->mmap_sem);
+	} else
+		ret = get_user_pages_fast(vaddr, 1,
+					  !!(prot & IOMMU_WRITE), page);


[...]
> -static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
> +static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
> +			 int prot, unsigned long *pfn)
>  {
>  	struct page *page[1];
>  	struct vm_area_struct *vma;
> +	struct mm_struct *local_mm = (mm ? mm : current->mm);

it'd be clearer if you call this variable as 'mm' while the earlier input parameter
as 'local_mm'.

>  	int ret = -EFAULT;
> 
> -	if (get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE), page) == 1) {
> +	if (mm) {
> +		down_read(&local_mm->mmap_sem);
> +		ret = get_user_pages_remote(NULL, local_mm, vaddr, 1,
> +					!!(prot & IOMMU_WRITE), 0, page, NULL);
> +		up_read(&local_mm->mmap_sem);
> +	} else
> +		ret = get_user_pages_fast(vaddr, 1,
> +					  !!(prot & IOMMU_WRITE), page);
> +
> +	if (ret == 1) {
>  		*pfn = page_to_pfn(page[0]);
>  		return 0;
>  	}
> 
> -	down_read(&current->mm->mmap_sem);
> +	down_read(&local_mm->mmap_sem);
> 
> -	vma = find_vma_intersection(current->mm, vaddr, vaddr + 1);
> +	vma = find_vma_intersection(local_mm, vaddr, vaddr + 1);
> 
>  	if (vma && vma->vm_flags & VM_PFNMAP) {
>  		*pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;

[...]
> +static long __vfio_pin_pages_local(struct vfio_domain *domain,
> +				   unsigned long vaddr, int prot,
> +				   unsigned long *pfn_base,
> +				   bool do_accounting)

'pages' -> 'page' since only one page is handled here.

[...]
> +
> +static void __vfio_unpin_pages_local(struct vfio_domain *domain,
> +				     unsigned long pfn, int prot,
> +				     bool do_accounting)

ditto

> +{
> +	put_pfn(pfn, prot);
> +
> +	if (do_accounting)
> +		vfio_lock_acct(domain->local_addr_space->task, -1);
> +}
> +
> +static int vfio_unpin_pfn(struct vfio_domain *domain,
> +			  struct vfio_pfn *vpfn, bool do_accounting)
> +{
> +	__vfio_unpin_pages_local(domain, vpfn->pfn, vpfn->prot,
> +				 do_accounting);
> +
> +	if (atomic_dec_and_test(&vpfn->ref_count))
> +		vfio_remove_from_pfn_list(domain, vpfn);
> +
> +	return 1;
> +}
> +
> +static long vfio_iommu_type1_pin_pages(void *iommu_data,
> +				       unsigned long *user_pfn,
> +				       long npage, int prot,
> +				       unsigned long *phys_pfn)
> +{
> +	struct vfio_iommu *iommu = iommu_data;
> +	struct vfio_domain *domain;
> +	int i, j, ret;
> +	long retpage;
> +	unsigned long remote_vaddr;
> +	unsigned long *pfn = phys_pfn;
> +	struct vfio_dma *dma;
> +	bool do_accounting = false;
> +
> +	if (!iommu || !user_pfn || !phys_pfn)
> +		return -EINVAL;
> +
> +	mutex_lock(&iommu->lock);
> +
> +	if (!iommu->local_domain) {
> +		ret = -EINVAL;
> +		goto pin_done;
> +	}
> +
> +	domain = iommu->local_domain;
> +
> +	/*
> +	 * If iommu capable domain exist in the container then all pages are
> +	 * already pinned and accounted. Accouting should be done if there is no
> +	 * iommu capable domain in the container.
> +	 */
> +	do_accounting = !IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu);
> +
> +	for (i = 0; i < npage; i++) {
> +		struct vfio_pfn *p;
> +		dma_addr_t iova;
> +
> +		iova = user_pfn[i] << PAGE_SHIFT;
> +
> +		dma = vfio_find_dma(iommu, iova, 0);
> +		if (!dma) {
> +			ret = -EINVAL;
> +			goto pin_unwind;
> +		}
> +
> +		remote_vaddr = dma->vaddr + iova - dma->iova;

again, why "remote"_vaddr on a 'local' function?

> +
> +		retpage = __vfio_pin_pages_local(domain, remote_vaddr, prot,
> +						 &pfn[i], do_accounting);
> +		if (retpage <= 0) {
> +			WARN_ON(!retpage);
> +			ret = (int)retpage;
> +			goto pin_unwind;
> +		}
> +
> +		mutex_lock(&domain->local_addr_space->pfn_list_lock);
> +
> +		/* search if pfn exist */
> +		p = vfio_find_pfn(domain, pfn[i]);
> +		if (p) {
> +			atomic_inc(&p->ref_count);
> +			mutex_unlock(&domain->local_addr_space->pfn_list_lock);
> +			continue;
> +		}
> +
> +		ret = vfio_add_to_pfn_list(domain, remote_vaddr, iova,
> +					   pfn[i], prot);
> +		mutex_unlock(&domain->local_addr_space->pfn_list_lock);
> +
> +		if (ret) {
> +			__vfio_unpin_pages_local(domain, pfn[i], prot,
> +						 do_accounting);
> +			goto pin_unwind;
> +		}
> +	}
> +
> +	ret = i;
> +	goto pin_done;
> +
> +pin_unwind:
> +	pfn[i] = 0;
> +	mutex_lock(&domain->local_addr_space->pfn_list_lock);
> +	for (j = 0; j < i; j++) {
> +		struct vfio_pfn *p;
> +
> +		p = vfio_find_pfn(domain, pfn[j]);
> +		if (p)
> +			vfio_unpin_pfn(domain, p, do_accounting);
> +
> +		pfn[j] = 0;
> +	}
> +	mutex_unlock(&domain->local_addr_space->pfn_list_lock);
> +
> +pin_done:
> +	mutex_unlock(&iommu->lock);
> +	return ret;
> +}
> +
> +static long vfio_iommu_type1_unpin_pages(void *iommu_data, unsigned long *pfn,
> +					 long npage)
> +{
> +	struct vfio_iommu *iommu = iommu_data;
> +	struct vfio_domain *domain = NULL;
> +	long unlocked = 0;
> +	int i;
> +
> +	if (!iommu || !pfn)
> +		return -EINVAL;
> +

acquire iommu lock...

> +	domain = iommu->local_domain;
> +
> +	for (i = 0; i < npage; i++) {
> +		struct vfio_pfn *p;
> +
> +		mutex_lock(&domain->local_addr_space->pfn_list_lock);
> +
> +		/* verify if pfn exist in pfn_list */
> +		p = vfio_find_pfn(domain, pfn[i]);
> +		if (p)
> +			unlocked += vfio_unpin_pfn(domain, p, true);

Should we force update accounting here even when there is iommu capable
domain? It's not consistent to earlier pin_pages.

> +
> +		mutex_unlock(&domain->local_addr_space->pfn_list_lock);
> +	}
> 
>  	return unlocked;
>  }
> @@ -341,6 +636,12 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct
> vfio_dma *dma)
> 
>  	if (!dma->size)
>  		return;
> +
> +	if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu))
> +		return;

Is above check redundant to following dma->iommu_mapped?

> +
> +	if (!dma->iommu_mapped)
> +		return;
>  	/*
>  	 * We use the IOMMU to track the physical addresses, otherwise we'd
>  	 * need a much more complicated tracking system.  Unfortunately that

Thanks
Kevin

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v8 3/6] vfio iommu: Add support for mediated devices
@ 2016-10-12 10:31     ` Tian, Kevin
  0 siblings, 0 replies; 73+ messages in thread
From: Tian, Kevin @ 2016-10-12 10:31 UTC (permalink / raw)
  To: Kirti Wankhede, alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, Song, Jike, bjsdjshi

> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> Sent: Tuesday, October 11, 2016 4:29 AM
> 
[...]
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 2ba19424e4a1..ce6d6dcbd9a8 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -55,18 +55,26 @@ MODULE_PARM_DESC(disable_hugepages,
> 
>  struct vfio_iommu {
>  	struct list_head	domain_list;
> +	struct vfio_domain	*local_domain;

Hi, Kirti, can you help explain the meaning of 'local" here? I have a hard time 
to understand its intention... In your later change of vaddr_get_pfn, it's
even more confusing where get_user_pages_remote is used on a 'local_mm':

+	if (mm) {
+		down_read(&local_mm->mmap_sem);
+		ret = get_user_pages_remote(NULL, local_mm, vaddr, 1,
+					!!(prot & IOMMU_WRITE), 0, page, NULL);
+		up_read(&local_mm->mmap_sem);
+	} else
+		ret = get_user_pages_fast(vaddr, 1,
+					  !!(prot & IOMMU_WRITE), page);


[...]
> -static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
> +static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
> +			 int prot, unsigned long *pfn)
>  {
>  	struct page *page[1];
>  	struct vm_area_struct *vma;
> +	struct mm_struct *local_mm = (mm ? mm : current->mm);

it'd be clearer if you call this variable as 'mm' while the earlier input parameter
as 'local_mm'.

>  	int ret = -EFAULT;
> 
> -	if (get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE), page) == 1) {
> +	if (mm) {
> +		down_read(&local_mm->mmap_sem);
> +		ret = get_user_pages_remote(NULL, local_mm, vaddr, 1,
> +					!!(prot & IOMMU_WRITE), 0, page, NULL);
> +		up_read(&local_mm->mmap_sem);
> +	} else
> +		ret = get_user_pages_fast(vaddr, 1,
> +					  !!(prot & IOMMU_WRITE), page);
> +
> +	if (ret == 1) {
>  		*pfn = page_to_pfn(page[0]);
>  		return 0;
>  	}
> 
> -	down_read(&current->mm->mmap_sem);
> +	down_read(&local_mm->mmap_sem);
> 
> -	vma = find_vma_intersection(current->mm, vaddr, vaddr + 1);
> +	vma = find_vma_intersection(local_mm, vaddr, vaddr + 1);
> 
>  	if (vma && vma->vm_flags & VM_PFNMAP) {
>  		*pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;

[...]
> +static long __vfio_pin_pages_local(struct vfio_domain *domain,
> +				   unsigned long vaddr, int prot,
> +				   unsigned long *pfn_base,
> +				   bool do_accounting)

'pages' -> 'page' since only one page is handled here.

[...]
> +
> +static void __vfio_unpin_pages_local(struct vfio_domain *domain,
> +				     unsigned long pfn, int prot,
> +				     bool do_accounting)

ditto

> +{
> +	put_pfn(pfn, prot);
> +
> +	if (do_accounting)
> +		vfio_lock_acct(domain->local_addr_space->task, -1);
> +}
> +
> +static int vfio_unpin_pfn(struct vfio_domain *domain,
> +			  struct vfio_pfn *vpfn, bool do_accounting)
> +{
> +	__vfio_unpin_pages_local(domain, vpfn->pfn, vpfn->prot,
> +				 do_accounting);
> +
> +	if (atomic_dec_and_test(&vpfn->ref_count))
> +		vfio_remove_from_pfn_list(domain, vpfn);
> +
> +	return 1;
> +}
> +
> +static long vfio_iommu_type1_pin_pages(void *iommu_data,
> +				       unsigned long *user_pfn,
> +				       long npage, int prot,
> +				       unsigned long *phys_pfn)
> +{
> +	struct vfio_iommu *iommu = iommu_data;
> +	struct vfio_domain *domain;
> +	int i, j, ret;
> +	long retpage;
> +	unsigned long remote_vaddr;
> +	unsigned long *pfn = phys_pfn;
> +	struct vfio_dma *dma;
> +	bool do_accounting = false;
> +
> +	if (!iommu || !user_pfn || !phys_pfn)
> +		return -EINVAL;
> +
> +	mutex_lock(&iommu->lock);
> +
> +	if (!iommu->local_domain) {
> +		ret = -EINVAL;
> +		goto pin_done;
> +	}
> +
> +	domain = iommu->local_domain;
> +
> +	/*
> +	 * If iommu capable domain exist in the container then all pages are
> +	 * already pinned and accounted. Accouting should be done if there is no
> +	 * iommu capable domain in the container.
> +	 */
> +	do_accounting = !IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu);
> +
> +	for (i = 0; i < npage; i++) {
> +		struct vfio_pfn *p;
> +		dma_addr_t iova;
> +
> +		iova = user_pfn[i] << PAGE_SHIFT;
> +
> +		dma = vfio_find_dma(iommu, iova, 0);
> +		if (!dma) {
> +			ret = -EINVAL;
> +			goto pin_unwind;
> +		}
> +
> +		remote_vaddr = dma->vaddr + iova - dma->iova;

again, why "remote"_vaddr on a 'local' function?

> +
> +		retpage = __vfio_pin_pages_local(domain, remote_vaddr, prot,
> +						 &pfn[i], do_accounting);
> +		if (retpage <= 0) {
> +			WARN_ON(!retpage);
> +			ret = (int)retpage;
> +			goto pin_unwind;
> +		}
> +
> +		mutex_lock(&domain->local_addr_space->pfn_list_lock);
> +
> +		/* search if pfn exist */
> +		p = vfio_find_pfn(domain, pfn[i]);
> +		if (p) {
> +			atomic_inc(&p->ref_count);
> +			mutex_unlock(&domain->local_addr_space->pfn_list_lock);
> +			continue;
> +		}
> +
> +		ret = vfio_add_to_pfn_list(domain, remote_vaddr, iova,
> +					   pfn[i], prot);
> +		mutex_unlock(&domain->local_addr_space->pfn_list_lock);
> +
> +		if (ret) {
> +			__vfio_unpin_pages_local(domain, pfn[i], prot,
> +						 do_accounting);
> +			goto pin_unwind;
> +		}
> +	}
> +
> +	ret = i;
> +	goto pin_done;
> +
> +pin_unwind:
> +	pfn[i] = 0;
> +	mutex_lock(&domain->local_addr_space->pfn_list_lock);
> +	for (j = 0; j < i; j++) {
> +		struct vfio_pfn *p;
> +
> +		p = vfio_find_pfn(domain, pfn[j]);
> +		if (p)
> +			vfio_unpin_pfn(domain, p, do_accounting);
> +
> +		pfn[j] = 0;
> +	}
> +	mutex_unlock(&domain->local_addr_space->pfn_list_lock);
> +
> +pin_done:
> +	mutex_unlock(&iommu->lock);
> +	return ret;
> +}
> +
> +static long vfio_iommu_type1_unpin_pages(void *iommu_data, unsigned long *pfn,
> +					 long npage)
> +{
> +	struct vfio_iommu *iommu = iommu_data;
> +	struct vfio_domain *domain = NULL;
> +	long unlocked = 0;
> +	int i;
> +
> +	if (!iommu || !pfn)
> +		return -EINVAL;
> +

acquire iommu lock...

> +	domain = iommu->local_domain;
> +
> +	for (i = 0; i < npage; i++) {
> +		struct vfio_pfn *p;
> +
> +		mutex_lock(&domain->local_addr_space->pfn_list_lock);
> +
> +		/* verify if pfn exist in pfn_list */
> +		p = vfio_find_pfn(domain, pfn[i]);
> +		if (p)
> +			unlocked += vfio_unpin_pfn(domain, p, true);

Should we force update accounting here even when there is iommu capable
domain? It's not consistent to earlier pin_pages.

> +
> +		mutex_unlock(&domain->local_addr_space->pfn_list_lock);
> +	}
> 
>  	return unlocked;
>  }
> @@ -341,6 +636,12 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct
> vfio_dma *dma)
> 
>  	if (!dma->size)
>  		return;
> +
> +	if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu))
> +		return;

Is above check redundant to following dma->iommu_mapped?

> +
> +	if (!dma->iommu_mapped)
> +		return;
>  	/*
>  	 * We use the IOMMU to track the physical addresses, otherwise we'd
>  	 * need a much more complicated tracking system.  Unfortunately that

Thanks
Kevin

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 3/6] vfio iommu: Add support for mediated devices
  2016-10-11 22:06     ` [Qemu-devel] " Alex Williamson
@ 2016-10-12 10:38       ` Tian, Kevin
  -1 siblings, 0 replies; 73+ messages in thread
From: Tian, Kevin @ 2016-10-12 10:38 UTC (permalink / raw)
  To: Alex Williamson, Kirti Wankhede
  Cc: Song, Jike, cjia, kvm, qemu-devel, kraxel, pbonzini, bjsdjshi

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Wednesday, October 12, 2016 6:07 AM
> > @@ -696,6 +1067,11 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
> >
> >  			iova += size;
> >  		}
> > +
> > +		if (!dma->iommu_mapped) {
> > +			dma->iommu_mapped = true;
> > +			vfio_update_accounting(iommu, dma);
> > +		}
> 
> This is the case where we potentially have pinned pfns and we've added
> an iommu mapped device and need to adjust accounting.  But we've fully
> pinned and accounted the entire iommu mapped space while still holding
> the accounting for any pfn mapped space.  So for a time, assuming some
> pfn pinned pages, we have duplicate accounting.  How does userspace
> deal with that?  For instance, if I'm using an mdev device where the
> vendor driver has pinned 512MB of guest memory, then I hot-add an
> assigned NIC and the entire VM address space gets pinned, that pinning
> will fail unless my locked memory limits are at least 512MB in excess
> of my VM size.  Additionally, the user doesn't know how much memory the
> vendor driver is going to pin, it might be the whole VM address space,
> so the user would need 2x the locked memory limits.
> 

Looks we have inconsistent policies in local/remote pining:

- for local pinning, it increases accounting only when the region hasn't
been pinned in remote path

- however in remote pinning, it always increases accounting and then
adjust back if the region has been pinned in local path earlier. This leaves
a window as you said where double accounting may occur on some pages.

What about adding similar check in remote pining, i.e. increasing account
only when the region hasn't been pinned in local path? That way the 
accounting could be always accurate...

Thanks
Kevin

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v8 3/6] vfio iommu: Add support for mediated devices
@ 2016-10-12 10:38       ` Tian, Kevin
  0 siblings, 0 replies; 73+ messages in thread
From: Tian, Kevin @ 2016-10-12 10:38 UTC (permalink / raw)
  To: Alex Williamson, Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, Song, Jike, bjsdjshi

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Wednesday, October 12, 2016 6:07 AM
> > @@ -696,6 +1067,11 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
> >
> >  			iova += size;
> >  		}
> > +
> > +		if (!dma->iommu_mapped) {
> > +			dma->iommu_mapped = true;
> > +			vfio_update_accounting(iommu, dma);
> > +		}
> 
> This is the case where we potentially have pinned pfns and we've added
> an iommu mapped device and need to adjust accounting.  But we've fully
> pinned and accounted the entire iommu mapped space while still holding
> the accounting for any pfn mapped space.  So for a time, assuming some
> pfn pinned pages, we have duplicate accounting.  How does userspace
> deal with that?  For instance, if I'm using an mdev device where the
> vendor driver has pinned 512MB of guest memory, then I hot-add an
> assigned NIC and the entire VM address space gets pinned, that pinning
> will fail unless my locked memory limits are at least 512MB in excess
> of my VM size.  Additionally, the user doesn't know how much memory the
> vendor driver is going to pin, it might be the whole VM address space,
> so the user would need 2x the locked memory limits.
> 

Looks we have inconsistent policies in local/remote pining:

- for local pinning, it increases accounting only when the region hasn't
been pinned in remote path

- however in remote pinning, it always increases accounting and then
adjust back if the region has been pinned in local path earlier. This leaves
a window as you said where double accounting may occur on some pages.

What about adding similar check in remote pining, i.e. increasing account
only when the region hasn't been pinned in local path? That way the 
accounting could be always accurate...

Thanks
Kevin

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 4/6] docs: Add Documentation for Mediated devices
  2016-10-12  1:52         ` [Qemu-devel] " Tian, Kevin
@ 2016-10-12 15:13           ` Kirti Wankhede
  -1 siblings, 0 replies; 73+ messages in thread
From: Kirti Wankhede @ 2016-10-12 15:13 UTC (permalink / raw)
  To: Tian, Kevin, Daniel P. Berrange
  Cc: Song, Jike, cjia, kvm, qemu-devel, alex.williamson, kraxel,
	pbonzini, bjsdjshi



On 10/12/2016 7:22 AM, Tian, Kevin wrote:
>> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
>> Sent: Wednesday, October 12, 2016 4:45 AM
>>>> +* mdev_supported_types:
>>>> +    List of current supported mediated device types and its details are added
>>>> +in this directory in following format:
>>>> +
>>>> +|- <parent phy device>
>>>> +|--- Vendor-specific-attributes [optional]
>>>> +|--- mdev_supported_types
>>>> +|     |--- <type id>
>>>> +|     |   |--- create
>>>> +|     |   |--- name
>>>> +|     |   |--- available_instances
>>>> +|     |   |--- description /class
>>>> +|     |   |--- [devices]
>>>> +|     |--- <type id>
>>>> +|     |   |--- create
>>>> +|     |   |--- name
>>>> +|     |   |--- available_instances
>>>> +|     |   |--- description /class
>>>> +|     |   |--- [devices]
>>>> +|     |--- <type id>
>>>> +|          |--- create
>>>> +|          |--- name
>>>> +|          |--- available_instances
>>>> +|          |--- description /class
>>>> +|          |--- [devices]
>>>> +
>>>> +[TBD : description or class is yet to be decided. This will change.]
>>>
>>> I thought that in previous discussions we had agreed to drop
>>> the <type id> concept and use the name as the unique identifier.
>>> When reporting these types in libvirt we won't want to report
>>> the type id values - we'll want the name strings to be unique.
>>>
>>
>> The 'name' might not be unique but type_id will be. For example that Neo
>> pointed out in earlier discussion, virtual devices can come from two
>> different physical devices, end user would be presented with what they
>> had selected but there will be internal implementation differences. In
>> that case 'type_id' will be unique.
>>
> 
> Hi, Kirti, my understanding is that Neo agreed to use an unique type
> string (if you still called it <type id>), and then no need of additional
> 'name' field which can be put inside 'description' field. See below quote:
> 

We had internal discussions about this within NVIDIA and found that
'name' might not be unique where as 'type_id' would be unique. I'm
refering to Neo's mail after that, where Neo do pointed that out.

https://lists.gnu.org/archive/html/qemu-devel/2016-09/msg07714.html

Thanks,
Kirti

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v8 4/6] docs: Add Documentation for Mediated devices
@ 2016-10-12 15:13           ` Kirti Wankhede
  0 siblings, 0 replies; 73+ messages in thread
From: Kirti Wankhede @ 2016-10-12 15:13 UTC (permalink / raw)
  To: Tian, Kevin, Daniel P. Berrange
  Cc: alex.williamson, pbonzini, kraxel, cjia, Song, Jike, kvm,
	qemu-devel, bjsdjshi



On 10/12/2016 7:22 AM, Tian, Kevin wrote:
>> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
>> Sent: Wednesday, October 12, 2016 4:45 AM
>>>> +* mdev_supported_types:
>>>> +    List of current supported mediated device types and its details are added
>>>> +in this directory in following format:
>>>> +
>>>> +|- <parent phy device>
>>>> +|--- Vendor-specific-attributes [optional]
>>>> +|--- mdev_supported_types
>>>> +|     |--- <type id>
>>>> +|     |   |--- create
>>>> +|     |   |--- name
>>>> +|     |   |--- available_instances
>>>> +|     |   |--- description /class
>>>> +|     |   |--- [devices]
>>>> +|     |--- <type id>
>>>> +|     |   |--- create
>>>> +|     |   |--- name
>>>> +|     |   |--- available_instances
>>>> +|     |   |--- description /class
>>>> +|     |   |--- [devices]
>>>> +|     |--- <type id>
>>>> +|          |--- create
>>>> +|          |--- name
>>>> +|          |--- available_instances
>>>> +|          |--- description /class
>>>> +|          |--- [devices]
>>>> +
>>>> +[TBD : description or class is yet to be decided. This will change.]
>>>
>>> I thought that in previous discussions we had agreed to drop
>>> the <type id> concept and use the name as the unique identifier.
>>> When reporting these types in libvirt we won't want to report
>>> the type id values - we'll want the name strings to be unique.
>>>
>>
>> The 'name' might not be unique but type_id will be. For example that Neo
>> pointed out in earlier discussion, virtual devices can come from two
>> different physical devices, end user would be presented with what they
>> had selected but there will be internal implementation differences. In
>> that case 'type_id' will be unique.
>>
> 
> Hi, Kirti, my understanding is that Neo agreed to use an unique type
> string (if you still called it <type id>), and then no need of additional
> 'name' field which can be put inside 'description' field. See below quote:
> 

We had internal discussions about this within NVIDIA and found that
'name' might not be unique where as 'type_id' would be unique. I'm
refering to Neo's mail after that, where Neo do pointed that out.

https://lists.gnu.org/archive/html/qemu-devel/2016-09/msg07714.html

Thanks,
Kirti

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 4/6] docs: Add Documentation for Mediated devices
  2016-10-12 15:13           ` [Qemu-devel] " Kirti Wankhede
@ 2016-10-12 15:59             ` Alex Williamson
  -1 siblings, 0 replies; 73+ messages in thread
From: Alex Williamson @ 2016-10-12 15:59 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Song, Jike, kvm, Tian, Kevin, qemu-devel, cjia, kraxel, pbonzini,
	bjsdjshi

On Wed, 12 Oct 2016 20:43:48 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 10/12/2016 7:22 AM, Tian, Kevin wrote:
> >> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> >> Sent: Wednesday, October 12, 2016 4:45 AM  
> >>>> +* mdev_supported_types:
> >>>> +    List of current supported mediated device types and its details are added
> >>>> +in this directory in following format:
> >>>> +
> >>>> +|- <parent phy device>
> >>>> +|--- Vendor-specific-attributes [optional]
> >>>> +|--- mdev_supported_types
> >>>> +|     |--- <type id>
> >>>> +|     |   |--- create
> >>>> +|     |   |--- name
> >>>> +|     |   |--- available_instances
> >>>> +|     |   |--- description /class
> >>>> +|     |   |--- [devices]
> >>>> +|     |--- <type id>
> >>>> +|     |   |--- create
> >>>> +|     |   |--- name
> >>>> +|     |   |--- available_instances
> >>>> +|     |   |--- description /class
> >>>> +|     |   |--- [devices]
> >>>> +|     |--- <type id>
> >>>> +|          |--- create
> >>>> +|          |--- name
> >>>> +|          |--- available_instances
> >>>> +|          |--- description /class
> >>>> +|          |--- [devices]
> >>>> +
> >>>> +[TBD : description or class is yet to be decided. This will change.]  
> >>>
> >>> I thought that in previous discussions we had agreed to drop
> >>> the <type id> concept and use the name as the unique identifier.
> >>> When reporting these types in libvirt we won't want to report
> >>> the type id values - we'll want the name strings to be unique.
> >>>  
> >>
> >> The 'name' might not be unique but type_id will be. For example that Neo
> >> pointed out in earlier discussion, virtual devices can come from two
> >> different physical devices, end user would be presented with what they
> >> had selected but there will be internal implementation differences. In
> >> that case 'type_id' will be unique.
> >>  
> > 
> > Hi, Kirti, my understanding is that Neo agreed to use an unique type
> > string (if you still called it <type id>), and then no need of additional
> > 'name' field which can be put inside 'description' field. See below quote:
> >   
> 
> We had internal discussions about this within NVIDIA and found that
> 'name' might not be unique where as 'type_id' would be unique. I'm
> refering to Neo's mail after that, where Neo do pointed that out.
> 
> https://lists.gnu.org/archive/html/qemu-devel/2016-09/msg07714.html

Everyone not privy to those internal discussions, including me, seems to
think we dropped type_id and that if a vendor does not have a stable
name, they can compose some sort of stable type description based on the
name+id, or even vendor+id, ex. NVIDIA-11.  So please share why we
haven't managed to kill off type_id yet.  No matter what internal
representation each vendor driver has of "type_id" it seems possible
for it to come up with stable string to define a given configuration.
Thanks,

Alex

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v8 4/6] docs: Add Documentation for Mediated devices
@ 2016-10-12 15:59             ` Alex Williamson
  0 siblings, 0 replies; 73+ messages in thread
From: Alex Williamson @ 2016-10-12 15:59 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Tian, Kevin, Daniel P. Berrange, pbonzini, kraxel, cjia, Song,
	Jike, kvm, qemu-devel, bjsdjshi

On Wed, 12 Oct 2016 20:43:48 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 10/12/2016 7:22 AM, Tian, Kevin wrote:
> >> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> >> Sent: Wednesday, October 12, 2016 4:45 AM  
> >>>> +* mdev_supported_types:
> >>>> +    List of current supported mediated device types and its details are added
> >>>> +in this directory in following format:
> >>>> +
> >>>> +|- <parent phy device>
> >>>> +|--- Vendor-specific-attributes [optional]
> >>>> +|--- mdev_supported_types
> >>>> +|     |--- <type id>
> >>>> +|     |   |--- create
> >>>> +|     |   |--- name
> >>>> +|     |   |--- available_instances
> >>>> +|     |   |--- description /class
> >>>> +|     |   |--- [devices]
> >>>> +|     |--- <type id>
> >>>> +|     |   |--- create
> >>>> +|     |   |--- name
> >>>> +|     |   |--- available_instances
> >>>> +|     |   |--- description /class
> >>>> +|     |   |--- [devices]
> >>>> +|     |--- <type id>
> >>>> +|          |--- create
> >>>> +|          |--- name
> >>>> +|          |--- available_instances
> >>>> +|          |--- description /class
> >>>> +|          |--- [devices]
> >>>> +
> >>>> +[TBD : description or class is yet to be decided. This will change.]  
> >>>
> >>> I thought that in previous discussions we had agreed to drop
> >>> the <type id> concept and use the name as the unique identifier.
> >>> When reporting these types in libvirt we won't want to report
> >>> the type id values - we'll want the name strings to be unique.
> >>>  
> >>
> >> The 'name' might not be unique but type_id will be. For example that Neo
> >> pointed out in earlier discussion, virtual devices can come from two
> >> different physical devices, end user would be presented with what they
> >> had selected but there will be internal implementation differences. In
> >> that case 'type_id' will be unique.
> >>  
> > 
> > Hi, Kirti, my understanding is that Neo agreed to use an unique type
> > string (if you still called it <type id>), and then no need of additional
> > 'name' field which can be put inside 'description' field. See below quote:
> >   
> 
> We had internal discussions about this within NVIDIA and found that
> 'name' might not be unique where as 'type_id' would be unique. I'm
> refering to Neo's mail after that, where Neo do pointed that out.
> 
> https://lists.gnu.org/archive/html/qemu-devel/2016-09/msg07714.html

Everyone not privy to those internal discussions, including me, seems to
think we dropped type_id and that if a vendor does not have a stable
name, they can compose some sort of stable type description based on the
name+id, or even vendor+id, ex. NVIDIA-11.  So please share why we
haven't managed to kill off type_id yet.  No matter what internal
representation each vendor driver has of "type_id" it seems possible
for it to come up with stable string to define a given configuration.
Thanks,

Alex

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 4/6] docs: Add Documentation for Mediated devices
  2016-10-12 15:59             ` [Qemu-devel] " Alex Williamson
@ 2016-10-12 19:02               ` Kirti Wankhede
  -1 siblings, 0 replies; 73+ messages in thread
From: Kirti Wankhede @ 2016-10-12 19:02 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Song, Jike, kvm, Tian, Kevin, qemu-devel, cjia, kraxel, pbonzini,
	bjsdjshi



On 10/12/2016 9:29 PM, Alex Williamson wrote:
> On Wed, 12 Oct 2016 20:43:48 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> On 10/12/2016 7:22 AM, Tian, Kevin wrote:
>>>> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
>>>> Sent: Wednesday, October 12, 2016 4:45 AM  
>>>>>> +* mdev_supported_types:
>>>>>> +    List of current supported mediated device types and its details are added
>>>>>> +in this directory in following format:
>>>>>> +
>>>>>> +|- <parent phy device>
>>>>>> +|--- Vendor-specific-attributes [optional]
>>>>>> +|--- mdev_supported_types
>>>>>> +|     |--- <type id>
>>>>>> +|     |   |--- create
>>>>>> +|     |   |--- name
>>>>>> +|     |   |--- available_instances
>>>>>> +|     |   |--- description /class
>>>>>> +|     |   |--- [devices]
>>>>>> +|     |--- <type id>
>>>>>> +|     |   |--- create
>>>>>> +|     |   |--- name
>>>>>> +|     |   |--- available_instances
>>>>>> +|     |   |--- description /class
>>>>>> +|     |   |--- [devices]
>>>>>> +|     |--- <type id>
>>>>>> +|          |--- create
>>>>>> +|          |--- name
>>>>>> +|          |--- available_instances
>>>>>> +|          |--- description /class
>>>>>> +|          |--- [devices]
>>>>>> +
>>>>>> +[TBD : description or class is yet to be decided. This will change.]  
>>>>>
>>>>> I thought that in previous discussions we had agreed to drop
>>>>> the <type id> concept and use the name as the unique identifier.
>>>>> When reporting these types in libvirt we won't want to report
>>>>> the type id values - we'll want the name strings to be unique.
>>>>>  
>>>>
>>>> The 'name' might not be unique but type_id will be. For example that Neo
>>>> pointed out in earlier discussion, virtual devices can come from two
>>>> different physical devices, end user would be presented with what they
>>>> had selected but there will be internal implementation differences. In
>>>> that case 'type_id' will be unique.
>>>>  
>>>
>>> Hi, Kirti, my understanding is that Neo agreed to use an unique type
>>> string (if you still called it <type id>), and then no need of additional
>>> 'name' field which can be put inside 'description' field. See below quote:
>>>   
>>
>> We had internal discussions about this within NVIDIA and found that
>> 'name' might not be unique where as 'type_id' would be unique. I'm
>> refering to Neo's mail after that, where Neo do pointed that out.
>>
>> https://lists.gnu.org/archive/html/qemu-devel/2016-09/msg07714.html
> 
> Everyone not privy to those internal discussions, including me, seems to
> think we dropped type_id and that if a vendor does not have a stable
> name, they can compose some sort of stable type description based on the
> name+id, or even vendor+id, ex. NVIDIA-11.  So please share why we
> haven't managed to kill off type_id yet.  No matter what internal
> representation each vendor driver has of "type_id" it seems possible
> for it to come up with stable string to define a given configuration.


The 'type_id' is unique and the 'name' are not, the name is just a
virtual device name/ human readable name. Because at this moment Intel
can't define a proper GPU class, we have to add a 'description' field
there as well to represent the features of this virtual device, once we
have all agreed with the GPU class and its mandatory attributes, the
'description' field can be removed. Here is an example,
type_id/type_name = NVIDIA_11,
name=M60-M0Q,
description=2560x1600, 2 displays, 512MB"

Neo's previous comment only applies to the situation where we will have
the GPU class or optional attributes defined and recognized by libvirt,
since that is not going to happen any time soon, we will have to have
the new 'description' field, and we don't want to have it mixed up with
'name' field.

We can definitely have something like name+id as Alex recommended to
remove the 'name' field, but it will just require libvirt to have more
logic to parse that string.

Thanks,
Kirti

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v8 4/6] docs: Add Documentation for Mediated devices
@ 2016-10-12 19:02               ` Kirti Wankhede
  0 siblings, 0 replies; 73+ messages in thread
From: Kirti Wankhede @ 2016-10-12 19:02 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, Daniel P. Berrange, pbonzini, kraxel, cjia, Song,
	Jike, kvm, qemu-devel, bjsdjshi



On 10/12/2016 9:29 PM, Alex Williamson wrote:
> On Wed, 12 Oct 2016 20:43:48 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> On 10/12/2016 7:22 AM, Tian, Kevin wrote:
>>>> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
>>>> Sent: Wednesday, October 12, 2016 4:45 AM  
>>>>>> +* mdev_supported_types:
>>>>>> +    List of current supported mediated device types and its details are added
>>>>>> +in this directory in following format:
>>>>>> +
>>>>>> +|- <parent phy device>
>>>>>> +|--- Vendor-specific-attributes [optional]
>>>>>> +|--- mdev_supported_types
>>>>>> +|     |--- <type id>
>>>>>> +|     |   |--- create
>>>>>> +|     |   |--- name
>>>>>> +|     |   |--- available_instances
>>>>>> +|     |   |--- description /class
>>>>>> +|     |   |--- [devices]
>>>>>> +|     |--- <type id>
>>>>>> +|     |   |--- create
>>>>>> +|     |   |--- name
>>>>>> +|     |   |--- available_instances
>>>>>> +|     |   |--- description /class
>>>>>> +|     |   |--- [devices]
>>>>>> +|     |--- <type id>
>>>>>> +|          |--- create
>>>>>> +|          |--- name
>>>>>> +|          |--- available_instances
>>>>>> +|          |--- description /class
>>>>>> +|          |--- [devices]
>>>>>> +
>>>>>> +[TBD : description or class is yet to be decided. This will change.]  
>>>>>
>>>>> I thought that in previous discussions we had agreed to drop
>>>>> the <type id> concept and use the name as the unique identifier.
>>>>> When reporting these types in libvirt we won't want to report
>>>>> the type id values - we'll want the name strings to be unique.
>>>>>  
>>>>
>>>> The 'name' might not be unique but type_id will be. For example that Neo
>>>> pointed out in earlier discussion, virtual devices can come from two
>>>> different physical devices, end user would be presented with what they
>>>> had selected but there will be internal implementation differences. In
>>>> that case 'type_id' will be unique.
>>>>  
>>>
>>> Hi, Kirti, my understanding is that Neo agreed to use an unique type
>>> string (if you still called it <type id>), and then no need of additional
>>> 'name' field which can be put inside 'description' field. See below quote:
>>>   
>>
>> We had internal discussions about this within NVIDIA and found that
>> 'name' might not be unique where as 'type_id' would be unique. I'm
>> refering to Neo's mail after that, where Neo do pointed that out.
>>
>> https://lists.gnu.org/archive/html/qemu-devel/2016-09/msg07714.html
> 
> Everyone not privy to those internal discussions, including me, seems to
> think we dropped type_id and that if a vendor does not have a stable
> name, they can compose some sort of stable type description based on the
> name+id, or even vendor+id, ex. NVIDIA-11.  So please share why we
> haven't managed to kill off type_id yet.  No matter what internal
> representation each vendor driver has of "type_id" it seems possible
> for it to come up with stable string to define a given configuration.


The 'type_id' is unique and the 'name' are not, the name is just a
virtual device name/ human readable name. Because at this moment Intel
can't define a proper GPU class, we have to add a 'description' field
there as well to represent the features of this virtual device, once we
have all agreed with the GPU class and its mandatory attributes, the
'description' field can be removed. Here is an example,
type_id/type_name = NVIDIA_11,
name=M60-M0Q,
description=2560x1600, 2 displays, 512MB"

Neo's previous comment only applies to the situation where we will have
the GPU class or optional attributes defined and recognized by libvirt,
since that is not going to happen any time soon, we will have to have
the new 'description' field, and we don't want to have it mixed up with
'name' field.

We can definitely have something like name+id as Alex recommended to
remove the 'name' field, but it will just require libvirt to have more
logic to parse that string.

Thanks,
Kirti

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 6/6] Add common functions for SET_IRQS and GET_REGION_INFO ioctls
  2016-10-11 23:18     ` [Qemu-devel] " Alex Williamson
@ 2016-10-12 19:37       ` Kirti Wankhede
  -1 siblings, 0 replies; 73+ messages in thread
From: Kirti Wankhede @ 2016-10-12 19:37 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kevin.tian, cjia, kvm, qemu-devel, jike.song, kraxel, pbonzini, bjsdjshi



On 10/12/2016 4:48 AM, Alex Williamson wrote:
> On Tue, 11 Oct 2016 01:58:37 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> Add common functions for SET_IRQS and to add capability buffer for
>> GET_REGION_INFO ioctls
> 
> Clearly should be two (or more) separate patches since SET_IRQS and
> REGION_INFO are unrelated changes.  Each of the two capabilities handled
> could possibly be separate patches as well.
> 

Ok. I'll have the two separated.

>  
...

>> @@ -754,35 +742,22 @@ static long vfio_pci_ioctl(void *device_data,
>>  	} else if (cmd == VFIO_DEVICE_SET_IRQS) {
>>  		struct vfio_irq_set hdr;
>>  		u8 *data = NULL;
>> -		int ret = 0;
>> +		int max, ret = 0, data_size = 0;
>>  
>>  		minsz = offsetofend(struct vfio_irq_set, count);
>>  
>>  		if (copy_from_user(&hdr, (void __user *)arg, minsz))
>>  			return -EFAULT;
>>  
>> -		if (hdr.argsz < minsz || hdr.index >= VFIO_PCI_NUM_IRQS ||
>> -		    hdr.flags & ~(VFIO_IRQ_SET_DATA_TYPE_MASK |
>> -				  VFIO_IRQ_SET_ACTION_TYPE_MASK))
>> -			return -EINVAL;
>> -
>> -		if (!(hdr.flags & VFIO_IRQ_SET_DATA_NONE)) {
>> -			size_t size;
>> -			int max = vfio_pci_get_irq_count(vdev, hdr.index);
>> +		max = vfio_pci_get_irq_count(vdev, hdr.index);
>>  
>> -			if (hdr.flags & VFIO_IRQ_SET_DATA_BOOL)
>> -				size = sizeof(uint8_t);
>> -			else if (hdr.flags & VFIO_IRQ_SET_DATA_EVENTFD)
>> -				size = sizeof(int32_t);
>> -			else
>> -				return -EINVAL;
>> -
>> -			if (hdr.argsz - minsz < hdr.count * size ||
>> -			    hdr.start >= max || hdr.start + hdr.count > max)
>> -				return -EINVAL;
> 
> 
> vfio_platform has very similar code that would also need to be updated.
>

Ok. Thanks for pointing that out. I'll update that too.


>> +		ret = vfio_set_irqs_validate_and_prepare(&hdr, max, &data_size);
>> +		if (ret)
>> +			return ret;
>>  
>> +		if (data_size) {
>>  			data = memdup_user((void __user *)(arg + minsz),
>> -					   hdr.count * size);
>> +					    data_size);
>>  			if (IS_ERR(data))
>>  				return PTR_ERR(data);
>>  		}
>> @@ -790,7 +765,7 @@ static long vfio_pci_ioctl(void *device_data,
>>  		mutex_lock(&vdev->igate);
>>  
>>  		ret = vfio_pci_set_irqs_ioctl(vdev, hdr.flags, hdr.index,
>> -					      hdr.start, hdr.count, data);
>> +				hdr.start, hdr.count, data);
> 
> White space bogosity.
> 
>>  
>>  		mutex_unlock(&vdev->igate);
>>  		kfree(data);
>> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
>> index e3e342861e04..0185d5fb2c85 100644
>> --- a/drivers/vfio/vfio.c
>> +++ b/drivers/vfio/vfio.c
>> @@ -1782,6 +1782,122 @@ void vfio_info_cap_shift(struct vfio_info_cap *caps, size_t offset)
>>  }
>>  EXPORT_SYMBOL_GPL(vfio_info_cap_shift);
>>  
>> +static int sparse_mmap_cap(struct vfio_info_cap *caps, void *cap_type)
>> +{
>> +	struct vfio_info_cap_header *header;
>> +	struct vfio_region_info_cap_sparse_mmap *sparse_cap, *sparse = cap_type;
>> +	size_t size;
>> +
>> +	size = sizeof(*sparse) + sparse->nr_areas *  sizeof(*sparse->areas);
>> +	header = vfio_info_cap_add(caps, size,
>> +				   VFIO_REGION_INFO_CAP_SPARSE_MMAP, 1);
>> +	if (IS_ERR(header))
>> +		return PTR_ERR(header);
>> +
>> +	sparse_cap = container_of(header,
>> +			struct vfio_region_info_cap_sparse_mmap, header);
>> +	sparse_cap->nr_areas = sparse->nr_areas;
>> +	memcpy(sparse_cap->areas, sparse->areas,
>> +	       sparse->nr_areas * sizeof(*sparse->areas));
>> +	return 0;
>> +}
>> +
>> +static int region_type_cap(struct vfio_info_cap *caps, void *cap_type)
>> +{
>> +	struct vfio_info_cap_header *header;
>> +	struct vfio_region_info_cap_type *type_cap, *cap = cap_type;
>> +
>> +	header = vfio_info_cap_add(caps, sizeof(*cap),
>> +				   VFIO_REGION_INFO_CAP_TYPE, 1);
>> +	if (IS_ERR(header))
>> +		return PTR_ERR(header);
>> +
>> +	type_cap = container_of(header, struct vfio_region_info_cap_type,
>> +				header);
>> +	type_cap->type = cap->type;
>> +	type_cap->subtype = cap->subtype;
>> +	return 0;
>> +}
> 
> Why can't we just do a memcpy of all the data past the header?  Do we
> need separate functions for these?
> 

In case of sparse_cap, data past header is variable, depends on
nr_areas. For region_type_cap, data is fixed. For both capabilities,
structures are different and id are different. I think we need seperate
functions.

> vfio_info_cap_add() should now be static and unexported, right?
> 

Yes.

>> +
>> +int vfio_info_add_capability(struct vfio_region_info *info,
>> +			     struct vfio_info_cap *caps,
>> +			     int cap_type_id,
>> +			     void *cap_type)
>> +{
>> +	int ret;
>> +
>> +	if (!(info->flags & VFIO_REGION_INFO_FLAG_CAPS) || !cap_type)
> 
> Why make the caller set flags, seems rather arbitrary since this
> function controls the cap_offset and whether we actually end up copying
> the data.
> 

Kept this flag to be set at caller side so that if caller sets this flag
it should also fill cap_type.
Yes, it could be moved in here, so in that case sanity check will be
only on !cap_type and based on this cap_type flag would be set.

Thanks,
Kirti

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v8 6/6] Add common functions for SET_IRQS and GET_REGION_INFO ioctls
@ 2016-10-12 19:37       ` Kirti Wankhede
  0 siblings, 0 replies; 73+ messages in thread
From: Kirti Wankhede @ 2016-10-12 19:37 UTC (permalink / raw)
  To: Alex Williamson
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi



On 10/12/2016 4:48 AM, Alex Williamson wrote:
> On Tue, 11 Oct 2016 01:58:37 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> Add common functions for SET_IRQS and to add capability buffer for
>> GET_REGION_INFO ioctls
> 
> Clearly should be two (or more) separate patches since SET_IRQS and
> REGION_INFO are unrelated changes.  Each of the two capabilities handled
> could possibly be separate patches as well.
> 

Ok. I'll have the two separated.

>  
...

>> @@ -754,35 +742,22 @@ static long vfio_pci_ioctl(void *device_data,
>>  	} else if (cmd == VFIO_DEVICE_SET_IRQS) {
>>  		struct vfio_irq_set hdr;
>>  		u8 *data = NULL;
>> -		int ret = 0;
>> +		int max, ret = 0, data_size = 0;
>>  
>>  		minsz = offsetofend(struct vfio_irq_set, count);
>>  
>>  		if (copy_from_user(&hdr, (void __user *)arg, minsz))
>>  			return -EFAULT;
>>  
>> -		if (hdr.argsz < minsz || hdr.index >= VFIO_PCI_NUM_IRQS ||
>> -		    hdr.flags & ~(VFIO_IRQ_SET_DATA_TYPE_MASK |
>> -				  VFIO_IRQ_SET_ACTION_TYPE_MASK))
>> -			return -EINVAL;
>> -
>> -		if (!(hdr.flags & VFIO_IRQ_SET_DATA_NONE)) {
>> -			size_t size;
>> -			int max = vfio_pci_get_irq_count(vdev, hdr.index);
>> +		max = vfio_pci_get_irq_count(vdev, hdr.index);
>>  
>> -			if (hdr.flags & VFIO_IRQ_SET_DATA_BOOL)
>> -				size = sizeof(uint8_t);
>> -			else if (hdr.flags & VFIO_IRQ_SET_DATA_EVENTFD)
>> -				size = sizeof(int32_t);
>> -			else
>> -				return -EINVAL;
>> -
>> -			if (hdr.argsz - minsz < hdr.count * size ||
>> -			    hdr.start >= max || hdr.start + hdr.count > max)
>> -				return -EINVAL;
> 
> 
> vfio_platform has very similar code that would also need to be updated.
>

Ok. Thanks for pointing that out. I'll update that too.


>> +		ret = vfio_set_irqs_validate_and_prepare(&hdr, max, &data_size);
>> +		if (ret)
>> +			return ret;
>>  
>> +		if (data_size) {
>>  			data = memdup_user((void __user *)(arg + minsz),
>> -					   hdr.count * size);
>> +					    data_size);
>>  			if (IS_ERR(data))
>>  				return PTR_ERR(data);
>>  		}
>> @@ -790,7 +765,7 @@ static long vfio_pci_ioctl(void *device_data,
>>  		mutex_lock(&vdev->igate);
>>  
>>  		ret = vfio_pci_set_irqs_ioctl(vdev, hdr.flags, hdr.index,
>> -					      hdr.start, hdr.count, data);
>> +				hdr.start, hdr.count, data);
> 
> White space bogosity.
> 
>>  
>>  		mutex_unlock(&vdev->igate);
>>  		kfree(data);
>> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
>> index e3e342861e04..0185d5fb2c85 100644
>> --- a/drivers/vfio/vfio.c
>> +++ b/drivers/vfio/vfio.c
>> @@ -1782,6 +1782,122 @@ void vfio_info_cap_shift(struct vfio_info_cap *caps, size_t offset)
>>  }
>>  EXPORT_SYMBOL_GPL(vfio_info_cap_shift);
>>  
>> +static int sparse_mmap_cap(struct vfio_info_cap *caps, void *cap_type)
>> +{
>> +	struct vfio_info_cap_header *header;
>> +	struct vfio_region_info_cap_sparse_mmap *sparse_cap, *sparse = cap_type;
>> +	size_t size;
>> +
>> +	size = sizeof(*sparse) + sparse->nr_areas *  sizeof(*sparse->areas);
>> +	header = vfio_info_cap_add(caps, size,
>> +				   VFIO_REGION_INFO_CAP_SPARSE_MMAP, 1);
>> +	if (IS_ERR(header))
>> +		return PTR_ERR(header);
>> +
>> +	sparse_cap = container_of(header,
>> +			struct vfio_region_info_cap_sparse_mmap, header);
>> +	sparse_cap->nr_areas = sparse->nr_areas;
>> +	memcpy(sparse_cap->areas, sparse->areas,
>> +	       sparse->nr_areas * sizeof(*sparse->areas));
>> +	return 0;
>> +}
>> +
>> +static int region_type_cap(struct vfio_info_cap *caps, void *cap_type)
>> +{
>> +	struct vfio_info_cap_header *header;
>> +	struct vfio_region_info_cap_type *type_cap, *cap = cap_type;
>> +
>> +	header = vfio_info_cap_add(caps, sizeof(*cap),
>> +				   VFIO_REGION_INFO_CAP_TYPE, 1);
>> +	if (IS_ERR(header))
>> +		return PTR_ERR(header);
>> +
>> +	type_cap = container_of(header, struct vfio_region_info_cap_type,
>> +				header);
>> +	type_cap->type = cap->type;
>> +	type_cap->subtype = cap->subtype;
>> +	return 0;
>> +}
> 
> Why can't we just do a memcpy of all the data past the header?  Do we
> need separate functions for these?
> 

In case of sparse_cap, data past header is variable, depends on
nr_areas. For region_type_cap, data is fixed. For both capabilities,
structures are different and id are different. I think we need seperate
functions.

> vfio_info_cap_add() should now be static and unexported, right?
> 

Yes.

>> +
>> +int vfio_info_add_capability(struct vfio_region_info *info,
>> +			     struct vfio_info_cap *caps,
>> +			     int cap_type_id,
>> +			     void *cap_type)
>> +{
>> +	int ret;
>> +
>> +	if (!(info->flags & VFIO_REGION_INFO_FLAG_CAPS) || !cap_type)
> 
> Why make the caller set flags, seems rather arbitrary since this
> function controls the cap_offset and whether we actually end up copying
> the data.
> 

Kept this flag to be set at caller side so that if caller sets this flag
it should also fill cap_type.
Yes, it could be moved in here, so in that case sanity check will be
only on !cap_type and based on this cap_type flag would be set.

Thanks,
Kirti

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v8 4/6] docs: Add Documentation for Mediated devices
  2016-10-12 19:02               ` [Qemu-devel] " Kirti Wankhede
  (?)
@ 2016-10-12 21:44               ` Alex Williamson
  2016-10-13  9:22                 ` Kirti Wankhede
  -1 siblings, 1 reply; 73+ messages in thread
From: Alex Williamson @ 2016-10-12 21:44 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Tian, Kevin, Daniel P. Berrange, pbonzini, kraxel, cjia, Song,
	Jike, kvm, qemu-devel, bjsdjshi, Laine Stump

[-- Attachment #1: Type: text/plain, Size: 11056 bytes --]

On Thu, 13 Oct 2016 00:32:48 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 10/12/2016 9:29 PM, Alex Williamson wrote:
> > On Wed, 12 Oct 2016 20:43:48 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> On 10/12/2016 7:22 AM, Tian, Kevin wrote:  
> >>>> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> >>>> Sent: Wednesday, October 12, 2016 4:45 AM    
> >>>>>> +* mdev_supported_types:
> >>>>>> +    List of current supported mediated device types and its details are added
> >>>>>> +in this directory in following format:
> >>>>>> +
> >>>>>> +|- <parent phy device>
> >>>>>> +|--- Vendor-specific-attributes [optional]
> >>>>>> +|--- mdev_supported_types
> >>>>>> +|     |--- <type id>
> >>>>>> +|     |   |--- create
> >>>>>> +|     |   |--- name
> >>>>>> +|     |   |--- available_instances
> >>>>>> +|     |   |--- description /class
> >>>>>> +|     |   |--- [devices]
> >>>>>> +|     |--- <type id>
> >>>>>> +|     |   |--- create
> >>>>>> +|     |   |--- name
> >>>>>> +|     |   |--- available_instances
> >>>>>> +|     |   |--- description /class
> >>>>>> +|     |   |--- [devices]
> >>>>>> +|     |--- <type id>
> >>>>>> +|          |--- create
> >>>>>> +|          |--- name
> >>>>>> +|          |--- available_instances
> >>>>>> +|          |--- description /class
> >>>>>> +|          |--- [devices]
> >>>>>> +
> >>>>>> +[TBD : description or class is yet to be decided. This will change.]    
> >>>>>
> >>>>> I thought that in previous discussions we had agreed to drop
> >>>>> the <type id> concept and use the name as the unique identifier.
> >>>>> When reporting these types in libvirt we won't want to report
> >>>>> the type id values - we'll want the name strings to be unique.
> >>>>>    
> >>>>
> >>>> The 'name' might not be unique but type_id will be. For example that Neo
> >>>> pointed out in earlier discussion, virtual devices can come from two
> >>>> different physical devices, end user would be presented with what they
> >>>> had selected but there will be internal implementation differences. In
> >>>> that case 'type_id' will be unique.
> >>>>    
> >>>
> >>> Hi, Kirti, my understanding is that Neo agreed to use an unique type
> >>> string (if you still called it <type id>), and then no need of additional
> >>> 'name' field which can be put inside 'description' field. See below quote:
> >>>     
> >>
> >> We had internal discussions about this within NVIDIA and found that
> >> 'name' might not be unique where as 'type_id' would be unique. I'm
> >> refering to Neo's mail after that, where Neo do pointed that out.
> >>
> >> https://lists.gnu.org/archive/html/qemu-devel/2016-09/msg07714.html  
> > 
> > Everyone not privy to those internal discussions, including me, seems to
> > think we dropped type_id and that if a vendor does not have a stable
> > name, they can compose some sort of stable type description based on the
> > name+id, or even vendor+id, ex. NVIDIA-11.  So please share why we
> > haven't managed to kill off type_id yet.  No matter what internal
> > representation each vendor driver has of "type_id" it seems possible
> > for it to come up with stable string to define a given configuration.  
> 
> 
> The 'type_id' is unique and the 'name' are not, the name is just a
> virtual device name/ human readable name. Because at this moment Intel
> can't define a proper GPU class, we have to add a 'description' field
> there as well to represent the features of this virtual device, once we
> have all agreed with the GPU class and its mandatory attributes, the
> 'description' field can be removed. Here is an example,
> type_id/type_name = NVIDIA_11,
> name=M60-M0Q,
> description=2560x1600, 2 displays, 512MB"
> 
> Neo's previous comment only applies to the situation where we will have
> the GPU class or optional attributes defined and recognized by libvirt,
> since that is not going to happen any time soon, we will have to have
> the new 'description' field, and we don't want to have it mixed up with
> 'name' field.
> 
> We can definitely have something like name+id as Alex recommended to
> remove the 'name' field, but it will just require libvirt to have more
> logic to parse that string.

Let's use the mtty example driver provided in patch 5 so we can all
more clearly see how the interfaces work.  I'll start from the
beginning of my experience and work my way to the type/name thing.

(please add a modules_install target to the Makefile)

# modprobe mtty

Now what?  It seems like I need to have prior knowledge that this
drivers supports mdev devices and I need to go hunt for them.  We need
to create a class (ex. /sys/class/mdev/) where a user can find all the
devices that participate in this mediated device infrastructure.  That
would point me to /sys/devices/mtty.

# tree /sys/devices/mtty
/sys/devices/mtty
|-- mdev_supported_types
|   `-- mtty1
|       |-- available_instances (1)
|       |-- create
|       |-- devices
|       `-- name ("Dual-port-serial")
|-- mtty_dev
|   `-- sample_mtty_dev ("This is phy device")
|-- power
|   |-- async
|   |-- autosuspend_delay_ms
|   |-- control
|   |-- runtime_active_kids
|   |-- runtime_active_time
|   |-- runtime_enabled
|   |-- runtime_status
|   |-- runtime_suspended_time
|   `-- runtime_usage
`-- uevent

Ok, but that was boring, we really need to have at least 2 supported
types to validate the interface, so without changing the actual device
backing, I pretended to have a single port vs dual port:

/sys/devices/mtty
|-- mdev_supported_types
|   |-- mtty1
|   |   |-- available_instances (24)
|   |   |-- create
|   |   |-- devices
|   |   `-- name (Single-port-serial)
|   `-- mtty2
|       |-- available_instances (12)
|       |-- create
|       |-- devices
|       `-- name (Dual-port-serial)
[snip]

I arbitrarily decides I have 24 ports and each single port uses 1 port
and each dual port uses 2 ports.

Before I start creating devices, what are we going to key the libvirt
XML on?  Can we do anything to prevent vendors from colliding or do we
have any way to suggest meaningful and unique type_ids?  Presumably if
we had a PCI device hosting this, we would be rooted at that parent
device, ex. /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0.  Maybe
the type_id should automatically be prefixed by the vendor module name,
ex. mtty-1, i915-foo, nvidia-bar.  There's something missing for
deterministically creating a "XYZ" device and knowing exactly what that
means and finding a parent device that supports it.

Let's get to mdev creating...

# uuidgen > mdev_supported_types/mtty2/create
# tree /sys/devices/mtty
/sys/devices/mtty
|-- e68189be-700e-41f7-93a3-b5351e79c470
|   |-- driver -> ../../../bus/mdev/drivers/vfio_mdev
|   |-- iommu_group -> ../../../kernel/iommu_groups/63
|   |-- mtty2 -> ../mdev_supported_types/mtty2
|   |-- power
|   |   |-- async
|   |   |-- autosuspend_delay_ms
|   |   |-- control
|   |   |-- runtime_active_kids
|   |   |-- runtime_active_time
|   |   |-- runtime_enabled
|   |   |-- runtime_status
|   |   |-- runtime_suspended_time
|   |   `-- runtime_usage
|   |-- remove
|   |-- subsystem -> ../../../bus/mdev
|   |-- uevent
|   `-- vendor
|       `-- sample_mdev_dev ("This is MDEV e68189be-700e-41f7-93a3-b5351e79c470")
|-- mdev_supported_types
|   |-- mtty1
|   |   |-- available_instances (22)
|   |   |-- create
|   |   |-- devices
|   |   `-- name
|   `-- mtty2
|       |-- available_instances (11)
|       |-- create
|       |-- devices
|       |   `-- e68189be-700e-41f7-93a3-b5351e79c470 -> ../../../e68189be-700e-41f7-93a3-b5351e79c470
|       `-- name

The mdev device was created directly under the parent, which seems like
it's going to get messy to me (ie. imagine dropping a bunch of uuids
into a PCI parent device's sysfs directory, how does a user know what
they are?).

Under the device we have "mtty2", shouldn't that be
"mdev_supported_type", which then links to mtty2?  Otherwise a user
needs to decode from the link what this attribute is.

Also here's an example of those vendor sysfs entries per device.  So
long as the vendor never expects a tool like libvirt to manipulate
attributes there, I can see how that could be pretty powerful.

Moving down to the mdev_supported_types, I've updated mtty so that it
actually adjusts available instance, and we can now see a link under
the devices for mtty2.

Also worth noting is that a link for the device appears
in /sys/bus/mdev/devices.

BTW, specifying this device for QEMU vfio-pci is where the sysfsdev
option comes into play:

-device
vfio-pci,sysfsdev=/sys/devices/mtty/e68189be-700e-41f7-93a3-b5351e79c470

Which raises another question, we can tell through the vfio interfaces
that this is exposes as a PCI device, by creating a container
(open(/dev/vfio/vfio)), setting an iommu (ioctl(VFIO_SET_IOMMU)),
adding the group to the container (ioctl(VFIO_GROUP_SET_CONTAINER)),
getting the device (ioctl(VFIO_GROUP_GET_DEVICE_FD)), and finally
getting the device info (ioctl(VFIO_DEVICE_GET_INFO)) and checking the
flag bit that says the API is PCI.  That's a long path to go and has
stumbling blocks like the type of iommu that's available for the given
platform.  How do we make that manageable?  I don't think we want to
create some artificial relationship that the type of the parent
necessarily match the type of the child mdev, we've already broken that
with a simple mdev tty driver.

One more:

# uuidgen > mdev_supported_types/mtty1/create
# tree /sys/devices/mtty
/sys/devices/mtty
|-- a7ae17d1-2de4-44c2-ae58-20ae0a0befe8
|   |-- driver -> ../../../bus/mdev/drivers/vfio_mdev
|   |-- iommu_group -> ../../../kernel/iommu_groups/64
|   |-- mtty1 -> ../mdev_supported_types/mtty1
|   |-- power
[snip]
|   |-- remove
|   |-- subsystem -> ../../../bus/mdev
|   |-- uevent
|   `-- vendor
|       `-- sample_mdev_dev ("This is MDEV a7ae17d1-2de4-44c2-ae58-20ae0a0befe8")
|-- e68189be-700e-41f7-93a3-b5351e79c470
|   |-- driver -> ../../../bus/mdev/drivers/vfio_mdev
|   |-- iommu_group -> ../../../kernel/iommu_groups/63
|   |-- mtty2 -> ../mdev_supported_types/mtty2
|   |-- power
[snip]
|   |-- remove
|   |-- subsystem -> ../../../bus/mdev
|   |-- uevent
|   `-- vendor
|       `-- sample_mdev_dev ("This is MDEV e68189be-700e-41f7-93a3-b5351e79c470")
|-- mdev_supported_types
|   |-- mtty1
|   |   |-- available_instances (21)
|   |   |-- create
|   |   |-- devices
|   |   |   `-- a7ae17d1-2de4-44c2-ae58-20ae0a0befe8 -> ../../../a7ae17d1-2de4-44c2-ae58-20ae0a0befe8
|   |   `-- name
|   `-- mtty2
|       |-- available_instances (10)
|       |-- create
|       |-- devices
|       |   `-- e68189be-700e-41f7-93a3-b5351e79c470 -> ../../../e68189be-700e-41f7-93a3-b5351e79c470
|       `-- name

Hopefully as expected with the caveats for the first example.

# echo 1 > a7ae17d1-2de4-44c2-ae58-20ae0a0befe8/remove
# echo 1 > e68189be-700e-41f7-93a3-b5351e79c470/remove

These do what they're supposed to, the devices are gone.

Ok, I've identified some issues, let's figure out how to resolve them.
Thanks,

Alex

(hack multi-port mtty patch attached)

[-- Attachment #2: mtty-multi-port.patch --]
[-- Type: text/x-patch, Size: 2716 bytes --]

diff --git a/Documentation/vfio-mdev/Makefile b/Documentation/vfio-mdev/Makefile
index ff6f8a3..721daf0 100644
--- a/Documentation/vfio-mdev/Makefile
+++ b/Documentation/vfio-mdev/Makefile
@@ -8,6 +8,9 @@ obj-m:=mtty.o
 default:
 	$(MAKE) -C $(KDIR) SUBDIRS=$(PWD) modules
 
+modules_install:
+	$(MAKE) -C $(KDIR) SUBDIRS=$(PWD) modules_install
+
 clean:
 	@rm -rf .*.cmd *.mod.c *.o *.ko .tmp*
 	@rm -rf Module.* Modules.* modules.* .tmp_versions
diff --git a/Documentation/vfio-mdev/mtty.c b/Documentation/vfio-mdev/mtty.c
index 497c90e..05ab40d 100644
--- a/Documentation/vfio-mdev/mtty.c
+++ b/Documentation/vfio-mdev/mtty.c
@@ -141,8 +141,11 @@ struct mdev_state {
 	struct serial_port s[2];
 	struct mutex rxtx_lock;
 	struct vfio_device_info dev_info;
+	int nr_ports;
 };
 
+#define MAX_MTTYS 24
+
 struct mutex mdev_list_lock;
 struct list_head mdev_devices_list;
 
@@ -723,6 +726,11 @@ int mtty_create(struct kobject *kobj, struct mdev_device *mdev)
 	if (mdev_state == NULL)
 		return -ENOMEM;
 
+	if (!strcmp(kobj->name, "mtty1"))
+		mdev_state->nr_ports = 1;
+	else if (!strcmp(kobj->name, "mtty2"))
+		mdev_state->nr_ports = 2;
+
 	mdev_state->irq_index = -1;
 	mdev_state->s[0].max_fifo_size = MAX_FIFO_SIZE;
 	mdev_state->s[1].max_fifo_size = MAX_FIFO_SIZE;
@@ -1224,7 +1232,12 @@ const struct attribute_group *mdev_dev_groups[] = {
 static ssize_t
 name_show(struct kobject *kobj, struct device *dev, char *buf)
 {
-	return sprintf(buf, "Dual-port-serial\n");
+	if (!strcmp(kobj->name, "mtty1"))
+		return sprintf(buf, "Single-port-serial\n");
+	if (!strcmp(kobj->name, "mtty2"))
+		return sprintf(buf, "Dual-port-serial\n");
+
+	return -EINVAL;
 }
 
 MDEV_TYPE_ATTR_RO(name);
@@ -1232,7 +1245,20 @@ MDEV_TYPE_ATTR_RO(name);
 static ssize_t
 available_instances_show(struct kobject *kobj, struct device *dev, char *buf)
 {
-	return sprintf(buf, "1\n");
+	struct mdev_state *mds;
+	int ports, used = 0;
+
+	if (!strcmp(kobj->name, "mtty1"))
+		ports = 1;
+	else if (!strcmp(kobj->name, "mtty2"))
+		ports = 2;
+	else
+		return -EINVAL;
+
+	list_for_each_entry(mds, &mdev_devices_list, next) {
+		used += mds->nr_ports;
+	}
+	return sprintf(buf, "%d\n", (MAX_MTTYS - used)/ports);
 }
 
 MDEV_TYPE_ATTR_RO(available_instances);
@@ -1243,13 +1269,19 @@ static struct attribute *mdev_types_attrs[] = {
 	NULL,
 };
 
-static struct attribute_group mdev_type_group = {
+static struct attribute_group mdev_type_group1 = {
 	.name  = "mtty1",
 	.attrs = mdev_types_attrs,
 };
 
+static struct attribute_group mdev_type_group2 = {
+	.name  = "mtty2",
+	.attrs = mdev_types_attrs,
+};
+
 struct attribute_group *mdev_type_groups[] = {
-	&mdev_type_group,
+	&mdev_type_group1,
+	&mdev_type_group2,
 	NULL,
 };
 

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 4/6] docs: Add Documentation for Mediated devices
  2016-10-12 19:02               ` [Qemu-devel] " Kirti Wankhede
@ 2016-10-13  3:27                 ` Tian, Kevin
  -1 siblings, 0 replies; 73+ messages in thread
From: Tian, Kevin @ 2016-10-13  3:27 UTC (permalink / raw)
  To: Kirti Wankhede, Alex Williamson
  Cc: Song, Jike, cjia, kvm, qemu-devel, kraxel, pbonzini, bjsdjshi

> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> Sent: Thursday, October 13, 2016 3:03 AM
> 
> 
> On 10/12/2016 9:29 PM, Alex Williamson wrote:
> > On Wed, 12 Oct 2016 20:43:48 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >
> >> On 10/12/2016 7:22 AM, Tian, Kevin wrote:
> >>>> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> >>>> Sent: Wednesday, October 12, 2016 4:45 AM
> >>>>>> +* mdev_supported_types:
> >>>>>> +    List of current supported mediated device types and its details are added
> >>>>>> +in this directory in following format:
> >>>>>> +
> >>>>>> +|- <parent phy device>
> >>>>>> +|--- Vendor-specific-attributes [optional]
> >>>>>> +|--- mdev_supported_types
> >>>>>> +|     |--- <type id>
> >>>>>> +|     |   |--- create
> >>>>>> +|     |   |--- name
> >>>>>> +|     |   |--- available_instances
> >>>>>> +|     |   |--- description /class
> >>>>>> +|     |   |--- [devices]
> >>>>>> +|     |--- <type id>
> >>>>>> +|     |   |--- create
> >>>>>> +|     |   |--- name
> >>>>>> +|     |   |--- available_instances
> >>>>>> +|     |   |--- description /class
> >>>>>> +|     |   |--- [devices]
> >>>>>> +|     |--- <type id>
> >>>>>> +|          |--- create
> >>>>>> +|          |--- name
> >>>>>> +|          |--- available_instances
> >>>>>> +|          |--- description /class
> >>>>>> +|          |--- [devices]
> >>>>>> +
> >>>>>> +[TBD : description or class is yet to be decided. This will change.]
> >>>>>
> >>>>> I thought that in previous discussions we had agreed to drop
> >>>>> the <type id> concept and use the name as the unique identifier.
> >>>>> When reporting these types in libvirt we won't want to report
> >>>>> the type id values - we'll want the name strings to be unique.
> >>>>>
> >>>>
> >>>> The 'name' might not be unique but type_id will be. For example that Neo
> >>>> pointed out in earlier discussion, virtual devices can come from two
> >>>> different physical devices, end user would be presented with what they
> >>>> had selected but there will be internal implementation differences. In
> >>>> that case 'type_id' will be unique.
> >>>>
> >>>
> >>> Hi, Kirti, my understanding is that Neo agreed to use an unique type
> >>> string (if you still called it <type id>), and then no need of additional
> >>> 'name' field which can be put inside 'description' field. See below quote:
> >>>
> >>
> >> We had internal discussions about this within NVIDIA and found that
> >> 'name' might not be unique where as 'type_id' would be unique. I'm
> >> refering to Neo's mail after that, where Neo do pointed that out.
> >>
> >> https://lists.gnu.org/archive/html/qemu-devel/2016-09/msg07714.html
> >
> > Everyone not privy to those internal discussions, including me, seems to
> > think we dropped type_id and that if a vendor does not have a stable
> > name, they can compose some sort of stable type description based on the
> > name+id, or even vendor+id, ex. NVIDIA-11.  So please share why we
> > haven't managed to kill off type_id yet.  No matter what internal
> > representation each vendor driver has of "type_id" it seems possible
> > for it to come up with stable string to define a given configuration.
> 
> 
> The 'type_id' is unique and the 'name' are not, the name is just a
> virtual device name/ human readable name. Because at this moment Intel
> can't define a proper GPU class, we have to add a 'description' field
> there as well to represent the features of this virtual device, once we
> have all agreed with the GPU class and its mandatory attributes, the
> 'description' field can be removed. Here is an example,
> type_id/type_name = NVIDIA_11,
> name=M60-M0Q,
> description=2560x1600, 2 displays, 512MB"

As I commented earlier, I didn't see how above attributes can be defined
mandatory:

- #displays, is concerned only for VDI usage, where remote user may
care about how many virtual displays it could be use. What about using
vGPU in non-VDI usage, e.g. purely media transcoding case where 
#displays is just nothing? Then for media transcoding do we want to
further introduce attributes like H.265?

- framebuffer size (512MB) might make sense to discrete card like
NVIDIA. In your case the graphics memory is on-card, so the memory
size is critical to performance so user might want to know. However for 
integrated card like Intel, we just use system memory as 'virtual' graphics
memory through GPU page tables. There is one global GPU page table
(GGTT) partitioned between vGPUs, but the majority of rendering happens
on per-process GPU page table (PPGTT) which can be fully managed by
each VM. In this sense, the size of GGTT resource has little performance
implication (mostly an indirect functionality sense, such as #displays) User
cannot make clear expectation on it, so we don't have plan to expose it.

> 
> Neo's previous comment only applies to the situation where we will have
> the GPU class or optional attributes defined and recognized by libvirt,
> since that is not going to happen any time soon, we will have to have
> the new 'description' field, and we don't want to have it mixed up with
> 'name' field.

As explained above, I don't see necessity of defining a GPU class now, 
at least not based on your examples (except resolution which might be
one). If required in the future, it must include mandatory attributes 
cross virtualizations usages and hardware vendors. Vendor can choose
to document the relationship between vGPU type and applied virtualization
usages in their own user manual.

> 
> We can definitely have something like name+id as Alex recommended to
> remove the 'name' field, but it will just require libvirt to have more
> logic to parse that string.
> 

In this manner I prefer to have name+id style. libvirt doesn't need to
parse it, just treat the whole name+id as the unique type identification.
'description' field is kept for vendor specific descriptive purpose where
you can add above attributes. libvirt doesn't need to parse too, just 
pass it up.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v8 4/6] docs: Add Documentation for Mediated devices
@ 2016-10-13  3:27                 ` Tian, Kevin
  0 siblings, 0 replies; 73+ messages in thread
From: Tian, Kevin @ 2016-10-13  3:27 UTC (permalink / raw)
  To: Kirti Wankhede, Alex Williamson
  Cc: Daniel P. Berrange, pbonzini, kraxel, cjia, Song, Jike, kvm,
	qemu-devel, bjsdjshi

> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> Sent: Thursday, October 13, 2016 3:03 AM
> 
> 
> On 10/12/2016 9:29 PM, Alex Williamson wrote:
> > On Wed, 12 Oct 2016 20:43:48 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >
> >> On 10/12/2016 7:22 AM, Tian, Kevin wrote:
> >>>> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> >>>> Sent: Wednesday, October 12, 2016 4:45 AM
> >>>>>> +* mdev_supported_types:
> >>>>>> +    List of current supported mediated device types and its details are added
> >>>>>> +in this directory in following format:
> >>>>>> +
> >>>>>> +|- <parent phy device>
> >>>>>> +|--- Vendor-specific-attributes [optional]
> >>>>>> +|--- mdev_supported_types
> >>>>>> +|     |--- <type id>
> >>>>>> +|     |   |--- create
> >>>>>> +|     |   |--- name
> >>>>>> +|     |   |--- available_instances
> >>>>>> +|     |   |--- description /class
> >>>>>> +|     |   |--- [devices]
> >>>>>> +|     |--- <type id>
> >>>>>> +|     |   |--- create
> >>>>>> +|     |   |--- name
> >>>>>> +|     |   |--- available_instances
> >>>>>> +|     |   |--- description /class
> >>>>>> +|     |   |--- [devices]
> >>>>>> +|     |--- <type id>
> >>>>>> +|          |--- create
> >>>>>> +|          |--- name
> >>>>>> +|          |--- available_instances
> >>>>>> +|          |--- description /class
> >>>>>> +|          |--- [devices]
> >>>>>> +
> >>>>>> +[TBD : description or class is yet to be decided. This will change.]
> >>>>>
> >>>>> I thought that in previous discussions we had agreed to drop
> >>>>> the <type id> concept and use the name as the unique identifier.
> >>>>> When reporting these types in libvirt we won't want to report
> >>>>> the type id values - we'll want the name strings to be unique.
> >>>>>
> >>>>
> >>>> The 'name' might not be unique but type_id will be. For example that Neo
> >>>> pointed out in earlier discussion, virtual devices can come from two
> >>>> different physical devices, end user would be presented with what they
> >>>> had selected but there will be internal implementation differences. In
> >>>> that case 'type_id' will be unique.
> >>>>
> >>>
> >>> Hi, Kirti, my understanding is that Neo agreed to use an unique type
> >>> string (if you still called it <type id>), and then no need of additional
> >>> 'name' field which can be put inside 'description' field. See below quote:
> >>>
> >>
> >> We had internal discussions about this within NVIDIA and found that
> >> 'name' might not be unique where as 'type_id' would be unique. I'm
> >> refering to Neo's mail after that, where Neo do pointed that out.
> >>
> >> https://lists.gnu.org/archive/html/qemu-devel/2016-09/msg07714.html
> >
> > Everyone not privy to those internal discussions, including me, seems to
> > think we dropped type_id and that if a vendor does not have a stable
> > name, they can compose some sort of stable type description based on the
> > name+id, or even vendor+id, ex. NVIDIA-11.  So please share why we
> > haven't managed to kill off type_id yet.  No matter what internal
> > representation each vendor driver has of "type_id" it seems possible
> > for it to come up with stable string to define a given configuration.
> 
> 
> The 'type_id' is unique and the 'name' are not, the name is just a
> virtual device name/ human readable name. Because at this moment Intel
> can't define a proper GPU class, we have to add a 'description' field
> there as well to represent the features of this virtual device, once we
> have all agreed with the GPU class and its mandatory attributes, the
> 'description' field can be removed. Here is an example,
> type_id/type_name = NVIDIA_11,
> name=M60-M0Q,
> description=2560x1600, 2 displays, 512MB"

As I commented earlier, I didn't see how above attributes can be defined
mandatory:

- #displays, is concerned only for VDI usage, where remote user may
care about how many virtual displays it could be use. What about using
vGPU in non-VDI usage, e.g. purely media transcoding case where 
#displays is just nothing? Then for media transcoding do we want to
further introduce attributes like H.265?

- framebuffer size (512MB) might make sense to discrete card like
NVIDIA. In your case the graphics memory is on-card, so the memory
size is critical to performance so user might want to know. However for 
integrated card like Intel, we just use system memory as 'virtual' graphics
memory through GPU page tables. There is one global GPU page table
(GGTT) partitioned between vGPUs, but the majority of rendering happens
on per-process GPU page table (PPGTT) which can be fully managed by
each VM. In this sense, the size of GGTT resource has little performance
implication (mostly an indirect functionality sense, such as #displays) User
cannot make clear expectation on it, so we don't have plan to expose it.

> 
> Neo's previous comment only applies to the situation where we will have
> the GPU class or optional attributes defined and recognized by libvirt,
> since that is not going to happen any time soon, we will have to have
> the new 'description' field, and we don't want to have it mixed up with
> 'name' field.

As explained above, I don't see necessity of defining a GPU class now, 
at least not based on your examples (except resolution which might be
one). If required in the future, it must include mandatory attributes 
cross virtualizations usages and hardware vendors. Vendor can choose
to document the relationship between vGPU type and applied virtualization
usages in their own user manual.

> 
> We can definitely have something like name+id as Alex recommended to
> remove the 'name' field, but it will just require libvirt to have more
> logic to parse that string.
> 

In this manner I prefer to have name+id style. libvirt doesn't need to
parse it, just treat the whole name+id as the unique type identification.
'description' field is kept for vendor specific descriptive purpose where
you can add above attributes. libvirt doesn't need to parse too, just 
pass it up.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v8 4/6] docs: Add Documentation for Mediated devices
  2016-10-12 21:44               ` Alex Williamson
@ 2016-10-13  9:22                 ` Kirti Wankhede
  2016-10-13 14:36                     ` [Qemu-devel] " Alex Williamson
  0 siblings, 1 reply; 73+ messages in thread
From: Kirti Wankhede @ 2016-10-13  9:22 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, Daniel P. Berrange, pbonzini, kraxel, cjia, Song,
	Jike, kvm, qemu-devel, bjsdjshi, Laine Stump



On 10/13/2016 3:14 AM, Alex Williamson wrote:
> On Thu, 13 Oct 2016 00:32:48 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> On 10/12/2016 9:29 PM, Alex Williamson wrote:
>>> On Wed, 12 Oct 2016 20:43:48 +0530
>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>   
>>>> On 10/12/2016 7:22 AM, Tian, Kevin wrote:  
>>>>>> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
>>>>>> Sent: Wednesday, October 12, 2016 4:45 AM    
>>>>>>>> +* mdev_supported_types:
>>>>>>>> +    List of current supported mediated device types and its details are added
>>>>>>>> +in this directory in following format:
>>>>>>>> +
>>>>>>>> +|- <parent phy device>
>>>>>>>> +|--- Vendor-specific-attributes [optional]
>>>>>>>> +|--- mdev_supported_types
>>>>>>>> +|     |--- <type id>
>>>>>>>> +|     |   |--- create
>>>>>>>> +|     |   |--- name
>>>>>>>> +|     |   |--- available_instances
>>>>>>>> +|     |   |--- description /class
>>>>>>>> +|     |   |--- [devices]
>>>>>>>> +|     |--- <type id>
>>>>>>>> +|     |   |--- create
>>>>>>>> +|     |   |--- name
>>>>>>>> +|     |   |--- available_instances
>>>>>>>> +|     |   |--- description /class
>>>>>>>> +|     |   |--- [devices]
>>>>>>>> +|     |--- <type id>
>>>>>>>> +|          |--- create
>>>>>>>> +|          |--- name
>>>>>>>> +|          |--- available_instances
>>>>>>>> +|          |--- description /class
>>>>>>>> +|          |--- [devices]
>>>>>>>> +
>>>>>>>> +[TBD : description or class is yet to be decided. This will change.]    
>>>>>>>
>>>>>>> I thought that in previous discussions we had agreed to drop
>>>>>>> the <type id> concept and use the name as the unique identifier.
>>>>>>> When reporting these types in libvirt we won't want to report
>>>>>>> the type id values - we'll want the name strings to be unique.
>>>>>>>    
>>>>>>
>>>>>> The 'name' might not be unique but type_id will be. For example that Neo
>>>>>> pointed out in earlier discussion, virtual devices can come from two
>>>>>> different physical devices, end user would be presented with what they
>>>>>> had selected but there will be internal implementation differences. In
>>>>>> that case 'type_id' will be unique.
>>>>>>    
>>>>>
>>>>> Hi, Kirti, my understanding is that Neo agreed to use an unique type
>>>>> string (if you still called it <type id>), and then no need of additional
>>>>> 'name' field which can be put inside 'description' field. See below quote:
>>>>>     
>>>>
>>>> We had internal discussions about this within NVIDIA and found that
>>>> 'name' might not be unique where as 'type_id' would be unique. I'm
>>>> refering to Neo's mail after that, where Neo do pointed that out.
>>>>
>>>> https://lists.gnu.org/archive/html/qemu-devel/2016-09/msg07714.html  
>>>
>>> Everyone not privy to those internal discussions, including me, seems to
>>> think we dropped type_id and that if a vendor does not have a stable
>>> name, they can compose some sort of stable type description based on the
>>> name+id, or even vendor+id, ex. NVIDIA-11.  So please share why we
>>> haven't managed to kill off type_id yet.  No matter what internal
>>> representation each vendor driver has of "type_id" it seems possible
>>> for it to come up with stable string to define a given configuration.  
>>
>>
>> The 'type_id' is unique and the 'name' are not, the name is just a
>> virtual device name/ human readable name. Because at this moment Intel
>> can't define a proper GPU class, we have to add a 'description' field
>> there as well to represent the features of this virtual device, once we
>> have all agreed with the GPU class and its mandatory attributes, the
>> 'description' field can be removed. Here is an example,
>> type_id/type_name = NVIDIA_11,
>> name=M60-M0Q,
>> description=2560x1600, 2 displays, 512MB"
>>
>> Neo's previous comment only applies to the situation where we will have
>> the GPU class or optional attributes defined and recognized by libvirt,
>> since that is not going to happen any time soon, we will have to have
>> the new 'description' field, and we don't want to have it mixed up with
>> 'name' field.
>>
>> We can definitely have something like name+id as Alex recommended to
>> remove the 'name' field, but it will just require libvirt to have more
>> logic to parse that string.
> 
> Let's use the mtty example driver provided in patch 5 so we can all
> more clearly see how the interfaces work.  I'll start from the
> beginning of my experience and work my way to the type/name thing.
> 

Thanks for looking into it and getting feel of it. And I hope this helps
to understand that 'name' and 'type_id' are different.


> (please add a modules_install target to the Makefile)
>

This is an example and I feel it should not be installed in
/lib/modules/../build path. This should be used to understand the
interface and the flow of mdev device management life cycle. User can
use insmod to load driver:

# insmod mtty.ko

> # modprobe mtty
> 
> Now what?  It seems like I need to have prior knowledge that this
> drivers supports mdev devices and I need to go hunt for them.  We need
> to create a class (ex. /sys/class/mdev/) where a user can find all the
> devices that participate in this mediated device infrastructure.  That
> would point me to /sys/devices/mtty.
> 

You can find devices registered to mdev framework by searching for
'mdev_supported_types' directory at the leaf nodes of devices in
/sys/devices directory. Yes, we can have 'mdev' class and links to
devices which are registered to mdev framework in /sys/class/mdev/.


> # tree /sys/devices/mtty
> /sys/devices/mtty
> |-- mdev_supported_types
> |   `-- mtty1
> |       |-- available_instances (1)
> |       |-- create
> |       |-- devices
> |       `-- name ("Dual-port-serial")
> |-- mtty_dev
> |   `-- sample_mtty_dev ("This is phy device")
> |-- power
> |   |-- async
> |   |-- autosuspend_delay_ms
> |   |-- control
> |   |-- runtime_active_kids
> |   |-- runtime_active_time
> |   |-- runtime_enabled
> |   |-- runtime_status
> |   |-- runtime_suspended_time
> |   `-- runtime_usage
> `-- uevent
> 
> Ok, but that was boring, we really need to have at least 2 supported
> types to validate the interface, so without changing the actual device
> backing, I pretended to have a single port vs dual port:
> 
> /sys/devices/mtty
> |-- mdev_supported_types
> |   |-- mtty1
> |   |   |-- available_instances (24)
> |   |   |-- create
> |   |   |-- devices
> |   |   `-- name (Single-port-serial)
> |   `-- mtty2
> |       |-- available_instances (12)
> |       |-- create
> |       |-- devices
> |       `-- name (Dual-port-serial)
> [snip]
> 
> I arbitrarily decides I have 24 ports and each single port uses 1 port
> and each dual port uses 2 ports.
> 
> Before I start creating devices, what are we going to key the libvirt
> XML on?  Can we do anything to prevent vendors from colliding or do we
> have any way to suggest meaningful and unique type_ids? 

Libvirt would have parent and type_id in XML. No two vendors can own
same parent device. So I don't think vendors would collide even having
same type_id, since <parent, type_id> pair would always be unique.


 Presumably if
> we had a PCI device hosting this, we would be rooted at that parent
> device, ex. /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0.  Maybe
> the type_id should automatically be prefixed by the vendor module name,
> ex. mtty-1, i915-foo, nvidia-bar.  There's something missing for
> deterministically creating a "XYZ" device and knowing exactly what that
> means and finding a parent device that supports it.
> 

We can prefix type_id with module name, i.e using dev->driver->name, but
<parent, type_id> pair is unique so I don't see much benefit in doing
that.


> Let's get to mdev creating...
> 
> # uuidgen > mdev_supported_types/mtty2/create
> # tree /sys/devices/mtty
> /sys/devices/mtty
> |-- e68189be-700e-41f7-93a3-b5351e79c470
> |   |-- driver -> ../../../bus/mdev/drivers/vfio_mdev
> |   |-- iommu_group -> ../../../kernel/iommu_groups/63
> |   |-- mtty2 -> ../mdev_supported_types/mtty2
> |   |-- power
> |   |   |-- async
> |   |   |-- autosuspend_delay_ms
> |   |   |-- control
> |   |   |-- runtime_active_kids
> |   |   |-- runtime_active_time
> |   |   |-- runtime_enabled
> |   |   |-- runtime_status
> |   |   |-- runtime_suspended_time
> |   |   `-- runtime_usage
> |   |-- remove
> |   |-- subsystem -> ../../../bus/mdev
> |   |-- uevent
> |   `-- vendor
> |       `-- sample_mdev_dev ("This is MDEV e68189be-700e-41f7-93a3-b5351e79c470")
> |-- mdev_supported_types
> |   |-- mtty1
> |   |   |-- available_instances (22)
> |   |   |-- create
> |   |   |-- devices
> |   |   `-- name
> |   `-- mtty2
> |       |-- available_instances (11)
> |       |-- create
> |       |-- devices
> |       |   `-- e68189be-700e-41f7-93a3-b5351e79c470 -> ../../../e68189be-700e-41f7-93a3-b5351e79c470
> |       `-- name
> 
> The mdev device was created directly under the parent, which seems like
> it's going to get messy to me (ie. imagine dropping a bunch of uuids
> into a PCI parent device's sysfs directory, how does a user know what
> they are?).
> 

That is the way devices are placed in sysfs. For example below devices:

80:01.0 PCI bridge: Intel Corporation Xeon E5/Core i7 IIO PCI Express
Root Port 1a (rev 07)
80:02.0 PCI bridge: Intel Corporation Xeon E5/Core i7 IIO PCI Express
Root Port 2a (rev 07)
80:03.0 PCI bridge: Intel Corporation Xeon E5/Core i7 IIO PCI Express
Root Port 3a in PCI Express Mode (rev 07)
80:04.0 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
0 (rev 07)
80:04.1 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
1 (rev 07)
80:04.2 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
2 (rev 07)
80:04.3 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
3 (rev 07)
80:04.4 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
4 (rev 07)
80:04.5 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
5 (rev 07)
80:04.6 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
6 (rev 07)
80:04.7 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
7 (rev 07)
80:05.0 System peripheral: Intel Corporation Xeon E5/Core i7 Address
Map, VTd_Misc, System Management (rev 07)
80:05.2 System peripheral: Intel Corporation Xeon E5/Core i7 Control
Status and Global Errors (rev 07)
80:05.4 PIC: Intel Corporation Xeon E5/Core i7 I/O APIC (rev 07)

In sysfs, those are in same parent folder of its parent root port:

# ls /sys/devices/pci0000\:80/ -l
total 0
drwxr-xr-x 8 root root    0 Oct 13 12:08 0000:80:01.0
drwxr-xr-x 7 root root    0 Oct 13 12:08 0000:80:02.0
drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:03.0
drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:04.0
drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:04.1
drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:04.2
drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:04.3
drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:04.4
drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:04.5
drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:04.6
drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:04.7
drwxr-xr-x 3 root root    0 Oct 13 12:08 0000:80:05.0
drwxr-xr-x 3 root root    0 Oct 13 12:08 0000:80:05.2
drwxr-xr-x 3 root root    0 Oct 13 12:08 0000:80:05.4
lrwxrwxrwx 1 root root    0 Oct 13 13:25 firmware_node ->
../LNXSYSTM:00/LNXSYBUS:00/PNP0A08:01
drwxr-xr-x 3 root root    0 Oct 13 12:08 pci_bus
drwxr-xr-x 2 root root    0 Oct 13 12:08 power
-rw-r--r-- 1 root root 4096 Oct 13 13:25 uevent


> Under the device we have "mtty2", shouldn't that be
> "mdev_supported_type", which then links to mtty2?  Otherwise a user
> needs to decode from the link what this attribute is.
> 

I thought it should show type, so that by looking at 'ls' output user
should be able to find type_id.

> Also here's an example of those vendor sysfs entries per device.  So
> long as the vendor never expects a tool like libvirt to manipulate
> attributes there, I can see how that could be pretty powerful.
> 

Yes, it is good to have vendor specific entries, libvirt might not
report/use it. That would be more useful for system admin to get extra
information manually that libvirt doesn't report.


> Moving down to the mdev_supported_types, I've updated mtty so that it
> actually adjusts available instance, and we can now see a link under
> the devices for mtty2.
> 
> Also worth noting is that a link for the device appears
> in /sys/bus/mdev/devices.
> 
> BTW, specifying this device for QEMU vfio-pci is where the sysfsdev
> option comes into play:
> 
> -device
> vfio-pci,sysfsdev=/sys/devices/mtty/e68189be-700e-41f7-93a3-b5351e79c470
> 
> Which raises another question, we can tell through the vfio interfaces
> that this is exposes as a PCI device, by creating a container
> (open(/dev/vfio/vfio)), setting an iommu (ioctl(VFIO_SET_IOMMU)),
> adding the group to the container (ioctl(VFIO_GROUP_SET_CONTAINER)),
> getting the device (ioctl(VFIO_GROUP_GET_DEVICE_FD)), and finally
> getting the device info (ioctl(VFIO_DEVICE_GET_INFO)) and checking the
> flag bit that says the API is PCI.  That's a long path to go and has
> stumbling blocks like the type of iommu that's available for the given
> platform.  How do we make that manageable? 

Do you want device type to be expressed in sysfs? Then that should be
done from vendor driver. vfio_mdev module is now a shim layer, so mdev
core or vfio_mdev module don't know what device type flag vendor driver
had set.

Thanks,
Kirti

> I don't think we want to
> create some artificial relationship that the type of the parent
> necessarily match the type of the child mdev, we've already broken that
> with a simple mdev tty driver.
> 
> One more:
> 
> # uuidgen > mdev_supported_types/mtty1/create
> # tree /sys/devices/mtty
> /sys/devices/mtty
> |-- a7ae17d1-2de4-44c2-ae58-20ae0a0befe8
> |   |-- driver -> ../../../bus/mdev/drivers/vfio_mdev
> |   |-- iommu_group -> ../../../kernel/iommu_groups/64
> |   |-- mtty1 -> ../mdev_supported_types/mtty1
> |   |-- power
> [snip]
> |   |-- remove
> |   |-- subsystem -> ../../../bus/mdev
> |   |-- uevent
> |   `-- vendor
> |       `-- sample_mdev_dev ("This is MDEV a7ae17d1-2de4-44c2-ae58-20ae0a0befe8")
> |-- e68189be-700e-41f7-93a3-b5351e79c470
> |   |-- driver -> ../../../bus/mdev/drivers/vfio_mdev
> |   |-- iommu_group -> ../../../kernel/iommu_groups/63
> |   |-- mtty2 -> ../mdev_supported_types/mtty2
> |   |-- power
> [snip]
> |   |-- remove
> |   |-- subsystem -> ../../../bus/mdev
> |   |-- uevent
> |   `-- vendor
> |       `-- sample_mdev_dev ("This is MDEV e68189be-700e-41f7-93a3-b5351e79c470")
> |-- mdev_supported_types
> |   |-- mtty1
> |   |   |-- available_instances (21)
> |   |   |-- create
> |   |   |-- devices
> |   |   |   `-- a7ae17d1-2de4-44c2-ae58-20ae0a0befe8 -> ../../../a7ae17d1-2de4-44c2-ae58-20ae0a0befe8
> |   |   `-- name
> |   `-- mtty2
> |       |-- available_instances (10)
> |       |-- create
> |       |-- devices
> |       |   `-- e68189be-700e-41f7-93a3-b5351e79c470 -> ../../../e68189be-700e-41f7-93a3-b5351e79c470
> |       `-- name
> 
> Hopefully as expected with the caveats for the first example.
> 
> # echo 1 > a7ae17d1-2de4-44c2-ae58-20ae0a0befe8/remove
> # echo 1 > e68189be-700e-41f7-93a3-b5351e79c470/remove
> 
> These do what they're supposed to, the devices are gone.
> 
> Ok, I've identified some issues, let's figure out how to resolve them.
> Thanks,
> 
> Alex
> 
> (hack multi-port mtty patch attached)
> 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 3/6] vfio iommu: Add support for mediated devices
  2016-10-11 22:06     ` [Qemu-devel] " Alex Williamson
@ 2016-10-13 14:34       ` Kirti Wankhede
  -1 siblings, 0 replies; 73+ messages in thread
From: Kirti Wankhede @ 2016-10-13 14:34 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kevin.tian, cjia, kvm, qemu-devel, jike.song, kraxel, pbonzini, bjsdjshi



On 10/12/2016 3:36 AM, Alex Williamson wrote:
> On Tue, 11 Oct 2016 01:58:34 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>
...


>> +static struct vfio_group *vfio_group_from_dev(struct device *dev)
>> +{
>> +	struct vfio_device *device;
>> +	struct vfio_group *group;
>> +	int ret;
>> +
>> +	device = vfio_device_get_from_dev(dev);
>
> Note how this does dev->iommu_group->vfio_group->vfio_device and then
> we back out one level to get the vfio_group, it's not a terribly
> lightweight path.  Perhaps we should have:
>
> struct vfio_device *vfio_group_get_from_dev(struct device *dev)
> {
>         struct iommu_group *iommu_group;
>         struct vfio_group *group;
>
>         iommu_group = iommu_group_get(dev);
>         if (!iommu_group)
>                 return NULL;
>
>         group = vfio_group_get_from_iommu(iommu_group);
> 	iommu_group_put(iommu_group);
>
> 	return group;
> }
>
> vfio_device_get_from_dev() would make use of this.
>
> Then create a separate:
>
> static int vfio_group_add_container_user(struct vfio_group *group)
> {
>
>> +	if (!atomic_inc_not_zero(&group->container_users)) {
> 		return -EINVAL;
>> +	}
>> +
>> +	if (group->noiommu) {
>> +		atomic_dec(&group->container_users);
> 		return -EPERM;
>> +	}
>> +
>> +	if (!group->container->iommu_driver ||
>> +	    !vfio_group_viable(group)) {
>> +		atomic_dec(&group->container_users);
> 		return -EINVAL;
>> +	}
>> +
> 	return 0;
> }
>
> vfio_group_get_external_user() would be updated to use this.  In fact,
> creating these two functions and updating the existing code to use
> these should be a separate patch.
>

Ok. I'll update.


> Note that your version did not hold a group reference while doing the
> pin/unpin operations below, which seems like a bug.
>

container->group_lock is held for pin/unpin. I think then we don't have
to hold the reference to group, because groups are attached and detached
holding this lock, right?


>> +
>> +err_ret:
>> +	vfio_device_put(device);
>> +	return ERR_PTR(ret);
>> +}
>> +
>> +/*
>> + * Pin a set of guest PFNs and return their associated host PFNs for
local
>> + * domain only.
>> + * @dev [in] : device
>> + * @user_pfn [in]: array of user/guest PFNs
>> + * @npage [in]: count of array elements
>> + * @prot [in] : protection flags
>> + * @phys_pfn[out] : array of host PFNs
>> + */
>> +long vfio_pin_pages(struct device *dev, unsigned long *user_pfn,
>> +		    long npage, int prot, unsigned long *phys_pfn)
>> +{
>> +	struct vfio_container *container;
>> +	struct vfio_group *group;
>> +	struct vfio_iommu_driver *driver;
>> +	ssize_t ret = -EINVAL;
>> +
>> +	if (!dev || !user_pfn || !phys_pfn)
>> +		return -EINVAL;
>> +
>> +	group = vfio_group_from_dev(dev);
>> +	if (IS_ERR(group))
>> +		return PTR_ERR(group);
>
> As suggested above:
>
> 	group = vfio_group_get_from_dev(dev);
> 	if (!group)
> 		return -ENODEV;
>
> 	ret = vfio_group_add_container_user(group)
> 	if (ret)
> 		vfio_group_put(group);
> 		return ret;
> 	}
>

Ok.


>> +
>> +	container = group->container;
>> +	if (IS_ERR(container))
>> +		return PTR_ERR(container);
>> +
>> +	down_read(&container->group_lock);
>> +
>> +	driver = container->iommu_driver;
>> +	if (likely(driver && driver->ops->pin_pages))
>> +		ret = driver->ops->pin_pages(container->iommu_data, user_pfn,
>> +					     npage, prot, phys_pfn);
>> +
>> +	up_read(&container->group_lock);
>> +	vfio_group_try_dissolve_container(group);
>
> Even if you're considering that the container_user reference holds the
> driver, I think we need a group reference throughout this and this
> should end with a vfio_group_put(group);
>

Same as I mentioned above, container->group_lock is held here.

...

>> +
>> +static long vfio_iommu_type1_unpin_pages(void *iommu_data, unsigned
long *pfn,
>> +					 long npage)
>> +{
>> +	struct vfio_iommu *iommu = iommu_data;
>> +	struct vfio_domain *domain = NULL;
>> +	long unlocked = 0;
>> +	int i;
>> +
>> +	if (!iommu || !pfn)
>> +		return -EINVAL;
>> +
>
> We need iommu->lock here, right?
>

Oh, yes.

>> +	domain = iommu->local_domain;
>> +
>> +	for (i = 0; i < npage; i++) {
>> +		struct vfio_pfn *p;
>> +
>> +		mutex_lock(&domain->local_addr_space->pfn_list_lock);
>> +
>> +		/* verify if pfn exist in pfn_list */
>> +		p = vfio_find_pfn(domain, pfn[i]);
>> +		if (p)
>> +			unlocked += vfio_unpin_pfn(domain, p, true);
>> +
>> +		mutex_unlock(&domain->local_addr_space->pfn_list_lock);
>
> We hold this mutex outside the loop in the pin unwind case, why is it
> different here?
>

pin_unwind is error condition, so should be done in one go.
Here this is not error case. Holding lock for long could block other
threads if there are multiple threads.



>> +static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct
vfio_dma *dma,
>> +			    size_t map_size)
>> +{
>> +	dma_addr_t iova = dma->iova;
>> +	unsigned long vaddr = dma->vaddr;
>> +	size_t size = map_size, dma_size = 0;
>> +	long npage;
>> +	unsigned long pfn;
>> +	int ret = 0;
>> +
>> +	while (size) {
>> +		/* Pin a contiguous chunk of memory */
>> +		npage = __vfio_pin_pages_remote(vaddr + dma_size,
>> +						size >> PAGE_SHIFT, dma->prot,
>> +						&pfn);
>> +		if (npage <= 0) {
>> +			WARN_ON(!npage);
>> +			ret = (int)npage;
>> +			break;
>> +		}
>> +
>> +		/* Map it! */
>> +		ret = vfio_iommu_map(iommu, iova + dma_size, pfn, npage,
>> +				     dma->prot);
>> +		if (ret) {
>> +			__vfio_unpin_pages_remote(pfn, npage, dma->prot, true);
>> +			break;
>> +		}
>> +
>> +		size -= npage << PAGE_SHIFT;
>> +		dma_size += npage << PAGE_SHIFT;
>> +	}
>> +
>> +	if (ret)
>> +		vfio_remove_dma(iommu, dma);
>
>
> There's a bug introduced here, vfio_remove_dma() needs dma->size to be
> accurate to the point of failure, it's not updated until the success
> branch below, so it's never going to unmap/unpin anything.
>

Ops, yes. I'll fix this.

>> +	else {
>> +		dma->size = dma_size;
>> +		dma->iommu_mapped = true;
>> +		vfio_update_accounting(iommu, dma);
>
> I'm confused how this works, when called from vfio_dma_do_map() we're
> populating a vfio_dma, that is we're populating part of the iova space
> of the device.  How could we have pinned pfns in the local address
> space that overlap that?  It would be invalid to have such pinned pfns
> since that part of the iova space was not previously mapped.
>
> Another issue is that if there were existing overlaps, userspace would
> need to have locked memory limits sufficient for this temporary double
> accounting.  I'm not sure how they'd come up with heuristics to handle
> that since we're potentially looking at the bulk of VM memory in a
> single vfio_dma entry.
>

I see that when QEMU boots a VM, in the case when first vGPU device is
attached and then pass through device is attached, then on first call to
vfio_dma_do_map(), pin and iommu_mmap is skipped. Then when a pass
through device is attached, all mappings are unmapped and then again
vfio_dma_do_map() is called. At this moment IOMMU capable domain is
present and so pin and iommu_mmap() on all sys mem is done. Now in
between these two device attach, if any pages are pinned by vendor
driver, then accounting should be updated.


>> +	}
>> +
>> +	return ret;
>> +}
>> +
>>  static int vfio_dma_do_map(struct vfio_iommu *iommu,
>>  			   struct vfio_iommu_type1_dma_map *map)
>>  {
>>  	dma_addr_t iova = map->iova;
>>  	unsigned long vaddr = map->vaddr;
>>  	size_t size = map->size;
>> -	long npage;
>>  	int ret = 0, prot = 0;
>>  	uint64_t mask;
>>  	struct vfio_dma *dma;
>> -	unsigned long pfn;
>>
>>  	/* Verify that none of our __u64 fields overflow */
>>  	if (map->size != size || map->vaddr != vaddr || map->iova != iova)
>> @@ -611,29 +981,11 @@ static int vfio_dma_do_map(struct vfio_iommu
*iommu,
>>  	/* Insert zero-sized and grow as we map chunks of it */
>>  	vfio_link_dma(iommu, dma);
>>
>> -	while (size) {
>> -		/* Pin a contiguous chunk of memory */
>> -		npage = vfio_pin_pages(vaddr + dma->size,
>> -				       size >> PAGE_SHIFT, prot, &pfn);
>> -		if (npage <= 0) {
>> -			WARN_ON(!npage);
>> -			ret = (int)npage;
>> -			break;
>> -		}
>> -
>> -		/* Map it! */
>> -		ret = vfio_iommu_map(iommu, iova + dma->size, pfn, npage, prot);
>> -		if (ret) {
>> -			vfio_unpin_pages(pfn, npage, prot, true);
>> -			break;
>> -		}
>> -
>> -		size -= npage << PAGE_SHIFT;
>> -		dma->size += npage << PAGE_SHIFT;
>> -	}
>> -
>> -	if (ret)
>> -		vfio_remove_dma(iommu, dma);
>> +	/* Don't pin and map if container doesn't contain IOMMU capable
domain*/
>> +	if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu))
>> +		dma->size = size;
>> +	else
>> +		ret = vfio_pin_map_dma(iommu, dma, size);
>>
>>  	mutex_unlock(&iommu->lock);
>>  	return ret;
>> @@ -662,10 +1014,6 @@ static int vfio_iommu_replay(struct vfio_iommu
*iommu,
>>  	d = list_first_entry(&iommu->domain_list, struct vfio_domain, next);
>>  	n = rb_first(&iommu->dma_list);
>>
>> -	/* If there's not a domain, there better not be any mappings */
>> -	if (WARN_ON(n && !d))
>> -		return -EINVAL;
>> -
>>  	for (; n; n = rb_next(n)) {
>>  		struct vfio_dma *dma;
>>  		dma_addr_t iova;
>> @@ -674,20 +1022,43 @@ static int vfio_iommu_replay(struct vfio_iommu
*iommu,
>>  		iova = dma->iova;
>>
>>  		while (iova < dma->iova + dma->size) {
>> -			phys_addr_t phys = iommu_iova_to_phys(d->domain, iova);
>> +			phys_addr_t phys;
>>  			size_t size;
>>
>> -			if (WARN_ON(!phys)) {
>> -				iova += PAGE_SIZE;
>> -				continue;
>> -			}
>> +			if (dma->iommu_mapped) {
>> +				phys = iommu_iova_to_phys(d->domain, iova);
>> +
>> +				if (WARN_ON(!phys)) {
>> +					iova += PAGE_SIZE;
>> +					continue;
>> +				}
>>
>> -			size = PAGE_SIZE;
>> +				size = PAGE_SIZE;
>>
>> -			while (iova + size < dma->iova + dma->size &&
>> -			       phys + size == iommu_iova_to_phys(d->domain,
>> +				while (iova + size < dma->iova + dma->size &&
>> +				    phys + size == iommu_iova_to_phys(d->domain,
>>  								 iova + size))
>> -				size += PAGE_SIZE;
>> +					size += PAGE_SIZE;
>> +			} else {
>> +				unsigned long pfn;
>> +				unsigned long vaddr = dma->vaddr +
>> +						     (iova - dma->iova);
>> +				size_t n = dma->iova + dma->size - iova;
>> +				long npage;
>> +
>> +				npage = __vfio_pin_pages_remote(vaddr,
>> +								n >> PAGE_SHIFT,
>> +								dma->prot,
>> +								&pfn);
>> +				if (npage <= 0) {
>> +					WARN_ON(!npage);
>> +					ret = (int)npage;
>> +					return ret;
>> +				}
>> +
>> +				phys = pfn << PAGE_SHIFT;
>> +				size = npage << PAGE_SHIFT;
>> +			}
>>
>>  			ret = iommu_map(domain->domain, iova, phys,
>>  					size, dma->prot | domain->prot);
>> @@ -696,6 +1067,11 @@ static int vfio_iommu_replay(struct vfio_iommu
*iommu,
>>
>>  			iova += size;
>>  		}
>> +
>> +		if (!dma->iommu_mapped) {
>> +			dma->iommu_mapped = true;
>> +			vfio_update_accounting(iommu, dma);
>> +		}
>
> This is the case where we potentially have pinned pfns and we've added
> an iommu mapped device and need to adjust accounting.  But we've fully
> pinned and accounted the entire iommu mapped space while still holding
> the accounting for any pfn mapped space.  So for a time, assuming some
> pfn pinned pages, we have duplicate accounting.  How does userspace
> deal with that?  For instance, if I'm using an mdev device where the
> vendor driver has pinned 512MB of guest memory, then I hot-add an
> assigned NIC and the entire VM address space gets pinned, that pinning
> will fail unless my locked memory limits are at least 512MB in excess
> of my VM size.  Additionally, the user doesn't know how much memory the
> vendor driver is going to pin, it might be the whole VM address space,
> so the user would need 2x the locked memory limits.
>

Is the RLIMIT_MEMLOCK set so low? I got your point. I'll update
__vfio_pin_pages_remote() to check if page which is pinned is already
accounted in __vfio_pin_pages_remote() itself.


>>  	}
>>
>>  	return 0;
>> @@ -734,11 +1110,24 @@ static void vfio_test_domain_fgsp(struct
vfio_domain *domain)
>>  	__free_pages(pages, order);
>>  }
>>
>> +static struct vfio_group *find_iommu_group(struct vfio_domain *domain,
>> +				   struct iommu_group *iommu_group)
>> +{
>> +	struct vfio_group *g;
>> +
>> +	list_for_each_entry(g, &domain->group_list, next) {
>> +		if (g->iommu_group == iommu_group)
>> +			return g;
>> +	}
>> +
>> +	return NULL;
>> +}
>
> It would make review easier if changes like splitting this into a
> separate function with no functional change on the calling path could
> be a separate patch.
>

OK.

Thanks,
Kirti

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v8 3/6] vfio iommu: Add support for mediated devices
@ 2016-10-13 14:34       ` Kirti Wankhede
  0 siblings, 0 replies; 73+ messages in thread
From: Kirti Wankhede @ 2016-10-13 14:34 UTC (permalink / raw)
  To: Alex Williamson
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi



On 10/12/2016 3:36 AM, Alex Williamson wrote:
> On Tue, 11 Oct 2016 01:58:34 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>
...


>> +static struct vfio_group *vfio_group_from_dev(struct device *dev)
>> +{
>> +	struct vfio_device *device;
>> +	struct vfio_group *group;
>> +	int ret;
>> +
>> +	device = vfio_device_get_from_dev(dev);
>
> Note how this does dev->iommu_group->vfio_group->vfio_device and then
> we back out one level to get the vfio_group, it's not a terribly
> lightweight path.  Perhaps we should have:
>
> struct vfio_device *vfio_group_get_from_dev(struct device *dev)
> {
>         struct iommu_group *iommu_group;
>         struct vfio_group *group;
>
>         iommu_group = iommu_group_get(dev);
>         if (!iommu_group)
>                 return NULL;
>
>         group = vfio_group_get_from_iommu(iommu_group);
> 	iommu_group_put(iommu_group);
>
> 	return group;
> }
>
> vfio_device_get_from_dev() would make use of this.
>
> Then create a separate:
>
> static int vfio_group_add_container_user(struct vfio_group *group)
> {
>
>> +	if (!atomic_inc_not_zero(&group->container_users)) {
> 		return -EINVAL;
>> +	}
>> +
>> +	if (group->noiommu) {
>> +		atomic_dec(&group->container_users);
> 		return -EPERM;
>> +	}
>> +
>> +	if (!group->container->iommu_driver ||
>> +	    !vfio_group_viable(group)) {
>> +		atomic_dec(&group->container_users);
> 		return -EINVAL;
>> +	}
>> +
> 	return 0;
> }
>
> vfio_group_get_external_user() would be updated to use this.  In fact,
> creating these two functions and updating the existing code to use
> these should be a separate patch.
>

Ok. I'll update.


> Note that your version did not hold a group reference while doing the
> pin/unpin operations below, which seems like a bug.
>

container->group_lock is held for pin/unpin. I think then we don't have
to hold the reference to group, because groups are attached and detached
holding this lock, right?


>> +
>> +err_ret:
>> +	vfio_device_put(device);
>> +	return ERR_PTR(ret);
>> +}
>> +
>> +/*
>> + * Pin a set of guest PFNs and return their associated host PFNs for
local
>> + * domain only.
>> + * @dev [in] : device
>> + * @user_pfn [in]: array of user/guest PFNs
>> + * @npage [in]: count of array elements
>> + * @prot [in] : protection flags
>> + * @phys_pfn[out] : array of host PFNs
>> + */
>> +long vfio_pin_pages(struct device *dev, unsigned long *user_pfn,
>> +		    long npage, int prot, unsigned long *phys_pfn)
>> +{
>> +	struct vfio_container *container;
>> +	struct vfio_group *group;
>> +	struct vfio_iommu_driver *driver;
>> +	ssize_t ret = -EINVAL;
>> +
>> +	if (!dev || !user_pfn || !phys_pfn)
>> +		return -EINVAL;
>> +
>> +	group = vfio_group_from_dev(dev);
>> +	if (IS_ERR(group))
>> +		return PTR_ERR(group);
>
> As suggested above:
>
> 	group = vfio_group_get_from_dev(dev);
> 	if (!group)
> 		return -ENODEV;
>
> 	ret = vfio_group_add_container_user(group)
> 	if (ret)
> 		vfio_group_put(group);
> 		return ret;
> 	}
>

Ok.


>> +
>> +	container = group->container;
>> +	if (IS_ERR(container))
>> +		return PTR_ERR(container);
>> +
>> +	down_read(&container->group_lock);
>> +
>> +	driver = container->iommu_driver;
>> +	if (likely(driver && driver->ops->pin_pages))
>> +		ret = driver->ops->pin_pages(container->iommu_data, user_pfn,
>> +					     npage, prot, phys_pfn);
>> +
>> +	up_read(&container->group_lock);
>> +	vfio_group_try_dissolve_container(group);
>
> Even if you're considering that the container_user reference holds the
> driver, I think we need a group reference throughout this and this
> should end with a vfio_group_put(group);
>

Same as I mentioned above, container->group_lock is held here.

...

>> +
>> +static long vfio_iommu_type1_unpin_pages(void *iommu_data, unsigned
long *pfn,
>> +					 long npage)
>> +{
>> +	struct vfio_iommu *iommu = iommu_data;
>> +	struct vfio_domain *domain = NULL;
>> +	long unlocked = 0;
>> +	int i;
>> +
>> +	if (!iommu || !pfn)
>> +		return -EINVAL;
>> +
>
> We need iommu->lock here, right?
>

Oh, yes.

>> +	domain = iommu->local_domain;
>> +
>> +	for (i = 0; i < npage; i++) {
>> +		struct vfio_pfn *p;
>> +
>> +		mutex_lock(&domain->local_addr_space->pfn_list_lock);
>> +
>> +		/* verify if pfn exist in pfn_list */
>> +		p = vfio_find_pfn(domain, pfn[i]);
>> +		if (p)
>> +			unlocked += vfio_unpin_pfn(domain, p, true);
>> +
>> +		mutex_unlock(&domain->local_addr_space->pfn_list_lock);
>
> We hold this mutex outside the loop in the pin unwind case, why is it
> different here?
>

pin_unwind is error condition, so should be done in one go.
Here this is not error case. Holding lock for long could block other
threads if there are multiple threads.



>> +static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct
vfio_dma *dma,
>> +			    size_t map_size)
>> +{
>> +	dma_addr_t iova = dma->iova;
>> +	unsigned long vaddr = dma->vaddr;
>> +	size_t size = map_size, dma_size = 0;
>> +	long npage;
>> +	unsigned long pfn;
>> +	int ret = 0;
>> +
>> +	while (size) {
>> +		/* Pin a contiguous chunk of memory */
>> +		npage = __vfio_pin_pages_remote(vaddr + dma_size,
>> +						size >> PAGE_SHIFT, dma->prot,
>> +						&pfn);
>> +		if (npage <= 0) {
>> +			WARN_ON(!npage);
>> +			ret = (int)npage;
>> +			break;
>> +		}
>> +
>> +		/* Map it! */
>> +		ret = vfio_iommu_map(iommu, iova + dma_size, pfn, npage,
>> +				     dma->prot);
>> +		if (ret) {
>> +			__vfio_unpin_pages_remote(pfn, npage, dma->prot, true);
>> +			break;
>> +		}
>> +
>> +		size -= npage << PAGE_SHIFT;
>> +		dma_size += npage << PAGE_SHIFT;
>> +	}
>> +
>> +	if (ret)
>> +		vfio_remove_dma(iommu, dma);
>
>
> There's a bug introduced here, vfio_remove_dma() needs dma->size to be
> accurate to the point of failure, it's not updated until the success
> branch below, so it's never going to unmap/unpin anything.
>

Ops, yes. I'll fix this.

>> +	else {
>> +		dma->size = dma_size;
>> +		dma->iommu_mapped = true;
>> +		vfio_update_accounting(iommu, dma);
>
> I'm confused how this works, when called from vfio_dma_do_map() we're
> populating a vfio_dma, that is we're populating part of the iova space
> of the device.  How could we have pinned pfns in the local address
> space that overlap that?  It would be invalid to have such pinned pfns
> since that part of the iova space was not previously mapped.
>
> Another issue is that if there were existing overlaps, userspace would
> need to have locked memory limits sufficient for this temporary double
> accounting.  I'm not sure how they'd come up with heuristics to handle
> that since we're potentially looking at the bulk of VM memory in a
> single vfio_dma entry.
>

I see that when QEMU boots a VM, in the case when first vGPU device is
attached and then pass through device is attached, then on first call to
vfio_dma_do_map(), pin and iommu_mmap is skipped. Then when a pass
through device is attached, all mappings are unmapped and then again
vfio_dma_do_map() is called. At this moment IOMMU capable domain is
present and so pin and iommu_mmap() on all sys mem is done. Now in
between these two device attach, if any pages are pinned by vendor
driver, then accounting should be updated.


>> +	}
>> +
>> +	return ret;
>> +}
>> +
>>  static int vfio_dma_do_map(struct vfio_iommu *iommu,
>>  			   struct vfio_iommu_type1_dma_map *map)
>>  {
>>  	dma_addr_t iova = map->iova;
>>  	unsigned long vaddr = map->vaddr;
>>  	size_t size = map->size;
>> -	long npage;
>>  	int ret = 0, prot = 0;
>>  	uint64_t mask;
>>  	struct vfio_dma *dma;
>> -	unsigned long pfn;
>>
>>  	/* Verify that none of our __u64 fields overflow */
>>  	if (map->size != size || map->vaddr != vaddr || map->iova != iova)
>> @@ -611,29 +981,11 @@ static int vfio_dma_do_map(struct vfio_iommu
*iommu,
>>  	/* Insert zero-sized and grow as we map chunks of it */
>>  	vfio_link_dma(iommu, dma);
>>
>> -	while (size) {
>> -		/* Pin a contiguous chunk of memory */
>> -		npage = vfio_pin_pages(vaddr + dma->size,
>> -				       size >> PAGE_SHIFT, prot, &pfn);
>> -		if (npage <= 0) {
>> -			WARN_ON(!npage);
>> -			ret = (int)npage;
>> -			break;
>> -		}
>> -
>> -		/* Map it! */
>> -		ret = vfio_iommu_map(iommu, iova + dma->size, pfn, npage, prot);
>> -		if (ret) {
>> -			vfio_unpin_pages(pfn, npage, prot, true);
>> -			break;
>> -		}
>> -
>> -		size -= npage << PAGE_SHIFT;
>> -		dma->size += npage << PAGE_SHIFT;
>> -	}
>> -
>> -	if (ret)
>> -		vfio_remove_dma(iommu, dma);
>> +	/* Don't pin and map if container doesn't contain IOMMU capable
domain*/
>> +	if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu))
>> +		dma->size = size;
>> +	else
>> +		ret = vfio_pin_map_dma(iommu, dma, size);
>>
>>  	mutex_unlock(&iommu->lock);
>>  	return ret;
>> @@ -662,10 +1014,6 @@ static int vfio_iommu_replay(struct vfio_iommu
*iommu,
>>  	d = list_first_entry(&iommu->domain_list, struct vfio_domain, next);
>>  	n = rb_first(&iommu->dma_list);
>>
>> -	/* If there's not a domain, there better not be any mappings */
>> -	if (WARN_ON(n && !d))
>> -		return -EINVAL;
>> -
>>  	for (; n; n = rb_next(n)) {
>>  		struct vfio_dma *dma;
>>  		dma_addr_t iova;
>> @@ -674,20 +1022,43 @@ static int vfio_iommu_replay(struct vfio_iommu
*iommu,
>>  		iova = dma->iova;
>>
>>  		while (iova < dma->iova + dma->size) {
>> -			phys_addr_t phys = iommu_iova_to_phys(d->domain, iova);
>> +			phys_addr_t phys;
>>  			size_t size;
>>
>> -			if (WARN_ON(!phys)) {
>> -				iova += PAGE_SIZE;
>> -				continue;
>> -			}
>> +			if (dma->iommu_mapped) {
>> +				phys = iommu_iova_to_phys(d->domain, iova);
>> +
>> +				if (WARN_ON(!phys)) {
>> +					iova += PAGE_SIZE;
>> +					continue;
>> +				}
>>
>> -			size = PAGE_SIZE;
>> +				size = PAGE_SIZE;
>>
>> -			while (iova + size < dma->iova + dma->size &&
>> -			       phys + size == iommu_iova_to_phys(d->domain,
>> +				while (iova + size < dma->iova + dma->size &&
>> +				    phys + size == iommu_iova_to_phys(d->domain,
>>  								 iova + size))
>> -				size += PAGE_SIZE;
>> +					size += PAGE_SIZE;
>> +			} else {
>> +				unsigned long pfn;
>> +				unsigned long vaddr = dma->vaddr +
>> +						     (iova - dma->iova);
>> +				size_t n = dma->iova + dma->size - iova;
>> +				long npage;
>> +
>> +				npage = __vfio_pin_pages_remote(vaddr,
>> +								n >> PAGE_SHIFT,
>> +								dma->prot,
>> +								&pfn);
>> +				if (npage <= 0) {
>> +					WARN_ON(!npage);
>> +					ret = (int)npage;
>> +					return ret;
>> +				}
>> +
>> +				phys = pfn << PAGE_SHIFT;
>> +				size = npage << PAGE_SHIFT;
>> +			}
>>
>>  			ret = iommu_map(domain->domain, iova, phys,
>>  					size, dma->prot | domain->prot);
>> @@ -696,6 +1067,11 @@ static int vfio_iommu_replay(struct vfio_iommu
*iommu,
>>
>>  			iova += size;
>>  		}
>> +
>> +		if (!dma->iommu_mapped) {
>> +			dma->iommu_mapped = true;
>> +			vfio_update_accounting(iommu, dma);
>> +		}
>
> This is the case where we potentially have pinned pfns and we've added
> an iommu mapped device and need to adjust accounting.  But we've fully
> pinned and accounted the entire iommu mapped space while still holding
> the accounting for any pfn mapped space.  So for a time, assuming some
> pfn pinned pages, we have duplicate accounting.  How does userspace
> deal with that?  For instance, if I'm using an mdev device where the
> vendor driver has pinned 512MB of guest memory, then I hot-add an
> assigned NIC and the entire VM address space gets pinned, that pinning
> will fail unless my locked memory limits are at least 512MB in excess
> of my VM size.  Additionally, the user doesn't know how much memory the
> vendor driver is going to pin, it might be the whole VM address space,
> so the user would need 2x the locked memory limits.
>

Is the RLIMIT_MEMLOCK set so low? I got your point. I'll update
__vfio_pin_pages_remote() to check if page which is pinned is already
accounted in __vfio_pin_pages_remote() itself.


>>  	}
>>
>>  	return 0;
>> @@ -734,11 +1110,24 @@ static void vfio_test_domain_fgsp(struct
vfio_domain *domain)
>>  	__free_pages(pages, order);
>>  }
>>
>> +static struct vfio_group *find_iommu_group(struct vfio_domain *domain,
>> +				   struct iommu_group *iommu_group)
>> +{
>> +	struct vfio_group *g;
>> +
>> +	list_for_each_entry(g, &domain->group_list, next) {
>> +		if (g->iommu_group == iommu_group)
>> +			return g;
>> +	}
>> +
>> +	return NULL;
>> +}
>
> It would make review easier if changes like splitting this into a
> separate function with no functional change on the calling path could
> be a separate patch.
>

OK.

Thanks,
Kirti

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 4/6] docs: Add Documentation for Mediated devices
  2016-10-13  9:22                 ` Kirti Wankhede
@ 2016-10-13 14:36                     ` Alex Williamson
  0 siblings, 0 replies; 73+ messages in thread
From: Alex Williamson @ 2016-10-13 14:36 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Song, Jike, kvm, Tian, Kevin, qemu-devel, cjia, kraxel,
	Laine Stump, pbonzini, bjsdjshi

On Thu, 13 Oct 2016 14:52:09 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 10/13/2016 3:14 AM, Alex Williamson wrote:
> > On Thu, 13 Oct 2016 00:32:48 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> On 10/12/2016 9:29 PM, Alex Williamson wrote:  
> >>> On Wed, 12 Oct 2016 20:43:48 +0530
> >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>     
> >>>> On 10/12/2016 7:22 AM, Tian, Kevin wrote:    
> >>>>>> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> >>>>>> Sent: Wednesday, October 12, 2016 4:45 AM      
> >>>>>>>> +* mdev_supported_types:
> >>>>>>>> +    List of current supported mediated device types and its details are added
> >>>>>>>> +in this directory in following format:
> >>>>>>>> +
> >>>>>>>> +|- <parent phy device>
> >>>>>>>> +|--- Vendor-specific-attributes [optional]
> >>>>>>>> +|--- mdev_supported_types
> >>>>>>>> +|     |--- <type id>
> >>>>>>>> +|     |   |--- create
> >>>>>>>> +|     |   |--- name
> >>>>>>>> +|     |   |--- available_instances
> >>>>>>>> +|     |   |--- description /class
> >>>>>>>> +|     |   |--- [devices]
> >>>>>>>> +|     |--- <type id>
> >>>>>>>> +|     |   |--- create
> >>>>>>>> +|     |   |--- name
> >>>>>>>> +|     |   |--- available_instances
> >>>>>>>> +|     |   |--- description /class
> >>>>>>>> +|     |   |--- [devices]
> >>>>>>>> +|     |--- <type id>
> >>>>>>>> +|          |--- create
> >>>>>>>> +|          |--- name
> >>>>>>>> +|          |--- available_instances
> >>>>>>>> +|          |--- description /class
> >>>>>>>> +|          |--- [devices]
> >>>>>>>> +
> >>>>>>>> +[TBD : description or class is yet to be decided. This will change.]      
> >>>>>>>
> >>>>>>> I thought that in previous discussions we had agreed to drop
> >>>>>>> the <type id> concept and use the name as the unique identifier.
> >>>>>>> When reporting these types in libvirt we won't want to report
> >>>>>>> the type id values - we'll want the name strings to be unique.
> >>>>>>>      
> >>>>>>
> >>>>>> The 'name' might not be unique but type_id will be. For example that Neo
> >>>>>> pointed out in earlier discussion, virtual devices can come from two
> >>>>>> different physical devices, end user would be presented with what they
> >>>>>> had selected but there will be internal implementation differences. In
> >>>>>> that case 'type_id' will be unique.
> >>>>>>      
> >>>>>
> >>>>> Hi, Kirti, my understanding is that Neo agreed to use an unique type
> >>>>> string (if you still called it <type id>), and then no need of additional
> >>>>> 'name' field which can be put inside 'description' field. See below quote:
> >>>>>       
> >>>>
> >>>> We had internal discussions about this within NVIDIA and found that
> >>>> 'name' might not be unique where as 'type_id' would be unique. I'm
> >>>> refering to Neo's mail after that, where Neo do pointed that out.
> >>>>
> >>>> https://lists.gnu.org/archive/html/qemu-devel/2016-09/msg07714.html    
> >>>
> >>> Everyone not privy to those internal discussions, including me, seems to
> >>> think we dropped type_id and that if a vendor does not have a stable
> >>> name, they can compose some sort of stable type description based on the
> >>> name+id, or even vendor+id, ex. NVIDIA-11.  So please share why we
> >>> haven't managed to kill off type_id yet.  No matter what internal
> >>> representation each vendor driver has of "type_id" it seems possible
> >>> for it to come up with stable string to define a given configuration.    
> >>
> >>
> >> The 'type_id' is unique and the 'name' are not, the name is just a
> >> virtual device name/ human readable name. Because at this moment Intel
> >> can't define a proper GPU class, we have to add a 'description' field
> >> there as well to represent the features of this virtual device, once we
> >> have all agreed with the GPU class and its mandatory attributes, the
> >> 'description' field can be removed. Here is an example,
> >> type_id/type_name = NVIDIA_11,
> >> name=M60-M0Q,
> >> description=2560x1600, 2 displays, 512MB"
> >>
> >> Neo's previous comment only applies to the situation where we will have
> >> the GPU class or optional attributes defined and recognized by libvirt,
> >> since that is not going to happen any time soon, we will have to have
> >> the new 'description' field, and we don't want to have it mixed up with
> >> 'name' field.
> >>
> >> We can definitely have something like name+id as Alex recommended to
> >> remove the 'name' field, but it will just require libvirt to have more
> >> logic to parse that string.  
> > 
> > Let's use the mtty example driver provided in patch 5 so we can all
> > more clearly see how the interfaces work.  I'll start from the
> > beginning of my experience and work my way to the type/name thing.
> >   
> 
> Thanks for looking into it and getting feel of it. And I hope this helps
> to understand that 'name' and 'type_id' are different.
> 
> 
> > (please add a modules_install target to the Makefile)
> >  
> 
> This is an example and I feel it should not be installed in
> /lib/modules/../build path. This should be used to understand the
> interface and the flow of mdev device management life cycle. User can
> use insmod to load driver:
> 
> # insmod mtty.ko

It's not built by default, that's sufficient.  Providing a
modules_install target makes it more accessible for testing and allows
easier testing of module dependencies with modprobe.  insmod does not
exercise the automatic module dependency loading.

> > # modprobe mtty
> > 
> > Now what?  It seems like I need to have prior knowledge that this
> > drivers supports mdev devices and I need to go hunt for them.  We need
> > to create a class (ex. /sys/class/mdev/) where a user can find all the
> > devices that participate in this mediated device infrastructure.  That
> > would point me to /sys/devices/mtty.
> >   
> 
> You can find devices registered to mdev framework by searching for
> 'mdev_supported_types' directory at the leaf nodes of devices in
> /sys/devices directory. Yes, we can have 'mdev' class and links to
> devices which are registered to mdev framework in /sys/class/mdev/.
> 
> 
> > # tree /sys/devices/mtty
> > /sys/devices/mtty
> > |-- mdev_supported_types
> > |   `-- mtty1
> > |       |-- available_instances (1)
> > |       |-- create
> > |       |-- devices
> > |       `-- name ("Dual-port-serial")
> > |-- mtty_dev
> > |   `-- sample_mtty_dev ("This is phy device")
> > |-- power
> > |   |-- async
> > |   |-- autosuspend_delay_ms
> > |   |-- control
> > |   |-- runtime_active_kids
> > |   |-- runtime_active_time
> > |   |-- runtime_enabled
> > |   |-- runtime_status
> > |   |-- runtime_suspended_time
> > |   `-- runtime_usage
> > `-- uevent
> > 
> > Ok, but that was boring, we really need to have at least 2 supported
> > types to validate the interface, so without changing the actual device
> > backing, I pretended to have a single port vs dual port:
> > 
> > /sys/devices/mtty
> > |-- mdev_supported_types
> > |   |-- mtty1
> > |   |   |-- available_instances (24)
> > |   |   |-- create
> > |   |   |-- devices
> > |   |   `-- name (Single-port-serial)
> > |   `-- mtty2
> > |       |-- available_instances (12)
> > |       |-- create
> > |       |-- devices
> > |       `-- name (Dual-port-serial)
> > [snip]
> > 
> > I arbitrarily decides I have 24 ports and each single port uses 1 port
> > and each dual port uses 2 ports.
> > 
> > Before I start creating devices, what are we going to key the libvirt
> > XML on?  Can we do anything to prevent vendors from colliding or do we
> > have any way to suggest meaningful and unique type_ids?   
> 
> Libvirt would have parent and type_id in XML. No two vendors can own
> same parent device. So I don't think vendors would collide even having
> same type_id, since <parent, type_id> pair would always be unique.


We have a goal of supporting migration with mdev devices, Intel has
already shown this is possible.  Tying the XML representation of an
mdev device to a parent device is directly contradictory to that goal.
libvirt needs a token which is unique across vendor to be able to
instantiate an mdev device.  <parent, type_id> is unacceptable.
 
>  Presumably if
> > we had a PCI device hosting this, we would be rooted at that parent
> > device, ex. /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0.  Maybe
> > the type_id should automatically be prefixed by the vendor module name,
> > ex. mtty-1, i915-foo, nvidia-bar.  There's something missing for
> > deterministically creating a "XYZ" device and knowing exactly what that
> > means and finding a parent device that supports it.
> >   
> 
> We can prefix type_id with module name, i.e using dev->driver->name, but
> <parent, type_id> pair is unique so I don't see much benefit in doing
> that.
> 
> 
> > Let's get to mdev creating...
> > 
> > # uuidgen > mdev_supported_types/mtty2/create
> > # tree /sys/devices/mtty
> > /sys/devices/mtty
> > |-- e68189be-700e-41f7-93a3-b5351e79c470
> > |   |-- driver -> ../../../bus/mdev/drivers/vfio_mdev
> > |   |-- iommu_group -> ../../../kernel/iommu_groups/63
> > |   |-- mtty2 -> ../mdev_supported_types/mtty2
> > |   |-- power
> > |   |   |-- async
> > |   |   |-- autosuspend_delay_ms
> > |   |   |-- control
> > |   |   |-- runtime_active_kids
> > |   |   |-- runtime_active_time
> > |   |   |-- runtime_enabled
> > |   |   |-- runtime_status
> > |   |   |-- runtime_suspended_time
> > |   |   `-- runtime_usage
> > |   |-- remove
> > |   |-- subsystem -> ../../../bus/mdev
> > |   |-- uevent
> > |   `-- vendor
> > |       `-- sample_mdev_dev ("This is MDEV e68189be-700e-41f7-93a3-b5351e79c470")
> > |-- mdev_supported_types
> > |   |-- mtty1
> > |   |   |-- available_instances (22)
> > |   |   |-- create
> > |   |   |-- devices
> > |   |   `-- name
> > |   `-- mtty2
> > |       |-- available_instances (11)
> > |       |-- create
> > |       |-- devices
> > |       |   `-- e68189be-700e-41f7-93a3-b5351e79c470 -> ../../../e68189be-700e-41f7-93a3-b5351e79c470
> > |       `-- name
> > 
> > The mdev device was created directly under the parent, which seems like
> > it's going to get messy to me (ie. imagine dropping a bunch of uuids
> > into a PCI parent device's sysfs directory, how does a user know what
> > they are?).
> >   
> 
> That is the way devices are placed in sysfs. For example below devices:
> 
> 80:01.0 PCI bridge: Intel Corporation Xeon E5/Core i7 IIO PCI Express
> Root Port 1a (rev 07)
> 80:02.0 PCI bridge: Intel Corporation Xeon E5/Core i7 IIO PCI Express
> Root Port 2a (rev 07)
> 80:03.0 PCI bridge: Intel Corporation Xeon E5/Core i7 IIO PCI Express
> Root Port 3a in PCI Express Mode (rev 07)
> 80:04.0 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
> 0 (rev 07)
> 80:04.1 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
> 1 (rev 07)
> 80:04.2 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
> 2 (rev 07)
> 80:04.3 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
> 3 (rev 07)
> 80:04.4 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
> 4 (rev 07)
> 80:04.5 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
> 5 (rev 07)
> 80:04.6 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
> 6 (rev 07)
> 80:04.7 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
> 7 (rev 07)
> 80:05.0 System peripheral: Intel Corporation Xeon E5/Core i7 Address
> Map, VTd_Misc, System Management (rev 07)
> 80:05.2 System peripheral: Intel Corporation Xeon E5/Core i7 Control
> Status and Global Errors (rev 07)
> 80:05.4 PIC: Intel Corporation Xeon E5/Core i7 I/O APIC (rev 07)
> 
> In sysfs, those are in same parent folder of its parent root port:
> 
> # ls /sys/devices/pci0000\:80/ -l
> total 0
> drwxr-xr-x 8 root root    0 Oct 13 12:08 0000:80:01.0
> drwxr-xr-x 7 root root    0 Oct 13 12:08 0000:80:02.0
> drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:03.0
> drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:04.0
> drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:04.1
> drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:04.2
> drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:04.3
> drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:04.4
> drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:04.5
> drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:04.6
> drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:04.7
> drwxr-xr-x 3 root root    0 Oct 13 12:08 0000:80:05.0
> drwxr-xr-x 3 root root    0 Oct 13 12:08 0000:80:05.2
> drwxr-xr-x 3 root root    0 Oct 13 12:08 0000:80:05.4
> lrwxrwxrwx 1 root root    0 Oct 13 13:25 firmware_node ->
> ../LNXSYSTM:00/LNXSYBUS:00/PNP0A08:01
> drwxr-xr-x 3 root root    0 Oct 13 12:08 pci_bus
> drwxr-xr-x 2 root root    0 Oct 13 12:08 power
> -rw-r--r-- 1 root root 4096 Oct 13 13:25 uevent

I agree, that's also messy, but a PCI device has a standard PCI address
format, so just by looking at those you know it's a child PCI device.
We can write code that understands how to parse that and understands
what it is by the address.  On the other hand we're dropping uuids into
a directory.  We can write code that understands how to parse it, but
how do we know that it's a child device vs some other attribute for the
parent?

> > Under the device we have "mtty2", shouldn't that be
> > "mdev_supported_type", which then links to mtty2?  Otherwise a user
> > needs to decode from the link what this attribute is.
> >   
> 
> I thought it should show type, so that by looking at 'ls' output user
> should be able to find type_id.

The type_id should be shown by actually reading the link, not by the
link name itself, the same way that the iommu_group link for a device
isn't the group number, it links to the group number but uses a
standard link name.

> > Also here's an example of those vendor sysfs entries per device.  So
> > long as the vendor never expects a tool like libvirt to manipulate
> > attributes there, I can see how that could be pretty powerful.
> >   
> 
> Yes, it is good to have vendor specific entries, libvirt might not
> report/use it. That would be more useful for system admin to get extra
> information manually that libvirt doesn't report.
> 
> 
> > Moving down to the mdev_supported_types, I've updated mtty so that it
> > actually adjusts available instance, and we can now see a link under
> > the devices for mtty2.
> > 
> > Also worth noting is that a link for the device appears
> > in /sys/bus/mdev/devices.
> > 
> > BTW, specifying this device for QEMU vfio-pci is where the sysfsdev
> > option comes into play:
> > 
> > -device
> > vfio-pci,sysfsdev=/sys/devices/mtty/e68189be-700e-41f7-93a3-b5351e79c470
> > 
> > Which raises another question, we can tell through the vfio interfaces
> > that this is exposes as a PCI device, by creating a container
> > (open(/dev/vfio/vfio)), setting an iommu (ioctl(VFIO_SET_IOMMU)),
> > adding the group to the container (ioctl(VFIO_GROUP_SET_CONTAINER)),
> > getting the device (ioctl(VFIO_GROUP_GET_DEVICE_FD)), and finally
> > getting the device info (ioctl(VFIO_DEVICE_GET_INFO)) and checking the
> > flag bit that says the API is PCI.  That's a long path to go and has
> > stumbling blocks like the type of iommu that's available for the given
> > platform.  How do we make that manageable?   
> 
> Do you want device type to be expressed in sysfs? Then that should be
> done from vendor driver. vfio_mdev module is now a shim layer, so mdev
> core or vfio_mdev module don't know what device type flag vendor driver
> had set.

Right, the vendor driver would need to expose this, the mdev layers are
device agnostic, they don't know or care which device API is being
exposed.  The other question is whether it needs to be part of the
initial implementation or can we assume pci for now and add something
later.  I guess we already have our proof to the contrary with the
IBM ccw device that libvirt can't simply assume pci.  I see that many
devices in sysfs have a subsystem link, which seems rather appropriate,
but we're not creating a real pci device, so linking to /sys/bus/pci
or /sys/class/pci_bus both seem invalid.  Is that a dead end?  We could
always expose vfio_device_info.flags, but that seems pretty ugly as
well, plus the sysfs mdev interface is not vfio specific.  What if we
had a "device_api" attribute which the vendor driver would show as
"vfio-pci"?  Therefore the mdev interface is not tied to vfio, but we
clearly show that a given type_id clearly exports a vfio-pci
interface.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v8 4/6] docs: Add Documentation for Mediated devices
@ 2016-10-13 14:36                     ` Alex Williamson
  0 siblings, 0 replies; 73+ messages in thread
From: Alex Williamson @ 2016-10-13 14:36 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Tian, Kevin, Daniel P. Berrange, pbonzini, kraxel, cjia, Song,
	Jike, kvm, qemu-devel, bjsdjshi, Laine Stump

On Thu, 13 Oct 2016 14:52:09 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 10/13/2016 3:14 AM, Alex Williamson wrote:
> > On Thu, 13 Oct 2016 00:32:48 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> On 10/12/2016 9:29 PM, Alex Williamson wrote:  
> >>> On Wed, 12 Oct 2016 20:43:48 +0530
> >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>     
> >>>> On 10/12/2016 7:22 AM, Tian, Kevin wrote:    
> >>>>>> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> >>>>>> Sent: Wednesday, October 12, 2016 4:45 AM      
> >>>>>>>> +* mdev_supported_types:
> >>>>>>>> +    List of current supported mediated device types and its details are added
> >>>>>>>> +in this directory in following format:
> >>>>>>>> +
> >>>>>>>> +|- <parent phy device>
> >>>>>>>> +|--- Vendor-specific-attributes [optional]
> >>>>>>>> +|--- mdev_supported_types
> >>>>>>>> +|     |--- <type id>
> >>>>>>>> +|     |   |--- create
> >>>>>>>> +|     |   |--- name
> >>>>>>>> +|     |   |--- available_instances
> >>>>>>>> +|     |   |--- description /class
> >>>>>>>> +|     |   |--- [devices]
> >>>>>>>> +|     |--- <type id>
> >>>>>>>> +|     |   |--- create
> >>>>>>>> +|     |   |--- name
> >>>>>>>> +|     |   |--- available_instances
> >>>>>>>> +|     |   |--- description /class
> >>>>>>>> +|     |   |--- [devices]
> >>>>>>>> +|     |--- <type id>
> >>>>>>>> +|          |--- create
> >>>>>>>> +|          |--- name
> >>>>>>>> +|          |--- available_instances
> >>>>>>>> +|          |--- description /class
> >>>>>>>> +|          |--- [devices]
> >>>>>>>> +
> >>>>>>>> +[TBD : description or class is yet to be decided. This will change.]      
> >>>>>>>
> >>>>>>> I thought that in previous discussions we had agreed to drop
> >>>>>>> the <type id> concept and use the name as the unique identifier.
> >>>>>>> When reporting these types in libvirt we won't want to report
> >>>>>>> the type id values - we'll want the name strings to be unique.
> >>>>>>>      
> >>>>>>
> >>>>>> The 'name' might not be unique but type_id will be. For example that Neo
> >>>>>> pointed out in earlier discussion, virtual devices can come from two
> >>>>>> different physical devices, end user would be presented with what they
> >>>>>> had selected but there will be internal implementation differences. In
> >>>>>> that case 'type_id' will be unique.
> >>>>>>      
> >>>>>
> >>>>> Hi, Kirti, my understanding is that Neo agreed to use an unique type
> >>>>> string (if you still called it <type id>), and then no need of additional
> >>>>> 'name' field which can be put inside 'description' field. See below quote:
> >>>>>       
> >>>>
> >>>> We had internal discussions about this within NVIDIA and found that
> >>>> 'name' might not be unique where as 'type_id' would be unique. I'm
> >>>> refering to Neo's mail after that, where Neo do pointed that out.
> >>>>
> >>>> https://lists.gnu.org/archive/html/qemu-devel/2016-09/msg07714.html    
> >>>
> >>> Everyone not privy to those internal discussions, including me, seems to
> >>> think we dropped type_id and that if a vendor does not have a stable
> >>> name, they can compose some sort of stable type description based on the
> >>> name+id, or even vendor+id, ex. NVIDIA-11.  So please share why we
> >>> haven't managed to kill off type_id yet.  No matter what internal
> >>> representation each vendor driver has of "type_id" it seems possible
> >>> for it to come up with stable string to define a given configuration.    
> >>
> >>
> >> The 'type_id' is unique and the 'name' are not, the name is just a
> >> virtual device name/ human readable name. Because at this moment Intel
> >> can't define a proper GPU class, we have to add a 'description' field
> >> there as well to represent the features of this virtual device, once we
> >> have all agreed with the GPU class and its mandatory attributes, the
> >> 'description' field can be removed. Here is an example,
> >> type_id/type_name = NVIDIA_11,
> >> name=M60-M0Q,
> >> description=2560x1600, 2 displays, 512MB"
> >>
> >> Neo's previous comment only applies to the situation where we will have
> >> the GPU class or optional attributes defined and recognized by libvirt,
> >> since that is not going to happen any time soon, we will have to have
> >> the new 'description' field, and we don't want to have it mixed up with
> >> 'name' field.
> >>
> >> We can definitely have something like name+id as Alex recommended to
> >> remove the 'name' field, but it will just require libvirt to have more
> >> logic to parse that string.  
> > 
> > Let's use the mtty example driver provided in patch 5 so we can all
> > more clearly see how the interfaces work.  I'll start from the
> > beginning of my experience and work my way to the type/name thing.
> >   
> 
> Thanks for looking into it and getting feel of it. And I hope this helps
> to understand that 'name' and 'type_id' are different.
> 
> 
> > (please add a modules_install target to the Makefile)
> >  
> 
> This is an example and I feel it should not be installed in
> /lib/modules/../build path. This should be used to understand the
> interface and the flow of mdev device management life cycle. User can
> use insmod to load driver:
> 
> # insmod mtty.ko

It's not built by default, that's sufficient.  Providing a
modules_install target makes it more accessible for testing and allows
easier testing of module dependencies with modprobe.  insmod does not
exercise the automatic module dependency loading.

> > # modprobe mtty
> > 
> > Now what?  It seems like I need to have prior knowledge that this
> > drivers supports mdev devices and I need to go hunt for them.  We need
> > to create a class (ex. /sys/class/mdev/) where a user can find all the
> > devices that participate in this mediated device infrastructure.  That
> > would point me to /sys/devices/mtty.
> >   
> 
> You can find devices registered to mdev framework by searching for
> 'mdev_supported_types' directory at the leaf nodes of devices in
> /sys/devices directory. Yes, we can have 'mdev' class and links to
> devices which are registered to mdev framework in /sys/class/mdev/.
> 
> 
> > # tree /sys/devices/mtty
> > /sys/devices/mtty
> > |-- mdev_supported_types
> > |   `-- mtty1
> > |       |-- available_instances (1)
> > |       |-- create
> > |       |-- devices
> > |       `-- name ("Dual-port-serial")
> > |-- mtty_dev
> > |   `-- sample_mtty_dev ("This is phy device")
> > |-- power
> > |   |-- async
> > |   |-- autosuspend_delay_ms
> > |   |-- control
> > |   |-- runtime_active_kids
> > |   |-- runtime_active_time
> > |   |-- runtime_enabled
> > |   |-- runtime_status
> > |   |-- runtime_suspended_time
> > |   `-- runtime_usage
> > `-- uevent
> > 
> > Ok, but that was boring, we really need to have at least 2 supported
> > types to validate the interface, so without changing the actual device
> > backing, I pretended to have a single port vs dual port:
> > 
> > /sys/devices/mtty
> > |-- mdev_supported_types
> > |   |-- mtty1
> > |   |   |-- available_instances (24)
> > |   |   |-- create
> > |   |   |-- devices
> > |   |   `-- name (Single-port-serial)
> > |   `-- mtty2
> > |       |-- available_instances (12)
> > |       |-- create
> > |       |-- devices
> > |       `-- name (Dual-port-serial)
> > [snip]
> > 
> > I arbitrarily decides I have 24 ports and each single port uses 1 port
> > and each dual port uses 2 ports.
> > 
> > Before I start creating devices, what are we going to key the libvirt
> > XML on?  Can we do anything to prevent vendors from colliding or do we
> > have any way to suggest meaningful and unique type_ids?   
> 
> Libvirt would have parent and type_id in XML. No two vendors can own
> same parent device. So I don't think vendors would collide even having
> same type_id, since <parent, type_id> pair would always be unique.


We have a goal of supporting migration with mdev devices, Intel has
already shown this is possible.  Tying the XML representation of an
mdev device to a parent device is directly contradictory to that goal.
libvirt needs a token which is unique across vendor to be able to
instantiate an mdev device.  <parent, type_id> is unacceptable.
 
>  Presumably if
> > we had a PCI device hosting this, we would be rooted at that parent
> > device, ex. /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0.  Maybe
> > the type_id should automatically be prefixed by the vendor module name,
> > ex. mtty-1, i915-foo, nvidia-bar.  There's something missing for
> > deterministically creating a "XYZ" device and knowing exactly what that
> > means and finding a parent device that supports it.
> >   
> 
> We can prefix type_id with module name, i.e using dev->driver->name, but
> <parent, type_id> pair is unique so I don't see much benefit in doing
> that.
> 
> 
> > Let's get to mdev creating...
> > 
> > # uuidgen > mdev_supported_types/mtty2/create
> > # tree /sys/devices/mtty
> > /sys/devices/mtty
> > |-- e68189be-700e-41f7-93a3-b5351e79c470
> > |   |-- driver -> ../../../bus/mdev/drivers/vfio_mdev
> > |   |-- iommu_group -> ../../../kernel/iommu_groups/63
> > |   |-- mtty2 -> ../mdev_supported_types/mtty2
> > |   |-- power
> > |   |   |-- async
> > |   |   |-- autosuspend_delay_ms
> > |   |   |-- control
> > |   |   |-- runtime_active_kids
> > |   |   |-- runtime_active_time
> > |   |   |-- runtime_enabled
> > |   |   |-- runtime_status
> > |   |   |-- runtime_suspended_time
> > |   |   `-- runtime_usage
> > |   |-- remove
> > |   |-- subsystem -> ../../../bus/mdev
> > |   |-- uevent
> > |   `-- vendor
> > |       `-- sample_mdev_dev ("This is MDEV e68189be-700e-41f7-93a3-b5351e79c470")
> > |-- mdev_supported_types
> > |   |-- mtty1
> > |   |   |-- available_instances (22)
> > |   |   |-- create
> > |   |   |-- devices
> > |   |   `-- name
> > |   `-- mtty2
> > |       |-- available_instances (11)
> > |       |-- create
> > |       |-- devices
> > |       |   `-- e68189be-700e-41f7-93a3-b5351e79c470 -> ../../../e68189be-700e-41f7-93a3-b5351e79c470
> > |       `-- name
> > 
> > The mdev device was created directly under the parent, which seems like
> > it's going to get messy to me (ie. imagine dropping a bunch of uuids
> > into a PCI parent device's sysfs directory, how does a user know what
> > they are?).
> >   
> 
> That is the way devices are placed in sysfs. For example below devices:
> 
> 80:01.0 PCI bridge: Intel Corporation Xeon E5/Core i7 IIO PCI Express
> Root Port 1a (rev 07)
> 80:02.0 PCI bridge: Intel Corporation Xeon E5/Core i7 IIO PCI Express
> Root Port 2a (rev 07)
> 80:03.0 PCI bridge: Intel Corporation Xeon E5/Core i7 IIO PCI Express
> Root Port 3a in PCI Express Mode (rev 07)
> 80:04.0 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
> 0 (rev 07)
> 80:04.1 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
> 1 (rev 07)
> 80:04.2 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
> 2 (rev 07)
> 80:04.3 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
> 3 (rev 07)
> 80:04.4 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
> 4 (rev 07)
> 80:04.5 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
> 5 (rev 07)
> 80:04.6 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
> 6 (rev 07)
> 80:04.7 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
> 7 (rev 07)
> 80:05.0 System peripheral: Intel Corporation Xeon E5/Core i7 Address
> Map, VTd_Misc, System Management (rev 07)
> 80:05.2 System peripheral: Intel Corporation Xeon E5/Core i7 Control
> Status and Global Errors (rev 07)
> 80:05.4 PIC: Intel Corporation Xeon E5/Core i7 I/O APIC (rev 07)
> 
> In sysfs, those are in same parent folder of its parent root port:
> 
> # ls /sys/devices/pci0000\:80/ -l
> total 0
> drwxr-xr-x 8 root root    0 Oct 13 12:08 0000:80:01.0
> drwxr-xr-x 7 root root    0 Oct 13 12:08 0000:80:02.0
> drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:03.0
> drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:04.0
> drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:04.1
> drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:04.2
> drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:04.3
> drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:04.4
> drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:04.5
> drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:04.6
> drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:04.7
> drwxr-xr-x 3 root root    0 Oct 13 12:08 0000:80:05.0
> drwxr-xr-x 3 root root    0 Oct 13 12:08 0000:80:05.2
> drwxr-xr-x 3 root root    0 Oct 13 12:08 0000:80:05.4
> lrwxrwxrwx 1 root root    0 Oct 13 13:25 firmware_node ->
> ../LNXSYSTM:00/LNXSYBUS:00/PNP0A08:01
> drwxr-xr-x 3 root root    0 Oct 13 12:08 pci_bus
> drwxr-xr-x 2 root root    0 Oct 13 12:08 power
> -rw-r--r-- 1 root root 4096 Oct 13 13:25 uevent

I agree, that's also messy, but a PCI device has a standard PCI address
format, so just by looking at those you know it's a child PCI device.
We can write code that understands how to parse that and understands
what it is by the address.  On the other hand we're dropping uuids into
a directory.  We can write code that understands how to parse it, but
how do we know that it's a child device vs some other attribute for the
parent?

> > Under the device we have "mtty2", shouldn't that be
> > "mdev_supported_type", which then links to mtty2?  Otherwise a user
> > needs to decode from the link what this attribute is.
> >   
> 
> I thought it should show type, so that by looking at 'ls' output user
> should be able to find type_id.

The type_id should be shown by actually reading the link, not by the
link name itself, the same way that the iommu_group link for a device
isn't the group number, it links to the group number but uses a
standard link name.

> > Also here's an example of those vendor sysfs entries per device.  So
> > long as the vendor never expects a tool like libvirt to manipulate
> > attributes there, I can see how that could be pretty powerful.
> >   
> 
> Yes, it is good to have vendor specific entries, libvirt might not
> report/use it. That would be more useful for system admin to get extra
> information manually that libvirt doesn't report.
> 
> 
> > Moving down to the mdev_supported_types, I've updated mtty so that it
> > actually adjusts available instance, and we can now see a link under
> > the devices for mtty2.
> > 
> > Also worth noting is that a link for the device appears
> > in /sys/bus/mdev/devices.
> > 
> > BTW, specifying this device for QEMU vfio-pci is where the sysfsdev
> > option comes into play:
> > 
> > -device
> > vfio-pci,sysfsdev=/sys/devices/mtty/e68189be-700e-41f7-93a3-b5351e79c470
> > 
> > Which raises another question, we can tell through the vfio interfaces
> > that this is exposes as a PCI device, by creating a container
> > (open(/dev/vfio/vfio)), setting an iommu (ioctl(VFIO_SET_IOMMU)),
> > adding the group to the container (ioctl(VFIO_GROUP_SET_CONTAINER)),
> > getting the device (ioctl(VFIO_GROUP_GET_DEVICE_FD)), and finally
> > getting the device info (ioctl(VFIO_DEVICE_GET_INFO)) and checking the
> > flag bit that says the API is PCI.  That's a long path to go and has
> > stumbling blocks like the type of iommu that's available for the given
> > platform.  How do we make that manageable?   
> 
> Do you want device type to be expressed in sysfs? Then that should be
> done from vendor driver. vfio_mdev module is now a shim layer, so mdev
> core or vfio_mdev module don't know what device type flag vendor driver
> had set.

Right, the vendor driver would need to expose this, the mdev layers are
device agnostic, they don't know or care which device API is being
exposed.  The other question is whether it needs to be part of the
initial implementation or can we assume pci for now and add something
later.  I guess we already have our proof to the contrary with the
IBM ccw device that libvirt can't simply assume pci.  I see that many
devices in sysfs have a subsystem link, which seems rather appropriate,
but we're not creating a real pci device, so linking to /sys/bus/pci
or /sys/class/pci_bus both seem invalid.  Is that a dead end?  We could
always expose vfio_device_info.flags, but that seems pretty ugly as
well, plus the sysfs mdev interface is not vfio specific.  What if we
had a "device_api" attribute which the vendor driver would show as
"vfio-pci"?  Therefore the mdev interface is not tied to vfio, but we
clearly show that a given type_id clearly exports a vfio-pci
interface.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 4/6] docs: Add Documentation for Mediated devices
  2016-10-13 14:36                     ` [Qemu-devel] " Alex Williamson
@ 2016-10-13 16:00                       ` Paolo Bonzini
  -1 siblings, 0 replies; 73+ messages in thread
From: Paolo Bonzini @ 2016-10-13 16:00 UTC (permalink / raw)
  To: Alex Williamson, Kirti Wankhede
  Cc: Song, Jike, cjia, kvm, Tian, Kevin, qemu-devel, kraxel,
	Laine Stump, bjsdjshi



On 13/10/2016 16:36, Alex Williamson wrote:
>> > 
>> > Libvirt would have parent and type_id in XML. No two vendors can own
>> > same parent device. So I don't think vendors would collide even having
>> > same type_id, since <parent, type_id> pair would always be unique.
> 
> We have a goal of supporting migration with mdev devices, Intel has
> already shown this is possible.  Tying the XML representation of an
> mdev device to a parent device is directly contradictory to that goal.
> libvirt needs a token which is unique across vendor to be able to
> instantiate an mdev device.  <parent, type_id> is unacceptable.

Would the vGPU's PCI vendor and device ID be acceptable and unique?

Paolo

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v8 4/6] docs: Add Documentation for Mediated devices
@ 2016-10-13 16:00                       ` Paolo Bonzini
  0 siblings, 0 replies; 73+ messages in thread
From: Paolo Bonzini @ 2016-10-13 16:00 UTC (permalink / raw)
  To: Alex Williamson, Kirti Wankhede
  Cc: Tian, Kevin, Daniel P. Berrange, kraxel, cjia, Song, Jike, kvm,
	qemu-devel, bjsdjshi, Laine Stump



On 13/10/2016 16:36, Alex Williamson wrote:
>> > 
>> > Libvirt would have parent and type_id in XML. No two vendors can own
>> > same parent device. So I don't think vendors would collide even having
>> > same type_id, since <parent, type_id> pair would always be unique.
> 
> We have a goal of supporting migration with mdev devices, Intel has
> already shown this is possible.  Tying the XML representation of an
> mdev device to a parent device is directly contradictory to that goal.
> libvirt needs a token which is unique across vendor to be able to
> instantiate an mdev device.  <parent, type_id> is unacceptable.

Would the vGPU's PCI vendor and device ID be acceptable and unique?

Paolo

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v8 4/6] docs: Add Documentation for Mediated devices
  2016-10-13 16:00                       ` [Qemu-devel] " Paolo Bonzini
  (?)
@ 2016-10-13 16:30                       ` Alex Williamson
  -1 siblings, 0 replies; 73+ messages in thread
From: Alex Williamson @ 2016-10-13 16:30 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kirti Wankhede, Tian, Kevin, Daniel P. Berrange, kraxel, cjia,
	Song, Jike, kvm, qemu-devel, bjsdjshi, Laine Stump

On Thu, 13 Oct 2016 18:00:07 +0200
Paolo Bonzini <pbonzini@redhat.com> wrote:

> On 13/10/2016 16:36, Alex Williamson wrote:
> >> > 
> >> > Libvirt would have parent and type_id in XML. No two vendors can own
> >> > same parent device. So I don't think vendors would collide even having
> >> > same type_id, since <parent, type_id> pair would always be unique.  
> > 
> > We have a goal of supporting migration with mdev devices, Intel has
> > already shown this is possible.  Tying the XML representation of an
> > mdev device to a parent device is directly contradictory to that goal.
> > libvirt needs a token which is unique across vendor to be able to
> > instantiate an mdev device.  <parent, type_id> is unacceptable.  
> 
> Would the vGPU's PCI vendor and device ID be acceptable and unique?

No, the PCI vendor:device ID doesn't fully describe the device, just
look at a given GeForce configuration on the market today, a single
NVIDIA PCI device ID can have different clocks, different memory sizes,
different cooling profiles, different output ports, etc. configurable
by the card vendor.  An mdev vGPU can be the same.  The type_id needs
to encompass the entire virtual hardware configuration of the device.

Personally I think we should create a type-id the pre-pends the vendor
driver name, giving each vendor a unique namespace, and then let the
vendor driver manage the rest.  For example, an "nvidia-xyz" type_id
should define a unique mdev configuration that may be implemented
across multiple physical hardware SKUs.  libvirt xml (with
managed='yes') would list an "nvidia-xyz" device as required for the VM
and search the host for mdev parent device capable of creating such a
device.  Based on utilization, locality, or whatever parameters it
determines, libvirt would create (or re-use, if a pool) an instance of
that type_id.  Specifying a specific parent might be something we want
to allow to give the user full placement control, but I don't think it
should be required based on the sysfs API we provide to the user.
Thanks,

Alex

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 3/6] vfio iommu: Add support for mediated devices
  2016-10-13 14:34       ` [Qemu-devel] " Kirti Wankhede
@ 2016-10-13 17:12         ` Alex Williamson
  -1 siblings, 0 replies; 73+ messages in thread
From: Alex Williamson @ 2016-10-13 17:12 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: kevin.tian, cjia, kvm, qemu-devel, jike.song, kraxel, pbonzini, bjsdjshi

On Thu, 13 Oct 2016 20:04:43 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 10/12/2016 3:36 AM, Alex Williamson wrote:
> > On Tue, 11 Oct 2016 01:58:34 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >  
> ...
> 
> 
> >> +static struct vfio_group *vfio_group_from_dev(struct device *dev)
> >> +{
> >> +	struct vfio_device *device;
> >> +	struct vfio_group *group;
> >> +	int ret;
> >> +
> >> +	device = vfio_device_get_from_dev(dev);  
> >
> > Note how this does dev->iommu_group->vfio_group->vfio_device and then
> > we back out one level to get the vfio_group, it's not a terribly
> > lightweight path.  Perhaps we should have:
> >
> > struct vfio_device *vfio_group_get_from_dev(struct device *dev)
> > {
> >         struct iommu_group *iommu_group;
> >         struct vfio_group *group;
> >
> >         iommu_group = iommu_group_get(dev);
> >         if (!iommu_group)
> >                 return NULL;
> >
> >         group = vfio_group_get_from_iommu(iommu_group);
> > 	iommu_group_put(iommu_group);
> >
> > 	return group;
> > }
> >
> > vfio_device_get_from_dev() would make use of this.
> >
> > Then create a separate:
> >
> > static int vfio_group_add_container_user(struct vfio_group *group)
> > {
> >  
> >> +	if (!atomic_inc_not_zero(&group->container_users)) {  
> > 		return -EINVAL;  
> >> +	}
> >> +
> >> +	if (group->noiommu) {
> >> +		atomic_dec(&group->container_users);  
> > 		return -EPERM;  
> >> +	}
> >> +
> >> +	if (!group->container->iommu_driver ||
> >> +	    !vfio_group_viable(group)) {
> >> +		atomic_dec(&group->container_users);  
> > 		return -EINVAL;  
> >> +	}
> >> +  
> > 	return 0;
> > }
> >
> > vfio_group_get_external_user() would be updated to use this.  In fact,
> > creating these two functions and updating the existing code to use
> > these should be a separate patch.
> >  
> 
> Ok. I'll update.
> 
> 
> > Note that your version did not hold a group reference while doing the
> > pin/unpin operations below, which seems like a bug.
> >  
> 
> container->group_lock is held for pin/unpin. I think then we don't have
> to hold the reference to group, because groups are attached and detached
> holding this lock, right?
> 
> 
> >> +
> >> +err_ret:
> >> +	vfio_device_put(device);
> >> +	return ERR_PTR(ret);
> >> +}
> >> +
> >> +/*
> >> + * Pin a set of guest PFNs and return their associated host PFNs for  
> local
> >> + * domain only.
> >> + * @dev [in] : device
> >> + * @user_pfn [in]: array of user/guest PFNs
> >> + * @npage [in]: count of array elements
> >> + * @prot [in] : protection flags
> >> + * @phys_pfn[out] : array of host PFNs
> >> + */
> >> +long vfio_pin_pages(struct device *dev, unsigned long *user_pfn,
> >> +		    long npage, int prot, unsigned long *phys_pfn)
> >> +{
> >> +	struct vfio_container *container;
> >> +	struct vfio_group *group;
> >> +	struct vfio_iommu_driver *driver;
> >> +	ssize_t ret = -EINVAL;
> >> +
> >> +	if (!dev || !user_pfn || !phys_pfn)
> >> +		return -EINVAL;
> >> +
> >> +	group = vfio_group_from_dev(dev);
> >> +	if (IS_ERR(group))
> >> +		return PTR_ERR(group);  
> >
> > As suggested above:
> >
> > 	group = vfio_group_get_from_dev(dev);
> > 	if (!group)
> > 		return -ENODEV;
> >
> > 	ret = vfio_group_add_container_user(group)
> > 	if (ret)
> > 		vfio_group_put(group);
> > 		return ret;
> > 	}
> >  
> 
> Ok.
> 
> 
> >> +
> >> +	container = group->container;
> >> +	if (IS_ERR(container))
> >> +		return PTR_ERR(container);
> >> +
> >> +	down_read(&container->group_lock);
> >> +
> >> +	driver = container->iommu_driver;
> >> +	if (likely(driver && driver->ops->pin_pages))
> >> +		ret = driver->ops->pin_pages(container->iommu_data, user_pfn,
> >> +					     npage, prot, phys_pfn);
> >> +
> >> +	up_read(&container->group_lock);
> >> +	vfio_group_try_dissolve_container(group);  
> >
> > Even if you're considering that the container_user reference holds the
> > driver, I think we need a group reference throughout this and this
> > should end with a vfio_group_put(group);
> >  
> 
> Same as I mentioned above, container->group_lock is held here.

What allows you to assume that your @group pointer is valid when you
finish with vfio_group_try_dissolve_container()?  You have no reference
to the group, the only device reference you have is the struct device,
not the vfio_device, so that might have been unbound from vfio.  I'm
still inclined to believe you need to hold the reference to the group.

> >> +
> >> +static long vfio_iommu_type1_unpin_pages(void *iommu_data, unsigned  
> long *pfn,
> >> +					 long npage)
> >> +{
> >> +	struct vfio_iommu *iommu = iommu_data;
> >> +	struct vfio_domain *domain = NULL;
> >> +	long unlocked = 0;
> >> +	int i;
> >> +
> >> +	if (!iommu || !pfn)
> >> +		return -EINVAL;
> >> +  
> >
> > We need iommu->lock here, right?
> >  
> 
> Oh, yes.
> 
> >> +	domain = iommu->local_domain;
> >> +
> >> +	for (i = 0; i < npage; i++) {
> >> +		struct vfio_pfn *p;
> >> +
> >> +		mutex_lock(&domain->local_addr_space->pfn_list_lock);
> >> +
> >> +		/* verify if pfn exist in pfn_list */
> >> +		p = vfio_find_pfn(domain, pfn[i]);
> >> +		if (p)
> >> +			unlocked += vfio_unpin_pfn(domain, p, true);
> >> +
> >> +		mutex_unlock(&domain->local_addr_space->pfn_list_lock);  
> >
> > We hold this mutex outside the loop in the pin unwind case, why is it
> > different here?
> >  
> 
> pin_unwind is error condition, so should be done in one go.
> Here this is not error case. Holding lock for long could block other
> threads if there are multiple threads.


Ok, iommu->lock will need to be inside the loop then too or else
there's likely no gain anyway.


> >> +static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct  
> vfio_dma *dma,
> >> +			    size_t map_size)
> >> +{
> >> +	dma_addr_t iova = dma->iova;
> >> +	unsigned long vaddr = dma->vaddr;
> >> +	size_t size = map_size, dma_size = 0;
> >> +	long npage;
> >> +	unsigned long pfn;
> >> +	int ret = 0;
> >> +
> >> +	while (size) {
> >> +		/* Pin a contiguous chunk of memory */
> >> +		npage = __vfio_pin_pages_remote(vaddr + dma_size,
> >> +						size >> PAGE_SHIFT, dma->prot,
> >> +						&pfn);
> >> +		if (npage <= 0) {
> >> +			WARN_ON(!npage);
> >> +			ret = (int)npage;
> >> +			break;
> >> +		}
> >> +
> >> +		/* Map it! */
> >> +		ret = vfio_iommu_map(iommu, iova + dma_size, pfn, npage,
> >> +				     dma->prot);
> >> +		if (ret) {
> >> +			__vfio_unpin_pages_remote(pfn, npage, dma->prot, true);
> >> +			break;
> >> +		}
> >> +
> >> +		size -= npage << PAGE_SHIFT;
> >> +		dma_size += npage << PAGE_SHIFT;
> >> +	}
> >> +
> >> +	if (ret)
> >> +		vfio_remove_dma(iommu, dma);  
> >
> >
> > There's a bug introduced here, vfio_remove_dma() needs dma->size to be
> > accurate to the point of failure, it's not updated until the success
> > branch below, so it's never going to unmap/unpin anything.
> >  
> 
> Ops, yes. I'll fix this.
> 
> >> +	else {
> >> +		dma->size = dma_size;
> >> +		dma->iommu_mapped = true;
> >> +		vfio_update_accounting(iommu, dma);  
> >
> > I'm confused how this works, when called from vfio_dma_do_map() we're
> > populating a vfio_dma, that is we're populating part of the iova space
> > of the device.  How could we have pinned pfns in the local address
> > space that overlap that?  It would be invalid to have such pinned pfns
> > since that part of the iova space was not previously mapped.
> >
> > Another issue is that if there were existing overlaps, userspace would
> > need to have locked memory limits sufficient for this temporary double
> > accounting.  I'm not sure how they'd come up with heuristics to handle
> > that since we're potentially looking at the bulk of VM memory in a
> > single vfio_dma entry.
> >  
> 
> I see that when QEMU boots a VM, in the case when first vGPU device is
> attached and then pass through device is attached, then on first call to
> vfio_dma_do_map(), pin and iommu_mmap is skipped. Then when a pass
> through device is attached, all mappings are unmapped and then again
> vfio_dma_do_map() is called. At this moment IOMMU capable domain is
> present and so pin and iommu_mmap() on all sys mem is done. Now in
> between these two device attach, if any pages are pinned by vendor
> driver, then accounting should be updated.

So that actually points out something that was on my todo list to check
in this patch, when an unmap occurs, we need to invalidate the vendor
driver mappings.  For that period you describe above, the mappings the
vendor driver holds are invalid, we cannot assume that they will
return and certainly cannot assume they will have the same GPA to HVA
mapping.  So the sequence should be that the unmap causes invalidation
of any potential vendor mappings and then there's no reason that pfn
path would need to update accounting on a vfio_dma_do_map(), it should
not be possible that anything is currently pinned within that IOVA
range.

> >> +	}
> >> +
> >> +	return ret;
> >> +}
> >> +
> >>  static int vfio_dma_do_map(struct vfio_iommu *iommu,
> >>  			   struct vfio_iommu_type1_dma_map *map)
> >>  {
> >>  	dma_addr_t iova = map->iova;
> >>  	unsigned long vaddr = map->vaddr;
> >>  	size_t size = map->size;
> >> -	long npage;
> >>  	int ret = 0, prot = 0;
> >>  	uint64_t mask;
> >>  	struct vfio_dma *dma;
> >> -	unsigned long pfn;
> >>
> >>  	/* Verify that none of our __u64 fields overflow */
> >>  	if (map->size != size || map->vaddr != vaddr || map->iova != iova)
> >> @@ -611,29 +981,11 @@ static int vfio_dma_do_map(struct vfio_iommu  
> *iommu,
> >>  	/* Insert zero-sized and grow as we map chunks of it */
> >>  	vfio_link_dma(iommu, dma);
> >>
> >> -	while (size) {
> >> -		/* Pin a contiguous chunk of memory */
> >> -		npage = vfio_pin_pages(vaddr + dma->size,
> >> -				       size >> PAGE_SHIFT, prot, &pfn);
> >> -		if (npage <= 0) {
> >> -			WARN_ON(!npage);
> >> -			ret = (int)npage;
> >> -			break;
> >> -		}
> >> -
> >> -		/* Map it! */
> >> -		ret = vfio_iommu_map(iommu, iova + dma->size, pfn, npage, prot);
> >> -		if (ret) {
> >> -			vfio_unpin_pages(pfn, npage, prot, true);
> >> -			break;
> >> -		}
> >> -
> >> -		size -= npage << PAGE_SHIFT;
> >> -		dma->size += npage << PAGE_SHIFT;
> >> -	}
> >> -
> >> -	if (ret)
> >> -		vfio_remove_dma(iommu, dma);
> >> +	/* Don't pin and map if container doesn't contain IOMMU capable  
> domain*/
> >> +	if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu))
> >> +		dma->size = size;
> >> +	else
> >> +		ret = vfio_pin_map_dma(iommu, dma, size);
> >>
> >>  	mutex_unlock(&iommu->lock);
> >>  	return ret;
> >> @@ -662,10 +1014,6 @@ static int vfio_iommu_replay(struct vfio_iommu  
> *iommu,
> >>  	d = list_first_entry(&iommu->domain_list, struct vfio_domain, next);
> >>  	n = rb_first(&iommu->dma_list);
> >>
> >> -	/* If there's not a domain, there better not be any mappings */
> >> -	if (WARN_ON(n && !d))
> >> -		return -EINVAL;
> >> -
> >>  	for (; n; n = rb_next(n)) {
> >>  		struct vfio_dma *dma;
> >>  		dma_addr_t iova;
> >> @@ -674,20 +1022,43 @@ static int vfio_iommu_replay(struct vfio_iommu  
> *iommu,
> >>  		iova = dma->iova;
> >>
> >>  		while (iova < dma->iova + dma->size) {
> >> -			phys_addr_t phys = iommu_iova_to_phys(d->domain, iova);
> >> +			phys_addr_t phys;
> >>  			size_t size;
> >>
> >> -			if (WARN_ON(!phys)) {
> >> -				iova += PAGE_SIZE;
> >> -				continue;
> >> -			}
> >> +			if (dma->iommu_mapped) {
> >> +				phys = iommu_iova_to_phys(d->domain, iova);
> >> +
> >> +				if (WARN_ON(!phys)) {
> >> +					iova += PAGE_SIZE;
> >> +					continue;
> >> +				}
> >>
> >> -			size = PAGE_SIZE;
> >> +				size = PAGE_SIZE;
> >>
> >> -			while (iova + size < dma->iova + dma->size &&
> >> -			       phys + size == iommu_iova_to_phys(d->domain,
> >> +				while (iova + size < dma->iova + dma->size &&
> >> +				    phys + size == iommu_iova_to_phys(d->domain,
> >>  								 iova + size))
> >> -				size += PAGE_SIZE;
> >> +					size += PAGE_SIZE;
> >> +			} else {
> >> +				unsigned long pfn;
> >> +				unsigned long vaddr = dma->vaddr +
> >> +						     (iova - dma->iova);
> >> +				size_t n = dma->iova + dma->size - iova;
> >> +				long npage;
> >> +
> >> +				npage = __vfio_pin_pages_remote(vaddr,
> >> +								n >> PAGE_SHIFT,
> >> +								dma->prot,
> >> +								&pfn);
> >> +				if (npage <= 0) {
> >> +					WARN_ON(!npage);
> >> +					ret = (int)npage;
> >> +					return ret;
> >> +				}
> >> +
> >> +				phys = pfn << PAGE_SHIFT;
> >> +				size = npage << PAGE_SHIFT;
> >> +			}
> >>
> >>  			ret = iommu_map(domain->domain, iova, phys,
> >>  					size, dma->prot | domain->prot);
> >> @@ -696,6 +1067,11 @@ static int vfio_iommu_replay(struct vfio_iommu  
> *iommu,
> >>
> >>  			iova += size;
> >>  		}
> >> +
> >> +		if (!dma->iommu_mapped) {
> >> +			dma->iommu_mapped = true;
> >> +			vfio_update_accounting(iommu, dma);
> >> +		}  
> >
> > This is the case where we potentially have pinned pfns and we've added
> > an iommu mapped device and need to adjust accounting.  But we've fully
> > pinned and accounted the entire iommu mapped space while still holding
> > the accounting for any pfn mapped space.  So for a time, assuming some
> > pfn pinned pages, we have duplicate accounting.  How does userspace
> > deal with that?  For instance, if I'm using an mdev device where the
> > vendor driver has pinned 512MB of guest memory, then I hot-add an
> > assigned NIC and the entire VM address space gets pinned, that pinning
> > will fail unless my locked memory limits are at least 512MB in excess
> > of my VM size.  Additionally, the user doesn't know how much memory the
> > vendor driver is going to pin, it might be the whole VM address space,
> > so the user would need 2x the locked memory limits.
> >  
> 
> Is the RLIMIT_MEMLOCK set so low? I got your point. I'll update
> __vfio_pin_pages_remote() to check if page which is pinned is already
> accounted in __vfio_pin_pages_remote() itself.

I believe we currently support running a VM with RLIMIT set to exactly
the VM memory size.  We should not regress from that.  libvirt provides
a small "fudge factor", but we shouldn't count on it.
 
> >>  	}
> >>
> >>  	return 0;
> >> @@ -734,11 +1110,24 @@ static void vfio_test_domain_fgsp(struct  
> vfio_domain *domain)
> >>  	__free_pages(pages, order);
> >>  }
> >>
> >> +static struct vfio_group *find_iommu_group(struct vfio_domain *domain,
> >> +				   struct iommu_group *iommu_group)
> >> +{
> >> +	struct vfio_group *g;
> >> +
> >> +	list_for_each_entry(g, &domain->group_list, next) {
> >> +		if (g->iommu_group == iommu_group)
> >> +			return g;
> >> +	}
> >> +
> >> +	return NULL;
> >> +}  
> >
> > It would make review easier if changes like splitting this into a
> > separate function with no functional change on the calling path could
> > be a separate patch.
> >  
> 
> OK.
> 
> Thanks,
> Kirti
> 
> -----------------------------------------------------------------------------------
> This email message is for the sole use of the intended recipient(s) and may contain
> confidential information.  Any unauthorized review, use, disclosure or distribution
> is prohibited.  If you are not the intended recipient, please contact the sender by
> reply email and destroy all copies of the original message.
> -----------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v8 3/6] vfio iommu: Add support for mediated devices
@ 2016-10-13 17:12         ` Alex Williamson
  0 siblings, 0 replies; 73+ messages in thread
From: Alex Williamson @ 2016-10-13 17:12 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, jike.song, bjsdjshi

On Thu, 13 Oct 2016 20:04:43 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 10/12/2016 3:36 AM, Alex Williamson wrote:
> > On Tue, 11 Oct 2016 01:58:34 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >  
> ...
> 
> 
> >> +static struct vfio_group *vfio_group_from_dev(struct device *dev)
> >> +{
> >> +	struct vfio_device *device;
> >> +	struct vfio_group *group;
> >> +	int ret;
> >> +
> >> +	device = vfio_device_get_from_dev(dev);  
> >
> > Note how this does dev->iommu_group->vfio_group->vfio_device and then
> > we back out one level to get the vfio_group, it's not a terribly
> > lightweight path.  Perhaps we should have:
> >
> > struct vfio_device *vfio_group_get_from_dev(struct device *dev)
> > {
> >         struct iommu_group *iommu_group;
> >         struct vfio_group *group;
> >
> >         iommu_group = iommu_group_get(dev);
> >         if (!iommu_group)
> >                 return NULL;
> >
> >         group = vfio_group_get_from_iommu(iommu_group);
> > 	iommu_group_put(iommu_group);
> >
> > 	return group;
> > }
> >
> > vfio_device_get_from_dev() would make use of this.
> >
> > Then create a separate:
> >
> > static int vfio_group_add_container_user(struct vfio_group *group)
> > {
> >  
> >> +	if (!atomic_inc_not_zero(&group->container_users)) {  
> > 		return -EINVAL;  
> >> +	}
> >> +
> >> +	if (group->noiommu) {
> >> +		atomic_dec(&group->container_users);  
> > 		return -EPERM;  
> >> +	}
> >> +
> >> +	if (!group->container->iommu_driver ||
> >> +	    !vfio_group_viable(group)) {
> >> +		atomic_dec(&group->container_users);  
> > 		return -EINVAL;  
> >> +	}
> >> +  
> > 	return 0;
> > }
> >
> > vfio_group_get_external_user() would be updated to use this.  In fact,
> > creating these two functions and updating the existing code to use
> > these should be a separate patch.
> >  
> 
> Ok. I'll update.
> 
> 
> > Note that your version did not hold a group reference while doing the
> > pin/unpin operations below, which seems like a bug.
> >  
> 
> container->group_lock is held for pin/unpin. I think then we don't have
> to hold the reference to group, because groups are attached and detached
> holding this lock, right?
> 
> 
> >> +
> >> +err_ret:
> >> +	vfio_device_put(device);
> >> +	return ERR_PTR(ret);
> >> +}
> >> +
> >> +/*
> >> + * Pin a set of guest PFNs and return their associated host PFNs for  
> local
> >> + * domain only.
> >> + * @dev [in] : device
> >> + * @user_pfn [in]: array of user/guest PFNs
> >> + * @npage [in]: count of array elements
> >> + * @prot [in] : protection flags
> >> + * @phys_pfn[out] : array of host PFNs
> >> + */
> >> +long vfio_pin_pages(struct device *dev, unsigned long *user_pfn,
> >> +		    long npage, int prot, unsigned long *phys_pfn)
> >> +{
> >> +	struct vfio_container *container;
> >> +	struct vfio_group *group;
> >> +	struct vfio_iommu_driver *driver;
> >> +	ssize_t ret = -EINVAL;
> >> +
> >> +	if (!dev || !user_pfn || !phys_pfn)
> >> +		return -EINVAL;
> >> +
> >> +	group = vfio_group_from_dev(dev);
> >> +	if (IS_ERR(group))
> >> +		return PTR_ERR(group);  
> >
> > As suggested above:
> >
> > 	group = vfio_group_get_from_dev(dev);
> > 	if (!group)
> > 		return -ENODEV;
> >
> > 	ret = vfio_group_add_container_user(group)
> > 	if (ret)
> > 		vfio_group_put(group);
> > 		return ret;
> > 	}
> >  
> 
> Ok.
> 
> 
> >> +
> >> +	container = group->container;
> >> +	if (IS_ERR(container))
> >> +		return PTR_ERR(container);
> >> +
> >> +	down_read(&container->group_lock);
> >> +
> >> +	driver = container->iommu_driver;
> >> +	if (likely(driver && driver->ops->pin_pages))
> >> +		ret = driver->ops->pin_pages(container->iommu_data, user_pfn,
> >> +					     npage, prot, phys_pfn);
> >> +
> >> +	up_read(&container->group_lock);
> >> +	vfio_group_try_dissolve_container(group);  
> >
> > Even if you're considering that the container_user reference holds the
> > driver, I think we need a group reference throughout this and this
> > should end with a vfio_group_put(group);
> >  
> 
> Same as I mentioned above, container->group_lock is held here.

What allows you to assume that your @group pointer is valid when you
finish with vfio_group_try_dissolve_container()?  You have no reference
to the group, the only device reference you have is the struct device,
not the vfio_device, so that might have been unbound from vfio.  I'm
still inclined to believe you need to hold the reference to the group.

> >> +
> >> +static long vfio_iommu_type1_unpin_pages(void *iommu_data, unsigned  
> long *pfn,
> >> +					 long npage)
> >> +{
> >> +	struct vfio_iommu *iommu = iommu_data;
> >> +	struct vfio_domain *domain = NULL;
> >> +	long unlocked = 0;
> >> +	int i;
> >> +
> >> +	if (!iommu || !pfn)
> >> +		return -EINVAL;
> >> +  
> >
> > We need iommu->lock here, right?
> >  
> 
> Oh, yes.
> 
> >> +	domain = iommu->local_domain;
> >> +
> >> +	for (i = 0; i < npage; i++) {
> >> +		struct vfio_pfn *p;
> >> +
> >> +		mutex_lock(&domain->local_addr_space->pfn_list_lock);
> >> +
> >> +		/* verify if pfn exist in pfn_list */
> >> +		p = vfio_find_pfn(domain, pfn[i]);
> >> +		if (p)
> >> +			unlocked += vfio_unpin_pfn(domain, p, true);
> >> +
> >> +		mutex_unlock(&domain->local_addr_space->pfn_list_lock);  
> >
> > We hold this mutex outside the loop in the pin unwind case, why is it
> > different here?
> >  
> 
> pin_unwind is error condition, so should be done in one go.
> Here this is not error case. Holding lock for long could block other
> threads if there are multiple threads.


Ok, iommu->lock will need to be inside the loop then too or else
there's likely no gain anyway.


> >> +static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct  
> vfio_dma *dma,
> >> +			    size_t map_size)
> >> +{
> >> +	dma_addr_t iova = dma->iova;
> >> +	unsigned long vaddr = dma->vaddr;
> >> +	size_t size = map_size, dma_size = 0;
> >> +	long npage;
> >> +	unsigned long pfn;
> >> +	int ret = 0;
> >> +
> >> +	while (size) {
> >> +		/* Pin a contiguous chunk of memory */
> >> +		npage = __vfio_pin_pages_remote(vaddr + dma_size,
> >> +						size >> PAGE_SHIFT, dma->prot,
> >> +						&pfn);
> >> +		if (npage <= 0) {
> >> +			WARN_ON(!npage);
> >> +			ret = (int)npage;
> >> +			break;
> >> +		}
> >> +
> >> +		/* Map it! */
> >> +		ret = vfio_iommu_map(iommu, iova + dma_size, pfn, npage,
> >> +				     dma->prot);
> >> +		if (ret) {
> >> +			__vfio_unpin_pages_remote(pfn, npage, dma->prot, true);
> >> +			break;
> >> +		}
> >> +
> >> +		size -= npage << PAGE_SHIFT;
> >> +		dma_size += npage << PAGE_SHIFT;
> >> +	}
> >> +
> >> +	if (ret)
> >> +		vfio_remove_dma(iommu, dma);  
> >
> >
> > There's a bug introduced here, vfio_remove_dma() needs dma->size to be
> > accurate to the point of failure, it's not updated until the success
> > branch below, so it's never going to unmap/unpin anything.
> >  
> 
> Ops, yes. I'll fix this.
> 
> >> +	else {
> >> +		dma->size = dma_size;
> >> +		dma->iommu_mapped = true;
> >> +		vfio_update_accounting(iommu, dma);  
> >
> > I'm confused how this works, when called from vfio_dma_do_map() we're
> > populating a vfio_dma, that is we're populating part of the iova space
> > of the device.  How could we have pinned pfns in the local address
> > space that overlap that?  It would be invalid to have such pinned pfns
> > since that part of the iova space was not previously mapped.
> >
> > Another issue is that if there were existing overlaps, userspace would
> > need to have locked memory limits sufficient for this temporary double
> > accounting.  I'm not sure how they'd come up with heuristics to handle
> > that since we're potentially looking at the bulk of VM memory in a
> > single vfio_dma entry.
> >  
> 
> I see that when QEMU boots a VM, in the case when first vGPU device is
> attached and then pass through device is attached, then on first call to
> vfio_dma_do_map(), pin and iommu_mmap is skipped. Then when a pass
> through device is attached, all mappings are unmapped and then again
> vfio_dma_do_map() is called. At this moment IOMMU capable domain is
> present and so pin and iommu_mmap() on all sys mem is done. Now in
> between these two device attach, if any pages are pinned by vendor
> driver, then accounting should be updated.

So that actually points out something that was on my todo list to check
in this patch, when an unmap occurs, we need to invalidate the vendor
driver mappings.  For that period you describe above, the mappings the
vendor driver holds are invalid, we cannot assume that they will
return and certainly cannot assume they will have the same GPA to HVA
mapping.  So the sequence should be that the unmap causes invalidation
of any potential vendor mappings and then there's no reason that pfn
path would need to update accounting on a vfio_dma_do_map(), it should
not be possible that anything is currently pinned within that IOVA
range.

> >> +	}
> >> +
> >> +	return ret;
> >> +}
> >> +
> >>  static int vfio_dma_do_map(struct vfio_iommu *iommu,
> >>  			   struct vfio_iommu_type1_dma_map *map)
> >>  {
> >>  	dma_addr_t iova = map->iova;
> >>  	unsigned long vaddr = map->vaddr;
> >>  	size_t size = map->size;
> >> -	long npage;
> >>  	int ret = 0, prot = 0;
> >>  	uint64_t mask;
> >>  	struct vfio_dma *dma;
> >> -	unsigned long pfn;
> >>
> >>  	/* Verify that none of our __u64 fields overflow */
> >>  	if (map->size != size || map->vaddr != vaddr || map->iova != iova)
> >> @@ -611,29 +981,11 @@ static int vfio_dma_do_map(struct vfio_iommu  
> *iommu,
> >>  	/* Insert zero-sized and grow as we map chunks of it */
> >>  	vfio_link_dma(iommu, dma);
> >>
> >> -	while (size) {
> >> -		/* Pin a contiguous chunk of memory */
> >> -		npage = vfio_pin_pages(vaddr + dma->size,
> >> -				       size >> PAGE_SHIFT, prot, &pfn);
> >> -		if (npage <= 0) {
> >> -			WARN_ON(!npage);
> >> -			ret = (int)npage;
> >> -			break;
> >> -		}
> >> -
> >> -		/* Map it! */
> >> -		ret = vfio_iommu_map(iommu, iova + dma->size, pfn, npage, prot);
> >> -		if (ret) {
> >> -			vfio_unpin_pages(pfn, npage, prot, true);
> >> -			break;
> >> -		}
> >> -
> >> -		size -= npage << PAGE_SHIFT;
> >> -		dma->size += npage << PAGE_SHIFT;
> >> -	}
> >> -
> >> -	if (ret)
> >> -		vfio_remove_dma(iommu, dma);
> >> +	/* Don't pin and map if container doesn't contain IOMMU capable  
> domain*/
> >> +	if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu))
> >> +		dma->size = size;
> >> +	else
> >> +		ret = vfio_pin_map_dma(iommu, dma, size);
> >>
> >>  	mutex_unlock(&iommu->lock);
> >>  	return ret;
> >> @@ -662,10 +1014,6 @@ static int vfio_iommu_replay(struct vfio_iommu  
> *iommu,
> >>  	d = list_first_entry(&iommu->domain_list, struct vfio_domain, next);
> >>  	n = rb_first(&iommu->dma_list);
> >>
> >> -	/* If there's not a domain, there better not be any mappings */
> >> -	if (WARN_ON(n && !d))
> >> -		return -EINVAL;
> >> -
> >>  	for (; n; n = rb_next(n)) {
> >>  		struct vfio_dma *dma;
> >>  		dma_addr_t iova;
> >> @@ -674,20 +1022,43 @@ static int vfio_iommu_replay(struct vfio_iommu  
> *iommu,
> >>  		iova = dma->iova;
> >>
> >>  		while (iova < dma->iova + dma->size) {
> >> -			phys_addr_t phys = iommu_iova_to_phys(d->domain, iova);
> >> +			phys_addr_t phys;
> >>  			size_t size;
> >>
> >> -			if (WARN_ON(!phys)) {
> >> -				iova += PAGE_SIZE;
> >> -				continue;
> >> -			}
> >> +			if (dma->iommu_mapped) {
> >> +				phys = iommu_iova_to_phys(d->domain, iova);
> >> +
> >> +				if (WARN_ON(!phys)) {
> >> +					iova += PAGE_SIZE;
> >> +					continue;
> >> +				}
> >>
> >> -			size = PAGE_SIZE;
> >> +				size = PAGE_SIZE;
> >>
> >> -			while (iova + size < dma->iova + dma->size &&
> >> -			       phys + size == iommu_iova_to_phys(d->domain,
> >> +				while (iova + size < dma->iova + dma->size &&
> >> +				    phys + size == iommu_iova_to_phys(d->domain,
> >>  								 iova + size))
> >> -				size += PAGE_SIZE;
> >> +					size += PAGE_SIZE;
> >> +			} else {
> >> +				unsigned long pfn;
> >> +				unsigned long vaddr = dma->vaddr +
> >> +						     (iova - dma->iova);
> >> +				size_t n = dma->iova + dma->size - iova;
> >> +				long npage;
> >> +
> >> +				npage = __vfio_pin_pages_remote(vaddr,
> >> +								n >> PAGE_SHIFT,
> >> +								dma->prot,
> >> +								&pfn);
> >> +				if (npage <= 0) {
> >> +					WARN_ON(!npage);
> >> +					ret = (int)npage;
> >> +					return ret;
> >> +				}
> >> +
> >> +				phys = pfn << PAGE_SHIFT;
> >> +				size = npage << PAGE_SHIFT;
> >> +			}
> >>
> >>  			ret = iommu_map(domain->domain, iova, phys,
> >>  					size, dma->prot | domain->prot);
> >> @@ -696,6 +1067,11 @@ static int vfio_iommu_replay(struct vfio_iommu  
> *iommu,
> >>
> >>  			iova += size;
> >>  		}
> >> +
> >> +		if (!dma->iommu_mapped) {
> >> +			dma->iommu_mapped = true;
> >> +			vfio_update_accounting(iommu, dma);
> >> +		}  
> >
> > This is the case where we potentially have pinned pfns and we've added
> > an iommu mapped device and need to adjust accounting.  But we've fully
> > pinned and accounted the entire iommu mapped space while still holding
> > the accounting for any pfn mapped space.  So for a time, assuming some
> > pfn pinned pages, we have duplicate accounting.  How does userspace
> > deal with that?  For instance, if I'm using an mdev device where the
> > vendor driver has pinned 512MB of guest memory, then I hot-add an
> > assigned NIC and the entire VM address space gets pinned, that pinning
> > will fail unless my locked memory limits are at least 512MB in excess
> > of my VM size.  Additionally, the user doesn't know how much memory the
> > vendor driver is going to pin, it might be the whole VM address space,
> > so the user would need 2x the locked memory limits.
> >  
> 
> Is the RLIMIT_MEMLOCK set so low? I got your point. I'll update
> __vfio_pin_pages_remote() to check if page which is pinned is already
> accounted in __vfio_pin_pages_remote() itself.

I believe we currently support running a VM with RLIMIT set to exactly
the VM memory size.  We should not regress from that.  libvirt provides
a small "fudge factor", but we shouldn't count on it.
 
> >>  	}
> >>
> >>  	return 0;
> >> @@ -734,11 +1110,24 @@ static void vfio_test_domain_fgsp(struct  
> vfio_domain *domain)
> >>  	__free_pages(pages, order);
> >>  }
> >>
> >> +static struct vfio_group *find_iommu_group(struct vfio_domain *domain,
> >> +				   struct iommu_group *iommu_group)
> >> +{
> >> +	struct vfio_group *g;
> >> +
> >> +	list_for_each_entry(g, &domain->group_list, next) {
> >> +		if (g->iommu_group == iommu_group)
> >> +			return g;
> >> +	}
> >> +
> >> +	return NULL;
> >> +}  
> >
> > It would make review easier if changes like splitting this into a
> > separate function with no functional change on the calling path could
> > be a separate patch.
> >  
> 
> OK.
> 
> Thanks,
> Kirti
> 
> -----------------------------------------------------------------------------------
> This email message is for the sole use of the intended recipient(s) and may contain
> confidential information.  Any unauthorized review, use, disclosure or distribution
> is prohibited.  If you are not the intended recipient, please contact the sender by
> reply email and destroy all copies of the original message.
> -----------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 4/6] docs: Add Documentation for Mediated devices
  2016-10-10 20:28   ` [Qemu-devel] " Kirti Wankhede
@ 2016-10-14  2:22     ` Jike Song
  -1 siblings, 0 replies; 73+ messages in thread
From: Jike Song @ 2016-10-14  2:22 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: kevin.tian, cjia, kvm, qemu-devel, alex.williamson, kraxel,
	pbonzini, bjsdjshi

On 10/11/2016 04:28 AM, Kirti Wankhede wrote:
> +
> +Under per-physical device sysfs:
> +--------------------------------
> +
> +* mdev_supported_types:
> +    List of current supported mediated device types and its details are added
> +in this directory in following format:
> +
> +|- <parent phy device>
> +|--- Vendor-specific-attributes [optional]
> +|--- mdev_supported_types
> +|     |--- <type id>
> +|     |   |--- create
> +|     |   |--- name
> +|     |   |--- available_instances
> +|     |   |--- description /class
> +|     |   |--- [devices]
> +|     |--- <type id>
> +|     |   |--- create
> +|     |   |--- name
> +|     |   |--- available_instances
> +|     |   |--- description /class
> +|     |   |--- [devices]
> +|     |--- <type id>
> +|          |--- create
> +|          |--- name
> +|          |--- available_instances
> +|          |--- description /class
> +|          |--- [devices]
> +
> +[TBD : description or class is yet to be decided. This will change.]
> +
> +Under per mdev device:
> +----------------------
> +
> +|- <parent phy device>
> +|--- $MDEV_UUID
> +         |--- remove
> +         |--- {link to its type}
> +         |--- vendor-specific-attributes [optional]
> +

All mdev directories are placed under physical device directly.

Looking at the sysfs directory of physical device, you get:

        <parent phy device>
        |--- mdev_supported_types/
        |        |--- type1/
        |        |--- type2/
        |        |--- type3/
        |--- mdev1/
        |--- mdev2/



With an independent device between physical and mdev, and names
simplified, you will get:

        <parent phy device>
        |--- mdev/
        |        |--- supported_type1/
        |        |--- supported_type2/
        |        |--- supported_type3/
        |        |--- mdev1/
        |        |--- mdev2/

i.e. everything related to mdev are placed under one single directory -
the same as SR-IOV.  I'm not sure if it is possible without
introducing an independent device (which you apparently dislike), but
placing so many mdev directories under physical doesn't seems clean.



--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v8 4/6] docs: Add Documentation for Mediated devices
@ 2016-10-14  2:22     ` Jike Song
  0 siblings, 0 replies; 73+ messages in thread
From: Jike Song @ 2016-10-14  2:22 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: alex.williamson, pbonzini, kraxel, cjia, qemu-devel, kvm,
	kevin.tian, bjsdjshi

On 10/11/2016 04:28 AM, Kirti Wankhede wrote:
> +
> +Under per-physical device sysfs:
> +--------------------------------
> +
> +* mdev_supported_types:
> +    List of current supported mediated device types and its details are added
> +in this directory in following format:
> +
> +|- <parent phy device>
> +|--- Vendor-specific-attributes [optional]
> +|--- mdev_supported_types
> +|     |--- <type id>
> +|     |   |--- create
> +|     |   |--- name
> +|     |   |--- available_instances
> +|     |   |--- description /class
> +|     |   |--- [devices]
> +|     |--- <type id>
> +|     |   |--- create
> +|     |   |--- name
> +|     |   |--- available_instances
> +|     |   |--- description /class
> +|     |   |--- [devices]
> +|     |--- <type id>
> +|          |--- create
> +|          |--- name
> +|          |--- available_instances
> +|          |--- description /class
> +|          |--- [devices]
> +
> +[TBD : description or class is yet to be decided. This will change.]
> +
> +Under per mdev device:
> +----------------------
> +
> +|- <parent phy device>
> +|--- $MDEV_UUID
> +         |--- remove
> +         |--- {link to its type}
> +         |--- vendor-specific-attributes [optional]
> +

All mdev directories are placed under physical device directly.

Looking at the sysfs directory of physical device, you get:

        <parent phy device>
        |--- mdev_supported_types/
        |        |--- type1/
        |        |--- type2/
        |        |--- type3/
        |--- mdev1/
        |--- mdev2/



With an independent device between physical and mdev, and names
simplified, you will get:

        <parent phy device>
        |--- mdev/
        |        |--- supported_type1/
        |        |--- supported_type2/
        |        |--- supported_type3/
        |        |--- mdev1/
        |        |--- mdev2/

i.e. everything related to mdev are placed under one single directory -
the same as SR-IOV.  I'm not sure if it is possible without
introducing an independent device (which you apparently dislike), but
placing so many mdev directories under physical doesn't seems clean.



--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 4/6] docs: Add Documentation for Mediated devices
  2016-10-14  2:22     ` [Qemu-devel] " Jike Song
@ 2016-10-14  3:15       ` Kirti Wankhede
  -1 siblings, 0 replies; 73+ messages in thread
From: Kirti Wankhede @ 2016-10-14  3:15 UTC (permalink / raw)
  To: Jike Song
  Cc: alex.williamson, pbonzini, kraxel, cjia, qemu-devel, kvm,
	kevin.tian, bjsdjshi



On 10/14/2016 7:52 AM, Jike Song wrote:
> On 10/11/2016 04:28 AM, Kirti Wankhede wrote:
>> +
>> +Under per-physical device sysfs:
>> +--------------------------------
>> +
>> +* mdev_supported_types:
>> +    List of current supported mediated device types and its details are added
>> +in this directory in following format:
>> +
>> +|- <parent phy device>
>> +|--- Vendor-specific-attributes [optional]
>> +|--- mdev_supported_types
>> +|     |--- <type id>
>> +|     |   |--- create
>> +|     |   |--- name
>> +|     |   |--- available_instances
>> +|     |   |--- description /class
>> +|     |   |--- [devices]
>> +|     |--- <type id>
>> +|     |   |--- create
>> +|     |   |--- name
>> +|     |   |--- available_instances
>> +|     |   |--- description /class
>> +|     |   |--- [devices]
>> +|     |--- <type id>
>> +|          |--- create
>> +|          |--- name
>> +|          |--- available_instances
>> +|          |--- description /class
>> +|          |--- [devices]
>> +
>> +[TBD : description or class is yet to be decided. This will change.]
>> +
>> +Under per mdev device:
>> +----------------------
>> +
>> +|- <parent phy device>
>> +|--- $MDEV_UUID
>> +         |--- remove
>> +         |--- {link to its type}
>> +         |--- vendor-specific-attributes [optional]
>> +
> 
> All mdev directories are placed under physical device directly.
> 
> Looking at the sysfs directory of physical device, you get:
> 
>         <parent phy device>
>         |--- mdev_supported_types/
>         |        |--- type1/
>         |        |--- type2/
>         |        |--- type3/
>         |--- mdev1/
>         |--- mdev2/
> 
> 
> 
> With an independent device between physical and mdev, and names
> simplified, you will get:
> 
>         <parent phy device>
>         |--- mdev/
>         |        |--- supported_type1/
>         |        |--- supported_type2/
>         |        |--- supported_type3/
>         |        |--- mdev1/
>         |        |--- mdev2/
> 
> i.e. everything related to mdev are placed under one single directory -
> the same as SR-IOV.  I'm not sure if it is possible without
> introducing an independent device (which you apparently dislike), but
> placing so many mdev directories under physical doesn't seems clean.
> 
> 

I'm repeating the same example that I had in reply to Alex's question,
the parent-child relationship between devices is reflected in sysfs.
There are cases when there are multiple children and all are placed in
same parent directory:

80:01.0 PCI bridge: Intel Corporation Xeon E5/Core i7 IIO PCI Express
Root Port 1a (rev 07)
80:02.0 PCI bridge: Intel Corporation Xeon E5/Core i7 IIO PCI Express
Root Port 2a (rev 07)
80:03.0 PCI bridge: Intel Corporation Xeon E5/Core i7 IIO PCI Express
Root Port 3a in PCI Express Mode (rev 07)
80:04.0 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
0 (rev 07)
80:04.1 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
1 (rev 07)
80:04.2 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
2 (rev 07)
80:04.3 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
3 (rev 07)
80:04.4 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
4 (rev 07)
80:04.5 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
5 (rev 07)
80:04.6 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
6 (rev 07)
80:04.7 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
7 (rev 07)
80:05.0 System peripheral: Intel Corporation Xeon E5/Core i7 Address
Map, VTd_Misc, System Management (rev 07)
80:05.2 System peripheral: Intel Corporation Xeon E5/Core i7 Control
Status and Global Errors (rev 07)
80:05.4 PIC: Intel Corporation Xeon E5/Core i7 I/O APIC (rev 07)
81:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network
Connection (rev 01)
81:00.1 Ethernet controller: Intel Corporation I350 Gigabit Network
Connection (rev 01)
83:00.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI
Express Gen 3 (8.0 GT/s) Switch (rev ca)
84:08.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI
Express Gen 3 (8.0 GT/s) Switch (rev ca)
84:10.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI
Express Gen 3 (8.0 GT/s) Switch (rev ca)
85:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (rev a1)
86:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (rev a1)

In sysfs, those are in same parent folder of its parent root port:

# ls /sys/devices/pci0000\:80/ -l
total 0
drwxr-xr-x 8 root root    0 Oct 13 13:30 0000:80:01.0
drwxr-xr-x 7 root root    0 Oct 13 13:30 0000:80:02.0
drwxr-xr-x 6 root root    0 Oct 13 13:30 0000:80:03.0
drwxr-xr-x 6 root root    0 Oct 13 13:30 0000:80:04.0
drwxr-xr-x 6 root root    0 Oct 13 13:30 0000:80:04.1
drwxr-xr-x 6 root root    0 Oct 13 13:30 0000:80:04.2
drwxr-xr-x 6 root root    0 Oct 13 13:30 0000:80:04.3
drwxr-xr-x 6 root root    0 Oct 13 13:30 0000:80:04.4
drwxr-xr-x 6 root root    0 Oct 13 13:30 0000:80:04.5
drwxr-xr-x 6 root root    0 Oct 13 13:30 0000:80:04.6
drwxr-xr-x 6 root root    0 Oct 13 13:30 0000:80:04.7
drwxr-xr-x 3 root root    0 Oct 13 13:30 0000:80:05.0
drwxr-xr-x 3 root root    0 Oct 13 13:30 0000:80:05.2
drwxr-xr-x 3 root root    0 Oct 13 13:30 0000:80:05.4
lrwxrwxrwx 1 root root    0 Oct 13 13:30 firmware_node ->
../LNXSYSTM:00/LNXSYBUS:00/PNP0A08:01
drwxr-xr-x 3 root root    0 Oct 13 13:30 pci_bus
drwxr-xr-x 2 root root    0 Oct 13 13:30 power
-rw-r--r-- 1 root root 4096 Oct 13 13:25 uevent


This is not only the case with pci devices, another example, i2c
devices. Nvidia driver registers to i2c bus and when nvidia.ko is
loaded, you can see i2c devices, which forms child of GPU @0000:85:00.0,
are places in 0000:85:00.0 directory.

/sys/bus/pci/devices/0000:85:00.0/i2c-4
/sys/bus/pci/devices/0000:85:00.0/i2c-5

Thanks,
Kirti

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v8 4/6] docs: Add Documentation for Mediated devices
@ 2016-10-14  3:15       ` Kirti Wankhede
  0 siblings, 0 replies; 73+ messages in thread
From: Kirti Wankhede @ 2016-10-14  3:15 UTC (permalink / raw)
  To: Jike Song
  Cc: alex.williamson, pbonzini, kraxel, cjia, qemu-devel, kvm,
	kevin.tian, bjsdjshi



On 10/14/2016 7:52 AM, Jike Song wrote:
> On 10/11/2016 04:28 AM, Kirti Wankhede wrote:
>> +
>> +Under per-physical device sysfs:
>> +--------------------------------
>> +
>> +* mdev_supported_types:
>> +    List of current supported mediated device types and its details are added
>> +in this directory in following format:
>> +
>> +|- <parent phy device>
>> +|--- Vendor-specific-attributes [optional]
>> +|--- mdev_supported_types
>> +|     |--- <type id>
>> +|     |   |--- create
>> +|     |   |--- name
>> +|     |   |--- available_instances
>> +|     |   |--- description /class
>> +|     |   |--- [devices]
>> +|     |--- <type id>
>> +|     |   |--- create
>> +|     |   |--- name
>> +|     |   |--- available_instances
>> +|     |   |--- description /class
>> +|     |   |--- [devices]
>> +|     |--- <type id>
>> +|          |--- create
>> +|          |--- name
>> +|          |--- available_instances
>> +|          |--- description /class
>> +|          |--- [devices]
>> +
>> +[TBD : description or class is yet to be decided. This will change.]
>> +
>> +Under per mdev device:
>> +----------------------
>> +
>> +|- <parent phy device>
>> +|--- $MDEV_UUID
>> +         |--- remove
>> +         |--- {link to its type}
>> +         |--- vendor-specific-attributes [optional]
>> +
> 
> All mdev directories are placed under physical device directly.
> 
> Looking at the sysfs directory of physical device, you get:
> 
>         <parent phy device>
>         |--- mdev_supported_types/
>         |        |--- type1/
>         |        |--- type2/
>         |        |--- type3/
>         |--- mdev1/
>         |--- mdev2/
> 
> 
> 
> With an independent device between physical and mdev, and names
> simplified, you will get:
> 
>         <parent phy device>
>         |--- mdev/
>         |        |--- supported_type1/
>         |        |--- supported_type2/
>         |        |--- supported_type3/
>         |        |--- mdev1/
>         |        |--- mdev2/
> 
> i.e. everything related to mdev are placed under one single directory -
> the same as SR-IOV.  I'm not sure if it is possible without
> introducing an independent device (which you apparently dislike), but
> placing so many mdev directories under physical doesn't seems clean.
> 
> 

I'm repeating the same example that I had in reply to Alex's question,
the parent-child relationship between devices is reflected in sysfs.
There are cases when there are multiple children and all are placed in
same parent directory:

80:01.0 PCI bridge: Intel Corporation Xeon E5/Core i7 IIO PCI Express
Root Port 1a (rev 07)
80:02.0 PCI bridge: Intel Corporation Xeon E5/Core i7 IIO PCI Express
Root Port 2a (rev 07)
80:03.0 PCI bridge: Intel Corporation Xeon E5/Core i7 IIO PCI Express
Root Port 3a in PCI Express Mode (rev 07)
80:04.0 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
0 (rev 07)
80:04.1 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
1 (rev 07)
80:04.2 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
2 (rev 07)
80:04.3 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
3 (rev 07)
80:04.4 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
4 (rev 07)
80:04.5 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
5 (rev 07)
80:04.6 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
6 (rev 07)
80:04.7 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
7 (rev 07)
80:05.0 System peripheral: Intel Corporation Xeon E5/Core i7 Address
Map, VTd_Misc, System Management (rev 07)
80:05.2 System peripheral: Intel Corporation Xeon E5/Core i7 Control
Status and Global Errors (rev 07)
80:05.4 PIC: Intel Corporation Xeon E5/Core i7 I/O APIC (rev 07)
81:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network
Connection (rev 01)
81:00.1 Ethernet controller: Intel Corporation I350 Gigabit Network
Connection (rev 01)
83:00.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI
Express Gen 3 (8.0 GT/s) Switch (rev ca)
84:08.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI
Express Gen 3 (8.0 GT/s) Switch (rev ca)
84:10.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI
Express Gen 3 (8.0 GT/s) Switch (rev ca)
85:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (rev a1)
86:00.0 VGA compatible controller: NVIDIA Corporation Device 13f2 (rev a1)

In sysfs, those are in same parent folder of its parent root port:

# ls /sys/devices/pci0000\:80/ -l
total 0
drwxr-xr-x 8 root root    0 Oct 13 13:30 0000:80:01.0
drwxr-xr-x 7 root root    0 Oct 13 13:30 0000:80:02.0
drwxr-xr-x 6 root root    0 Oct 13 13:30 0000:80:03.0
drwxr-xr-x 6 root root    0 Oct 13 13:30 0000:80:04.0
drwxr-xr-x 6 root root    0 Oct 13 13:30 0000:80:04.1
drwxr-xr-x 6 root root    0 Oct 13 13:30 0000:80:04.2
drwxr-xr-x 6 root root    0 Oct 13 13:30 0000:80:04.3
drwxr-xr-x 6 root root    0 Oct 13 13:30 0000:80:04.4
drwxr-xr-x 6 root root    0 Oct 13 13:30 0000:80:04.5
drwxr-xr-x 6 root root    0 Oct 13 13:30 0000:80:04.6
drwxr-xr-x 6 root root    0 Oct 13 13:30 0000:80:04.7
drwxr-xr-x 3 root root    0 Oct 13 13:30 0000:80:05.0
drwxr-xr-x 3 root root    0 Oct 13 13:30 0000:80:05.2
drwxr-xr-x 3 root root    0 Oct 13 13:30 0000:80:05.4
lrwxrwxrwx 1 root root    0 Oct 13 13:30 firmware_node ->
../LNXSYSTM:00/LNXSYBUS:00/PNP0A08:01
drwxr-xr-x 3 root root    0 Oct 13 13:30 pci_bus
drwxr-xr-x 2 root root    0 Oct 13 13:30 power
-rw-r--r-- 1 root root 4096 Oct 13 13:25 uevent


This is not only the case with pci devices, another example, i2c
devices. Nvidia driver registers to i2c bus and when nvidia.ko is
loaded, you can see i2c devices, which forms child of GPU @0000:85:00.0,
are places in 0000:85:00.0 directory.

/sys/bus/pci/devices/0000:85:00.0/i2c-4
/sys/bus/pci/devices/0000:85:00.0/i2c-5

Thanks,
Kirti

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v8 4/6] docs: Add Documentation for Mediated devices
  2016-10-13 14:36                     ` [Qemu-devel] " Alex Williamson
  (?)
  (?)
@ 2016-10-14  3:31                     ` Kirti Wankhede
  2016-10-14  4:22                         ` [Qemu-devel] " Alex Williamson
  -1 siblings, 1 reply; 73+ messages in thread
From: Kirti Wankhede @ 2016-10-14  3:31 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, Daniel P. Berrange, pbonzini, kraxel, cjia, Song,
	Jike, kvm, qemu-devel, bjsdjshi, Laine Stump



On 10/13/2016 8:06 PM, Alex Williamson wrote:
> On Thu, 13 Oct 2016 14:52:09 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> On 10/13/2016 3:14 AM, Alex Williamson wrote:
>>> On Thu, 13 Oct 2016 00:32:48 +0530
>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>   
>>>> On 10/12/2016 9:29 PM, Alex Williamson wrote:  
>>>>> On Wed, 12 Oct 2016 20:43:48 +0530
>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>>>     
>>>>>> On 10/12/2016 7:22 AM, Tian, Kevin wrote:    
>>>>>>>> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
>>>>>>>> Sent: Wednesday, October 12, 2016 4:45 AM      
>>>>>>>>>> +* mdev_supported_types:
>>>>>>>>>> +    List of current supported mediated device types and its details are added
>>>>>>>>>> +in this directory in following format:
>>>>>>>>>> +
>>>>>>>>>> +|- <parent phy device>
>>>>>>>>>> +|--- Vendor-specific-attributes [optional]
>>>>>>>>>> +|--- mdev_supported_types
>>>>>>>>>> +|     |--- <type id>
>>>>>>>>>> +|     |   |--- create
>>>>>>>>>> +|     |   |--- name
>>>>>>>>>> +|     |   |--- available_instances
>>>>>>>>>> +|     |   |--- description /class
>>>>>>>>>> +|     |   |--- [devices]
>>>>>>>>>> +|     |--- <type id>
>>>>>>>>>> +|     |   |--- create
>>>>>>>>>> +|     |   |--- name
>>>>>>>>>> +|     |   |--- available_instances
>>>>>>>>>> +|     |   |--- description /class
>>>>>>>>>> +|     |   |--- [devices]
>>>>>>>>>> +|     |--- <type id>
>>>>>>>>>> +|          |--- create
>>>>>>>>>> +|          |--- name
>>>>>>>>>> +|          |--- available_instances
>>>>>>>>>> +|          |--- description /class
>>>>>>>>>> +|          |--- [devices]
>>>>>>>>>> +
>>>>>>>>>> +[TBD : description or class is yet to be decided. This will change.]      
>>>>>>>>>
>>>>>>>>> I thought that in previous discussions we had agreed to drop
>>>>>>>>> the <type id> concept and use the name as the unique identifier.
>>>>>>>>> When reporting these types in libvirt we won't want to report
>>>>>>>>> the type id values - we'll want the name strings to be unique.
>>>>>>>>>      
>>>>>>>>
>>>>>>>> The 'name' might not be unique but type_id will be. For example that Neo
>>>>>>>> pointed out in earlier discussion, virtual devices can come from two
>>>>>>>> different physical devices, end user would be presented with what they
>>>>>>>> had selected but there will be internal implementation differences. In
>>>>>>>> that case 'type_id' will be unique.
>>>>>>>>      
>>>>>>>
>>>>>>> Hi, Kirti, my understanding is that Neo agreed to use an unique type
>>>>>>> string (if you still called it <type id>), and then no need of additional
>>>>>>> 'name' field which can be put inside 'description' field. See below quote:
>>>>>>>       
>>>>>>
>>>>>> We had internal discussions about this within NVIDIA and found that
>>>>>> 'name' might not be unique where as 'type_id' would be unique. I'm
>>>>>> refering to Neo's mail after that, where Neo do pointed that out.
>>>>>>
>>>>>> https://lists.gnu.org/archive/html/qemu-devel/2016-09/msg07714.html    
>>>>>
>>>>> Everyone not privy to those internal discussions, including me, seems to
>>>>> think we dropped type_id and that if a vendor does not have a stable
>>>>> name, they can compose some sort of stable type description based on the
>>>>> name+id, or even vendor+id, ex. NVIDIA-11.  So please share why we
>>>>> haven't managed to kill off type_id yet.  No matter what internal
>>>>> representation each vendor driver has of "type_id" it seems possible
>>>>> for it to come up with stable string to define a given configuration.    
>>>>
>>>>
>>>> The 'type_id' is unique and the 'name' are not, the name is just a
>>>> virtual device name/ human readable name. Because at this moment Intel
>>>> can't define a proper GPU class, we have to add a 'description' field
>>>> there as well to represent the features of this virtual device, once we
>>>> have all agreed with the GPU class and its mandatory attributes, the
>>>> 'description' field can be removed. Here is an example,
>>>> type_id/type_name = NVIDIA_11,
>>>> name=M60-M0Q,
>>>> description=2560x1600, 2 displays, 512MB"
>>>>
>>>> Neo's previous comment only applies to the situation where we will have
>>>> the GPU class or optional attributes defined and recognized by libvirt,
>>>> since that is not going to happen any time soon, we will have to have
>>>> the new 'description' field, and we don't want to have it mixed up with
>>>> 'name' field.
>>>>
>>>> We can definitely have something like name+id as Alex recommended to
>>>> remove the 'name' field, but it will just require libvirt to have more
>>>> logic to parse that string.  
>>>
>>> Let's use the mtty example driver provided in patch 5 so we can all
>>> more clearly see how the interfaces work.  I'll start from the
>>> beginning of my experience and work my way to the type/name thing.
>>>   
>>
>> Thanks for looking into it and getting feel of it. And I hope this helps
>> to understand that 'name' and 'type_id' are different.
>>
>>
>>> (please add a modules_install target to the Makefile)
>>>  
>>
>> This is an example and I feel it should not be installed in
>> /lib/modules/../build path. This should be used to understand the
>> interface and the flow of mdev device management life cycle. User can
>> use insmod to load driver:
>>
>> # insmod mtty.ko
> 
> It's not built by default, that's sufficient.  Providing a
> modules_install target makes it more accessible for testing and allows
> easier testing of module dependencies with modprobe.  insmod does not
> exercise the automatic module dependency loading.
> 
>>> # modprobe mtty
>>>
>>> Now what?  It seems like I need to have prior knowledge that this
>>> drivers supports mdev devices and I need to go hunt for them.  We need
>>> to create a class (ex. /sys/class/mdev/) where a user can find all the
>>> devices that participate in this mediated device infrastructure.  That
>>> would point me to /sys/devices/mtty.
>>>   
>>
>> You can find devices registered to mdev framework by searching for
>> 'mdev_supported_types' directory at the leaf nodes of devices in
>> /sys/devices directory. Yes, we can have 'mdev' class and links to
>> devices which are registered to mdev framework in /sys/class/mdev/.
>>
>>
>>> # tree /sys/devices/mtty
>>> /sys/devices/mtty
>>> |-- mdev_supported_types
>>> |   `-- mtty1
>>> |       |-- available_instances (1)
>>> |       |-- create
>>> |       |-- devices
>>> |       `-- name ("Dual-port-serial")
>>> |-- mtty_dev
>>> |   `-- sample_mtty_dev ("This is phy device")
>>> |-- power
>>> |   |-- async
>>> |   |-- autosuspend_delay_ms
>>> |   |-- control
>>> |   |-- runtime_active_kids
>>> |   |-- runtime_active_time
>>> |   |-- runtime_enabled
>>> |   |-- runtime_status
>>> |   |-- runtime_suspended_time
>>> |   `-- runtime_usage
>>> `-- uevent
>>>
>>> Ok, but that was boring, we really need to have at least 2 supported
>>> types to validate the interface, so without changing the actual device
>>> backing, I pretended to have a single port vs dual port:
>>>
>>> /sys/devices/mtty
>>> |-- mdev_supported_types
>>> |   |-- mtty1
>>> |   |   |-- available_instances (24)
>>> |   |   |-- create
>>> |   |   |-- devices
>>> |   |   `-- name (Single-port-serial)
>>> |   `-- mtty2
>>> |       |-- available_instances (12)
>>> |       |-- create
>>> |       |-- devices
>>> |       `-- name (Dual-port-serial)
>>> [snip]
>>>
>>> I arbitrarily decides I have 24 ports and each single port uses 1 port
>>> and each dual port uses 2 ports.
>>>
>>> Before I start creating devices, what are we going to key the libvirt
>>> XML on?  Can we do anything to prevent vendors from colliding or do we
>>> have any way to suggest meaningful and unique type_ids?   
>>
>> Libvirt would have parent and type_id in XML. No two vendors can own
>> same parent device. So I don't think vendors would collide even having
>> same type_id, since <parent, type_id> pair would always be unique.
> 
> 
> We have a goal of supporting migration with mdev devices, Intel has
> already shown this is possible.  Tying the XML representation of an
> mdev device to a parent device is directly contradictory to that goal.
> libvirt needs a token which is unique across vendor to be able to
> instantiate an mdev device.  <parent, type_id> is unacceptable.
>  

Ok, I'll have dev->driver->name prefix for type_id from mdev core
module. In vendor driver attribute functions, they should also use same
format to identify the type.

>>  Presumably if
>>> we had a PCI device hosting this, we would be rooted at that parent
>>> device, ex. /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0.  Maybe
>>> the type_id should automatically be prefixed by the vendor module name,
>>> ex. mtty-1, i915-foo, nvidia-bar.  There's something missing for
>>> deterministically creating a "XYZ" device and knowdev->driver->nameing exactly what that
>>> means and finding a parent device that supports it.
>>>   
>>
>> We can prefix type_id with module name, i.e using , but
>> <parent, type_id> pair is unique so I don't see much benefit in doing
>> that.
>>
>>
>>> Let's get to mdev creating...
>>>
>>> # uuidgen > mdev_supported_types/mtty2/create
>>> # tree /sys/devices/mtty
>>> /sys/devices/mtty
>>> |-- e68189be-700e-41f7-93a3-b5351e79c470
>>> |   |-- driver -> ../../../bus/mdev/drivers/vfio_mdev
>>> |   |-- iommu_group -> ../../../kernel/iommu_groups/63
>>> |   |-- mtty2 -> ../mdev_supported_types/mtty2
>>> |   |-- power
>>> |   |   |-- async
>>> |   |   |-- autosuspend_delay_ms
>>> |   |   |-- control
>>> |   |   |-- runtime_active_kids
>>> |   |   |-- runtime_active_time
>>> |   |   |-- runtime_enabled
>>> |   |   |-- runtime_status
>>> |   |   |-- runtime_suspended_time
>>> |   |   `-- runtime_usage
>>> |   |-- remove
>>> |   |-- subsystem -> ../../../bus/mdev
>>> |   |-- uevent
>>> |   `-- vendor
>>> |       `-- sample_mdev_dev ("This is MDEV e68189be-700e-41f7-93a3-b5351e79c470")
>>> |-- mdev_supported_types
>>> |   |-- mtty1
>>> |   |   |-- available_instances (22)
>>> |   |   |-- create
>>> |   |   |-- devices
>>> |   |   `-- name
>>> |   `-- mtty2
>>> |       |-- available_instances (11)
>>> |       |-- create
>>> |       |-- devices
>>> |       |   `-- e68189be-700e-41f7-93a3-b5351e79c470 -> ../../../e68189be-700e-41f7-93a3-b5351e79c470
>>> |       `-- name
>>>
>>> The mdev device was created directly under the parent, which seems like
>>> it's going to get messy to me (ie. imagine dropping a bunch of uuids
>>> into a PCI parent device's sysfs directory, how does a user know what
>>> they are?).
>>>   
>>
>> That is the way devices are placed in sysfs. For example below devices:
>>
>> 80:01.0 PCI bridge: Intel Corporation Xeon E5/Core i7 IIO PCI Express
>> Root Port 1a (rev 07)
>> 80:02.0 PCI bridge: Intel Corporation Xeon E5/Core i7 IIO PCI Express
>> Root Port 2a (rev 07)
>> 80:03.0 PCI bridge: Intel Corporation Xeon E5/Core i7 IIO PCI Express
>> Root Port 3a in PCI Express Mode (rev 07)
>> 80:04.0 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
>> 0 (rev 07)
>> 80:04.1 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
>> 1 (rev 07)
>> 80:04.2 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
>> 2 (rev 07)
>> 80:04.3 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
>> 3 (rev 07)
>> 80:04.4 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
>> 4 (rev 07)
>> 80:04.5 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
>> 5 (rev 07)
>> 80:04.6 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
>> 6 (rev 07)
>> 80:04.7 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel
>> 7 (rev 07)
>> 80:05.0 System peripheral: Intel Corporation Xeon E5/Core i7 Address
>> Map, VTd_Misc, System Management (rev 07)
>> 80:05.2 System peripheral: Intel Corporation Xeon E5/Core i7 Control
>> Status and Global Errors (rev 07)
>> 80:05.4 PIC: Intel Corporation Xeon E5/Core i7 I/O APIC (rev 07)
>>
>> In sysfs, those are in same parent folder of its parent root port:
>>
>> # ls /sys/devices/pci0000\:80/ -l
>> total 0
>> drwxr-xr-x 8 root root    0 Oct 13 12:08 0000:80:01.0
>> drwxr-xr-x 7 root root    0 Oct 13 12:08 0000:80:02.0
>> drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:03.0
>> drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:04.0
>> drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:04.1
>> drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:04.2
>> drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:04.3
>> drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:04.4
>> drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:04.5
>> drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:04.6
>> drwxr-xr-x 6 root root    0 Oct 13 12:08 0000:80:04.7
>> drwxr-xr-x 3 root root    0 Oct 13 12:08 0000:80:05.0
>> drwxr-xr-x 3 root root    0 Oct 13 12:08 0000:80:05.2
>> drwxr-xr-x 3 root root    0 Oct 13 12:08 0000:80:05.4
>> lrwxrwxrwx 1 root root    0 Oct 13 13:25 firmware_node ->
>> ../LNXSYSTM:00/LNXSYBUS:00/PNP0A08:01
>> drwxr-xr-x 3 root root    0 Oct 13 12:08 pci_bus
>> drwxr-xr-x 2 root root    0 Oct 13 12:08 power
>> -rw-r--r-- 1 root root 4096 Oct 13 13:25 uevent
> 
> I agree, that's also messy, but a PCI device has a standard PCI address
> format, so just by looking at those you know it's a child PCI device.
> We can write code that understands how to parse that and understands
> what it is by the address.  On the other hand we're dropping uuids into
> a directory.  We can write code that understands how to parse it, but
> how do we know that it's a child device vs some other attribute for the
> parent?
> 


mdev child will have 'mdev_supported_type' link in its directory. That
should be sufficient to identify it's child and not other attribute.


>>> Under the device we have "mtty2", shouldn't that be
>>> "mdev_supported_type", which then links to mtty2?  Otherwise a user
>>> needs to decode from the link what this attribute is.
>>>   
>>
>> I thought it should show type, so that by looking at 'ls' output user
>> should be able to find type_id.
> 
> The type_id should be shown by actually reading the link, not by the
> link name itself, the same way that the iommu_group link for a device
> isn't the group number, it links to the group number but uses a
> standard link name.
> 

Ok. I'll rename the link name to 'mdev_supported_type'

>>> Also here's an example of those vendor sysfs entries per device.  So
>>> long as the vendor never expects a tool like libvirt to manipulate
>>> attributes there, I can see how that could be pretty powerful.
>>>   
>>
>> Yes, it is good to have vendor specific entries, libvirt might not
>> report/use it. That would be more useful for system admin to get extra
>> information manually that libvirt doesn't report.
>>
>>
>>> Moving down to the mdev_supported_types, I've updated mtty so that it
>>> actually adjusts available instance, and we can now see a link under
>>> the devices for mtty2.
>>>
>>> Also worth noting is that a link for the device appears
>>> in /sys/bus/mdev/devices.
>>>
>>> BTW, specifying this device for QEMU vfio-pci is where the sysfsdev
>>> option comes into play:
>>>
>>> -device
>>> vfio-pci,sysfsdev=/sys/devices/mtty/e68189be-700e-41f7-93a3-b5351e79c470
>>>
>>> Which raises another question, we can tell through the vfio interfaces
>>> that this is exposes as a PCI device, by creating a container
>>> (open(/dev/vfio/vfio)), setting an iommu (ioctl(VFIO_SET_IOMMU)),
>>> adding the group to the container (ioctl(VFIO_GROUP_SET_CONTAINER)),
>>> getting the device (ioctl(VFIO_GROUP_GET_DEVICE_FD)), and finally
>>> getting the device info (ioctl(VFIO_DEVICE_GET_INFO)) and checking the
>>> flag bit that says the API is PCI.  That's a long path to go and has
>>> stumbling blocks like the type of iommu that's available for the given
>>> platform.  How do we make that manageable?   
>>
>> Do you want device type to be expressed in sysfs? Then that should be
>> done from vendor driver. vfio_mdev module is now a shim layer, so mdev
>> core or vfio_mdev module don't know what device type flag vendor driver
>> had set.
> 
> Right, the vendor driver would need to expose this, the mdev layers are
> device agnostic, they don't know or care which device API is being
> exposed.  The other question is whether it needs to be part of the
> initial implementation or can we assume pci for now and add something
> later.  I guess we already have our proof to the contrary with the
> IBM ccw device that libvirt can't simply assume pci.  I see that many
> devices in sysfs have a subsystem link, which seems rather appropriate,
> but we're not creating a real pci device, so linking to /sys/bus/pci
> or /sys/class/pci_bus both seem invalid.  Is that a dead end?  We could
> always expose vfio_device_info.flags, but that seems pretty ugly as
> well, plus the sysfs mdev interface is not vfio specific.  What if we
> had a "device_api" attribute which the vendor driver would show as
> "vfio-pci"?  Therefore the mdev interface is not tied to vfio, but we
> clearly show that a given type_id clearly exports a vfio-pci
> interface.  Thanks,
> 

We can't use subsystem for mdev device. Kernel's device core framework
adds subsystem link to mdev device folder as:
 subsystem -> ../../../../../../../bus/mdev

We will have an mandatory attribute in 'supported_type_groups' which
should show "vfio-pci" or "vfio-platform" based on the device flag
vendor driver is going to set for that type.

Thanks,
Kirti


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 4/6] docs: Add Documentation for Mediated devices
  2016-10-14  3:31                     ` Kirti Wankhede
@ 2016-10-14  4:22                         ` Alex Williamson
  0 siblings, 0 replies; 73+ messages in thread
From: Alex Williamson @ 2016-10-14  4:22 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Song, Jike, kvm, Tian, Kevin, qemu-devel, cjia, kraxel,
	Laine Stump, pbonzini, bjsdjshi

On Fri, 14 Oct 2016 09:01:01 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 10/13/2016 8:06 PM, Alex Williamson wrote:
> > On Thu, 13 Oct 2016 14:52:09 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> On 10/13/2016 3:14 AM, Alex Williamson wrote:  
> >>> Under the device we have "mtty2", shouldn't that be
> >>> "mdev_supported_type", which then links to mtty2?  Otherwise a user
> >>> needs to decode from the link what this attribute is.
> >>>     
> >>
> >> I thought it should show type, so that by looking at 'ls' output user
> >> should be able to find type_id.  
> > 
> > The type_id should be shown by actually reading the link, not by the
> > link name itself, the same way that the iommu_group link for a device
> > isn't the group number, it links to the group number but uses a
> > standard link name.
> >   
> 
> Ok. I'll rename the link name to 'mdev_supported_type'

Hmm, if we have a device, then clearly it's a supported type, we can
probably reduce this to 'mdev_type'.  Sorry for not catching that.

BTW, please include the linux-kernel <linux-kernel@vger.kernel.org>
mailing list on the CC in your next posting.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v8 4/6] docs: Add Documentation for Mediated devices
@ 2016-10-14  4:22                         ` Alex Williamson
  0 siblings, 0 replies; 73+ messages in thread
From: Alex Williamson @ 2016-10-14  4:22 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Tian, Kevin, Daniel P. Berrange, pbonzini, kraxel, cjia, Song,
	Jike, kvm, qemu-devel, bjsdjshi, Laine Stump

On Fri, 14 Oct 2016 09:01:01 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 10/13/2016 8:06 PM, Alex Williamson wrote:
> > On Thu, 13 Oct 2016 14:52:09 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> On 10/13/2016 3:14 AM, Alex Williamson wrote:  
> >>> Under the device we have "mtty2", shouldn't that be
> >>> "mdev_supported_type", which then links to mtty2?  Otherwise a user
> >>> needs to decode from the link what this attribute is.
> >>>     
> >>
> >> I thought it should show type, so that by looking at 'ls' output user
> >> should be able to find type_id.  
> > 
> > The type_id should be shown by actually reading the link, not by the
> > link name itself, the same way that the iommu_group link for a device
> > isn't the group number, it links to the group number but uses a
> > standard link name.
> >   
> 
> Ok. I'll rename the link name to 'mdev_supported_type'

Hmm, if we have a device, then clearly it's a supported type, we can
probably reduce this to 'mdev_type'.  Sorry for not catching that.

BTW, please include the linux-kernel <linux-kernel@vger.kernel.org>
mailing list on the CC in your next posting.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v8 3/6] vfio iommu: Add support for mediated devices
  2016-10-12 10:31     ` [Qemu-devel] " Tian, Kevin
@ 2016-10-14 11:35       ` Kirti Wankhede
  -1 siblings, 0 replies; 73+ messages in thread
From: Kirti Wankhede @ 2016-10-14 11:35 UTC (permalink / raw)
  To: Tian, Kevin, alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, Song, Jike, bjsdjshi



On 10/12/2016 4:01 PM, Tian, Kevin wrote:
>> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
>> Sent: Tuesday, October 11, 2016 4:29 AM
>>
> [...]
>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>> index 2ba19424e4a1..ce6d6dcbd9a8 100644
>> --- a/drivers/vfio/vfio_iommu_type1.c
>> +++ b/drivers/vfio/vfio_iommu_type1.c
>> @@ -55,18 +55,26 @@ MODULE_PARM_DESC(disable_hugepages,
>>
>>  struct vfio_iommu {
>>  	struct list_head	domain_list;
>> +	struct vfio_domain	*local_domain;
> 
> Hi, Kirti, can you help explain the meaning of 'local" here? I have a hard time 
> to understand its intention... In your later change of vaddr_get_pfn, it's
> even more confusing where get_user_pages_remote is used on a 'local_mm':
> 

'local' in local_domain is to describe that the domain for local page
tracking. 'local_mm' in vaddr_get_pfn() is local variable in
vaddr_get_pfn() function.
    struct mm_struct *local_mm = (mm ? mm : current->mm);


> +	if (mm) {
> +		down_read(&local_mm->mmap_sem);
> +		ret = get_user_pages_remote(NULL, local_mm, vaddr, 1,
> +					!!(prot & IOMMU_WRITE), 0, page, NULL);
> +		up_read(&local_mm->mmap_sem);
> +	} else
> +		ret = get_user_pages_fast(vaddr, 1,
> +					  !!(prot & IOMMU_WRITE), page);
> 
> 
> [...]
>> -static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
>> +static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
>> +			 int prot, unsigned long *pfn)
>>  {
>>  	struct page *page[1];
>>  	struct vm_area_struct *vma;
>> +	struct mm_struct *local_mm = (mm ? mm : current->mm);
> 
> it'd be clearer if you call this variable as 'mm' while the earlier input parameter
> as 'local_mm'.
> 

Like I mentioned above, 'local' here is for local variable in this
function.

>>  	int ret = -EFAULT;
>>
>> -	if (get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE), page) == 1) {
>> +	if (mm) {
>> +		down_read(&local_mm->mmap_sem);
>> +		ret = get_user_pages_remote(NULL, local_mm, vaddr, 1,
>> +					!!(prot & IOMMU_WRITE), 0, page, NULL);
>> +		up_read(&local_mm->mmap_sem);
>> +	} else
>> +		ret = get_user_pages_fast(vaddr, 1,
>> +					  !!(prot & IOMMU_WRITE), page);
>> +
>> +	if (ret == 1) {
>>  		*pfn = page_to_pfn(page[0]);
>>  		return 0;
>>  	}
>>
>> -	down_read(&current->mm->mmap_sem);
>> +	down_read(&local_mm->mmap_sem);
>>
>> -	vma = find_vma_intersection(current->mm, vaddr, vaddr + 1);
>> +	vma = find_vma_intersection(local_mm, vaddr, vaddr + 1);
>>
>>  	if (vma && vma->vm_flags & VM_PFNMAP) {
>>  		*pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
> 
> [...]
>> +static long __vfio_pin_pages_local(struct vfio_domain *domain,
>> +				   unsigned long vaddr, int prot,
>> +				   unsigned long *pfn_base,
>> +				   bool do_accounting)
> 
> 'pages' -> 'page' since only one page is handled here.
> 
> [...]
>> +
>> +static void __vfio_unpin_pages_local(struct vfio_domain *domain,
>> +				     unsigned long pfn, int prot,
>> +				     bool do_accounting)
> 
> ditto
> 

Ok.

>> +{
>> +	put_pfn(pfn, prot);
>> +
>> +	if (do_accounting)
>> +		vfio_lock_acct(domain->local_addr_space->task, -1);
>> +}
>> +
>> +static int vfio_unpin_pfn(struct vfio_domain *domain,
>> +			  struct vfio_pfn *vpfn, bool do_accounting)
>> +{
>> +	__vfio_unpin_pages_local(domain, vpfn->pfn, vpfn->prot,
>> +				 do_accounting);
>> +
>> +	if (atomic_dec_and_test(&vpfn->ref_count))
>> +		vfio_remove_from_pfn_list(domain, vpfn);
>> +
>> +	return 1;
>> +}
>> +
>> +static long vfio_iommu_type1_pin_pages(void *iommu_data,
>> +				       unsigned long *user_pfn,
>> +				       long npage, int prot,
>> +				       unsigned long *phys_pfn)
>> +{
>> +	struct vfio_iommu *iommu = iommu_data;
>> +	struct vfio_domain *domain;
>> +	int i, j, ret;
>> +	long retpage;
>> +	unsigned long remote_vaddr;
>> +	unsigned long *pfn = phys_pfn;
>> +	struct vfio_dma *dma;
>> +	bool do_accounting = false;
>> +
>> +	if (!iommu || !user_pfn || !phys_pfn)
>> +		return -EINVAL;
>> +
>> +	mutex_lock(&iommu->lock);
>> +
>> +	if (!iommu->local_domain) {
>> +		ret = -EINVAL;
>> +		goto pin_done;
>> +	}
>> +
>> +	domain = iommu->local_domain;
>> +
>> +	/*
>> +	 * If iommu capable domain exist in the container then all pages are
>> +	 * already pinned and accounted. Accouting should be done if there is no
>> +	 * iommu capable domain in the container.
>> +	 */
>> +	do_accounting = !IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu);
>> +
>> +	for (i = 0; i < npage; i++) {
>> +		struct vfio_pfn *p;
>> +		dma_addr_t iova;
>> +
>> +		iova = user_pfn[i] << PAGE_SHIFT;
>> +
>> +		dma = vfio_find_dma(iommu, iova, 0);
>> +		if (!dma) {
>> +			ret = -EINVAL;
>> +			goto pin_unwind;
>> +		}
>> +
>> +		remote_vaddr = dma->vaddr + iova - dma->iova;
> 
> again, why "remote"_vaddr on a 'local' function?
> 

Not 'local' function, its local_domain. __vfio_pin_pages_local() pins
pages for local_domain. When this function is called from other process,
other than QEMU process, vaddr from QEMU process is remote_vaddr for
caller.


>> +
>> +		retpage = __vfio_pin_pages_local(domain, remote_vaddr, prot,
>> +						 &pfn[i], do_accounting);
>> +		if (retpage <= 0) {
>> +			WARN_ON(!retpage);
>> +			ret = (int)retpage;
>> +			goto pin_unwind;
>> +		}
>> +
>> +		mutex_lock(&domain->local_addr_space->pfn_list_lock);
>> +
>> +		/* search if pfn exist */
>> +		p = vfio_find_pfn(domain, pfn[i]);
>> +		if (p) {
>> +			atomic_inc(&p->ref_count);
>> +			mutex_unlock(&domain->local_addr_space->pfn_list_lock);
>> +			continue;
>> +		}
>> +
>> +		ret = vfio_add_to_pfn_list(domain, remote_vaddr, iova,
>> +					   pfn[i], prot);
>> +		mutex_unlock(&domain->local_addr_space->pfn_list_lock);
>> +
>> +		if (ret) {
>> +			__vfio_unpin_pages_local(domain, pfn[i], prot,
>> +						 do_accounting);
>> +			goto pin_unwind;
>> +		}
>> +	}
>> +
>> +	ret = i;
>> +	goto pin_done;
>> +
>> +pin_unwind:
>> +	pfn[i] = 0;
>> +	mutex_lock(&domain->local_addr_space->pfn_list_lock);
>> +	for (j = 0; j < i; j++) {
>> +		struct vfio_pfn *p;
>> +
>> +		p = vfio_find_pfn(domain, pfn[j]);
>> +		if (p)
>> +			vfio_unpin_pfn(domain, p, do_accounting);
>> +
>> +		pfn[j] = 0;
>> +	}
>> +	mutex_unlock(&domain->local_addr_space->pfn_list_lock);
>> +
>> +pin_done:
>> +	mutex_unlock(&iommu->lock);
>> +	return ret;
>> +}
>> +
>> +static long vfio_iommu_type1_unpin_pages(void *iommu_data, unsigned long *pfn,
>> +					 long npage)
>> +{
>> +	struct vfio_iommu *iommu = iommu_data;
>> +	struct vfio_domain *domain = NULL;
>> +	long unlocked = 0;
>> +	int i;
>> +
>> +	if (!iommu || !pfn)
>> +		return -EINVAL;
>> +
> 
> acquire iommu lock...
> 

Yes, Alex has pointed this out and I'm going to fix it in v9.

>> +	domain = iommu->local_domain;
>> +
>> +	for (i = 0; i < npage; i++) {
>> +		struct vfio_pfn *p;
>> +
>> +		mutex_lock(&domain->local_addr_space->pfn_list_lock);
>> +
>> +		/* verify if pfn exist in pfn_list */
>> +		p = vfio_find_pfn(domain, pfn[i]);
>> +		if (p)
>> +			unlocked += vfio_unpin_pfn(domain, p, true);
> 
> Should we force update accounting here even when there is iommu capable
> domain? It's not consistent to earlier pin_pages.
> 

Yes, fixing this.

>> +
>> +		mutex_unlock(&domain->local_addr_space->pfn_list_lock);
>> +	}
>>
>>  	return unlocked;
>>  }
>> @@ -341,6 +636,12 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct
>> vfio_dma *dma)
>>
>>  	if (!dma->size)
>>  		return;
>> +
>> +	if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu))
>> +		return;
> 
> Is above check redundant to following dma->iommu_mapped?
> 

I'm going to remove dma->iommu_mapped and changing accounting code as
per Alex's comment and problem that Alex pointed out.

Thanks,
Kirti

>> +
>> +	if (!dma->iommu_mapped)
>> +		return;
>>  	/*
>>  	 * We use the IOMMU to track the physical addresses, otherwise we'd
>>  	 * need a much more complicated tracking system.  Unfortunately that
> 
> Thanks
> Kevin
> 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v8 3/6] vfio iommu: Add support for mediated devices
@ 2016-10-14 11:35       ` Kirti Wankhede
  0 siblings, 0 replies; 73+ messages in thread
From: Kirti Wankhede @ 2016-10-14 11:35 UTC (permalink / raw)
  To: Tian, Kevin, alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, Song, Jike, bjsdjshi



On 10/12/2016 4:01 PM, Tian, Kevin wrote:
>> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
>> Sent: Tuesday, October 11, 2016 4:29 AM
>>
> [...]
>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>> index 2ba19424e4a1..ce6d6dcbd9a8 100644
>> --- a/drivers/vfio/vfio_iommu_type1.c
>> +++ b/drivers/vfio/vfio_iommu_type1.c
>> @@ -55,18 +55,26 @@ MODULE_PARM_DESC(disable_hugepages,
>>
>>  struct vfio_iommu {
>>  	struct list_head	domain_list;
>> +	struct vfio_domain	*local_domain;
> 
> Hi, Kirti, can you help explain the meaning of 'local" here? I have a hard time 
> to understand its intention... In your later change of vaddr_get_pfn, it's
> even more confusing where get_user_pages_remote is used on a 'local_mm':
> 

'local' in local_domain is to describe that the domain for local page
tracking. 'local_mm' in vaddr_get_pfn() is local variable in
vaddr_get_pfn() function.
    struct mm_struct *local_mm = (mm ? mm : current->mm);


> +	if (mm) {
> +		down_read(&local_mm->mmap_sem);
> +		ret = get_user_pages_remote(NULL, local_mm, vaddr, 1,
> +					!!(prot & IOMMU_WRITE), 0, page, NULL);
> +		up_read(&local_mm->mmap_sem);
> +	} else
> +		ret = get_user_pages_fast(vaddr, 1,
> +					  !!(prot & IOMMU_WRITE), page);
> 
> 
> [...]
>> -static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
>> +static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
>> +			 int prot, unsigned long *pfn)
>>  {
>>  	struct page *page[1];
>>  	struct vm_area_struct *vma;
>> +	struct mm_struct *local_mm = (mm ? mm : current->mm);
> 
> it'd be clearer if you call this variable as 'mm' while the earlier input parameter
> as 'local_mm'.
> 

Like I mentioned above, 'local' here is for local variable in this
function.

>>  	int ret = -EFAULT;
>>
>> -	if (get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE), page) == 1) {
>> +	if (mm) {
>> +		down_read(&local_mm->mmap_sem);
>> +		ret = get_user_pages_remote(NULL, local_mm, vaddr, 1,
>> +					!!(prot & IOMMU_WRITE), 0, page, NULL);
>> +		up_read(&local_mm->mmap_sem);
>> +	} else
>> +		ret = get_user_pages_fast(vaddr, 1,
>> +					  !!(prot & IOMMU_WRITE), page);
>> +
>> +	if (ret == 1) {
>>  		*pfn = page_to_pfn(page[0]);
>>  		return 0;
>>  	}
>>
>> -	down_read(&current->mm->mmap_sem);
>> +	down_read(&local_mm->mmap_sem);
>>
>> -	vma = find_vma_intersection(current->mm, vaddr, vaddr + 1);
>> +	vma = find_vma_intersection(local_mm, vaddr, vaddr + 1);
>>
>>  	if (vma && vma->vm_flags & VM_PFNMAP) {
>>  		*pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
> 
> [...]
>> +static long __vfio_pin_pages_local(struct vfio_domain *domain,
>> +				   unsigned long vaddr, int prot,
>> +				   unsigned long *pfn_base,
>> +				   bool do_accounting)
> 
> 'pages' -> 'page' since only one page is handled here.
> 
> [...]
>> +
>> +static void __vfio_unpin_pages_local(struct vfio_domain *domain,
>> +				     unsigned long pfn, int prot,
>> +				     bool do_accounting)
> 
> ditto
> 

Ok.

>> +{
>> +	put_pfn(pfn, prot);
>> +
>> +	if (do_accounting)
>> +		vfio_lock_acct(domain->local_addr_space->task, -1);
>> +}
>> +
>> +static int vfio_unpin_pfn(struct vfio_domain *domain,
>> +			  struct vfio_pfn *vpfn, bool do_accounting)
>> +{
>> +	__vfio_unpin_pages_local(domain, vpfn->pfn, vpfn->prot,
>> +				 do_accounting);
>> +
>> +	if (atomic_dec_and_test(&vpfn->ref_count))
>> +		vfio_remove_from_pfn_list(domain, vpfn);
>> +
>> +	return 1;
>> +}
>> +
>> +static long vfio_iommu_type1_pin_pages(void *iommu_data,
>> +				       unsigned long *user_pfn,
>> +				       long npage, int prot,
>> +				       unsigned long *phys_pfn)
>> +{
>> +	struct vfio_iommu *iommu = iommu_data;
>> +	struct vfio_domain *domain;
>> +	int i, j, ret;
>> +	long retpage;
>> +	unsigned long remote_vaddr;
>> +	unsigned long *pfn = phys_pfn;
>> +	struct vfio_dma *dma;
>> +	bool do_accounting = false;
>> +
>> +	if (!iommu || !user_pfn || !phys_pfn)
>> +		return -EINVAL;
>> +
>> +	mutex_lock(&iommu->lock);
>> +
>> +	if (!iommu->local_domain) {
>> +		ret = -EINVAL;
>> +		goto pin_done;
>> +	}
>> +
>> +	domain = iommu->local_domain;
>> +
>> +	/*
>> +	 * If iommu capable domain exist in the container then all pages are
>> +	 * already pinned and accounted. Accouting should be done if there is no
>> +	 * iommu capable domain in the container.
>> +	 */
>> +	do_accounting = !IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu);
>> +
>> +	for (i = 0; i < npage; i++) {
>> +		struct vfio_pfn *p;
>> +		dma_addr_t iova;
>> +
>> +		iova = user_pfn[i] << PAGE_SHIFT;
>> +
>> +		dma = vfio_find_dma(iommu, iova, 0);
>> +		if (!dma) {
>> +			ret = -EINVAL;
>> +			goto pin_unwind;
>> +		}
>> +
>> +		remote_vaddr = dma->vaddr + iova - dma->iova;
> 
> again, why "remote"_vaddr on a 'local' function?
> 

Not 'local' function, its local_domain. __vfio_pin_pages_local() pins
pages for local_domain. When this function is called from other process,
other than QEMU process, vaddr from QEMU process is remote_vaddr for
caller.


>> +
>> +		retpage = __vfio_pin_pages_local(domain, remote_vaddr, prot,
>> +						 &pfn[i], do_accounting);
>> +		if (retpage <= 0) {
>> +			WARN_ON(!retpage);
>> +			ret = (int)retpage;
>> +			goto pin_unwind;
>> +		}
>> +
>> +		mutex_lock(&domain->local_addr_space->pfn_list_lock);
>> +
>> +		/* search if pfn exist */
>> +		p = vfio_find_pfn(domain, pfn[i]);
>> +		if (p) {
>> +			atomic_inc(&p->ref_count);
>> +			mutex_unlock(&domain->local_addr_space->pfn_list_lock);
>> +			continue;
>> +		}
>> +
>> +		ret = vfio_add_to_pfn_list(domain, remote_vaddr, iova,
>> +					   pfn[i], prot);
>> +		mutex_unlock(&domain->local_addr_space->pfn_list_lock);
>> +
>> +		if (ret) {
>> +			__vfio_unpin_pages_local(domain, pfn[i], prot,
>> +						 do_accounting);
>> +			goto pin_unwind;
>> +		}
>> +	}
>> +
>> +	ret = i;
>> +	goto pin_done;
>> +
>> +pin_unwind:
>> +	pfn[i] = 0;
>> +	mutex_lock(&domain->local_addr_space->pfn_list_lock);
>> +	for (j = 0; j < i; j++) {
>> +		struct vfio_pfn *p;
>> +
>> +		p = vfio_find_pfn(domain, pfn[j]);
>> +		if (p)
>> +			vfio_unpin_pfn(domain, p, do_accounting);
>> +
>> +		pfn[j] = 0;
>> +	}
>> +	mutex_unlock(&domain->local_addr_space->pfn_list_lock);
>> +
>> +pin_done:
>> +	mutex_unlock(&iommu->lock);
>> +	return ret;
>> +}
>> +
>> +static long vfio_iommu_type1_unpin_pages(void *iommu_data, unsigned long *pfn,
>> +					 long npage)
>> +{
>> +	struct vfio_iommu *iommu = iommu_data;
>> +	struct vfio_domain *domain = NULL;
>> +	long unlocked = 0;
>> +	int i;
>> +
>> +	if (!iommu || !pfn)
>> +		return -EINVAL;
>> +
> 
> acquire iommu lock...
> 

Yes, Alex has pointed this out and I'm going to fix it in v9.

>> +	domain = iommu->local_domain;
>> +
>> +	for (i = 0; i < npage; i++) {
>> +		struct vfio_pfn *p;
>> +
>> +		mutex_lock(&domain->local_addr_space->pfn_list_lock);
>> +
>> +		/* verify if pfn exist in pfn_list */
>> +		p = vfio_find_pfn(domain, pfn[i]);
>> +		if (p)
>> +			unlocked += vfio_unpin_pfn(domain, p, true);
> 
> Should we force update accounting here even when there is iommu capable
> domain? It's not consistent to earlier pin_pages.
> 

Yes, fixing this.

>> +
>> +		mutex_unlock(&domain->local_addr_space->pfn_list_lock);
>> +	}
>>
>>  	return unlocked;
>>  }
>> @@ -341,6 +636,12 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct
>> vfio_dma *dma)
>>
>>  	if (!dma->size)
>>  		return;
>> +
>> +	if (!IS_IOMMU_CAPABLE_DOMAIN_IN_CONTAINER(iommu))
>> +		return;
> 
> Is above check redundant to following dma->iommu_mapped?
> 

I'm going to remove dma->iommu_mapped and changing accounting code as
per Alex's comment and problem that Alex pointed out.

Thanks,
Kirti

>> +
>> +	if (!dma->iommu_mapped)
>> +		return;
>>  	/*
>>  	 * We use the IOMMU to track the physical addresses, otherwise we'd
>>  	 * need a much more complicated tracking system.  Unfortunately that
> 
> Thanks
> Kevin
> 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* RE: [PATCH v8 3/6] vfio iommu: Add support for mediated devices
  2016-10-14 11:35       ` [Qemu-devel] " Kirti Wankhede
@ 2016-10-14 12:29         ` Tian, Kevin
  -1 siblings, 0 replies; 73+ messages in thread
From: Tian, Kevin @ 2016-10-14 12:29 UTC (permalink / raw)
  To: Kirti Wankhede, alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, Song, Jike, bjsdjshi

> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> Sent: Friday, October 14, 2016 7:36 PM
> 
> 
> On 10/12/2016 4:01 PM, Tian, Kevin wrote:
> >> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> >> Sent: Tuesday, October 11, 2016 4:29 AM
> >>
> > [...]
> >> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> >> index 2ba19424e4a1..ce6d6dcbd9a8 100644
> >> --- a/drivers/vfio/vfio_iommu_type1.c
> >> +++ b/drivers/vfio/vfio_iommu_type1.c
> >> @@ -55,18 +55,26 @@ MODULE_PARM_DESC(disable_hugepages,
> >>
> >>  struct vfio_iommu {
> >>  	struct list_head	domain_list;
> >> +	struct vfio_domain	*local_domain;
> >
> > Hi, Kirti, can you help explain the meaning of 'local" here? I have a hard time
> > to understand its intention... In your later change of vaddr_get_pfn, it's
> > even more confusing where get_user_pages_remote is used on a 'local_mm':
> >
> 
> 'local' in local_domain is to describe that the domain for local page
> tracking. 'local_mm' in vaddr_get_pfn() is local variable in
> vaddr_get_pfn() function.
>     struct mm_struct *local_mm = (mm ? mm : current->mm);
> 

'local page tracking' means track logic local to VFIO? Then when we say
'remote page tracking', who is remote? I would appreciate some code
comment to describe this definition, otherwise it's easily confusing when
'local' sometimes means who does page tracking, while other times 
it just means a local variable. At least when I read this patch, the 
immediate impression is that local_mm belongs to local_domain. :-)

Thanks
Kevin

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [Qemu-devel] [PATCH v8 3/6] vfio iommu: Add support for mediated devices
@ 2016-10-14 12:29         ` Tian, Kevin
  0 siblings, 0 replies; 73+ messages in thread
From: Tian, Kevin @ 2016-10-14 12:29 UTC (permalink / raw)
  To: Kirti Wankhede, alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, Song, Jike, bjsdjshi

> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> Sent: Friday, October 14, 2016 7:36 PM
> 
> 
> On 10/12/2016 4:01 PM, Tian, Kevin wrote:
> >> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> >> Sent: Tuesday, October 11, 2016 4:29 AM
> >>
> > [...]
> >> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> >> index 2ba19424e4a1..ce6d6dcbd9a8 100644
> >> --- a/drivers/vfio/vfio_iommu_type1.c
> >> +++ b/drivers/vfio/vfio_iommu_type1.c
> >> @@ -55,18 +55,26 @@ MODULE_PARM_DESC(disable_hugepages,
> >>
> >>  struct vfio_iommu {
> >>  	struct list_head	domain_list;
> >> +	struct vfio_domain	*local_domain;
> >
> > Hi, Kirti, can you help explain the meaning of 'local" here? I have a hard time
> > to understand its intention... In your later change of vaddr_get_pfn, it's
> > even more confusing where get_user_pages_remote is used on a 'local_mm':
> >
> 
> 'local' in local_domain is to describe that the domain for local page
> tracking. 'local_mm' in vaddr_get_pfn() is local variable in
> vaddr_get_pfn() function.
>     struct mm_struct *local_mm = (mm ? mm : current->mm);
> 

'local page tracking' means track logic local to VFIO? Then when we say
'remote page tracking', who is remote? I would appreciate some code
comment to describe this definition, otherwise it's easily confusing when
'local' sometimes means who does page tracking, while other times 
it just means a local variable. At least when I read this patch, the 
immediate impression is that local_mm belongs to local_domain. :-)

Thanks
Kevin

^ permalink raw reply	[flat|nested] 73+ messages in thread

end of thread, other threads:[~2016-10-14 12:29 UTC | newest]

Thread overview: 73+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-10 20:28 [PATCH v8 0/6] Add Mediated device support Kirti Wankhede
2016-10-10 20:28 ` [Qemu-devel] " Kirti Wankhede
2016-10-10 20:28 ` [PATCH v8 1/6] vfio: Mediated device Core driver Kirti Wankhede
2016-10-10 20:28   ` [Qemu-devel] " Kirti Wankhede
2016-10-10 21:00   ` Eric Blake
2016-10-10 21:00     ` Eric Blake
2016-10-11  3:51   ` Alex Williamson
2016-10-11  3:51     ` [Qemu-devel] " Alex Williamson
2016-10-11 20:13     ` Kirti Wankhede
2016-10-11 20:13       ` [Qemu-devel] " Kirti Wankhede
2016-10-12  8:39   ` Tian, Kevin
2016-10-12  8:39     ` [Qemu-devel] " Tian, Kevin
2016-10-10 20:28 ` [PATCH v8 2/6] vfio: VFIO based driver for Mediated devices Kirti Wankhede
2016-10-10 20:28   ` [Qemu-devel] " Kirti Wankhede
2016-10-11  3:55   ` Alex Williamson
2016-10-11  3:55     ` [Qemu-devel] " Alex Williamson
2016-10-11 20:24     ` Kirti Wankhede
2016-10-11 20:24       ` [Qemu-devel] " Kirti Wankhede
2016-10-10 20:28 ` [PATCH v8 3/6] vfio iommu: Add support for mediated devices Kirti Wankhede
2016-10-10 20:28   ` [Qemu-devel] " Kirti Wankhede
2016-10-11 22:06   ` Alex Williamson
2016-10-11 22:06     ` [Qemu-devel] " Alex Williamson
2016-10-12 10:38     ` Tian, Kevin
2016-10-12 10:38       ` [Qemu-devel] " Tian, Kevin
2016-10-13 14:34     ` Kirti Wankhede
2016-10-13 14:34       ` [Qemu-devel] " Kirti Wankhede
2016-10-13 17:12       ` Alex Williamson
2016-10-13 17:12         ` [Qemu-devel] " Alex Williamson
2016-10-12 10:31   ` Tian, Kevin
2016-10-12 10:31     ` [Qemu-devel] " Tian, Kevin
2016-10-14 11:35     ` Kirti Wankhede
2016-10-14 11:35       ` [Qemu-devel] " Kirti Wankhede
2016-10-14 12:29       ` Tian, Kevin
2016-10-14 12:29         ` [Qemu-devel] " Tian, Kevin
2016-10-10 20:28 ` [PATCH v8 4/6] docs: Add Documentation for Mediated devices Kirti Wankhede
2016-10-10 20:28   ` [Qemu-devel] " Kirti Wankhede
2016-10-11 14:14   ` Daniel P. Berrange
2016-10-11 20:44     ` Kirti Wankhede
2016-10-11 20:44       ` Kirti Wankhede
2016-10-12  1:52       ` Tian, Kevin
2016-10-12  1:52         ` [Qemu-devel] " Tian, Kevin
2016-10-12 15:13         ` Kirti Wankhede
2016-10-12 15:13           ` [Qemu-devel] " Kirti Wankhede
2016-10-12 15:59           ` Alex Williamson
2016-10-12 15:59             ` [Qemu-devel] " Alex Williamson
2016-10-12 19:02             ` Kirti Wankhede
2016-10-12 19:02               ` [Qemu-devel] " Kirti Wankhede
2016-10-12 21:44               ` Alex Williamson
2016-10-13  9:22                 ` Kirti Wankhede
2016-10-13 14:36                   ` Alex Williamson
2016-10-13 14:36                     ` [Qemu-devel] " Alex Williamson
2016-10-13 16:00                     ` Paolo Bonzini
2016-10-13 16:00                       ` [Qemu-devel] " Paolo Bonzini
2016-10-13 16:30                       ` Alex Williamson
2016-10-14  3:31                     ` Kirti Wankhede
2016-10-14  4:22                       ` Alex Williamson
2016-10-14  4:22                         ` [Qemu-devel] " Alex Williamson
2016-10-13  3:27               ` Tian, Kevin
2016-10-13  3:27                 ` [Qemu-devel] " Tian, Kevin
2016-10-14  2:22   ` Jike Song
2016-10-14  2:22     ` [Qemu-devel] " Jike Song
2016-10-14  3:15     ` Kirti Wankhede
2016-10-14  3:15       ` [Qemu-devel] " Kirti Wankhede
2016-10-10 20:28 ` [PATCH v8 5/6] Add simple sample driver for mediated device framework Kirti Wankhede
2016-10-10 20:28   ` [Qemu-devel] " Kirti Wankhede
2016-10-10 20:28 ` [PATCH v8 6/6] Add common functions for SET_IRQS and GET_REGION_INFO ioctls Kirti Wankhede
2016-10-10 20:28   ` [Qemu-devel] " Kirti Wankhede
2016-10-11 23:18   ` Alex Williamson
2016-10-11 23:18     ` [Qemu-devel] " Alex Williamson
2016-10-12 19:37     ` Kirti Wankhede
2016-10-12 19:37       ` [Qemu-devel] " Kirti Wankhede
2016-10-11  2:23 ` [PATCH v8 0/6] Add Mediated device support Jike Song
2016-10-11  2:23   ` [Qemu-devel] " Jike Song

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.