[RFC PATCH v4 0/3] Add Mediated device support[was: Add vGPU support]

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH v4 0/3] Add Mediated device support[was: Add vGPU support]
@ 2016-05-24 19:58 ` Kirti Wankhede
  0 siblings, 0 replies; 92+ messages in thread
From: Kirti Wankhede @ 2016-05-24 19:58 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, kevin.tian, shuai.ruan, jike.song, zhiyuan.lv,
	bjsdjshi, Kirti Wankhede

This series adds Mediated device support to v4.6 Linux host kernel. Purpose
of this series is to provide a common interface for mediated device
management that can be used by different devices. This series introduces
Mdev core module that create and manage mediated devices, VFIO based driver
for mediated PCI devices that are created by Mdev core module and update
VFIO type1 IOMMU module to support mediated devices.

What's new in v4?
- Renamed 'vgpu' module to 'mdev' module that represent generic term
  'Mediated device'.
- Moved mdev directory to drivers/vfio directory as this is the extension
  of VFIO APIs for mediated devices.
- Updated mdev driver to be flexible to register multiple types of drivers
  to mdev_bus_type bus.
- Updated mdev core driver with mdev_put_device() and mdev_get_device() for
  mediated devices.


What's left to do?
VFIO driver for vGPU device doesn't support devices with MSI-X enabled.

Please review.

Kirti Wankhede (3):
  Mediated device Core driver
  VFIO driver for mediated PCI device
  VFIO Type1 IOMMU: Add support for mediated devices

 drivers/vfio/Kconfig                |   1 +
 drivers/vfio/Makefile               |   1 +
 drivers/vfio/mdev/Kconfig           |  18 +
 drivers/vfio/mdev/Makefile          |   6 +
 drivers/vfio/mdev/mdev-core.c       | 462 +++++++++++++++++++++++++
 drivers/vfio/mdev/mdev-driver.c     | 139 ++++++++
 drivers/vfio/mdev/mdev-sysfs.c      | 312 +++++++++++++++++
 drivers/vfio/mdev/mdev_private.h    |  33 ++
 drivers/vfio/mdev/vfio_mpci.c       | 648 ++++++++++++++++++++++++++++++++++++
 drivers/vfio/pci/vfio_pci_private.h |   6 -
 drivers/vfio/pci/vfio_pci_rdwr.c    |   1 +
 drivers/vfio/vfio_iommu_type1.c     | 433 ++++++++++++++++++++++--
 include/linux/mdev.h                | 224 +++++++++++++
 include/linux/vfio.h                |  13 +
 14 files changed, 2259 insertions(+), 38 deletions(-)
 create mode 100644 drivers/vfio/mdev/Kconfig
 create mode 100644 drivers/vfio/mdev/Makefile
 create mode 100644 drivers/vfio/mdev/mdev-core.c
 create mode 100644 drivers/vfio/mdev/mdev-driver.c
 create mode 100644 drivers/vfio/mdev/mdev-sysfs.c
 create mode 100644 drivers/vfio/mdev/mdev_private.h
 create mode 100644 drivers/vfio/mdev/vfio_mpci.c
 create mode 100644 include/linux/mdev.h

-- 
2.7.0


^ permalink raw reply	[flat|nested] 92+ messages in thread

* [Qemu-devel] [RFC PATCH v4 0/3] Add Mediated device support[was: Add vGPU support]
@ 2016-05-24 19:58 ` Kirti Wankhede
  0 siblings, 0 replies; 92+ messages in thread
From: Kirti Wankhede @ 2016-05-24 19:58 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, kevin.tian, shuai.ruan, jike.song, zhiyuan.lv,
	bjsdjshi, Kirti Wankhede

This series adds Mediated device support to v4.6 Linux host kernel. Purpose
of this series is to provide a common interface for mediated device
management that can be used by different devices. This series introduces
Mdev core module that create and manage mediated devices, VFIO based driver
for mediated PCI devices that are created by Mdev core module and update
VFIO type1 IOMMU module to support mediated devices.

What's new in v4?
- Renamed 'vgpu' module to 'mdev' module that represent generic term
  'Mediated device'.
- Moved mdev directory to drivers/vfio directory as this is the extension
  of VFIO APIs for mediated devices.
- Updated mdev driver to be flexible to register multiple types of drivers
  to mdev_bus_type bus.
- Updated mdev core driver with mdev_put_device() and mdev_get_device() for
  mediated devices.


What's left to do?
VFIO driver for vGPU device doesn't support devices with MSI-X enabled.

Please review.

Kirti Wankhede (3):
  Mediated device Core driver
  VFIO driver for mediated PCI device
  VFIO Type1 IOMMU: Add support for mediated devices

 drivers/vfio/Kconfig                |   1 +
 drivers/vfio/Makefile               |   1 +
 drivers/vfio/mdev/Kconfig           |  18 +
 drivers/vfio/mdev/Makefile          |   6 +
 drivers/vfio/mdev/mdev-core.c       | 462 +++++++++++++++++++++++++
 drivers/vfio/mdev/mdev-driver.c     | 139 ++++++++
 drivers/vfio/mdev/mdev-sysfs.c      | 312 +++++++++++++++++
 drivers/vfio/mdev/mdev_private.h    |  33 ++
 drivers/vfio/mdev/vfio_mpci.c       | 648 ++++++++++++++++++++++++++++++++++++
 drivers/vfio/pci/vfio_pci_private.h |   6 -
 drivers/vfio/pci/vfio_pci_rdwr.c    |   1 +
 drivers/vfio/vfio_iommu_type1.c     | 433 ++++++++++++++++++++++--
 include/linux/mdev.h                | 224 +++++++++++++
 include/linux/vfio.h                |  13 +
 14 files changed, 2259 insertions(+), 38 deletions(-)
 create mode 100644 drivers/vfio/mdev/Kconfig
 create mode 100644 drivers/vfio/mdev/Makefile
 create mode 100644 drivers/vfio/mdev/mdev-core.c
 create mode 100644 drivers/vfio/mdev/mdev-driver.c
 create mode 100644 drivers/vfio/mdev/mdev-sysfs.c
 create mode 100644 drivers/vfio/mdev/mdev_private.h
 create mode 100644 drivers/vfio/mdev/vfio_mpci.c
 create mode 100644 include/linux/mdev.h

-- 
2.7.0

^ permalink raw reply	[flat|nested] 92+ messages in thread

* [RFC PATCH v4 1/3] Mediated device Core driver
  2016-05-24 19:58 ` [Qemu-devel] " Kirti Wankhede
@ 2016-05-24 19:58   ` Kirti Wankhede
  -1 siblings, 0 replies; 92+ messages in thread
From: Kirti Wankhede @ 2016-05-24 19:58 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, kevin.tian, shuai.ruan, jike.song, zhiyuan.lv,
	bjsdjshi, Kirti Wankhede

Design for Mediated Device Driver:
Main purpose of this driver is to provide a common interface for mediated
device management that can be used by differnt drivers of different
devices.

This module provides a generic interface to create the device, add it to
mediated bus, add device to IOMMU group and then add it to vfio group.

Below is the high Level block diagram, with Nvidia, Intel and IBM devices
as example, since these are the devices which are going to actively use
this module as of now.

 +---------------+
 |               |
 | +-----------+ |  mdev_register_driver() +--------------+
 | |           | +<------------------------+ __init()     |
 | |           | |                         |              |
 | |  mdev     | +------------------------>+              |<-> VFIO user
 | |  bus      | |     probe()/remove()    | vfio_mpci.ko |    APIs
 | |  driver   | |                         |              |
 | |           | |                         +--------------+
 | |           | |  mdev_register_driver() +--------------+
 | |           | +<------------------------+ __init()     |
 | |           | |                         |              |
 | |           | +------------------------>+              |<-> VFIO user
 | +-----------+ |     probe()/remove()    | vfio_mccw.ko |    APIs
 |               |                         |              |
 |  MDEV CORE    |                         +--------------+
 |   MODULE      |
 |   mdev.ko     |
 | +-----------+ |  mdev_register_device() +--------------+
 | |           | +<------------------------+              |
 | |           | |                         |  nvidia.ko   |<-> physical
 | |           | +------------------------>+              |    device
 | |           | |        callback         +--------------+
 | | Physical  | |
 | |  device   | |  mdev_register_device() +--------------+
 | | interface | |<------------------------+              |
 | |           | |                         |  i915.ko     |<-> physical
 | |           | +------------------------>+              |    device
 | |           | |        callback         +--------------+
 | |           | |
 | |           | |  mdev_register_device() +--------------+
 | |           | +<------------------------+              |
 | |           | |                         | ccw_device.ko|<-> physical
 | |           | +------------------------>+              |    device
 | |           | |        callback         +--------------+
 | +-----------+ |
 +---------------+

Core driver provides two types of registration interfaces:
1. Registration interface for mediated bus driver:

/**
  * struct mdev_driver - Mediated device's driver
  * @name: driver name
  * @probe: called when new device created
  * @remove:called when device removed
  * @match: called when new device or driver is added for this bus.
	    Return 1 if given device can be handled by given driver and
	    zero otherwise.
  * @driver:device driver structure
  *
  **/
struct mdev_driver {
         const char *name;
         int  (*probe)  (struct device *dev);
         void (*remove) (struct device *dev);
	 int  (*match)(struct device *dev);
         struct device_driver    driver;
};

int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
void mdev_unregister_driver(struct mdev_driver *drv);

Mediated device's driver for mdev should use this interface to register
with Core driver. With this, mediated devices driver for such devices is
responsible to add mediated device to VFIO group.

2. Physical device driver interface
This interface provides vendor driver the set APIs to manage physical
device related work in their own driver. APIs are :
- supported_config: provide supported configuration list by the vendor
		    driver
- create: to allocate basic resources in vendor driver for a mediated
	  device.
- destroy: to free resources in vendor driver when mediated device is
	   destroyed.
- start: to initiate mediated device initialization process from vendor
	 driver when VM boots and before QEMU starts.
- shutdown: to teardown mediated device resources during VM teardown.
- read : read emulation callback.
- write: write emulation callback.
- set_irqs: send interrupt configuration information that QEMU sets.
- get_region_info: to provide region size and its flags for the mediated
		   device.
- validate_map_request: to validate remap pfn request.

This registration interface should be used by vendor drivers to register
each physical device to mdev core driver.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Change-Id: I88f4482f7608f40550a152c5f882b64271287c62
---
 drivers/vfio/Kconfig             |   1 +
 drivers/vfio/Makefile            |   1 +
 drivers/vfio/mdev/Kconfig        |  11 +
 drivers/vfio/mdev/Makefile       |   5 +
 drivers/vfio/mdev/mdev-core.c    | 462 +++++++++++++++++++++++++++++++++++++++
 drivers/vfio/mdev/mdev-driver.c  | 139 ++++++++++++
 drivers/vfio/mdev/mdev-sysfs.c   | 312 ++++++++++++++++++++++++++
 drivers/vfio/mdev/mdev_private.h |  33 +++
 include/linux/mdev.h             | 224 +++++++++++++++++++
 9 files changed, 1188 insertions(+)
 create mode 100644 drivers/vfio/mdev/Kconfig
 create mode 100644 drivers/vfio/mdev/Makefile
 create mode 100644 drivers/vfio/mdev/mdev-core.c
 create mode 100644 drivers/vfio/mdev/mdev-driver.c
 create mode 100644 drivers/vfio/mdev/mdev-sysfs.c
 create mode 100644 drivers/vfio/mdev/mdev_private.h
 create mode 100644 include/linux/mdev.h

diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index da6e2ce77495..23eced02aaf6 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -48,4 +48,5 @@ menuconfig VFIO_NOIOMMU
 
 source "drivers/vfio/pci/Kconfig"
 source "drivers/vfio/platform/Kconfig"
+source "drivers/vfio/mdev/Kconfig"
 source "virt/lib/Kconfig"
diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
index 7b8a31f63fea..7c70753e54ab 100644
--- a/drivers/vfio/Makefile
+++ b/drivers/vfio/Makefile
@@ -7,3 +7,4 @@ obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
 obj-$(CONFIG_VFIO_SPAPR_EEH) += vfio_spapr_eeh.o
 obj-$(CONFIG_VFIO_PCI) += pci/
 obj-$(CONFIG_VFIO_PLATFORM) += platform/
+obj-$(CONFIG_MDEV) += mdev/
diff --git a/drivers/vfio/mdev/Kconfig b/drivers/vfio/mdev/Kconfig
new file mode 100644
index 000000000000..951e2bb06a3f
--- /dev/null
+++ b/drivers/vfio/mdev/Kconfig
@@ -0,0 +1,11 @@
+
+config MDEV
+    tristate "Mediated device driver framework"
+    depends on VFIO
+    default n
+    help
+        MDEV provides a framework to virtualize device without SR-IOV cap
+        See Documentation/mdev.txt for more details.
+
+        If you don't know what do here, say N.
+
diff --git a/drivers/vfio/mdev/Makefile b/drivers/vfio/mdev/Makefile
new file mode 100644
index 000000000000..4adb069febce
--- /dev/null
+++ b/drivers/vfio/mdev/Makefile
@@ -0,0 +1,5 @@
+
+mdev-y := mdev-core.o mdev-sysfs.o mdev-driver.o
+
+obj-$(CONFIG_MDEV) += mdev.o
+
diff --git a/drivers/vfio/mdev/mdev-core.c b/drivers/vfio/mdev/mdev-core.c
new file mode 100644
index 000000000000..af070d73735f
--- /dev/null
+++ b/drivers/vfio/mdev/mdev-core.c
@@ -0,0 +1,462 @@
+/*
+ * Mediated device Core Driver
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/slab.h>
+#include <linux/cdev.h>
+#include <linux/sched.h>
+#include <linux/uuid.h>
+#include <linux/vfio.h>
+#include <linux/iommu.h>
+#include <linux/sysfs.h>
+#include <linux/mdev.h>
+
+#include "mdev_private.h"
+
+#define DRIVER_VERSION		"0.1"
+#define DRIVER_AUTHOR		"NVIDIA Corporation"
+#define DRIVER_DESC		"Mediated device Core Driver"
+
+#define MDEV_CLASS_NAME		"mdev"
+
+/*
+ * Global Structures
+ */
+
+static struct devices_list {
+	struct list_head    dev_list;
+	struct mutex        list_lock;
+} mdevices, phy_devices;
+
+/*
+ * Functions
+ */
+
+static int mdev_add_attribute_group(struct device *dev,
+				    const struct attribute_group **groups)
+{
+	return sysfs_create_groups(&dev->kobj, groups);
+}
+
+static void mdev_remove_attribute_group(struct device *dev,
+					const struct attribute_group **groups)
+{
+	sysfs_remove_groups(&dev->kobj, groups);
+}
+
+static struct mdev_device *find_mdev_device(uuid_le uuid, int instance)
+{
+	struct mdev_device *vdev = NULL, *v;
+
+	mutex_lock(&mdevices.list_lock);
+	list_for_each_entry(v, &mdevices.dev_list, next) {
+		if ((uuid_le_cmp(v->uuid, uuid) == 0) &&
+		    (v->instance == instance)) {
+			vdev = v;
+			break;
+		}
+	}
+	mutex_unlock(&mdevices.list_lock);
+	return vdev;
+}
+
+static struct mdev_device *find_next_mdev_device(struct phy_device *phy_dev)
+{
+	struct mdev_device *mdev = NULL, *p;
+
+	mutex_lock(&mdevices.list_lock);
+	list_for_each_entry(p, &mdevices.dev_list, next) {
+		if (p->phy_dev == phy_dev) {
+			mdev = p;
+			break;
+		}
+	}
+	mutex_unlock(&mdevices.list_lock);
+	return mdev;
+}
+
+static struct phy_device *find_physical_device(struct device *dev)
+{
+	struct phy_device *pdev = NULL, *p;
+
+	mutex_lock(&phy_devices.list_lock);
+	list_for_each_entry(p, &phy_devices.dev_list, next) {
+		if (p->dev == dev) {
+			pdev = p;
+			break;
+		}
+	}
+	mutex_unlock(&phy_devices.list_lock);
+	return pdev;
+}
+
+static void mdev_destroy_device(struct mdev_device *mdevice)
+{
+	struct phy_device *phy_dev = mdevice->phy_dev;
+
+	if (phy_dev) {
+		mutex_lock(&phy_devices.list_lock);
+
+		/*
+		* If vendor driver doesn't return success that means vendor
+		* driver doesn't support hot-unplug
+		*/
+		if (phy_dev->ops->destroy) {
+			if (phy_dev->ops->destroy(phy_dev->dev, mdevice->uuid,
+						  mdevice->instance)) {
+				mutex_unlock(&phy_devices.list_lock);
+				return;
+			}
+		}
+
+		mdev_remove_attribute_group(&mdevice->dev,
+					    phy_dev->ops->mdev_attr_groups);
+		mdevice->phy_dev = NULL;
+		mutex_unlock(&phy_devices.list_lock);
+	}
+
+	mdev_put_device(mdevice);
+	device_unregister(&mdevice->dev);
+}
+
+/*
+ * Find mediated device from given iommu_group and increment refcount of
+ * mediated device. Caller should call mdev_put_device() when the use of
+ * mdev_device is done.
+ */
+struct mdev_device *mdev_get_device_by_group(struct iommu_group *group)
+{
+	struct mdev_device *mdev = NULL, *p;
+
+	mutex_lock(&mdevices.list_lock);
+	list_for_each_entry(p, &mdevices.dev_list, next) {
+		if (!p->group)
+			continue;
+
+		if (iommu_group_id(p->group) == iommu_group_id(group)) {
+			mdev = mdev_get_device(p);
+			break;
+		}
+	}
+	mutex_unlock(&mdevices.list_lock);
+	return mdev;
+}
+EXPORT_SYMBOL_GPL(mdev_get_device_by_group);
+
+/*
+ * mdev_register_device : Register a device
+ * @dev: device structure representing physical device.
+ * @phy_device_ops: Physical device operation structure to be registered.
+ *
+ * Add device to list of registered physical devices.
+ * Returns a negative value on error, otherwise 0.
+ */
+int mdev_register_device(struct device *dev, const struct phy_device_ops *ops)
+{
+	int ret = 0;
+	struct phy_device *phy_dev, *pdev;
+
+	if (!dev || !ops)
+		return -EINVAL;
+
+	/* Check for duplicate */
+	pdev = find_physical_device(dev);
+	if (pdev)
+		return -EEXIST;
+
+	phy_dev = kzalloc(sizeof(*phy_dev), GFP_KERNEL);
+	if (!phy_dev)
+		return -ENOMEM;
+
+	phy_dev->dev = dev;
+	phy_dev->ops = ops;
+
+	mutex_lock(&phy_devices.list_lock);
+	ret = mdev_create_sysfs_files(dev);
+	if (ret)
+		goto add_sysfs_error;
+
+	ret = mdev_add_attribute_group(dev, ops->dev_attr_groups);
+	if (ret)
+		goto add_group_error;
+
+	list_add(&phy_dev->next, &phy_devices.dev_list);
+	dev_info(dev, "MDEV: Registered\n");
+	mutex_unlock(&phy_devices.list_lock);
+
+	return 0;
+
+add_group_error:
+	mdev_remove_sysfs_files(dev);
+add_sysfs_error:
+	mutex_unlock(&phy_devices.list_lock);
+	kfree(phy_dev);
+	return ret;
+}
+EXPORT_SYMBOL(mdev_register_device);
+
+/*
+ * mdev_unregister_device : Unregister a physical device
+ * @dev: device structure representing physical device.
+ *
+ * Remove device from list of registered physical devices. Gives a change to
+ * free existing mediated devices for the given physical device.
+ */
+
+void mdev_unregister_device(struct device *dev)
+{
+	struct phy_device *phy_dev;
+	struct mdev_device *vdev = NULL;
+
+	phy_dev = find_physical_device(dev);
+
+	if (!phy_dev)
+		return;
+
+	dev_info(dev, "MDEV: Unregistering\n");
+
+	while ((vdev = find_next_mdev_device(phy_dev)))
+		mdev_destroy_device(vdev);
+
+	mutex_lock(&phy_devices.list_lock);
+	list_del(&phy_dev->next);
+	mutex_unlock(&phy_devices.list_lock);
+
+	mdev_remove_attribute_group(dev,
+				    phy_dev->ops->dev_attr_groups);
+
+	mdev_remove_sysfs_files(dev);
+	kfree(phy_dev);
+}
+EXPORT_SYMBOL(mdev_unregister_device);
+
+/*
+ * Functions required for mdev-sysfs
+ */
+
+static struct mdev_device *mdev_device_alloc(uuid_le uuid, int instance)
+{
+	struct mdev_device *mdevice = NULL;
+
+	mdevice = kzalloc(sizeof(*mdevice), GFP_KERNEL);
+	if (!mdevice)
+		return ERR_PTR(-ENOMEM);
+
+	kref_init(&mdevice->kref);
+	memcpy(&mdevice->uuid, &uuid, sizeof(uuid_le));
+	mdevice->instance = instance;
+	mutex_init(&mdevice->ops_lock);
+
+	return mdevice;
+}
+
+static void mdev_device_release(struct device *dev)
+{
+	struct mdev_device *mdevice = to_mdev_device(dev);
+
+	if (!mdevice)
+		return;
+
+	dev_info(&mdevice->dev, "MDEV: destroying\n");
+
+	mutex_lock(&mdevices.list_lock);
+	list_del(&mdevice->next);
+	mutex_unlock(&mdevices.list_lock);
+
+	kfree(mdevice);
+}
+
+int create_mdev_device(struct device *dev, uuid_le uuid, uint32_t instance,
+		       char *mdev_params)
+{
+	int retval = 0;
+	struct mdev_device *mdevice = NULL;
+	struct phy_device *phy_dev;
+
+	phy_dev = find_physical_device(dev);
+	if (!phy_dev)
+		return -EINVAL;
+
+	mdevice = mdev_device_alloc(uuid, instance);
+	if (IS_ERR(mdevice)) {
+		retval = PTR_ERR(mdevice);
+		return retval;
+	}
+
+	mdevice->dev.parent  = dev;
+	mdevice->dev.bus     = &mdev_bus_type;
+	mdevice->dev.release = mdev_device_release;
+	dev_set_name(&mdevice->dev, "%pUb-%d", uuid.b, instance);
+
+	mutex_lock(&mdevices.list_lock);
+	list_add(&mdevice->next, &mdevices.dev_list);
+	mutex_unlock(&mdevices.list_lock);
+
+	retval = device_register(&mdevice->dev);
+	if (retval) {
+		mdev_put_device(mdevice);
+		return retval;
+	}
+
+	mutex_lock(&phy_devices.list_lock);
+	if (phy_dev->ops->create) {
+		retval = phy_dev->ops->create(dev, mdevice->uuid,
+					      instance, mdev_params);
+		if (retval)
+			goto create_failed;
+	}
+
+	retval = mdev_add_attribute_group(&mdevice->dev,
+					  phy_dev->ops->mdev_attr_groups);
+	if (retval)
+		goto create_failed;
+
+	mdevice->phy_dev = phy_dev;
+	mutex_unlock(&phy_devices.list_lock);
+	mdev_get_device(mdevice);
+	dev_info(&mdevice->dev, "MDEV: created\n");
+
+	return retval;
+
+create_failed:
+	mutex_unlock(&phy_devices.list_lock);
+	device_unregister(&mdevice->dev);
+	return retval;
+}
+
+int destroy_mdev_device(uuid_le uuid, uint32_t instance)
+{
+	struct mdev_device *vdev;
+
+	vdev = find_mdev_device(uuid, instance);
+
+	if (!vdev)
+		return -EINVAL;
+
+	mdev_destroy_device(vdev);
+	return 0;
+}
+
+void get_mdev_supported_types(struct device *dev, char *str)
+{
+	struct phy_device *phy_dev;
+
+	phy_dev = find_physical_device(dev);
+
+	if (phy_dev) {
+		mutex_lock(&phy_devices.list_lock);
+		if (phy_dev->ops->supported_config)
+			phy_dev->ops->supported_config(phy_dev->dev, str);
+		mutex_unlock(&phy_devices.list_lock);
+	}
+}
+
+int mdev_start_callback(uuid_le uuid, uint32_t instance)
+{
+	int ret = 0;
+	struct mdev_device *mdevice;
+	struct phy_device *phy_dev;
+
+	mdevice = find_mdev_device(uuid, instance);
+
+	if (!mdevice)
+		return -EINVAL;
+
+	phy_dev = mdevice->phy_dev;
+
+	mutex_lock(&phy_devices.list_lock);
+	if (phy_dev->ops->start)
+		ret = phy_dev->ops->start(mdevice->uuid);
+	mutex_unlock(&phy_devices.list_lock);
+
+	if (ret < 0)
+		pr_err("mdev_start failed  %d\n", ret);
+	else
+		kobject_uevent(&mdevice->dev.kobj, KOBJ_ONLINE);
+
+	return ret;
+}
+
+int mdev_shutdown_callback(uuid_le uuid, uint32_t instance)
+{
+	int ret = 0;
+	struct mdev_device *mdevice;
+	struct phy_device *phy_dev;
+
+	mdevice = find_mdev_device(uuid, instance);
+
+	if (!mdevice)
+		return -EINVAL;
+
+	phy_dev = mdevice->phy_dev;
+
+	mutex_lock(&phy_devices.list_lock);
+	if (phy_dev->ops->shutdown)
+		ret = phy_dev->ops->shutdown(mdevice->uuid);
+	mutex_unlock(&phy_devices.list_lock);
+
+	if (ret < 0)
+		pr_err("mdev_shutdown failed %d\n", ret);
+	else
+		kobject_uevent(&mdevice->dev.kobj, KOBJ_OFFLINE);
+
+	return ret;
+}
+
+static struct class mdev_class = {
+	.name		= MDEV_CLASS_NAME,
+	.owner		= THIS_MODULE,
+	.class_attrs	= mdev_class_attrs,
+};
+
+static int __init mdev_init(void)
+{
+	int rc = 0;
+
+	mutex_init(&mdevices.list_lock);
+	INIT_LIST_HEAD(&mdevices.dev_list);
+	mutex_init(&phy_devices.list_lock);
+	INIT_LIST_HEAD(&phy_devices.dev_list);
+
+	rc = class_register(&mdev_class);
+	if (rc < 0) {
+		pr_err("Failed to register mdev class\n");
+		return rc;
+	}
+
+	rc = mdev_bus_register();
+	if (rc < 0) {
+		pr_err("Failed to register mdev bus\n");
+		class_unregister(&mdev_class);
+		return rc;
+	}
+
+	return rc;
+}
+
+static void __exit mdev_exit(void)
+{
+	mdev_bus_unregister();
+	class_unregister(&mdev_class);
+}
+
+module_init(mdev_init)
+module_exit(mdev_exit)
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/drivers/vfio/mdev/mdev-driver.c b/drivers/vfio/mdev/mdev-driver.c
new file mode 100644
index 000000000000..bc8a169782bc
--- /dev/null
+++ b/drivers/vfio/mdev/mdev-driver.c
@@ -0,0 +1,139 @@
+/*
+ * MDEV driver
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/device.h>
+#include <linux/iommu.h>
+#include <linux/mdev.h>
+
+#include "mdev_private.h"
+
+static int mdevice_attach_iommu(struct mdev_device *mdevice)
+{
+	int retval = 0;
+	struct iommu_group *group = NULL;
+
+	group = iommu_group_alloc();
+	if (IS_ERR(group)) {
+		dev_err(&mdevice->dev, "MDEV: failed to allocate group!\n");
+		return PTR_ERR(group);
+	}
+
+	retval = iommu_group_add_device(group, &mdevice->dev);
+	if (retval) {
+		dev_err(&mdevice->dev, "MDEV: failed to add dev to group!\n");
+		goto attach_fail;
+	}
+
+	mdevice->group = group;
+
+	dev_info(&mdevice->dev, "MDEV: group_id = %d\n",
+				 iommu_group_id(group));
+attach_fail:
+	iommu_group_put(group);
+	return retval;
+}
+
+static void mdevice_detach_iommu(struct mdev_device *mdevice)
+{
+	iommu_group_remove_device(&mdevice->dev);
+	dev_info(&mdevice->dev, "MDEV: detaching iommu\n");
+}
+
+static int mdevice_probe(struct device *dev)
+{
+	struct mdev_driver *drv = to_mdev_driver(dev->driver);
+	struct mdev_device *mdevice = to_mdev_device(dev);
+	int status = 0;
+
+	status = mdevice_attach_iommu(mdevice);
+	if (status) {
+		dev_err(dev, "Failed to attach IOMMU\n");
+		return status;
+	}
+
+	if (drv && drv->probe)
+		status = drv->probe(dev);
+
+	return status;
+}
+
+static int mdevice_remove(struct device *dev)
+{
+	struct mdev_driver *drv = to_mdev_driver(dev->driver);
+	struct mdev_device *mdevice = to_mdev_device(dev);
+
+	if (drv && drv->remove)
+		drv->remove(dev);
+
+	mdevice_detach_iommu(mdevice);
+
+	return 0;
+}
+
+static int mdevice_match(struct device *dev, struct device_driver *drv)
+{
+	int ret = 0;
+	struct mdev_driver *mdrv = to_mdev_driver(drv);
+
+	if (mdrv && mdrv->match)
+		ret = mdrv->match(dev);
+
+	return ret;
+}
+
+struct bus_type mdev_bus_type = {
+	.name		= "mdev",
+	.match		= mdevice_match,
+	.probe		= mdevice_probe,
+	.remove		= mdevice_remove,
+};
+EXPORT_SYMBOL_GPL(mdev_bus_type);
+
+/**
+ * mdev_register_driver - register a new MDEV driver
+ * @drv: the driver to register
+ * @owner: owner module of driver ro register
+ *
+ * Returns a negative value on error, otherwise 0.
+ */
+int mdev_register_driver(struct mdev_driver *drv, struct module *owner)
+{
+	/* initialize common driver fields */
+	drv->driver.name = drv->name;
+	drv->driver.bus = &mdev_bus_type;
+	drv->driver.owner = owner;
+
+	/* register with core */
+	return driver_register(&drv->driver);
+}
+EXPORT_SYMBOL(mdev_register_driver);
+
+/**
+ * mdev_unregister_driver - unregister MDEV driver
+ * @drv: the driver to unregister
+ *
+ */
+void mdev_unregister_driver(struct mdev_driver *drv)
+{
+	driver_unregister(&drv->driver);
+}
+EXPORT_SYMBOL(mdev_unregister_driver);
+
+int mdev_bus_register(void)
+{
+	return bus_register(&mdev_bus_type);
+}
+
+void mdev_bus_unregister(void)
+{
+	bus_unregister(&mdev_bus_type);
+}
diff --git a/drivers/vfio/mdev/mdev-sysfs.c b/drivers/vfio/mdev/mdev-sysfs.c
new file mode 100644
index 000000000000..79d351a7a502
--- /dev/null
+++ b/drivers/vfio/mdev/mdev-sysfs.c
@@ -0,0 +1,312 @@
+/*
+ * File attributes for Mediated devices
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/sysfs.h>
+#include <linux/ctype.h>
+#include <linux/device.h>
+#include <linux/slab.h>
+#include <linux/uuid.h>
+#include <linux/mdev.h>
+
+#include "mdev_private.h"
+
+/* Prototypes */
+static ssize_t mdev_supported_types_show(struct device *dev,
+					 struct device_attribute *attr,
+					 char *buf);
+static DEVICE_ATTR_RO(mdev_supported_types);
+
+static ssize_t mdev_create_store(struct device *dev,
+				 struct device_attribute *attr,
+				 const char *buf, size_t count);
+static DEVICE_ATTR_WO(mdev_create);
+
+static ssize_t mdev_destroy_store(struct device *dev,
+				  struct device_attribute *attr,
+				  const char *buf, size_t count);
+static DEVICE_ATTR_WO(mdev_destroy);
+
+/* Static functions */
+
+#define UUID_CHAR_LENGTH	36
+#define UUID_BYTE_LENGTH	16
+
+#define SUPPORTED_TYPE_BUFFER_LENGTH	1024
+
+static inline bool is_uuid_sep(char sep)
+{
+	if (sep == '\n' || sep == '-' || sep == ':' || sep == '\0')
+		return true;
+	return false;
+}
+
+static int uuid_parse(const char *str, uuid_le *uuid)
+{
+	int i;
+
+	if (strlen(str) < UUID_CHAR_LENGTH)
+		return -1;
+
+	for (i = 0; i < UUID_BYTE_LENGTH; i++) {
+		if (!isxdigit(str[0]) || !isxdigit(str[1])) {
+			pr_err("%s err", __func__);
+			return -EINVAL;
+		}
+
+		uuid->b[i] = (hex_to_bin(str[0]) << 4) | hex_to_bin(str[1]);
+		str += 2;
+		if (is_uuid_sep(*str))
+			str++;
+	}
+
+	return 0;
+}
+
+/* Functions */
+static ssize_t mdev_supported_types_show(struct device *dev,
+					 struct device_attribute *attr,
+					 char *buf)
+{
+	char *str;
+	ssize_t n;
+
+	str = kzalloc(sizeof(*str) * SUPPORTED_TYPE_BUFFER_LENGTH, GFP_KERNEL);
+	if (!str)
+		return -ENOMEM;
+
+	get_mdev_supported_types(dev, str);
+
+	n = sprintf(buf, "%s\n", str);
+	kfree(str);
+
+	return n;
+}
+
+static ssize_t mdev_create_store(struct device *dev,
+				 struct device_attribute *attr,
+				 const char *buf, size_t count)
+{
+	char *str, *pstr;
+	char *uuid_str, *instance_str, *mdev_params = NULL;
+	uuid_le uuid;
+	uint32_t instance;
+	int ret = 0;
+
+	pstr = str = kstrndup(buf, count, GFP_KERNEL);
+
+	if (!str)
+		return -ENOMEM;
+
+	uuid_str = strsep(&str, ":");
+	if (!uuid_str) {
+		pr_err("mdev_create: Empty UUID string %s\n", buf);
+		ret = -EINVAL;
+		goto create_error;
+	}
+
+	if (!str) {
+		pr_err("mdev_create: mdev instance not present %s\n", buf);
+		ret = -EINVAL;
+		goto create_error;
+	}
+
+	instance_str = strsep(&str, ":");
+	if (!instance_str) {
+		pr_err("mdev_create: Empty instance string %s\n", buf);
+		ret = -EINVAL;
+		goto create_error;
+	}
+
+	ret = kstrtouint(instance_str, 0, &instance);
+	if (ret) {
+		pr_err("mdev_create: mdev instance parsing error %s\n", buf);
+		goto create_error;
+	}
+
+	if (!str) {
+		pr_err("mdev_create: mdev params not specified %s\n", buf);
+		ret = -EINVAL;
+		goto create_error;
+	}
+
+	mdev_params = kstrdup(str, GFP_KERNEL);
+
+	if (!mdev_params) {
+		ret = -EINVAL;
+		goto create_error;
+	}
+
+	if (uuid_parse(uuid_str, &uuid) < 0) {
+		pr_err("mdev_create: UUID parse error %s\n", buf);
+		ret = -EINVAL;
+		goto create_error;
+	}
+
+	if (create_mdev_device(dev, uuid, instance, mdev_params) < 0) {
+		pr_err("mdev_create: Failed to create mdev device\n");
+		ret = -EINVAL;
+		goto create_error;
+	}
+	ret = count;
+
+create_error:
+	kfree(mdev_params);
+	kfree(pstr);
+	return ret;
+}
+
+static ssize_t mdev_destroy_store(struct device *dev,
+				  struct device_attribute *attr,
+				  const char *buf, size_t count)
+{
+	char *uuid_str, *str, *pstr;
+	uuid_le uuid;
+	unsigned int instance;
+	int ret;
+
+	str = pstr = kstrndup(buf, count, GFP_KERNEL);
+
+	if (!str)
+		return -ENOMEM;
+
+	uuid_str = strsep(&str, ":");
+	if (!uuid_str) {
+		pr_err("mdev_destroy: Empty UUID string %s\n", buf);
+		ret = -EINVAL;
+		goto destroy_error;
+	}
+
+	if (str == NULL) {
+		pr_err("mdev_destroy: instance not specified %s\n", buf);
+		ret = -EINVAL;
+		goto destroy_error;
+	}
+
+	ret = kstrtouint(str, 0, &instance);
+	if (ret) {
+		pr_err("mdev_destroy: instance parsing error %s\n", buf);
+		goto destroy_error;
+	}
+
+	if (uuid_parse(uuid_str, &uuid) < 0) {
+		pr_err("mdev_destroy: UUID parse error  %s\n", buf);
+		ret = -EINVAL;
+		goto destroy_error;
+	}
+
+	ret = destroy_mdev_device(uuid, instance);
+	if (ret < 0)
+		goto destroy_error;
+
+	ret = count;
+
+destroy_error:
+	kfree(pstr);
+	return ret;
+}
+
+ssize_t mdev_start_store(struct class *class, struct class_attribute *attr,
+			 const char *buf, size_t count)
+{
+	char *uuid_str;
+	uuid_le uuid;
+	int ret = 0;
+
+	uuid_str = kstrndup(buf, count, GFP_KERNEL);
+
+	if (!uuid_str)
+		return -ENOMEM;
+
+	if (uuid_parse(uuid_str, &uuid) < 0) {
+		pr_err("mdev_start: UUID parse error  %s\n", buf);
+		ret = -EINVAL;
+		goto start_error;
+	}
+
+	ret = mdev_start_callback(uuid, 0);
+	if (ret < 0)
+		goto start_error;
+
+	ret = count;
+
+start_error:
+	kfree(uuid_str);
+	return ret;
+}
+
+ssize_t mdev_shutdown_store(struct class *class, struct class_attribute *attr,
+			    const char *buf, size_t count)
+{
+	char *uuid_str;
+	uuid_le uuid;
+	int ret = 0;
+
+	uuid_str = kstrndup(buf, count, GFP_KERNEL);
+
+	if (!uuid_str)
+		return -ENOMEM;
+
+	if (uuid_parse(uuid_str, &uuid) < 0) {
+		pr_err("mdev_shutdown: UUID parse error %s\n", buf);
+		ret = -EINVAL;
+	}
+
+	ret = mdev_shutdown_callback(uuid, 0);
+	if (ret < 0)
+		goto shutdown_error;
+
+	ret = count;
+
+shutdown_error:
+	kfree(uuid_str);
+	return ret;
+
+}
+
+struct class_attribute mdev_class_attrs[] = {
+	__ATTR_WO(mdev_start),
+	__ATTR_WO(mdev_shutdown),
+	__ATTR_NULL
+};
+
+int mdev_create_sysfs_files(struct device *dev)
+{
+	int retval;
+
+	retval = sysfs_create_file(&dev->kobj,
+				   &dev_attr_mdev_supported_types.attr);
+	if (retval) {
+		pr_err("Failed to create mdev_supported_types sysfs entry\n");
+		return retval;
+	}
+
+	retval = sysfs_create_file(&dev->kobj, &dev_attr_mdev_create.attr);
+	if (retval) {
+		pr_err("Failed to create mdev_create sysfs entry\n");
+		return retval;
+	}
+
+	retval = sysfs_create_file(&dev->kobj, &dev_attr_mdev_destroy.attr);
+	if (retval) {
+		pr_err("Failed to create mdev_destroy sysfs entry\n");
+		return retval;
+	}
+
+	return 0;
+}
+
+void mdev_remove_sysfs_files(struct device *dev)
+{
+	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_supported_types.attr);
+	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_create.attr);
+	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_destroy.attr);
+}
diff --git a/drivers/vfio/mdev/mdev_private.h b/drivers/vfio/mdev/mdev_private.h
new file mode 100644
index 000000000000..a472310c7749
--- /dev/null
+++ b/drivers/vfio/mdev/mdev_private.h
@@ -0,0 +1,33 @@
+/*
+ * Mediated device interal definitions
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef MDEV_PRIVATE_H
+#define MDEV_PRIVATE_H
+
+int  mdev_bus_register(void);
+void mdev_bus_unregister(void);
+
+/* Function prototypes for mdev_sysfs */
+
+extern struct class_attribute mdev_class_attrs[];
+
+int  mdev_create_sysfs_files(struct device *dev);
+void mdev_remove_sysfs_files(struct device *dev);
+
+int  create_mdev_device(struct device *dev, uuid_le uuid, uint32_t instance,
+			char *mdev_params);
+int  destroy_mdev_device(uuid_le uuid, uint32_t instance);
+void get_mdev_supported_types(struct device *dev, char *str);
+int  mdev_start_callback(uuid_le uuid, uint32_t instance);
+int  mdev_shutdown_callback(uuid_le uuid, uint32_t instance);
+
+#endif /* MDEV_PRIVATE_H */
diff --git a/include/linux/mdev.h b/include/linux/mdev.h
new file mode 100644
index 000000000000..d9633acd85f2
--- /dev/null
+++ b/include/linux/mdev.h
@@ -0,0 +1,224 @@
+/*
+ * Mediated device definition
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef MDEV_H
+#define MDEV_H
+
+/* Common Data structures */
+
+struct pci_region_info {
+	uint64_t start;
+	uint64_t size;
+	uint32_t flags;		/*!< VFIO region info flags */
+};
+
+enum mdev_emul_space {
+	EMUL_CONFIG_SPACE,	/*!< PCI configuration space */
+	EMUL_IO,		/*!< I/O register space */
+	EMUL_MMIO		/*!< Memory-mapped I/O space */
+};
+
+struct phy_device;
+
+/*
+ * Mediated device
+ */
+
+struct mdev_device {
+	struct kref		kref;
+	struct device		dev;
+	struct phy_device	*phy_dev;
+	struct iommu_group	*group;
+	void			*iommu_data;
+	uuid_le			uuid;
+	uint32_t		instance;
+	void			*driver_data;
+	struct mutex		ops_lock;
+	struct list_head	next;
+};
+
+
+/**
+ * struct phy_device_ops - Structure to be registered for each physical device
+ * to register the device to mdev module.
+ *
+ * @owner:		The module owner.
+ * @dev_attr_groups:	Default attributes of the physical device.
+ * @mdev_attr_groups:	Default attributes of the mediated device.
+ * @supported_config:	Called to get information about supported types.
+ *			@dev : device structure of physical device.
+ *			@config: should return string listing supported config
+ *			Returns integer: success (0) or error (< 0)
+ * @create:		Called to allocate basic resources in physical device's
+ *			driver for a particular mediated device
+ *			@dev: physical pci device structure on which mediated
+ *			      device should be created
+ *			@uuid: VM's uuid for which VM it is intended to
+ *			@instance: mediated instance in that VM
+ *			@mdev_params: extra parameters required by physical
+ *			device's driver.
+ *			Returns integer: success (0) or error (< 0)
+ * @destroy:		Called to free resources in physical device's driver for
+ *			a mediated device instance of that VM.
+ *			@dev: physical device structure to which this mediated
+ *			      device points to.
+ *			@uuid: VM's uuid for which the mediated device belongs
+ *			@instance: mdev instance in that VM
+ *			Returns integer: success (0) or error (< 0)
+ *			If VM is running and destroy() is called that means the
+ *			mdev is being hotunpluged. Return error if VM is running
+ *			and driver doesn't support mediated device hotplug.
+ * @start:		Called to do initiate mediated device initialization
+ *			process in physical device's driver when VM boots before
+ *			qemu starts.
+ *			@uuid: VM's UUID which is booting.
+ *			Returns integer: success (0) or error (< 0)
+ * @shutdown:		Called to teardown mediated device related resources for
+ *			the VM
+ *			@uuid: VM's UUID which is shutting down .
+ *			Returns integer: success (0) or error (< 0)
+ * @read:		Read emulation callback
+ *			@mdev: mediated device structure
+ *			@buf: read buffer
+ *			@count: number bytes to read
+ *			@address_space: specifies for which address
+ *			space the request is: pci_config_space, IO
+ *			register space or MMIO space.
+ *			@pos: offset from base address.
+ *			Retuns number on bytes read on success or error.
+ * @write:		Write emulation callback
+ *			@mdev: mediated device structure
+ *			@buf: write buffer
+ *			@count: number bytes to be written
+ *			@address_space: specifies for which address space the
+ *			request is: pci_config_space, IO register space or MMIO
+ *			space.
+ *			@pos: offset from base address.
+ *			Retuns number on bytes written on success or error.
+ * @set_irqs:		Called to send about interrupts configuration
+ *			information that VMM sets.
+ *			@mdev: mediated device structure
+ *			@flags, index, start, count and *data : same as that of
+ *			struct vfio_irq_set of VFIO_DEVICE_SET_IRQS API.
+ * @get_region_info:	Called to get BAR size and flags of mediated device.
+ *			@mdev: mediated device structure
+ *			@region_index: VFIO region index
+ *			@region_info: output, returns size and flags of
+ *				      requested region.
+ *			Returns integer: success (0) or error (< 0)
+ * @validate_map_request: Validate remap pfn request
+ *			@mdev: mediated device structure
+ *			@virtaddr: target user address to start at
+ *			@pfn: physical address of kernel memory, vendor driver
+ *			      can change if required.
+ *			@size: size of map area, vendor driver can change the
+ *			       size of map area if desired.
+ *			@prot: page protection flags for this mapping, vendor
+ *			       driver can change, if required.
+ *			Returns integer: success (0) or error (< 0)
+ *
+ * Physical device that support mediated device should be registered with mdev
+ * module with phy_device_ops structure.
+ */
+
+struct phy_device_ops {
+	struct module   *owner;
+	const struct attribute_group **dev_attr_groups;
+	const struct attribute_group **mdev_attr_groups;
+
+	int	(*supported_config)(struct device *dev, char *config);
+	int     (*create)(struct device *dev, uuid_le uuid,
+			  uint32_t instance, char *mdev_params);
+	int     (*destroy)(struct device *dev, uuid_le uuid,
+			   uint32_t instance);
+	int     (*start)(uuid_le uuid);
+	int     (*shutdown)(uuid_le uuid);
+	ssize_t (*read)(struct mdev_device *vdev, char *buf, size_t count,
+			enum mdev_emul_space address_space, loff_t pos);
+	ssize_t (*write)(struct mdev_device *vdev, char *buf, size_t count,
+			 enum mdev_emul_space address_space, loff_t pos);
+	int     (*set_irqs)(struct mdev_device *vdev, uint32_t flags,
+			    unsigned int index, unsigned int start,
+			    unsigned int count, void *data);
+	int	(*get_region_info)(struct mdev_device *vdev, int region_index,
+				 struct pci_region_info *region_info);
+	int	(*validate_map_request)(struct mdev_device *vdev,
+					unsigned long virtaddr,
+					unsigned long *pfn, unsigned long *size,
+					pgprot_t *prot);
+};
+
+/*
+ * Physical Device
+ */
+struct phy_device {
+	struct device                   *dev;
+	const struct phy_device_ops     *ops;
+	struct list_head                next;
+};
+
+/**
+ * struct mdev_driver - Mediated device driver
+ * @name: driver name
+ * @probe: called when new device created
+ * @remove: called when device removed
+ * @match: called when new device or driver is added for this bus. Return 1 if
+ *	   given device can be handled by given driver and zero otherwise.
+ * @driver: device driver structure
+ *
+ **/
+struct mdev_driver {
+	const char *name;
+	int  (*probe)(struct device *dev);
+	void (*remove)(struct device *dev);
+	int  (*match)(struct device *dev);
+	struct device_driver driver;
+};
+
+static inline struct mdev_driver *to_mdev_driver(struct device_driver *drv)
+{
+	return drv ? container_of(drv, struct mdev_driver, driver) : NULL;
+}
+
+static inline struct mdev_device *to_mdev_device(struct device *dev)
+{
+	return dev ? container_of(dev, struct mdev_device, dev) : NULL;
+}
+
+static inline struct mdev_device *mdev_get_device(struct mdev_device *vdev)
+{
+	return (vdev && get_device(&vdev->dev)) ? vdev : NULL;
+}
+
+static inline  void mdev_put_device(struct mdev_device *vdev)
+{
+	if (vdev)
+		put_device(&vdev->dev);
+}
+
+extern struct bus_type mdev_bus_type;
+
+#define dev_is_mdev(d) ((d)->bus == &mdev_bus_type)
+
+extern int  mdev_register_device(struct device *dev,
+				 const struct phy_device_ops *ops);
+extern void mdev_unregister_device(struct device *dev);
+
+extern int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
+extern void mdev_unregister_driver(struct mdev_driver *drv);
+
+extern int mdev_map_virtual_bar(uint64_t virt_bar_addr, uint64_t phys_bar_addr,
+				uint32_t len, uint32_t flags);
+
+extern struct mdev_device *mdev_get_device_by_group(struct iommu_group *group);
+
+#endif /* MDEV_H */
-- 
2.7.0


^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [Qemu-devel] [RFC PATCH v4 1/3] Mediated device Core driver
@ 2016-05-24 19:58   ` Kirti Wankhede
  0 siblings, 0 replies; 92+ messages in thread
From: Kirti Wankhede @ 2016-05-24 19:58 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, kevin.tian, shuai.ruan, jike.song, zhiyuan.lv,
	bjsdjshi, Kirti Wankhede

Design for Mediated Device Driver:
Main purpose of this driver is to provide a common interface for mediated
device management that can be used by differnt drivers of different
devices.

This module provides a generic interface to create the device, add it to
mediated bus, add device to IOMMU group and then add it to vfio group.

Below is the high Level block diagram, with Nvidia, Intel and IBM devices
as example, since these are the devices which are going to actively use
this module as of now.

 +---------------+
 |               |
 | +-----------+ |  mdev_register_driver() +--------------+
 | |           | +<------------------------+ __init()     |
 | |           | |                         |              |
 | |  mdev     | +------------------------>+              |<-> VFIO user
 | |  bus      | |     probe()/remove()    | vfio_mpci.ko |    APIs
 | |  driver   | |                         |              |
 | |           | |                         +--------------+
 | |           | |  mdev_register_driver() +--------------+
 | |           | +<------------------------+ __init()     |
 | |           | |                         |              |
 | |           | +------------------------>+              |<-> VFIO user
 | +-----------+ |     probe()/remove()    | vfio_mccw.ko |    APIs
 |               |                         |              |
 |  MDEV CORE    |                         +--------------+
 |   MODULE      |
 |   mdev.ko     |
 | +-----------+ |  mdev_register_device() +--------------+
 | |           | +<------------------------+              |
 | |           | |                         |  nvidia.ko   |<-> physical
 | |           | +------------------------>+              |    device
 | |           | |        callback         +--------------+
 | | Physical  | |
 | |  device   | |  mdev_register_device() +--------------+
 | | interface | |<------------------------+              |
 | |           | |                         |  i915.ko     |<-> physical
 | |           | +------------------------>+              |    device
 | |           | |        callback         +--------------+
 | |           | |
 | |           | |  mdev_register_device() +--------------+
 | |           | +<------------------------+              |
 | |           | |                         | ccw_device.ko|<-> physical
 | |           | +------------------------>+              |    device
 | |           | |        callback         +--------------+
 | +-----------+ |
 +---------------+

Core driver provides two types of registration interfaces:
1. Registration interface for mediated bus driver:

/**
  * struct mdev_driver - Mediated device's driver
  * @name: driver name
  * @probe: called when new device created
  * @remove:called when device removed
  * @match: called when new device or driver is added for this bus.
	    Return 1 if given device can be handled by given driver and
	    zero otherwise.
  * @driver:device driver structure
  *
  **/
struct mdev_driver {
         const char *name;
         int  (*probe)  (struct device *dev);
         void (*remove) (struct device *dev);
	 int  (*match)(struct device *dev);
         struct device_driver    driver;
};

int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
void mdev_unregister_driver(struct mdev_driver *drv);

Mediated device's driver for mdev should use this interface to register
with Core driver. With this, mediated devices driver for such devices is
responsible to add mediated device to VFIO group.

2. Physical device driver interface
This interface provides vendor driver the set APIs to manage physical
device related work in their own driver. APIs are :
- supported_config: provide supported configuration list by the vendor
		    driver
- create: to allocate basic resources in vendor driver for a mediated
	  device.
- destroy: to free resources in vendor driver when mediated device is
	   destroyed.
- start: to initiate mediated device initialization process from vendor
	 driver when VM boots and before QEMU starts.
- shutdown: to teardown mediated device resources during VM teardown.
- read : read emulation callback.
- write: write emulation callback.
- set_irqs: send interrupt configuration information that QEMU sets.
- get_region_info: to provide region size and its flags for the mediated
		   device.
- validate_map_request: to validate remap pfn request.

This registration interface should be used by vendor drivers to register
each physical device to mdev core driver.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Change-Id: I88f4482f7608f40550a152c5f882b64271287c62
---
 drivers/vfio/Kconfig             |   1 +
 drivers/vfio/Makefile            |   1 +
 drivers/vfio/mdev/Kconfig        |  11 +
 drivers/vfio/mdev/Makefile       |   5 +
 drivers/vfio/mdev/mdev-core.c    | 462 +++++++++++++++++++++++++++++++++++++++
 drivers/vfio/mdev/mdev-driver.c  | 139 ++++++++++++
 drivers/vfio/mdev/mdev-sysfs.c   | 312 ++++++++++++++++++++++++++
 drivers/vfio/mdev/mdev_private.h |  33 +++
 include/linux/mdev.h             | 224 +++++++++++++++++++
 9 files changed, 1188 insertions(+)
 create mode 100644 drivers/vfio/mdev/Kconfig
 create mode 100644 drivers/vfio/mdev/Makefile
 create mode 100644 drivers/vfio/mdev/mdev-core.c
 create mode 100644 drivers/vfio/mdev/mdev-driver.c
 create mode 100644 drivers/vfio/mdev/mdev-sysfs.c
 create mode 100644 drivers/vfio/mdev/mdev_private.h
 create mode 100644 include/linux/mdev.h

diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index da6e2ce77495..23eced02aaf6 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -48,4 +48,5 @@ menuconfig VFIO_NOIOMMU
 
 source "drivers/vfio/pci/Kconfig"
 source "drivers/vfio/platform/Kconfig"
+source "drivers/vfio/mdev/Kconfig"
 source "virt/lib/Kconfig"
diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
index 7b8a31f63fea..7c70753e54ab 100644
--- a/drivers/vfio/Makefile
+++ b/drivers/vfio/Makefile
@@ -7,3 +7,4 @@ obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
 obj-$(CONFIG_VFIO_SPAPR_EEH) += vfio_spapr_eeh.o
 obj-$(CONFIG_VFIO_PCI) += pci/
 obj-$(CONFIG_VFIO_PLATFORM) += platform/
+obj-$(CONFIG_MDEV) += mdev/
diff --git a/drivers/vfio/mdev/Kconfig b/drivers/vfio/mdev/Kconfig
new file mode 100644
index 000000000000..951e2bb06a3f
--- /dev/null
+++ b/drivers/vfio/mdev/Kconfig
@@ -0,0 +1,11 @@
+
+config MDEV
+    tristate "Mediated device driver framework"
+    depends on VFIO
+    default n
+    help
+        MDEV provides a framework to virtualize device without SR-IOV cap
+        See Documentation/mdev.txt for more details.
+
+        If you don't know what do here, say N.
+
diff --git a/drivers/vfio/mdev/Makefile b/drivers/vfio/mdev/Makefile
new file mode 100644
index 000000000000..4adb069febce
--- /dev/null
+++ b/drivers/vfio/mdev/Makefile
@@ -0,0 +1,5 @@
+
+mdev-y := mdev-core.o mdev-sysfs.o mdev-driver.o
+
+obj-$(CONFIG_MDEV) += mdev.o
+
diff --git a/drivers/vfio/mdev/mdev-core.c b/drivers/vfio/mdev/mdev-core.c
new file mode 100644
index 000000000000..af070d73735f
--- /dev/null
+++ b/drivers/vfio/mdev/mdev-core.c
@@ -0,0 +1,462 @@
+/*
+ * Mediated device Core Driver
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/slab.h>
+#include <linux/cdev.h>
+#include <linux/sched.h>
+#include <linux/uuid.h>
+#include <linux/vfio.h>
+#include <linux/iommu.h>
+#include <linux/sysfs.h>
+#include <linux/mdev.h>
+
+#include "mdev_private.h"
+
+#define DRIVER_VERSION		"0.1"
+#define DRIVER_AUTHOR		"NVIDIA Corporation"
+#define DRIVER_DESC		"Mediated device Core Driver"
+
+#define MDEV_CLASS_NAME		"mdev"
+
+/*
+ * Global Structures
+ */
+
+static struct devices_list {
+	struct list_head    dev_list;
+	struct mutex        list_lock;
+} mdevices, phy_devices;
+
+/*
+ * Functions
+ */
+
+static int mdev_add_attribute_group(struct device *dev,
+				    const struct attribute_group **groups)
+{
+	return sysfs_create_groups(&dev->kobj, groups);
+}
+
+static void mdev_remove_attribute_group(struct device *dev,
+					const struct attribute_group **groups)
+{
+	sysfs_remove_groups(&dev->kobj, groups);
+}
+
+static struct mdev_device *find_mdev_device(uuid_le uuid, int instance)
+{
+	struct mdev_device *vdev = NULL, *v;
+
+	mutex_lock(&mdevices.list_lock);
+	list_for_each_entry(v, &mdevices.dev_list, next) {
+		if ((uuid_le_cmp(v->uuid, uuid) == 0) &&
+		    (v->instance == instance)) {
+			vdev = v;
+			break;
+		}
+	}
+	mutex_unlock(&mdevices.list_lock);
+	return vdev;
+}
+
+static struct mdev_device *find_next_mdev_device(struct phy_device *phy_dev)
+{
+	struct mdev_device *mdev = NULL, *p;
+
+	mutex_lock(&mdevices.list_lock);
+	list_for_each_entry(p, &mdevices.dev_list, next) {
+		if (p->phy_dev == phy_dev) {
+			mdev = p;
+			break;
+		}
+	}
+	mutex_unlock(&mdevices.list_lock);
+	return mdev;
+}
+
+static struct phy_device *find_physical_device(struct device *dev)
+{
+	struct phy_device *pdev = NULL, *p;
+
+	mutex_lock(&phy_devices.list_lock);
+	list_for_each_entry(p, &phy_devices.dev_list, next) {
+		if (p->dev == dev) {
+			pdev = p;
+			break;
+		}
+	}
+	mutex_unlock(&phy_devices.list_lock);
+	return pdev;
+}
+
+static void mdev_destroy_device(struct mdev_device *mdevice)
+{
+	struct phy_device *phy_dev = mdevice->phy_dev;
+
+	if (phy_dev) {
+		mutex_lock(&phy_devices.list_lock);
+
+		/*
+		* If vendor driver doesn't return success that means vendor
+		* driver doesn't support hot-unplug
+		*/
+		if (phy_dev->ops->destroy) {
+			if (phy_dev->ops->destroy(phy_dev->dev, mdevice->uuid,
+						  mdevice->instance)) {
+				mutex_unlock(&phy_devices.list_lock);
+				return;
+			}
+		}
+
+		mdev_remove_attribute_group(&mdevice->dev,
+					    phy_dev->ops->mdev_attr_groups);
+		mdevice->phy_dev = NULL;
+		mutex_unlock(&phy_devices.list_lock);
+	}
+
+	mdev_put_device(mdevice);
+	device_unregister(&mdevice->dev);
+}
+
+/*
+ * Find mediated device from given iommu_group and increment refcount of
+ * mediated device. Caller should call mdev_put_device() when the use of
+ * mdev_device is done.
+ */
+struct mdev_device *mdev_get_device_by_group(struct iommu_group *group)
+{
+	struct mdev_device *mdev = NULL, *p;
+
+	mutex_lock(&mdevices.list_lock);
+	list_for_each_entry(p, &mdevices.dev_list, next) {
+		if (!p->group)
+			continue;
+
+		if (iommu_group_id(p->group) == iommu_group_id(group)) {
+			mdev = mdev_get_device(p);
+			break;
+		}
+	}
+	mutex_unlock(&mdevices.list_lock);
+	return mdev;
+}
+EXPORT_SYMBOL_GPL(mdev_get_device_by_group);
+
+/*
+ * mdev_register_device : Register a device
+ * @dev: device structure representing physical device.
+ * @phy_device_ops: Physical device operation structure to be registered.
+ *
+ * Add device to list of registered physical devices.
+ * Returns a negative value on error, otherwise 0.
+ */
+int mdev_register_device(struct device *dev, const struct phy_device_ops *ops)
+{
+	int ret = 0;
+	struct phy_device *phy_dev, *pdev;
+
+	if (!dev || !ops)
+		return -EINVAL;
+
+	/* Check for duplicate */
+	pdev = find_physical_device(dev);
+	if (pdev)
+		return -EEXIST;
+
+	phy_dev = kzalloc(sizeof(*phy_dev), GFP_KERNEL);
+	if (!phy_dev)
+		return -ENOMEM;
+
+	phy_dev->dev = dev;
+	phy_dev->ops = ops;
+
+	mutex_lock(&phy_devices.list_lock);
+	ret = mdev_create_sysfs_files(dev);
+	if (ret)
+		goto add_sysfs_error;
+
+	ret = mdev_add_attribute_group(dev, ops->dev_attr_groups);
+	if (ret)
+		goto add_group_error;
+
+	list_add(&phy_dev->next, &phy_devices.dev_list);
+	dev_info(dev, "MDEV: Registered\n");
+	mutex_unlock(&phy_devices.list_lock);
+
+	return 0;
+
+add_group_error:
+	mdev_remove_sysfs_files(dev);
+add_sysfs_error:
+	mutex_unlock(&phy_devices.list_lock);
+	kfree(phy_dev);
+	return ret;
+}
+EXPORT_SYMBOL(mdev_register_device);
+
+/*
+ * mdev_unregister_device : Unregister a physical device
+ * @dev: device structure representing physical device.
+ *
+ * Remove device from list of registered physical devices. Gives a change to
+ * free existing mediated devices for the given physical device.
+ */
+
+void mdev_unregister_device(struct device *dev)
+{
+	struct phy_device *phy_dev;
+	struct mdev_device *vdev = NULL;
+
+	phy_dev = find_physical_device(dev);
+
+	if (!phy_dev)
+		return;
+
+	dev_info(dev, "MDEV: Unregistering\n");
+
+	while ((vdev = find_next_mdev_device(phy_dev)))
+		mdev_destroy_device(vdev);
+
+	mutex_lock(&phy_devices.list_lock);
+	list_del(&phy_dev->next);
+	mutex_unlock(&phy_devices.list_lock);
+
+	mdev_remove_attribute_group(dev,
+				    phy_dev->ops->dev_attr_groups);
+
+	mdev_remove_sysfs_files(dev);
+	kfree(phy_dev);
+}
+EXPORT_SYMBOL(mdev_unregister_device);
+
+/*
+ * Functions required for mdev-sysfs
+ */
+
+static struct mdev_device *mdev_device_alloc(uuid_le uuid, int instance)
+{
+	struct mdev_device *mdevice = NULL;
+
+	mdevice = kzalloc(sizeof(*mdevice), GFP_KERNEL);
+	if (!mdevice)
+		return ERR_PTR(-ENOMEM);
+
+	kref_init(&mdevice->kref);
+	memcpy(&mdevice->uuid, &uuid, sizeof(uuid_le));
+	mdevice->instance = instance;
+	mutex_init(&mdevice->ops_lock);
+
+	return mdevice;
+}
+
+static void mdev_device_release(struct device *dev)
+{
+	struct mdev_device *mdevice = to_mdev_device(dev);
+
+	if (!mdevice)
+		return;
+
+	dev_info(&mdevice->dev, "MDEV: destroying\n");
+
+	mutex_lock(&mdevices.list_lock);
+	list_del(&mdevice->next);
+	mutex_unlock(&mdevices.list_lock);
+
+	kfree(mdevice);
+}
+
+int create_mdev_device(struct device *dev, uuid_le uuid, uint32_t instance,
+		       char *mdev_params)
+{
+	int retval = 0;
+	struct mdev_device *mdevice = NULL;
+	struct phy_device *phy_dev;
+
+	phy_dev = find_physical_device(dev);
+	if (!phy_dev)
+		return -EINVAL;
+
+	mdevice = mdev_device_alloc(uuid, instance);
+	if (IS_ERR(mdevice)) {
+		retval = PTR_ERR(mdevice);
+		return retval;
+	}
+
+	mdevice->dev.parent  = dev;
+	mdevice->dev.bus     = &mdev_bus_type;
+	mdevice->dev.release = mdev_device_release;
+	dev_set_name(&mdevice->dev, "%pUb-%d", uuid.b, instance);
+
+	mutex_lock(&mdevices.list_lock);
+	list_add(&mdevice->next, &mdevices.dev_list);
+	mutex_unlock(&mdevices.list_lock);
+
+	retval = device_register(&mdevice->dev);
+	if (retval) {
+		mdev_put_device(mdevice);
+		return retval;
+	}
+
+	mutex_lock(&phy_devices.list_lock);
+	if (phy_dev->ops->create) {
+		retval = phy_dev->ops->create(dev, mdevice->uuid,
+					      instance, mdev_params);
+		if (retval)
+			goto create_failed;
+	}
+
+	retval = mdev_add_attribute_group(&mdevice->dev,
+					  phy_dev->ops->mdev_attr_groups);
+	if (retval)
+		goto create_failed;
+
+	mdevice->phy_dev = phy_dev;
+	mutex_unlock(&phy_devices.list_lock);
+	mdev_get_device(mdevice);
+	dev_info(&mdevice->dev, "MDEV: created\n");
+
+	return retval;
+
+create_failed:
+	mutex_unlock(&phy_devices.list_lock);
+	device_unregister(&mdevice->dev);
+	return retval;
+}
+
+int destroy_mdev_device(uuid_le uuid, uint32_t instance)
+{
+	struct mdev_device *vdev;
+
+	vdev = find_mdev_device(uuid, instance);
+
+	if (!vdev)
+		return -EINVAL;
+
+	mdev_destroy_device(vdev);
+	return 0;
+}
+
+void get_mdev_supported_types(struct device *dev, char *str)
+{
+	struct phy_device *phy_dev;
+
+	phy_dev = find_physical_device(dev);
+
+	if (phy_dev) {
+		mutex_lock(&phy_devices.list_lock);
+		if (phy_dev->ops->supported_config)
+			phy_dev->ops->supported_config(phy_dev->dev, str);
+		mutex_unlock(&phy_devices.list_lock);
+	}
+}
+
+int mdev_start_callback(uuid_le uuid, uint32_t instance)
+{
+	int ret = 0;
+	struct mdev_device *mdevice;
+	struct phy_device *phy_dev;
+
+	mdevice = find_mdev_device(uuid, instance);
+
+	if (!mdevice)
+		return -EINVAL;
+
+	phy_dev = mdevice->phy_dev;
+
+	mutex_lock(&phy_devices.list_lock);
+	if (phy_dev->ops->start)
+		ret = phy_dev->ops->start(mdevice->uuid);
+	mutex_unlock(&phy_devices.list_lock);
+
+	if (ret < 0)
+		pr_err("mdev_start failed  %d\n", ret);
+	else
+		kobject_uevent(&mdevice->dev.kobj, KOBJ_ONLINE);
+
+	return ret;
+}
+
+int mdev_shutdown_callback(uuid_le uuid, uint32_t instance)
+{
+	int ret = 0;
+	struct mdev_device *mdevice;
+	struct phy_device *phy_dev;
+
+	mdevice = find_mdev_device(uuid, instance);
+
+	if (!mdevice)
+		return -EINVAL;
+
+	phy_dev = mdevice->phy_dev;
+
+	mutex_lock(&phy_devices.list_lock);
+	if (phy_dev->ops->shutdown)
+		ret = phy_dev->ops->shutdown(mdevice->uuid);
+	mutex_unlock(&phy_devices.list_lock);
+
+	if (ret < 0)
+		pr_err("mdev_shutdown failed %d\n", ret);
+	else
+		kobject_uevent(&mdevice->dev.kobj, KOBJ_OFFLINE);
+
+	return ret;
+}
+
+static struct class mdev_class = {
+	.name		= MDEV_CLASS_NAME,
+	.owner		= THIS_MODULE,
+	.class_attrs	= mdev_class_attrs,
+};
+
+static int __init mdev_init(void)
+{
+	int rc = 0;
+
+	mutex_init(&mdevices.list_lock);
+	INIT_LIST_HEAD(&mdevices.dev_list);
+	mutex_init(&phy_devices.list_lock);
+	INIT_LIST_HEAD(&phy_devices.dev_list);
+
+	rc = class_register(&mdev_class);
+	if (rc < 0) {
+		pr_err("Failed to register mdev class\n");
+		return rc;
+	}
+
+	rc = mdev_bus_register();
+	if (rc < 0) {
+		pr_err("Failed to register mdev bus\n");
+		class_unregister(&mdev_class);
+		return rc;
+	}
+
+	return rc;
+}
+
+static void __exit mdev_exit(void)
+{
+	mdev_bus_unregister();
+	class_unregister(&mdev_class);
+}
+
+module_init(mdev_init)
+module_exit(mdev_exit)
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/drivers/vfio/mdev/mdev-driver.c b/drivers/vfio/mdev/mdev-driver.c
new file mode 100644
index 000000000000..bc8a169782bc
--- /dev/null
+++ b/drivers/vfio/mdev/mdev-driver.c
@@ -0,0 +1,139 @@
+/*
+ * MDEV driver
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/device.h>
+#include <linux/iommu.h>
+#include <linux/mdev.h>
+
+#include "mdev_private.h"
+
+static int mdevice_attach_iommu(struct mdev_device *mdevice)
+{
+	int retval = 0;
+	struct iommu_group *group = NULL;
+
+	group = iommu_group_alloc();
+	if (IS_ERR(group)) {
+		dev_err(&mdevice->dev, "MDEV: failed to allocate group!\n");
+		return PTR_ERR(group);
+	}
+
+	retval = iommu_group_add_device(group, &mdevice->dev);
+	if (retval) {
+		dev_err(&mdevice->dev, "MDEV: failed to add dev to group!\n");
+		goto attach_fail;
+	}
+
+	mdevice->group = group;
+
+	dev_info(&mdevice->dev, "MDEV: group_id = %d\n",
+				 iommu_group_id(group));
+attach_fail:
+	iommu_group_put(group);
+	return retval;
+}
+
+static void mdevice_detach_iommu(struct mdev_device *mdevice)
+{
+	iommu_group_remove_device(&mdevice->dev);
+	dev_info(&mdevice->dev, "MDEV: detaching iommu\n");
+}
+
+static int mdevice_probe(struct device *dev)
+{
+	struct mdev_driver *drv = to_mdev_driver(dev->driver);
+	struct mdev_device *mdevice = to_mdev_device(dev);
+	int status = 0;
+
+	status = mdevice_attach_iommu(mdevice);
+	if (status) {
+		dev_err(dev, "Failed to attach IOMMU\n");
+		return status;
+	}
+
+	if (drv && drv->probe)
+		status = drv->probe(dev);
+
+	return status;
+}
+
+static int mdevice_remove(struct device *dev)
+{
+	struct mdev_driver *drv = to_mdev_driver(dev->driver);
+	struct mdev_device *mdevice = to_mdev_device(dev);
+
+	if (drv && drv->remove)
+		drv->remove(dev);
+
+	mdevice_detach_iommu(mdevice);
+
+	return 0;
+}
+
+static int mdevice_match(struct device *dev, struct device_driver *drv)
+{
+	int ret = 0;
+	struct mdev_driver *mdrv = to_mdev_driver(drv);
+
+	if (mdrv && mdrv->match)
+		ret = mdrv->match(dev);
+
+	return ret;
+}
+
+struct bus_type mdev_bus_type = {
+	.name		= "mdev",
+	.match		= mdevice_match,
+	.probe		= mdevice_probe,
+	.remove		= mdevice_remove,
+};
+EXPORT_SYMBOL_GPL(mdev_bus_type);
+
+/**
+ * mdev_register_driver - register a new MDEV driver
+ * @drv: the driver to register
+ * @owner: owner module of driver ro register
+ *
+ * Returns a negative value on error, otherwise 0.
+ */
+int mdev_register_driver(struct mdev_driver *drv, struct module *owner)
+{
+	/* initialize common driver fields */
+	drv->driver.name = drv->name;
+	drv->driver.bus = &mdev_bus_type;
+	drv->driver.owner = owner;
+
+	/* register with core */
+	return driver_register(&drv->driver);
+}
+EXPORT_SYMBOL(mdev_register_driver);
+
+/**
+ * mdev_unregister_driver - unregister MDEV driver
+ * @drv: the driver to unregister
+ *
+ */
+void mdev_unregister_driver(struct mdev_driver *drv)
+{
+	driver_unregister(&drv->driver);
+}
+EXPORT_SYMBOL(mdev_unregister_driver);
+
+int mdev_bus_register(void)
+{
+	return bus_register(&mdev_bus_type);
+}
+
+void mdev_bus_unregister(void)
+{
+	bus_unregister(&mdev_bus_type);
+}
diff --git a/drivers/vfio/mdev/mdev-sysfs.c b/drivers/vfio/mdev/mdev-sysfs.c
new file mode 100644
index 000000000000..79d351a7a502
--- /dev/null
+++ b/drivers/vfio/mdev/mdev-sysfs.c
@@ -0,0 +1,312 @@
+/*
+ * File attributes for Mediated devices
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/sysfs.h>
+#include <linux/ctype.h>
+#include <linux/device.h>
+#include <linux/slab.h>
+#include <linux/uuid.h>
+#include <linux/mdev.h>
+
+#include "mdev_private.h"
+
+/* Prototypes */
+static ssize_t mdev_supported_types_show(struct device *dev,
+					 struct device_attribute *attr,
+					 char *buf);
+static DEVICE_ATTR_RO(mdev_supported_types);
+
+static ssize_t mdev_create_store(struct device *dev,
+				 struct device_attribute *attr,
+				 const char *buf, size_t count);
+static DEVICE_ATTR_WO(mdev_create);
+
+static ssize_t mdev_destroy_store(struct device *dev,
+				  struct device_attribute *attr,
+				  const char *buf, size_t count);
+static DEVICE_ATTR_WO(mdev_destroy);
+
+/* Static functions */
+
+#define UUID_CHAR_LENGTH	36
+#define UUID_BYTE_LENGTH	16
+
+#define SUPPORTED_TYPE_BUFFER_LENGTH	1024
+
+static inline bool is_uuid_sep(char sep)
+{
+	if (sep == '\n' || sep == '-' || sep == ':' || sep == '\0')
+		return true;
+	return false;
+}
+
+static int uuid_parse(const char *str, uuid_le *uuid)
+{
+	int i;
+
+	if (strlen(str) < UUID_CHAR_LENGTH)
+		return -1;
+
+	for (i = 0; i < UUID_BYTE_LENGTH; i++) {
+		if (!isxdigit(str[0]) || !isxdigit(str[1])) {
+			pr_err("%s err", __func__);
+			return -EINVAL;
+		}
+
+		uuid->b[i] = (hex_to_bin(str[0]) << 4) | hex_to_bin(str[1]);
+		str += 2;
+		if (is_uuid_sep(*str))
+			str++;
+	}
+
+	return 0;
+}
+
+/* Functions */
+static ssize_t mdev_supported_types_show(struct device *dev,
+					 struct device_attribute *attr,
+					 char *buf)
+{
+	char *str;
+	ssize_t n;
+
+	str = kzalloc(sizeof(*str) * SUPPORTED_TYPE_BUFFER_LENGTH, GFP_KERNEL);
+	if (!str)
+		return -ENOMEM;
+
+	get_mdev_supported_types(dev, str);
+
+	n = sprintf(buf, "%s\n", str);
+	kfree(str);
+
+	return n;
+}
+
+static ssize_t mdev_create_store(struct device *dev,
+				 struct device_attribute *attr,
+				 const char *buf, size_t count)
+{
+	char *str, *pstr;
+	char *uuid_str, *instance_str, *mdev_params = NULL;
+	uuid_le uuid;
+	uint32_t instance;
+	int ret = 0;
+
+	pstr = str = kstrndup(buf, count, GFP_KERNEL);
+
+	if (!str)
+		return -ENOMEM;
+
+	uuid_str = strsep(&str, ":");
+	if (!uuid_str) {
+		pr_err("mdev_create: Empty UUID string %s\n", buf);
+		ret = -EINVAL;
+		goto create_error;
+	}
+
+	if (!str) {
+		pr_err("mdev_create: mdev instance not present %s\n", buf);
+		ret = -EINVAL;
+		goto create_error;
+	}
+
+	instance_str = strsep(&str, ":");
+	if (!instance_str) {
+		pr_err("mdev_create: Empty instance string %s\n", buf);
+		ret = -EINVAL;
+		goto create_error;
+	}
+
+	ret = kstrtouint(instance_str, 0, &instance);
+	if (ret) {
+		pr_err("mdev_create: mdev instance parsing error %s\n", buf);
+		goto create_error;
+	}
+
+	if (!str) {
+		pr_err("mdev_create: mdev params not specified %s\n", buf);
+		ret = -EINVAL;
+		goto create_error;
+	}
+
+	mdev_params = kstrdup(str, GFP_KERNEL);
+
+	if (!mdev_params) {
+		ret = -EINVAL;
+		goto create_error;
+	}
+
+	if (uuid_parse(uuid_str, &uuid) < 0) {
+		pr_err("mdev_create: UUID parse error %s\n", buf);
+		ret = -EINVAL;
+		goto create_error;
+	}
+
+	if (create_mdev_device(dev, uuid, instance, mdev_params) < 0) {
+		pr_err("mdev_create: Failed to create mdev device\n");
+		ret = -EINVAL;
+		goto create_error;
+	}
+	ret = count;
+
+create_error:
+	kfree(mdev_params);
+	kfree(pstr);
+	return ret;
+}
+
+static ssize_t mdev_destroy_store(struct device *dev,
+				  struct device_attribute *attr,
+				  const char *buf, size_t count)
+{
+	char *uuid_str, *str, *pstr;
+	uuid_le uuid;
+	unsigned int instance;
+	int ret;
+
+	str = pstr = kstrndup(buf, count, GFP_KERNEL);
+
+	if (!str)
+		return -ENOMEM;
+
+	uuid_str = strsep(&str, ":");
+	if (!uuid_str) {
+		pr_err("mdev_destroy: Empty UUID string %s\n", buf);
+		ret = -EINVAL;
+		goto destroy_error;
+	}
+
+	if (str == NULL) {
+		pr_err("mdev_destroy: instance not specified %s\n", buf);
+		ret = -EINVAL;
+		goto destroy_error;
+	}
+
+	ret = kstrtouint(str, 0, &instance);
+	if (ret) {
+		pr_err("mdev_destroy: instance parsing error %s\n", buf);
+		goto destroy_error;
+	}
+
+	if (uuid_parse(uuid_str, &uuid) < 0) {
+		pr_err("mdev_destroy: UUID parse error  %s\n", buf);
+		ret = -EINVAL;
+		goto destroy_error;
+	}
+
+	ret = destroy_mdev_device(uuid, instance);
+	if (ret < 0)
+		goto destroy_error;
+
+	ret = count;
+
+destroy_error:
+	kfree(pstr);
+	return ret;
+}
+
+ssize_t mdev_start_store(struct class *class, struct class_attribute *attr,
+			 const char *buf, size_t count)
+{
+	char *uuid_str;
+	uuid_le uuid;
+	int ret = 0;
+
+	uuid_str = kstrndup(buf, count, GFP_KERNEL);
+
+	if (!uuid_str)
+		return -ENOMEM;
+
+	if (uuid_parse(uuid_str, &uuid) < 0) {
+		pr_err("mdev_start: UUID parse error  %s\n", buf);
+		ret = -EINVAL;
+		goto start_error;
+	}
+
+	ret = mdev_start_callback(uuid, 0);
+	if (ret < 0)
+		goto start_error;
+
+	ret = count;
+
+start_error:
+	kfree(uuid_str);
+	return ret;
+}
+
+ssize_t mdev_shutdown_store(struct class *class, struct class_attribute *attr,
+			    const char *buf, size_t count)
+{
+	char *uuid_str;
+	uuid_le uuid;
+	int ret = 0;
+
+	uuid_str = kstrndup(buf, count, GFP_KERNEL);
+
+	if (!uuid_str)
+		return -ENOMEM;
+
+	if (uuid_parse(uuid_str, &uuid) < 0) {
+		pr_err("mdev_shutdown: UUID parse error %s\n", buf);
+		ret = -EINVAL;
+	}
+
+	ret = mdev_shutdown_callback(uuid, 0);
+	if (ret < 0)
+		goto shutdown_error;
+
+	ret = count;
+
+shutdown_error:
+	kfree(uuid_str);
+	return ret;
+
+}
+
+struct class_attribute mdev_class_attrs[] = {
+	__ATTR_WO(mdev_start),
+	__ATTR_WO(mdev_shutdown),
+	__ATTR_NULL
+};
+
+int mdev_create_sysfs_files(struct device *dev)
+{
+	int retval;
+
+	retval = sysfs_create_file(&dev->kobj,
+				   &dev_attr_mdev_supported_types.attr);
+	if (retval) {
+		pr_err("Failed to create mdev_supported_types sysfs entry\n");
+		return retval;
+	}
+
+	retval = sysfs_create_file(&dev->kobj, &dev_attr_mdev_create.attr);
+	if (retval) {
+		pr_err("Failed to create mdev_create sysfs entry\n");
+		return retval;
+	}
+
+	retval = sysfs_create_file(&dev->kobj, &dev_attr_mdev_destroy.attr);
+	if (retval) {
+		pr_err("Failed to create mdev_destroy sysfs entry\n");
+		return retval;
+	}
+
+	return 0;
+}
+
+void mdev_remove_sysfs_files(struct device *dev)
+{
+	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_supported_types.attr);
+	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_create.attr);
+	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_destroy.attr);
+}
diff --git a/drivers/vfio/mdev/mdev_private.h b/drivers/vfio/mdev/mdev_private.h
new file mode 100644
index 000000000000..a472310c7749
--- /dev/null
+++ b/drivers/vfio/mdev/mdev_private.h
@@ -0,0 +1,33 @@
+/*
+ * Mediated device interal definitions
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef MDEV_PRIVATE_H
+#define MDEV_PRIVATE_H
+
+int  mdev_bus_register(void);
+void mdev_bus_unregister(void);
+
+/* Function prototypes for mdev_sysfs */
+
+extern struct class_attribute mdev_class_attrs[];
+
+int  mdev_create_sysfs_files(struct device *dev);
+void mdev_remove_sysfs_files(struct device *dev);
+
+int  create_mdev_device(struct device *dev, uuid_le uuid, uint32_t instance,
+			char *mdev_params);
+int  destroy_mdev_device(uuid_le uuid, uint32_t instance);
+void get_mdev_supported_types(struct device *dev, char *str);
+int  mdev_start_callback(uuid_le uuid, uint32_t instance);
+int  mdev_shutdown_callback(uuid_le uuid, uint32_t instance);
+
+#endif /* MDEV_PRIVATE_H */
diff --git a/include/linux/mdev.h b/include/linux/mdev.h
new file mode 100644
index 000000000000..d9633acd85f2
--- /dev/null
+++ b/include/linux/mdev.h
@@ -0,0 +1,224 @@
+/*
+ * Mediated device definition
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef MDEV_H
+#define MDEV_H
+
+/* Common Data structures */
+
+struct pci_region_info {
+	uint64_t start;
+	uint64_t size;
+	uint32_t flags;		/*!< VFIO region info flags */
+};
+
+enum mdev_emul_space {
+	EMUL_CONFIG_SPACE,	/*!< PCI configuration space */
+	EMUL_IO,		/*!< I/O register space */
+	EMUL_MMIO		/*!< Memory-mapped I/O space */
+};
+
+struct phy_device;
+
+/*
+ * Mediated device
+ */
+
+struct mdev_device {
+	struct kref		kref;
+	struct device		dev;
+	struct phy_device	*phy_dev;
+	struct iommu_group	*group;
+	void			*iommu_data;
+	uuid_le			uuid;
+	uint32_t		instance;
+	void			*driver_data;
+	struct mutex		ops_lock;
+	struct list_head	next;
+};
+
+
+/**
+ * struct phy_device_ops - Structure to be registered for each physical device
+ * to register the device to mdev module.
+ *
+ * @owner:		The module owner.
+ * @dev_attr_groups:	Default attributes of the physical device.
+ * @mdev_attr_groups:	Default attributes of the mediated device.
+ * @supported_config:	Called to get information about supported types.
+ *			@dev : device structure of physical device.
+ *			@config: should return string listing supported config
+ *			Returns integer: success (0) or error (< 0)
+ * @create:		Called to allocate basic resources in physical device's
+ *			driver for a particular mediated device
+ *			@dev: physical pci device structure on which mediated
+ *			      device should be created
+ *			@uuid: VM's uuid for which VM it is intended to
+ *			@instance: mediated instance in that VM
+ *			@mdev_params: extra parameters required by physical
+ *			device's driver.
+ *			Returns integer: success (0) or error (< 0)
+ * @destroy:		Called to free resources in physical device's driver for
+ *			a mediated device instance of that VM.
+ *			@dev: physical device structure to which this mediated
+ *			      device points to.
+ *			@uuid: VM's uuid for which the mediated device belongs
+ *			@instance: mdev instance in that VM
+ *			Returns integer: success (0) or error (< 0)
+ *			If VM is running and destroy() is called that means the
+ *			mdev is being hotunpluged. Return error if VM is running
+ *			and driver doesn't support mediated device hotplug.
+ * @start:		Called to do initiate mediated device initialization
+ *			process in physical device's driver when VM boots before
+ *			qemu starts.
+ *			@uuid: VM's UUID which is booting.
+ *			Returns integer: success (0) or error (< 0)
+ * @shutdown:		Called to teardown mediated device related resources for
+ *			the VM
+ *			@uuid: VM's UUID which is shutting down .
+ *			Returns integer: success (0) or error (< 0)
+ * @read:		Read emulation callback
+ *			@mdev: mediated device structure
+ *			@buf: read buffer
+ *			@count: number bytes to read
+ *			@address_space: specifies for which address
+ *			space the request is: pci_config_space, IO
+ *			register space or MMIO space.
+ *			@pos: offset from base address.
+ *			Retuns number on bytes read on success or error.
+ * @write:		Write emulation callback
+ *			@mdev: mediated device structure
+ *			@buf: write buffer
+ *			@count: number bytes to be written
+ *			@address_space: specifies for which address space the
+ *			request is: pci_config_space, IO register space or MMIO
+ *			space.
+ *			@pos: offset from base address.
+ *			Retuns number on bytes written on success or error.
+ * @set_irqs:		Called to send about interrupts configuration
+ *			information that VMM sets.
+ *			@mdev: mediated device structure
+ *			@flags, index, start, count and *data : same as that of
+ *			struct vfio_irq_set of VFIO_DEVICE_SET_IRQS API.
+ * @get_region_info:	Called to get BAR size and flags of mediated device.
+ *			@mdev: mediated device structure
+ *			@region_index: VFIO region index
+ *			@region_info: output, returns size and flags of
+ *				      requested region.
+ *			Returns integer: success (0) or error (< 0)
+ * @validate_map_request: Validate remap pfn request
+ *			@mdev: mediated device structure
+ *			@virtaddr: target user address to start at
+ *			@pfn: physical address of kernel memory, vendor driver
+ *			      can change if required.
+ *			@size: size of map area, vendor driver can change the
+ *			       size of map area if desired.
+ *			@prot: page protection flags for this mapping, vendor
+ *			       driver can change, if required.
+ *			Returns integer: success (0) or error (< 0)
+ *
+ * Physical device that support mediated device should be registered with mdev
+ * module with phy_device_ops structure.
+ */
+
+struct phy_device_ops {
+	struct module   *owner;
+	const struct attribute_group **dev_attr_groups;
+	const struct attribute_group **mdev_attr_groups;
+
+	int	(*supported_config)(struct device *dev, char *config);
+	int     (*create)(struct device *dev, uuid_le uuid,
+			  uint32_t instance, char *mdev_params);
+	int     (*destroy)(struct device *dev, uuid_le uuid,
+			   uint32_t instance);
+	int     (*start)(uuid_le uuid);
+	int     (*shutdown)(uuid_le uuid);
+	ssize_t (*read)(struct mdev_device *vdev, char *buf, size_t count,
+			enum mdev_emul_space address_space, loff_t pos);
+	ssize_t (*write)(struct mdev_device *vdev, char *buf, size_t count,
+			 enum mdev_emul_space address_space, loff_t pos);
+	int     (*set_irqs)(struct mdev_device *vdev, uint32_t flags,
+			    unsigned int index, unsigned int start,
+			    unsigned int count, void *data);
+	int	(*get_region_info)(struct mdev_device *vdev, int region_index,
+				 struct pci_region_info *region_info);
+	int	(*validate_map_request)(struct mdev_device *vdev,
+					unsigned long virtaddr,
+					unsigned long *pfn, unsigned long *size,
+					pgprot_t *prot);
+};
+
+/*
+ * Physical Device
+ */
+struct phy_device {
+	struct device                   *dev;
+	const struct phy_device_ops     *ops;
+	struct list_head                next;
+};
+
+/**
+ * struct mdev_driver - Mediated device driver
+ * @name: driver name
+ * @probe: called when new device created
+ * @remove: called when device removed
+ * @match: called when new device or driver is added for this bus. Return 1 if
+ *	   given device can be handled by given driver and zero otherwise.
+ * @driver: device driver structure
+ *
+ **/
+struct mdev_driver {
+	const char *name;
+	int  (*probe)(struct device *dev);
+	void (*remove)(struct device *dev);
+	int  (*match)(struct device *dev);
+	struct device_driver driver;
+};
+
+static inline struct mdev_driver *to_mdev_driver(struct device_driver *drv)
+{
+	return drv ? container_of(drv, struct mdev_driver, driver) : NULL;
+}
+
+static inline struct mdev_device *to_mdev_device(struct device *dev)
+{
+	return dev ? container_of(dev, struct mdev_device, dev) : NULL;
+}
+
+static inline struct mdev_device *mdev_get_device(struct mdev_device *vdev)
+{
+	return (vdev && get_device(&vdev->dev)) ? vdev : NULL;
+}
+
+static inline  void mdev_put_device(struct mdev_device *vdev)
+{
+	if (vdev)
+		put_device(&vdev->dev);
+}
+
+extern struct bus_type mdev_bus_type;
+
+#define dev_is_mdev(d) ((d)->bus == &mdev_bus_type)
+
+extern int  mdev_register_device(struct device *dev,
+				 const struct phy_device_ops *ops);
+extern void mdev_unregister_device(struct device *dev);
+
+extern int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
+extern void mdev_unregister_driver(struct mdev_driver *drv);
+
+extern int mdev_map_virtual_bar(uint64_t virt_bar_addr, uint64_t phys_bar_addr,
+				uint32_t len, uint32_t flags);
+
+extern struct mdev_device *mdev_get_device_by_group(struct iommu_group *group);
+
+#endif /* MDEV_H */
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [RFC PATCH v4 2/3] VFIO driver for mediated PCI device
  2016-05-24 19:58 ` [Qemu-devel] " Kirti Wankhede
@ 2016-05-24 19:58   ` Kirti Wankhede
  -1 siblings, 0 replies; 92+ messages in thread
From: Kirti Wankhede @ 2016-05-24 19:58 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, kevin.tian, shuai.ruan, jike.song, zhiyuan.lv,
	bjsdjshi, Kirti Wankhede

VFIO driver registers with MDEV core driver. MDEV core driver creates
mediated device and calls probe routine of MPCI VFIO driver. This MPCI
VFIO driver adds mediated device to VFIO core module.
Main aim of this module is to manage all VFIO APIs for each mediated PCI
device.
Those are:
- get region information from vendor driver.
- trap and emulate PCI config space and BAR region.
- Send interrupt configuration information to vendor driver.
- mmap mappable region with invalidate mapping and fault on access to
  remap pfn.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Change-Id: I48a34af88a9a905ec1f0f7528383c5db76c2e14d
---
 drivers/vfio/mdev/Kconfig           |   7 +
 drivers/vfio/mdev/Makefile          |   1 +
 drivers/vfio/mdev/vfio_mpci.c       | 648 ++++++++++++++++++++++++++++++++++++
 drivers/vfio/pci/vfio_pci_private.h |   6 -
 drivers/vfio/pci/vfio_pci_rdwr.c    |   1 +
 include/linux/vfio.h                |   7 +
 6 files changed, 664 insertions(+), 6 deletions(-)
 create mode 100644 drivers/vfio/mdev/vfio_mpci.c

diff --git a/drivers/vfio/mdev/Kconfig b/drivers/vfio/mdev/Kconfig
index 951e2bb06a3f..8d9e78aaa80f 100644
--- a/drivers/vfio/mdev/Kconfig
+++ b/drivers/vfio/mdev/Kconfig
@@ -9,3 +9,10 @@ config MDEV
 
         If you don't know what do here, say N.
 
+config VFIO_MPCI
+    tristate "VFIO support for Mediated PCI devices"
+    depends on VFIO && PCI && MDEV
+    default n
+    help
+        VFIO based driver for mediated PCI devices.
+
diff --git a/drivers/vfio/mdev/Makefile b/drivers/vfio/mdev/Makefile
index 4adb069febce..8ab38c57df21 100644
--- a/drivers/vfio/mdev/Makefile
+++ b/drivers/vfio/mdev/Makefile
@@ -2,4 +2,5 @@
 mdev-y := mdev-core.o mdev-sysfs.o mdev-driver.o
 
 obj-$(CONFIG_MDEV) += mdev.o
+obj-$(CONFIG_VFIO_MPCI) += vfio_mpci.o
 
diff --git a/drivers/vfio/mdev/vfio_mpci.c b/drivers/vfio/mdev/vfio_mpci.c
new file mode 100644
index 000000000000..ef9d757ec511
--- /dev/null
+++ b/drivers/vfio/mdev/vfio_mpci.c
@@ -0,0 +1,648 @@
+/*
+ * VFIO based Mediated PCI device driver
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/uuid.h>
+#include <linux/vfio.h>
+#include <linux/iommu.h>
+#include <linux/mdev.h>
+
+#include "mdev_private.h"
+
+#define DRIVER_VERSION  "0.1"
+#define DRIVER_AUTHOR   "NVIDIA Corporation"
+#define DRIVER_DESC     "VFIO based Mediated PCI device driver"
+
+struct vfio_mdevice {
+	struct iommu_group *group;
+	struct mdev_device *mdevice;
+	int		    refcnt;
+	struct pci_region_info vfio_region_info[VFIO_PCI_NUM_REGIONS];
+	u8		    *vconfig;
+	struct mutex	    vfio_mdev_lock;
+};
+
+static int get_virtual_bar_info(struct mdev_device *mdevice,
+				struct pci_region_info *vfio_region_info,
+				int index)
+{
+	int ret = -EINVAL;
+	struct phy_device *phy_dev = mdevice->phy_dev;
+
+	if (dev_is_pci(phy_dev->dev) && phy_dev->ops->get_region_info) {
+		mutex_lock(&mdevice->ops_lock);
+		ret = phy_dev->ops->get_region_info(mdevice, index,
+						    vfio_region_info);
+		mutex_unlock(&mdevice->ops_lock);
+	}
+	return ret;
+}
+
+static int mdev_read_base(struct vfio_mdevice *vdev)
+{
+	int index, pos;
+	u32 start_lo, start_hi;
+	u32 mem_type;
+
+	pos = PCI_BASE_ADDRESS_0;
+
+	for (index = 0; index <= VFIO_PCI_BAR5_REGION_INDEX; index++) {
+
+		if (!vdev->vfio_region_info[index].size)
+			continue;
+
+		start_lo = (*(u32 *)(vdev->vconfig + pos)) &
+					PCI_BASE_ADDRESS_MEM_MASK;
+		mem_type = (*(u32 *)(vdev->vconfig + pos)) &
+					PCI_BASE_ADDRESS_MEM_TYPE_MASK;
+
+		switch (mem_type) {
+		case PCI_BASE_ADDRESS_MEM_TYPE_64:
+			start_hi = (*(u32 *)(vdev->vconfig + pos + 4));
+			pos += 4;
+			break;
+		case PCI_BASE_ADDRESS_MEM_TYPE_32:
+		case PCI_BASE_ADDRESS_MEM_TYPE_1M:
+			/* 1M mem BAR treated as 32-bit BAR */
+		default:
+			/* mem unknown type treated as 32-bit BAR */
+			start_hi = 0;
+			break;
+		}
+		pos += 4;
+		vdev->vfio_region_info[index].start = ((u64)start_hi << 32) |
+							start_lo;
+	}
+	return 0;
+}
+
+static int vfio_mpci_open(void *device_data)
+{
+	int ret = 0;
+	struct vfio_mdevice *vdev = device_data;
+
+	if (!try_module_get(THIS_MODULE))
+		return -ENODEV;
+
+	mutex_lock(&vdev->vfio_mdev_lock);
+	if (!vdev->refcnt) {
+		u8 *vconfig;
+		int index;
+		struct pci_region_info *cfg_reg;
+
+		for (index = VFIO_PCI_BAR0_REGION_INDEX;
+		     index < VFIO_PCI_NUM_REGIONS; index++) {
+			ret = get_virtual_bar_info(vdev->mdevice,
+						&vdev->vfio_region_info[index],
+						index);
+			if (ret)
+				goto open_error;
+		}
+		cfg_reg = &vdev->vfio_region_info[VFIO_PCI_CONFIG_REGION_INDEX];
+		if (!cfg_reg->size)
+			goto open_error;
+
+		vconfig = kzalloc(cfg_reg->size, GFP_KERNEL);
+		if (IS_ERR(vconfig)) {
+			ret = PTR_ERR(vconfig);
+			goto open_error;
+		}
+
+		vdev->vconfig = vconfig;
+	}
+
+	vdev->refcnt++;
+open_error:
+
+	mutex_unlock(&vdev->vfio_mdev_lock);
+	if (ret)
+		module_put(THIS_MODULE);
+
+	return ret;
+}
+
+static void vfio_mpci_close(void *device_data)
+{
+	struct vfio_mdevice *vdev = device_data;
+
+	mutex_lock(&vdev->vfio_mdev_lock);
+	vdev->refcnt--;
+	if (!vdev->refcnt) {
+		memset(&vdev->vfio_region_info, 0,
+			sizeof(vdev->vfio_region_info));
+		kfree(vdev->vconfig);
+	}
+	mutex_unlock(&vdev->vfio_mdev_lock);
+}
+
+static int mdev_get_irq_count(struct vfio_mdevice *vdev, int irq_type)
+{
+	/* Don't support MSIX for now */
+	if (irq_type == VFIO_PCI_MSIX_IRQ_INDEX)
+		return -1;
+
+	return 1;
+}
+
+static long vfio_mpci_unlocked_ioctl(void *device_data,
+				     unsigned int cmd, unsigned long arg)
+{
+	int ret = 0;
+	struct vfio_mdevice *vdev = device_data;
+	unsigned long minsz;
+
+	switch (cmd) {
+	case VFIO_DEVICE_GET_INFO:
+	{
+		struct vfio_device_info info;
+
+		minsz = offsetofend(struct vfio_device_info, num_irqs);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz)
+			return -EINVAL;
+
+		info.flags = VFIO_DEVICE_FLAGS_PCI;
+		info.num_regions = VFIO_PCI_NUM_REGIONS;
+		info.num_irqs = VFIO_PCI_NUM_IRQS;
+
+		return copy_to_user((void __user *)arg, &info, minsz);
+	}
+
+	case VFIO_DEVICE_GET_REGION_INFO:
+	{
+		struct vfio_region_info info;
+
+		minsz = offsetofend(struct vfio_region_info, offset);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz)
+			return -EINVAL;
+
+		switch (info.index) {
+		case VFIO_PCI_CONFIG_REGION_INDEX:
+		case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
+			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+			info.size = vdev->vfio_region_info[info.index].size;
+			if (!info.size) {
+				info.flags = 0;
+				break;
+			}
+
+			info.flags = vdev->vfio_region_info[info.index].flags;
+			break;
+		case VFIO_PCI_VGA_REGION_INDEX:
+		case VFIO_PCI_ROM_REGION_INDEX:
+		default:
+			return -EINVAL;
+		}
+
+		return copy_to_user((void __user *)arg, &info, minsz);
+
+	}
+	case VFIO_DEVICE_GET_IRQ_INFO:
+	{
+		struct vfio_irq_info info;
+
+		minsz = offsetofend(struct vfio_irq_info, count);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz || info.index >= VFIO_PCI_NUM_IRQS)
+			return -EINVAL;
+
+		switch (info.index) {
+		case VFIO_PCI_INTX_IRQ_INDEX ... VFIO_PCI_MSI_IRQ_INDEX:
+		case VFIO_PCI_REQ_IRQ_INDEX:
+			break;
+			/* pass thru to return error */
+		case VFIO_PCI_MSIX_IRQ_INDEX:
+		default:
+			return -EINVAL;
+		}
+
+		info.count = VFIO_PCI_NUM_IRQS;
+
+		info.flags = VFIO_IRQ_INFO_EVENTFD;
+		info.count = mdev_get_irq_count(vdev, info.index);
+
+		if (info.count == -1)
+			return -EINVAL;
+
+		if (info.index == VFIO_PCI_INTX_IRQ_INDEX)
+			info.flags |= (VFIO_IRQ_INFO_MASKABLE |
+					VFIO_IRQ_INFO_AUTOMASKED);
+		else
+			info.flags |= VFIO_IRQ_INFO_NORESIZE;
+
+		return copy_to_user((void __user *)arg, &info, minsz);
+	}
+
+	case VFIO_DEVICE_SET_IRQS:
+	{
+		struct vfio_irq_set hdr;
+		struct mdev_device *mdevice = vdev->mdevice;
+		struct phy_device *phy_dev = vdev->mdevice->phy_dev;
+		u8 *data = NULL;
+		int ret = 0;
+
+		minsz = offsetofend(struct vfio_irq_set, count);
+
+		if (copy_from_user(&hdr, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (hdr.argsz < minsz || hdr.index >= VFIO_PCI_NUM_IRQS ||
+		    hdr.flags & ~(VFIO_IRQ_SET_DATA_TYPE_MASK |
+		    VFIO_IRQ_SET_ACTION_TYPE_MASK))
+			return -EINVAL;
+
+		if (!(hdr.flags & VFIO_IRQ_SET_DATA_NONE)) {
+			size_t size;
+			int max = mdev_get_irq_count(vdev, hdr.index);
+
+			if (hdr.flags & VFIO_IRQ_SET_DATA_BOOL)
+				size = sizeof(uint8_t);
+			else if (hdr.flags & VFIO_IRQ_SET_DATA_EVENTFD)
+				size = sizeof(int32_t);
+			else
+				return -EINVAL;
+
+			if (hdr.argsz - minsz < hdr.count * size ||
+			    hdr.start >= max || hdr.start + hdr.count > max)
+				return -EINVAL;
+
+			data = memdup_user((void __user *)(arg + minsz),
+					    hdr.count * size);
+			if (IS_ERR(data))
+				return PTR_ERR(data);
+
+			}
+
+			if (phy_dev->ops->set_irqs) {
+				mutex_lock(&mdevice->ops_lock);
+				ret = phy_dev->ops->set_irqs(mdevice, hdr.flags,
+							   hdr.index, hdr.start,
+							   hdr.count, data);
+				mutex_unlock(&mdevice->ops_lock);
+			}
+
+			kfree(data);
+			return ret;
+	}
+
+	default:
+		return -EINVAL;
+	}
+	return ret;
+}
+
+ssize_t mdev_dev_config_rw(struct vfio_mdevice *vdev, char __user *buf,
+			   size_t count, loff_t *ppos, bool iswrite)
+{
+	struct mdev_device *mdevice = vdev->mdevice;
+	struct phy_device *phy_dev = mdevice->phy_dev;
+	int size = vdev->vfio_region_info[VFIO_PCI_CONFIG_REGION_INDEX].size;
+	int ret = 0;
+	uint64_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
+
+	if (pos < 0 || pos >= size ||
+	    pos + count > size) {
+		pr_err("%s pos 0x%llx out of range\n", __func__, pos);
+		ret = -EFAULT;
+		goto config_rw_exit;
+	}
+
+	if (iswrite) {
+		char *user_data;
+
+		user_data = memdup_user(buf, count);
+		if (IS_ERR(user_data)) {
+			ret = PTR_ERR(user_data);
+			goto config_rw_exit;
+		}
+
+		if (phy_dev->ops->write) {
+			mutex_lock(&mdevice->ops_lock);
+			ret = phy_dev->ops->write(mdevice, user_data, count,
+						  EMUL_CONFIG_SPACE, pos);
+			mutex_unlock(&mdevice->ops_lock);
+		}
+
+		memcpy((void *)(vdev->vconfig + pos), (void *)user_data, count);
+		kfree(user_data);
+	} else {
+		char *ret_data = kzalloc(count, GFP_KERNEL);
+
+		if (IS_ERR(ret_data)) {
+			ret = PTR_ERR(ret_data);
+			goto config_rw_exit;
+		}
+
+		if (phy_dev->ops->read) {
+			mutex_lock(&mdevice->ops_lock);
+			ret = phy_dev->ops->read(mdevice, ret_data, count,
+						 EMUL_CONFIG_SPACE, pos);
+			mutex_unlock(&mdevice->ops_lock);
+		}
+
+		if (ret > 0) {
+			if (copy_to_user(buf, ret_data, ret)) {
+				ret = -EFAULT;
+				kfree(ret_data);
+				goto config_rw_exit;
+			}
+
+			memcpy((void *)(vdev->vconfig + pos),
+				(void *)ret_data, count);
+		}
+		kfree(ret_data);
+	}
+config_rw_exit:
+	return ret;
+}
+
+ssize_t mdev_dev_bar_rw(struct vfio_mdevice *vdev, char __user *buf,
+			size_t count, loff_t *ppos, bool iswrite)
+{
+	struct mdev_device *mdevice = vdev->mdevice;
+	struct phy_device *phy_dev = mdevice->phy_dev;
+	loff_t offset = *ppos & VFIO_PCI_OFFSET_MASK;
+	loff_t pos;
+	int bar_index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+	int ret = 0;
+
+	if (!vdev->vfio_region_info[bar_index].start) {
+		ret = mdev_read_base(vdev);
+		if (ret)
+			goto bar_rw_exit;
+	}
+
+	if (offset >= vdev->vfio_region_info[bar_index].size) {
+		ret = -EINVAL;
+		goto bar_rw_exit;
+	}
+
+	pos = vdev->vfio_region_info[bar_index].start + offset;
+	if (iswrite) {
+		char *user_data;
+
+		user_data = memdup_user(buf, count);
+		if (IS_ERR(user_data)) {
+			ret = PTR_ERR(user_data);
+			goto bar_rw_exit;
+		}
+
+		if (phy_dev->ops->write) {
+			mutex_lock(&mdevice->ops_lock);
+			ret = phy_dev->ops->write(mdevice,  user_data, count,
+						  EMUL_MMIO, pos);
+			mutex_unlock(&mdevice->ops_lock);
+		}
+
+		kfree(user_data);
+	} else {
+		char *ret_data = kzalloc(count, GFP_KERNEL);
+
+		if (IS_ERR(ret_data)) {
+			ret = PTR_ERR(ret_data);
+			goto bar_rw_exit;
+		}
+
+		memset(ret_data, 0, count);
+
+		if (phy_dev->ops->read) {
+			mutex_lock(&mdevice->ops_lock);
+			ret = phy_dev->ops->read(mdevice, ret_data, count,
+						 EMUL_MMIO, pos);
+			mutex_unlock(&mdevice->ops_lock);
+		}
+
+		if (ret > 0) {
+			if (copy_to_user(buf, ret_data, ret))
+				ret = -EFAULT;
+		}
+		kfree(ret_data);
+	}
+
+bar_rw_exit:
+	return ret;
+}
+
+
+static ssize_t mdev_dev_rw(void *device_data, char __user *buf,
+			   size_t count, loff_t *ppos, bool iswrite)
+{
+	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+	struct vfio_mdevice *vdev = device_data;
+
+	if (index >= VFIO_PCI_NUM_REGIONS)
+		return -EINVAL;
+
+	switch (index) {
+	case VFIO_PCI_CONFIG_REGION_INDEX:
+		return mdev_dev_config_rw(vdev, buf, count, ppos, iswrite);
+
+	case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
+		return mdev_dev_bar_rw(vdev, buf, count, ppos, iswrite);
+
+	case VFIO_PCI_ROM_REGION_INDEX:
+	case VFIO_PCI_VGA_REGION_INDEX:
+		break;
+	}
+
+	return -EINVAL;
+}
+
+
+static ssize_t vfio_mpci_read(void *device_data, char __user *buf,
+			      size_t count, loff_t *ppos)
+{
+	int ret = 0;
+
+	if (count)
+		ret = mdev_dev_rw(device_data, buf, count, ppos, false);
+
+	return ret;
+}
+
+static ssize_t vfio_mpci_write(void *device_data, const char __user *buf,
+			       size_t count, loff_t *ppos)
+{
+	int ret = 0;
+
+	if (count)
+		ret = mdev_dev_rw(device_data, (char *)buf, count, ppos, true);
+
+	return ret;
+}
+
+static int mdev_dev_mmio_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	int ret = 0;
+	struct vfio_mdevice *vdev = vma->vm_private_data;
+	struct mdev_device *mdevice;
+	struct phy_device *phy_dev;
+	u64 virtaddr = (u64)vmf->virtual_address;
+	u64 offset, phyaddr;
+	unsigned long req_size, pgoff;
+	pgprot_t pg_prot;
+
+	if (!vdev && !vdev->mdevice)
+		return -EINVAL;
+
+	mdevice = vdev->mdevice;
+	phy_dev  = mdevice->phy_dev;
+
+	offset   = vma->vm_pgoff << PAGE_SHIFT;
+	phyaddr  = virtaddr - vma->vm_start + offset;
+	pgoff    = phyaddr >> PAGE_SHIFT;
+	req_size = vma->vm_end - virtaddr;
+	pg_prot  = vma->vm_page_prot;
+
+	if (phy_dev->ops->validate_map_request) {
+		mutex_lock(&mdevice->ops_lock);
+		ret = phy_dev->ops->validate_map_request(mdevice, virtaddr,
+							 &pgoff, &req_size,
+							 &pg_prot);
+		mutex_unlock(&mdevice->ops_lock);
+		if (ret)
+			return ret;
+
+		if (!req_size)
+			return -EINVAL;
+	}
+
+	ret = remap_pfn_range(vma, virtaddr, pgoff, req_size, pg_prot);
+
+	return ret | VM_FAULT_NOPAGE;
+}
+
+static const struct vm_operations_struct mdev_dev_mmio_ops = {
+	.fault = mdev_dev_mmio_fault,
+};
+
+
+static int vfio_mpci_mmap(void *device_data, struct vm_area_struct *vma)
+{
+	unsigned int index;
+	struct vfio_mdevice *vdev = device_data;
+	struct mdev_device *mdevice = vdev->mdevice;
+	struct pci_dev *pdev;
+	unsigned long pgoff;
+	loff_t offset;
+
+	if (!dev_is_pci(mdevice->phy_dev->dev))
+		return -EINVAL;
+
+	pdev = to_pci_dev(mdevice->phy_dev->dev);
+
+	offset = vma->vm_pgoff << PAGE_SHIFT;
+
+	index = VFIO_PCI_OFFSET_TO_INDEX(offset);
+
+	if (index >= VFIO_PCI_ROM_REGION_INDEX)
+		return -EINVAL;
+
+	pgoff = vma->vm_pgoff &
+		((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
+
+	vma->vm_pgoff = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff;
+
+	vma->vm_private_data = vdev;
+	vma->vm_ops = &mdev_dev_mmio_ops;
+
+	return 0;
+}
+
+static const struct vfio_device_ops vfio_mpci_dev_ops = {
+	.name		= "vfio-mpci",
+	.open		= vfio_mpci_open,
+	.release	= vfio_mpci_close,
+	.ioctl		= vfio_mpci_unlocked_ioctl,
+	.read		= vfio_mpci_read,
+	.write		= vfio_mpci_write,
+	.mmap		= vfio_mpci_mmap,
+};
+
+int vfio_mpci_probe(struct device *dev)
+{
+	struct vfio_mdevice *vdev;
+	struct mdev_device *mdevice = to_mdev_device(dev);
+	int ret = 0;
+
+	if (mdevice == NULL)
+		return -EINVAL;
+
+	vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
+	if (IS_ERR(vdev))
+		return PTR_ERR(vdev);
+
+	vdev->mdevice = mdevice;
+	vdev->group = mdevice->group;
+	mutex_init(&vdev->vfio_mdev_lock);
+
+	ret = vfio_add_group_dev(dev, &vfio_mpci_dev_ops, vdev);
+	if (ret)
+		kfree(vdev);
+
+	return ret;
+}
+
+void vfio_mpci_remove(struct device *dev)
+{
+	struct vfio_mdevice *vdev;
+
+	vdev = vfio_del_group_dev(dev);
+	kfree(vdev);
+}
+
+int vfio_mpci_match(struct device *dev)
+{
+	if (dev_is_pci(dev->parent))
+		return 1;
+
+	return 0;
+}
+
+struct mdev_driver vfio_mpci_driver = {
+	.name	= "vfio_mpci",
+	.probe	= vfio_mpci_probe,
+	.remove	= vfio_mpci_remove,
+	.match	= vfio_mpci_match,
+};
+
+static int __init vfio_mpci_init(void)
+{
+	return mdev_register_driver(&vfio_mpci_driver, THIS_MODULE);
+}
+
+static void __exit vfio_mpci_exit(void)
+{
+	mdev_unregister_driver(&vfio_mpci_driver);
+}
+
+module_init(vfio_mpci_init)
+module_exit(vfio_mpci_exit)
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index 8a7d546d18a0..04a450908ffb 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -19,12 +19,6 @@
 #ifndef VFIO_PCI_PRIVATE_H
 #define VFIO_PCI_PRIVATE_H
 
-#define VFIO_PCI_OFFSET_SHIFT   40
-
-#define VFIO_PCI_OFFSET_TO_INDEX(off)	(off >> VFIO_PCI_OFFSET_SHIFT)
-#define VFIO_PCI_INDEX_TO_OFFSET(index)	((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
-#define VFIO_PCI_OFFSET_MASK	(((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
-
 /* Special capability IDs predefined access */
 #define PCI_CAP_ID_INVALID		0xFF	/* default raw access */
 #define PCI_CAP_ID_INVALID_VIRT		0xFE	/* default virt access */
diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c
index 5ffd1d9ad4bd..5b912be9d9c3 100644
--- a/drivers/vfio/pci/vfio_pci_rdwr.c
+++ b/drivers/vfio/pci/vfio_pci_rdwr.c
@@ -18,6 +18,7 @@
 #include <linux/uaccess.h>
 #include <linux/io.h>
 #include <linux/vgaarb.h>
+#include <linux/vfio.h>
 
 #include "vfio_pci_private.h"
 
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 0ecae0b1cd34..431b824b0d3e 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -18,6 +18,13 @@
 #include <linux/poll.h>
 #include <uapi/linux/vfio.h>
 
+#define VFIO_PCI_OFFSET_SHIFT   40
+
+#define VFIO_PCI_OFFSET_TO_INDEX(off)   (off >> VFIO_PCI_OFFSET_SHIFT)
+#define VFIO_PCI_INDEX_TO_OFFSET(index) ((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
+#define VFIO_PCI_OFFSET_MASK    (((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
+
+
 /**
  * struct vfio_device_ops - VFIO bus driver device callbacks
  *
-- 
2.7.0


^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [Qemu-devel] [RFC PATCH v4 2/3] VFIO driver for mediated PCI device
@ 2016-05-24 19:58   ` Kirti Wankhede
  0 siblings, 0 replies; 92+ messages in thread
From: Kirti Wankhede @ 2016-05-24 19:58 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, kevin.tian, shuai.ruan, jike.song, zhiyuan.lv,
	bjsdjshi, Kirti Wankhede

VFIO driver registers with MDEV core driver. MDEV core driver creates
mediated device and calls probe routine of MPCI VFIO driver. This MPCI
VFIO driver adds mediated device to VFIO core module.
Main aim of this module is to manage all VFIO APIs for each mediated PCI
device.
Those are:
- get region information from vendor driver.
- trap and emulate PCI config space and BAR region.
- Send interrupt configuration information to vendor driver.
- mmap mappable region with invalidate mapping and fault on access to
  remap pfn.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Change-Id: I48a34af88a9a905ec1f0f7528383c5db76c2e14d
---
 drivers/vfio/mdev/Kconfig           |   7 +
 drivers/vfio/mdev/Makefile          |   1 +
 drivers/vfio/mdev/vfio_mpci.c       | 648 ++++++++++++++++++++++++++++++++++++
 drivers/vfio/pci/vfio_pci_private.h |   6 -
 drivers/vfio/pci/vfio_pci_rdwr.c    |   1 +
 include/linux/vfio.h                |   7 +
 6 files changed, 664 insertions(+), 6 deletions(-)
 create mode 100644 drivers/vfio/mdev/vfio_mpci.c

diff --git a/drivers/vfio/mdev/Kconfig b/drivers/vfio/mdev/Kconfig
index 951e2bb06a3f..8d9e78aaa80f 100644
--- a/drivers/vfio/mdev/Kconfig
+++ b/drivers/vfio/mdev/Kconfig
@@ -9,3 +9,10 @@ config MDEV
 
         If you don't know what do here, say N.
 
+config VFIO_MPCI
+    tristate "VFIO support for Mediated PCI devices"
+    depends on VFIO && PCI && MDEV
+    default n
+    help
+        VFIO based driver for mediated PCI devices.
+
diff --git a/drivers/vfio/mdev/Makefile b/drivers/vfio/mdev/Makefile
index 4adb069febce..8ab38c57df21 100644
--- a/drivers/vfio/mdev/Makefile
+++ b/drivers/vfio/mdev/Makefile
@@ -2,4 +2,5 @@
 mdev-y := mdev-core.o mdev-sysfs.o mdev-driver.o
 
 obj-$(CONFIG_MDEV) += mdev.o
+obj-$(CONFIG_VFIO_MPCI) += vfio_mpci.o
 
diff --git a/drivers/vfio/mdev/vfio_mpci.c b/drivers/vfio/mdev/vfio_mpci.c
new file mode 100644
index 000000000000..ef9d757ec511
--- /dev/null
+++ b/drivers/vfio/mdev/vfio_mpci.c
@@ -0,0 +1,648 @@
+/*
+ * VFIO based Mediated PCI device driver
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/uuid.h>
+#include <linux/vfio.h>
+#include <linux/iommu.h>
+#include <linux/mdev.h>
+
+#include "mdev_private.h"
+
+#define DRIVER_VERSION  "0.1"
+#define DRIVER_AUTHOR   "NVIDIA Corporation"
+#define DRIVER_DESC     "VFIO based Mediated PCI device driver"
+
+struct vfio_mdevice {
+	struct iommu_group *group;
+	struct mdev_device *mdevice;
+	int		    refcnt;
+	struct pci_region_info vfio_region_info[VFIO_PCI_NUM_REGIONS];
+	u8		    *vconfig;
+	struct mutex	    vfio_mdev_lock;
+};
+
+static int get_virtual_bar_info(struct mdev_device *mdevice,
+				struct pci_region_info *vfio_region_info,
+				int index)
+{
+	int ret = -EINVAL;
+	struct phy_device *phy_dev = mdevice->phy_dev;
+
+	if (dev_is_pci(phy_dev->dev) && phy_dev->ops->get_region_info) {
+		mutex_lock(&mdevice->ops_lock);
+		ret = phy_dev->ops->get_region_info(mdevice, index,
+						    vfio_region_info);
+		mutex_unlock(&mdevice->ops_lock);
+	}
+	return ret;
+}
+
+static int mdev_read_base(struct vfio_mdevice *vdev)
+{
+	int index, pos;
+	u32 start_lo, start_hi;
+	u32 mem_type;
+
+	pos = PCI_BASE_ADDRESS_0;
+
+	for (index = 0; index <= VFIO_PCI_BAR5_REGION_INDEX; index++) {
+
+		if (!vdev->vfio_region_info[index].size)
+			continue;
+
+		start_lo = (*(u32 *)(vdev->vconfig + pos)) &
+					PCI_BASE_ADDRESS_MEM_MASK;
+		mem_type = (*(u32 *)(vdev->vconfig + pos)) &
+					PCI_BASE_ADDRESS_MEM_TYPE_MASK;
+
+		switch (mem_type) {
+		case PCI_BASE_ADDRESS_MEM_TYPE_64:
+			start_hi = (*(u32 *)(vdev->vconfig + pos + 4));
+			pos += 4;
+			break;
+		case PCI_BASE_ADDRESS_MEM_TYPE_32:
+		case PCI_BASE_ADDRESS_MEM_TYPE_1M:
+			/* 1M mem BAR treated as 32-bit BAR */
+		default:
+			/* mem unknown type treated as 32-bit BAR */
+			start_hi = 0;
+			break;
+		}
+		pos += 4;
+		vdev->vfio_region_info[index].start = ((u64)start_hi << 32) |
+							start_lo;
+	}
+	return 0;
+}
+
+static int vfio_mpci_open(void *device_data)
+{
+	int ret = 0;
+	struct vfio_mdevice *vdev = device_data;
+
+	if (!try_module_get(THIS_MODULE))
+		return -ENODEV;
+
+	mutex_lock(&vdev->vfio_mdev_lock);
+	if (!vdev->refcnt) {
+		u8 *vconfig;
+		int index;
+		struct pci_region_info *cfg_reg;
+
+		for (index = VFIO_PCI_BAR0_REGION_INDEX;
+		     index < VFIO_PCI_NUM_REGIONS; index++) {
+			ret = get_virtual_bar_info(vdev->mdevice,
+						&vdev->vfio_region_info[index],
+						index);
+			if (ret)
+				goto open_error;
+		}
+		cfg_reg = &vdev->vfio_region_info[VFIO_PCI_CONFIG_REGION_INDEX];
+		if (!cfg_reg->size)
+			goto open_error;
+
+		vconfig = kzalloc(cfg_reg->size, GFP_KERNEL);
+		if (IS_ERR(vconfig)) {
+			ret = PTR_ERR(vconfig);
+			goto open_error;
+		}
+
+		vdev->vconfig = vconfig;
+	}
+
+	vdev->refcnt++;
+open_error:
+
+	mutex_unlock(&vdev->vfio_mdev_lock);
+	if (ret)
+		module_put(THIS_MODULE);
+
+	return ret;
+}
+
+static void vfio_mpci_close(void *device_data)
+{
+	struct vfio_mdevice *vdev = device_data;
+
+	mutex_lock(&vdev->vfio_mdev_lock);
+	vdev->refcnt--;
+	if (!vdev->refcnt) {
+		memset(&vdev->vfio_region_info, 0,
+			sizeof(vdev->vfio_region_info));
+		kfree(vdev->vconfig);
+	}
+	mutex_unlock(&vdev->vfio_mdev_lock);
+}
+
+static int mdev_get_irq_count(struct vfio_mdevice *vdev, int irq_type)
+{
+	/* Don't support MSIX for now */
+	if (irq_type == VFIO_PCI_MSIX_IRQ_INDEX)
+		return -1;
+
+	return 1;
+}
+
+static long vfio_mpci_unlocked_ioctl(void *device_data,
+				     unsigned int cmd, unsigned long arg)
+{
+	int ret = 0;
+	struct vfio_mdevice *vdev = device_data;
+	unsigned long minsz;
+
+	switch (cmd) {
+	case VFIO_DEVICE_GET_INFO:
+	{
+		struct vfio_device_info info;
+
+		minsz = offsetofend(struct vfio_device_info, num_irqs);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz)
+			return -EINVAL;
+
+		info.flags = VFIO_DEVICE_FLAGS_PCI;
+		info.num_regions = VFIO_PCI_NUM_REGIONS;
+		info.num_irqs = VFIO_PCI_NUM_IRQS;
+
+		return copy_to_user((void __user *)arg, &info, minsz);
+	}
+
+	case VFIO_DEVICE_GET_REGION_INFO:
+	{
+		struct vfio_region_info info;
+
+		minsz = offsetofend(struct vfio_region_info, offset);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz)
+			return -EINVAL;
+
+		switch (info.index) {
+		case VFIO_PCI_CONFIG_REGION_INDEX:
+		case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
+			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+			info.size = vdev->vfio_region_info[info.index].size;
+			if (!info.size) {
+				info.flags = 0;
+				break;
+			}
+
+			info.flags = vdev->vfio_region_info[info.index].flags;
+			break;
+		case VFIO_PCI_VGA_REGION_INDEX:
+		case VFIO_PCI_ROM_REGION_INDEX:
+		default:
+			return -EINVAL;
+		}
+
+		return copy_to_user((void __user *)arg, &info, minsz);
+
+	}
+	case VFIO_DEVICE_GET_IRQ_INFO:
+	{
+		struct vfio_irq_info info;
+
+		minsz = offsetofend(struct vfio_irq_info, count);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz || info.index >= VFIO_PCI_NUM_IRQS)
+			return -EINVAL;
+
+		switch (info.index) {
+		case VFIO_PCI_INTX_IRQ_INDEX ... VFIO_PCI_MSI_IRQ_INDEX:
+		case VFIO_PCI_REQ_IRQ_INDEX:
+			break;
+			/* pass thru to return error */
+		case VFIO_PCI_MSIX_IRQ_INDEX:
+		default:
+			return -EINVAL;
+		}
+
+		info.count = VFIO_PCI_NUM_IRQS;
+
+		info.flags = VFIO_IRQ_INFO_EVENTFD;
+		info.count = mdev_get_irq_count(vdev, info.index);
+
+		if (info.count == -1)
+			return -EINVAL;
+
+		if (info.index == VFIO_PCI_INTX_IRQ_INDEX)
+			info.flags |= (VFIO_IRQ_INFO_MASKABLE |
+					VFIO_IRQ_INFO_AUTOMASKED);
+		else
+			info.flags |= VFIO_IRQ_INFO_NORESIZE;
+
+		return copy_to_user((void __user *)arg, &info, minsz);
+	}
+
+	case VFIO_DEVICE_SET_IRQS:
+	{
+		struct vfio_irq_set hdr;
+		struct mdev_device *mdevice = vdev->mdevice;
+		struct phy_device *phy_dev = vdev->mdevice->phy_dev;
+		u8 *data = NULL;
+		int ret = 0;
+
+		minsz = offsetofend(struct vfio_irq_set, count);
+
+		if (copy_from_user(&hdr, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (hdr.argsz < minsz || hdr.index >= VFIO_PCI_NUM_IRQS ||
+		    hdr.flags & ~(VFIO_IRQ_SET_DATA_TYPE_MASK |
+		    VFIO_IRQ_SET_ACTION_TYPE_MASK))
+			return -EINVAL;
+
+		if (!(hdr.flags & VFIO_IRQ_SET_DATA_NONE)) {
+			size_t size;
+			int max = mdev_get_irq_count(vdev, hdr.index);
+
+			if (hdr.flags & VFIO_IRQ_SET_DATA_BOOL)
+				size = sizeof(uint8_t);
+			else if (hdr.flags & VFIO_IRQ_SET_DATA_EVENTFD)
+				size = sizeof(int32_t);
+			else
+				return -EINVAL;
+
+			if (hdr.argsz - minsz < hdr.count * size ||
+			    hdr.start >= max || hdr.start + hdr.count > max)
+				return -EINVAL;
+
+			data = memdup_user((void __user *)(arg + minsz),
+					    hdr.count * size);
+			if (IS_ERR(data))
+				return PTR_ERR(data);
+
+			}
+
+			if (phy_dev->ops->set_irqs) {
+				mutex_lock(&mdevice->ops_lock);
+				ret = phy_dev->ops->set_irqs(mdevice, hdr.flags,
+							   hdr.index, hdr.start,
+							   hdr.count, data);
+				mutex_unlock(&mdevice->ops_lock);
+			}
+
+			kfree(data);
+			return ret;
+	}
+
+	default:
+		return -EINVAL;
+	}
+	return ret;
+}
+
+ssize_t mdev_dev_config_rw(struct vfio_mdevice *vdev, char __user *buf,
+			   size_t count, loff_t *ppos, bool iswrite)
+{
+	struct mdev_device *mdevice = vdev->mdevice;
+	struct phy_device *phy_dev = mdevice->phy_dev;
+	int size = vdev->vfio_region_info[VFIO_PCI_CONFIG_REGION_INDEX].size;
+	int ret = 0;
+	uint64_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
+
+	if (pos < 0 || pos >= size ||
+	    pos + count > size) {
+		pr_err("%s pos 0x%llx out of range\n", __func__, pos);
+		ret = -EFAULT;
+		goto config_rw_exit;
+	}
+
+	if (iswrite) {
+		char *user_data;
+
+		user_data = memdup_user(buf, count);
+		if (IS_ERR(user_data)) {
+			ret = PTR_ERR(user_data);
+			goto config_rw_exit;
+		}
+
+		if (phy_dev->ops->write) {
+			mutex_lock(&mdevice->ops_lock);
+			ret = phy_dev->ops->write(mdevice, user_data, count,
+						  EMUL_CONFIG_SPACE, pos);
+			mutex_unlock(&mdevice->ops_lock);
+		}
+
+		memcpy((void *)(vdev->vconfig + pos), (void *)user_data, count);
+		kfree(user_data);
+	} else {
+		char *ret_data = kzalloc(count, GFP_KERNEL);
+
+		if (IS_ERR(ret_data)) {
+			ret = PTR_ERR(ret_data);
+			goto config_rw_exit;
+		}
+
+		if (phy_dev->ops->read) {
+			mutex_lock(&mdevice->ops_lock);
+			ret = phy_dev->ops->read(mdevice, ret_data, count,
+						 EMUL_CONFIG_SPACE, pos);
+			mutex_unlock(&mdevice->ops_lock);
+		}
+
+		if (ret > 0) {
+			if (copy_to_user(buf, ret_data, ret)) {
+				ret = -EFAULT;
+				kfree(ret_data);
+				goto config_rw_exit;
+			}
+
+			memcpy((void *)(vdev->vconfig + pos),
+				(void *)ret_data, count);
+		}
+		kfree(ret_data);
+	}
+config_rw_exit:
+	return ret;
+}
+
+ssize_t mdev_dev_bar_rw(struct vfio_mdevice *vdev, char __user *buf,
+			size_t count, loff_t *ppos, bool iswrite)
+{
+	struct mdev_device *mdevice = vdev->mdevice;
+	struct phy_device *phy_dev = mdevice->phy_dev;
+	loff_t offset = *ppos & VFIO_PCI_OFFSET_MASK;
+	loff_t pos;
+	int bar_index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+	int ret = 0;
+
+	if (!vdev->vfio_region_info[bar_index].start) {
+		ret = mdev_read_base(vdev);
+		if (ret)
+			goto bar_rw_exit;
+	}
+
+	if (offset >= vdev->vfio_region_info[bar_index].size) {
+		ret = -EINVAL;
+		goto bar_rw_exit;
+	}
+
+	pos = vdev->vfio_region_info[bar_index].start + offset;
+	if (iswrite) {
+		char *user_data;
+
+		user_data = memdup_user(buf, count);
+		if (IS_ERR(user_data)) {
+			ret = PTR_ERR(user_data);
+			goto bar_rw_exit;
+		}
+
+		if (phy_dev->ops->write) {
+			mutex_lock(&mdevice->ops_lock);
+			ret = phy_dev->ops->write(mdevice,  user_data, count,
+						  EMUL_MMIO, pos);
+			mutex_unlock(&mdevice->ops_lock);
+		}
+
+		kfree(user_data);
+	} else {
+		char *ret_data = kzalloc(count, GFP_KERNEL);
+
+		if (IS_ERR(ret_data)) {
+			ret = PTR_ERR(ret_data);
+			goto bar_rw_exit;
+		}
+
+		memset(ret_data, 0, count);
+
+		if (phy_dev->ops->read) {
+			mutex_lock(&mdevice->ops_lock);
+			ret = phy_dev->ops->read(mdevice, ret_data, count,
+						 EMUL_MMIO, pos);
+			mutex_unlock(&mdevice->ops_lock);
+		}
+
+		if (ret > 0) {
+			if (copy_to_user(buf, ret_data, ret))
+				ret = -EFAULT;
+		}
+		kfree(ret_data);
+	}
+
+bar_rw_exit:
+	return ret;
+}
+
+
+static ssize_t mdev_dev_rw(void *device_data, char __user *buf,
+			   size_t count, loff_t *ppos, bool iswrite)
+{
+	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+	struct vfio_mdevice *vdev = device_data;
+
+	if (index >= VFIO_PCI_NUM_REGIONS)
+		return -EINVAL;
+
+	switch (index) {
+	case VFIO_PCI_CONFIG_REGION_INDEX:
+		return mdev_dev_config_rw(vdev, buf, count, ppos, iswrite);
+
+	case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
+		return mdev_dev_bar_rw(vdev, buf, count, ppos, iswrite);
+
+	case VFIO_PCI_ROM_REGION_INDEX:
+	case VFIO_PCI_VGA_REGION_INDEX:
+		break;
+	}
+
+	return -EINVAL;
+}
+
+
+static ssize_t vfio_mpci_read(void *device_data, char __user *buf,
+			      size_t count, loff_t *ppos)
+{
+	int ret = 0;
+
+	if (count)
+		ret = mdev_dev_rw(device_data, buf, count, ppos, false);
+
+	return ret;
+}
+
+static ssize_t vfio_mpci_write(void *device_data, const char __user *buf,
+			       size_t count, loff_t *ppos)
+{
+	int ret = 0;
+
+	if (count)
+		ret = mdev_dev_rw(device_data, (char *)buf, count, ppos, true);
+
+	return ret;
+}
+
+static int mdev_dev_mmio_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	int ret = 0;
+	struct vfio_mdevice *vdev = vma->vm_private_data;
+	struct mdev_device *mdevice;
+	struct phy_device *phy_dev;
+	u64 virtaddr = (u64)vmf->virtual_address;
+	u64 offset, phyaddr;
+	unsigned long req_size, pgoff;
+	pgprot_t pg_prot;
+
+	if (!vdev && !vdev->mdevice)
+		return -EINVAL;
+
+	mdevice = vdev->mdevice;
+	phy_dev  = mdevice->phy_dev;
+
+	offset   = vma->vm_pgoff << PAGE_SHIFT;
+	phyaddr  = virtaddr - vma->vm_start + offset;
+	pgoff    = phyaddr >> PAGE_SHIFT;
+	req_size = vma->vm_end - virtaddr;
+	pg_prot  = vma->vm_page_prot;
+
+	if (phy_dev->ops->validate_map_request) {
+		mutex_lock(&mdevice->ops_lock);
+		ret = phy_dev->ops->validate_map_request(mdevice, virtaddr,
+							 &pgoff, &req_size,
+							 &pg_prot);
+		mutex_unlock(&mdevice->ops_lock);
+		if (ret)
+			return ret;
+
+		if (!req_size)
+			return -EINVAL;
+	}
+
+	ret = remap_pfn_range(vma, virtaddr, pgoff, req_size, pg_prot);
+
+	return ret | VM_FAULT_NOPAGE;
+}
+
+static const struct vm_operations_struct mdev_dev_mmio_ops = {
+	.fault = mdev_dev_mmio_fault,
+};
+
+
+static int vfio_mpci_mmap(void *device_data, struct vm_area_struct *vma)
+{
+	unsigned int index;
+	struct vfio_mdevice *vdev = device_data;
+	struct mdev_device *mdevice = vdev->mdevice;
+	struct pci_dev *pdev;
+	unsigned long pgoff;
+	loff_t offset;
+
+	if (!dev_is_pci(mdevice->phy_dev->dev))
+		return -EINVAL;
+
+	pdev = to_pci_dev(mdevice->phy_dev->dev);
+
+	offset = vma->vm_pgoff << PAGE_SHIFT;
+
+	index = VFIO_PCI_OFFSET_TO_INDEX(offset);
+
+	if (index >= VFIO_PCI_ROM_REGION_INDEX)
+		return -EINVAL;
+
+	pgoff = vma->vm_pgoff &
+		((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
+
+	vma->vm_pgoff = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff;
+
+	vma->vm_private_data = vdev;
+	vma->vm_ops = &mdev_dev_mmio_ops;
+
+	return 0;
+}
+
+static const struct vfio_device_ops vfio_mpci_dev_ops = {
+	.name		= "vfio-mpci",
+	.open		= vfio_mpci_open,
+	.release	= vfio_mpci_close,
+	.ioctl		= vfio_mpci_unlocked_ioctl,
+	.read		= vfio_mpci_read,
+	.write		= vfio_mpci_write,
+	.mmap		= vfio_mpci_mmap,
+};
+
+int vfio_mpci_probe(struct device *dev)
+{
+	struct vfio_mdevice *vdev;
+	struct mdev_device *mdevice = to_mdev_device(dev);
+	int ret = 0;
+
+	if (mdevice == NULL)
+		return -EINVAL;
+
+	vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
+	if (IS_ERR(vdev))
+		return PTR_ERR(vdev);
+
+	vdev->mdevice = mdevice;
+	vdev->group = mdevice->group;
+	mutex_init(&vdev->vfio_mdev_lock);
+
+	ret = vfio_add_group_dev(dev, &vfio_mpci_dev_ops, vdev);
+	if (ret)
+		kfree(vdev);
+
+	return ret;
+}
+
+void vfio_mpci_remove(struct device *dev)
+{
+	struct vfio_mdevice *vdev;
+
+	vdev = vfio_del_group_dev(dev);
+	kfree(vdev);
+}
+
+int vfio_mpci_match(struct device *dev)
+{
+	if (dev_is_pci(dev->parent))
+		return 1;
+
+	return 0;
+}
+
+struct mdev_driver vfio_mpci_driver = {
+	.name	= "vfio_mpci",
+	.probe	= vfio_mpci_probe,
+	.remove	= vfio_mpci_remove,
+	.match	= vfio_mpci_match,
+};
+
+static int __init vfio_mpci_init(void)
+{
+	return mdev_register_driver(&vfio_mpci_driver, THIS_MODULE);
+}
+
+static void __exit vfio_mpci_exit(void)
+{
+	mdev_unregister_driver(&vfio_mpci_driver);
+}
+
+module_init(vfio_mpci_init)
+module_exit(vfio_mpci_exit)
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index 8a7d546d18a0..04a450908ffb 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -19,12 +19,6 @@
 #ifndef VFIO_PCI_PRIVATE_H
 #define VFIO_PCI_PRIVATE_H
 
-#define VFIO_PCI_OFFSET_SHIFT   40
-
-#define VFIO_PCI_OFFSET_TO_INDEX(off)	(off >> VFIO_PCI_OFFSET_SHIFT)
-#define VFIO_PCI_INDEX_TO_OFFSET(index)	((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
-#define VFIO_PCI_OFFSET_MASK	(((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
-
 /* Special capability IDs predefined access */
 #define PCI_CAP_ID_INVALID		0xFF	/* default raw access */
 #define PCI_CAP_ID_INVALID_VIRT		0xFE	/* default virt access */
diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c
index 5ffd1d9ad4bd..5b912be9d9c3 100644
--- a/drivers/vfio/pci/vfio_pci_rdwr.c
+++ b/drivers/vfio/pci/vfio_pci_rdwr.c
@@ -18,6 +18,7 @@
 #include <linux/uaccess.h>
 #include <linux/io.h>
 #include <linux/vgaarb.h>
+#include <linux/vfio.h>
 
 #include "vfio_pci_private.h"
 
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 0ecae0b1cd34..431b824b0d3e 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -18,6 +18,13 @@
 #include <linux/poll.h>
 #include <uapi/linux/vfio.h>
 
+#define VFIO_PCI_OFFSET_SHIFT   40
+
+#define VFIO_PCI_OFFSET_TO_INDEX(off)   (off >> VFIO_PCI_OFFSET_SHIFT)
+#define VFIO_PCI_INDEX_TO_OFFSET(index) ((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
+#define VFIO_PCI_OFFSET_MASK    (((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
+
+
 /**
  * struct vfio_device_ops - VFIO bus driver device callbacks
  *
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [RFC PATCH v4 3/3] VFIO Type1 IOMMU: Add support for mediated devices
  2016-05-24 19:58 ` [Qemu-devel] " Kirti Wankhede
@ 2016-05-24 19:58   ` Kirti Wankhede
  -1 siblings, 0 replies; 92+ messages in thread
From: Kirti Wankhede @ 2016-05-24 19:58 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, kevin.tian, shuai.ruan, jike.song, zhiyuan.lv,
	bjsdjshi, Kirti Wankhede

VFIO Type1 IOMMU driver is designed for the devices which are IOMMU
capable. Mediated device only uses IOMMU TYPE1 API, the underlying
hardware can be managed by an IOMMU domain.

This change exports functions to pin and unpin pages for mediated devices.
It maintains data of pinned pages for mediated domain. This data is used to
verify unpinning request and to unpin remaining pages from detach_group()
if there are any.

Aim of this change is:
- To use most of the code of IOMMU driver for mediated devices
- To support direct assigned device and mediated device by single module

Updated the change to keep mediated domain structure out of domain_list.

Tested by assigning below combinations of devices to a single VM:
- GPU pass through only
- vGPU device only
- One GPU pass through and one vGPU device
- two GPU pass through

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Change-Id: I9c262abc9c68fd6abf52d91a636bf0cc631593a0
---
 drivers/vfio/vfio_iommu_type1.c | 433 +++++++++++++++++++++++++++++++++++++---
 include/linux/vfio.h            |   6 +
 2 files changed, 407 insertions(+), 32 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 75b24e93cedb..5cc7dc0288a3 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -36,6 +36,7 @@
 #include <linux/uaccess.h>
 #include <linux/vfio.h>
 #include <linux/workqueue.h>
+#include <linux/mdev.h>
 
 #define DRIVER_VERSION  "0.2"
 #define DRIVER_AUTHOR   "Alex Williamson <alex.williamson@redhat.com>"
@@ -55,6 +56,7 @@ MODULE_PARM_DESC(disable_hugepages,
 
 struct vfio_iommu {
 	struct list_head	domain_list;
+	struct vfio_domain	*mediated_domain;
 	struct mutex		lock;
 	struct rb_root		dma_list;
 	bool			v2;
@@ -67,6 +69,13 @@ struct vfio_domain {
 	struct list_head	group_list;
 	int			prot;		/* IOMMU_CACHE */
 	bool			fgsp;		/* Fine-grained super pages */
+
+	/* Domain for mediated device which is without physical IOMMU */
+	bool			mediated_device;
+
+	struct mm_struct	*mm;
+	struct rb_root		pfn_list;	/* pinned Host pfn list */
+	struct mutex		pfn_list_lock;	/* mutex for pfn_list */
 };
 
 struct vfio_dma {
@@ -79,10 +88,23 @@ struct vfio_dma {
 
 struct vfio_group {
 	struct iommu_group	*iommu_group;
+	struct mdev_device	*mdevice;
 	struct list_head	next;
 };
 
 /*
+ * Guest RAM pinning working set or DMA target
+ */
+struct vfio_pfn {
+	struct rb_node		node;
+	unsigned long		vaddr;		/* virtual addr */
+	dma_addr_t		iova;		/* IOVA */
+	unsigned long		npage;		/* number of pages */
+	unsigned long		pfn;		/* Host pfn */
+	size_t			prot;
+	atomic_t		ref_count;
+};
+/*
  * This code handles mapping and unmapping of user data buffers
  * into DMA'ble space using the IOMMU
  */
@@ -130,6 +152,64 @@ static void vfio_unlink_dma(struct vfio_iommu *iommu, struct vfio_dma *old)
 	rb_erase(&old->node, &iommu->dma_list);
 }
 
+/*
+ * Helper Functions for host pfn list
+ */
+
+static struct vfio_pfn *vfio_find_pfn(struct vfio_domain *domain,
+				      unsigned long pfn)
+{
+	struct rb_node *node;
+	struct vfio_pfn *vpfn, *ret = NULL;
+
+	mutex_lock(&domain->pfn_list_lock);
+	node = domain->pfn_list.rb_node;
+
+	while (node) {
+		vpfn = rb_entry(node, struct vfio_pfn, node);
+
+		if (pfn < vpfn->pfn)
+			node = node->rb_left;
+		else if (pfn > vpfn->pfn)
+			node = node->rb_right;
+		else {
+			ret = vpfn;
+			break;
+		}
+	}
+
+	mutex_unlock(&domain->pfn_list_lock);
+	return ret;
+}
+
+static void vfio_link_pfn(struct vfio_domain *domain, struct vfio_pfn *new)
+{
+	struct rb_node **link, *parent = NULL;
+	struct vfio_pfn *vpfn;
+
+	mutex_lock(&domain->pfn_list_lock);
+	link = &domain->pfn_list.rb_node;
+	while (*link) {
+		parent = *link;
+		vpfn = rb_entry(parent, struct vfio_pfn, node);
+
+		if (new->pfn < vpfn->pfn)
+			link = &(*link)->rb_left;
+		else
+			link = &(*link)->rb_right;
+	}
+
+	rb_link_node(&new->node, parent, link);
+	rb_insert_color(&new->node, &domain->pfn_list);
+	mutex_unlock(&domain->pfn_list_lock);
+}
+
+/* call by holding domain->pfn_list_lock */
+static void vfio_unlink_pfn(struct vfio_domain *domain, struct vfio_pfn *old)
+{
+	rb_erase(&old->node, &domain->pfn_list);
+}
+
 struct vwork {
 	struct mm_struct	*mm;
 	long			npage;
@@ -228,20 +308,29 @@ static int put_pfn(unsigned long pfn, int prot)
 	return 0;
 }
 
-static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
+static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
+			 int prot, unsigned long *pfn)
 {
 	struct page *page[1];
 	struct vm_area_struct *vma;
+	struct mm_struct *local_mm = mm;
 	int ret = -EFAULT;
 
-	if (get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE), page) == 1) {
+	if (!local_mm && !current->mm)
+		return -ENODEV;
+
+	if (!local_mm)
+		local_mm = current->mm;
+
+	down_read(&local_mm->mmap_sem);
+	if (get_user_pages_remote(NULL, local_mm, vaddr, 1,
+				!!(prot & IOMMU_WRITE), 0, page, NULL) == 1) {
 		*pfn = page_to_pfn(page[0]);
-		return 0;
+		ret = 0;
+		goto done_pfn;
 	}
 
-	down_read(&current->mm->mmap_sem);
-
-	vma = find_vma_intersection(current->mm, vaddr, vaddr + 1);
+	vma = find_vma_intersection(local_mm, vaddr, vaddr + 1);
 
 	if (vma && vma->vm_flags & VM_PFNMAP) {
 		*pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
@@ -249,7 +338,8 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
 			ret = 0;
 	}
 
-	up_read(&current->mm->mmap_sem);
+done_pfn:
+	up_read(&local_mm->mmap_sem);
 
 	return ret;
 }
@@ -259,18 +349,19 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
  * the iommu can only map chunks of consecutive pfns anyway, so get the
  * first page and all consecutive pages with the same locking.
  */
-static long vfio_pin_pages(unsigned long vaddr, long npage,
-			   int prot, unsigned long *pfn_base)
+static long vfio_pin_pages_internal(struct vfio_domain *domain,
+				    unsigned long vaddr, long npage,
+				    int prot, unsigned long *pfn_base)
 {
 	unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
 	bool lock_cap = capable(CAP_IPC_LOCK);
 	long ret, i;
 	bool rsvd;
 
-	if (!current->mm)
+	if (!domain)
 		return -ENODEV;
 
-	ret = vaddr_get_pfn(vaddr, prot, pfn_base);
+	ret = vaddr_get_pfn(domain->mm, vaddr, prot, pfn_base);
 	if (ret)
 		return ret;
 
@@ -293,7 +384,7 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
 	for (i = 1, vaddr += PAGE_SIZE; i < npage; i++, vaddr += PAGE_SIZE) {
 		unsigned long pfn = 0;
 
-		ret = vaddr_get_pfn(vaddr, prot, &pfn);
+		ret = vaddr_get_pfn(domain->mm, vaddr, prot, &pfn);
 		if (ret)
 			break;
 
@@ -318,20 +409,165 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
 	return i;
 }
 
-static long vfio_unpin_pages(unsigned long pfn, long npage,
-			     int prot, bool do_accounting)
+static long vfio_unpin_pages_internal(struct vfio_domain *domain,
+				      unsigned long pfn, long npage, int prot,
+				      bool do_accounting)
 {
 	unsigned long unlocked = 0;
 	long i;
 
+	if (!domain)
+		return -ENODEV;
+
 	for (i = 0; i < npage; i++)
 		unlocked += put_pfn(pfn++, prot);
 
 	if (do_accounting)
 		vfio_lock_acct(-unlocked);
+	return unlocked;
+}
+
+/*
+ * Pin a set of guest PFNs and return their associated host PFNs for API
+ * supported domain only.
+ * @vaddr [in]: array of guest PFNs
+ * @npage [in]: count of array elements
+ * @prot [in] : protection flags
+ * @pfn_base[out] : array of host PFNs
+ */
+long vfio_pin_pages(void *iommu_data, dma_addr_t *vaddr, long npage,
+		   int prot, dma_addr_t *pfn_base)
+{
+	struct vfio_iommu *iommu = iommu_data;
+	struct vfio_domain *domain = NULL;
+	int i = 0, ret = 0;
+	long retpage;
+	unsigned long remote_vaddr = 0;
+	dma_addr_t *pfn = pfn_base;
+	struct vfio_dma *dma;
+
+	if (!iommu || !vaddr || !pfn_base)
+		return -EINVAL;
+
+	mutex_lock(&iommu->lock);
+
+	if (!iommu->mediated_domain) {
+		ret = -EINVAL;
+		goto pin_done;
+	}
+
+	domain = iommu->mediated_domain;
+
+	for (i = 0; i < npage; i++) {
+		struct vfio_pfn *p, *lpfn;
+		unsigned long tpfn;
+		dma_addr_t iova;
+		long pg_cnt = 1;
+
+		iova = vaddr[i] << PAGE_SHIFT;
+
+		dma = vfio_find_dma(iommu, iova, 0 /*  size */);
+		if (!dma) {
+			ret = -EINVAL;
+			goto pin_done;
+		}
+
+		remote_vaddr = dma->vaddr + iova - dma->iova;
+
+		retpage = vfio_pin_pages_internal(domain, remote_vaddr,
+						  pg_cnt, prot, &tpfn);
+		if (retpage <= 0) {
+			WARN_ON(!retpage);
+			ret = (int)retpage;
+			goto pin_done;
+		}
+
+		pfn[i] = tpfn;
+
+		/* search if pfn exist */
+		p = vfio_find_pfn(domain, tpfn);
+		if (p) {
+			atomic_inc(&p->ref_count);
+			continue;
+		}
+
+		/* add to pfn_list */
+		lpfn = kzalloc(sizeof(*lpfn), GFP_KERNEL);
+		if (!lpfn) {
+			ret = -ENOMEM;
+			goto pin_done;
+		}
+		lpfn->vaddr = remote_vaddr;
+		lpfn->iova = iova;
+		lpfn->pfn = pfn[i];
+		lpfn->npage = 1;
+		lpfn->prot = prot;
+		atomic_inc(&lpfn->ref_count);
+		vfio_link_pfn(domain, lpfn);
+	}
+
+	ret = i;
+
+pin_done:
+	mutex_unlock(&iommu->lock);
+	return ret;
+}
+EXPORT_SYMBOL(vfio_pin_pages);
+
+static int vfio_unpin_pfn(struct vfio_domain *domain,
+			  struct vfio_pfn *vpfn, bool do_accounting)
+{
+	int ret;
+
+	ret = vfio_unpin_pages_internal(domain, vpfn->pfn, vpfn->npage,
+					vpfn->prot, do_accounting);
+
+	if (ret > 0 && atomic_dec_and_test(&vpfn->ref_count)) {
+		vfio_unlink_pfn(domain, vpfn);
+		kfree(vpfn);
+	}
+
+	return ret;
+}
+
+/*
+ * Unpin set of host PFNs for API supported domain only.
+ * @pfn	[in] : array of host PFNs to be unpinned.
+ * @npage [in] :count of elements in array, that is number of pages.
+ * @prot [in] : protection flags
+ */
+long vfio_unpin_pages(void *iommu_data, dma_addr_t *pfn, long npage,
+		     int prot)
+{
+	struct vfio_iommu *iommu = iommu_data;
+	struct vfio_domain *domain = NULL;
+	long unlocked = 0;
+	int i;
+
+	if (!iommu || !pfn)
+		return -EINVAL;
+
+	if (!iommu->mediated_domain)
+		return -EINVAL;
+
+	domain = iommu->mediated_domain;
+
+	for (i = 0; i < npage; i++) {
+		struct vfio_pfn *p;
+
+		/* verify if pfn exist in pfn_list */
+		p = vfio_find_pfn(domain, *(pfn + i));
+		if (!p)
+			continue;
+
+		mutex_lock(&domain->pfn_list_lock);
+		unlocked += vfio_unpin_pfn(domain, p, true);
+		mutex_unlock(&domain->pfn_list_lock);
+	}
 
 	return unlocked;
 }
+EXPORT_SYMBOL(vfio_unpin_pages);
 
 static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
 {
@@ -341,6 +577,10 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
 
 	if (!dma->size)
 		return;
+
+	if (list_empty(&iommu->domain_list))
+		return;
+
 	/*
 	 * We use the IOMMU to track the physical addresses, otherwise we'd
 	 * need a much more complicated tracking system.  Unfortunately that
@@ -382,9 +622,10 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
 		if (WARN_ON(!unmapped))
 			break;
 
-		unlocked += vfio_unpin_pages(phys >> PAGE_SHIFT,
-					     unmapped >> PAGE_SHIFT,
-					     dma->prot, false);
+		unlocked += vfio_unpin_pages_internal(domain,
+						phys >> PAGE_SHIFT,
+						unmapped >> PAGE_SHIFT,
+						dma->prot, false);
 		iova += unmapped;
 
 		cond_resched();
@@ -517,6 +758,9 @@ static int map_try_harder(struct vfio_domain *domain, dma_addr_t iova,
 	long i;
 	int ret;
 
+	if (domain->mediated_device)
+		return -EINVAL;
+
 	for (i = 0; i < npage; i++, pfn++, iova += PAGE_SIZE) {
 		ret = iommu_map(domain->domain, iova,
 				(phys_addr_t)pfn << PAGE_SHIFT,
@@ -537,6 +781,9 @@ static int vfio_iommu_map(struct vfio_iommu *iommu, dma_addr_t iova,
 	struct vfio_domain *d;
 	int ret;
 
+	if (list_empty(&iommu->domain_list))
+		return 0;
+
 	list_for_each_entry(d, &iommu->domain_list, next) {
 		ret = iommu_map(d->domain, iova, (phys_addr_t)pfn << PAGE_SHIFT,
 				npage << PAGE_SHIFT, prot | d->prot);
@@ -569,6 +816,7 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
 	uint64_t mask;
 	struct vfio_dma *dma;
 	unsigned long pfn;
+	struct vfio_domain *domain = NULL;
 
 	/* Verify that none of our __u64 fields overflow */
 	if (map->size != size || map->vaddr != vaddr || map->iova != iova)
@@ -611,10 +859,21 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
 	/* Insert zero-sized and grow as we map chunks of it */
 	vfio_link_dma(iommu, dma);
 
+	/*
+	 * Skip pin and map if and domain list is empty
+	 */
+	if (list_empty(&iommu->domain_list)) {
+		dma->size = size;
+		goto map_done;
+	}
+
+	domain = list_first_entry(&iommu->domain_list,
+				  struct vfio_domain, next);
+
 	while (size) {
 		/* Pin a contiguous chunk of memory */
-		npage = vfio_pin_pages(vaddr + dma->size,
-				       size >> PAGE_SHIFT, prot, &pfn);
+		npage = vfio_pin_pages_internal(domain, vaddr + dma->size,
+						size >> PAGE_SHIFT, prot, &pfn);
 		if (npage <= 0) {
 			WARN_ON(!npage);
 			ret = (int)npage;
@@ -624,7 +883,8 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
 		/* Map it! */
 		ret = vfio_iommu_map(iommu, iova + dma->size, pfn, npage, prot);
 		if (ret) {
-			vfio_unpin_pages(pfn, npage, prot, true);
+			vfio_unpin_pages_internal(domain, pfn, npage,
+						  prot, true);
 			break;
 		}
 
@@ -635,6 +895,7 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
 	if (ret)
 		vfio_remove_dma(iommu, dma);
 
+map_done:
 	mutex_unlock(&iommu->lock);
 	return ret;
 }
@@ -658,6 +919,9 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
 	struct rb_node *n;
 	int ret;
 
+	if (domain->mediated_device)
+		return 0;
+
 	/* Arbitrarily pick the first domain in the list for lookups */
 	d = list_first_entry(&iommu->domain_list, struct vfio_domain, next);
 	n = rb_first(&iommu->dma_list);
@@ -716,6 +980,9 @@ static void vfio_test_domain_fgsp(struct vfio_domain *domain)
 	struct page *pages;
 	int ret, order = get_order(PAGE_SIZE * 2);
 
+	if (domain->mediated_device)
+		return;
+
 	pages = alloc_pages(GFP_KERNEL | __GFP_ZERO, order);
 	if (!pages)
 		return;
@@ -734,11 +1001,25 @@ static void vfio_test_domain_fgsp(struct vfio_domain *domain)
 	__free_pages(pages, order);
 }
 
+static struct vfio_group *is_iommu_group_present(struct vfio_domain *domain,
+				   struct iommu_group *iommu_group)
+{
+	struct vfio_group *g;
+
+	list_for_each_entry(g, &domain->group_list, next) {
+		if (g->iommu_group != iommu_group)
+			continue;
+		return g;
+	}
+
+	return NULL;
+}
+
 static int vfio_iommu_type1_attach_group(void *iommu_data,
 					 struct iommu_group *iommu_group)
 {
 	struct vfio_iommu *iommu = iommu_data;
-	struct vfio_group *group, *g;
+	struct vfio_group *group;
 	struct vfio_domain *domain, *d;
 	struct bus_type *bus = NULL;
 	int ret;
@@ -746,10 +1027,15 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 	mutex_lock(&iommu->lock);
 
 	list_for_each_entry(d, &iommu->domain_list, next) {
-		list_for_each_entry(g, &d->group_list, next) {
-			if (g->iommu_group != iommu_group)
-				continue;
+		if (is_iommu_group_present(d, iommu_group)) {
+			mutex_unlock(&iommu->lock);
+			return -EINVAL;
+		}
+	}
 
+	if (iommu->mediated_domain) {
+		if (is_iommu_group_present(iommu->mediated_domain,
+					   iommu_group)) {
 			mutex_unlock(&iommu->lock);
 			return -EINVAL;
 		}
@@ -769,6 +1055,32 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 	if (ret)
 		goto out_free;
 
+	if (!iommu_present(bus) && (bus == &mdev_bus_type)) {
+		struct mdev_device *mdevice = NULL;
+
+		mdevice = mdev_get_device_by_group(iommu_group);
+		if (!mdevice)
+			goto out_free;
+
+		mdevice->iommu_data = iommu;
+		group->mdevice = mdevice;
+
+		if (iommu->mediated_domain) {
+			list_add(&group->next,
+				 &iommu->mediated_domain->group_list);
+			kfree(domain);
+			goto out_success;
+		}
+		domain->mediated_device = true;
+		domain->mm = current->mm;
+		INIT_LIST_HEAD(&domain->group_list);
+		list_add(&group->next, &domain->group_list);
+		domain->pfn_list = RB_ROOT;
+		mutex_init(&domain->pfn_list_lock);
+		iommu->mediated_domain = domain;
+		goto out_success;
+	}
+
 	domain->domain = iommu_domain_alloc(bus);
 	if (!domain->domain) {
 		ret = -EIO;
@@ -836,6 +1148,7 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 
 	list_add(&domain->next, &iommu->domain_list);
 
+out_success:
 	mutex_unlock(&iommu->lock);
 
 	return 0;
@@ -859,6 +1172,18 @@ static void vfio_iommu_unmap_unpin_all(struct vfio_iommu *iommu)
 		vfio_remove_dma(iommu, rb_entry(node, struct vfio_dma, node));
 }
 
+static void vfio_iommu_unpin_api_domain(struct vfio_domain *domain)
+{
+	struct rb_node *node;
+
+	mutex_lock(&domain->pfn_list_lock);
+	while ((node = rb_first(&domain->pfn_list))) {
+		vfio_unpin_pfn(domain,
+				rb_entry(node, struct vfio_pfn, node), false);
+	}
+	mutex_unlock(&domain->pfn_list_lock);
+}
+
 static void vfio_iommu_type1_detach_group(void *iommu_data,
 					  struct iommu_group *iommu_group)
 {
@@ -868,31 +1193,54 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
 
 	mutex_lock(&iommu->lock);
 
-	list_for_each_entry(domain, &iommu->domain_list, next) {
-		list_for_each_entry(group, &domain->group_list, next) {
-			if (group->iommu_group != iommu_group)
-				continue;
+	if (iommu->mediated_domain) {
+		domain = iommu->mediated_domain;
+		group = is_iommu_group_present(domain, iommu_group);
+		if (group) {
+			if (group->mdevice) {
+				group->mdevice->iommu_data = NULL;
+				mdev_put_device(group->mdevice);
+			}
+
+			list_del(&group->next);
+			kfree(group);
 
+			if (list_empty(&domain->group_list)) {
+				vfio_iommu_unpin_api_domain(domain);
+
+				if (list_empty(&iommu->domain_list))
+					vfio_iommu_unmap_unpin_all(iommu);
+
+				kfree(domain);
+				iommu->mediated_domain = NULL;
+			}
+		}
+	}
+
+	list_for_each_entry(domain, &iommu->domain_list, next) {
+		group = is_iommu_group_present(domain, iommu_group);
+		if (group) {
 			iommu_detach_group(domain->domain, iommu_group);
 			list_del(&group->next);
 			kfree(group);
 			/*
 			 * Group ownership provides privilege, if the group
 			 * list is empty, the domain goes away.  If it's the
-			 * last domain, then all the mappings go away too.
+			 * last domain with iommu and API-only domain doesn't
+			 * exist, the all the mappings go away too.
 			 */
 			if (list_empty(&domain->group_list)) {
-				if (list_is_singular(&iommu->domain_list))
+				if (list_is_singular(&iommu->domain_list) &&
+				    !iommu->mediated_domain)
 					vfio_iommu_unmap_unpin_all(iommu);
 				iommu_domain_free(domain->domain);
 				list_del(&domain->next);
 				kfree(domain);
 			}
-			goto done;
+			break;
 		}
 	}
 
-done:
 	mutex_unlock(&iommu->lock);
 }
 
@@ -930,8 +1278,28 @@ static void vfio_iommu_type1_release(void *iommu_data)
 	struct vfio_domain *domain, *domain_tmp;
 	struct vfio_group *group, *group_tmp;
 
+	if (iommu->mediated_domain) {
+		domain = iommu->mediated_domain;
+		list_for_each_entry_safe(group, group_tmp,
+					 &domain->group_list, next) {
+			if (group->mdevice) {
+				group->mdevice->iommu_data = NULL;
+				mdev_put_device(group->mdevice);
+			}
+
+			list_del(&group->next);
+			kfree(group);
+		}
+		vfio_iommu_unpin_api_domain(domain);
+		kfree(domain);
+		iommu->mediated_domain = NULL;
+	}
+
 	vfio_iommu_unmap_unpin_all(iommu);
 
+	if (list_empty(&iommu->domain_list))
+		goto release_exit;
+
 	list_for_each_entry_safe(domain, domain_tmp,
 				 &iommu->domain_list, next) {
 		list_for_each_entry_safe(group, group_tmp,
@@ -945,6 +1313,7 @@ static void vfio_iommu_type1_release(void *iommu_data)
 		kfree(domain);
 	}
 
+release_exit:
 	kfree(iommu);
 }
 
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 431b824b0d3e..0a907bb33426 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -134,6 +134,12 @@ static inline long vfio_spapr_iommu_eeh_ioctl(struct iommu_group *group,
 }
 #endif /* CONFIG_EEH */
 
+extern long vfio_pin_pages(void *iommu_data, dma_addr_t *vaddr, long npage,
+			   int prot, dma_addr_t *pfn_base);
+
+extern long vfio_unpin_pages(void *iommu_data, dma_addr_t *pfn, long npage,
+			     int prot);
+
 /*
  * IRQfd - generic
  */
-- 
2.7.0


^ permalink raw reply related	[flat|nested] 92+ messages in thread

* [Qemu-devel] [RFC PATCH v4 3/3] VFIO Type1 IOMMU: Add support for mediated devices
@ 2016-05-24 19:58   ` Kirti Wankhede
  0 siblings, 0 replies; 92+ messages in thread
From: Kirti Wankhede @ 2016-05-24 19:58 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, kevin.tian, shuai.ruan, jike.song, zhiyuan.lv,
	bjsdjshi, Kirti Wankhede

VFIO Type1 IOMMU driver is designed for the devices which are IOMMU
capable. Mediated device only uses IOMMU TYPE1 API, the underlying
hardware can be managed by an IOMMU domain.

This change exports functions to pin and unpin pages for mediated devices.
It maintains data of pinned pages for mediated domain. This data is used to
verify unpinning request and to unpin remaining pages from detach_group()
if there are any.

Aim of this change is:
- To use most of the code of IOMMU driver for mediated devices
- To support direct assigned device and mediated device by single module

Updated the change to keep mediated domain structure out of domain_list.

Tested by assigning below combinations of devices to a single VM:
- GPU pass through only
- vGPU device only
- One GPU pass through and one vGPU device
- two GPU pass through

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Change-Id: I9c262abc9c68fd6abf52d91a636bf0cc631593a0
---
 drivers/vfio/vfio_iommu_type1.c | 433 +++++++++++++++++++++++++++++++++++++---
 include/linux/vfio.h            |   6 +
 2 files changed, 407 insertions(+), 32 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 75b24e93cedb..5cc7dc0288a3 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -36,6 +36,7 @@
 #include <linux/uaccess.h>
 #include <linux/vfio.h>
 #include <linux/workqueue.h>
+#include <linux/mdev.h>
 
 #define DRIVER_VERSION  "0.2"
 #define DRIVER_AUTHOR   "Alex Williamson <alex.williamson@redhat.com>"
@@ -55,6 +56,7 @@ MODULE_PARM_DESC(disable_hugepages,
 
 struct vfio_iommu {
 	struct list_head	domain_list;
+	struct vfio_domain	*mediated_domain;
 	struct mutex		lock;
 	struct rb_root		dma_list;
 	bool			v2;
@@ -67,6 +69,13 @@ struct vfio_domain {
 	struct list_head	group_list;
 	int			prot;		/* IOMMU_CACHE */
 	bool			fgsp;		/* Fine-grained super pages */
+
+	/* Domain for mediated device which is without physical IOMMU */
+	bool			mediated_device;
+
+	struct mm_struct	*mm;
+	struct rb_root		pfn_list;	/* pinned Host pfn list */
+	struct mutex		pfn_list_lock;	/* mutex for pfn_list */
 };
 
 struct vfio_dma {
@@ -79,10 +88,23 @@ struct vfio_dma {
 
 struct vfio_group {
 	struct iommu_group	*iommu_group;
+	struct mdev_device	*mdevice;
 	struct list_head	next;
 };
 
 /*
+ * Guest RAM pinning working set or DMA target
+ */
+struct vfio_pfn {
+	struct rb_node		node;
+	unsigned long		vaddr;		/* virtual addr */
+	dma_addr_t		iova;		/* IOVA */
+	unsigned long		npage;		/* number of pages */
+	unsigned long		pfn;		/* Host pfn */
+	size_t			prot;
+	atomic_t		ref_count;
+};
+/*
  * This code handles mapping and unmapping of user data buffers
  * into DMA'ble space using the IOMMU
  */
@@ -130,6 +152,64 @@ static void vfio_unlink_dma(struct vfio_iommu *iommu, struct vfio_dma *old)
 	rb_erase(&old->node, &iommu->dma_list);
 }
 
+/*
+ * Helper Functions for host pfn list
+ */
+
+static struct vfio_pfn *vfio_find_pfn(struct vfio_domain *domain,
+				      unsigned long pfn)
+{
+	struct rb_node *node;
+	struct vfio_pfn *vpfn, *ret = NULL;
+
+	mutex_lock(&domain->pfn_list_lock);
+	node = domain->pfn_list.rb_node;
+
+	while (node) {
+		vpfn = rb_entry(node, struct vfio_pfn, node);
+
+		if (pfn < vpfn->pfn)
+			node = node->rb_left;
+		else if (pfn > vpfn->pfn)
+			node = node->rb_right;
+		else {
+			ret = vpfn;
+			break;
+		}
+	}
+
+	mutex_unlock(&domain->pfn_list_lock);
+	return ret;
+}
+
+static void vfio_link_pfn(struct vfio_domain *domain, struct vfio_pfn *new)
+{
+	struct rb_node **link, *parent = NULL;
+	struct vfio_pfn *vpfn;
+
+	mutex_lock(&domain->pfn_list_lock);
+	link = &domain->pfn_list.rb_node;
+	while (*link) {
+		parent = *link;
+		vpfn = rb_entry(parent, struct vfio_pfn, node);
+
+		if (new->pfn < vpfn->pfn)
+			link = &(*link)->rb_left;
+		else
+			link = &(*link)->rb_right;
+	}
+
+	rb_link_node(&new->node, parent, link);
+	rb_insert_color(&new->node, &domain->pfn_list);
+	mutex_unlock(&domain->pfn_list_lock);
+}
+
+/* call by holding domain->pfn_list_lock */
+static void vfio_unlink_pfn(struct vfio_domain *domain, struct vfio_pfn *old)
+{
+	rb_erase(&old->node, &domain->pfn_list);
+}
+
 struct vwork {
 	struct mm_struct	*mm;
 	long			npage;
@@ -228,20 +308,29 @@ static int put_pfn(unsigned long pfn, int prot)
 	return 0;
 }
 
-static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
+static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
+			 int prot, unsigned long *pfn)
 {
 	struct page *page[1];
 	struct vm_area_struct *vma;
+	struct mm_struct *local_mm = mm;
 	int ret = -EFAULT;
 
-	if (get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE), page) == 1) {
+	if (!local_mm && !current->mm)
+		return -ENODEV;
+
+	if (!local_mm)
+		local_mm = current->mm;
+
+	down_read(&local_mm->mmap_sem);
+	if (get_user_pages_remote(NULL, local_mm, vaddr, 1,
+				!!(prot & IOMMU_WRITE), 0, page, NULL) == 1) {
 		*pfn = page_to_pfn(page[0]);
-		return 0;
+		ret = 0;
+		goto done_pfn;
 	}
 
-	down_read(&current->mm->mmap_sem);
-
-	vma = find_vma_intersection(current->mm, vaddr, vaddr + 1);
+	vma = find_vma_intersection(local_mm, vaddr, vaddr + 1);
 
 	if (vma && vma->vm_flags & VM_PFNMAP) {
 		*pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
@@ -249,7 +338,8 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
 			ret = 0;
 	}
 
-	up_read(&current->mm->mmap_sem);
+done_pfn:
+	up_read(&local_mm->mmap_sem);
 
 	return ret;
 }
@@ -259,18 +349,19 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
  * the iommu can only map chunks of consecutive pfns anyway, so get the
  * first page and all consecutive pages with the same locking.
  */
-static long vfio_pin_pages(unsigned long vaddr, long npage,
-			   int prot, unsigned long *pfn_base)
+static long vfio_pin_pages_internal(struct vfio_domain *domain,
+				    unsigned long vaddr, long npage,
+				    int prot, unsigned long *pfn_base)
 {
 	unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
 	bool lock_cap = capable(CAP_IPC_LOCK);
 	long ret, i;
 	bool rsvd;
 
-	if (!current->mm)
+	if (!domain)
 		return -ENODEV;
 
-	ret = vaddr_get_pfn(vaddr, prot, pfn_base);
+	ret = vaddr_get_pfn(domain->mm, vaddr, prot, pfn_base);
 	if (ret)
 		return ret;
 
@@ -293,7 +384,7 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
 	for (i = 1, vaddr += PAGE_SIZE; i < npage; i++, vaddr += PAGE_SIZE) {
 		unsigned long pfn = 0;
 
-		ret = vaddr_get_pfn(vaddr, prot, &pfn);
+		ret = vaddr_get_pfn(domain->mm, vaddr, prot, &pfn);
 		if (ret)
 			break;
 
@@ -318,20 +409,165 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
 	return i;
 }
 
-static long vfio_unpin_pages(unsigned long pfn, long npage,
-			     int prot, bool do_accounting)
+static long vfio_unpin_pages_internal(struct vfio_domain *domain,
+				      unsigned long pfn, long npage, int prot,
+				      bool do_accounting)
 {
 	unsigned long unlocked = 0;
 	long i;
 
+	if (!domain)
+		return -ENODEV;
+
 	for (i = 0; i < npage; i++)
 		unlocked += put_pfn(pfn++, prot);
 
 	if (do_accounting)
 		vfio_lock_acct(-unlocked);
+	return unlocked;
+}
+
+/*
+ * Pin a set of guest PFNs and return their associated host PFNs for API
+ * supported domain only.
+ * @vaddr [in]: array of guest PFNs
+ * @npage [in]: count of array elements
+ * @prot [in] : protection flags
+ * @pfn_base[out] : array of host PFNs
+ */
+long vfio_pin_pages(void *iommu_data, dma_addr_t *vaddr, long npage,
+		   int prot, dma_addr_t *pfn_base)
+{
+	struct vfio_iommu *iommu = iommu_data;
+	struct vfio_domain *domain = NULL;
+	int i = 0, ret = 0;
+	long retpage;
+	unsigned long remote_vaddr = 0;
+	dma_addr_t *pfn = pfn_base;
+	struct vfio_dma *dma;
+
+	if (!iommu || !vaddr || !pfn_base)
+		return -EINVAL;
+
+	mutex_lock(&iommu->lock);
+
+	if (!iommu->mediated_domain) {
+		ret = -EINVAL;
+		goto pin_done;
+	}
+
+	domain = iommu->mediated_domain;
+
+	for (i = 0; i < npage; i++) {
+		struct vfio_pfn *p, *lpfn;
+		unsigned long tpfn;
+		dma_addr_t iova;
+		long pg_cnt = 1;
+
+		iova = vaddr[i] << PAGE_SHIFT;
+
+		dma = vfio_find_dma(iommu, iova, 0 /*  size */);
+		if (!dma) {
+			ret = -EINVAL;
+			goto pin_done;
+		}
+
+		remote_vaddr = dma->vaddr + iova - dma->iova;
+
+		retpage = vfio_pin_pages_internal(domain, remote_vaddr,
+						  pg_cnt, prot, &tpfn);
+		if (retpage <= 0) {
+			WARN_ON(!retpage);
+			ret = (int)retpage;
+			goto pin_done;
+		}
+
+		pfn[i] = tpfn;
+
+		/* search if pfn exist */
+		p = vfio_find_pfn(domain, tpfn);
+		if (p) {
+			atomic_inc(&p->ref_count);
+			continue;
+		}
+
+		/* add to pfn_list */
+		lpfn = kzalloc(sizeof(*lpfn), GFP_KERNEL);
+		if (!lpfn) {
+			ret = -ENOMEM;
+			goto pin_done;
+		}
+		lpfn->vaddr = remote_vaddr;
+		lpfn->iova = iova;
+		lpfn->pfn = pfn[i];
+		lpfn->npage = 1;
+		lpfn->prot = prot;
+		atomic_inc(&lpfn->ref_count);
+		vfio_link_pfn(domain, lpfn);
+	}
+
+	ret = i;
+
+pin_done:
+	mutex_unlock(&iommu->lock);
+	return ret;
+}
+EXPORT_SYMBOL(vfio_pin_pages);
+
+static int vfio_unpin_pfn(struct vfio_domain *domain,
+			  struct vfio_pfn *vpfn, bool do_accounting)
+{
+	int ret;
+
+	ret = vfio_unpin_pages_internal(domain, vpfn->pfn, vpfn->npage,
+					vpfn->prot, do_accounting);
+
+	if (ret > 0 && atomic_dec_and_test(&vpfn->ref_count)) {
+		vfio_unlink_pfn(domain, vpfn);
+		kfree(vpfn);
+	}
+
+	return ret;
+}
+
+/*
+ * Unpin set of host PFNs for API supported domain only.
+ * @pfn	[in] : array of host PFNs to be unpinned.
+ * @npage [in] :count of elements in array, that is number of pages.
+ * @prot [in] : protection flags
+ */
+long vfio_unpin_pages(void *iommu_data, dma_addr_t *pfn, long npage,
+		     int prot)
+{
+	struct vfio_iommu *iommu = iommu_data;
+	struct vfio_domain *domain = NULL;
+	long unlocked = 0;
+	int i;
+
+	if (!iommu || !pfn)
+		return -EINVAL;
+
+	if (!iommu->mediated_domain)
+		return -EINVAL;
+
+	domain = iommu->mediated_domain;
+
+	for (i = 0; i < npage; i++) {
+		struct vfio_pfn *p;
+
+		/* verify if pfn exist in pfn_list */
+		p = vfio_find_pfn(domain, *(pfn + i));
+		if (!p)
+			continue;
+
+		mutex_lock(&domain->pfn_list_lock);
+		unlocked += vfio_unpin_pfn(domain, p, true);
+		mutex_unlock(&domain->pfn_list_lock);
+	}
 
 	return unlocked;
 }
+EXPORT_SYMBOL(vfio_unpin_pages);
 
 static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
 {
@@ -341,6 +577,10 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
 
 	if (!dma->size)
 		return;
+
+	if (list_empty(&iommu->domain_list))
+		return;
+
 	/*
 	 * We use the IOMMU to track the physical addresses, otherwise we'd
 	 * need a much more complicated tracking system.  Unfortunately that
@@ -382,9 +622,10 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
 		if (WARN_ON(!unmapped))
 			break;
 
-		unlocked += vfio_unpin_pages(phys >> PAGE_SHIFT,
-					     unmapped >> PAGE_SHIFT,
-					     dma->prot, false);
+		unlocked += vfio_unpin_pages_internal(domain,
+						phys >> PAGE_SHIFT,
+						unmapped >> PAGE_SHIFT,
+						dma->prot, false);
 		iova += unmapped;
 
 		cond_resched();
@@ -517,6 +758,9 @@ static int map_try_harder(struct vfio_domain *domain, dma_addr_t iova,
 	long i;
 	int ret;
 
+	if (domain->mediated_device)
+		return -EINVAL;
+
 	for (i = 0; i < npage; i++, pfn++, iova += PAGE_SIZE) {
 		ret = iommu_map(domain->domain, iova,
 				(phys_addr_t)pfn << PAGE_SHIFT,
@@ -537,6 +781,9 @@ static int vfio_iommu_map(struct vfio_iommu *iommu, dma_addr_t iova,
 	struct vfio_domain *d;
 	int ret;
 
+	if (list_empty(&iommu->domain_list))
+		return 0;
+
 	list_for_each_entry(d, &iommu->domain_list, next) {
 		ret = iommu_map(d->domain, iova, (phys_addr_t)pfn << PAGE_SHIFT,
 				npage << PAGE_SHIFT, prot | d->prot);
@@ -569,6 +816,7 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
 	uint64_t mask;
 	struct vfio_dma *dma;
 	unsigned long pfn;
+	struct vfio_domain *domain = NULL;
 
 	/* Verify that none of our __u64 fields overflow */
 	if (map->size != size || map->vaddr != vaddr || map->iova != iova)
@@ -611,10 +859,21 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
 	/* Insert zero-sized and grow as we map chunks of it */
 	vfio_link_dma(iommu, dma);
 
+	/*
+	 * Skip pin and map if and domain list is empty
+	 */
+	if (list_empty(&iommu->domain_list)) {
+		dma->size = size;
+		goto map_done;
+	}
+
+	domain = list_first_entry(&iommu->domain_list,
+				  struct vfio_domain, next);
+
 	while (size) {
 		/* Pin a contiguous chunk of memory */
-		npage = vfio_pin_pages(vaddr + dma->size,
-				       size >> PAGE_SHIFT, prot, &pfn);
+		npage = vfio_pin_pages_internal(domain, vaddr + dma->size,
+						size >> PAGE_SHIFT, prot, &pfn);
 		if (npage <= 0) {
 			WARN_ON(!npage);
 			ret = (int)npage;
@@ -624,7 +883,8 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
 		/* Map it! */
 		ret = vfio_iommu_map(iommu, iova + dma->size, pfn, npage, prot);
 		if (ret) {
-			vfio_unpin_pages(pfn, npage, prot, true);
+			vfio_unpin_pages_internal(domain, pfn, npage,
+						  prot, true);
 			break;
 		}
 
@@ -635,6 +895,7 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
 	if (ret)
 		vfio_remove_dma(iommu, dma);
 
+map_done:
 	mutex_unlock(&iommu->lock);
 	return ret;
 }
@@ -658,6 +919,9 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
 	struct rb_node *n;
 	int ret;
 
+	if (domain->mediated_device)
+		return 0;
+
 	/* Arbitrarily pick the first domain in the list for lookups */
 	d = list_first_entry(&iommu->domain_list, struct vfio_domain, next);
 	n = rb_first(&iommu->dma_list);
@@ -716,6 +980,9 @@ static void vfio_test_domain_fgsp(struct vfio_domain *domain)
 	struct page *pages;
 	int ret, order = get_order(PAGE_SIZE * 2);
 
+	if (domain->mediated_device)
+		return;
+
 	pages = alloc_pages(GFP_KERNEL | __GFP_ZERO, order);
 	if (!pages)
 		return;
@@ -734,11 +1001,25 @@ static void vfio_test_domain_fgsp(struct vfio_domain *domain)
 	__free_pages(pages, order);
 }
 
+static struct vfio_group *is_iommu_group_present(struct vfio_domain *domain,
+				   struct iommu_group *iommu_group)
+{
+	struct vfio_group *g;
+
+	list_for_each_entry(g, &domain->group_list, next) {
+		if (g->iommu_group != iommu_group)
+			continue;
+		return g;
+	}
+
+	return NULL;
+}
+
 static int vfio_iommu_type1_attach_group(void *iommu_data,
 					 struct iommu_group *iommu_group)
 {
 	struct vfio_iommu *iommu = iommu_data;
-	struct vfio_group *group, *g;
+	struct vfio_group *group;
 	struct vfio_domain *domain, *d;
 	struct bus_type *bus = NULL;
 	int ret;
@@ -746,10 +1027,15 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 	mutex_lock(&iommu->lock);
 
 	list_for_each_entry(d, &iommu->domain_list, next) {
-		list_for_each_entry(g, &d->group_list, next) {
-			if (g->iommu_group != iommu_group)
-				continue;
+		if (is_iommu_group_present(d, iommu_group)) {
+			mutex_unlock(&iommu->lock);
+			return -EINVAL;
+		}
+	}
 
+	if (iommu->mediated_domain) {
+		if (is_iommu_group_present(iommu->mediated_domain,
+					   iommu_group)) {
 			mutex_unlock(&iommu->lock);
 			return -EINVAL;
 		}
@@ -769,6 +1055,32 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 	if (ret)
 		goto out_free;
 
+	if (!iommu_present(bus) && (bus == &mdev_bus_type)) {
+		struct mdev_device *mdevice = NULL;
+
+		mdevice = mdev_get_device_by_group(iommu_group);
+		if (!mdevice)
+			goto out_free;
+
+		mdevice->iommu_data = iommu;
+		group->mdevice = mdevice;
+
+		if (iommu->mediated_domain) {
+			list_add(&group->next,
+				 &iommu->mediated_domain->group_list);
+			kfree(domain);
+			goto out_success;
+		}
+		domain->mediated_device = true;
+		domain->mm = current->mm;
+		INIT_LIST_HEAD(&domain->group_list);
+		list_add(&group->next, &domain->group_list);
+		domain->pfn_list = RB_ROOT;
+		mutex_init(&domain->pfn_list_lock);
+		iommu->mediated_domain = domain;
+		goto out_success;
+	}
+
 	domain->domain = iommu_domain_alloc(bus);
 	if (!domain->domain) {
 		ret = -EIO;
@@ -836,6 +1148,7 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 
 	list_add(&domain->next, &iommu->domain_list);
 
+out_success:
 	mutex_unlock(&iommu->lock);
 
 	return 0;
@@ -859,6 +1172,18 @@ static void vfio_iommu_unmap_unpin_all(struct vfio_iommu *iommu)
 		vfio_remove_dma(iommu, rb_entry(node, struct vfio_dma, node));
 }
 
+static void vfio_iommu_unpin_api_domain(struct vfio_domain *domain)
+{
+	struct rb_node *node;
+
+	mutex_lock(&domain->pfn_list_lock);
+	while ((node = rb_first(&domain->pfn_list))) {
+		vfio_unpin_pfn(domain,
+				rb_entry(node, struct vfio_pfn, node), false);
+	}
+	mutex_unlock(&domain->pfn_list_lock);
+}
+
 static void vfio_iommu_type1_detach_group(void *iommu_data,
 					  struct iommu_group *iommu_group)
 {
@@ -868,31 +1193,54 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
 
 	mutex_lock(&iommu->lock);
 
-	list_for_each_entry(domain, &iommu->domain_list, next) {
-		list_for_each_entry(group, &domain->group_list, next) {
-			if (group->iommu_group != iommu_group)
-				continue;
+	if (iommu->mediated_domain) {
+		domain = iommu->mediated_domain;
+		group = is_iommu_group_present(domain, iommu_group);
+		if (group) {
+			if (group->mdevice) {
+				group->mdevice->iommu_data = NULL;
+				mdev_put_device(group->mdevice);
+			}
+
+			list_del(&group->next);
+			kfree(group);
 
+			if (list_empty(&domain->group_list)) {
+				vfio_iommu_unpin_api_domain(domain);
+
+				if (list_empty(&iommu->domain_list))
+					vfio_iommu_unmap_unpin_all(iommu);
+
+				kfree(domain);
+				iommu->mediated_domain = NULL;
+			}
+		}
+	}
+
+	list_for_each_entry(domain, &iommu->domain_list, next) {
+		group = is_iommu_group_present(domain, iommu_group);
+		if (group) {
 			iommu_detach_group(domain->domain, iommu_group);
 			list_del(&group->next);
 			kfree(group);
 			/*
 			 * Group ownership provides privilege, if the group
 			 * list is empty, the domain goes away.  If it's the
-			 * last domain, then all the mappings go away too.
+			 * last domain with iommu and API-only domain doesn't
+			 * exist, the all the mappings go away too.
 			 */
 			if (list_empty(&domain->group_list)) {
-				if (list_is_singular(&iommu->domain_list))
+				if (list_is_singular(&iommu->domain_list) &&
+				    !iommu->mediated_domain)
 					vfio_iommu_unmap_unpin_all(iommu);
 				iommu_domain_free(domain->domain);
 				list_del(&domain->next);
 				kfree(domain);
 			}
-			goto done;
+			break;
 		}
 	}
 
-done:
 	mutex_unlock(&iommu->lock);
 }
 
@@ -930,8 +1278,28 @@ static void vfio_iommu_type1_release(void *iommu_data)
 	struct vfio_domain *domain, *domain_tmp;
 	struct vfio_group *group, *group_tmp;
 
+	if (iommu->mediated_domain) {
+		domain = iommu->mediated_domain;
+		list_for_each_entry_safe(group, group_tmp,
+					 &domain->group_list, next) {
+			if (group->mdevice) {
+				group->mdevice->iommu_data = NULL;
+				mdev_put_device(group->mdevice);
+			}
+
+			list_del(&group->next);
+			kfree(group);
+		}
+		vfio_iommu_unpin_api_domain(domain);
+		kfree(domain);
+		iommu->mediated_domain = NULL;
+	}
+
 	vfio_iommu_unmap_unpin_all(iommu);
 
+	if (list_empty(&iommu->domain_list))
+		goto release_exit;
+
 	list_for_each_entry_safe(domain, domain_tmp,
 				 &iommu->domain_list, next) {
 		list_for_each_entry_safe(group, group_tmp,
@@ -945,6 +1313,7 @@ static void vfio_iommu_type1_release(void *iommu_data)
 		kfree(domain);
 	}
 
+release_exit:
 	kfree(iommu);
 }
 
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 431b824b0d3e..0a907bb33426 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -134,6 +134,12 @@ static inline long vfio_spapr_iommu_eeh_ioctl(struct iommu_group *group,
 }
 #endif /* CONFIG_EEH */
 
+extern long vfio_pin_pages(void *iommu_data, dma_addr_t *vaddr, long npage,
+			   int prot, dma_addr_t *pfn_base);
+
+extern long vfio_unpin_pages(void *iommu_data, dma_addr_t *pfn, long npage,
+			     int prot);
+
 /*
  * IRQfd - generic
  */
-- 
2.7.0

^ permalink raw reply related	[flat|nested] 92+ messages in thread

* RE: [RFC PATCH v4 0/3] Add Mediated device support[was: Add vGPU support]
  2016-05-24 19:58 ` [Qemu-devel] " Kirti Wankhede
@ 2016-05-25  7:13   ` Tian, Kevin
  -1 siblings, 0 replies; 92+ messages in thread
From: Tian, Kevin @ 2016-05-25  7:13 UTC (permalink / raw)
  To: Kirti Wankhede, alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, Ruan, Shuai, Song, Jike, Lv, Zhiyuan, bjsdjshi

> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> Sent: Wednesday, May 25, 2016 3:58 AM
> 
> This series adds Mediated device support to v4.6 Linux host kernel. Purpose
> of this series is to provide a common interface for mediated device
> management that can be used by different devices. This series introduces
> Mdev core module that create and manage mediated devices, VFIO based driver
> for mediated PCI devices that are created by Mdev core module and update
> VFIO type1 IOMMU module to support mediated devices.

Thanks. "Mediated device" is more generic than previous one. :-)

> 
> What's new in v4?
> - Renamed 'vgpu' module to 'mdev' module that represent generic term
>   'Mediated device'.
> - Moved mdev directory to drivers/vfio directory as this is the extension
>   of VFIO APIs for mediated devices.
> - Updated mdev driver to be flexible to register multiple types of drivers
>   to mdev_bus_type bus.
> - Updated mdev core driver with mdev_put_device() and mdev_get_device() for
>   mediated devices.
> 
> 

Just curious. In this version you move the whole mdev core under
VFIO now. Sorry if I missed any agreement on this change. IIRC Alex 
doesn't want VFIO to manage mdev life-cycle directly. Instead VFIO is 
just a mdev driver on created mediated devices....

Thanks
Kevin

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v4 0/3] Add Mediated device support[was: Add vGPU support]
@ 2016-05-25  7:13   ` Tian, Kevin
  0 siblings, 0 replies; 92+ messages in thread
From: Tian, Kevin @ 2016-05-25  7:13 UTC (permalink / raw)
  To: Kirti Wankhede, alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, Ruan, Shuai, Song, Jike, Lv, Zhiyuan, bjsdjshi

> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> Sent: Wednesday, May 25, 2016 3:58 AM
> 
> This series adds Mediated device support to v4.6 Linux host kernel. Purpose
> of this series is to provide a common interface for mediated device
> management that can be used by different devices. This series introduces
> Mdev core module that create and manage mediated devices, VFIO based driver
> for mediated PCI devices that are created by Mdev core module and update
> VFIO type1 IOMMU module to support mediated devices.

Thanks. "Mediated device" is more generic than previous one. :-)

> 
> What's new in v4?
> - Renamed 'vgpu' module to 'mdev' module that represent generic term
>   'Mediated device'.
> - Moved mdev directory to drivers/vfio directory as this is the extension
>   of VFIO APIs for mediated devices.
> - Updated mdev driver to be flexible to register multiple types of drivers
>   to mdev_bus_type bus.
> - Updated mdev core driver with mdev_put_device() and mdev_get_device() for
>   mediated devices.
> 
> 

Just curious. In this version you move the whole mdev core under
VFIO now. Sorry if I missed any agreement on this change. IIRC Alex 
doesn't want VFIO to manage mdev life-cycle directly. Instead VFIO is 
just a mdev driver on created mediated devices....

Thanks
Kevin

^ permalink raw reply	[flat|nested] 92+ messages in thread

* RE: [RFC PATCH v4 1/3] Mediated device Core driver
  2016-05-24 19:58   ` [Qemu-devel] " Kirti Wankhede
@ 2016-05-25  7:55     ` Tian, Kevin
  -1 siblings, 0 replies; 92+ messages in thread
From: Tian, Kevin @ 2016-05-25  7:55 UTC (permalink / raw)
  To: Kirti Wankhede, alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, Ruan, Shuai, Song, Jike, Lv, Zhiyuan, bjsdjshi

> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> Sent: Wednesday, May 25, 2016 3:58 AM
> 
> Design for Mediated Device Driver:
> Main purpose of this driver is to provide a common interface for mediated
> device management that can be used by differnt drivers of different
> devices.
> 
> This module provides a generic interface to create the device, add it to
> mediated bus, add device to IOMMU group and then add it to vfio group.
> 
> Below is the high Level block diagram, with Nvidia, Intel and IBM devices
> as example, since these are the devices which are going to actively use
> this module as of now.
> 
>  +---------------+
>  |               |
>  | +-----------+ |  mdev_register_driver() +--------------+
>  | |           | +<------------------------+ __init()     |
>  | |           | |                         |              |
>  | |  mdev     | +------------------------>+              |<-> VFIO user
>  | |  bus      | |     probe()/remove()    | vfio_mpci.ko |    APIs
>  | |  driver   | |                         |              |
>  | |           | |                         +--------------+
>  | |           | |  mdev_register_driver() +--------------+
>  | |           | +<------------------------+ __init()     |
>  | |           | |                         |              |
>  | |           | +------------------------>+              |<-> VFIO user
>  | +-----------+ |     probe()/remove()    | vfio_mccw.ko |    APIs
>  |               |                         |              |
>  |  MDEV CORE    |                         +--------------+
>  |   MODULE      |
>  |   mdev.ko     |
>  | +-----------+ |  mdev_register_device() +--------------+
>  | |           | +<------------------------+              |
>  | |           | |                         |  nvidia.ko   |<-> physical
>  | |           | +------------------------>+              |    device
>  | |           | |        callback         +--------------+
>  | | Physical  | |
>  | |  device   | |  mdev_register_device() +--------------+
>  | | interface | |<------------------------+              |
>  | |           | |                         |  i915.ko     |<-> physical
>  | |           | +------------------------>+              |    device
>  | |           | |        callback         +--------------+
>  | |           | |
>  | |           | |  mdev_register_device() +--------------+
>  | |           | +<------------------------+              |
>  | |           | |                         | ccw_device.ko|<-> physical
>  | |           | +------------------------>+              |    device
>  | |           | |        callback         +--------------+
>  | +-----------+ |
>  +---------------+
> 
> Core driver provides two types of registration interfaces:
> 1. Registration interface for mediated bus driver:
> 
> /**
>   * struct mdev_driver - Mediated device's driver
>   * @name: driver name
>   * @probe: called when new device created
>   * @remove:called when device removed
>   * @match: called when new device or driver is added for this bus.
> 	    Return 1 if given device can be handled by given driver and
> 	    zero otherwise.
>   * @driver:device driver structure
>   *
>   **/
> struct mdev_driver {
>          const char *name;
>          int  (*probe)  (struct device *dev);
>          void (*remove) (struct device *dev);
> 	 int  (*match)(struct device *dev);
>          struct device_driver    driver;
> };
> 
> int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
> void mdev_unregister_driver(struct mdev_driver *drv);
> 
> Mediated device's driver for mdev should use this interface to register
> with Core driver. With this, mediated devices driver for such devices is
> responsible to add mediated device to VFIO group.
> 
> 2. Physical device driver interface
> This interface provides vendor driver the set APIs to manage physical
> device related work in their own driver. APIs are :
> - supported_config: provide supported configuration list by the vendor
> 		    driver
> - create: to allocate basic resources in vendor driver for a mediated
> 	  device.
> - destroy: to free resources in vendor driver when mediated device is
> 	   destroyed.
> - start: to initiate mediated device initialization process from vendor
> 	 driver when VM boots and before QEMU starts.
> - shutdown: to teardown mediated device resources during VM teardown.
> - read : read emulation callback.
> - write: write emulation callback.
> - set_irqs: send interrupt configuration information that QEMU sets.
> - get_region_info: to provide region size and its flags for the mediated
> 		   device.
> - validate_map_request: to validate remap pfn request.
> 
> This registration interface should be used by vendor drivers to register
> each physical device to mdev core driver.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Signed-off-by: Neo Jia <cjia@nvidia.com>
> Change-Id: I88f4482f7608f40550a152c5f882b64271287c62
> ---
>  drivers/vfio/Kconfig             |   1 +
>  drivers/vfio/Makefile            |   1 +
>  drivers/vfio/mdev/Kconfig        |  11 +
>  drivers/vfio/mdev/Makefile       |   5 +
>  drivers/vfio/mdev/mdev-core.c    | 462
> +++++++++++++++++++++++++++++++++++++++
>  drivers/vfio/mdev/mdev-driver.c  | 139 ++++++++++++
>  drivers/vfio/mdev/mdev-sysfs.c   | 312 ++++++++++++++++++++++++++
>  drivers/vfio/mdev/mdev_private.h |  33 +++
>  include/linux/mdev.h             | 224 +++++++++++++++++++
>  9 files changed, 1188 insertions(+)
>  create mode 100644 drivers/vfio/mdev/Kconfig
>  create mode 100644 drivers/vfio/mdev/Makefile
>  create mode 100644 drivers/vfio/mdev/mdev-core.c
>  create mode 100644 drivers/vfio/mdev/mdev-driver.c
>  create mode 100644 drivers/vfio/mdev/mdev-sysfs.c
>  create mode 100644 drivers/vfio/mdev/mdev_private.h
>  create mode 100644 include/linux/mdev.h
> 
> diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
> index da6e2ce77495..23eced02aaf6 100644
> --- a/drivers/vfio/Kconfig
> +++ b/drivers/vfio/Kconfig
> @@ -48,4 +48,5 @@ menuconfig VFIO_NOIOMMU
> 
>  source "drivers/vfio/pci/Kconfig"
>  source "drivers/vfio/platform/Kconfig"
> +source "drivers/vfio/mdev/Kconfig"
>  source "virt/lib/Kconfig"
> diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
> index 7b8a31f63fea..7c70753e54ab 100644
> --- a/drivers/vfio/Makefile
> +++ b/drivers/vfio/Makefile
> @@ -7,3 +7,4 @@ obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) +=
> vfio_iommu_spapr_tce.o
>  obj-$(CONFIG_VFIO_SPAPR_EEH) += vfio_spapr_eeh.o
>  obj-$(CONFIG_VFIO_PCI) += pci/
>  obj-$(CONFIG_VFIO_PLATFORM) += platform/
> +obj-$(CONFIG_MDEV) += mdev/
> diff --git a/drivers/vfio/mdev/Kconfig b/drivers/vfio/mdev/Kconfig
> new file mode 100644
> index 000000000000..951e2bb06a3f
> --- /dev/null
> +++ b/drivers/vfio/mdev/Kconfig
> @@ -0,0 +1,11 @@
> +
> +config MDEV
> +    tristate "Mediated device driver framework"

Sorry not a native speaker. Is it cleaner to say "Driver framework for Mediated 
Devices" or "Mediated Device Framework"? Should we focus on driver or device
here?

> +    depends on VFIO
> +    default n
> +    help
> +        MDEV provides a framework to virtualize device without SR-IOV cap
> +        See Documentation/mdev.txt for more details.

Looks Documentation/mdev.txt is not included in this version.

> +
> +        If you don't know what do here, say N.
> +
> diff --git a/drivers/vfio/mdev/Makefile b/drivers/vfio/mdev/Makefile
> new file mode 100644
> index 000000000000..4adb069febce
> --- /dev/null
> +++ b/drivers/vfio/mdev/Makefile
> @@ -0,0 +1,5 @@
> +
> +mdev-y := mdev-core.o mdev-sysfs.o mdev-driver.o
> +
> +obj-$(CONFIG_MDEV) += mdev.o
> +
> diff --git a/drivers/vfio/mdev/mdev-core.c b/drivers/vfio/mdev/mdev-core.c
> new file mode 100644
> index 000000000000..af070d73735f
> --- /dev/null
> +++ b/drivers/vfio/mdev/mdev-core.c
> @@ -0,0 +1,462 @@
> +/*
> + * Mediated device Core Driver
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/init.h>
> +#include <linux/module.h>
> +#include <linux/device.h>
> +#include <linux/kernel.h>
> +#include <linux/fs.h>
> +#include <linux/slab.h>
> +#include <linux/cdev.h>
> +#include <linux/sched.h>
> +#include <linux/uuid.h>
> +#include <linux/vfio.h>
> +#include <linux/iommu.h>
> +#include <linux/sysfs.h>
> +#include <linux/mdev.h>
> +
> +#include "mdev_private.h"
> +
> +#define DRIVER_VERSION		"0.1"
> +#define DRIVER_AUTHOR		"NVIDIA Corporation"
> +#define DRIVER_DESC		"Mediated device Core Driver"
> +
> +#define MDEV_CLASS_NAME		"mdev"
> +
> +/*
> + * Global Structures
> + */
> +
> +static struct devices_list {
> +	struct list_head    dev_list;
> +	struct mutex        list_lock;
> +} mdevices, phy_devices;

phy_devices -> pdevices? and similarly we can use pdev/mdev
pair in other places...

> +
> +/*
> + * Functions
> + */
> +
> +static int mdev_add_attribute_group(struct device *dev,
> +				    const struct attribute_group **groups)
> +{
> +	return sysfs_create_groups(&dev->kobj, groups);
> +}
> +
> +static void mdev_remove_attribute_group(struct device *dev,
> +					const struct attribute_group **groups)
> +{
> +	sysfs_remove_groups(&dev->kobj, groups);
> +}
> +
> +static struct mdev_device *find_mdev_device(uuid_le uuid, int instance)

can we just call it "struct mdev* or "mdevice"? "dev_device" looks redundant.

Sorry I may have to ask same question since I didn't get an answer yet.
what exactly does 'instance' mean here? since uuid is unique, why do 
we need match instance too?

> +{
> +	struct mdev_device *vdev = NULL, *v;

better to unify the notation here. what's the difference between mdev
and vdev?

> +
> +	mutex_lock(&mdevices.list_lock);
> +	list_for_each_entry(v, &mdevices.dev_list, next) {
> +		if ((uuid_le_cmp(v->uuid, uuid) == 0) &&
> +		    (v->instance == instance)) {
> +			vdev = v;
> +			break;
> +		}
> +	}
> +	mutex_unlock(&mdevices.list_lock);
> +	return vdev;
> +}
> +
> +static struct mdev_device *find_next_mdev_device(struct phy_device *phy_dev)
> +{
> +	struct mdev_device *mdev = NULL, *p;
> +
> +	mutex_lock(&mdevices.list_lock);
> +	list_for_each_entry(p, &mdevices.dev_list, next) {
> +		if (p->phy_dev == phy_dev) {
> +			mdev = p;
> +			break;
> +		}
> +	}

Looks above is to find the first mdev for a given physical device, instead of
finding next mdev

> +	mutex_unlock(&mdevices.list_lock);
> +	return mdev;
> +}
> +
> +static struct phy_device *find_physical_device(struct device *dev)
> +{
> +	struct phy_device *pdev = NULL, *p;
> +
> +	mutex_lock(&phy_devices.list_lock);
> +	list_for_each_entry(p, &phy_devices.dev_list, next) {
> +		if (p->dev == dev) {
> +			pdev = p;
> +			break;
> +		}
> +	}
> +	mutex_unlock(&phy_devices.list_lock);
> +	return pdev;
> +}
> +
> +static void mdev_destroy_device(struct mdev_device *mdevice)
> +{
> +	struct phy_device *phy_dev = mdevice->phy_dev;
> +
> +	if (phy_dev) {
> +		mutex_lock(&phy_devices.list_lock);
> +
> +		/*
> +		* If vendor driver doesn't return success that means vendor
> +		* driver doesn't support hot-unplug
> +		*/
> +		if (phy_dev->ops->destroy) {
> +			if (phy_dev->ops->destroy(phy_dev->dev, mdevice->uuid,
> +						  mdevice->instance)) {
> +				mutex_unlock(&phy_devices.list_lock);

a warning message is preferred. Also better to return -EBUSY here.

> +				return;
> +			}
> +		}
> +
> +		mdev_remove_attribute_group(&mdevice->dev,
> +					    phy_dev->ops->mdev_attr_groups);
> +		mdevice->phy_dev = NULL;

Am I missing something here? You didn't remove this mdev node from
the list, and below...

> +		mutex_unlock(&phy_devices.list_lock);

you should use mutex of mdevices list

> +	}
> +
> +	mdev_put_device(mdevice);
> +	device_unregister(&mdevice->dev);
> +}
> +
> +/*
> + * Find mediated device from given iommu_group and increment refcount of
> + * mediated device. Caller should call mdev_put_device() when the use of
> + * mdev_device is done.
> + */
> +struct mdev_device *mdev_get_device_by_group(struct iommu_group *group)
> +{
> +	struct mdev_device *mdev = NULL, *p;
> +
> +	mutex_lock(&mdevices.list_lock);
> +	list_for_each_entry(p, &mdevices.dev_list, next) {
> +		if (!p->group)
> +			continue;
> +
> +		if (iommu_group_id(p->group) == iommu_group_id(group)) {
> +			mdev = mdev_get_device(p);
> +			break;
> +		}
> +	}
> +	mutex_unlock(&mdevices.list_lock);
> +	return mdev;
> +}
> +EXPORT_SYMBOL_GPL(mdev_get_device_by_group);
> +
> +/*
> + * mdev_register_device : Register a device
> + * @dev: device structure representing physical device.
> + * @phy_device_ops: Physical device operation structure to be registered.
> + *
> + * Add device to list of registered physical devices.
> + * Returns a negative value on error, otherwise 0.
> + */
> +int mdev_register_device(struct device *dev, const struct phy_device_ops *ops)
> +{
> +	int ret = 0;
> +	struct phy_device *phy_dev, *pdev;
> +
> +	if (!dev || !ops)
> +		return -EINVAL;
> +
> +	/* Check for duplicate */
> +	pdev = find_physical_device(dev);
> +	if (pdev)
> +		return -EEXIST;
> +
> +	phy_dev = kzalloc(sizeof(*phy_dev), GFP_KERNEL);
> +	if (!phy_dev)
> +		return -ENOMEM;
> +
> +	phy_dev->dev = dev;
> +	phy_dev->ops = ops;
> +
> +	mutex_lock(&phy_devices.list_lock);
> +	ret = mdev_create_sysfs_files(dev);
> +	if (ret)
> +		goto add_sysfs_error;
> +
> +	ret = mdev_add_attribute_group(dev, ops->dev_attr_groups);
> +	if (ret)
> +		goto add_group_error;

any reason to include sysfs operations inside the mutex which is
purely about phy_devices list?

> +
> +	list_add(&phy_dev->next, &phy_devices.dev_list);
> +	dev_info(dev, "MDEV: Registered\n");
> +	mutex_unlock(&phy_devices.list_lock);
> +
> +	return 0;
> +
> +add_group_error:
> +	mdev_remove_sysfs_files(dev);
> +add_sysfs_error:
> +	mutex_unlock(&phy_devices.list_lock);
> +	kfree(phy_dev);
> +	return ret;
> +}
> +EXPORT_SYMBOL(mdev_register_device);
> +
> +/*
> + * mdev_unregister_device : Unregister a physical device
> + * @dev: device structure representing physical device.
> + *
> + * Remove device from list of registered physical devices. Gives a change to
> + * free existing mediated devices for the given physical device.
> + */
> +
> +void mdev_unregister_device(struct device *dev)
> +{
> +	struct phy_device *phy_dev;
> +	struct mdev_device *vdev = NULL;
> +
> +	phy_dev = find_physical_device(dev);
> +
> +	if (!phy_dev)
> +		return;
> +
> +	dev_info(dev, "MDEV: Unregistering\n");
> +
> +	while ((vdev = find_next_mdev_device(phy_dev)))
> +		mdev_destroy_device(vdev);

Need check return value here since ops->destroy may fail.

> +
> +	mutex_lock(&phy_devices.list_lock);
> +	list_del(&phy_dev->next);
> +	mutex_unlock(&phy_devices.list_lock);
> +
> +	mdev_remove_attribute_group(dev,
> +				    phy_dev->ops->dev_attr_groups);
> +
> +	mdev_remove_sysfs_files(dev);
> +	kfree(phy_dev);
> +}
> +EXPORT_SYMBOL(mdev_unregister_device);
> +
> +/*
> + * Functions required for mdev-sysfs
> + */
> +
> +static struct mdev_device *mdev_device_alloc(uuid_le uuid, int instance)
> +{
> +	struct mdev_device *mdevice = NULL;
> +
> +	mdevice = kzalloc(sizeof(*mdevice), GFP_KERNEL);
> +	if (!mdevice)
> +		return ERR_PTR(-ENOMEM);
> +
> +	kref_init(&mdevice->kref);
> +	memcpy(&mdevice->uuid, &uuid, sizeof(uuid_le));
> +	mdevice->instance = instance;
> +	mutex_init(&mdevice->ops_lock);
> +
> +	return mdevice;
> +}
> +
> +static void mdev_device_release(struct device *dev)

what's the difference between this release and earlier destroy version?

> +{
> +	struct mdev_device *mdevice = to_mdev_device(dev);
> +
> +	if (!mdevice)
> +		return;
> +
> +	dev_info(&mdevice->dev, "MDEV: destroying\n");
> +
> +	mutex_lock(&mdevices.list_lock);
> +	list_del(&mdevice->next);
> +	mutex_unlock(&mdevices.list_lock);
> +
> +	kfree(mdevice);
> +}
> +
> +int create_mdev_device(struct device *dev, uuid_le uuid, uint32_t instance,
> +		       char *mdev_params)
> +{
> +	int retval = 0;
> +	struct mdev_device *mdevice = NULL;
> +	struct phy_device *phy_dev;
> +
> +	phy_dev = find_physical_device(dev);
> +	if (!phy_dev)
> +		return -EINVAL;
> +
> +	mdevice = mdev_device_alloc(uuid, instance);
> +	if (IS_ERR(mdevice)) {
> +		retval = PTR_ERR(mdevice);
> +		return retval;
> +	}
> +
> +	mdevice->dev.parent  = dev;
> +	mdevice->dev.bus     = &mdev_bus_type;
> +	mdevice->dev.release = mdev_device_release;
> +	dev_set_name(&mdevice->dev, "%pUb-%d", uuid.b, instance);
> +
> +	mutex_lock(&mdevices.list_lock);
> +	list_add(&mdevice->next, &mdevices.dev_list);
> +	mutex_unlock(&mdevices.list_lock);

update list in the end, since even ops->create hasn't been invoked yet.

> +
> +	retval = device_register(&mdevice->dev);
> +	if (retval) {
> +		mdev_put_device(mdevice);
> +		return retval;
> +	}
> +
> +	mutex_lock(&phy_devices.list_lock);
> +	if (phy_dev->ops->create) {
> +		retval = phy_dev->ops->create(dev, mdevice->uuid,
> +					      instance, mdev_params);
> +		if (retval)
> +			goto create_failed;
> +	}
> +
> +	retval = mdev_add_attribute_group(&mdevice->dev,
> +					  phy_dev->ops->mdev_attr_groups);
> +	if (retval)
> +		goto create_failed;
> +
> +	mdevice->phy_dev = phy_dev;
> +	mutex_unlock(&phy_devices.list_lock);
> +	mdev_get_device(mdevice);
> +	dev_info(&mdevice->dev, "MDEV: created\n");
> +
> +	return retval;
> +
> +create_failed:
> +	mutex_unlock(&phy_devices.list_lock);
> +	device_unregister(&mdevice->dev);
> +	return retval;
> +}
> +
> +int destroy_mdev_device(uuid_le uuid, uint32_t instance)
> +{
> +	struct mdev_device *vdev;
> +
> +	vdev = find_mdev_device(uuid, instance);
> +
> +	if (!vdev)
> +		return -EINVAL;
> +
> +	mdev_destroy_device(vdev);
> +	return 0;
> +}
> +
> +void get_mdev_supported_types(struct device *dev, char *str)
> +{
> +	struct phy_device *phy_dev;
> +
> +	phy_dev = find_physical_device(dev);
> +
> +	if (phy_dev) {
> +		mutex_lock(&phy_devices.list_lock);
> +		if (phy_dev->ops->supported_config)
> +			phy_dev->ops->supported_config(phy_dev->dev, str);
> +		mutex_unlock(&phy_devices.list_lock);
> +	}
> +}
> +
> +int mdev_start_callback(uuid_le uuid, uint32_t instance)
> +{
> +	int ret = 0;
> +	struct mdev_device *mdevice;
> +	struct phy_device *phy_dev;
> +
> +	mdevice = find_mdev_device(uuid, instance);
> +
> +	if (!mdevice)
> +		return -EINVAL;
> +
> +	phy_dev = mdevice->phy_dev;
> +
> +	mutex_lock(&phy_devices.list_lock);
> +	if (phy_dev->ops->start)
> +		ret = phy_dev->ops->start(mdevice->uuid);
> +	mutex_unlock(&phy_devices.list_lock);
> +
> +	if (ret < 0)
> +		pr_err("mdev_start failed  %d\n", ret);
> +	else
> +		kobject_uevent(&mdevice->dev.kobj, KOBJ_ONLINE);
> +
> +	return ret;
> +}
> +
> +int mdev_shutdown_callback(uuid_le uuid, uint32_t instance)
> +{
> +	int ret = 0;
> +	struct mdev_device *mdevice;
> +	struct phy_device *phy_dev;
> +
> +	mdevice = find_mdev_device(uuid, instance);
> +
> +	if (!mdevice)
> +		return -EINVAL;
> +
> +	phy_dev = mdevice->phy_dev;
> +
> +	mutex_lock(&phy_devices.list_lock);
> +	if (phy_dev->ops->shutdown)
> +		ret = phy_dev->ops->shutdown(mdevice->uuid);
> +	mutex_unlock(&phy_devices.list_lock);
> +
> +	if (ret < 0)
> +		pr_err("mdev_shutdown failed %d\n", ret);
> +	else
> +		kobject_uevent(&mdevice->dev.kobj, KOBJ_OFFLINE);
> +
> +	return ret;
> +}
> +
> +static struct class mdev_class = {
> +	.name		= MDEV_CLASS_NAME,
> +	.owner		= THIS_MODULE,
> +	.class_attrs	= mdev_class_attrs,
> +};
> +
> +static int __init mdev_init(void)
> +{
> +	int rc = 0;
> +
> +	mutex_init(&mdevices.list_lock);
> +	INIT_LIST_HEAD(&mdevices.dev_list);
> +	mutex_init(&phy_devices.list_lock);
> +	INIT_LIST_HEAD(&phy_devices.dev_list);
> +
> +	rc = class_register(&mdev_class);
> +	if (rc < 0) {
> +		pr_err("Failed to register mdev class\n");
> +		return rc;
> +	}
> +
> +	rc = mdev_bus_register();
> +	if (rc < 0) {
> +		pr_err("Failed to register mdev bus\n");
> +		class_unregister(&mdev_class);
> +		return rc;
> +	}
> +
> +	return rc;
> +}
> +
> +static void __exit mdev_exit(void)
> +{

should we check any remaining mdev/pdev which are not cleaned
up correctly here?

> +	mdev_bus_unregister();
> +	class_unregister(&mdev_class);
> +}
> +
> +module_init(mdev_init)
> +module_exit(mdev_exit)
> +
> +MODULE_VERSION(DRIVER_VERSION);
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR(DRIVER_AUTHOR);
> +MODULE_DESCRIPTION(DRIVER_DESC);
> diff --git a/drivers/vfio/mdev/mdev-driver.c b/drivers/vfio/mdev/mdev-driver.c
> new file mode 100644
> index 000000000000..bc8a169782bc
> --- /dev/null
> +++ b/drivers/vfio/mdev/mdev-driver.c
> @@ -0,0 +1,139 @@
> +/*
> + * MDEV driver
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/device.h>
> +#include <linux/iommu.h>
> +#include <linux/mdev.h>
> +
> +#include "mdev_private.h"
> +
> +static int mdevice_attach_iommu(struct mdev_device *mdevice)
> +{
> +	int retval = 0;
> +	struct iommu_group *group = NULL;
> +
> +	group = iommu_group_alloc();
> +	if (IS_ERR(group)) {
> +		dev_err(&mdevice->dev, "MDEV: failed to allocate group!\n");
> +		return PTR_ERR(group);
> +	}
> +
> +	retval = iommu_group_add_device(group, &mdevice->dev);
> +	if (retval) {
> +		dev_err(&mdevice->dev, "MDEV: failed to add dev to group!\n");
> +		goto attach_fail;
> +	}
> +
> +	mdevice->group = group;
> +
> +	dev_info(&mdevice->dev, "MDEV: group_id = %d\n",
> +				 iommu_group_id(group));
> +attach_fail:
> +	iommu_group_put(group);
> +	return retval;
> +}
> +
> +static void mdevice_detach_iommu(struct mdev_device *mdevice)
> +{
> +	iommu_group_remove_device(&mdevice->dev);
> +	dev_info(&mdevice->dev, "MDEV: detaching iommu\n");
> +}
> +
> +static int mdevice_probe(struct device *dev)
> +{
> +	struct mdev_driver *drv = to_mdev_driver(dev->driver);
> +	struct mdev_device *mdevice = to_mdev_device(dev);
> +	int status = 0;
> +
> +	status = mdevice_attach_iommu(mdevice);
> +	if (status) {
> +		dev_err(dev, "Failed to attach IOMMU\n");
> +		return status;
> +	}
> +
> +	if (drv && drv->probe)
> +		status = drv->probe(dev);
> +
> +	return status;
> +}
> +
> +static int mdevice_remove(struct device *dev)
> +{
> +	struct mdev_driver *drv = to_mdev_driver(dev->driver);
> +	struct mdev_device *mdevice = to_mdev_device(dev);
> +
> +	if (drv && drv->remove)
> +		drv->remove(dev);
> +
> +	mdevice_detach_iommu(mdevice);
> +
> +	return 0;
> +}
> +
> +static int mdevice_match(struct device *dev, struct device_driver *drv)
> +{
> +	int ret = 0;
> +	struct mdev_driver *mdrv = to_mdev_driver(drv);
> +
> +	if (mdrv && mdrv->match)
> +		ret = mdrv->match(dev);
> +
> +	return ret;
> +}
> +
> +struct bus_type mdev_bus_type = {
> +	.name		= "mdev",
> +	.match		= mdevice_match,
> +	.probe		= mdevice_probe,
> +	.remove		= mdevice_remove,
> +};
> +EXPORT_SYMBOL_GPL(mdev_bus_type);
> +
> +/**
> + * mdev_register_driver - register a new MDEV driver
> + * @drv: the driver to register
> + * @owner: owner module of driver ro register
> + *
> + * Returns a negative value on error, otherwise 0.
> + */
> +int mdev_register_driver(struct mdev_driver *drv, struct module *owner)
> +{
> +	/* initialize common driver fields */
> +	drv->driver.name = drv->name;
> +	drv->driver.bus = &mdev_bus_type;
> +	drv->driver.owner = owner;
> +
> +	/* register with core */
> +	return driver_register(&drv->driver);
> +}
> +EXPORT_SYMBOL(mdev_register_driver);
> +
> +/**
> + * mdev_unregister_driver - unregister MDEV driver
> + * @drv: the driver to unregister
> + *
> + */
> +void mdev_unregister_driver(struct mdev_driver *drv)
> +{
> +	driver_unregister(&drv->driver);
> +}
> +EXPORT_SYMBOL(mdev_unregister_driver);
> +
> +int mdev_bus_register(void)
> +{
> +	return bus_register(&mdev_bus_type);
> +}
> +
> +void mdev_bus_unregister(void)
> +{
> +	bus_unregister(&mdev_bus_type);
> +}
> diff --git a/drivers/vfio/mdev/mdev-sysfs.c b/drivers/vfio/mdev/mdev-sysfs.c
> new file mode 100644
> index 000000000000..79d351a7a502
> --- /dev/null
> +++ b/drivers/vfio/mdev/mdev-sysfs.c
> @@ -0,0 +1,312 @@
> +/*
> + * File attributes for Mediated devices
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/sysfs.h>
> +#include <linux/ctype.h>
> +#include <linux/device.h>
> +#include <linux/slab.h>
> +#include <linux/uuid.h>
> +#include <linux/mdev.h>
> +
> +#include "mdev_private.h"
> +
> +/* Prototypes */
> +static ssize_t mdev_supported_types_show(struct device *dev,
> +					 struct device_attribute *attr,
> +					 char *buf);
> +static DEVICE_ATTR_RO(mdev_supported_types);
> +
> +static ssize_t mdev_create_store(struct device *dev,
> +				 struct device_attribute *attr,
> +				 const char *buf, size_t count);
> +static DEVICE_ATTR_WO(mdev_create);
> +
> +static ssize_t mdev_destroy_store(struct device *dev,
> +				  struct device_attribute *attr,
> +				  const char *buf, size_t count);
> +static DEVICE_ATTR_WO(mdev_destroy);
> +
> +/* Static functions */
> +
> +#define UUID_CHAR_LENGTH	36
> +#define UUID_BYTE_LENGTH	16
> +
> +#define SUPPORTED_TYPE_BUFFER_LENGTH	1024
> +
> +static inline bool is_uuid_sep(char sep)
> +{
> +	if (sep == '\n' || sep == '-' || sep == ':' || sep == '\0')
> +		return true;
> +	return false;
> +}
> +
> +static int uuid_parse(const char *str, uuid_le *uuid)
> +{
> +	int i;
> +
> +	if (strlen(str) < UUID_CHAR_LENGTH)
> +		return -1;
> +
> +	for (i = 0; i < UUID_BYTE_LENGTH; i++) {
> +		if (!isxdigit(str[0]) || !isxdigit(str[1])) {
> +			pr_err("%s err", __func__);
> +			return -EINVAL;
> +		}
> +
> +		uuid->b[i] = (hex_to_bin(str[0]) << 4) | hex_to_bin(str[1]);
> +		str += 2;
> +		if (is_uuid_sep(*str))
> +			str++;
> +	}
> +
> +	return 0;
> +}
> +
> +/* Functions */
> +static ssize_t mdev_supported_types_show(struct device *dev,
> +					 struct device_attribute *attr,
> +					 char *buf)
> +{
> +	char *str;
> +	ssize_t n;
> +
> +	str = kzalloc(sizeof(*str) * SUPPORTED_TYPE_BUFFER_LENGTH, GFP_KERNEL);
> +	if (!str)
> +		return -ENOMEM;
> +
> +	get_mdev_supported_types(dev, str);
> +
> +	n = sprintf(buf, "%s\n", str);
> +	kfree(str);
> +
> +	return n;
> +}
> +
> +static ssize_t mdev_create_store(struct device *dev,
> +				 struct device_attribute *attr,
> +				 const char *buf, size_t count)
> +{
> +	char *str, *pstr;
> +	char *uuid_str, *instance_str, *mdev_params = NULL;
> +	uuid_le uuid;
> +	uint32_t instance;
> +	int ret = 0;
> +
> +	pstr = str = kstrndup(buf, count, GFP_KERNEL);
> +
> +	if (!str)
> +		return -ENOMEM;
> +
> +	uuid_str = strsep(&str, ":");
> +	if (!uuid_str) {
> +		pr_err("mdev_create: Empty UUID string %s\n", buf);
> +		ret = -EINVAL;
> +		goto create_error;
> +	}
> +
> +	if (!str) {
> +		pr_err("mdev_create: mdev instance not present %s\n", buf);
> +		ret = -EINVAL;
> +		goto create_error;
> +	}
> +
> +	instance_str = strsep(&str, ":");
> +	if (!instance_str) {
> +		pr_err("mdev_create: Empty instance string %s\n", buf);
> +		ret = -EINVAL;
> +		goto create_error;
> +	}
> +
> +	ret = kstrtouint(instance_str, 0, &instance);
> +	if (ret) {
> +		pr_err("mdev_create: mdev instance parsing error %s\n", buf);
> +		goto create_error;
> +	}
> +
> +	if (!str) {
> +		pr_err("mdev_create: mdev params not specified %s\n", buf);
> +		ret = -EINVAL;
> +		goto create_error;
> +	}
> +
> +	mdev_params = kstrdup(str, GFP_KERNEL);
> +
> +	if (!mdev_params) {
> +		ret = -EINVAL;
> +		goto create_error;
> +	}
> +
> +	if (uuid_parse(uuid_str, &uuid) < 0) {
> +		pr_err("mdev_create: UUID parse error %s\n", buf);
> +		ret = -EINVAL;
> +		goto create_error;
> +	}
> +
> +	if (create_mdev_device(dev, uuid, instance, mdev_params) < 0) {
> +		pr_err("mdev_create: Failed to create mdev device\n");
> +		ret = -EINVAL;
> +		goto create_error;
> +	}
> +	ret = count;
> +
> +create_error:
> +	kfree(mdev_params);
> +	kfree(pstr);
> +	return ret;
> +}
> +
> +static ssize_t mdev_destroy_store(struct device *dev,
> +				  struct device_attribute *attr,
> +				  const char *buf, size_t count)
> +{
> +	char *uuid_str, *str, *pstr;
> +	uuid_le uuid;
> +	unsigned int instance;
> +	int ret;
> +
> +	str = pstr = kstrndup(buf, count, GFP_KERNEL);
> +
> +	if (!str)
> +		return -ENOMEM;
> +
> +	uuid_str = strsep(&str, ":");
> +	if (!uuid_str) {
> +		pr_err("mdev_destroy: Empty UUID string %s\n", buf);
> +		ret = -EINVAL;
> +		goto destroy_error;
> +	}
> +
> +	if (str == NULL) {
> +		pr_err("mdev_destroy: instance not specified %s\n", buf);
> +		ret = -EINVAL;
> +		goto destroy_error;
> +	}
> +
> +	ret = kstrtouint(str, 0, &instance);
> +	if (ret) {
> +		pr_err("mdev_destroy: instance parsing error %s\n", buf);
> +		goto destroy_error;
> +	}
> +
> +	if (uuid_parse(uuid_str, &uuid) < 0) {
> +		pr_err("mdev_destroy: UUID parse error  %s\n", buf);
> +		ret = -EINVAL;
> +		goto destroy_error;
> +	}
> +
> +	ret = destroy_mdev_device(uuid, instance);
> +	if (ret < 0)
> +		goto destroy_error;
> +
> +	ret = count;
> +
> +destroy_error:
> +	kfree(pstr);
> +	return ret;
> +}
> +
> +ssize_t mdev_start_store(struct class *class, struct class_attribute *attr,
> +			 const char *buf, size_t count)
> +{
> +	char *uuid_str;
> +	uuid_le uuid;
> +	int ret = 0;
> +
> +	uuid_str = kstrndup(buf, count, GFP_KERNEL);
> +
> +	if (!uuid_str)
> +		return -ENOMEM;
> +
> +	if (uuid_parse(uuid_str, &uuid) < 0) {
> +		pr_err("mdev_start: UUID parse error  %s\n", buf);
> +		ret = -EINVAL;
> +		goto start_error;
> +	}
> +
> +	ret = mdev_start_callback(uuid, 0);
> +	if (ret < 0)
> +		goto start_error;
> +
> +	ret = count;
> +
> +start_error:
> +	kfree(uuid_str);
> +	return ret;
> +}
> +
> +ssize_t mdev_shutdown_store(struct class *class, struct class_attribute *attr,
> +			    const char *buf, size_t count)
> +{
> +	char *uuid_str;
> +	uuid_le uuid;
> +	int ret = 0;
> +
> +	uuid_str = kstrndup(buf, count, GFP_KERNEL);
> +
> +	if (!uuid_str)
> +		return -ENOMEM;
> +
> +	if (uuid_parse(uuid_str, &uuid) < 0) {
> +		pr_err("mdev_shutdown: UUID parse error %s\n", buf);
> +		ret = -EINVAL;
> +	}
> +
> +	ret = mdev_shutdown_callback(uuid, 0);
> +	if (ret < 0)
> +		goto shutdown_error;
> +
> +	ret = count;
> +
> +shutdown_error:
> +	kfree(uuid_str);
> +	return ret;
> +
> +}
> +
> +struct class_attribute mdev_class_attrs[] = {
> +	__ATTR_WO(mdev_start),
> +	__ATTR_WO(mdev_shutdown),
> +	__ATTR_NULL
> +};
> +
> +int mdev_create_sysfs_files(struct device *dev)
> +{
> +	int retval;
> +
> +	retval = sysfs_create_file(&dev->kobj,
> +				   &dev_attr_mdev_supported_types.attr);
> +	if (retval) {
> +		pr_err("Failed to create mdev_supported_types sysfs entry\n");
> +		return retval;
> +	}
> +
> +	retval = sysfs_create_file(&dev->kobj, &dev_attr_mdev_create.attr);
> +	if (retval) {
> +		pr_err("Failed to create mdev_create sysfs entry\n");
> +		return retval;
> +	}
> +
> +	retval = sysfs_create_file(&dev->kobj, &dev_attr_mdev_destroy.attr);
> +	if (retval) {
> +		pr_err("Failed to create mdev_destroy sysfs entry\n");
> +		return retval;
> +	}
> +
> +	return 0;
> +}
> +
> +void mdev_remove_sysfs_files(struct device *dev)
> +{
> +	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_supported_types.attr);
> +	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_create.attr);
> +	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_destroy.attr);
> +}
> diff --git a/drivers/vfio/mdev/mdev_private.h b/drivers/vfio/mdev/mdev_private.h
> new file mode 100644
> index 000000000000..a472310c7749
> --- /dev/null
> +++ b/drivers/vfio/mdev/mdev_private.h
> @@ -0,0 +1,33 @@
> +/*
> + * Mediated device interal definitions
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#ifndef MDEV_PRIVATE_H
> +#define MDEV_PRIVATE_H
> +
> +int  mdev_bus_register(void);
> +void mdev_bus_unregister(void);
> +
> +/* Function prototypes for mdev_sysfs */
> +
> +extern struct class_attribute mdev_class_attrs[];
> +
> +int  mdev_create_sysfs_files(struct device *dev);
> +void mdev_remove_sysfs_files(struct device *dev);
> +
> +int  create_mdev_device(struct device *dev, uuid_le uuid, uint32_t instance,
> +			char *mdev_params);
> +int  destroy_mdev_device(uuid_le uuid, uint32_t instance);
> +void get_mdev_supported_types(struct device *dev, char *str);
> +int  mdev_start_callback(uuid_le uuid, uint32_t instance);
> +int  mdev_shutdown_callback(uuid_le uuid, uint32_t instance);
> +
> +#endif /* MDEV_PRIVATE_H */
> diff --git a/include/linux/mdev.h b/include/linux/mdev.h
> new file mode 100644
> index 000000000000..d9633acd85f2
> --- /dev/null
> +++ b/include/linux/mdev.h
> @@ -0,0 +1,224 @@
> +/*
> + * Mediated device definition
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#ifndef MDEV_H
> +#define MDEV_H
> +
> +/* Common Data structures */
> +
> +struct pci_region_info {
> +	uint64_t start;
> +	uint64_t size;
> +	uint32_t flags;		/*!< VFIO region info flags */
> +};
> +
> +enum mdev_emul_space {
> +	EMUL_CONFIG_SPACE,	/*!< PCI configuration space */
> +	EMUL_IO,		/*!< I/O register space */
> +	EMUL_MMIO		/*!< Memory-mapped I/O space */
> +};
> +
> +struct phy_device;
> +
> +/*
> + * Mediated device
> + */
> +
> +struct mdev_device {
> +	struct kref		kref;
> +	struct device		dev;
> +	struct phy_device	*phy_dev;
> +	struct iommu_group	*group;
> +	void			*iommu_data;
> +	uuid_le			uuid;
> +	uint32_t		instance;
> +	void			*driver_data;
> +	struct mutex		ops_lock;
> +	struct list_head	next;
> +};
> +
> +
> +/**
> + * struct phy_device_ops - Structure to be registered for each physical device
> + * to register the device to mdev module.
> + *
> + * @owner:		The module owner.
> + * @dev_attr_groups:	Default attributes of the physical device.
> + * @mdev_attr_groups:	Default attributes of the mediated device.
> + * @supported_config:	Called to get information about supported types.
> + *			@dev : device structure of physical device.
> + *			@config: should return string listing supported config
> + *			Returns integer: success (0) or error (< 0)
> + * @create:		Called to allocate basic resources in physical device's
> + *			driver for a particular mediated device
> + *			@dev: physical pci device structure on which mediated
> + *			      device should be created
> + *			@uuid: VM's uuid for which VM it is intended to
> + *			@instance: mediated instance in that VM
> + *			@mdev_params: extra parameters required by physical
> + *			device's driver.
> + *			Returns integer: success (0) or error (< 0)
> + * @destroy:		Called to free resources in physical device's driver for
> + *			a mediated device instance of that VM.
> + *			@dev: physical device structure to which this mediated
> + *			      device points to.
> + *			@uuid: VM's uuid for which the mediated device belongs
> + *			@instance: mdev instance in that VM
> + *			Returns integer: success (0) or error (< 0)
> + *			If VM is running and destroy() is called that means the
> + *			mdev is being hotunpluged. Return error if VM is running
> + *			and driver doesn't support mediated device hotplug.
> + * @start:		Called to do initiate mediated device initialization
> + *			process in physical device's driver when VM boots before
> + *			qemu starts.
> + *			@uuid: VM's UUID which is booting.
> + *			Returns integer: success (0) or error (< 0)
> + * @shutdown:		Called to teardown mediated device related resources for
> + *			the VM
> + *			@uuid: VM's UUID which is shutting down .
> + *			Returns integer: success (0) or error (< 0)
> + * @read:		Read emulation callback
> + *			@mdev: mediated device structure
> + *			@buf: read buffer
> + *			@count: number bytes to read
> + *			@address_space: specifies for which address
> + *			space the request is: pci_config_space, IO
> + *			register space or MMIO space.
> + *			@pos: offset from base address.
> + *			Retuns number on bytes read on success or error.
> + * @write:		Write emulation callback
> + *			@mdev: mediated device structure
> + *			@buf: write buffer
> + *			@count: number bytes to be written
> + *			@address_space: specifies for which address space the
> + *			request is: pci_config_space, IO register space or MMIO
> + *			space.
> + *			@pos: offset from base address.
> + *			Retuns number on bytes written on success or error.
> + * @set_irqs:		Called to send about interrupts configuration
> + *			information that VMM sets.
> + *			@mdev: mediated device structure
> + *			@flags, index, start, count and *data : same as that of
> + *			struct vfio_irq_set of VFIO_DEVICE_SET_IRQS API.
> + * @get_region_info:	Called to get BAR size and flags of mediated device.
> + *			@mdev: mediated device structure
> + *			@region_index: VFIO region index
> + *			@region_info: output, returns size and flags of
> + *				      requested region.
> + *			Returns integer: success (0) or error (< 0)
> + * @validate_map_request: Validate remap pfn request
> + *			@mdev: mediated device structure
> + *			@virtaddr: target user address to start at
> + *			@pfn: physical address of kernel memory, vendor driver
> + *			      can change if required.
> + *			@size: size of map area, vendor driver can change the
> + *			       size of map area if desired.
> + *			@prot: page protection flags for this mapping, vendor
> + *			       driver can change, if required.
> + *			Returns integer: success (0) or error (< 0)
> + *
> + * Physical device that support mediated device should be registered with mdev
> + * module with phy_device_ops structure.
> + */
> +
> +struct phy_device_ops {
> +	struct module   *owner;
> +	const struct attribute_group **dev_attr_groups;
> +	const struct attribute_group **mdev_attr_groups;
> +
> +	int	(*supported_config)(struct device *dev, char *config);
> +	int     (*create)(struct device *dev, uuid_le uuid,
> +			  uint32_t instance, char *mdev_params);
> +	int     (*destroy)(struct device *dev, uuid_le uuid,
> +			   uint32_t instance);
> +	int     (*start)(uuid_le uuid);
> +	int     (*shutdown)(uuid_le uuid);
> +	ssize_t (*read)(struct mdev_device *vdev, char *buf, size_t count,
> +			enum mdev_emul_space address_space, loff_t pos);
> +	ssize_t (*write)(struct mdev_device *vdev, char *buf, size_t count,
> +			 enum mdev_emul_space address_space, loff_t pos);
> +	int     (*set_irqs)(struct mdev_device *vdev, uint32_t flags,
> +			    unsigned int index, unsigned int start,
> +			    unsigned int count, void *data);
> +	int	(*get_region_info)(struct mdev_device *vdev, int region_index,
> +				 struct pci_region_info *region_info);
> +	int	(*validate_map_request)(struct mdev_device *vdev,
> +					unsigned long virtaddr,
> +					unsigned long *pfn, unsigned long *size,
> +					pgprot_t *prot);
> +};
> +
> +/*
> + * Physical Device
> + */
> +struct phy_device {
> +	struct device                   *dev;
> +	const struct phy_device_ops     *ops;
> +	struct list_head                next;
> +};
> +
> +/**
> + * struct mdev_driver - Mediated device driver
> + * @name: driver name
> + * @probe: called when new device created
> + * @remove: called when device removed
> + * @match: called when new device or driver is added for this bus. Return 1 if
> + *	   given device can be handled by given driver and zero otherwise.
> + * @driver: device driver structure
> + *
> + **/
> +struct mdev_driver {
> +	const char *name;
> +	int  (*probe)(struct device *dev);
> +	void (*remove)(struct device *dev);
> +	int  (*match)(struct device *dev);
> +	struct device_driver driver;
> +};
> +
> +static inline struct mdev_driver *to_mdev_driver(struct device_driver *drv)
> +{
> +	return drv ? container_of(drv, struct mdev_driver, driver) : NULL;
> +}
> +
> +static inline struct mdev_device *to_mdev_device(struct device *dev)
> +{
> +	return dev ? container_of(dev, struct mdev_device, dev) : NULL;
> +}
> +
> +static inline struct mdev_device *mdev_get_device(struct mdev_device *vdev)
> +{
> +	return (vdev && get_device(&vdev->dev)) ? vdev : NULL;
> +}
> +
> +static inline  void mdev_put_device(struct mdev_device *vdev)
> +{
> +	if (vdev)
> +		put_device(&vdev->dev);
> +}
> +
> +extern struct bus_type mdev_bus_type;
> +
> +#define dev_is_mdev(d) ((d)->bus == &mdev_bus_type)
> +
> +extern int  mdev_register_device(struct device *dev,
> +				 const struct phy_device_ops *ops);
> +extern void mdev_unregister_device(struct device *dev);
> +
> +extern int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
> +extern void mdev_unregister_driver(struct mdev_driver *drv);
> +
> +extern int mdev_map_virtual_bar(uint64_t virt_bar_addr, uint64_t phys_bar_addr,
> +				uint32_t len, uint32_t flags);
> +
> +extern struct mdev_device *mdev_get_device_by_group(struct iommu_group *group);
> +
> +#endif /* MDEV_H */
> --
> 2.7.0


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v4 1/3] Mediated device Core driver
@ 2016-05-25  7:55     ` Tian, Kevin
  0 siblings, 0 replies; 92+ messages in thread
From: Tian, Kevin @ 2016-05-25  7:55 UTC (permalink / raw)
  To: Kirti Wankhede, alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, Ruan, Shuai, Song, Jike, Lv, Zhiyuan, bjsdjshi

> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> Sent: Wednesday, May 25, 2016 3:58 AM
> 
> Design for Mediated Device Driver:
> Main purpose of this driver is to provide a common interface for mediated
> device management that can be used by differnt drivers of different
> devices.
> 
> This module provides a generic interface to create the device, add it to
> mediated bus, add device to IOMMU group and then add it to vfio group.
> 
> Below is the high Level block diagram, with Nvidia, Intel and IBM devices
> as example, since these are the devices which are going to actively use
> this module as of now.
> 
>  +---------------+
>  |               |
>  | +-----------+ |  mdev_register_driver() +--------------+
>  | |           | +<------------------------+ __init()     |
>  | |           | |                         |              |
>  | |  mdev     | +------------------------>+              |<-> VFIO user
>  | |  bus      | |     probe()/remove()    | vfio_mpci.ko |    APIs
>  | |  driver   | |                         |              |
>  | |           | |                         +--------------+
>  | |           | |  mdev_register_driver() +--------------+
>  | |           | +<------------------------+ __init()     |
>  | |           | |                         |              |
>  | |           | +------------------------>+              |<-> VFIO user
>  | +-----------+ |     probe()/remove()    | vfio_mccw.ko |    APIs
>  |               |                         |              |
>  |  MDEV CORE    |                         +--------------+
>  |   MODULE      |
>  |   mdev.ko     |
>  | +-----------+ |  mdev_register_device() +--------------+
>  | |           | +<------------------------+              |
>  | |           | |                         |  nvidia.ko   |<-> physical
>  | |           | +------------------------>+              |    device
>  | |           | |        callback         +--------------+
>  | | Physical  | |
>  | |  device   | |  mdev_register_device() +--------------+
>  | | interface | |<------------------------+              |
>  | |           | |                         |  i915.ko     |<-> physical
>  | |           | +------------------------>+              |    device
>  | |           | |        callback         +--------------+
>  | |           | |
>  | |           | |  mdev_register_device() +--------------+
>  | |           | +<------------------------+              |
>  | |           | |                         | ccw_device.ko|<-> physical
>  | |           | +------------------------>+              |    device
>  | |           | |        callback         +--------------+
>  | +-----------+ |
>  +---------------+
> 
> Core driver provides two types of registration interfaces:
> 1. Registration interface for mediated bus driver:
> 
> /**
>   * struct mdev_driver - Mediated device's driver
>   * @name: driver name
>   * @probe: called when new device created
>   * @remove:called when device removed
>   * @match: called when new device or driver is added for this bus.
> 	    Return 1 if given device can be handled by given driver and
> 	    zero otherwise.
>   * @driver:device driver structure
>   *
>   **/
> struct mdev_driver {
>          const char *name;
>          int  (*probe)  (struct device *dev);
>          void (*remove) (struct device *dev);
> 	 int  (*match)(struct device *dev);
>          struct device_driver    driver;
> };
> 
> int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
> void mdev_unregister_driver(struct mdev_driver *drv);
> 
> Mediated device's driver for mdev should use this interface to register
> with Core driver. With this, mediated devices driver for such devices is
> responsible to add mediated device to VFIO group.
> 
> 2. Physical device driver interface
> This interface provides vendor driver the set APIs to manage physical
> device related work in their own driver. APIs are :
> - supported_config: provide supported configuration list by the vendor
> 		    driver
> - create: to allocate basic resources in vendor driver for a mediated
> 	  device.
> - destroy: to free resources in vendor driver when mediated device is
> 	   destroyed.
> - start: to initiate mediated device initialization process from vendor
> 	 driver when VM boots and before QEMU starts.
> - shutdown: to teardown mediated device resources during VM teardown.
> - read : read emulation callback.
> - write: write emulation callback.
> - set_irqs: send interrupt configuration information that QEMU sets.
> - get_region_info: to provide region size and its flags for the mediated
> 		   device.
> - validate_map_request: to validate remap pfn request.
> 
> This registration interface should be used by vendor drivers to register
> each physical device to mdev core driver.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Signed-off-by: Neo Jia <cjia@nvidia.com>
> Change-Id: I88f4482f7608f40550a152c5f882b64271287c62
> ---
>  drivers/vfio/Kconfig             |   1 +
>  drivers/vfio/Makefile            |   1 +
>  drivers/vfio/mdev/Kconfig        |  11 +
>  drivers/vfio/mdev/Makefile       |   5 +
>  drivers/vfio/mdev/mdev-core.c    | 462
> +++++++++++++++++++++++++++++++++++++++
>  drivers/vfio/mdev/mdev-driver.c  | 139 ++++++++++++
>  drivers/vfio/mdev/mdev-sysfs.c   | 312 ++++++++++++++++++++++++++
>  drivers/vfio/mdev/mdev_private.h |  33 +++
>  include/linux/mdev.h             | 224 +++++++++++++++++++
>  9 files changed, 1188 insertions(+)
>  create mode 100644 drivers/vfio/mdev/Kconfig
>  create mode 100644 drivers/vfio/mdev/Makefile
>  create mode 100644 drivers/vfio/mdev/mdev-core.c
>  create mode 100644 drivers/vfio/mdev/mdev-driver.c
>  create mode 100644 drivers/vfio/mdev/mdev-sysfs.c
>  create mode 100644 drivers/vfio/mdev/mdev_private.h
>  create mode 100644 include/linux/mdev.h
> 
> diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
> index da6e2ce77495..23eced02aaf6 100644
> --- a/drivers/vfio/Kconfig
> +++ b/drivers/vfio/Kconfig
> @@ -48,4 +48,5 @@ menuconfig VFIO_NOIOMMU
> 
>  source "drivers/vfio/pci/Kconfig"
>  source "drivers/vfio/platform/Kconfig"
> +source "drivers/vfio/mdev/Kconfig"
>  source "virt/lib/Kconfig"
> diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
> index 7b8a31f63fea..7c70753e54ab 100644
> --- a/drivers/vfio/Makefile
> +++ b/drivers/vfio/Makefile
> @@ -7,3 +7,4 @@ obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) +=
> vfio_iommu_spapr_tce.o
>  obj-$(CONFIG_VFIO_SPAPR_EEH) += vfio_spapr_eeh.o
>  obj-$(CONFIG_VFIO_PCI) += pci/
>  obj-$(CONFIG_VFIO_PLATFORM) += platform/
> +obj-$(CONFIG_MDEV) += mdev/
> diff --git a/drivers/vfio/mdev/Kconfig b/drivers/vfio/mdev/Kconfig
> new file mode 100644
> index 000000000000..951e2bb06a3f
> --- /dev/null
> +++ b/drivers/vfio/mdev/Kconfig
> @@ -0,0 +1,11 @@
> +
> +config MDEV
> +    tristate "Mediated device driver framework"

Sorry not a native speaker. Is it cleaner to say "Driver framework for Mediated 
Devices" or "Mediated Device Framework"? Should we focus on driver or device
here?

> +    depends on VFIO
> +    default n
> +    help
> +        MDEV provides a framework to virtualize device without SR-IOV cap
> +        See Documentation/mdev.txt for more details.

Looks Documentation/mdev.txt is not included in this version.

> +
> +        If you don't know what do here, say N.
> +
> diff --git a/drivers/vfio/mdev/Makefile b/drivers/vfio/mdev/Makefile
> new file mode 100644
> index 000000000000..4adb069febce
> --- /dev/null
> +++ b/drivers/vfio/mdev/Makefile
> @@ -0,0 +1,5 @@
> +
> +mdev-y := mdev-core.o mdev-sysfs.o mdev-driver.o
> +
> +obj-$(CONFIG_MDEV) += mdev.o
> +
> diff --git a/drivers/vfio/mdev/mdev-core.c b/drivers/vfio/mdev/mdev-core.c
> new file mode 100644
> index 000000000000..af070d73735f
> --- /dev/null
> +++ b/drivers/vfio/mdev/mdev-core.c
> @@ -0,0 +1,462 @@
> +/*
> + * Mediated device Core Driver
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/init.h>
> +#include <linux/module.h>
> +#include <linux/device.h>
> +#include <linux/kernel.h>
> +#include <linux/fs.h>
> +#include <linux/slab.h>
> +#include <linux/cdev.h>
> +#include <linux/sched.h>
> +#include <linux/uuid.h>
> +#include <linux/vfio.h>
> +#include <linux/iommu.h>
> +#include <linux/sysfs.h>
> +#include <linux/mdev.h>
> +
> +#include "mdev_private.h"
> +
> +#define DRIVER_VERSION		"0.1"
> +#define DRIVER_AUTHOR		"NVIDIA Corporation"
> +#define DRIVER_DESC		"Mediated device Core Driver"
> +
> +#define MDEV_CLASS_NAME		"mdev"
> +
> +/*
> + * Global Structures
> + */
> +
> +static struct devices_list {
> +	struct list_head    dev_list;
> +	struct mutex        list_lock;
> +} mdevices, phy_devices;

phy_devices -> pdevices? and similarly we can use pdev/mdev
pair in other places...

> +
> +/*
> + * Functions
> + */
> +
> +static int mdev_add_attribute_group(struct device *dev,
> +				    const struct attribute_group **groups)
> +{
> +	return sysfs_create_groups(&dev->kobj, groups);
> +}
> +
> +static void mdev_remove_attribute_group(struct device *dev,
> +					const struct attribute_group **groups)
> +{
> +	sysfs_remove_groups(&dev->kobj, groups);
> +}
> +
> +static struct mdev_device *find_mdev_device(uuid_le uuid, int instance)

can we just call it "struct mdev* or "mdevice"? "dev_device" looks redundant.

Sorry I may have to ask same question since I didn't get an answer yet.
what exactly does 'instance' mean here? since uuid is unique, why do 
we need match instance too?

> +{
> +	struct mdev_device *vdev = NULL, *v;

better to unify the notation here. what's the difference between mdev
and vdev?

> +
> +	mutex_lock(&mdevices.list_lock);
> +	list_for_each_entry(v, &mdevices.dev_list, next) {
> +		if ((uuid_le_cmp(v->uuid, uuid) == 0) &&
> +		    (v->instance == instance)) {
> +			vdev = v;
> +			break;
> +		}
> +	}
> +	mutex_unlock(&mdevices.list_lock);
> +	return vdev;
> +}
> +
> +static struct mdev_device *find_next_mdev_device(struct phy_device *phy_dev)
> +{
> +	struct mdev_device *mdev = NULL, *p;
> +
> +	mutex_lock(&mdevices.list_lock);
> +	list_for_each_entry(p, &mdevices.dev_list, next) {
> +		if (p->phy_dev == phy_dev) {
> +			mdev = p;
> +			break;
> +		}
> +	}

Looks above is to find the first mdev for a given physical device, instead of
finding next mdev

> +	mutex_unlock(&mdevices.list_lock);
> +	return mdev;
> +}
> +
> +static struct phy_device *find_physical_device(struct device *dev)
> +{
> +	struct phy_device *pdev = NULL, *p;
> +
> +	mutex_lock(&phy_devices.list_lock);
> +	list_for_each_entry(p, &phy_devices.dev_list, next) {
> +		if (p->dev == dev) {
> +			pdev = p;
> +			break;
> +		}
> +	}
> +	mutex_unlock(&phy_devices.list_lock);
> +	return pdev;
> +}
> +
> +static void mdev_destroy_device(struct mdev_device *mdevice)
> +{
> +	struct phy_device *phy_dev = mdevice->phy_dev;
> +
> +	if (phy_dev) {
> +		mutex_lock(&phy_devices.list_lock);
> +
> +		/*
> +		* If vendor driver doesn't return success that means vendor
> +		* driver doesn't support hot-unplug
> +		*/
> +		if (phy_dev->ops->destroy) {
> +			if (phy_dev->ops->destroy(phy_dev->dev, mdevice->uuid,
> +						  mdevice->instance)) {
> +				mutex_unlock(&phy_devices.list_lock);

a warning message is preferred. Also better to return -EBUSY here.

> +				return;
> +			}
> +		}
> +
> +		mdev_remove_attribute_group(&mdevice->dev,
> +					    phy_dev->ops->mdev_attr_groups);
> +		mdevice->phy_dev = NULL;

Am I missing something here? You didn't remove this mdev node from
the list, and below...

> +		mutex_unlock(&phy_devices.list_lock);

you should use mutex of mdevices list

> +	}
> +
> +	mdev_put_device(mdevice);
> +	device_unregister(&mdevice->dev);
> +}
> +
> +/*
> + * Find mediated device from given iommu_group and increment refcount of
> + * mediated device. Caller should call mdev_put_device() when the use of
> + * mdev_device is done.
> + */
> +struct mdev_device *mdev_get_device_by_group(struct iommu_group *group)
> +{
> +	struct mdev_device *mdev = NULL, *p;
> +
> +	mutex_lock(&mdevices.list_lock);
> +	list_for_each_entry(p, &mdevices.dev_list, next) {
> +		if (!p->group)
> +			continue;
> +
> +		if (iommu_group_id(p->group) == iommu_group_id(group)) {
> +			mdev = mdev_get_device(p);
> +			break;
> +		}
> +	}
> +	mutex_unlock(&mdevices.list_lock);
> +	return mdev;
> +}
> +EXPORT_SYMBOL_GPL(mdev_get_device_by_group);
> +
> +/*
> + * mdev_register_device : Register a device
> + * @dev: device structure representing physical device.
> + * @phy_device_ops: Physical device operation structure to be registered.
> + *
> + * Add device to list of registered physical devices.
> + * Returns a negative value on error, otherwise 0.
> + */
> +int mdev_register_device(struct device *dev, const struct phy_device_ops *ops)
> +{
> +	int ret = 0;
> +	struct phy_device *phy_dev, *pdev;
> +
> +	if (!dev || !ops)
> +		return -EINVAL;
> +
> +	/* Check for duplicate */
> +	pdev = find_physical_device(dev);
> +	if (pdev)
> +		return -EEXIST;
> +
> +	phy_dev = kzalloc(sizeof(*phy_dev), GFP_KERNEL);
> +	if (!phy_dev)
> +		return -ENOMEM;
> +
> +	phy_dev->dev = dev;
> +	phy_dev->ops = ops;
> +
> +	mutex_lock(&phy_devices.list_lock);
> +	ret = mdev_create_sysfs_files(dev);
> +	if (ret)
> +		goto add_sysfs_error;
> +
> +	ret = mdev_add_attribute_group(dev, ops->dev_attr_groups);
> +	if (ret)
> +		goto add_group_error;

any reason to include sysfs operations inside the mutex which is
purely about phy_devices list?

> +
> +	list_add(&phy_dev->next, &phy_devices.dev_list);
> +	dev_info(dev, "MDEV: Registered\n");
> +	mutex_unlock(&phy_devices.list_lock);
> +
> +	return 0;
> +
> +add_group_error:
> +	mdev_remove_sysfs_files(dev);
> +add_sysfs_error:
> +	mutex_unlock(&phy_devices.list_lock);
> +	kfree(phy_dev);
> +	return ret;
> +}
> +EXPORT_SYMBOL(mdev_register_device);
> +
> +/*
> + * mdev_unregister_device : Unregister a physical device
> + * @dev: device structure representing physical device.
> + *
> + * Remove device from list of registered physical devices. Gives a change to
> + * free existing mediated devices for the given physical device.
> + */
> +
> +void mdev_unregister_device(struct device *dev)
> +{
> +	struct phy_device *phy_dev;
> +	struct mdev_device *vdev = NULL;
> +
> +	phy_dev = find_physical_device(dev);
> +
> +	if (!phy_dev)
> +		return;
> +
> +	dev_info(dev, "MDEV: Unregistering\n");
> +
> +	while ((vdev = find_next_mdev_device(phy_dev)))
> +		mdev_destroy_device(vdev);

Need check return value here since ops->destroy may fail.

> +
> +	mutex_lock(&phy_devices.list_lock);
> +	list_del(&phy_dev->next);
> +	mutex_unlock(&phy_devices.list_lock);
> +
> +	mdev_remove_attribute_group(dev,
> +				    phy_dev->ops->dev_attr_groups);
> +
> +	mdev_remove_sysfs_files(dev);
> +	kfree(phy_dev);
> +}
> +EXPORT_SYMBOL(mdev_unregister_device);
> +
> +/*
> + * Functions required for mdev-sysfs
> + */
> +
> +static struct mdev_device *mdev_device_alloc(uuid_le uuid, int instance)
> +{
> +	struct mdev_device *mdevice = NULL;
> +
> +	mdevice = kzalloc(sizeof(*mdevice), GFP_KERNEL);
> +	if (!mdevice)
> +		return ERR_PTR(-ENOMEM);
> +
> +	kref_init(&mdevice->kref);
> +	memcpy(&mdevice->uuid, &uuid, sizeof(uuid_le));
> +	mdevice->instance = instance;
> +	mutex_init(&mdevice->ops_lock);
> +
> +	return mdevice;
> +}
> +
> +static void mdev_device_release(struct device *dev)

what's the difference between this release and earlier destroy version?

> +{
> +	struct mdev_device *mdevice = to_mdev_device(dev);
> +
> +	if (!mdevice)
> +		return;
> +
> +	dev_info(&mdevice->dev, "MDEV: destroying\n");
> +
> +	mutex_lock(&mdevices.list_lock);
> +	list_del(&mdevice->next);
> +	mutex_unlock(&mdevices.list_lock);
> +
> +	kfree(mdevice);
> +}
> +
> +int create_mdev_device(struct device *dev, uuid_le uuid, uint32_t instance,
> +		       char *mdev_params)
> +{
> +	int retval = 0;
> +	struct mdev_device *mdevice = NULL;
> +	struct phy_device *phy_dev;
> +
> +	phy_dev = find_physical_device(dev);
> +	if (!phy_dev)
> +		return -EINVAL;
> +
> +	mdevice = mdev_device_alloc(uuid, instance);
> +	if (IS_ERR(mdevice)) {
> +		retval = PTR_ERR(mdevice);
> +		return retval;
> +	}
> +
> +	mdevice->dev.parent  = dev;
> +	mdevice->dev.bus     = &mdev_bus_type;
> +	mdevice->dev.release = mdev_device_release;
> +	dev_set_name(&mdevice->dev, "%pUb-%d", uuid.b, instance);
> +
> +	mutex_lock(&mdevices.list_lock);
> +	list_add(&mdevice->next, &mdevices.dev_list);
> +	mutex_unlock(&mdevices.list_lock);

update list in the end, since even ops->create hasn't been invoked yet.

> +
> +	retval = device_register(&mdevice->dev);
> +	if (retval) {
> +		mdev_put_device(mdevice);
> +		return retval;
> +	}
> +
> +	mutex_lock(&phy_devices.list_lock);
> +	if (phy_dev->ops->create) {
> +		retval = phy_dev->ops->create(dev, mdevice->uuid,
> +					      instance, mdev_params);
> +		if (retval)
> +			goto create_failed;
> +	}
> +
> +	retval = mdev_add_attribute_group(&mdevice->dev,
> +					  phy_dev->ops->mdev_attr_groups);
> +	if (retval)
> +		goto create_failed;
> +
> +	mdevice->phy_dev = phy_dev;
> +	mutex_unlock(&phy_devices.list_lock);
> +	mdev_get_device(mdevice);
> +	dev_info(&mdevice->dev, "MDEV: created\n");
> +
> +	return retval;
> +
> +create_failed:
> +	mutex_unlock(&phy_devices.list_lock);
> +	device_unregister(&mdevice->dev);
> +	return retval;
> +}
> +
> +int destroy_mdev_device(uuid_le uuid, uint32_t instance)
> +{
> +	struct mdev_device *vdev;
> +
> +	vdev = find_mdev_device(uuid, instance);
> +
> +	if (!vdev)
> +		return -EINVAL;
> +
> +	mdev_destroy_device(vdev);
> +	return 0;
> +}
> +
> +void get_mdev_supported_types(struct device *dev, char *str)
> +{
> +	struct phy_device *phy_dev;
> +
> +	phy_dev = find_physical_device(dev);
> +
> +	if (phy_dev) {
> +		mutex_lock(&phy_devices.list_lock);
> +		if (phy_dev->ops->supported_config)
> +			phy_dev->ops->supported_config(phy_dev->dev, str);
> +		mutex_unlock(&phy_devices.list_lock);
> +	}
> +}
> +
> +int mdev_start_callback(uuid_le uuid, uint32_t instance)
> +{
> +	int ret = 0;
> +	struct mdev_device *mdevice;
> +	struct phy_device *phy_dev;
> +
> +	mdevice = find_mdev_device(uuid, instance);
> +
> +	if (!mdevice)
> +		return -EINVAL;
> +
> +	phy_dev = mdevice->phy_dev;
> +
> +	mutex_lock(&phy_devices.list_lock);
> +	if (phy_dev->ops->start)
> +		ret = phy_dev->ops->start(mdevice->uuid);
> +	mutex_unlock(&phy_devices.list_lock);
> +
> +	if (ret < 0)
> +		pr_err("mdev_start failed  %d\n", ret);
> +	else
> +		kobject_uevent(&mdevice->dev.kobj, KOBJ_ONLINE);
> +
> +	return ret;
> +}
> +
> +int mdev_shutdown_callback(uuid_le uuid, uint32_t instance)
> +{
> +	int ret = 0;
> +	struct mdev_device *mdevice;
> +	struct phy_device *phy_dev;
> +
> +	mdevice = find_mdev_device(uuid, instance);
> +
> +	if (!mdevice)
> +		return -EINVAL;
> +
> +	phy_dev = mdevice->phy_dev;
> +
> +	mutex_lock(&phy_devices.list_lock);
> +	if (phy_dev->ops->shutdown)
> +		ret = phy_dev->ops->shutdown(mdevice->uuid);
> +	mutex_unlock(&phy_devices.list_lock);
> +
> +	if (ret < 0)
> +		pr_err("mdev_shutdown failed %d\n", ret);
> +	else
> +		kobject_uevent(&mdevice->dev.kobj, KOBJ_OFFLINE);
> +
> +	return ret;
> +}
> +
> +static struct class mdev_class = {
> +	.name		= MDEV_CLASS_NAME,
> +	.owner		= THIS_MODULE,
> +	.class_attrs	= mdev_class_attrs,
> +};
> +
> +static int __init mdev_init(void)
> +{
> +	int rc = 0;
> +
> +	mutex_init(&mdevices.list_lock);
> +	INIT_LIST_HEAD(&mdevices.dev_list);
> +	mutex_init(&phy_devices.list_lock);
> +	INIT_LIST_HEAD(&phy_devices.dev_list);
> +
> +	rc = class_register(&mdev_class);
> +	if (rc < 0) {
> +		pr_err("Failed to register mdev class\n");
> +		return rc;
> +	}
> +
> +	rc = mdev_bus_register();
> +	if (rc < 0) {
> +		pr_err("Failed to register mdev bus\n");
> +		class_unregister(&mdev_class);
> +		return rc;
> +	}
> +
> +	return rc;
> +}
> +
> +static void __exit mdev_exit(void)
> +{

should we check any remaining mdev/pdev which are not cleaned
up correctly here?

> +	mdev_bus_unregister();
> +	class_unregister(&mdev_class);
> +}
> +
> +module_init(mdev_init)
> +module_exit(mdev_exit)
> +
> +MODULE_VERSION(DRIVER_VERSION);
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR(DRIVER_AUTHOR);
> +MODULE_DESCRIPTION(DRIVER_DESC);
> diff --git a/drivers/vfio/mdev/mdev-driver.c b/drivers/vfio/mdev/mdev-driver.c
> new file mode 100644
> index 000000000000..bc8a169782bc
> --- /dev/null
> +++ b/drivers/vfio/mdev/mdev-driver.c
> @@ -0,0 +1,139 @@
> +/*
> + * MDEV driver
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/device.h>
> +#include <linux/iommu.h>
> +#include <linux/mdev.h>
> +
> +#include "mdev_private.h"
> +
> +static int mdevice_attach_iommu(struct mdev_device *mdevice)
> +{
> +	int retval = 0;
> +	struct iommu_group *group = NULL;
> +
> +	group = iommu_group_alloc();
> +	if (IS_ERR(group)) {
> +		dev_err(&mdevice->dev, "MDEV: failed to allocate group!\n");
> +		return PTR_ERR(group);
> +	}
> +
> +	retval = iommu_group_add_device(group, &mdevice->dev);
> +	if (retval) {
> +		dev_err(&mdevice->dev, "MDEV: failed to add dev to group!\n");
> +		goto attach_fail;
> +	}
> +
> +	mdevice->group = group;
> +
> +	dev_info(&mdevice->dev, "MDEV: group_id = %d\n",
> +				 iommu_group_id(group));
> +attach_fail:
> +	iommu_group_put(group);
> +	return retval;
> +}
> +
> +static void mdevice_detach_iommu(struct mdev_device *mdevice)
> +{
> +	iommu_group_remove_device(&mdevice->dev);
> +	dev_info(&mdevice->dev, "MDEV: detaching iommu\n");
> +}
> +
> +static int mdevice_probe(struct device *dev)
> +{
> +	struct mdev_driver *drv = to_mdev_driver(dev->driver);
> +	struct mdev_device *mdevice = to_mdev_device(dev);
> +	int status = 0;
> +
> +	status = mdevice_attach_iommu(mdevice);
> +	if (status) {
> +		dev_err(dev, "Failed to attach IOMMU\n");
> +		return status;
> +	}
> +
> +	if (drv && drv->probe)
> +		status = drv->probe(dev);
> +
> +	return status;
> +}
> +
> +static int mdevice_remove(struct device *dev)
> +{
> +	struct mdev_driver *drv = to_mdev_driver(dev->driver);
> +	struct mdev_device *mdevice = to_mdev_device(dev);
> +
> +	if (drv && drv->remove)
> +		drv->remove(dev);
> +
> +	mdevice_detach_iommu(mdevice);
> +
> +	return 0;
> +}
> +
> +static int mdevice_match(struct device *dev, struct device_driver *drv)
> +{
> +	int ret = 0;
> +	struct mdev_driver *mdrv = to_mdev_driver(drv);
> +
> +	if (mdrv && mdrv->match)
> +		ret = mdrv->match(dev);
> +
> +	return ret;
> +}
> +
> +struct bus_type mdev_bus_type = {
> +	.name		= "mdev",
> +	.match		= mdevice_match,
> +	.probe		= mdevice_probe,
> +	.remove		= mdevice_remove,
> +};
> +EXPORT_SYMBOL_GPL(mdev_bus_type);
> +
> +/**
> + * mdev_register_driver - register a new MDEV driver
> + * @drv: the driver to register
> + * @owner: owner module of driver ro register
> + *
> + * Returns a negative value on error, otherwise 0.
> + */
> +int mdev_register_driver(struct mdev_driver *drv, struct module *owner)
> +{
> +	/* initialize common driver fields */
> +	drv->driver.name = drv->name;
> +	drv->driver.bus = &mdev_bus_type;
> +	drv->driver.owner = owner;
> +
> +	/* register with core */
> +	return driver_register(&drv->driver);
> +}
> +EXPORT_SYMBOL(mdev_register_driver);
> +
> +/**
> + * mdev_unregister_driver - unregister MDEV driver
> + * @drv: the driver to unregister
> + *
> + */
> +void mdev_unregister_driver(struct mdev_driver *drv)
> +{
> +	driver_unregister(&drv->driver);
> +}
> +EXPORT_SYMBOL(mdev_unregister_driver);
> +
> +int mdev_bus_register(void)
> +{
> +	return bus_register(&mdev_bus_type);
> +}
> +
> +void mdev_bus_unregister(void)
> +{
> +	bus_unregister(&mdev_bus_type);
> +}
> diff --git a/drivers/vfio/mdev/mdev-sysfs.c b/drivers/vfio/mdev/mdev-sysfs.c
> new file mode 100644
> index 000000000000..79d351a7a502
> --- /dev/null
> +++ b/drivers/vfio/mdev/mdev-sysfs.c
> @@ -0,0 +1,312 @@
> +/*
> + * File attributes for Mediated devices
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/sysfs.h>
> +#include <linux/ctype.h>
> +#include <linux/device.h>
> +#include <linux/slab.h>
> +#include <linux/uuid.h>
> +#include <linux/mdev.h>
> +
> +#include "mdev_private.h"
> +
> +/* Prototypes */
> +static ssize_t mdev_supported_types_show(struct device *dev,
> +					 struct device_attribute *attr,
> +					 char *buf);
> +static DEVICE_ATTR_RO(mdev_supported_types);
> +
> +static ssize_t mdev_create_store(struct device *dev,
> +				 struct device_attribute *attr,
> +				 const char *buf, size_t count);
> +static DEVICE_ATTR_WO(mdev_create);
> +
> +static ssize_t mdev_destroy_store(struct device *dev,
> +				  struct device_attribute *attr,
> +				  const char *buf, size_t count);
> +static DEVICE_ATTR_WO(mdev_destroy);
> +
> +/* Static functions */
> +
> +#define UUID_CHAR_LENGTH	36
> +#define UUID_BYTE_LENGTH	16
> +
> +#define SUPPORTED_TYPE_BUFFER_LENGTH	1024
> +
> +static inline bool is_uuid_sep(char sep)
> +{
> +	if (sep == '\n' || sep == '-' || sep == ':' || sep == '\0')
> +		return true;
> +	return false;
> +}
> +
> +static int uuid_parse(const char *str, uuid_le *uuid)
> +{
> +	int i;
> +
> +	if (strlen(str) < UUID_CHAR_LENGTH)
> +		return -1;
> +
> +	for (i = 0; i < UUID_BYTE_LENGTH; i++) {
> +		if (!isxdigit(str[0]) || !isxdigit(str[1])) {
> +			pr_err("%s err", __func__);
> +			return -EINVAL;
> +		}
> +
> +		uuid->b[i] = (hex_to_bin(str[0]) << 4) | hex_to_bin(str[1]);
> +		str += 2;
> +		if (is_uuid_sep(*str))
> +			str++;
> +	}
> +
> +	return 0;
> +}
> +
> +/* Functions */
> +static ssize_t mdev_supported_types_show(struct device *dev,
> +					 struct device_attribute *attr,
> +					 char *buf)
> +{
> +	char *str;
> +	ssize_t n;
> +
> +	str = kzalloc(sizeof(*str) * SUPPORTED_TYPE_BUFFER_LENGTH, GFP_KERNEL);
> +	if (!str)
> +		return -ENOMEM;
> +
> +	get_mdev_supported_types(dev, str);
> +
> +	n = sprintf(buf, "%s\n", str);
> +	kfree(str);
> +
> +	return n;
> +}
> +
> +static ssize_t mdev_create_store(struct device *dev,
> +				 struct device_attribute *attr,
> +				 const char *buf, size_t count)
> +{
> +	char *str, *pstr;
> +	char *uuid_str, *instance_str, *mdev_params = NULL;
> +	uuid_le uuid;
> +	uint32_t instance;
> +	int ret = 0;
> +
> +	pstr = str = kstrndup(buf, count, GFP_KERNEL);
> +
> +	if (!str)
> +		return -ENOMEM;
> +
> +	uuid_str = strsep(&str, ":");
> +	if (!uuid_str) {
> +		pr_err("mdev_create: Empty UUID string %s\n", buf);
> +		ret = -EINVAL;
> +		goto create_error;
> +	}
> +
> +	if (!str) {
> +		pr_err("mdev_create: mdev instance not present %s\n", buf);
> +		ret = -EINVAL;
> +		goto create_error;
> +	}
> +
> +	instance_str = strsep(&str, ":");
> +	if (!instance_str) {
> +		pr_err("mdev_create: Empty instance string %s\n", buf);
> +		ret = -EINVAL;
> +		goto create_error;
> +	}
> +
> +	ret = kstrtouint(instance_str, 0, &instance);
> +	if (ret) {
> +		pr_err("mdev_create: mdev instance parsing error %s\n", buf);
> +		goto create_error;
> +	}
> +
> +	if (!str) {
> +		pr_err("mdev_create: mdev params not specified %s\n", buf);
> +		ret = -EINVAL;
> +		goto create_error;
> +	}
> +
> +	mdev_params = kstrdup(str, GFP_KERNEL);
> +
> +	if (!mdev_params) {
> +		ret = -EINVAL;
> +		goto create_error;
> +	}
> +
> +	if (uuid_parse(uuid_str, &uuid) < 0) {
> +		pr_err("mdev_create: UUID parse error %s\n", buf);
> +		ret = -EINVAL;
> +		goto create_error;
> +	}
> +
> +	if (create_mdev_device(dev, uuid, instance, mdev_params) < 0) {
> +		pr_err("mdev_create: Failed to create mdev device\n");
> +		ret = -EINVAL;
> +		goto create_error;
> +	}
> +	ret = count;
> +
> +create_error:
> +	kfree(mdev_params);
> +	kfree(pstr);
> +	return ret;
> +}
> +
> +static ssize_t mdev_destroy_store(struct device *dev,
> +				  struct device_attribute *attr,
> +				  const char *buf, size_t count)
> +{
> +	char *uuid_str, *str, *pstr;
> +	uuid_le uuid;
> +	unsigned int instance;
> +	int ret;
> +
> +	str = pstr = kstrndup(buf, count, GFP_KERNEL);
> +
> +	if (!str)
> +		return -ENOMEM;
> +
> +	uuid_str = strsep(&str, ":");
> +	if (!uuid_str) {
> +		pr_err("mdev_destroy: Empty UUID string %s\n", buf);
> +		ret = -EINVAL;
> +		goto destroy_error;
> +	}
> +
> +	if (str == NULL) {
> +		pr_err("mdev_destroy: instance not specified %s\n", buf);
> +		ret = -EINVAL;
> +		goto destroy_error;
> +	}
> +
> +	ret = kstrtouint(str, 0, &instance);
> +	if (ret) {
> +		pr_err("mdev_destroy: instance parsing error %s\n", buf);
> +		goto destroy_error;
> +	}
> +
> +	if (uuid_parse(uuid_str, &uuid) < 0) {
> +		pr_err("mdev_destroy: UUID parse error  %s\n", buf);
> +		ret = -EINVAL;
> +		goto destroy_error;
> +	}
> +
> +	ret = destroy_mdev_device(uuid, instance);
> +	if (ret < 0)
> +		goto destroy_error;
> +
> +	ret = count;
> +
> +destroy_error:
> +	kfree(pstr);
> +	return ret;
> +}
> +
> +ssize_t mdev_start_store(struct class *class, struct class_attribute *attr,
> +			 const char *buf, size_t count)
> +{
> +	char *uuid_str;
> +	uuid_le uuid;
> +	int ret = 0;
> +
> +	uuid_str = kstrndup(buf, count, GFP_KERNEL);
> +
> +	if (!uuid_str)
> +		return -ENOMEM;
> +
> +	if (uuid_parse(uuid_str, &uuid) < 0) {
> +		pr_err("mdev_start: UUID parse error  %s\n", buf);
> +		ret = -EINVAL;
> +		goto start_error;
> +	}
> +
> +	ret = mdev_start_callback(uuid, 0);
> +	if (ret < 0)
> +		goto start_error;
> +
> +	ret = count;
> +
> +start_error:
> +	kfree(uuid_str);
> +	return ret;
> +}
> +
> +ssize_t mdev_shutdown_store(struct class *class, struct class_attribute *attr,
> +			    const char *buf, size_t count)
> +{
> +	char *uuid_str;
> +	uuid_le uuid;
> +	int ret = 0;
> +
> +	uuid_str = kstrndup(buf, count, GFP_KERNEL);
> +
> +	if (!uuid_str)
> +		return -ENOMEM;
> +
> +	if (uuid_parse(uuid_str, &uuid) < 0) {
> +		pr_err("mdev_shutdown: UUID parse error %s\n", buf);
> +		ret = -EINVAL;
> +	}
> +
> +	ret = mdev_shutdown_callback(uuid, 0);
> +	if (ret < 0)
> +		goto shutdown_error;
> +
> +	ret = count;
> +
> +shutdown_error:
> +	kfree(uuid_str);
> +	return ret;
> +
> +}
> +
> +struct class_attribute mdev_class_attrs[] = {
> +	__ATTR_WO(mdev_start),
> +	__ATTR_WO(mdev_shutdown),
> +	__ATTR_NULL
> +};
> +
> +int mdev_create_sysfs_files(struct device *dev)
> +{
> +	int retval;
> +
> +	retval = sysfs_create_file(&dev->kobj,
> +				   &dev_attr_mdev_supported_types.attr);
> +	if (retval) {
> +		pr_err("Failed to create mdev_supported_types sysfs entry\n");
> +		return retval;
> +	}
> +
> +	retval = sysfs_create_file(&dev->kobj, &dev_attr_mdev_create.attr);
> +	if (retval) {
> +		pr_err("Failed to create mdev_create sysfs entry\n");
> +		return retval;
> +	}
> +
> +	retval = sysfs_create_file(&dev->kobj, &dev_attr_mdev_destroy.attr);
> +	if (retval) {
> +		pr_err("Failed to create mdev_destroy sysfs entry\n");
> +		return retval;
> +	}
> +
> +	return 0;
> +}
> +
> +void mdev_remove_sysfs_files(struct device *dev)
> +{
> +	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_supported_types.attr);
> +	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_create.attr);
> +	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_destroy.attr);
> +}
> diff --git a/drivers/vfio/mdev/mdev_private.h b/drivers/vfio/mdev/mdev_private.h
> new file mode 100644
> index 000000000000..a472310c7749
> --- /dev/null
> +++ b/drivers/vfio/mdev/mdev_private.h
> @@ -0,0 +1,33 @@
> +/*
> + * Mediated device interal definitions
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#ifndef MDEV_PRIVATE_H
> +#define MDEV_PRIVATE_H
> +
> +int  mdev_bus_register(void);
> +void mdev_bus_unregister(void);
> +
> +/* Function prototypes for mdev_sysfs */
> +
> +extern struct class_attribute mdev_class_attrs[];
> +
> +int  mdev_create_sysfs_files(struct device *dev);
> +void mdev_remove_sysfs_files(struct device *dev);
> +
> +int  create_mdev_device(struct device *dev, uuid_le uuid, uint32_t instance,
> +			char *mdev_params);
> +int  destroy_mdev_device(uuid_le uuid, uint32_t instance);
> +void get_mdev_supported_types(struct device *dev, char *str);
> +int  mdev_start_callback(uuid_le uuid, uint32_t instance);
> +int  mdev_shutdown_callback(uuid_le uuid, uint32_t instance);
> +
> +#endif /* MDEV_PRIVATE_H */
> diff --git a/include/linux/mdev.h b/include/linux/mdev.h
> new file mode 100644
> index 000000000000..d9633acd85f2
> --- /dev/null
> +++ b/include/linux/mdev.h
> @@ -0,0 +1,224 @@
> +/*
> + * Mediated device definition
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#ifndef MDEV_H
> +#define MDEV_H
> +
> +/* Common Data structures */
> +
> +struct pci_region_info {
> +	uint64_t start;
> +	uint64_t size;
> +	uint32_t flags;		/*!< VFIO region info flags */
> +};
> +
> +enum mdev_emul_space {
> +	EMUL_CONFIG_SPACE,	/*!< PCI configuration space */
> +	EMUL_IO,		/*!< I/O register space */
> +	EMUL_MMIO		/*!< Memory-mapped I/O space */
> +};
> +
> +struct phy_device;
> +
> +/*
> + * Mediated device
> + */
> +
> +struct mdev_device {
> +	struct kref		kref;
> +	struct device		dev;
> +	struct phy_device	*phy_dev;
> +	struct iommu_group	*group;
> +	void			*iommu_data;
> +	uuid_le			uuid;
> +	uint32_t		instance;
> +	void			*driver_data;
> +	struct mutex		ops_lock;
> +	struct list_head	next;
> +};
> +
> +
> +/**
> + * struct phy_device_ops - Structure to be registered for each physical device
> + * to register the device to mdev module.
> + *
> + * @owner:		The module owner.
> + * @dev_attr_groups:	Default attributes of the physical device.
> + * @mdev_attr_groups:	Default attributes of the mediated device.
> + * @supported_config:	Called to get information about supported types.
> + *			@dev : device structure of physical device.
> + *			@config: should return string listing supported config
> + *			Returns integer: success (0) or error (< 0)
> + * @create:		Called to allocate basic resources in physical device's
> + *			driver for a particular mediated device
> + *			@dev: physical pci device structure on which mediated
> + *			      device should be created
> + *			@uuid: VM's uuid for which VM it is intended to
> + *			@instance: mediated instance in that VM
> + *			@mdev_params: extra parameters required by physical
> + *			device's driver.
> + *			Returns integer: success (0) or error (< 0)
> + * @destroy:		Called to free resources in physical device's driver for
> + *			a mediated device instance of that VM.
> + *			@dev: physical device structure to which this mediated
> + *			      device points to.
> + *			@uuid: VM's uuid for which the mediated device belongs
> + *			@instance: mdev instance in that VM
> + *			Returns integer: success (0) or error (< 0)
> + *			If VM is running and destroy() is called that means the
> + *			mdev is being hotunpluged. Return error if VM is running
> + *			and driver doesn't support mediated device hotplug.
> + * @start:		Called to do initiate mediated device initialization
> + *			process in physical device's driver when VM boots before
> + *			qemu starts.
> + *			@uuid: VM's UUID which is booting.
> + *			Returns integer: success (0) or error (< 0)
> + * @shutdown:		Called to teardown mediated device related resources for
> + *			the VM
> + *			@uuid: VM's UUID which is shutting down .
> + *			Returns integer: success (0) or error (< 0)
> + * @read:		Read emulation callback
> + *			@mdev: mediated device structure
> + *			@buf: read buffer
> + *			@count: number bytes to read
> + *			@address_space: specifies for which address
> + *			space the request is: pci_config_space, IO
> + *			register space or MMIO space.
> + *			@pos: offset from base address.
> + *			Retuns number on bytes read on success or error.
> + * @write:		Write emulation callback
> + *			@mdev: mediated device structure
> + *			@buf: write buffer
> + *			@count: number bytes to be written
> + *			@address_space: specifies for which address space the
> + *			request is: pci_config_space, IO register space or MMIO
> + *			space.
> + *			@pos: offset from base address.
> + *			Retuns number on bytes written on success or error.
> + * @set_irqs:		Called to send about interrupts configuration
> + *			information that VMM sets.
> + *			@mdev: mediated device structure
> + *			@flags, index, start, count and *data : same as that of
> + *			struct vfio_irq_set of VFIO_DEVICE_SET_IRQS API.
> + * @get_region_info:	Called to get BAR size and flags of mediated device.
> + *			@mdev: mediated device structure
> + *			@region_index: VFIO region index
> + *			@region_info: output, returns size and flags of
> + *				      requested region.
> + *			Returns integer: success (0) or error (< 0)
> + * @validate_map_request: Validate remap pfn request
> + *			@mdev: mediated device structure
> + *			@virtaddr: target user address to start at
> + *			@pfn: physical address of kernel memory, vendor driver
> + *			      can change if required.
> + *			@size: size of map area, vendor driver can change the
> + *			       size of map area if desired.
> + *			@prot: page protection flags for this mapping, vendor
> + *			       driver can change, if required.
> + *			Returns integer: success (0) or error (< 0)
> + *
> + * Physical device that support mediated device should be registered with mdev
> + * module with phy_device_ops structure.
> + */
> +
> +struct phy_device_ops {
> +	struct module   *owner;
> +	const struct attribute_group **dev_attr_groups;
> +	const struct attribute_group **mdev_attr_groups;
> +
> +	int	(*supported_config)(struct device *dev, char *config);
> +	int     (*create)(struct device *dev, uuid_le uuid,
> +			  uint32_t instance, char *mdev_params);
> +	int     (*destroy)(struct device *dev, uuid_le uuid,
> +			   uint32_t instance);
> +	int     (*start)(uuid_le uuid);
> +	int     (*shutdown)(uuid_le uuid);
> +	ssize_t (*read)(struct mdev_device *vdev, char *buf, size_t count,
> +			enum mdev_emul_space address_space, loff_t pos);
> +	ssize_t (*write)(struct mdev_device *vdev, char *buf, size_t count,
> +			 enum mdev_emul_space address_space, loff_t pos);
> +	int     (*set_irqs)(struct mdev_device *vdev, uint32_t flags,
> +			    unsigned int index, unsigned int start,
> +			    unsigned int count, void *data);
> +	int	(*get_region_info)(struct mdev_device *vdev, int region_index,
> +				 struct pci_region_info *region_info);
> +	int	(*validate_map_request)(struct mdev_device *vdev,
> +					unsigned long virtaddr,
> +					unsigned long *pfn, unsigned long *size,
> +					pgprot_t *prot);
> +};
> +
> +/*
> + * Physical Device
> + */
> +struct phy_device {
> +	struct device                   *dev;
> +	const struct phy_device_ops     *ops;
> +	struct list_head                next;
> +};
> +
> +/**
> + * struct mdev_driver - Mediated device driver
> + * @name: driver name
> + * @probe: called when new device created
> + * @remove: called when device removed
> + * @match: called when new device or driver is added for this bus. Return 1 if
> + *	   given device can be handled by given driver and zero otherwise.
> + * @driver: device driver structure
> + *
> + **/
> +struct mdev_driver {
> +	const char *name;
> +	int  (*probe)(struct device *dev);
> +	void (*remove)(struct device *dev);
> +	int  (*match)(struct device *dev);
> +	struct device_driver driver;
> +};
> +
> +static inline struct mdev_driver *to_mdev_driver(struct device_driver *drv)
> +{
> +	return drv ? container_of(drv, struct mdev_driver, driver) : NULL;
> +}
> +
> +static inline struct mdev_device *to_mdev_device(struct device *dev)
> +{
> +	return dev ? container_of(dev, struct mdev_device, dev) : NULL;
> +}
> +
> +static inline struct mdev_device *mdev_get_device(struct mdev_device *vdev)
> +{
> +	return (vdev && get_device(&vdev->dev)) ? vdev : NULL;
> +}
> +
> +static inline  void mdev_put_device(struct mdev_device *vdev)
> +{
> +	if (vdev)
> +		put_device(&vdev->dev);
> +}
> +
> +extern struct bus_type mdev_bus_type;
> +
> +#define dev_is_mdev(d) ((d)->bus == &mdev_bus_type)
> +
> +extern int  mdev_register_device(struct device *dev,
> +				 const struct phy_device_ops *ops);
> +extern void mdev_unregister_device(struct device *dev);
> +
> +extern int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
> +extern void mdev_unregister_driver(struct mdev_driver *drv);
> +
> +extern int mdev_map_virtual_bar(uint64_t virt_bar_addr, uint64_t phys_bar_addr,
> +				uint32_t len, uint32_t flags);
> +
> +extern struct mdev_device *mdev_get_device_by_group(struct iommu_group *group);
> +
> +#endif /* MDEV_H */
> --
> 2.7.0

^ permalink raw reply	[flat|nested] 92+ messages in thread

* RE: [RFC PATCH v4 2/3] VFIO driver for mediated PCI device
  2016-05-24 19:58   ` [Qemu-devel] " Kirti Wankhede
@ 2016-05-25  8:15     ` Tian, Kevin
  -1 siblings, 0 replies; 92+ messages in thread
From: Tian, Kevin @ 2016-05-25  8:15 UTC (permalink / raw)
  To: Kirti Wankhede, alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, Ruan, Shuai, Song, Jike, Lv, Zhiyuan, bjsdjshi

> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> Sent: Wednesday, May 25, 2016 3:58 AM
> 
> VFIO driver registers with MDEV core driver. MDEV core driver creates
> mediated device and calls probe routine of MPCI VFIO driver. This MPCI
> VFIO driver adds mediated device to VFIO core module.
> Main aim of this module is to manage all VFIO APIs for each mediated PCI
> device.
> Those are:
> - get region information from vendor driver.
> - trap and emulate PCI config space and BAR region.
> - Send interrupt configuration information to vendor driver.
> - mmap mappable region with invalidate mapping and fault on access to
>   remap pfn.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Signed-off-by: Neo Jia <cjia@nvidia.com>
> Change-Id: I48a34af88a9a905ec1f0f7528383c5db76c2e14d
> ---
>  drivers/vfio/mdev/Kconfig           |   7 +
>  drivers/vfio/mdev/Makefile          |   1 +
>  drivers/vfio/mdev/vfio_mpci.c       | 648
> ++++++++++++++++++++++++++++++++++++
>  drivers/vfio/pci/vfio_pci_private.h |   6 -
>  drivers/vfio/pci/vfio_pci_rdwr.c    |   1 +
>  include/linux/vfio.h                |   7 +
>  6 files changed, 664 insertions(+), 6 deletions(-)
>  create mode 100644 drivers/vfio/mdev/vfio_mpci.c
> 
> diff --git a/drivers/vfio/mdev/Kconfig b/drivers/vfio/mdev/Kconfig
> index 951e2bb06a3f..8d9e78aaa80f 100644
> --- a/drivers/vfio/mdev/Kconfig
> +++ b/drivers/vfio/mdev/Kconfig
> @@ -9,3 +9,10 @@ config MDEV
> 
>          If you don't know what do here, say N.
> 
> +config VFIO_MPCI
> +    tristate "VFIO support for Mediated PCI devices"
> +    depends on VFIO && PCI && MDEV
> +    default n
> +    help
> +        VFIO based driver for mediated PCI devices.
> +
> diff --git a/drivers/vfio/mdev/Makefile b/drivers/vfio/mdev/Makefile
> index 4adb069febce..8ab38c57df21 100644
> --- a/drivers/vfio/mdev/Makefile
> +++ b/drivers/vfio/mdev/Makefile
> @@ -2,4 +2,5 @@
>  mdev-y := mdev-core.o mdev-sysfs.o mdev-driver.o
> 
>  obj-$(CONFIG_MDEV) += mdev.o
> +obj-$(CONFIG_VFIO_MPCI) += vfio_mpci.o
> 
> diff --git a/drivers/vfio/mdev/vfio_mpci.c b/drivers/vfio/mdev/vfio_mpci.c
> new file mode 100644
> index 000000000000..ef9d757ec511
> --- /dev/null
> +++ b/drivers/vfio/mdev/vfio_mpci.c
> @@ -0,0 +1,648 @@
> +/*
> + * VFIO based Mediated PCI device driver
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/init.h>
> +#include <linux/module.h>
> +#include <linux/device.h>
> +#include <linux/kernel.h>
> +#include <linux/slab.h>
> +#include <linux/uuid.h>
> +#include <linux/vfio.h>
> +#include <linux/iommu.h>
> +#include <linux/mdev.h>
> +
> +#include "mdev_private.h"
> +
> +#define DRIVER_VERSION  "0.1"
> +#define DRIVER_AUTHOR   "NVIDIA Corporation"
> +#define DRIVER_DESC     "VFIO based Mediated PCI device driver"
> +
> +struct vfio_mdevice {
> +	struct iommu_group *group;
> +	struct mdev_device *mdevice;
> +	int		    refcnt;
> +	struct pci_region_info vfio_region_info[VFIO_PCI_NUM_REGIONS];
> +	u8		    *vconfig;
> +	struct mutex	    vfio_mdev_lock;
> +};
> +
> +static int get_virtual_bar_info(struct mdev_device *mdevice,
> +				struct pci_region_info *vfio_region_info,
> +				int index)

'virtual' or 'physical'? My feeling is to get physical region resource allocated
for a mdev.

> +{
> +	int ret = -EINVAL;
> +	struct phy_device *phy_dev = mdevice->phy_dev;
> +
> +	if (dev_is_pci(phy_dev->dev) && phy_dev->ops->get_region_info) {
> +		mutex_lock(&mdevice->ops_lock);
> +		ret = phy_dev->ops->get_region_info(mdevice, index,
> +						    vfio_region_info);
> +		mutex_unlock(&mdevice->ops_lock);
> +	}
> +	return ret;
> +}
> +
> +static int mdev_read_base(struct vfio_mdevice *vdev)

similar as earlier comment - vdev or mdev?

> +{
> +	int index, pos;
> +	u32 start_lo, start_hi;
> +	u32 mem_type;
> +
> +	pos = PCI_BASE_ADDRESS_0;
> +
> +	for (index = 0; index <= VFIO_PCI_BAR5_REGION_INDEX; index++) {
> +
> +		if (!vdev->vfio_region_info[index].size)
> +			continue;
> +
> +		start_lo = (*(u32 *)(vdev->vconfig + pos)) &
> +					PCI_BASE_ADDRESS_MEM_MASK;
> +		mem_type = (*(u32 *)(vdev->vconfig + pos)) &
> +					PCI_BASE_ADDRESS_MEM_TYPE_MASK;
> +
> +		switch (mem_type) {
> +		case PCI_BASE_ADDRESS_MEM_TYPE_64:
> +			start_hi = (*(u32 *)(vdev->vconfig + pos + 4));
> +			pos += 4;
> +			break;
> +		case PCI_BASE_ADDRESS_MEM_TYPE_32:
> +		case PCI_BASE_ADDRESS_MEM_TYPE_1M:
> +			/* 1M mem BAR treated as 32-bit BAR */
> +		default:
> +			/* mem unknown type treated as 32-bit BAR */
> +			start_hi = 0;
> +			break;
> +		}
> +		pos += 4;
> +		vdev->vfio_region_info[index].start = ((u64)start_hi << 32) |
> +							start_lo;
> +	}
> +	return 0;
> +}

Thanks
Kevin

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v4 2/3] VFIO driver for mediated PCI device
@ 2016-05-25  8:15     ` Tian, Kevin
  0 siblings, 0 replies; 92+ messages in thread
From: Tian, Kevin @ 2016-05-25  8:15 UTC (permalink / raw)
  To: Kirti Wankhede, alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, Ruan, Shuai, Song, Jike, Lv, Zhiyuan, bjsdjshi

> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> Sent: Wednesday, May 25, 2016 3:58 AM
> 
> VFIO driver registers with MDEV core driver. MDEV core driver creates
> mediated device and calls probe routine of MPCI VFIO driver. This MPCI
> VFIO driver adds mediated device to VFIO core module.
> Main aim of this module is to manage all VFIO APIs for each mediated PCI
> device.
> Those are:
> - get region information from vendor driver.
> - trap and emulate PCI config space and BAR region.
> - Send interrupt configuration information to vendor driver.
> - mmap mappable region with invalidate mapping and fault on access to
>   remap pfn.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Signed-off-by: Neo Jia <cjia@nvidia.com>
> Change-Id: I48a34af88a9a905ec1f0f7528383c5db76c2e14d
> ---
>  drivers/vfio/mdev/Kconfig           |   7 +
>  drivers/vfio/mdev/Makefile          |   1 +
>  drivers/vfio/mdev/vfio_mpci.c       | 648
> ++++++++++++++++++++++++++++++++++++
>  drivers/vfio/pci/vfio_pci_private.h |   6 -
>  drivers/vfio/pci/vfio_pci_rdwr.c    |   1 +
>  include/linux/vfio.h                |   7 +
>  6 files changed, 664 insertions(+), 6 deletions(-)
>  create mode 100644 drivers/vfio/mdev/vfio_mpci.c
> 
> diff --git a/drivers/vfio/mdev/Kconfig b/drivers/vfio/mdev/Kconfig
> index 951e2bb06a3f..8d9e78aaa80f 100644
> --- a/drivers/vfio/mdev/Kconfig
> +++ b/drivers/vfio/mdev/Kconfig
> @@ -9,3 +9,10 @@ config MDEV
> 
>          If you don't know what do here, say N.
> 
> +config VFIO_MPCI
> +    tristate "VFIO support for Mediated PCI devices"
> +    depends on VFIO && PCI && MDEV
> +    default n
> +    help
> +        VFIO based driver for mediated PCI devices.
> +
> diff --git a/drivers/vfio/mdev/Makefile b/drivers/vfio/mdev/Makefile
> index 4adb069febce..8ab38c57df21 100644
> --- a/drivers/vfio/mdev/Makefile
> +++ b/drivers/vfio/mdev/Makefile
> @@ -2,4 +2,5 @@
>  mdev-y := mdev-core.o mdev-sysfs.o mdev-driver.o
> 
>  obj-$(CONFIG_MDEV) += mdev.o
> +obj-$(CONFIG_VFIO_MPCI) += vfio_mpci.o
> 
> diff --git a/drivers/vfio/mdev/vfio_mpci.c b/drivers/vfio/mdev/vfio_mpci.c
> new file mode 100644
> index 000000000000..ef9d757ec511
> --- /dev/null
> +++ b/drivers/vfio/mdev/vfio_mpci.c
> @@ -0,0 +1,648 @@
> +/*
> + * VFIO based Mediated PCI device driver
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/init.h>
> +#include <linux/module.h>
> +#include <linux/device.h>
> +#include <linux/kernel.h>
> +#include <linux/slab.h>
> +#include <linux/uuid.h>
> +#include <linux/vfio.h>
> +#include <linux/iommu.h>
> +#include <linux/mdev.h>
> +
> +#include "mdev_private.h"
> +
> +#define DRIVER_VERSION  "0.1"
> +#define DRIVER_AUTHOR   "NVIDIA Corporation"
> +#define DRIVER_DESC     "VFIO based Mediated PCI device driver"
> +
> +struct vfio_mdevice {
> +	struct iommu_group *group;
> +	struct mdev_device *mdevice;
> +	int		    refcnt;
> +	struct pci_region_info vfio_region_info[VFIO_PCI_NUM_REGIONS];
> +	u8		    *vconfig;
> +	struct mutex	    vfio_mdev_lock;
> +};
> +
> +static int get_virtual_bar_info(struct mdev_device *mdevice,
> +				struct pci_region_info *vfio_region_info,
> +				int index)

'virtual' or 'physical'? My feeling is to get physical region resource allocated
for a mdev.

> +{
> +	int ret = -EINVAL;
> +	struct phy_device *phy_dev = mdevice->phy_dev;
> +
> +	if (dev_is_pci(phy_dev->dev) && phy_dev->ops->get_region_info) {
> +		mutex_lock(&mdevice->ops_lock);
> +		ret = phy_dev->ops->get_region_info(mdevice, index,
> +						    vfio_region_info);
> +		mutex_unlock(&mdevice->ops_lock);
> +	}
> +	return ret;
> +}
> +
> +static int mdev_read_base(struct vfio_mdevice *vdev)

similar as earlier comment - vdev or mdev?

> +{
> +	int index, pos;
> +	u32 start_lo, start_hi;
> +	u32 mem_type;
> +
> +	pos = PCI_BASE_ADDRESS_0;
> +
> +	for (index = 0; index <= VFIO_PCI_BAR5_REGION_INDEX; index++) {
> +
> +		if (!vdev->vfio_region_info[index].size)
> +			continue;
> +
> +		start_lo = (*(u32 *)(vdev->vconfig + pos)) &
> +					PCI_BASE_ADDRESS_MEM_MASK;
> +		mem_type = (*(u32 *)(vdev->vconfig + pos)) &
> +					PCI_BASE_ADDRESS_MEM_TYPE_MASK;
> +
> +		switch (mem_type) {
> +		case PCI_BASE_ADDRESS_MEM_TYPE_64:
> +			start_hi = (*(u32 *)(vdev->vconfig + pos + 4));
> +			pos += 4;
> +			break;
> +		case PCI_BASE_ADDRESS_MEM_TYPE_32:
> +		case PCI_BASE_ADDRESS_MEM_TYPE_1M:
> +			/* 1M mem BAR treated as 32-bit BAR */
> +		default:
> +			/* mem unknown type treated as 32-bit BAR */
> +			start_hi = 0;
> +			break;
> +		}
> +		pos += 4;
> +		vdev->vfio_region_info[index].start = ((u64)start_hi << 32) |
> +							start_lo;
> +	}
> +	return 0;
> +}

Thanks
Kevin

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC PATCH v4 2/3] VFIO driver for mediated PCI device
  2016-05-25  8:15     ` [Qemu-devel] " Tian, Kevin
@ 2016-05-25 13:04       ` Kirti Wankhede
  -1 siblings, 0 replies; 92+ messages in thread
From: Kirti Wankhede @ 2016-05-25 13:04 UTC (permalink / raw)
  To: Tian, Kevin, alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, Ruan, Shuai, Song, Jike, Lv, Zhiyuan, bjsdjshi



On 5/25/2016 1:45 PM, Tian, Kevin wrote:
>> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
>> Sent: Wednesday, May 25, 2016 3:58 AM
>>
>> VFIO driver registers with MDEV core driver. MDEV core driver creates
>> mediated device and calls probe routine of MPCI VFIO driver. This MPCI
>> VFIO driver adds mediated device to VFIO core module.
>> Main aim of this module is to manage all VFIO APIs for each mediated PCI
>> device.
>> Those are:
>> - get region information from vendor driver.
>> - trap and emulate PCI config space and BAR region.
>> - Send interrupt configuration information to vendor driver.
>> - mmap mappable region with invalidate mapping and fault on access to
>>   remap pfn.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Signed-off-by: Neo Jia <cjia@nvidia.com>
>> Change-Id: I48a34af88a9a905ec1f0f7528383c5db76c2e14d
>> ---
>>  drivers/vfio/mdev/Kconfig           |   7 +
>>  drivers/vfio/mdev/Makefile          |   1 +
>>  drivers/vfio/mdev/vfio_mpci.c       | 648
>> ++++++++++++++++++++++++++++++++++++
>>  drivers/vfio/pci/vfio_pci_private.h |   6 -
>>  drivers/vfio/pci/vfio_pci_rdwr.c    |   1 +
>>  include/linux/vfio.h                |   7 +
>>  6 files changed, 664 insertions(+), 6 deletions(-)
>>  create mode 100644 drivers/vfio/mdev/vfio_mpci.c
>>
>> diff --git a/drivers/vfio/mdev/Kconfig b/drivers/vfio/mdev/Kconfig
>> index 951e2bb06a3f..8d9e78aaa80f 100644
>> --- a/drivers/vfio/mdev/Kconfig
>> +++ b/drivers/vfio/mdev/Kconfig
>> @@ -9,3 +9,10 @@ config MDEV
>>
>>          If you don't know what do here, say N.
>>
>> +config VFIO_MPCI
>> +    tristate "VFIO support for Mediated PCI devices"
>> +    depends on VFIO && PCI && MDEV
>> +    default n
>> +    help
>> +        VFIO based driver for mediated PCI devices.
>> +
>> diff --git a/drivers/vfio/mdev/Makefile b/drivers/vfio/mdev/Makefile
>> index 4adb069febce..8ab38c57df21 100644
>> --- a/drivers/vfio/mdev/Makefile
>> +++ b/drivers/vfio/mdev/Makefile
>> @@ -2,4 +2,5 @@
>>  mdev-y := mdev-core.o mdev-sysfs.o mdev-driver.o
>>
>>  obj-$(CONFIG_MDEV) += mdev.o
>> +obj-$(CONFIG_VFIO_MPCI) += vfio_mpci.o
>>
>> diff --git a/drivers/vfio/mdev/vfio_mpci.c b/drivers/vfio/mdev/vfio_mpci.c
>> new file mode 100644
>> index 000000000000..ef9d757ec511
>> --- /dev/null
>> +++ b/drivers/vfio/mdev/vfio_mpci.c
>> @@ -0,0 +1,648 @@
>> +/*
>> + * VFIO based Mediated PCI device driver
>> + *
>> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
>> + *     Author: Neo Jia <cjia@nvidia.com>
>> + *	       Kirti Wankhede <kwankhede@nvidia.com>
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License version 2 as
>> + * published by the Free Software Foundation.
>> + */
>> +
>> +#include <linux/init.h>
>> +#include <linux/module.h>
>> +#include <linux/device.h>
>> +#include <linux/kernel.h>
>> +#include <linux/slab.h>
>> +#include <linux/uuid.h>
>> +#include <linux/vfio.h>
>> +#include <linux/iommu.h>
>> +#include <linux/mdev.h>
>> +
>> +#include "mdev_private.h"
>> +
>> +#define DRIVER_VERSION  "0.1"
>> +#define DRIVER_AUTHOR   "NVIDIA Corporation"
>> +#define DRIVER_DESC     "VFIO based Mediated PCI device driver"
>> +
>> +struct vfio_mdevice {
>> +	struct iommu_group *group;
>> +	struct mdev_device *mdevice;
>> +	int		    refcnt;
>> +	struct pci_region_info vfio_region_info[VFIO_PCI_NUM_REGIONS];
>> +	u8		    *vconfig;
>> +	struct mutex	    vfio_mdev_lock;
>> +};
>> +
>> +static int get_virtual_bar_info(struct mdev_device *mdevice,
>> +				struct pci_region_info *vfio_region_info,
>> +				int index)
> 
> 'virtual' or 'physical'? My feeling is to get physical region resource allocated
> for a mdev.
> 

It's mediated device's region information, changing it to
get_mdev_region_info.


>> +{
>> +	int ret = -EINVAL;
>> +	struct phy_device *phy_dev = mdevice->phy_dev;
>> +
>> +	if (dev_is_pci(phy_dev->dev) && phy_dev->ops->get_region_info) {
>> +		mutex_lock(&mdevice->ops_lock);
>> +		ret = phy_dev->ops->get_region_info(mdevice, index,
>> +						    vfio_region_info);
>> +		mutex_unlock(&mdevice->ops_lock);
>> +	}
>> +	return ret;
>> +}
>> +
>> +static int mdev_read_base(struct vfio_mdevice *vdev)
> 
> similar as earlier comment - vdev or mdev?
>

Here vdev is of type 'vfio_mdevice', that's why vdev, mdev doesn't suit
here. Changing it to 'vmdev' in next patch set.

Thanks,
Kirti


>> +{
>> +	int index, pos;
>> +	u32 start_lo, start_hi;
>> +	u32 mem_type;
>> +
>> +	pos = PCI_BASE_ADDRESS_0;
>> +
>> +	for (index = 0; index <= VFIO_PCI_BAR5_REGION_INDEX; index++) {
>> +
>> +		if (!vdev->vfio_region_info[index].size)
>> +			continue;
>> +
>> +		start_lo = (*(u32 *)(vdev->vconfig + pos)) &
>> +					PCI_BASE_ADDRESS_MEM_MASK;
>> +		mem_type = (*(u32 *)(vdev->vconfig + pos)) &
>> +					PCI_BASE_ADDRESS_MEM_TYPE_MASK;
>> +
>> +		switch (mem_type) {
>> +		case PCI_BASE_ADDRESS_MEM_TYPE_64:
>> +			start_hi = (*(u32 *)(vdev->vconfig + pos + 4));
>> +			pos += 4;
>> +			break;
>> +		case PCI_BASE_ADDRESS_MEM_TYPE_32:
>> +		case PCI_BASE_ADDRESS_MEM_TYPE_1M:
>> +			/* 1M mem BAR treated as 32-bit BAR */
>> +		default:
>> +			/* mem unknown type treated as 32-bit BAR */
>> +			start_hi = 0;
>> +			break;
>> +		}
>> +		pos += 4;
>> +		vdev->vfio_region_info[index].start = ((u64)start_hi << 32) |
>> +							start_lo;
>> +	}
>> +	return 0;
>> +}
> 
> Thanks
> Kevin
> 

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v4 2/3] VFIO driver for mediated PCI device
@ 2016-05-25 13:04       ` Kirti Wankhede
  0 siblings, 0 replies; 92+ messages in thread
From: Kirti Wankhede @ 2016-05-25 13:04 UTC (permalink / raw)
  To: Tian, Kevin, alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, Ruan, Shuai, Song, Jike, Lv, Zhiyuan, bjsdjshi



On 5/25/2016 1:45 PM, Tian, Kevin wrote:
>> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
>> Sent: Wednesday, May 25, 2016 3:58 AM
>>
>> VFIO driver registers with MDEV core driver. MDEV core driver creates
>> mediated device and calls probe routine of MPCI VFIO driver. This MPCI
>> VFIO driver adds mediated device to VFIO core module.
>> Main aim of this module is to manage all VFIO APIs for each mediated PCI
>> device.
>> Those are:
>> - get region information from vendor driver.
>> - trap and emulate PCI config space and BAR region.
>> - Send interrupt configuration information to vendor driver.
>> - mmap mappable region with invalidate mapping and fault on access to
>>   remap pfn.
>>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> Signed-off-by: Neo Jia <cjia@nvidia.com>
>> Change-Id: I48a34af88a9a905ec1f0f7528383c5db76c2e14d
>> ---
>>  drivers/vfio/mdev/Kconfig           |   7 +
>>  drivers/vfio/mdev/Makefile          |   1 +
>>  drivers/vfio/mdev/vfio_mpci.c       | 648
>> ++++++++++++++++++++++++++++++++++++
>>  drivers/vfio/pci/vfio_pci_private.h |   6 -
>>  drivers/vfio/pci/vfio_pci_rdwr.c    |   1 +
>>  include/linux/vfio.h                |   7 +
>>  6 files changed, 664 insertions(+), 6 deletions(-)
>>  create mode 100644 drivers/vfio/mdev/vfio_mpci.c
>>
>> diff --git a/drivers/vfio/mdev/Kconfig b/drivers/vfio/mdev/Kconfig
>> index 951e2bb06a3f..8d9e78aaa80f 100644
>> --- a/drivers/vfio/mdev/Kconfig
>> +++ b/drivers/vfio/mdev/Kconfig
>> @@ -9,3 +9,10 @@ config MDEV
>>
>>          If you don't know what do here, say N.
>>
>> +config VFIO_MPCI
>> +    tristate "VFIO support for Mediated PCI devices"
>> +    depends on VFIO && PCI && MDEV
>> +    default n
>> +    help
>> +        VFIO based driver for mediated PCI devices.
>> +
>> diff --git a/drivers/vfio/mdev/Makefile b/drivers/vfio/mdev/Makefile
>> index 4adb069febce..8ab38c57df21 100644
>> --- a/drivers/vfio/mdev/Makefile
>> +++ b/drivers/vfio/mdev/Makefile
>> @@ -2,4 +2,5 @@
>>  mdev-y := mdev-core.o mdev-sysfs.o mdev-driver.o
>>
>>  obj-$(CONFIG_MDEV) += mdev.o
>> +obj-$(CONFIG_VFIO_MPCI) += vfio_mpci.o
>>
>> diff --git a/drivers/vfio/mdev/vfio_mpci.c b/drivers/vfio/mdev/vfio_mpci.c
>> new file mode 100644
>> index 000000000000..ef9d757ec511
>> --- /dev/null
>> +++ b/drivers/vfio/mdev/vfio_mpci.c
>> @@ -0,0 +1,648 @@
>> +/*
>> + * VFIO based Mediated PCI device driver
>> + *
>> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
>> + *     Author: Neo Jia <cjia@nvidia.com>
>> + *	       Kirti Wankhede <kwankhede@nvidia.com>
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License version 2 as
>> + * published by the Free Software Foundation.
>> + */
>> +
>> +#include <linux/init.h>
>> +#include <linux/module.h>
>> +#include <linux/device.h>
>> +#include <linux/kernel.h>
>> +#include <linux/slab.h>
>> +#include <linux/uuid.h>
>> +#include <linux/vfio.h>
>> +#include <linux/iommu.h>
>> +#include <linux/mdev.h>
>> +
>> +#include "mdev_private.h"
>> +
>> +#define DRIVER_VERSION  "0.1"
>> +#define DRIVER_AUTHOR   "NVIDIA Corporation"
>> +#define DRIVER_DESC     "VFIO based Mediated PCI device driver"
>> +
>> +struct vfio_mdevice {
>> +	struct iommu_group *group;
>> +	struct mdev_device *mdevice;
>> +	int		    refcnt;
>> +	struct pci_region_info vfio_region_info[VFIO_PCI_NUM_REGIONS];
>> +	u8		    *vconfig;
>> +	struct mutex	    vfio_mdev_lock;
>> +};
>> +
>> +static int get_virtual_bar_info(struct mdev_device *mdevice,
>> +				struct pci_region_info *vfio_region_info,
>> +				int index)
> 
> 'virtual' or 'physical'? My feeling is to get physical region resource allocated
> for a mdev.
> 

It's mediated device's region information, changing it to
get_mdev_region_info.


>> +{
>> +	int ret = -EINVAL;
>> +	struct phy_device *phy_dev = mdevice->phy_dev;
>> +
>> +	if (dev_is_pci(phy_dev->dev) && phy_dev->ops->get_region_info) {
>> +		mutex_lock(&mdevice->ops_lock);
>> +		ret = phy_dev->ops->get_region_info(mdevice, index,
>> +						    vfio_region_info);
>> +		mutex_unlock(&mdevice->ops_lock);
>> +	}
>> +	return ret;
>> +}
>> +
>> +static int mdev_read_base(struct vfio_mdevice *vdev)
> 
> similar as earlier comment - vdev or mdev?
>

Here vdev is of type 'vfio_mdevice', that's why vdev, mdev doesn't suit
here. Changing it to 'vmdev' in next patch set.

Thanks,
Kirti


>> +{
>> +	int index, pos;
>> +	u32 start_lo, start_hi;
>> +	u32 mem_type;
>> +
>> +	pos = PCI_BASE_ADDRESS_0;
>> +
>> +	for (index = 0; index <= VFIO_PCI_BAR5_REGION_INDEX; index++) {
>> +
>> +		if (!vdev->vfio_region_info[index].size)
>> +			continue;
>> +
>> +		start_lo = (*(u32 *)(vdev->vconfig + pos)) &
>> +					PCI_BASE_ADDRESS_MEM_MASK;
>> +		mem_type = (*(u32 *)(vdev->vconfig + pos)) &
>> +					PCI_BASE_ADDRESS_MEM_TYPE_MASK;
>> +
>> +		switch (mem_type) {
>> +		case PCI_BASE_ADDRESS_MEM_TYPE_64:
>> +			start_hi = (*(u32 *)(vdev->vconfig + pos + 4));
>> +			pos += 4;
>> +			break;
>> +		case PCI_BASE_ADDRESS_MEM_TYPE_32:
>> +		case PCI_BASE_ADDRESS_MEM_TYPE_1M:
>> +			/* 1M mem BAR treated as 32-bit BAR */
>> +		default:
>> +			/* mem unknown type treated as 32-bit BAR */
>> +			start_hi = 0;
>> +			break;
>> +		}
>> +		pos += 4;
>> +		vdev->vfio_region_info[index].start = ((u64)start_hi << 32) |
>> +							start_lo;
>> +	}
>> +	return 0;
>> +}
> 
> Thanks
> Kevin
> 

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC PATCH v4 0/3] Add Mediated device support[was: Add vGPU support]
  2016-05-25  7:13   ` [Qemu-devel] " Tian, Kevin
@ 2016-05-25 13:43     ` Alex Williamson
  -1 siblings, 0 replies; 92+ messages in thread
From: Alex Williamson @ 2016-05-25 13:43 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Kirti Wankhede, pbonzini, kraxel, cjia, qemu-devel, kvm, Ruan,
	Shuai, Song, Jike, Lv, Zhiyuan, bjsdjshi

On Wed, 25 May 2016 07:13:58 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> > Sent: Wednesday, May 25, 2016 3:58 AM
> > 
> > This series adds Mediated device support to v4.6 Linux host kernel. Purpose
> > of this series is to provide a common interface for mediated device
> > management that can be used by different devices. This series introduces
> > Mdev core module that create and manage mediated devices, VFIO based driver
> > for mediated PCI devices that are created by Mdev core module and update
> > VFIO type1 IOMMU module to support mediated devices.  
> 
> Thanks. "Mediated device" is more generic than previous one. :-)
> 
> > 
> > What's new in v4?
> > - Renamed 'vgpu' module to 'mdev' module that represent generic term
> >   'Mediated device'.
> > - Moved mdev directory to drivers/vfio directory as this is the extension
> >   of VFIO APIs for mediated devices.
> > - Updated mdev driver to be flexible to register multiple types of drivers
> >   to mdev_bus_type bus.
> > - Updated mdev core driver with mdev_put_device() and mdev_get_device() for
> >   mediated devices.
> > 
> >   
> 
> Just curious. In this version you move the whole mdev core under
> VFIO now. Sorry if I missed any agreement on this change. IIRC Alex 
> doesn't want VFIO to manage mdev life-cycle directly. Instead VFIO is 
> just a mdev driver on created mediated devices....

I did originally suggest keeping them separate, but as we've progressed
through the implementation, it's become more clear that the mediated
device interface is very much tied to the vfio interface, acting mostly
as a passthrough.  So I thought it made sense to pull them together.
Still open to discussion of course.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v4 0/3] Add Mediated device support[was: Add vGPU support]
@ 2016-05-25 13:43     ` Alex Williamson
  0 siblings, 0 replies; 92+ messages in thread
From: Alex Williamson @ 2016-05-25 13:43 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Kirti Wankhede, pbonzini, kraxel, cjia, qemu-devel, kvm, Ruan,
	Shuai, Song, Jike, Lv, Zhiyuan, bjsdjshi

On Wed, 25 May 2016 07:13:58 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> > Sent: Wednesday, May 25, 2016 3:58 AM
> > 
> > This series adds Mediated device support to v4.6 Linux host kernel. Purpose
> > of this series is to provide a common interface for mediated device
> > management that can be used by different devices. This series introduces
> > Mdev core module that create and manage mediated devices, VFIO based driver
> > for mediated PCI devices that are created by Mdev core module and update
> > VFIO type1 IOMMU module to support mediated devices.  
> 
> Thanks. "Mediated device" is more generic than previous one. :-)
> 
> > 
> > What's new in v4?
> > - Renamed 'vgpu' module to 'mdev' module that represent generic term
> >   'Mediated device'.
> > - Moved mdev directory to drivers/vfio directory as this is the extension
> >   of VFIO APIs for mediated devices.
> > - Updated mdev driver to be flexible to register multiple types of drivers
> >   to mdev_bus_type bus.
> > - Updated mdev core driver with mdev_put_device() and mdev_get_device() for
> >   mediated devices.
> > 
> >   
> 
> Just curious. In this version you move the whole mdev core under
> VFIO now. Sorry if I missed any agreement on this change. IIRC Alex 
> doesn't want VFIO to manage mdev life-cycle directly. Instead VFIO is 
> just a mdev driver on created mediated devices....

I did originally suggest keeping them separate, but as we've progressed
through the implementation, it's become more clear that the mediated
device interface is very much tied to the vfio interface, acting mostly
as a passthrough.  So I thought it made sense to pull them together.
Still open to discussion of course.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC PATCH v4 1/3] Mediated device Core driver
  2016-05-25  7:55     ` [Qemu-devel] " Tian, Kevin
@ 2016-05-25 14:47       ` Kirti Wankhede
  -1 siblings, 0 replies; 92+ messages in thread
From: Kirti Wankhede @ 2016-05-25 14:47 UTC (permalink / raw)
  To: Tian, Kevin, alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, Ruan, Shuai, Song, Jike, Lv, Zhiyuan, bjsdjshi


On 5/25/2016 1:25 PM, Tian, Kevin wrote:
>> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
>> Sent: Wednesday, May 25, 2016 3:58 AM
>>

...

>> +
>> +config MDEV
>> +    tristate "Mediated device driver framework"
>
> Sorry not a native speaker. Is it cleaner to say "Driver framework for
Mediated
> Devices" or "Mediated Device Framework"? Should we focus on driver or
device
> here?
>

Both, device and driver. This framework provides way to register
physical *devices* and also register *driver* for mediated devices.


>> +    depends on VFIO
>> +    default n
>> +    help
>> +        MDEV provides a framework to virtualize device without
SR-IOV cap
>> +        See Documentation/mdev.txt for more details.
>
> Looks Documentation/mdev.txt is not included in this version.
>

Yes, will have Documentation/mdev.txt in next version of patch.


>> +static struct devices_list {
>> +	struct list_head    dev_list;
>> +	struct mutex        list_lock;
>> +} mdevices, phy_devices;
>
> phy_devices -> pdevices? and similarly we can use pdev/mdev
> pair in other places...
>

'pdevices' sometimes also refers to 'pointer to devices' that's the
reason I perfer to use phy_devices to represent 'physical devices'


>> +static struct mdev_device *find_mdev_device(uuid_le uuid, int instance)
>
> can we just call it "struct mdev* or "mdevice"? "dev_device" looks
redundant.
>

'struct mdev_device' represents 'device structure for device created by
mdev module'. Still that doesn't satisfy major folks, I'm open to change
it.


> Sorry I may have to ask same question since I didn't get an answer yet.
> what exactly does 'instance' mean here? since uuid is unique, why do
> we need match instance too?
>

'uuid' could be UUID of a VM for whom it is created. To support mutiple
mediated devices for same VM, name should be unique. Hence we need a
instance number to identify each mediated device uniquely in one VM.



>> +		if (phy_dev->ops->destroy) {
>> +			if (phy_dev->ops->destroy(phy_dev->dev, mdevice->uuid,
>> +						  mdevice->instance)) {
>> +				mutex_unlock(&phy_devices.list_lock);
>
> a warning message is preferred. Also better to return -EBUSY here.
>

mdev_destroy_device() is called from 2 paths, one is sysfs mdev_destroy
and mdev_unregister_device(). For the later case, return from here will
any ways ignored. mdev_unregister_device() is called from the remove
function of physical device and that doesn't care about return error, it
just removes the device from subsystem.

>> +				return;
>> +			}
>> +		}
>> +
>> +		mdev_remove_attribute_group(&mdevice->dev,
>> +					    phy_dev->ops->mdev_attr_groups);
>> +		mdevice->phy_dev = NULL;
>
> Am I missing something here? You didn't remove this mdev node from
> the list, and below...
>

device_unregister() calls put_device(dev) and if refcount is zero its
release function is called, which is mdev_device_release(), that is
hooked during device_register(). This node is removed from list from
mdev_device_release().


>> +		mutex_unlock(&phy_devices.list_lock);
>
> you should use mutex of mdevices list
>

No, this lock is for phy_dev.


>> +	phy_dev->dev = dev;
>> +	phy_dev->ops = ops;
>> +
>> +	mutex_lock(&phy_devices.list_lock);
>> +	ret = mdev_create_sysfs_files(dev);
>> +	if (ret)
>> +		goto add_sysfs_error;
>> +
>> +	ret = mdev_add_attribute_group(dev, ops->dev_attr_groups);
>> +	if (ret)
>> +		goto add_group_error;
>
> any reason to include sysfs operations inside the mutex which is
> purely about phy_devices list?
>

dev_attr_groups attribute is for physical device, hence inside
phy_devices.list_lock.

* @dev_attr_groups:    Default attributes of the physical device.


>> +void mdev_unregister_device(struct device *dev)
>> +{
>> +	struct phy_device *phy_dev;
>> +	struct mdev_device *vdev = NULL;
>> +
>> +	phy_dev = find_physical_device(dev);
>> +
>> +	if (!phy_dev)
>> +		return;
>> +
>> +	dev_info(dev, "MDEV: Unregistering\n");
>> +
>> +	while ((vdev = find_next_mdev_device(phy_dev)))
>> +		mdev_destroy_device(vdev);
>
> Need check return value here since ops->destroy may fail.
>

See my comment above.


>> +static void mdev_device_release(struct device *dev)
>
> what's the difference between this release and earlier destroy version?
>

See my comment above


>> +static void __exit mdev_exit(void)
>> +{
>
> should we check any remaining mdev/pdev which are not cleaned
> up correctly here?
>

If there are any physical device registered with this module then the
usage count is not zero so rmmod would anyways fail.

All mdev_devices, which are created for any physical device,  are
destroyed from mdev_unregister_device(physial_device);

Hence no need to explicitly add the code here which would never get used.

>> +	mdev_bus_unregister();
>> +	class_unregister(&mdev_class);
>> +}


Thanks,
Kirti




^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v4 1/3] Mediated device Core driver
@ 2016-05-25 14:47       ` Kirti Wankhede
  0 siblings, 0 replies; 92+ messages in thread
From: Kirti Wankhede @ 2016-05-25 14:47 UTC (permalink / raw)
  To: Tian, Kevin, alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, Ruan, Shuai, Song, Jike, Lv, Zhiyuan, bjsdjshi


On 5/25/2016 1:25 PM, Tian, Kevin wrote:
>> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
>> Sent: Wednesday, May 25, 2016 3:58 AM
>>

...

>> +
>> +config MDEV
>> +    tristate "Mediated device driver framework"
>
> Sorry not a native speaker. Is it cleaner to say "Driver framework for
Mediated
> Devices" or "Mediated Device Framework"? Should we focus on driver or
device
> here?
>

Both, device and driver. This framework provides way to register
physical *devices* and also register *driver* for mediated devices.


>> +    depends on VFIO
>> +    default n
>> +    help
>> +        MDEV provides a framework to virtualize device without
SR-IOV cap
>> +        See Documentation/mdev.txt for more details.
>
> Looks Documentation/mdev.txt is not included in this version.
>

Yes, will have Documentation/mdev.txt in next version of patch.


>> +static struct devices_list {
>> +	struct list_head    dev_list;
>> +	struct mutex        list_lock;
>> +} mdevices, phy_devices;
>
> phy_devices -> pdevices? and similarly we can use pdev/mdev
> pair in other places...
>

'pdevices' sometimes also refers to 'pointer to devices' that's the
reason I perfer to use phy_devices to represent 'physical devices'


>> +static struct mdev_device *find_mdev_device(uuid_le uuid, int instance)
>
> can we just call it "struct mdev* or "mdevice"? "dev_device" looks
redundant.
>

'struct mdev_device' represents 'device structure for device created by
mdev module'. Still that doesn't satisfy major folks, I'm open to change
it.


> Sorry I may have to ask same question since I didn't get an answer yet.
> what exactly does 'instance' mean here? since uuid is unique, why do
> we need match instance too?
>

'uuid' could be UUID of a VM for whom it is created. To support mutiple
mediated devices for same VM, name should be unique. Hence we need a
instance number to identify each mediated device uniquely in one VM.



>> +		if (phy_dev->ops->destroy) {
>> +			if (phy_dev->ops->destroy(phy_dev->dev, mdevice->uuid,
>> +						  mdevice->instance)) {
>> +				mutex_unlock(&phy_devices.list_lock);
>
> a warning message is preferred. Also better to return -EBUSY here.
>

mdev_destroy_device() is called from 2 paths, one is sysfs mdev_destroy
and mdev_unregister_device(). For the later case, return from here will
any ways ignored. mdev_unregister_device() is called from the remove
function of physical device and that doesn't care about return error, it
just removes the device from subsystem.

>> +				return;
>> +			}
>> +		}
>> +
>> +		mdev_remove_attribute_group(&mdevice->dev,
>> +					    phy_dev->ops->mdev_attr_groups);
>> +		mdevice->phy_dev = NULL;
>
> Am I missing something here? You didn't remove this mdev node from
> the list, and below...
>

device_unregister() calls put_device(dev) and if refcount is zero its
release function is called, which is mdev_device_release(), that is
hooked during device_register(). This node is removed from list from
mdev_device_release().


>> +		mutex_unlock(&phy_devices.list_lock);
>
> you should use mutex of mdevices list
>

No, this lock is for phy_dev.


>> +	phy_dev->dev = dev;
>> +	phy_dev->ops = ops;
>> +
>> +	mutex_lock(&phy_devices.list_lock);
>> +	ret = mdev_create_sysfs_files(dev);
>> +	if (ret)
>> +		goto add_sysfs_error;
>> +
>> +	ret = mdev_add_attribute_group(dev, ops->dev_attr_groups);
>> +	if (ret)
>> +		goto add_group_error;
>
> any reason to include sysfs operations inside the mutex which is
> purely about phy_devices list?
>

dev_attr_groups attribute is for physical device, hence inside
phy_devices.list_lock.

* @dev_attr_groups:    Default attributes of the physical device.


>> +void mdev_unregister_device(struct device *dev)
>> +{
>> +	struct phy_device *phy_dev;
>> +	struct mdev_device *vdev = NULL;
>> +
>> +	phy_dev = find_physical_device(dev);
>> +
>> +	if (!phy_dev)
>> +		return;
>> +
>> +	dev_info(dev, "MDEV: Unregistering\n");
>> +
>> +	while ((vdev = find_next_mdev_device(phy_dev)))
>> +		mdev_destroy_device(vdev);
>
> Need check return value here since ops->destroy may fail.
>

See my comment above.


>> +static void mdev_device_release(struct device *dev)
>
> what's the difference between this release and earlier destroy version?
>

See my comment above


>> +static void __exit mdev_exit(void)
>> +{
>
> should we check any remaining mdev/pdev which are not cleaned
> up correctly here?
>

If there are any physical device registered with this module then the
usage count is not zero so rmmod would anyways fail.

All mdev_devices, which are created for any physical device,  are
destroyed from mdev_unregister_device(physial_device);

Hence no need to explicitly add the code here which would never get used.

>> +	mdev_bus_unregister();
>> +	class_unregister(&mdev_class);
>> +}


Thanks,
Kirti

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC PATCH v4 1/3] Mediated device Core driver
  2016-05-24 19:58   ` [Qemu-devel] " Kirti Wankhede
@ 2016-05-25 22:39     ` Alex Williamson
  -1 siblings, 0 replies; 92+ messages in thread
From: Alex Williamson @ 2016-05-25 22:39 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, shuai.ruan,
	jike.song, zhiyuan.lv, bjsdjshi

On Wed, 25 May 2016 01:28:15 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> Design for Mediated Device Driver:
> Main purpose of this driver is to provide a common interface for mediated
> device management that can be used by differnt drivers of different
> devices.
> 
> This module provides a generic interface to create the device, add it to
> mediated bus, add device to IOMMU group and then add it to vfio group.
> 
> Below is the high Level block diagram, with Nvidia, Intel and IBM devices
> as example, since these are the devices which are going to actively use
> this module as of now.
> 
>  +---------------+
>  |               |
>  | +-----------+ |  mdev_register_driver() +--------------+
>  | |           | +<------------------------+ __init()     |
>  | |           | |                         |              |
>  | |  mdev     | +------------------------>+              |<-> VFIO user
>  | |  bus      | |     probe()/remove()    | vfio_mpci.ko |    APIs
>  | |  driver   | |                         |              |
>  | |           | |                         +--------------+
>  | |           | |  mdev_register_driver() +--------------+
>  | |           | +<------------------------+ __init()     |
>  | |           | |                         |              |
>  | |           | +------------------------>+              |<-> VFIO user
>  | +-----------+ |     probe()/remove()    | vfio_mccw.ko |    APIs
>  |               |                         |              |
>  |  MDEV CORE    |                         +--------------+
>  |   MODULE      |
>  |   mdev.ko     |
>  | +-----------+ |  mdev_register_device() +--------------+
>  | |           | +<------------------------+              |
>  | |           | |                         |  nvidia.ko   |<-> physical
>  | |           | +------------------------>+              |    device
>  | |           | |        callback         +--------------+
>  | | Physical  | |
>  | |  device   | |  mdev_register_device() +--------------+
>  | | interface | |<------------------------+              |
>  | |           | |                         |  i915.ko     |<-> physical
>  | |           | +------------------------>+              |    device
>  | |           | |        callback         +--------------+
>  | |           | |
>  | |           | |  mdev_register_device() +--------------+
>  | |           | +<------------------------+              |
>  | |           | |                         | ccw_device.ko|<-> physical
>  | |           | +------------------------>+              |    device
>  | |           | |        callback         +--------------+
>  | +-----------+ |
>  +---------------+
> 
> Core driver provides two types of registration interfaces:
> 1. Registration interface for mediated bus driver:
> 
> /**
>   * struct mdev_driver - Mediated device's driver
>   * @name: driver name
>   * @probe: called when new device created
>   * @remove:called when device removed
>   * @match: called when new device or driver is added for this bus.
> 	    Return 1 if given device can be handled by given driver and
> 	    zero otherwise.
>   * @driver:device driver structure
>   *
>   **/
> struct mdev_driver {
>          const char *name;
>          int  (*probe)  (struct device *dev);
>          void (*remove) (struct device *dev);
> 	 int  (*match)(struct device *dev);
>          struct device_driver    driver;
> };
> 
> int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
> void mdev_unregister_driver(struct mdev_driver *drv);
> 
> Mediated device's driver for mdev should use this interface to register
> with Core driver. With this, mediated devices driver for such devices is
> responsible to add mediated device to VFIO group.
> 
> 2. Physical device driver interface
> This interface provides vendor driver the set APIs to manage physical
> device related work in their own driver. APIs are :
> - supported_config: provide supported configuration list by the vendor
> 		    driver
> - create: to allocate basic resources in vendor driver for a mediated
> 	  device.
> - destroy: to free resources in vendor driver when mediated device is
> 	   destroyed.
> - start: to initiate mediated device initialization process from vendor
> 	 driver when VM boots and before QEMU starts.
> - shutdown: to teardown mediated device resources during VM teardown.
> - read : read emulation callback.
> - write: write emulation callback.
> - set_irqs: send interrupt configuration information that QEMU sets.
> - get_region_info: to provide region size and its flags for the mediated
> 		   device.
> - validate_map_request: to validate remap pfn request.

nit, vfio is a userspace driver interface where QEMU is simply a user
of that interface.  We should never assume QEMU is the only user.

> 
> This registration interface should be used by vendor drivers to register
> each physical device to mdev core driver.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Signed-off-by: Neo Jia <cjia@nvidia.com>
> Change-Id: I88f4482f7608f40550a152c5f882b64271287c62
> ---
>  drivers/vfio/Kconfig             |   1 +
>  drivers/vfio/Makefile            |   1 +
>  drivers/vfio/mdev/Kconfig        |  11 +
>  drivers/vfio/mdev/Makefile       |   5 +
>  drivers/vfio/mdev/mdev-core.c    | 462 +++++++++++++++++++++++++++++++++++++++
>  drivers/vfio/mdev/mdev-driver.c  | 139 ++++++++++++
>  drivers/vfio/mdev/mdev-sysfs.c   | 312 ++++++++++++++++++++++++++
>  drivers/vfio/mdev/mdev_private.h |  33 +++
>  include/linux/mdev.h             | 224 +++++++++++++++++++
>  9 files changed, 1188 insertions(+)
>  create mode 100644 drivers/vfio/mdev/Kconfig
>  create mode 100644 drivers/vfio/mdev/Makefile
>  create mode 100644 drivers/vfio/mdev/mdev-core.c
>  create mode 100644 drivers/vfio/mdev/mdev-driver.c
>  create mode 100644 drivers/vfio/mdev/mdev-sysfs.c
>  create mode 100644 drivers/vfio/mdev/mdev_private.h

- or _, pick one.  Underscore is more prevalent.

>  create mode 100644 include/linux/mdev.h
> diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
> index da6e2ce77495..23eced02aaf6 100644
> --- a/drivers/vfio/Kconfig
> +++ b/drivers/vfio/Kconfig
> @@ -48,4 +48,5 @@ menuconfig VFIO_NOIOMMU
>  
>  source "drivers/vfio/pci/Kconfig"
>  source "drivers/vfio/platform/Kconfig"
> +source "drivers/vfio/mdev/Kconfig"
>  source "virt/lib/Kconfig"
> diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
> index 7b8a31f63fea..7c70753e54ab 100644
> --- a/drivers/vfio/Makefile
> +++ b/drivers/vfio/Makefile
> @@ -7,3 +7,4 @@ obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
>  obj-$(CONFIG_VFIO_SPAPR_EEH) += vfio_spapr_eeh.o
>  obj-$(CONFIG_VFIO_PCI) += pci/
>  obj-$(CONFIG_VFIO_PLATFORM) += platform/
> +obj-$(CONFIG_MDEV) += mdev/
> diff --git a/drivers/vfio/mdev/Kconfig b/drivers/vfio/mdev/Kconfig
> new file mode 100644
> index 000000000000..951e2bb06a3f
> --- /dev/null
> +++ b/drivers/vfio/mdev/Kconfig
> @@ -0,0 +1,11 @@
> +
> +config MDEV
> +    tristate "Mediated device driver framework"
> +    depends on VFIO
> +    default n
> +    help
> +        MDEV provides a framework to virtualize device without SR-IOV cap
> +        See Documentation/mdev.txt for more details.

I don't see that file anywhere in this series.  Also note that SR-IOV
is a PCI specific technology while as a framework this is specifically
not limited to PCI.  In fact, there's really no requirement here that
physical hardware even exists, this interface could be used to provide
in-kernel emulation of a device.

> +
> +        If you don't know what do here, say N.
> +
> diff --git a/drivers/vfio/mdev/Makefile b/drivers/vfio/mdev/Makefile
> new file mode 100644
> index 000000000000..4adb069febce
> --- /dev/null
> +++ b/drivers/vfio/mdev/Makefile
> @@ -0,0 +1,5 @@
> +
> +mdev-y := mdev-core.o mdev-sysfs.o mdev-driver.o
> +
> +obj-$(CONFIG_MDEV) += mdev.o
> +
> diff --git a/drivers/vfio/mdev/mdev-core.c b/drivers/vfio/mdev/mdev-core.c
> new file mode 100644
> index 000000000000..af070d73735f
> --- /dev/null
> +++ b/drivers/vfio/mdev/mdev-core.c
> @@ -0,0 +1,462 @@
> +/*
> + * Mediated device Core Driver
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/init.h>
> +#include <linux/module.h>
> +#include <linux/device.h>
> +#include <linux/kernel.h>
> +#include <linux/fs.h>
> +#include <linux/slab.h>
> +#include <linux/cdev.h>
> +#include <linux/sched.h>
> +#include <linux/uuid.h>
> +#include <linux/vfio.h>
> +#include <linux/iommu.h>
> +#include <linux/sysfs.h>
> +#include <linux/mdev.h>
> +
> +#include "mdev_private.h"
> +
> +#define DRIVER_VERSION		"0.1"
> +#define DRIVER_AUTHOR		"NVIDIA Corporation"
> +#define DRIVER_DESC		"Mediated device Core Driver"
> +
> +#define MDEV_CLASS_NAME		"mdev"
> +
> +/*
> + * Global Structures
> + */
> +
> +static struct devices_list {
> +	struct list_head    dev_list;
> +	struct mutex        list_lock;
> +} mdevices, phy_devices;
> +
> +/*
> + * Functions
> + */
> +
> +static int mdev_add_attribute_group(struct device *dev,
> +				    const struct attribute_group **groups)
> +{
> +	return sysfs_create_groups(&dev->kobj, groups);
> +}
> +
> +static void mdev_remove_attribute_group(struct device *dev,
> +					const struct attribute_group **groups)
> +{
> +	sysfs_remove_groups(&dev->kobj, groups);
> +}
> +
> +static struct mdev_device *find_mdev_device(uuid_le uuid, int instance)
> +{
> +	struct mdev_device *vdev = NULL, *v;
> +
> +	mutex_lock(&mdevices.list_lock);
> +	list_for_each_entry(v, &mdevices.dev_list, next) {
> +		if ((uuid_le_cmp(v->uuid, uuid) == 0) &&
> +		    (v->instance == instance)) {
> +			vdev = v;
> +			break;
> +		}
> +	}
> +	mutex_unlock(&mdevices.list_lock);
> +	return vdev;
> +}
> +
> +static struct mdev_device *find_next_mdev_device(struct phy_device *phy_dev)
> +{

What's "next" about this function?  "next" generally means the caller
provides a starting point in the list and the search continues from
there.

> +	struct mdev_device *mdev = NULL, *p;
> +
> +	mutex_lock(&mdevices.list_lock);
> +	list_for_each_entry(p, &mdevices.dev_list, next) {
> +		if (p->phy_dev == phy_dev) {
> +			mdev = p;
> +			break;
> +		}
> +	}

Ah, I see from the unregister below that this is intended as a first
entry type function, so the naming is not consistent with Linux
terminology.  Suggest "first" in the name.

> +	mutex_unlock(&mdevices.list_lock);
> +	return mdev;
> +}
> +
> +static struct phy_device *find_physical_device(struct device *dev)
> +{
> +	struct phy_device *pdev = NULL, *p;
> +
> +	mutex_lock(&phy_devices.list_lock);
> +	list_for_each_entry(p, &phy_devices.dev_list, next) {
> +		if (p->dev == dev) {
> +			pdev = p;
> +			break;
> +		}
> +	}
> +	mutex_unlock(&phy_devices.list_lock);
> +	return pdev;
> +}
> +
> +static void mdev_destroy_device(struct mdev_device *mdevice)
> +{
> +	struct phy_device *phy_dev = mdevice->phy_dev;
> +
> +	if (phy_dev) {
> +		mutex_lock(&phy_devices.list_lock);
> +
> +		/*
> +		* If vendor driver doesn't return success that means vendor
> +		* driver doesn't support hot-unplug
> +		*/
> +		if (phy_dev->ops->destroy) {
> +			if (phy_dev->ops->destroy(phy_dev->dev, mdevice->uuid,
> +						  mdevice->instance)) {
> +				mutex_unlock(&phy_devices.list_lock);
> +				return;
> +			}
> +		}
> +
> +		mdev_remove_attribute_group(&mdevice->dev,
> +					    phy_dev->ops->mdev_attr_groups);
> +		mdevice->phy_dev = NULL;
> +		mutex_unlock(&phy_devices.list_lock);

Locking here appears arbitrary, how does the above code interact with
phy_devices.dev_list?

> +	}
> +
> +	mdev_put_device(mdevice);
> +	device_unregister(&mdevice->dev);
> +}
> +
> +/*
> + * Find mediated device from given iommu_group and increment refcount of
> + * mediated device. Caller should call mdev_put_device() when the use of
> + * mdev_device is done.
> + */
> +struct mdev_device *mdev_get_device_by_group(struct iommu_group *group)
> +{
> +	struct mdev_device *mdev = NULL, *p;
> +
> +	mutex_lock(&mdevices.list_lock);
> +	list_for_each_entry(p, &mdevices.dev_list, next) {
> +		if (!p->group)
> +			continue;
> +
> +		if (iommu_group_id(p->group) == iommu_group_id(group)) {
> +			mdev = mdev_get_device(p);
> +			break;
> +		}
> +	}
> +	mutex_unlock(&mdevices.list_lock);
> +	return mdev;
> +}
> +EXPORT_SYMBOL_GPL(mdev_get_device_by_group);
> +
> +/*
> + * mdev_register_device : Register a device
> + * @dev: device structure representing physical device.
> + * @phy_device_ops: Physical device operation structure to be registered.

Why are we insistent that there's a physical device?

> + *
> + * Add device to list of registered physical devices.
> + * Returns a negative value on error, otherwise 0.
> + */
> +int mdev_register_device(struct device *dev, const struct phy_device_ops *ops)
> +{
> +	int ret = 0;
> +	struct phy_device *phy_dev, *pdev;
> +
> +	if (!dev || !ops)
> +		return -EINVAL;
> +
> +	/* Check for duplicate */
> +	pdev = find_physical_device(dev);
> +	if (pdev)
> +		return -EEXIST;
> +

Why do we need a separate variable for this test vs just using phy_dev?

> +	phy_dev = kzalloc(sizeof(*phy_dev), GFP_KERNEL);
> +	if (!phy_dev)
> +		return -ENOMEM;
> +
> +	phy_dev->dev = dev;
> +	phy_dev->ops = ops;
> +

There's a race between where we searched for the physical device above
and when we grab this lock.

> +	mutex_lock(&phy_devices.list_lock);
> +	ret = mdev_create_sysfs_files(dev);
> +	if (ret)
> +		goto add_sysfs_error;
> +
> +	ret = mdev_add_attribute_group(dev, ops->dev_attr_groups);
> +	if (ret)
> +		goto add_group_error;
> +
> +	list_add(&phy_dev->next, &phy_devices.dev_list);
> +	dev_info(dev, "MDEV: Registered\n");
> +	mutex_unlock(&phy_devices.list_lock);
> +
> +	return 0;
> +
> +add_group_error:
> +	mdev_remove_sysfs_files(dev);
> +add_sysfs_error:
> +	mutex_unlock(&phy_devices.list_lock);
> +	kfree(phy_dev);
> +	return ret;
> +}
> +EXPORT_SYMBOL(mdev_register_device);
> +
> +/*
> + * mdev_unregister_device : Unregister a physical device
> + * @dev: device structure representing physical device.
> + *
> + * Remove device from list of registered physical devices. Gives a change to
> + * free existing mediated devices for the given physical device.
> + */
> +
> +void mdev_unregister_device(struct device *dev)
> +{
> +	struct phy_device *phy_dev;
> +	struct mdev_device *vdev = NULL;

vdev?  Why not mdev?

> +
> +	phy_dev = find_physical_device(dev);
> +
> +	if (!phy_dev)
> +		return;
> +
> +	dev_info(dev, "MDEV: Unregistering\n");
> +
> +	while ((vdev = find_next_mdev_device(phy_dev)))
> +		mdev_destroy_device(vdev);
> +

I'm guessing there's a race here that a new mdev could be added before
the phy_dev gets removed.  Probably need to fix ordering to remove the
phy_dev from the list first to prevent new mdev's being created from it.

> +	mutex_lock(&phy_devices.list_lock);
> +	list_del(&phy_dev->next);
> +	mutex_unlock(&phy_devices.list_lock);
> +
> +	mdev_remove_attribute_group(dev,
> +				    phy_dev->ops->dev_attr_groups);
> +
> +	mdev_remove_sysfs_files(dev);
> +	kfree(phy_dev);
> +}
> +EXPORT_SYMBOL(mdev_unregister_device);
> +
> +/*
> + * Functions required for mdev-sysfs
> + */
> +
> +static struct mdev_device *mdev_device_alloc(uuid_le uuid, int instance)
> +{
> +	struct mdev_device *mdevice = NULL;

Used mdev above, then vdev, now mdevice, pick one.

> +
> +	mdevice = kzalloc(sizeof(*mdevice), GFP_KERNEL);
> +	if (!mdevice)
> +		return ERR_PTR(-ENOMEM);
> +
> +	kref_init(&mdevice->kref);
> +	memcpy(&mdevice->uuid, &uuid, sizeof(uuid_le));
> +	mdevice->instance = instance;
> +	mutex_init(&mdevice->ops_lock);
> +
> +	return mdevice;
> +}
> +
> +static void mdev_device_release(struct device *dev)
> +{
> +	struct mdev_device *mdevice = to_mdev_device(dev);
> +
> +	if (!mdevice)
> +		return;
> +
> +	dev_info(&mdevice->dev, "MDEV: destroying\n");
> +
> +	mutex_lock(&mdevices.list_lock);
> +	list_del(&mdevice->next);
> +	mutex_unlock(&mdevices.list_lock);
> +
> +	kfree(mdevice);
> +}
> +
> +int create_mdev_device(struct device *dev, uuid_le uuid, uint32_t instance,
> +		       char *mdev_params)

Naming seems inconsistent, mdev_device_create()?

> +{
> +	int retval = 0;
> +	struct mdev_device *mdevice = NULL;
> +	struct phy_device *phy_dev;
> +
> +	phy_dev = find_physical_device(dev);
> +	if (!phy_dev)
> +		return -EINVAL;
> +
> +	mdevice = mdev_device_alloc(uuid, instance);
> +	if (IS_ERR(mdevice)) {
> +		retval = PTR_ERR(mdevice);
> +		return retval;
> +	}
> +
> +	mdevice->dev.parent  = dev;
> +	mdevice->dev.bus     = &mdev_bus_type;
> +	mdevice->dev.release = mdev_device_release;
> +	dev_set_name(&mdevice->dev, "%pUb-%d", uuid.b, instance);
> +
> +	mutex_lock(&mdevices.list_lock);
> +	list_add(&mdevice->next, &mdevices.dev_list);

We assume no conflicts?

> +	mutex_unlock(&mdevices.list_lock);
> +
> +	retval = device_register(&mdevice->dev);
> +	if (retval) {
> +		mdev_put_device(mdevice);
> +		return retval;
> +	}
> +
> +	mutex_lock(&phy_devices.list_lock);

What are we locking here?  We found phy_dev under lock but we have
absolutely no guarantee that it still exists and holding this mutex
doesn't change that.

> +	if (phy_dev->ops->create) {
> +		retval = phy_dev->ops->create(dev, mdevice->uuid,
> +					      instance, mdev_params);
> +		if (retval)
> +			goto create_failed;
> +	}
> +
> +	retval = mdev_add_attribute_group(&mdevice->dev,
> +					  phy_dev->ops->mdev_attr_groups);
> +	if (retval)
> +		goto create_failed;
> +
> +	mdevice->phy_dev = phy_dev;
> +	mutex_unlock(&phy_devices.list_lock);
> +	mdev_get_device(mdevice);
> +	dev_info(&mdevice->dev, "MDEV: created\n");
> +
> +	return retval;
> +
> +create_failed:
> +	mutex_unlock(&phy_devices.list_lock);
> +	device_unregister(&mdevice->dev);
> +	return retval;
> +}
> +
> +int destroy_mdev_device(uuid_le uuid, uint32_t instance)
> +{
> +	struct mdev_device *vdev;
> +
> +	vdev = find_mdev_device(uuid, instance);
> +
> +	if (!vdev)
> +		return -EINVAL;
> +
> +	mdev_destroy_device(vdev);
> +	return 0;
> +}
> +
> +void get_mdev_supported_types(struct device *dev, char *str)

Is there some defined max for the string?  How do we know how much the
caller has allocated?  Should we have a char** here so we can allocate
it?

> +{
> +	struct phy_device *phy_dev;
> +
> +	phy_dev = find_physical_device(dev);
> +
> +	if (phy_dev) {
> +		mutex_lock(&phy_devices.list_lock);

Again, this lock doesn't protect anything.  We either need a reference
or we need end-to-end locking.

> +		if (phy_dev->ops->supported_config)
> +			phy_dev->ops->supported_config(phy_dev->dev, str);
> +		mutex_unlock(&phy_devices.list_lock);
> +	}
> +}
> +
> +int mdev_start_callback(uuid_le uuid, uint32_t instance)
> +{
> +	int ret = 0;
> +	struct mdev_device *mdevice;
> +	struct phy_device *phy_dev;
> +
> +	mdevice = find_mdev_device(uuid, instance);
> +
> +	if (!mdevice)
> +		return -EINVAL;
> +
> +	phy_dev = mdevice->phy_dev;
> +
> +	mutex_lock(&phy_devices.list_lock);

Ineffective locking...

> +	if (phy_dev->ops->start)
> +		ret = phy_dev->ops->start(mdevice->uuid);
> +	mutex_unlock(&phy_devices.list_lock);
> +
> +	if (ret < 0)
> +		pr_err("mdev_start failed  %d\n", ret);
> +	else
> +		kobject_uevent(&mdevice->dev.kobj, KOBJ_ONLINE);
> +
> +	return ret;
> +}
> +
> +int mdev_shutdown_callback(uuid_le uuid, uint32_t instance)
> +{
> +	int ret = 0;
> +	struct mdev_device *mdevice;
> +	struct phy_device *phy_dev;
> +
> +	mdevice = find_mdev_device(uuid, instance);
> +
> +	if (!mdevice)
> +		return -EINVAL;
> +
> +	phy_dev = mdevice->phy_dev;
> +
> +	mutex_lock(&phy_devices.list_lock);
> +	if (phy_dev->ops->shutdown)
> +		ret = phy_dev->ops->shutdown(mdevice->uuid);
> +	mutex_unlock(&phy_devices.list_lock);
> +
> +	if (ret < 0)
> +		pr_err("mdev_shutdown failed %d\n", ret);
> +	else
> +		kobject_uevent(&mdevice->dev.kobj, KOBJ_OFFLINE);
> +
> +	return ret;
> +}
> +
> +static struct class mdev_class = {
> +	.name		= MDEV_CLASS_NAME,
> +	.owner		= THIS_MODULE,
> +	.class_attrs	= mdev_class_attrs,
> +};
> +
> +static int __init mdev_init(void)
> +{
> +	int rc = 0;
> +
> +	mutex_init(&mdevices.list_lock);
> +	INIT_LIST_HEAD(&mdevices.dev_list);
> +	mutex_init(&phy_devices.list_lock);
> +	INIT_LIST_HEAD(&phy_devices.dev_list);
> +
> +	rc = class_register(&mdev_class);
> +	if (rc < 0) {
> +		pr_err("Failed to register mdev class\n");
> +		return rc;
> +	}
> +
> +	rc = mdev_bus_register();
> +	if (rc < 0) {
> +		pr_err("Failed to register mdev bus\n");
> +		class_unregister(&mdev_class);
> +		return rc;
> +	}
> +
> +	return rc;
> +}
> +
> +static void __exit mdev_exit(void)
> +{
> +	mdev_bus_unregister();
> +	class_unregister(&mdev_class);
> +}
> +
> +module_init(mdev_init)
> +module_exit(mdev_exit)
> +
> +MODULE_VERSION(DRIVER_VERSION);
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR(DRIVER_AUTHOR);
> +MODULE_DESCRIPTION(DRIVER_DESC);
> diff --git a/drivers/vfio/mdev/mdev-driver.c b/drivers/vfio/mdev/mdev-driver.c
> new file mode 100644
> index 000000000000..bc8a169782bc
> --- /dev/null
> +++ b/drivers/vfio/mdev/mdev-driver.c
> @@ -0,0 +1,139 @@
> +/*
> + * MDEV driver
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/device.h>
> +#include <linux/iommu.h>
> +#include <linux/mdev.h>
> +
> +#include "mdev_private.h"
> +
> +static int mdevice_attach_iommu(struct mdev_device *mdevice)
> +{
> +	int retval = 0;
> +	struct iommu_group *group = NULL;
> +
> +	group = iommu_group_alloc();
> +	if (IS_ERR(group)) {
> +		dev_err(&mdevice->dev, "MDEV: failed to allocate group!\n");
> +		return PTR_ERR(group);
> +	}
> +
> +	retval = iommu_group_add_device(group, &mdevice->dev);
> +	if (retval) {
> +		dev_err(&mdevice->dev, "MDEV: failed to add dev to group!\n");
> +		goto attach_fail;
> +	}
> +
> +	mdevice->group = group;
> +
> +	dev_info(&mdevice->dev, "MDEV: group_id = %d\n",
> +				 iommu_group_id(group));

I assume a lot of these should probably be dev_dbg() or just removed
before we actually think about committing this code.

> +attach_fail:
> +	iommu_group_put(group);
> +	return retval;
> +}
> +
> +static void mdevice_detach_iommu(struct mdev_device *mdevice)
> +{
> +	iommu_group_remove_device(&mdevice->dev);
> +	dev_info(&mdevice->dev, "MDEV: detaching iommu\n");
> +}
> +
> +static int mdevice_probe(struct device *dev)
> +{
> +	struct mdev_driver *drv = to_mdev_driver(dev->driver);
> +	struct mdev_device *mdevice = to_mdev_device(dev);
> +	int status = 0;

status here, retval above, ret in previous file, please use some
consistency.  mdevice vs mdev, same.

> +
> +	status = mdevice_attach_iommu(mdevice);
> +	if (status) {
> +		dev_err(dev, "Failed to attach IOMMU\n");
> +		return status;
> +	}
> +
> +	if (drv && drv->probe)
> +		status = drv->probe(dev);
> +
> +	return status;
> +}
> +
> +static int mdevice_remove(struct device *dev)
> +{
> +	struct mdev_driver *drv = to_mdev_driver(dev->driver);
> +	struct mdev_device *mdevice = to_mdev_device(dev);
> +
> +	if (drv && drv->remove)
> +		drv->remove(dev);
> +
> +	mdevice_detach_iommu(mdevice);
> +
> +	return 0;
> +}
> +
> +static int mdevice_match(struct device *dev, struct device_driver *drv)
> +{
> +	int ret = 0;
> +	struct mdev_driver *mdrv = to_mdev_driver(drv);
> +
> +	if (mdrv && mdrv->match)
> +		ret = mdrv->match(dev);
> +
> +	return ret;
> +}
> +
> +struct bus_type mdev_bus_type = {
> +	.name		= "mdev",
> +	.match		= mdevice_match,
> +	.probe		= mdevice_probe,
> +	.remove		= mdevice_remove,
> +};
> +EXPORT_SYMBOL_GPL(mdev_bus_type);
> +
> +/**
> + * mdev_register_driver - register a new MDEV driver
> + * @drv: the driver to register
> + * @owner: owner module of driver ro register
> + *
> + * Returns a negative value on error, otherwise 0.
> + */
> +int mdev_register_driver(struct mdev_driver *drv, struct module *owner)
> +{
> +	/* initialize common driver fields */
> +	drv->driver.name = drv->name;
> +	drv->driver.bus = &mdev_bus_type;
> +	drv->driver.owner = owner;
> +
> +	/* register with core */
> +	return driver_register(&drv->driver);
> +}
> +EXPORT_SYMBOL(mdev_register_driver);
> +
> +/**
> + * mdev_unregister_driver - unregister MDEV driver
> + * @drv: the driver to unregister
> + *
> + */
> +void mdev_unregister_driver(struct mdev_driver *drv)
> +{
> +	driver_unregister(&drv->driver);
> +}
> +EXPORT_SYMBOL(mdev_unregister_driver);
> +
> +int mdev_bus_register(void)
> +{
> +	return bus_register(&mdev_bus_type);
> +}
> +
> +void mdev_bus_unregister(void)
> +{
> +	bus_unregister(&mdev_bus_type);
> +}
> diff --git a/drivers/vfio/mdev/mdev-sysfs.c b/drivers/vfio/mdev/mdev-sysfs.c
> new file mode 100644
> index 000000000000..79d351a7a502
> --- /dev/null
> +++ b/drivers/vfio/mdev/mdev-sysfs.c
> @@ -0,0 +1,312 @@
> +/*
> + * File attributes for Mediated devices
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/sysfs.h>
> +#include <linux/ctype.h>
> +#include <linux/device.h>
> +#include <linux/slab.h>
> +#include <linux/uuid.h>
> +#include <linux/mdev.h>
> +
> +#include "mdev_private.h"
> +
> +/* Prototypes */
> +static ssize_t mdev_supported_types_show(struct device *dev,
> +					 struct device_attribute *attr,
> +					 char *buf);
> +static DEVICE_ATTR_RO(mdev_supported_types);
> +
> +static ssize_t mdev_create_store(struct device *dev,
> +				 struct device_attribute *attr,
> +				 const char *buf, size_t count);
> +static DEVICE_ATTR_WO(mdev_create);
> +
> +static ssize_t mdev_destroy_store(struct device *dev,
> +				  struct device_attribute *attr,
> +				  const char *buf, size_t count);
> +static DEVICE_ATTR_WO(mdev_destroy);
> +
> +/* Static functions */
> +
> +#define UUID_CHAR_LENGTH	36
> +#define UUID_BYTE_LENGTH	16
> +
> +#define SUPPORTED_TYPE_BUFFER_LENGTH	1024
> +
> +static inline bool is_uuid_sep(char sep)
> +{
> +	if (sep == '\n' || sep == '-' || sep == ':' || sep == '\0')
> +		return true;
> +	return false;
> +}
> +
> +static int uuid_parse(const char *str, uuid_le *uuid)
> +{
> +	int i;
> +
> +	if (strlen(str) < UUID_CHAR_LENGTH)
> +		return -1;
> +
> +	for (i = 0; i < UUID_BYTE_LENGTH; i++) {
> +		if (!isxdigit(str[0]) || !isxdigit(str[1])) {
> +			pr_err("%s err", __func__);
> +			return -EINVAL;
> +		}
> +
> +		uuid->b[i] = (hex_to_bin(str[0]) << 4) | hex_to_bin(str[1]);
> +		str += 2;
> +		if (is_uuid_sep(*str))
> +			str++;
> +	}
> +
> +	return 0;
> +}
> +
> +/* Functions */
> +static ssize_t mdev_supported_types_show(struct device *dev,
> +					 struct device_attribute *attr,
> +					 char *buf)
> +{
> +	char *str;
> +	ssize_t n;
> +
> +	str = kzalloc(sizeof(*str) * SUPPORTED_TYPE_BUFFER_LENGTH, GFP_KERNEL);
> +	if (!str)
> +		return -ENOMEM;
> +
> +	get_mdev_supported_types(dev, str);
> +
> +	n = sprintf(buf, "%s\n", str);
> +	kfree(str);
> +
> +	return n;
> +}
> +
> +static ssize_t mdev_create_store(struct device *dev,
> +				 struct device_attribute *attr,
> +				 const char *buf, size_t count)
> +{
> +	char *str, *pstr;
> +	char *uuid_str, *instance_str, *mdev_params = NULL;
> +	uuid_le uuid;
> +	uint32_t instance;
> +	int ret = 0;
> +
> +	pstr = str = kstrndup(buf, count, GFP_KERNEL);
> +
> +	if (!str)
> +		return -ENOMEM;
> +
> +	uuid_str = strsep(&str, ":");
> +	if (!uuid_str) {
> +		pr_err("mdev_create: Empty UUID string %s\n", buf);
> +		ret = -EINVAL;
> +		goto create_error;
> +	}
> +
> +	if (!str) {
> +		pr_err("mdev_create: mdev instance not present %s\n", buf);
> +		ret = -EINVAL;
> +		goto create_error;
> +	}
> +
> +	instance_str = strsep(&str, ":");
> +	if (!instance_str) {
> +		pr_err("mdev_create: Empty instance string %s\n", buf);
> +		ret = -EINVAL;
> +		goto create_error;
> +	}
> +
> +	ret = kstrtouint(instance_str, 0, &instance);
> +	if (ret) {
> +		pr_err("mdev_create: mdev instance parsing error %s\n", buf);
> +		goto create_error;
> +	}
> +
> +	if (!str) {
> +		pr_err("mdev_create: mdev params not specified %s\n", buf);
> +		ret = -EINVAL;
> +		goto create_error;
> +	}

Are they necessarily required?  What if the driver doesn't have
multiple types?  The supported_config callback is optional per previous
code.

> +
> +	mdev_params = kstrdup(str, GFP_KERNEL);
> +
> +	if (!mdev_params) {
> +		ret = -EINVAL;
> +		goto create_error;
> +	}
> +
> +	if (uuid_parse(uuid_str, &uuid) < 0) {
> +		pr_err("mdev_create: UUID parse error %s\n", buf);
> +		ret = -EINVAL;
> +		goto create_error;
> +	}
> +
> +	if (create_mdev_device(dev, uuid, instance, mdev_params) < 0) {
> +		pr_err("mdev_create: Failed to create mdev device\n");
> +		ret = -EINVAL;
> +		goto create_error;
> +	}
> +	ret = count;
> +
> +create_error:
> +	kfree(mdev_params);
> +	kfree(pstr);
> +	return ret;
> +}
> +
> +static ssize_t mdev_destroy_store(struct device *dev,
> +				  struct device_attribute *attr,
> +				  const char *buf, size_t count)
> +{
> +	char *uuid_str, *str, *pstr;
> +	uuid_le uuid;
> +	unsigned int instance;
> +	int ret;
> +
> +	str = pstr = kstrndup(buf, count, GFP_KERNEL);
> +
> +	if (!str)
> +		return -ENOMEM;
> +
> +	uuid_str = strsep(&str, ":");
> +	if (!uuid_str) {
> +		pr_err("mdev_destroy: Empty UUID string %s\n", buf);
> +		ret = -EINVAL;
> +		goto destroy_error;
> +	}
> +
> +	if (str == NULL) {
> +		pr_err("mdev_destroy: instance not specified %s\n", buf);
> +		ret = -EINVAL;
> +		goto destroy_error;
> +	}
> +
> +	ret = kstrtouint(str, 0, &instance);
> +	if (ret) {
> +		pr_err("mdev_destroy: instance parsing error %s\n", buf);
> +		goto destroy_error;
> +	}
> +
> +	if (uuid_parse(uuid_str, &uuid) < 0) {
> +		pr_err("mdev_destroy: UUID parse error  %s\n", buf);
> +		ret = -EINVAL;
> +		goto destroy_error;
> +	}
> +
> +	ret = destroy_mdev_device(uuid, instance);
> +	if (ret < 0)
> +		goto destroy_error;
> +
> +	ret = count;
> +
> +destroy_error:
> +	kfree(pstr);
> +	return ret;
> +}
> +
> +ssize_t mdev_start_store(struct class *class, struct class_attribute *attr,
> +			 const char *buf, size_t count)
> +{
> +	char *uuid_str;
> +	uuid_le uuid;
> +	int ret = 0;
> +
> +	uuid_str = kstrndup(buf, count, GFP_KERNEL);
> +
> +	if (!uuid_str)
> +		return -ENOMEM;
> +
> +	if (uuid_parse(uuid_str, &uuid) < 0) {
> +		pr_err("mdev_start: UUID parse error  %s\n", buf);
> +		ret = -EINVAL;
> +		goto start_error;
> +	}
> +
> +	ret = mdev_start_callback(uuid, 0);
> +	if (ret < 0)
> +		goto start_error;
> +
> +	ret = count;
> +
> +start_error:
> +	kfree(uuid_str);
> +	return ret;
> +}
> +
> +ssize_t mdev_shutdown_store(struct class *class, struct class_attribute *attr,
> +			    const char *buf, size_t count)
> +{
> +	char *uuid_str;
> +	uuid_le uuid;
> +	int ret = 0;
> +
> +	uuid_str = kstrndup(buf, count, GFP_KERNEL);
> +
> +	if (!uuid_str)
> +		return -ENOMEM;
> +
> +	if (uuid_parse(uuid_str, &uuid) < 0) {
> +		pr_err("mdev_shutdown: UUID parse error %s\n", buf);
> +		ret = -EINVAL;
> +	}
> +
> +	ret = mdev_shutdown_callback(uuid, 0);
> +	if (ret < 0)
> +		goto shutdown_error;
> +
> +	ret = count;
> +
> +shutdown_error:
> +	kfree(uuid_str);
> +	return ret;
> +
> +}
> +
> +struct class_attribute mdev_class_attrs[] = {
> +	__ATTR_WO(mdev_start),
> +	__ATTR_WO(mdev_shutdown),
> +	__ATTR_NULL
> +};
> +
> +int mdev_create_sysfs_files(struct device *dev)
> +{
> +	int retval;
> +
> +	retval = sysfs_create_file(&dev->kobj,
> +				   &dev_attr_mdev_supported_types.attr);
> +	if (retval) {
> +		pr_err("Failed to create mdev_supported_types sysfs entry\n");
> +		return retval;
> +	}
> +
> +	retval = sysfs_create_file(&dev->kobj, &dev_attr_mdev_create.attr);
> +	if (retval) {
> +		pr_err("Failed to create mdev_create sysfs entry\n");
> +		return retval;
> +	}
> +
> +	retval = sysfs_create_file(&dev->kobj, &dev_attr_mdev_destroy.attr);
> +	if (retval) {
> +		pr_err("Failed to create mdev_destroy sysfs entry\n");
> +		return retval;
> +	}
> +
> +	return 0;
> +}
> +
> +void mdev_remove_sysfs_files(struct device *dev)
> +{
> +	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_supported_types.attr);
> +	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_create.attr);
> +	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_destroy.attr);
> +}
> diff --git a/drivers/vfio/mdev/mdev_private.h b/drivers/vfio/mdev/mdev_private.h
> new file mode 100644
> index 000000000000..a472310c7749
> --- /dev/null
> +++ b/drivers/vfio/mdev/mdev_private.h
> @@ -0,0 +1,33 @@
> +/*
> + * Mediated device interal definitions
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#ifndef MDEV_PRIVATE_H
> +#define MDEV_PRIVATE_H
> +
> +int  mdev_bus_register(void);
> +void mdev_bus_unregister(void);
> +
> +/* Function prototypes for mdev_sysfs */
> +
> +extern struct class_attribute mdev_class_attrs[];
> +
> +int  mdev_create_sysfs_files(struct device *dev);
> +void mdev_remove_sysfs_files(struct device *dev);
> +
> +int  create_mdev_device(struct device *dev, uuid_le uuid, uint32_t instance,
> +			char *mdev_params);
> +int  destroy_mdev_device(uuid_le uuid, uint32_t instance);
> +void get_mdev_supported_types(struct device *dev, char *str);
> +int  mdev_start_callback(uuid_le uuid, uint32_t instance);
> +int  mdev_shutdown_callback(uuid_le uuid, uint32_t instance);
> +
> +#endif /* MDEV_PRIVATE_H */
> diff --git a/include/linux/mdev.h b/include/linux/mdev.h
> new file mode 100644
> index 000000000000..d9633acd85f2
> --- /dev/null
> +++ b/include/linux/mdev.h
> @@ -0,0 +1,224 @@
> +/*
> + * Mediated device definition
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#ifndef MDEV_H
> +#define MDEV_H
> +
> +/* Common Data structures */
> +
> +struct pci_region_info {
> +	uint64_t start;
> +	uint64_t size;
> +	uint32_t flags;		/*!< VFIO region info flags */

Perhaps more clear to say "Same as vfio_region_info.flags"  Also prefix
with mdev_

What's with this comment style /*!< ?

> +};
> +
> +enum mdev_emul_space {
> +	EMUL_CONFIG_SPACE,	/*!< PCI configuration space */
> +	EMUL_IO,		/*!< I/O register space */
> +	EMUL_MMIO		/*!< Memory-mapped I/O space */
> +};
> +
> +struct phy_device;
> +
> +/*
> + * Mediated device
> + */
> +
> +struct mdev_device {
> +	struct kref		kref;
> +	struct device		dev;
> +	struct phy_device	*phy_dev;
> +	struct iommu_group	*group;
> +	void			*iommu_data;
> +	uuid_le			uuid;
> +	uint32_t		instance;
> +	void			*driver_data;
> +	struct mutex		ops_lock;
> +	struct list_head	next;
> +};

Could this be in the private header?  Seems like this should be opaque
outside of mdev core.

> +
> +
> +/**
> + * struct phy_device_ops - Structure to be registered for each physical device
> + * to register the device to mdev module.
> + *
> + * @owner:		The module owner.
> + * @dev_attr_groups:	Default attributes of the physical device.
> + * @mdev_attr_groups:	Default attributes of the mediated device.
> + * @supported_config:	Called to get information about supported types.
> + *			@dev : device structure of physical device.
> + *			@config: should return string listing supported config
> + *			Returns integer: success (0) or error (< 0)
> + * @create:		Called to allocate basic resources in physical device's
> + *			driver for a particular mediated device
> + *			@dev: physical pci device structure on which mediated
> + *			      device should be created

Not necessarily pci.

> + *			@uuid: VM's uuid for which VM it is intended to
> + *			@instance: mediated instance in that VM
> + *			@mdev_params: extra parameters required by physical
> + *			device's driver.
> + *			Returns integer: success (0) or error (< 0)
> + * @destroy:		Called to free resources in physical device's driver for
> + *			a mediated device instance of that VM.
> + *			@dev: physical device structure to which this mediated
> + *			      device points to.
> + *			@uuid: VM's uuid for which the mediated device belongs
> + *			@instance: mdev instance in that VM
> + *			Returns integer: success (0) or error (< 0)
> + *			If VM is running and destroy() is called that means the
> + *			mdev is being hotunpluged. Return error if VM is running
> + *			and driver doesn't support mediated device hotplug.
> + * @start:		Called to do initiate mediated device initialization
> + *			process in physical device's driver when VM boots before
> + *			qemu starts.
> + *			@uuid: VM's UUID which is booting.
> + *			Returns integer: success (0) or error (< 0)
> + * @shutdown:		Called to teardown mediated device related resources for
> + *			the VM
> + *			@uuid: VM's UUID which is shutting down .
> + *			Returns integer: success (0) or error (< 0)
> + * @read:		Read emulation callback
> + *			@mdev: mediated device structure
> + *			@buf: read buffer
> + *			@count: number bytes to read
> + *			@address_space: specifies for which address
> + *			space the request is: pci_config_space, IO
> + *			register space or MMIO space.

Seems like I asked before and it's no more clear in the code, how do we
handle multiple spaces for various types?  ie. a device might have
multiple MMIO spaces.

> + *			@pos: offset from base address.

What's the base address, zero?

> + *			Retuns number on bytes read on success or error.
> + * @write:		Write emulation callback
> + *			@mdev: mediated device structure
> + *			@buf: write buffer
> + *			@count: number bytes to be written
> + *			@address_space: specifies for which address space the
> + *			request is: pci_config_space, IO register space or MMIO
> + *			space.
> + *			@pos: offset from base address.
> + *			Retuns number on bytes written on success or error.
> + * @set_irqs:		Called to send about interrupts configuration
> + *			information that VMM sets.
> + *			@mdev: mediated device structure
> + *			@flags, index, start, count and *data : same as that of
> + *			struct vfio_irq_set of VFIO_DEVICE_SET_IRQS API.
> + * @get_region_info:	Called to get BAR size and flags of mediated device.
> + *			@mdev: mediated device structure
> + *			@region_index: VFIO region index
> + *			@region_info: output, returns size and flags of
> + *				      requested region.

I don't see how vfio regions indexes here map to read/write
address_space and position above.

> + *			Returns integer: success (0) or error (< 0)
> + * @validate_map_request: Validate remap pfn request
> + *			@mdev: mediated device structure
> + *			@virtaddr: target user address to start at
> + *			@pfn: physical address of kernel memory, vendor driver
> + *			      can change if required.
> + *			@size: size of map area, vendor driver can change the
> + *			       size of map area if desired.
> + *			@prot: page protection flags for this mapping, vendor
> + *			       driver can change, if required.
> + *			Returns integer: success (0) or error (< 0)

Still no invalidation interface?

> + *
> + * Physical device that support mediated device should be registered with mdev
> + * module with phy_device_ops structure.
> + */
> +
> +struct phy_device_ops {
> +	struct module   *owner;
> +	const struct attribute_group **dev_attr_groups;
> +	const struct attribute_group **mdev_attr_groups;
> +
> +	int	(*supported_config)(struct device *dev, char *config);
> +	int     (*create)(struct device *dev, uuid_le uuid,
> +			  uint32_t instance, char *mdev_params);
> +	int     (*destroy)(struct device *dev, uuid_le uuid,
> +			   uint32_t instance);
> +	int     (*start)(uuid_le uuid);
> +	int     (*shutdown)(uuid_le uuid);
> +	ssize_t (*read)(struct mdev_device *vdev, char *buf, size_t count,
> +			enum mdev_emul_space address_space, loff_t pos);
> +	ssize_t (*write)(struct mdev_device *vdev, char *buf, size_t count,
> +			 enum mdev_emul_space address_space, loff_t pos);
> +	int     (*set_irqs)(struct mdev_device *vdev, uint32_t flags,
> +			    unsigned int index, unsigned int start,
> +			    unsigned int count, void *data);
> +	int	(*get_region_info)(struct mdev_device *vdev, int region_index,
> +				 struct pci_region_info *region_info);

*pci*_region_info??

> +	int	(*validate_map_request)(struct mdev_device *vdev,
> +					unsigned long virtaddr,
> +					unsigned long *pfn, unsigned long *size,
> +					pgprot_t *prot);
> +};
> +
> +/*
> + * Physical Device
> + */
> +struct phy_device {
> +	struct device                   *dev;
> +	const struct phy_device_ops     *ops;
> +	struct list_head                next;
> +};

I would really like to be able to use the mediated device interface to
create a purely virtual device, is the expectation that my physical
device interface would create a virtual struct device which would
become the parent and control point in sysfs for creating all the mdev
devices? Should we be calling this a host_device or mdev_parent_dev in
that case since there's really no requirement that it be a physical
device?  Thanks,

Alex

> +
> +/**
> + * struct mdev_driver - Mediated device driver
> + * @name: driver name
> + * @probe: called when new device created
> + * @remove: called when device removed
> + * @match: called when new device or driver is added for this bus. Return 1 if
> + *	   given device can be handled by given driver and zero otherwise.
> + * @driver: device driver structure
> + *
> + **/
> +struct mdev_driver {
> +	const char *name;
> +	int  (*probe)(struct device *dev);
> +	void (*remove)(struct device *dev);
> +	int  (*match)(struct device *dev);
> +	struct device_driver driver;
> +};
> +
> +static inline struct mdev_driver *to_mdev_driver(struct device_driver *drv)
> +{
> +	return drv ? container_of(drv, struct mdev_driver, driver) : NULL;
> +}
> +
> +static inline struct mdev_device *to_mdev_device(struct device *dev)
> +{
> +	return dev ? container_of(dev, struct mdev_device, dev) : NULL;
> +}
> +
> +static inline struct mdev_device *mdev_get_device(struct mdev_device *vdev)
> +{
> +	return (vdev && get_device(&vdev->dev)) ? vdev : NULL;
> +}
> +
> +static inline  void mdev_put_device(struct mdev_device *vdev)
> +{
> +	if (vdev)
> +		put_device(&vdev->dev);
> +}
> +
> +extern struct bus_type mdev_bus_type;
> +
> +#define dev_is_mdev(d) ((d)->bus == &mdev_bus_type)
> +
> +extern int  mdev_register_device(struct device *dev,
> +				 const struct phy_device_ops *ops);
> +extern void mdev_unregister_device(struct device *dev);
> +
> +extern int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
> +extern void mdev_unregister_driver(struct mdev_driver *drv);
> +
> +extern int mdev_map_virtual_bar(uint64_t virt_bar_addr, uint64_t phys_bar_addr,
> +				uint32_t len, uint32_t flags);
> +
> +extern struct mdev_device *mdev_get_device_by_group(struct iommu_group *group);
> +
> +#endif /* MDEV_H */


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v4 1/3] Mediated device Core driver
@ 2016-05-25 22:39     ` Alex Williamson
  0 siblings, 0 replies; 92+ messages in thread
From: Alex Williamson @ 2016-05-25 22:39 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, shuai.ruan,
	jike.song, zhiyuan.lv, bjsdjshi

On Wed, 25 May 2016 01:28:15 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> Design for Mediated Device Driver:
> Main purpose of this driver is to provide a common interface for mediated
> device management that can be used by differnt drivers of different
> devices.
> 
> This module provides a generic interface to create the device, add it to
> mediated bus, add device to IOMMU group and then add it to vfio group.
> 
> Below is the high Level block diagram, with Nvidia, Intel and IBM devices
> as example, since these are the devices which are going to actively use
> this module as of now.
> 
>  +---------------+
>  |               |
>  | +-----------+ |  mdev_register_driver() +--------------+
>  | |           | +<------------------------+ __init()     |
>  | |           | |                         |              |
>  | |  mdev     | +------------------------>+              |<-> VFIO user
>  | |  bus      | |     probe()/remove()    | vfio_mpci.ko |    APIs
>  | |  driver   | |                         |              |
>  | |           | |                         +--------------+
>  | |           | |  mdev_register_driver() +--------------+
>  | |           | +<------------------------+ __init()     |
>  | |           | |                         |              |
>  | |           | +------------------------>+              |<-> VFIO user
>  | +-----------+ |     probe()/remove()    | vfio_mccw.ko |    APIs
>  |               |                         |              |
>  |  MDEV CORE    |                         +--------------+
>  |   MODULE      |
>  |   mdev.ko     |
>  | +-----------+ |  mdev_register_device() +--------------+
>  | |           | +<------------------------+              |
>  | |           | |                         |  nvidia.ko   |<-> physical
>  | |           | +------------------------>+              |    device
>  | |           | |        callback         +--------------+
>  | | Physical  | |
>  | |  device   | |  mdev_register_device() +--------------+
>  | | interface | |<------------------------+              |
>  | |           | |                         |  i915.ko     |<-> physical
>  | |           | +------------------------>+              |    device
>  | |           | |        callback         +--------------+
>  | |           | |
>  | |           | |  mdev_register_device() +--------------+
>  | |           | +<------------------------+              |
>  | |           | |                         | ccw_device.ko|<-> physical
>  | |           | +------------------------>+              |    device
>  | |           | |        callback         +--------------+
>  | +-----------+ |
>  +---------------+
> 
> Core driver provides two types of registration interfaces:
> 1. Registration interface for mediated bus driver:
> 
> /**
>   * struct mdev_driver - Mediated device's driver
>   * @name: driver name
>   * @probe: called when new device created
>   * @remove:called when device removed
>   * @match: called when new device or driver is added for this bus.
> 	    Return 1 if given device can be handled by given driver and
> 	    zero otherwise.
>   * @driver:device driver structure
>   *
>   **/
> struct mdev_driver {
>          const char *name;
>          int  (*probe)  (struct device *dev);
>          void (*remove) (struct device *dev);
> 	 int  (*match)(struct device *dev);
>          struct device_driver    driver;
> };
> 
> int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
> void mdev_unregister_driver(struct mdev_driver *drv);
> 
> Mediated device's driver for mdev should use this interface to register
> with Core driver. With this, mediated devices driver for such devices is
> responsible to add mediated device to VFIO group.
> 
> 2. Physical device driver interface
> This interface provides vendor driver the set APIs to manage physical
> device related work in their own driver. APIs are :
> - supported_config: provide supported configuration list by the vendor
> 		    driver
> - create: to allocate basic resources in vendor driver for a mediated
> 	  device.
> - destroy: to free resources in vendor driver when mediated device is
> 	   destroyed.
> - start: to initiate mediated device initialization process from vendor
> 	 driver when VM boots and before QEMU starts.
> - shutdown: to teardown mediated device resources during VM teardown.
> - read : read emulation callback.
> - write: write emulation callback.
> - set_irqs: send interrupt configuration information that QEMU sets.
> - get_region_info: to provide region size and its flags for the mediated
> 		   device.
> - validate_map_request: to validate remap pfn request.

nit, vfio is a userspace driver interface where QEMU is simply a user
of that interface.  We should never assume QEMU is the only user.

> 
> This registration interface should be used by vendor drivers to register
> each physical device to mdev core driver.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Signed-off-by: Neo Jia <cjia@nvidia.com>
> Change-Id: I88f4482f7608f40550a152c5f882b64271287c62
> ---
>  drivers/vfio/Kconfig             |   1 +
>  drivers/vfio/Makefile            |   1 +
>  drivers/vfio/mdev/Kconfig        |  11 +
>  drivers/vfio/mdev/Makefile       |   5 +
>  drivers/vfio/mdev/mdev-core.c    | 462 +++++++++++++++++++++++++++++++++++++++
>  drivers/vfio/mdev/mdev-driver.c  | 139 ++++++++++++
>  drivers/vfio/mdev/mdev-sysfs.c   | 312 ++++++++++++++++++++++++++
>  drivers/vfio/mdev/mdev_private.h |  33 +++
>  include/linux/mdev.h             | 224 +++++++++++++++++++
>  9 files changed, 1188 insertions(+)
>  create mode 100644 drivers/vfio/mdev/Kconfig
>  create mode 100644 drivers/vfio/mdev/Makefile
>  create mode 100644 drivers/vfio/mdev/mdev-core.c
>  create mode 100644 drivers/vfio/mdev/mdev-driver.c
>  create mode 100644 drivers/vfio/mdev/mdev-sysfs.c
>  create mode 100644 drivers/vfio/mdev/mdev_private.h

- or _, pick one.  Underscore is more prevalent.

>  create mode 100644 include/linux/mdev.h
> diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
> index da6e2ce77495..23eced02aaf6 100644
> --- a/drivers/vfio/Kconfig
> +++ b/drivers/vfio/Kconfig
> @@ -48,4 +48,5 @@ menuconfig VFIO_NOIOMMU
>  
>  source "drivers/vfio/pci/Kconfig"
>  source "drivers/vfio/platform/Kconfig"
> +source "drivers/vfio/mdev/Kconfig"
>  source "virt/lib/Kconfig"
> diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
> index 7b8a31f63fea..7c70753e54ab 100644
> --- a/drivers/vfio/Makefile
> +++ b/drivers/vfio/Makefile
> @@ -7,3 +7,4 @@ obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
>  obj-$(CONFIG_VFIO_SPAPR_EEH) += vfio_spapr_eeh.o
>  obj-$(CONFIG_VFIO_PCI) += pci/
>  obj-$(CONFIG_VFIO_PLATFORM) += platform/
> +obj-$(CONFIG_MDEV) += mdev/
> diff --git a/drivers/vfio/mdev/Kconfig b/drivers/vfio/mdev/Kconfig
> new file mode 100644
> index 000000000000..951e2bb06a3f
> --- /dev/null
> +++ b/drivers/vfio/mdev/Kconfig
> @@ -0,0 +1,11 @@
> +
> +config MDEV
> +    tristate "Mediated device driver framework"
> +    depends on VFIO
> +    default n
> +    help
> +        MDEV provides a framework to virtualize device without SR-IOV cap
> +        See Documentation/mdev.txt for more details.

I don't see that file anywhere in this series.  Also note that SR-IOV
is a PCI specific technology while as a framework this is specifically
not limited to PCI.  In fact, there's really no requirement here that
physical hardware even exists, this interface could be used to provide
in-kernel emulation of a device.

> +
> +        If you don't know what do here, say N.
> +
> diff --git a/drivers/vfio/mdev/Makefile b/drivers/vfio/mdev/Makefile
> new file mode 100644
> index 000000000000..4adb069febce
> --- /dev/null
> +++ b/drivers/vfio/mdev/Makefile
> @@ -0,0 +1,5 @@
> +
> +mdev-y := mdev-core.o mdev-sysfs.o mdev-driver.o
> +
> +obj-$(CONFIG_MDEV) += mdev.o
> +
> diff --git a/drivers/vfio/mdev/mdev-core.c b/drivers/vfio/mdev/mdev-core.c
> new file mode 100644
> index 000000000000..af070d73735f
> --- /dev/null
> +++ b/drivers/vfio/mdev/mdev-core.c
> @@ -0,0 +1,462 @@
> +/*
> + * Mediated device Core Driver
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/init.h>
> +#include <linux/module.h>
> +#include <linux/device.h>
> +#include <linux/kernel.h>
> +#include <linux/fs.h>
> +#include <linux/slab.h>
> +#include <linux/cdev.h>
> +#include <linux/sched.h>
> +#include <linux/uuid.h>
> +#include <linux/vfio.h>
> +#include <linux/iommu.h>
> +#include <linux/sysfs.h>
> +#include <linux/mdev.h>
> +
> +#include "mdev_private.h"
> +
> +#define DRIVER_VERSION		"0.1"
> +#define DRIVER_AUTHOR		"NVIDIA Corporation"
> +#define DRIVER_DESC		"Mediated device Core Driver"
> +
> +#define MDEV_CLASS_NAME		"mdev"
> +
> +/*
> + * Global Structures
> + */
> +
> +static struct devices_list {
> +	struct list_head    dev_list;
> +	struct mutex        list_lock;
> +} mdevices, phy_devices;
> +
> +/*
> + * Functions
> + */
> +
> +static int mdev_add_attribute_group(struct device *dev,
> +				    const struct attribute_group **groups)
> +{
> +	return sysfs_create_groups(&dev->kobj, groups);
> +}
> +
> +static void mdev_remove_attribute_group(struct device *dev,
> +					const struct attribute_group **groups)
> +{
> +	sysfs_remove_groups(&dev->kobj, groups);
> +}
> +
> +static struct mdev_device *find_mdev_device(uuid_le uuid, int instance)
> +{
> +	struct mdev_device *vdev = NULL, *v;
> +
> +	mutex_lock(&mdevices.list_lock);
> +	list_for_each_entry(v, &mdevices.dev_list, next) {
> +		if ((uuid_le_cmp(v->uuid, uuid) == 0) &&
> +		    (v->instance == instance)) {
> +			vdev = v;
> +			break;
> +		}
> +	}
> +	mutex_unlock(&mdevices.list_lock);
> +	return vdev;
> +}
> +
> +static struct mdev_device *find_next_mdev_device(struct phy_device *phy_dev)
> +{

What's "next" about this function?  "next" generally means the caller
provides a starting point in the list and the search continues from
there.

> +	struct mdev_device *mdev = NULL, *p;
> +
> +	mutex_lock(&mdevices.list_lock);
> +	list_for_each_entry(p, &mdevices.dev_list, next) {
> +		if (p->phy_dev == phy_dev) {
> +			mdev = p;
> +			break;
> +		}
> +	}

Ah, I see from the unregister below that this is intended as a first
entry type function, so the naming is not consistent with Linux
terminology.  Suggest "first" in the name.

> +	mutex_unlock(&mdevices.list_lock);
> +	return mdev;
> +}
> +
> +static struct phy_device *find_physical_device(struct device *dev)
> +{
> +	struct phy_device *pdev = NULL, *p;
> +
> +	mutex_lock(&phy_devices.list_lock);
> +	list_for_each_entry(p, &phy_devices.dev_list, next) {
> +		if (p->dev == dev) {
> +			pdev = p;
> +			break;
> +		}
> +	}
> +	mutex_unlock(&phy_devices.list_lock);
> +	return pdev;
> +}
> +
> +static void mdev_destroy_device(struct mdev_device *mdevice)
> +{
> +	struct phy_device *phy_dev = mdevice->phy_dev;
> +
> +	if (phy_dev) {
> +		mutex_lock(&phy_devices.list_lock);
> +
> +		/*
> +		* If vendor driver doesn't return success that means vendor
> +		* driver doesn't support hot-unplug
> +		*/
> +		if (phy_dev->ops->destroy) {
> +			if (phy_dev->ops->destroy(phy_dev->dev, mdevice->uuid,
> +						  mdevice->instance)) {
> +				mutex_unlock(&phy_devices.list_lock);
> +				return;
> +			}
> +		}
> +
> +		mdev_remove_attribute_group(&mdevice->dev,
> +					    phy_dev->ops->mdev_attr_groups);
> +		mdevice->phy_dev = NULL;
> +		mutex_unlock(&phy_devices.list_lock);

Locking here appears arbitrary, how does the above code interact with
phy_devices.dev_list?

> +	}
> +
> +	mdev_put_device(mdevice);
> +	device_unregister(&mdevice->dev);
> +}
> +
> +/*
> + * Find mediated device from given iommu_group and increment refcount of
> + * mediated device. Caller should call mdev_put_device() when the use of
> + * mdev_device is done.
> + */
> +struct mdev_device *mdev_get_device_by_group(struct iommu_group *group)
> +{
> +	struct mdev_device *mdev = NULL, *p;
> +
> +	mutex_lock(&mdevices.list_lock);
> +	list_for_each_entry(p, &mdevices.dev_list, next) {
> +		if (!p->group)
> +			continue;
> +
> +		if (iommu_group_id(p->group) == iommu_group_id(group)) {
> +			mdev = mdev_get_device(p);
> +			break;
> +		}
> +	}
> +	mutex_unlock(&mdevices.list_lock);
> +	return mdev;
> +}
> +EXPORT_SYMBOL_GPL(mdev_get_device_by_group);
> +
> +/*
> + * mdev_register_device : Register a device
> + * @dev: device structure representing physical device.
> + * @phy_device_ops: Physical device operation structure to be registered.

Why are we insistent that there's a physical device?

> + *
> + * Add device to list of registered physical devices.
> + * Returns a negative value on error, otherwise 0.
> + */
> +int mdev_register_device(struct device *dev, const struct phy_device_ops *ops)
> +{
> +	int ret = 0;
> +	struct phy_device *phy_dev, *pdev;
> +
> +	if (!dev || !ops)
> +		return -EINVAL;
> +
> +	/* Check for duplicate */
> +	pdev = find_physical_device(dev);
> +	if (pdev)
> +		return -EEXIST;
> +

Why do we need a separate variable for this test vs just using phy_dev?

> +	phy_dev = kzalloc(sizeof(*phy_dev), GFP_KERNEL);
> +	if (!phy_dev)
> +		return -ENOMEM;
> +
> +	phy_dev->dev = dev;
> +	phy_dev->ops = ops;
> +

There's a race between where we searched for the physical device above
and when we grab this lock.

> +	mutex_lock(&phy_devices.list_lock);
> +	ret = mdev_create_sysfs_files(dev);
> +	if (ret)
> +		goto add_sysfs_error;
> +
> +	ret = mdev_add_attribute_group(dev, ops->dev_attr_groups);
> +	if (ret)
> +		goto add_group_error;
> +
> +	list_add(&phy_dev->next, &phy_devices.dev_list);
> +	dev_info(dev, "MDEV: Registered\n");
> +	mutex_unlock(&phy_devices.list_lock);
> +
> +	return 0;
> +
> +add_group_error:
> +	mdev_remove_sysfs_files(dev);
> +add_sysfs_error:
> +	mutex_unlock(&phy_devices.list_lock);
> +	kfree(phy_dev);
> +	return ret;
> +}
> +EXPORT_SYMBOL(mdev_register_device);
> +
> +/*
> + * mdev_unregister_device : Unregister a physical device
> + * @dev: device structure representing physical device.
> + *
> + * Remove device from list of registered physical devices. Gives a change to
> + * free existing mediated devices for the given physical device.
> + */
> +
> +void mdev_unregister_device(struct device *dev)
> +{
> +	struct phy_device *phy_dev;
> +	struct mdev_device *vdev = NULL;

vdev?  Why not mdev?

> +
> +	phy_dev = find_physical_device(dev);
> +
> +	if (!phy_dev)
> +		return;
> +
> +	dev_info(dev, "MDEV: Unregistering\n");
> +
> +	while ((vdev = find_next_mdev_device(phy_dev)))
> +		mdev_destroy_device(vdev);
> +

I'm guessing there's a race here that a new mdev could be added before
the phy_dev gets removed.  Probably need to fix ordering to remove the
phy_dev from the list first to prevent new mdev's being created from it.

> +	mutex_lock(&phy_devices.list_lock);
> +	list_del(&phy_dev->next);
> +	mutex_unlock(&phy_devices.list_lock);
> +
> +	mdev_remove_attribute_group(dev,
> +				    phy_dev->ops->dev_attr_groups);
> +
> +	mdev_remove_sysfs_files(dev);
> +	kfree(phy_dev);
> +}
> +EXPORT_SYMBOL(mdev_unregister_device);
> +
> +/*
> + * Functions required for mdev-sysfs
> + */
> +
> +static struct mdev_device *mdev_device_alloc(uuid_le uuid, int instance)
> +{
> +	struct mdev_device *mdevice = NULL;

Used mdev above, then vdev, now mdevice, pick one.

> +
> +	mdevice = kzalloc(sizeof(*mdevice), GFP_KERNEL);
> +	if (!mdevice)
> +		return ERR_PTR(-ENOMEM);
> +
> +	kref_init(&mdevice->kref);
> +	memcpy(&mdevice->uuid, &uuid, sizeof(uuid_le));
> +	mdevice->instance = instance;
> +	mutex_init(&mdevice->ops_lock);
> +
> +	return mdevice;
> +}
> +
> +static void mdev_device_release(struct device *dev)
> +{
> +	struct mdev_device *mdevice = to_mdev_device(dev);
> +
> +	if (!mdevice)
> +		return;
> +
> +	dev_info(&mdevice->dev, "MDEV: destroying\n");
> +
> +	mutex_lock(&mdevices.list_lock);
> +	list_del(&mdevice->next);
> +	mutex_unlock(&mdevices.list_lock);
> +
> +	kfree(mdevice);
> +}
> +
> +int create_mdev_device(struct device *dev, uuid_le uuid, uint32_t instance,
> +		       char *mdev_params)

Naming seems inconsistent, mdev_device_create()?

> +{
> +	int retval = 0;
> +	struct mdev_device *mdevice = NULL;
> +	struct phy_device *phy_dev;
> +
> +	phy_dev = find_physical_device(dev);
> +	if (!phy_dev)
> +		return -EINVAL;
> +
> +	mdevice = mdev_device_alloc(uuid, instance);
> +	if (IS_ERR(mdevice)) {
> +		retval = PTR_ERR(mdevice);
> +		return retval;
> +	}
> +
> +	mdevice->dev.parent  = dev;
> +	mdevice->dev.bus     = &mdev_bus_type;
> +	mdevice->dev.release = mdev_device_release;
> +	dev_set_name(&mdevice->dev, "%pUb-%d", uuid.b, instance);
> +
> +	mutex_lock(&mdevices.list_lock);
> +	list_add(&mdevice->next, &mdevices.dev_list);

We assume no conflicts?

> +	mutex_unlock(&mdevices.list_lock);
> +
> +	retval = device_register(&mdevice->dev);
> +	if (retval) {
> +		mdev_put_device(mdevice);
> +		return retval;
> +	}
> +
> +	mutex_lock(&phy_devices.list_lock);

What are we locking here?  We found phy_dev under lock but we have
absolutely no guarantee that it still exists and holding this mutex
doesn't change that.

> +	if (phy_dev->ops->create) {
> +		retval = phy_dev->ops->create(dev, mdevice->uuid,
> +					      instance, mdev_params);
> +		if (retval)
> +			goto create_failed;
> +	}
> +
> +	retval = mdev_add_attribute_group(&mdevice->dev,
> +					  phy_dev->ops->mdev_attr_groups);
> +	if (retval)
> +		goto create_failed;
> +
> +	mdevice->phy_dev = phy_dev;
> +	mutex_unlock(&phy_devices.list_lock);
> +	mdev_get_device(mdevice);
> +	dev_info(&mdevice->dev, "MDEV: created\n");
> +
> +	return retval;
> +
> +create_failed:
> +	mutex_unlock(&phy_devices.list_lock);
> +	device_unregister(&mdevice->dev);
> +	return retval;
> +}
> +
> +int destroy_mdev_device(uuid_le uuid, uint32_t instance)
> +{
> +	struct mdev_device *vdev;
> +
> +	vdev = find_mdev_device(uuid, instance);
> +
> +	if (!vdev)
> +		return -EINVAL;
> +
> +	mdev_destroy_device(vdev);
> +	return 0;
> +}
> +
> +void get_mdev_supported_types(struct device *dev, char *str)

Is there some defined max for the string?  How do we know how much the
caller has allocated?  Should we have a char** here so we can allocate
it?

> +{
> +	struct phy_device *phy_dev;
> +
> +	phy_dev = find_physical_device(dev);
> +
> +	if (phy_dev) {
> +		mutex_lock(&phy_devices.list_lock);

Again, this lock doesn't protect anything.  We either need a reference
or we need end-to-end locking.

> +		if (phy_dev->ops->supported_config)
> +			phy_dev->ops->supported_config(phy_dev->dev, str);
> +		mutex_unlock(&phy_devices.list_lock);
> +	}
> +}
> +
> +int mdev_start_callback(uuid_le uuid, uint32_t instance)
> +{
> +	int ret = 0;
> +	struct mdev_device *mdevice;
> +	struct phy_device *phy_dev;
> +
> +	mdevice = find_mdev_device(uuid, instance);
> +
> +	if (!mdevice)
> +		return -EINVAL;
> +
> +	phy_dev = mdevice->phy_dev;
> +
> +	mutex_lock(&phy_devices.list_lock);

Ineffective locking...

> +	if (phy_dev->ops->start)
> +		ret = phy_dev->ops->start(mdevice->uuid);
> +	mutex_unlock(&phy_devices.list_lock);
> +
> +	if (ret < 0)
> +		pr_err("mdev_start failed  %d\n", ret);
> +	else
> +		kobject_uevent(&mdevice->dev.kobj, KOBJ_ONLINE);
> +
> +	return ret;
> +}
> +
> +int mdev_shutdown_callback(uuid_le uuid, uint32_t instance)
> +{
> +	int ret = 0;
> +	struct mdev_device *mdevice;
> +	struct phy_device *phy_dev;
> +
> +	mdevice = find_mdev_device(uuid, instance);
> +
> +	if (!mdevice)
> +		return -EINVAL;
> +
> +	phy_dev = mdevice->phy_dev;
> +
> +	mutex_lock(&phy_devices.list_lock);
> +	if (phy_dev->ops->shutdown)
> +		ret = phy_dev->ops->shutdown(mdevice->uuid);
> +	mutex_unlock(&phy_devices.list_lock);
> +
> +	if (ret < 0)
> +		pr_err("mdev_shutdown failed %d\n", ret);
> +	else
> +		kobject_uevent(&mdevice->dev.kobj, KOBJ_OFFLINE);
> +
> +	return ret;
> +}
> +
> +static struct class mdev_class = {
> +	.name		= MDEV_CLASS_NAME,
> +	.owner		= THIS_MODULE,
> +	.class_attrs	= mdev_class_attrs,
> +};
> +
> +static int __init mdev_init(void)
> +{
> +	int rc = 0;
> +
> +	mutex_init(&mdevices.list_lock);
> +	INIT_LIST_HEAD(&mdevices.dev_list);
> +	mutex_init(&phy_devices.list_lock);
> +	INIT_LIST_HEAD(&phy_devices.dev_list);
> +
> +	rc = class_register(&mdev_class);
> +	if (rc < 0) {
> +		pr_err("Failed to register mdev class\n");
> +		return rc;
> +	}
> +
> +	rc = mdev_bus_register();
> +	if (rc < 0) {
> +		pr_err("Failed to register mdev bus\n");
> +		class_unregister(&mdev_class);
> +		return rc;
> +	}
> +
> +	return rc;
> +}
> +
> +static void __exit mdev_exit(void)
> +{
> +	mdev_bus_unregister();
> +	class_unregister(&mdev_class);
> +}
> +
> +module_init(mdev_init)
> +module_exit(mdev_exit)
> +
> +MODULE_VERSION(DRIVER_VERSION);
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR(DRIVER_AUTHOR);
> +MODULE_DESCRIPTION(DRIVER_DESC);
> diff --git a/drivers/vfio/mdev/mdev-driver.c b/drivers/vfio/mdev/mdev-driver.c
> new file mode 100644
> index 000000000000..bc8a169782bc
> --- /dev/null
> +++ b/drivers/vfio/mdev/mdev-driver.c
> @@ -0,0 +1,139 @@
> +/*
> + * MDEV driver
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/device.h>
> +#include <linux/iommu.h>
> +#include <linux/mdev.h>
> +
> +#include "mdev_private.h"
> +
> +static int mdevice_attach_iommu(struct mdev_device *mdevice)
> +{
> +	int retval = 0;
> +	struct iommu_group *group = NULL;
> +
> +	group = iommu_group_alloc();
> +	if (IS_ERR(group)) {
> +		dev_err(&mdevice->dev, "MDEV: failed to allocate group!\n");
> +		return PTR_ERR(group);
> +	}
> +
> +	retval = iommu_group_add_device(group, &mdevice->dev);
> +	if (retval) {
> +		dev_err(&mdevice->dev, "MDEV: failed to add dev to group!\n");
> +		goto attach_fail;
> +	}
> +
> +	mdevice->group = group;
> +
> +	dev_info(&mdevice->dev, "MDEV: group_id = %d\n",
> +				 iommu_group_id(group));

I assume a lot of these should probably be dev_dbg() or just removed
before we actually think about committing this code.

> +attach_fail:
> +	iommu_group_put(group);
> +	return retval;
> +}
> +
> +static void mdevice_detach_iommu(struct mdev_device *mdevice)
> +{
> +	iommu_group_remove_device(&mdevice->dev);
> +	dev_info(&mdevice->dev, "MDEV: detaching iommu\n");
> +}
> +
> +static int mdevice_probe(struct device *dev)
> +{
> +	struct mdev_driver *drv = to_mdev_driver(dev->driver);
> +	struct mdev_device *mdevice = to_mdev_device(dev);
> +	int status = 0;

status here, retval above, ret in previous file, please use some
consistency.  mdevice vs mdev, same.

> +
> +	status = mdevice_attach_iommu(mdevice);
> +	if (status) {
> +		dev_err(dev, "Failed to attach IOMMU\n");
> +		return status;
> +	}
> +
> +	if (drv && drv->probe)
> +		status = drv->probe(dev);
> +
> +	return status;
> +}
> +
> +static int mdevice_remove(struct device *dev)
> +{
> +	struct mdev_driver *drv = to_mdev_driver(dev->driver);
> +	struct mdev_device *mdevice = to_mdev_device(dev);
> +
> +	if (drv && drv->remove)
> +		drv->remove(dev);
> +
> +	mdevice_detach_iommu(mdevice);
> +
> +	return 0;
> +}
> +
> +static int mdevice_match(struct device *dev, struct device_driver *drv)
> +{
> +	int ret = 0;
> +	struct mdev_driver *mdrv = to_mdev_driver(drv);
> +
> +	if (mdrv && mdrv->match)
> +		ret = mdrv->match(dev);
> +
> +	return ret;
> +}
> +
> +struct bus_type mdev_bus_type = {
> +	.name		= "mdev",
> +	.match		= mdevice_match,
> +	.probe		= mdevice_probe,
> +	.remove		= mdevice_remove,
> +};
> +EXPORT_SYMBOL_GPL(mdev_bus_type);
> +
> +/**
> + * mdev_register_driver - register a new MDEV driver
> + * @drv: the driver to register
> + * @owner: owner module of driver ro register
> + *
> + * Returns a negative value on error, otherwise 0.
> + */
> +int mdev_register_driver(struct mdev_driver *drv, struct module *owner)
> +{
> +	/* initialize common driver fields */
> +	drv->driver.name = drv->name;
> +	drv->driver.bus = &mdev_bus_type;
> +	drv->driver.owner = owner;
> +
> +	/* register with core */
> +	return driver_register(&drv->driver);
> +}
> +EXPORT_SYMBOL(mdev_register_driver);
> +
> +/**
> + * mdev_unregister_driver - unregister MDEV driver
> + * @drv: the driver to unregister
> + *
> + */
> +void mdev_unregister_driver(struct mdev_driver *drv)
> +{
> +	driver_unregister(&drv->driver);
> +}
> +EXPORT_SYMBOL(mdev_unregister_driver);
> +
> +int mdev_bus_register(void)
> +{
> +	return bus_register(&mdev_bus_type);
> +}
> +
> +void mdev_bus_unregister(void)
> +{
> +	bus_unregister(&mdev_bus_type);
> +}
> diff --git a/drivers/vfio/mdev/mdev-sysfs.c b/drivers/vfio/mdev/mdev-sysfs.c
> new file mode 100644
> index 000000000000..79d351a7a502
> --- /dev/null
> +++ b/drivers/vfio/mdev/mdev-sysfs.c
> @@ -0,0 +1,312 @@
> +/*
> + * File attributes for Mediated devices
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/sysfs.h>
> +#include <linux/ctype.h>
> +#include <linux/device.h>
> +#include <linux/slab.h>
> +#include <linux/uuid.h>
> +#include <linux/mdev.h>
> +
> +#include "mdev_private.h"
> +
> +/* Prototypes */
> +static ssize_t mdev_supported_types_show(struct device *dev,
> +					 struct device_attribute *attr,
> +					 char *buf);
> +static DEVICE_ATTR_RO(mdev_supported_types);
> +
> +static ssize_t mdev_create_store(struct device *dev,
> +				 struct device_attribute *attr,
> +				 const char *buf, size_t count);
> +static DEVICE_ATTR_WO(mdev_create);
> +
> +static ssize_t mdev_destroy_store(struct device *dev,
> +				  struct device_attribute *attr,
> +				  const char *buf, size_t count);
> +static DEVICE_ATTR_WO(mdev_destroy);
> +
> +/* Static functions */
> +
> +#define UUID_CHAR_LENGTH	36
> +#define UUID_BYTE_LENGTH	16
> +
> +#define SUPPORTED_TYPE_BUFFER_LENGTH	1024
> +
> +static inline bool is_uuid_sep(char sep)
> +{
> +	if (sep == '\n' || sep == '-' || sep == ':' || sep == '\0')
> +		return true;
> +	return false;
> +}
> +
> +static int uuid_parse(const char *str, uuid_le *uuid)
> +{
> +	int i;
> +
> +	if (strlen(str) < UUID_CHAR_LENGTH)
> +		return -1;
> +
> +	for (i = 0; i < UUID_BYTE_LENGTH; i++) {
> +		if (!isxdigit(str[0]) || !isxdigit(str[1])) {
> +			pr_err("%s err", __func__);
> +			return -EINVAL;
> +		}
> +
> +		uuid->b[i] = (hex_to_bin(str[0]) << 4) | hex_to_bin(str[1]);
> +		str += 2;
> +		if (is_uuid_sep(*str))
> +			str++;
> +	}
> +
> +	return 0;
> +}
> +
> +/* Functions */
> +static ssize_t mdev_supported_types_show(struct device *dev,
> +					 struct device_attribute *attr,
> +					 char *buf)
> +{
> +	char *str;
> +	ssize_t n;
> +
> +	str = kzalloc(sizeof(*str) * SUPPORTED_TYPE_BUFFER_LENGTH, GFP_KERNEL);
> +	if (!str)
> +		return -ENOMEM;
> +
> +	get_mdev_supported_types(dev, str);
> +
> +	n = sprintf(buf, "%s\n", str);
> +	kfree(str);
> +
> +	return n;
> +}
> +
> +static ssize_t mdev_create_store(struct device *dev,
> +				 struct device_attribute *attr,
> +				 const char *buf, size_t count)
> +{
> +	char *str, *pstr;
> +	char *uuid_str, *instance_str, *mdev_params = NULL;
> +	uuid_le uuid;
> +	uint32_t instance;
> +	int ret = 0;
> +
> +	pstr = str = kstrndup(buf, count, GFP_KERNEL);
> +
> +	if (!str)
> +		return -ENOMEM;
> +
> +	uuid_str = strsep(&str, ":");
> +	if (!uuid_str) {
> +		pr_err("mdev_create: Empty UUID string %s\n", buf);
> +		ret = -EINVAL;
> +		goto create_error;
> +	}
> +
> +	if (!str) {
> +		pr_err("mdev_create: mdev instance not present %s\n", buf);
> +		ret = -EINVAL;
> +		goto create_error;
> +	}
> +
> +	instance_str = strsep(&str, ":");
> +	if (!instance_str) {
> +		pr_err("mdev_create: Empty instance string %s\n", buf);
> +		ret = -EINVAL;
> +		goto create_error;
> +	}
> +
> +	ret = kstrtouint(instance_str, 0, &instance);
> +	if (ret) {
> +		pr_err("mdev_create: mdev instance parsing error %s\n", buf);
> +		goto create_error;
> +	}
> +
> +	if (!str) {
> +		pr_err("mdev_create: mdev params not specified %s\n", buf);
> +		ret = -EINVAL;
> +		goto create_error;
> +	}

Are they necessarily required?  What if the driver doesn't have
multiple types?  The supported_config callback is optional per previous
code.

> +
> +	mdev_params = kstrdup(str, GFP_KERNEL);
> +
> +	if (!mdev_params) {
> +		ret = -EINVAL;
> +		goto create_error;
> +	}
> +
> +	if (uuid_parse(uuid_str, &uuid) < 0) {
> +		pr_err("mdev_create: UUID parse error %s\n", buf);
> +		ret = -EINVAL;
> +		goto create_error;
> +	}
> +
> +	if (create_mdev_device(dev, uuid, instance, mdev_params) < 0) {
> +		pr_err("mdev_create: Failed to create mdev device\n");
> +		ret = -EINVAL;
> +		goto create_error;
> +	}
> +	ret = count;
> +
> +create_error:
> +	kfree(mdev_params);
> +	kfree(pstr);
> +	return ret;
> +}
> +
> +static ssize_t mdev_destroy_store(struct device *dev,
> +				  struct device_attribute *attr,
> +				  const char *buf, size_t count)
> +{
> +	char *uuid_str, *str, *pstr;
> +	uuid_le uuid;
> +	unsigned int instance;
> +	int ret;
> +
> +	str = pstr = kstrndup(buf, count, GFP_KERNEL);
> +
> +	if (!str)
> +		return -ENOMEM;
> +
> +	uuid_str = strsep(&str, ":");
> +	if (!uuid_str) {
> +		pr_err("mdev_destroy: Empty UUID string %s\n", buf);
> +		ret = -EINVAL;
> +		goto destroy_error;
> +	}
> +
> +	if (str == NULL) {
> +		pr_err("mdev_destroy: instance not specified %s\n", buf);
> +		ret = -EINVAL;
> +		goto destroy_error;
> +	}
> +
> +	ret = kstrtouint(str, 0, &instance);
> +	if (ret) {
> +		pr_err("mdev_destroy: instance parsing error %s\n", buf);
> +		goto destroy_error;
> +	}
> +
> +	if (uuid_parse(uuid_str, &uuid) < 0) {
> +		pr_err("mdev_destroy: UUID parse error  %s\n", buf);
> +		ret = -EINVAL;
> +		goto destroy_error;
> +	}
> +
> +	ret = destroy_mdev_device(uuid, instance);
> +	if (ret < 0)
> +		goto destroy_error;
> +
> +	ret = count;
> +
> +destroy_error:
> +	kfree(pstr);
> +	return ret;
> +}
> +
> +ssize_t mdev_start_store(struct class *class, struct class_attribute *attr,
> +			 const char *buf, size_t count)
> +{
> +	char *uuid_str;
> +	uuid_le uuid;
> +	int ret = 0;
> +
> +	uuid_str = kstrndup(buf, count, GFP_KERNEL);
> +
> +	if (!uuid_str)
> +		return -ENOMEM;
> +
> +	if (uuid_parse(uuid_str, &uuid) < 0) {
> +		pr_err("mdev_start: UUID parse error  %s\n", buf);
> +		ret = -EINVAL;
> +		goto start_error;
> +	}
> +
> +	ret = mdev_start_callback(uuid, 0);
> +	if (ret < 0)
> +		goto start_error;
> +
> +	ret = count;
> +
> +start_error:
> +	kfree(uuid_str);
> +	return ret;
> +}
> +
> +ssize_t mdev_shutdown_store(struct class *class, struct class_attribute *attr,
> +			    const char *buf, size_t count)
> +{
> +	char *uuid_str;
> +	uuid_le uuid;
> +	int ret = 0;
> +
> +	uuid_str = kstrndup(buf, count, GFP_KERNEL);
> +
> +	if (!uuid_str)
> +		return -ENOMEM;
> +
> +	if (uuid_parse(uuid_str, &uuid) < 0) {
> +		pr_err("mdev_shutdown: UUID parse error %s\n", buf);
> +		ret = -EINVAL;
> +	}
> +
> +	ret = mdev_shutdown_callback(uuid, 0);
> +	if (ret < 0)
> +		goto shutdown_error;
> +
> +	ret = count;
> +
> +shutdown_error:
> +	kfree(uuid_str);
> +	return ret;
> +
> +}
> +
> +struct class_attribute mdev_class_attrs[] = {
> +	__ATTR_WO(mdev_start),
> +	__ATTR_WO(mdev_shutdown),
> +	__ATTR_NULL
> +};
> +
> +int mdev_create_sysfs_files(struct device *dev)
> +{
> +	int retval;
> +
> +	retval = sysfs_create_file(&dev->kobj,
> +				   &dev_attr_mdev_supported_types.attr);
> +	if (retval) {
> +		pr_err("Failed to create mdev_supported_types sysfs entry\n");
> +		return retval;
> +	}
> +
> +	retval = sysfs_create_file(&dev->kobj, &dev_attr_mdev_create.attr);
> +	if (retval) {
> +		pr_err("Failed to create mdev_create sysfs entry\n");
> +		return retval;
> +	}
> +
> +	retval = sysfs_create_file(&dev->kobj, &dev_attr_mdev_destroy.attr);
> +	if (retval) {
> +		pr_err("Failed to create mdev_destroy sysfs entry\n");
> +		return retval;
> +	}
> +
> +	return 0;
> +}
> +
> +void mdev_remove_sysfs_files(struct device *dev)
> +{
> +	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_supported_types.attr);
> +	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_create.attr);
> +	sysfs_remove_file(&dev->kobj, &dev_attr_mdev_destroy.attr);
> +}
> diff --git a/drivers/vfio/mdev/mdev_private.h b/drivers/vfio/mdev/mdev_private.h
> new file mode 100644
> index 000000000000..a472310c7749
> --- /dev/null
> +++ b/drivers/vfio/mdev/mdev_private.h
> @@ -0,0 +1,33 @@
> +/*
> + * Mediated device interal definitions
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#ifndef MDEV_PRIVATE_H
> +#define MDEV_PRIVATE_H
> +
> +int  mdev_bus_register(void);
> +void mdev_bus_unregister(void);
> +
> +/* Function prototypes for mdev_sysfs */
> +
> +extern struct class_attribute mdev_class_attrs[];
> +
> +int  mdev_create_sysfs_files(struct device *dev);
> +void mdev_remove_sysfs_files(struct device *dev);
> +
> +int  create_mdev_device(struct device *dev, uuid_le uuid, uint32_t instance,
> +			char *mdev_params);
> +int  destroy_mdev_device(uuid_le uuid, uint32_t instance);
> +void get_mdev_supported_types(struct device *dev, char *str);
> +int  mdev_start_callback(uuid_le uuid, uint32_t instance);
> +int  mdev_shutdown_callback(uuid_le uuid, uint32_t instance);
> +
> +#endif /* MDEV_PRIVATE_H */
> diff --git a/include/linux/mdev.h b/include/linux/mdev.h
> new file mode 100644
> index 000000000000..d9633acd85f2
> --- /dev/null
> +++ b/include/linux/mdev.h
> @@ -0,0 +1,224 @@
> +/*
> + * Mediated device definition
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#ifndef MDEV_H
> +#define MDEV_H
> +
> +/* Common Data structures */
> +
> +struct pci_region_info {
> +	uint64_t start;
> +	uint64_t size;
> +	uint32_t flags;		/*!< VFIO region info flags */

Perhaps more clear to say "Same as vfio_region_info.flags"  Also prefix
with mdev_

What's with this comment style /*!< ?

> +};
> +
> +enum mdev_emul_space {
> +	EMUL_CONFIG_SPACE,	/*!< PCI configuration space */
> +	EMUL_IO,		/*!< I/O register space */
> +	EMUL_MMIO		/*!< Memory-mapped I/O space */
> +};
> +
> +struct phy_device;
> +
> +/*
> + * Mediated device
> + */
> +
> +struct mdev_device {
> +	struct kref		kref;
> +	struct device		dev;
> +	struct phy_device	*phy_dev;
> +	struct iommu_group	*group;
> +	void			*iommu_data;
> +	uuid_le			uuid;
> +	uint32_t		instance;
> +	void			*driver_data;
> +	struct mutex		ops_lock;
> +	struct list_head	next;
> +};

Could this be in the private header?  Seems like this should be opaque
outside of mdev core.

> +
> +
> +/**
> + * struct phy_device_ops - Structure to be registered for each physical device
> + * to register the device to mdev module.
> + *
> + * @owner:		The module owner.
> + * @dev_attr_groups:	Default attributes of the physical device.
> + * @mdev_attr_groups:	Default attributes of the mediated device.
> + * @supported_config:	Called to get information about supported types.
> + *			@dev : device structure of physical device.
> + *			@config: should return string listing supported config
> + *			Returns integer: success (0) or error (< 0)
> + * @create:		Called to allocate basic resources in physical device's
> + *			driver for a particular mediated device
> + *			@dev: physical pci device structure on which mediated
> + *			      device should be created

Not necessarily pci.

> + *			@uuid: VM's uuid for which VM it is intended to
> + *			@instance: mediated instance in that VM
> + *			@mdev_params: extra parameters required by physical
> + *			device's driver.
> + *			Returns integer: success (0) or error (< 0)
> + * @destroy:		Called to free resources in physical device's driver for
> + *			a mediated device instance of that VM.
> + *			@dev: physical device structure to which this mediated
> + *			      device points to.
> + *			@uuid: VM's uuid for which the mediated device belongs
> + *			@instance: mdev instance in that VM
> + *			Returns integer: success (0) or error (< 0)
> + *			If VM is running and destroy() is called that means the
> + *			mdev is being hotunpluged. Return error if VM is running
> + *			and driver doesn't support mediated device hotplug.
> + * @start:		Called to do initiate mediated device initialization
> + *			process in physical device's driver when VM boots before
> + *			qemu starts.
> + *			@uuid: VM's UUID which is booting.
> + *			Returns integer: success (0) or error (< 0)
> + * @shutdown:		Called to teardown mediated device related resources for
> + *			the VM
> + *			@uuid: VM's UUID which is shutting down .
> + *			Returns integer: success (0) or error (< 0)
> + * @read:		Read emulation callback
> + *			@mdev: mediated device structure
> + *			@buf: read buffer
> + *			@count: number bytes to read
> + *			@address_space: specifies for which address
> + *			space the request is: pci_config_space, IO
> + *			register space or MMIO space.

Seems like I asked before and it's no more clear in the code, how do we
handle multiple spaces for various types?  ie. a device might have
multiple MMIO spaces.

> + *			@pos: offset from base address.

What's the base address, zero?

> + *			Retuns number on bytes read on success or error.
> + * @write:		Write emulation callback
> + *			@mdev: mediated device structure
> + *			@buf: write buffer
> + *			@count: number bytes to be written
> + *			@address_space: specifies for which address space the
> + *			request is: pci_config_space, IO register space or MMIO
> + *			space.
> + *			@pos: offset from base address.
> + *			Retuns number on bytes written on success or error.
> + * @set_irqs:		Called to send about interrupts configuration
> + *			information that VMM sets.
> + *			@mdev: mediated device structure
> + *			@flags, index, start, count and *data : same as that of
> + *			struct vfio_irq_set of VFIO_DEVICE_SET_IRQS API.
> + * @get_region_info:	Called to get BAR size and flags of mediated device.
> + *			@mdev: mediated device structure
> + *			@region_index: VFIO region index
> + *			@region_info: output, returns size and flags of
> + *				      requested region.

I don't see how vfio regions indexes here map to read/write
address_space and position above.

> + *			Returns integer: success (0) or error (< 0)
> + * @validate_map_request: Validate remap pfn request
> + *			@mdev: mediated device structure
> + *			@virtaddr: target user address to start at
> + *			@pfn: physical address of kernel memory, vendor driver
> + *			      can change if required.
> + *			@size: size of map area, vendor driver can change the
> + *			       size of map area if desired.
> + *			@prot: page protection flags for this mapping, vendor
> + *			       driver can change, if required.
> + *			Returns integer: success (0) or error (< 0)

Still no invalidation interface?

> + *
> + * Physical device that support mediated device should be registered with mdev
> + * module with phy_device_ops structure.
> + */
> +
> +struct phy_device_ops {
> +	struct module   *owner;
> +	const struct attribute_group **dev_attr_groups;
> +	const struct attribute_group **mdev_attr_groups;
> +
> +	int	(*supported_config)(struct device *dev, char *config);
> +	int     (*create)(struct device *dev, uuid_le uuid,
> +			  uint32_t instance, char *mdev_params);
> +	int     (*destroy)(struct device *dev, uuid_le uuid,
> +			   uint32_t instance);
> +	int     (*start)(uuid_le uuid);
> +	int     (*shutdown)(uuid_le uuid);
> +	ssize_t (*read)(struct mdev_device *vdev, char *buf, size_t count,
> +			enum mdev_emul_space address_space, loff_t pos);
> +	ssize_t (*write)(struct mdev_device *vdev, char *buf, size_t count,
> +			 enum mdev_emul_space address_space, loff_t pos);
> +	int     (*set_irqs)(struct mdev_device *vdev, uint32_t flags,
> +			    unsigned int index, unsigned int start,
> +			    unsigned int count, void *data);
> +	int	(*get_region_info)(struct mdev_device *vdev, int region_index,
> +				 struct pci_region_info *region_info);

*pci*_region_info??

> +	int	(*validate_map_request)(struct mdev_device *vdev,
> +					unsigned long virtaddr,
> +					unsigned long *pfn, unsigned long *size,
> +					pgprot_t *prot);
> +};
> +
> +/*
> + * Physical Device
> + */
> +struct phy_device {
> +	struct device                   *dev;
> +	const struct phy_device_ops     *ops;
> +	struct list_head                next;
> +};

I would really like to be able to use the mediated device interface to
create a purely virtual device, is the expectation that my physical
device interface would create a virtual struct device which would
become the parent and control point in sysfs for creating all the mdev
devices? Should we be calling this a host_device or mdev_parent_dev in
that case since there's really no requirement that it be a physical
device?  Thanks,

Alex

> +
> +/**
> + * struct mdev_driver - Mediated device driver
> + * @name: driver name
> + * @probe: called when new device created
> + * @remove: called when device removed
> + * @match: called when new device or driver is added for this bus. Return 1 if
> + *	   given device can be handled by given driver and zero otherwise.
> + * @driver: device driver structure
> + *
> + **/
> +struct mdev_driver {
> +	const char *name;
> +	int  (*probe)(struct device *dev);
> +	void (*remove)(struct device *dev);
> +	int  (*match)(struct device *dev);
> +	struct device_driver driver;
> +};
> +
> +static inline struct mdev_driver *to_mdev_driver(struct device_driver *drv)
> +{
> +	return drv ? container_of(drv, struct mdev_driver, driver) : NULL;
> +}
> +
> +static inline struct mdev_device *to_mdev_device(struct device *dev)
> +{
> +	return dev ? container_of(dev, struct mdev_device, dev) : NULL;
> +}
> +
> +static inline struct mdev_device *mdev_get_device(struct mdev_device *vdev)
> +{
> +	return (vdev && get_device(&vdev->dev)) ? vdev : NULL;
> +}
> +
> +static inline  void mdev_put_device(struct mdev_device *vdev)
> +{
> +	if (vdev)
> +		put_device(&vdev->dev);
> +}
> +
> +extern struct bus_type mdev_bus_type;
> +
> +#define dev_is_mdev(d) ((d)->bus == &mdev_bus_type)
> +
> +extern int  mdev_register_device(struct device *dev,
> +				 const struct phy_device_ops *ops);
> +extern void mdev_unregister_device(struct device *dev);
> +
> +extern int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
> +extern void mdev_unregister_driver(struct mdev_driver *drv);
> +
> +extern int mdev_map_virtual_bar(uint64_t virt_bar_addr, uint64_t phys_bar_addr,
> +				uint32_t len, uint32_t flags);
> +
> +extern struct mdev_device *mdev_get_device_by_group(struct iommu_group *group);
> +
> +#endif /* MDEV_H */

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC PATCH v4 1/3] Mediated device Core driver
  2016-05-25 22:39     ` [Qemu-devel] " Alex Williamson
@ 2016-05-26  9:03       ` Kirti Wankhede
  -1 siblings, 0 replies; 92+ messages in thread
From: Kirti Wankhede @ 2016-05-26  9:03 UTC (permalink / raw)
  To: Alex Williamson
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, shuai.ruan,
	jike.song, zhiyuan.lv, bjsdjshi

Thanks Alex.

I'll consider all the nits and fix those in next version of patch.

More below:

On 5/26/2016 4:09 AM, Alex Williamson wrote:
> On Wed, 25 May 2016 01:28:15 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>

...

>> +
>> +config MDEV
>> +    tristate "Mediated device driver framework"
>> +    depends on VFIO
>> +    default n
>> +    help
>> +        MDEV provides a framework to virtualize device without
SR-IOV cap
>> +        See Documentation/mdev.txt for more details.
>
> I don't see that file anywhere in this series.

Yes, missed this file in this patch. I'll add it in next version of patch.
Since mdev module is moved in vfio directory, should I place this file
in vfio directory, Documentation/vfio/mdev.txt? or keep documentation of
mdev module within vfio.txt itself?

>> +	if (phy_dev) {
>> +		mutex_lock(&phy_devices.list_lock);
>> +
>> +		/*
>> +		* If vendor driver doesn't return success that means vendor
>> +		* driver doesn't support hot-unplug
>> +		*/
>> +		if (phy_dev->ops->destroy) {
>> +			if (phy_dev->ops->destroy(phy_dev->dev, mdevice->uuid,
>> +						  mdevice->instance)) {
>> +				mutex_unlock(&phy_devices.list_lock);
>> +				return;
>> +			}
>> +		}
>> +
>> +		mdev_remove_attribute_group(&mdevice->dev,
>> +					    phy_dev->ops->mdev_attr_groups);
>> +		mdevice->phy_dev = NULL;
>> +		mutex_unlock(&phy_devices.list_lock);
>
> Locking here appears arbitrary, how does the above code interact with
> phy_devices.dev_list?
>

Sorry for not being clear about phy_devices.list_lock, probably I
shouldn't have named it 'list_lock'. This lock is also to synchronize
register_device & unregister_device and physical device specific
callbacks: supported_config, create, destroy, start and shutdown.
Although supported_config, create and destroy are per phy_device
specific callbacks while start and shutdown could refer to multiple
phy_devices indirectly when there are multiple mdev devices of same type
on different physical devices. There could be race condition in start
callback and destroy & unregister_device. I'm revisiting this lock again
and will see to use per phy device lock for phy_device specific callbacks.

>> +struct mdev_device {
>> +	struct kref		kref;
>> +	struct device		dev;
>> +	struct phy_device	*phy_dev;
>> +	struct iommu_group	*group;
>> +	void			*iommu_data;
>> +	uuid_le			uuid;
>> +	uint32_t		instance;
>> +	void			*driver_data;
>> +	struct mutex		ops_lock;
>> +	struct list_head	next;
>> +};
>
> Could this be in the private header?  Seems like this should be opaque
> outside of mdev core.
>

No, this structure is used in mediated device call back functions to
vendor driver so that vendor driver could identify mdev device, similar
to pci_dev structure in pci bus subsystem. (I'll remove kref which is
not being used at all.)

>> + * @read:		Read emulation callback
>> + *			@mdev: mediated device structure
>> + *			@buf: read buffer
>> + *			@count: number bytes to read
>> + *			@address_space: specifies for which address
>> + *			space the request is: pci_config_space, IO
>> + *			register space or MMIO space.
>
> Seems like I asked before and it's no more clear in the code, how do we
> handle multiple spaces for various types?  ie. a device might have
> multiple MMIO spaces.
>
>> + *			@pos: offset from base address.

Sorry, updated the code but missed to update comment here.
pos = base_address + offset
(its not 'pos' anymore, will rename it to addr)

so vendor driver is aware about base addresses of multiple MMIO spaces
and its size, they can identify MMIO space based on addr.

>> +/*
>> + * Physical Device
>> + */
>> +struct phy_device {
>> +	struct device                   *dev;
>> +	const struct phy_device_ops     *ops;
>> +	struct list_head                next;
>> +};
>
> I would really like to be able to use the mediated device interface to
> create a purely virtual device, is the expectation that my physical
> device interface would create a virtual struct device which would
> become the parent and control point in sysfs for creating all the mdev
> devices? Should we be calling this a host_device or mdev_parent_dev in
> that case since there's really no requirement that it be a physical
> device?

Makes sense. I'll rename it to parent_device.

Thanks,
Kirti.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v4 1/3] Mediated device Core driver
@ 2016-05-26  9:03       ` Kirti Wankhede
  0 siblings, 0 replies; 92+ messages in thread
From: Kirti Wankhede @ 2016-05-26  9:03 UTC (permalink / raw)
  To: Alex Williamson
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, shuai.ruan,
	jike.song, zhiyuan.lv, bjsdjshi

Thanks Alex.

I'll consider all the nits and fix those in next version of patch.

More below:

On 5/26/2016 4:09 AM, Alex Williamson wrote:
> On Wed, 25 May 2016 01:28:15 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>

...

>> +
>> +config MDEV
>> +    tristate "Mediated device driver framework"
>> +    depends on VFIO
>> +    default n
>> +    help
>> +        MDEV provides a framework to virtualize device without
SR-IOV cap
>> +        See Documentation/mdev.txt for more details.
>
> I don't see that file anywhere in this series.

Yes, missed this file in this patch. I'll add it in next version of patch.
Since mdev module is moved in vfio directory, should I place this file
in vfio directory, Documentation/vfio/mdev.txt? or keep documentation of
mdev module within vfio.txt itself?

>> +	if (phy_dev) {
>> +		mutex_lock(&phy_devices.list_lock);
>> +
>> +		/*
>> +		* If vendor driver doesn't return success that means vendor
>> +		* driver doesn't support hot-unplug
>> +		*/
>> +		if (phy_dev->ops->destroy) {
>> +			if (phy_dev->ops->destroy(phy_dev->dev, mdevice->uuid,
>> +						  mdevice->instance)) {
>> +				mutex_unlock(&phy_devices.list_lock);
>> +				return;
>> +			}
>> +		}
>> +
>> +		mdev_remove_attribute_group(&mdevice->dev,
>> +					    phy_dev->ops->mdev_attr_groups);
>> +		mdevice->phy_dev = NULL;
>> +		mutex_unlock(&phy_devices.list_lock);
>
> Locking here appears arbitrary, how does the above code interact with
> phy_devices.dev_list?
>

Sorry for not being clear about phy_devices.list_lock, probably I
shouldn't have named it 'list_lock'. This lock is also to synchronize
register_device & unregister_device and physical device specific
callbacks: supported_config, create, destroy, start and shutdown.
Although supported_config, create and destroy are per phy_device
specific callbacks while start and shutdown could refer to multiple
phy_devices indirectly when there are multiple mdev devices of same type
on different physical devices. There could be race condition in start
callback and destroy & unregister_device. I'm revisiting this lock again
and will see to use per phy device lock for phy_device specific callbacks.

>> +struct mdev_device {
>> +	struct kref		kref;
>> +	struct device		dev;
>> +	struct phy_device	*phy_dev;
>> +	struct iommu_group	*group;
>> +	void			*iommu_data;
>> +	uuid_le			uuid;
>> +	uint32_t		instance;
>> +	void			*driver_data;
>> +	struct mutex		ops_lock;
>> +	struct list_head	next;
>> +};
>
> Could this be in the private header?  Seems like this should be opaque
> outside of mdev core.
>

No, this structure is used in mediated device call back functions to
vendor driver so that vendor driver could identify mdev device, similar
to pci_dev structure in pci bus subsystem. (I'll remove kref which is
not being used at all.)

>> + * @read:		Read emulation callback
>> + *			@mdev: mediated device structure
>> + *			@buf: read buffer
>> + *			@count: number bytes to read
>> + *			@address_space: specifies for which address
>> + *			space the request is: pci_config_space, IO
>> + *			register space or MMIO space.
>
> Seems like I asked before and it's no more clear in the code, how do we
> handle multiple spaces for various types?  ie. a device might have
> multiple MMIO spaces.
>
>> + *			@pos: offset from base address.

Sorry, updated the code but missed to update comment here.
pos = base_address + offset
(its not 'pos' anymore, will rename it to addr)

so vendor driver is aware about base addresses of multiple MMIO spaces
and its size, they can identify MMIO space based on addr.

>> +/*
>> + * Physical Device
>> + */
>> +struct phy_device {
>> +	struct device                   *dev;
>> +	const struct phy_device_ops     *ops;
>> +	struct list_head                next;
>> +};
>
> I would really like to be able to use the mediated device interface to
> create a purely virtual device, is the expectation that my physical
> device interface would create a virtual struct device which would
> become the parent and control point in sysfs for creating all the mdev
> devices? Should we be calling this a host_device or mdev_parent_dev in
> that case since there's really no requirement that it be a physical
> device?

Makes sense. I'll rename it to parent_device.

Thanks,
Kirti.

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC PATCH v4 1/3] Mediated device Core driver
  2016-05-26  9:03       ` [Qemu-devel] " Kirti Wankhede
@ 2016-05-26 14:06         ` Alex Williamson
  -1 siblings, 0 replies; 92+ messages in thread
From: Alex Williamson @ 2016-05-26 14:06 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: shuai.ruan, kevin.tian, cjia, kvm, qemu-devel, jike.song, kraxel,
	pbonzini, bjsdjshi, zhiyuan.lv

On Thu, 26 May 2016 14:33:39 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> Thanks Alex.
> 
> I'll consider all the nits and fix those in next version of patch.
> 
> More below:
> 
> On 5/26/2016 4:09 AM, Alex Williamson wrote:
> > On Wed, 25 May 2016 01:28:15 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >  
> 
> ...
> 
> >> +
> >> +config MDEV
> >> +    tristate "Mediated device driver framework"
> >> +    depends on VFIO
> >> +    default n
> >> +    help
> >> +        MDEV provides a framework to virtualize device without  
> SR-IOV cap
> >> +        See Documentation/mdev.txt for more details.  
> >
> > I don't see that file anywhere in this series.  
> 
> Yes, missed this file in this patch. I'll add it in next version of patch.
> Since mdev module is moved in vfio directory, should I place this file
> in vfio directory, Documentation/vfio/mdev.txt? or keep documentation of
> mdev module within vfio.txt itself?

Maybe just call it vfio-mediated-device.txt

> >> +	if (phy_dev) {
> >> +		mutex_lock(&phy_devices.list_lock);
> >> +
> >> +		/*
> >> +		* If vendor driver doesn't return success that means vendor
> >> +		* driver doesn't support hot-unplug
> >> +		*/
> >> +		if (phy_dev->ops->destroy) {
> >> +			if (phy_dev->ops->destroy(phy_dev->dev, mdevice->uuid,
> >> +						  mdevice->instance)) {
> >> +				mutex_unlock(&phy_devices.list_lock);
> >> +				return;
> >> +			}
> >> +		}
> >> +
> >> +		mdev_remove_attribute_group(&mdevice->dev,
> >> +					    phy_dev->ops->mdev_attr_groups);
> >> +		mdevice->phy_dev = NULL;
> >> +		mutex_unlock(&phy_devices.list_lock);  
> >
> > Locking here appears arbitrary, how does the above code interact with
> > phy_devices.dev_list?
> >  
> 
> Sorry for not being clear about phy_devices.list_lock, probably I
> shouldn't have named it 'list_lock'. This lock is also to synchronize
> register_device & unregister_device and physical device specific
> callbacks: supported_config, create, destroy, start and shutdown.
> Although supported_config, create and destroy are per phy_device
> specific callbacks while start and shutdown could refer to multiple
> phy_devices indirectly when there are multiple mdev devices of same type
> on different physical devices. There could be race condition in start
> callback and destroy & unregister_device. I'm revisiting this lock again
> and will see to use per phy device lock for phy_device specific callbacks.
> 
> 
> >> +struct mdev_device {
> >> +	struct kref		kref;
> >> +	struct device		dev;
> >> +	struct phy_device	*phy_dev;
> >> +	struct iommu_group	*group;
> >> +	void			*iommu_data;
> >> +	uuid_le			uuid;
> >> +	uint32_t		instance;
> >> +	void			*driver_data;
> >> +	struct mutex		ops_lock;
> >> +	struct list_head	next;
> >> +};  
> >
> > Could this be in the private header?  Seems like this should be opaque
> > outside of mdev core.
> >  
> 
> No, this structure is used in mediated device call back functions to
> vendor driver so that vendor driver could identify mdev device, similar
> to pci_dev structure in pci bus subsystem. (I'll remove kref which is
> not being used at all.)

Personally I'd prefer to see more use of reference counting and less
locking, especially since the locking is mostly ineffective in this
version.

> >> + * @read:		Read emulation callback
> >> + *			@mdev: mediated device structure
> >> + *			@buf: read buffer
> >> + *			@count: number bytes to read
> >> + *			@address_space: specifies for which address
> >> + *			space the request is: pci_config_space, IO
> >> + *			register space or MMIO space.  
> >
> > Seems like I asked before and it's no more clear in the code, how do we
> > handle multiple spaces for various types?  ie. a device might have
> > multiple MMIO spaces.
> >  
> >> + *			@pos: offset from base address.  
> 
> Sorry, updated the code but missed to update comment here.
> pos = base_address + offset
> (its not 'pos' anymore, will rename it to addr)
> 
> so vendor driver is aware about base addresses of multiple MMIO spaces
> and its size, they can identify MMIO space based on addr.

Why not let the vendor driver provide vfio_region_info directly,
including the offset within the device file descriptor thedn the
mediated device core simply pass read/write through without caring what
the address space is?  Thanks,

Alex
 
> >> +/*
> >> + * Physical Device
> >> + */
> >> +struct phy_device {
> >> +	struct device                   *dev;
> >> +	const struct phy_device_ops     *ops;
> >> +	struct list_head                next;
> >> +};  
> >
> > I would really like to be able to use the mediated device interface to
> > create a purely virtual device, is the expectation that my physical
> > device interface would create a virtual struct device which would
> > become the parent and control point in sysfs for creating all the mdev
> > devices? Should we be calling this a host_device or mdev_parent_dev in
> > that case since there's really no requirement that it be a physical
> > device?  
> 
> Makes sense. I'll rename it to parent_device.
> 
> Thanks,
> Kirti.
> 

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v4 1/3] Mediated device Core driver
@ 2016-05-26 14:06         ` Alex Williamson
  0 siblings, 0 replies; 92+ messages in thread
From: Alex Williamson @ 2016-05-26 14:06 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, shuai.ruan,
	jike.song, zhiyuan.lv, bjsdjshi

On Thu, 26 May 2016 14:33:39 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> Thanks Alex.
> 
> I'll consider all the nits and fix those in next version of patch.
> 
> More below:
> 
> On 5/26/2016 4:09 AM, Alex Williamson wrote:
> > On Wed, 25 May 2016 01:28:15 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >  
> 
> ...
> 
> >> +
> >> +config MDEV
> >> +    tristate "Mediated device driver framework"
> >> +    depends on VFIO
> >> +    default n
> >> +    help
> >> +        MDEV provides a framework to virtualize device without  
> SR-IOV cap
> >> +        See Documentation/mdev.txt for more details.  
> >
> > I don't see that file anywhere in this series.  
> 
> Yes, missed this file in this patch. I'll add it in next version of patch.
> Since mdev module is moved in vfio directory, should I place this file
> in vfio directory, Documentation/vfio/mdev.txt? or keep documentation of
> mdev module within vfio.txt itself?

Maybe just call it vfio-mediated-device.txt

> >> +	if (phy_dev) {
> >> +		mutex_lock(&phy_devices.list_lock);
> >> +
> >> +		/*
> >> +		* If vendor driver doesn't return success that means vendor
> >> +		* driver doesn't support hot-unplug
> >> +		*/
> >> +		if (phy_dev->ops->destroy) {
> >> +			if (phy_dev->ops->destroy(phy_dev->dev, mdevice->uuid,
> >> +						  mdevice->instance)) {
> >> +				mutex_unlock(&phy_devices.list_lock);
> >> +				return;
> >> +			}
> >> +		}
> >> +
> >> +		mdev_remove_attribute_group(&mdevice->dev,
> >> +					    phy_dev->ops->mdev_attr_groups);
> >> +		mdevice->phy_dev = NULL;
> >> +		mutex_unlock(&phy_devices.list_lock);  
> >
> > Locking here appears arbitrary, how does the above code interact with
> > phy_devices.dev_list?
> >  
> 
> Sorry for not being clear about phy_devices.list_lock, probably I
> shouldn't have named it 'list_lock'. This lock is also to synchronize
> register_device & unregister_device and physical device specific
> callbacks: supported_config, create, destroy, start and shutdown.
> Although supported_config, create and destroy are per phy_device
> specific callbacks while start and shutdown could refer to multiple
> phy_devices indirectly when there are multiple mdev devices of same type
> on different physical devices. There could be race condition in start
> callback and destroy & unregister_device. I'm revisiting this lock again
> and will see to use per phy device lock for phy_device specific callbacks.
> 
> 
> >> +struct mdev_device {
> >> +	struct kref		kref;
> >> +	struct device		dev;
> >> +	struct phy_device	*phy_dev;
> >> +	struct iommu_group	*group;
> >> +	void			*iommu_data;
> >> +	uuid_le			uuid;
> >> +	uint32_t		instance;
> >> +	void			*driver_data;
> >> +	struct mutex		ops_lock;
> >> +	struct list_head	next;
> >> +};  
> >
> > Could this be in the private header?  Seems like this should be opaque
> > outside of mdev core.
> >  
> 
> No, this structure is used in mediated device call back functions to
> vendor driver so that vendor driver could identify mdev device, similar
> to pci_dev structure in pci bus subsystem. (I'll remove kref which is
> not being used at all.)

Personally I'd prefer to see more use of reference counting and less
locking, especially since the locking is mostly ineffective in this
version.

> >> + * @read:		Read emulation callback
> >> + *			@mdev: mediated device structure
> >> + *			@buf: read buffer
> >> + *			@count: number bytes to read
> >> + *			@address_space: specifies for which address
> >> + *			space the request is: pci_config_space, IO
> >> + *			register space or MMIO space.  
> >
> > Seems like I asked before and it's no more clear in the code, how do we
> > handle multiple spaces for various types?  ie. a device might have
> > multiple MMIO spaces.
> >  
> >> + *			@pos: offset from base address.  
> 
> Sorry, updated the code but missed to update comment here.
> pos = base_address + offset
> (its not 'pos' anymore, will rename it to addr)
> 
> so vendor driver is aware about base addresses of multiple MMIO spaces
> and its size, they can identify MMIO space based on addr.

Why not let the vendor driver provide vfio_region_info directly,
including the offset within the device file descriptor thedn the
mediated device core simply pass read/write through without caring what
the address space is?  Thanks,

Alex
 
> >> +/*
> >> + * Physical Device
> >> + */
> >> +struct phy_device {
> >> +	struct device                   *dev;
> >> +	const struct phy_device_ops     *ops;
> >> +	struct list_head                next;
> >> +};  
> >
> > I would really like to be able to use the mediated device interface to
> > create a purely virtual device, is the expectation that my physical
> > device interface would create a virtual struct device which would
> > become the parent and control point in sysfs for creating all the mdev
> > devices? Should we be calling this a host_device or mdev_parent_dev in
> > that case since there's really no requirement that it be a physical
> > device?  
> 
> Makes sense. I'll rename it to parent_device.
> 
> Thanks,
> Kirti.
> 

^ permalink raw reply	[flat|nested] 92+ messages in thread

* RE: [RFC PATCH v4 1/3] Mediated device Core driver
  2016-05-25 14:47       ` [Qemu-devel] " Kirti Wankhede
@ 2016-05-27  9:00         ` Tian, Kevin
  -1 siblings, 0 replies; 92+ messages in thread
From: Tian, Kevin @ 2016-05-27  9:00 UTC (permalink / raw)
  To: Kirti Wankhede, alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, Ruan, Shuai, Song, Jike, Lv, Zhiyuan, bjsdjshi

> From: Kirti Wankhede
> Sent: Wednesday, May 25, 2016 10:47 PM
> 
> 
> >> +static struct devices_list {
> >> +	struct list_head    dev_list;
> >> +	struct mutex        list_lock;
> >> +} mdevices, phy_devices;
> >
> > phy_devices -> pdevices? and similarly we can use pdev/mdev
> > pair in other places...
> >
> 
> 'pdevices' sometimes also refers to 'pointer to devices' that's the
> reason I perfer to use phy_devices to represent 'physical devices'

well, I think it should be clear in this context where 'p' means 'physical'.
Just like frequently used pdev in pci files for pci_dev...

> 
> 
> >> +static struct mdev_device *find_mdev_device(uuid_le uuid, int instance)
> >
> > can we just call it "struct mdev* or "mdevice"? "dev_device" looks
> redundant.
> >
> 
> 'struct mdev_device' represents 'device structure for device created by
> mdev module'. Still that doesn't satisfy major folks, I'm open to change
> it.

IMO 'mdev' should be clearly enough to represent a mediated device

> 
> 
> > Sorry I may have to ask same question since I didn't get an answer yet.
> > what exactly does 'instance' mean here? since uuid is unique, why do
> > we need match instance too?
> >
> 
> 'uuid' could be UUID of a VM for whom it is created. To support mutiple
> mediated devices for same VM, name should be unique. Hence we need a
> instance number to identify each mediated device uniquely in one VM.

If UUID alone cannot universally identify a mediated device, what's the
point of explicitly tagging it in kernel? Either we assign a new UUID for
this mdev itself, or possibly better define it as a string? Management 
stack can pass any ID/name in string format which is sufficiently to identify 
mdev to its own purpose? Then in this framework we just do simple
string match...

> 
> 
> 
> >> +		if (phy_dev->ops->destroy) {
> >> +			if (phy_dev->ops->destroy(phy_dev->dev, mdevice->uuid,
> >> +						  mdevice->instance)) {
> >> +				mutex_unlock(&phy_devices.list_lock);
> >
> > a warning message is preferred. Also better to return -EBUSY here.
> >
> 
> mdev_destroy_device() is called from 2 paths, one is sysfs mdev_destroy
> and mdev_unregister_device(). For the later case, return from here will
> any ways ignored. mdev_unregister_device() is called from the remove
> function of physical device and that doesn't care about return error, it
> just removes the device from subsystem.

Regardless of whether caller will handle error, this function itself should
return error since it makes sense in other path e.g. sysfs to let user
know what's happening.

> 
> >> +				return;
> >> +			}
> >> +		}
> >> +
> >> +		mdev_remove_attribute_group(&mdevice->dev,
> >> +					    phy_dev->ops->mdev_attr_groups);
> >> +		mdevice->phy_dev = NULL;
> >
> > Am I missing something here? You didn't remove this mdev node from
> > the list, and below...
> >
> 
> device_unregister() calls put_device(dev) and if refcount is zero its
> release function is called, which is mdev_device_release(), that is
> hooked during device_register(). This node is removed from list from
> mdev_device_release().

I'm not sure whether there'll be some race condition here, since you
put a defunc mdev on the list... 

> 
> >> +	phy_dev->dev = dev;
> >> +	phy_dev->ops = ops;
> >> +
> >> +	mutex_lock(&phy_devices.list_lock);
> >> +	ret = mdev_create_sysfs_files(dev);
> >> +	if (ret)
> >> +		goto add_sysfs_error;
> >> +
> >> +	ret = mdev_add_attribute_group(dev, ops->dev_attr_groups);
> >> +	if (ret)
> >> +		goto add_group_error;
> >
> > any reason to include sysfs operations inside the mutex which is
> > purely about phy_devices list?
> >
> 
> dev_attr_groups attribute is for physical device, hence inside
> phy_devices.list_lock.

phy_devices.list_lock only protects the list, when you plan to add a
new phy_device node after it's initialized and get ready. sysfs
attribute setup is still part of initialization.


Thanks
Kevin

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v4 1/3] Mediated device Core driver
@ 2016-05-27  9:00         ` Tian, Kevin
  0 siblings, 0 replies; 92+ messages in thread
From: Tian, Kevin @ 2016-05-27  9:00 UTC (permalink / raw)
  To: Kirti Wankhede, alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, Ruan, Shuai, Song, Jike, Lv, Zhiyuan, bjsdjshi

> From: Kirti Wankhede
> Sent: Wednesday, May 25, 2016 10:47 PM
> 
> 
> >> +static struct devices_list {
> >> +	struct list_head    dev_list;
> >> +	struct mutex        list_lock;
> >> +} mdevices, phy_devices;
> >
> > phy_devices -> pdevices? and similarly we can use pdev/mdev
> > pair in other places...
> >
> 
> 'pdevices' sometimes also refers to 'pointer to devices' that's the
> reason I perfer to use phy_devices to represent 'physical devices'

well, I think it should be clear in this context where 'p' means 'physical'.
Just like frequently used pdev in pci files for pci_dev...

> 
> 
> >> +static struct mdev_device *find_mdev_device(uuid_le uuid, int instance)
> >
> > can we just call it "struct mdev* or "mdevice"? "dev_device" looks
> redundant.
> >
> 
> 'struct mdev_device' represents 'device structure for device created by
> mdev module'. Still that doesn't satisfy major folks, I'm open to change
> it.

IMO 'mdev' should be clearly enough to represent a mediated device

> 
> 
> > Sorry I may have to ask same question since I didn't get an answer yet.
> > what exactly does 'instance' mean here? since uuid is unique, why do
> > we need match instance too?
> >
> 
> 'uuid' could be UUID of a VM for whom it is created. To support mutiple
> mediated devices for same VM, name should be unique. Hence we need a
> instance number to identify each mediated device uniquely in one VM.

If UUID alone cannot universally identify a mediated device, what's the
point of explicitly tagging it in kernel? Either we assign a new UUID for
this mdev itself, or possibly better define it as a string? Management 
stack can pass any ID/name in string format which is sufficiently to identify 
mdev to its own purpose? Then in this framework we just do simple
string match...

> 
> 
> 
> >> +		if (phy_dev->ops->destroy) {
> >> +			if (phy_dev->ops->destroy(phy_dev->dev, mdevice->uuid,
> >> +						  mdevice->instance)) {
> >> +				mutex_unlock(&phy_devices.list_lock);
> >
> > a warning message is preferred. Also better to return -EBUSY here.
> >
> 
> mdev_destroy_device() is called from 2 paths, one is sysfs mdev_destroy
> and mdev_unregister_device(). For the later case, return from here will
> any ways ignored. mdev_unregister_device() is called from the remove
> function of physical device and that doesn't care about return error, it
> just removes the device from subsystem.

Regardless of whether caller will handle error, this function itself should
return error since it makes sense in other path e.g. sysfs to let user
know what's happening.

> 
> >> +				return;
> >> +			}
> >> +		}
> >> +
> >> +		mdev_remove_attribute_group(&mdevice->dev,
> >> +					    phy_dev->ops->mdev_attr_groups);
> >> +		mdevice->phy_dev = NULL;
> >
> > Am I missing something here? You didn't remove this mdev node from
> > the list, and below...
> >
> 
> device_unregister() calls put_device(dev) and if refcount is zero its
> release function is called, which is mdev_device_release(), that is
> hooked during device_register(). This node is removed from list from
> mdev_device_release().

I'm not sure whether there'll be some race condition here, since you
put a defunc mdev on the list... 

> 
> >> +	phy_dev->dev = dev;
> >> +	phy_dev->ops = ops;
> >> +
> >> +	mutex_lock(&phy_devices.list_lock);
> >> +	ret = mdev_create_sysfs_files(dev);
> >> +	if (ret)
> >> +		goto add_sysfs_error;
> >> +
> >> +	ret = mdev_add_attribute_group(dev, ops->dev_attr_groups);
> >> +	if (ret)
> >> +		goto add_group_error;
> >
> > any reason to include sysfs operations inside the mutex which is
> > purely about phy_devices list?
> >
> 
> dev_attr_groups attribute is for physical device, hence inside
> phy_devices.list_lock.

phy_devices.list_lock only protects the list, when you plan to add a
new phy_device node after it's initialized and get ready. sysfs
attribute setup is still part of initialization.


Thanks
Kevin

^ permalink raw reply	[flat|nested] 92+ messages in thread

* RE: [RFC PATCH v4 2/3] VFIO driver for mediated PCI device
  2016-05-25 13:04       ` [Qemu-devel] " Kirti Wankhede
@ 2016-05-27 10:03         ` Tian, Kevin
  -1 siblings, 0 replies; 92+ messages in thread
From: Tian, Kevin @ 2016-05-27 10:03 UTC (permalink / raw)
  To: Kirti Wankhede, alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, Ruan, Shuai, Song, Jike, Lv, Zhiyuan, bjsdjshi

> From: Kirti Wankhede
> Sent: Wednesday, May 25, 2016 9:05 PM
> 
> 
> >> +{
> >> +	int ret = -EINVAL;
> >> +	struct phy_device *phy_dev = mdevice->phy_dev;
> >> +
> >> +	if (dev_is_pci(phy_dev->dev) && phy_dev->ops->get_region_info) {
> >> +		mutex_lock(&mdevice->ops_lock);
> >> +		ret = phy_dev->ops->get_region_info(mdevice, index,
> >> +						    vfio_region_info);
> >> +		mutex_unlock(&mdevice->ops_lock);
> >> +	}
> >> +	return ret;
> >> +}
> >> +
> >> +static int mdev_read_base(struct vfio_mdevice *vdev)
> >
> > similar as earlier comment - vdev or mdev?
> >
> 
> Here vdev is of type 'vfio_mdevice', that's why vdev, mdev doesn't suit
> here. Changing it to 'vmdev' in next patch set.
> 

'vmdev' looks more confusing... :-)

Alex, can you give your thought here?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v4 2/3] VFIO driver for mediated PCI device
@ 2016-05-27 10:03         ` Tian, Kevin
  0 siblings, 0 replies; 92+ messages in thread
From: Tian, Kevin @ 2016-05-27 10:03 UTC (permalink / raw)
  To: Kirti Wankhede, alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, Ruan, Shuai, Song, Jike, Lv, Zhiyuan, bjsdjshi

> From: Kirti Wankhede
> Sent: Wednesday, May 25, 2016 9:05 PM
> 
> 
> >> +{
> >> +	int ret = -EINVAL;
> >> +	struct phy_device *phy_dev = mdevice->phy_dev;
> >> +
> >> +	if (dev_is_pci(phy_dev->dev) && phy_dev->ops->get_region_info) {
> >> +		mutex_lock(&mdevice->ops_lock);
> >> +		ret = phy_dev->ops->get_region_info(mdevice, index,
> >> +						    vfio_region_info);
> >> +		mutex_unlock(&mdevice->ops_lock);
> >> +	}
> >> +	return ret;
> >> +}
> >> +
> >> +static int mdev_read_base(struct vfio_mdevice *vdev)
> >
> > similar as earlier comment - vdev or mdev?
> >
> 
> Here vdev is of type 'vfio_mdevice', that's why vdev, mdev doesn't suit
> here. Changing it to 'vmdev' in next patch set.
> 

'vmdev' looks more confusing... :-)

Alex, can you give your thought here?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 92+ messages in thread

* RE: [RFC PATCH v4 0/3] Add Mediated device support[was: Add vGPU support]
  2016-05-25 13:43     ` [Qemu-devel] " Alex Williamson
@ 2016-05-27 11:02       ` Tian, Kevin
  -1 siblings, 0 replies; 92+ messages in thread
From: Tian, Kevin @ 2016-05-27 11:02 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Kirti Wankhede, pbonzini, kraxel, cjia, qemu-devel, kvm, Ruan,
	Shuai, Song, Jike, Lv, Zhiyuan, bjsdjshi

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Wednesday, May 25, 2016 9:44 PM
> 
> On Wed, 25 May 2016 07:13:58 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> > > Sent: Wednesday, May 25, 2016 3:58 AM
> > >
> > > This series adds Mediated device support to v4.6 Linux host kernel. Purpose
> > > of this series is to provide a common interface for mediated device
> > > management that can be used by different devices. This series introduces
> > > Mdev core module that create and manage mediated devices, VFIO based driver
> > > for mediated PCI devices that are created by Mdev core module and update
> > > VFIO type1 IOMMU module to support mediated devices.
> >
> > Thanks. "Mediated device" is more generic than previous one. :-)
> >
> > >
> > > What's new in v4?
> > > - Renamed 'vgpu' module to 'mdev' module that represent generic term
> > >   'Mediated device'.
> > > - Moved mdev directory to drivers/vfio directory as this is the extension
> > >   of VFIO APIs for mediated devices.
> > > - Updated mdev driver to be flexible to register multiple types of drivers
> > >   to mdev_bus_type bus.
> > > - Updated mdev core driver with mdev_put_device() and mdev_get_device() for
> > >   mediated devices.
> > >
> > >
> >
> > Just curious. In this version you move the whole mdev core under
> > VFIO now. Sorry if I missed any agreement on this change. IIRC Alex
> > doesn't want VFIO to manage mdev life-cycle directly. Instead VFIO is
> > just a mdev driver on created mediated devices....
> 
> I did originally suggest keeping them separate, but as we've progressed
> through the implementation, it's become more clear that the mediated
> device interface is very much tied to the vfio interface, acting mostly
> as a passthrough.  So I thought it made sense to pull them together.
> Still open to discussion of course.  Thanks,
> 

The main benefit of maintaining a separate mdev framework, IMHO, is
to allow better support of both KVM and Xen. Xen doesn't work with VFIO
today, because other VM's memory is not allocated from Dom0 which
means VFIO within Dom0 doesn't has view/permission to control isolation 
for other VMs.

However, after some thinking I think it might not be a big problem to
combine VFIO/mdev together, if we extend Xen to just use VFIO for
resource enumeration. In such model, VFIO still behaves as a single 
kernel portal to enumerate mediated devices to user space, but give up 
permission control to Qemu which will request a secure agent - Xen
hypervisor - to ensure isolation of VM usage on mediated device (including
EPT/IOMMU configuration).

I'm not sure whether VFIO can support this usage today. It is somehow 
similar to channel io passthru in s390, where we also rely on Qemu to 
mediate ccw commands to ensure isolation. Maybe just some slight 
extension is required (e.g. not assume some API must be invoked). Of 
course Qemu side vfio code also need some change. If this can work, 
at least we can first put it as the enumeration interface for mediated 
device in Xen. In the future it may be extended to cover normal Xen 
PCI assignment as well instead of using sysfs to read PCI resource
today.

If above works, then we have a sound plan to enable mediated devices 
based on VFIO first for KVM, and then extend to Xen with reasonable 
effort.

How do you think about it?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v4 0/3] Add Mediated device support[was: Add vGPU support]
@ 2016-05-27 11:02       ` Tian, Kevin
  0 siblings, 0 replies; 92+ messages in thread
From: Tian, Kevin @ 2016-05-27 11:02 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Kirti Wankhede, pbonzini, kraxel, cjia, qemu-devel, kvm, Ruan,
	Shuai, Song, Jike, Lv, Zhiyuan, bjsdjshi

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Wednesday, May 25, 2016 9:44 PM
> 
> On Wed, 25 May 2016 07:13:58 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> > > Sent: Wednesday, May 25, 2016 3:58 AM
> > >
> > > This series adds Mediated device support to v4.6 Linux host kernel. Purpose
> > > of this series is to provide a common interface for mediated device
> > > management that can be used by different devices. This series introduces
> > > Mdev core module that create and manage mediated devices, VFIO based driver
> > > for mediated PCI devices that are created by Mdev core module and update
> > > VFIO type1 IOMMU module to support mediated devices.
> >
> > Thanks. "Mediated device" is more generic than previous one. :-)
> >
> > >
> > > What's new in v4?
> > > - Renamed 'vgpu' module to 'mdev' module that represent generic term
> > >   'Mediated device'.
> > > - Moved mdev directory to drivers/vfio directory as this is the extension
> > >   of VFIO APIs for mediated devices.
> > > - Updated mdev driver to be flexible to register multiple types of drivers
> > >   to mdev_bus_type bus.
> > > - Updated mdev core driver with mdev_put_device() and mdev_get_device() for
> > >   mediated devices.
> > >
> > >
> >
> > Just curious. In this version you move the whole mdev core under
> > VFIO now. Sorry if I missed any agreement on this change. IIRC Alex
> > doesn't want VFIO to manage mdev life-cycle directly. Instead VFIO is
> > just a mdev driver on created mediated devices....
> 
> I did originally suggest keeping them separate, but as we've progressed
> through the implementation, it's become more clear that the mediated
> device interface is very much tied to the vfio interface, acting mostly
> as a passthrough.  So I thought it made sense to pull them together.
> Still open to discussion of course.  Thanks,
> 

The main benefit of maintaining a separate mdev framework, IMHO, is
to allow better support of both KVM and Xen. Xen doesn't work with VFIO
today, because other VM's memory is not allocated from Dom0 which
means VFIO within Dom0 doesn't has view/permission to control isolation 
for other VMs.

However, after some thinking I think it might not be a big problem to
combine VFIO/mdev together, if we extend Xen to just use VFIO for
resource enumeration. In such model, VFIO still behaves as a single 
kernel portal to enumerate mediated devices to user space, but give up 
permission control to Qemu which will request a secure agent - Xen
hypervisor - to ensure isolation of VM usage on mediated device (including
EPT/IOMMU configuration).

I'm not sure whether VFIO can support this usage today. It is somehow 
similar to channel io passthru in s390, where we also rely on Qemu to 
mediate ccw commands to ensure isolation. Maybe just some slight 
extension is required (e.g. not assume some API must be invoked). Of 
course Qemu side vfio code also need some change. If this can work, 
at least we can first put it as the enumeration interface for mediated 
device in Xen. In the future it may be extended to cover normal Xen 
PCI assignment as well instead of using sysfs to read PCI resource
today.

If above works, then we have a sound plan to enable mediated devices 
based on VFIO first for KVM, and then extend to Xen with reasonable 
effort.

How do you think about it?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC PATCH v4 0/3] Add Mediated device support[was: Add vGPU support]
  2016-05-27 11:02       ` [Qemu-devel] " Tian, Kevin
@ 2016-05-27 14:54         ` Alex Williamson
  -1 siblings, 0 replies; 92+ messages in thread
From: Alex Williamson @ 2016-05-27 14:54 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Kirti Wankhede, pbonzini, kraxel, cjia, qemu-devel, kvm, Ruan,
	Shuai, Song, Jike, Lv, Zhiyuan, bjsdjshi

On Fri, 27 May 2016 11:02:46 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Wednesday, May 25, 2016 9:44 PM
> > 
> > On Wed, 25 May 2016 07:13:58 +0000
> > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> >   
> > > > From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> > > > Sent: Wednesday, May 25, 2016 3:58 AM
> > > >
> > > > This series adds Mediated device support to v4.6 Linux host kernel. Purpose
> > > > of this series is to provide a common interface for mediated device
> > > > management that can be used by different devices. This series introduces
> > > > Mdev core module that create and manage mediated devices, VFIO based driver
> > > > for mediated PCI devices that are created by Mdev core module and update
> > > > VFIO type1 IOMMU module to support mediated devices.  
> > >
> > > Thanks. "Mediated device" is more generic than previous one. :-)
> > >  
> > > >
> > > > What's new in v4?
> > > > - Renamed 'vgpu' module to 'mdev' module that represent generic term
> > > >   'Mediated device'.
> > > > - Moved mdev directory to drivers/vfio directory as this is the extension
> > > >   of VFIO APIs for mediated devices.
> > > > - Updated mdev driver to be flexible to register multiple types of drivers
> > > >   to mdev_bus_type bus.
> > > > - Updated mdev core driver with mdev_put_device() and mdev_get_device() for
> > > >   mediated devices.
> > > >
> > > >  
> > >
> > > Just curious. In this version you move the whole mdev core under
> > > VFIO now. Sorry if I missed any agreement on this change. IIRC Alex
> > > doesn't want VFIO to manage mdev life-cycle directly. Instead VFIO is
> > > just a mdev driver on created mediated devices....  
> > 
> > I did originally suggest keeping them separate, but as we've progressed
> > through the implementation, it's become more clear that the mediated
> > device interface is very much tied to the vfio interface, acting mostly
> > as a passthrough.  So I thought it made sense to pull them together.
> > Still open to discussion of course.  Thanks,
> >   
> 
> The main benefit of maintaining a separate mdev framework, IMHO, is
> to allow better support of both KVM and Xen. Xen doesn't work with VFIO
> today, because other VM's memory is not allocated from Dom0 which
> means VFIO within Dom0 doesn't has view/permission to control isolation 
> for other VMs.

Isn't this just a matter of the vfio iommu model selected?  There could
be a vfio-iommu-xen that knows how to do the grant calls.

> However, after some thinking I think it might not be a big problem to
> combine VFIO/mdev together, if we extend Xen to just use VFIO for
> resource enumeration. In such model, VFIO still behaves as a single 
> kernel portal to enumerate mediated devices to user space, but give up 
> permission control to Qemu which will request a secure agent - Xen
> hypervisor - to ensure isolation of VM usage on mediated device (including
> EPT/IOMMU configuration).

The whole point here is to use the vfio user api and we seem to be
progressing towards using vfio-core as a conduit where the mediated
driver api is also fairly vfio-ish.  So it seems we're really headed
towards a vfio-mediated device rather than some sort generic mediated
driver interface.  I would object to leaving permission control to
QEMU, QEMU is just a vfio user, there are others like DPDK.  The kernel
needs to be in charge of protecting itself and users from each other,
QEMU can't do this, which is part of reason that KVM has moved to vfio
rather than the pci-sysfs resource interface.
 
> I'm not sure whether VFIO can support this usage today. It is somehow 
> similar to channel io passthru in s390, where we also rely on Qemu to 
> mediate ccw commands to ensure isolation. Maybe just some slight 
> extension is required (e.g. not assume some API must be invoked). Of 
> course Qemu side vfio code also need some change. If this can work, 
> at least we can first put it as the enumeration interface for mediated 
> device in Xen. In the future it may be extended to cover normal Xen 
> PCI assignment as well instead of using sysfs to read PCI resource
> today.

The channel io proposal doesn't rely on QEMU for security either, the
mediation occurs in the host kernel, parsing the ccw command program,
and doing translations to replace the guest physical addresses with
verified and pinned host physical addresses before submitting the
program to be run.  A mediated device is policed by the mediated
vendor driver in the host kernel, QEMU is untrusted, just like any
other user.

If xen is currently using pci-sysfs for mapping device resources, then
vfio should be directly usable, which leaves the IOMMU interfaces, such
as pinning and mapping user memory and making use of the IOMMU API,
that part of vfio is fairly modular though IOMMU groups is a fairly
fundamental concept within the core.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v4 0/3] Add Mediated device support[was: Add vGPU support]
@ 2016-05-27 14:54         ` Alex Williamson
  0 siblings, 0 replies; 92+ messages in thread
From: Alex Williamson @ 2016-05-27 14:54 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Kirti Wankhede, pbonzini, kraxel, cjia, qemu-devel, kvm, Ruan,
	Shuai, Song, Jike, Lv, Zhiyuan, bjsdjshi

On Fri, 27 May 2016 11:02:46 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Wednesday, May 25, 2016 9:44 PM
> > 
> > On Wed, 25 May 2016 07:13:58 +0000
> > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> >   
> > > > From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> > > > Sent: Wednesday, May 25, 2016 3:58 AM
> > > >
> > > > This series adds Mediated device support to v4.6 Linux host kernel. Purpose
> > > > of this series is to provide a common interface for mediated device
> > > > management that can be used by different devices. This series introduces
> > > > Mdev core module that create and manage mediated devices, VFIO based driver
> > > > for mediated PCI devices that are created by Mdev core module and update
> > > > VFIO type1 IOMMU module to support mediated devices.  
> > >
> > > Thanks. "Mediated device" is more generic than previous one. :-)
> > >  
> > > >
> > > > What's new in v4?
> > > > - Renamed 'vgpu' module to 'mdev' module that represent generic term
> > > >   'Mediated device'.
> > > > - Moved mdev directory to drivers/vfio directory as this is the extension
> > > >   of VFIO APIs for mediated devices.
> > > > - Updated mdev driver to be flexible to register multiple types of drivers
> > > >   to mdev_bus_type bus.
> > > > - Updated mdev core driver with mdev_put_device() and mdev_get_device() for
> > > >   mediated devices.
> > > >
> > > >  
> > >
> > > Just curious. In this version you move the whole mdev core under
> > > VFIO now. Sorry if I missed any agreement on this change. IIRC Alex
> > > doesn't want VFIO to manage mdev life-cycle directly. Instead VFIO is
> > > just a mdev driver on created mediated devices....  
> > 
> > I did originally suggest keeping them separate, but as we've progressed
> > through the implementation, it's become more clear that the mediated
> > device interface is very much tied to the vfio interface, acting mostly
> > as a passthrough.  So I thought it made sense to pull them together.
> > Still open to discussion of course.  Thanks,
> >   
> 
> The main benefit of maintaining a separate mdev framework, IMHO, is
> to allow better support of both KVM and Xen. Xen doesn't work with VFIO
> today, because other VM's memory is not allocated from Dom0 which
> means VFIO within Dom0 doesn't has view/permission to control isolation 
> for other VMs.

Isn't this just a matter of the vfio iommu model selected?  There could
be a vfio-iommu-xen that knows how to do the grant calls.

> However, after some thinking I think it might not be a big problem to
> combine VFIO/mdev together, if we extend Xen to just use VFIO for
> resource enumeration. In such model, VFIO still behaves as a single 
> kernel portal to enumerate mediated devices to user space, but give up 
> permission control to Qemu which will request a secure agent - Xen
> hypervisor - to ensure isolation of VM usage on mediated device (including
> EPT/IOMMU configuration).

The whole point here is to use the vfio user api and we seem to be
progressing towards using vfio-core as a conduit where the mediated
driver api is also fairly vfio-ish.  So it seems we're really headed
towards a vfio-mediated device rather than some sort generic mediated
driver interface.  I would object to leaving permission control to
QEMU, QEMU is just a vfio user, there are others like DPDK.  The kernel
needs to be in charge of protecting itself and users from each other,
QEMU can't do this, which is part of reason that KVM has moved to vfio
rather than the pci-sysfs resource interface.
 
> I'm not sure whether VFIO can support this usage today. It is somehow 
> similar to channel io passthru in s390, where we also rely on Qemu to 
> mediate ccw commands to ensure isolation. Maybe just some slight 
> extension is required (e.g. not assume some API must be invoked). Of 
> course Qemu side vfio code also need some change. If this can work, 
> at least we can first put it as the enumeration interface for mediated 
> device in Xen. In the future it may be extended to cover normal Xen 
> PCI assignment as well instead of using sysfs to read PCI resource
> today.

The channel io proposal doesn't rely on QEMU for security either, the
mediation occurs in the host kernel, parsing the ccw command program,
and doing translations to replace the guest physical addresses with
verified and pinned host physical addresses before submitting the
program to be run.  A mediated device is policed by the mediated
vendor driver in the host kernel, QEMU is untrusted, just like any
other user.

If xen is currently using pci-sysfs for mapping device resources, then
vfio should be directly usable, which leaves the IOMMU interfaces, such
as pinning and mapping user memory and making use of the IOMMU API,
that part of vfio is fairly modular though IOMMU groups is a fairly
fundamental concept within the core.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC PATCH v4 2/3] VFIO driver for mediated PCI device
  2016-05-27 10:03         ` [Qemu-devel] " Tian, Kevin
@ 2016-05-27 15:13           ` Alex Williamson
  -1 siblings, 0 replies; 92+ messages in thread
From: Alex Williamson @ 2016-05-27 15:13 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Kirti Wankhede, pbonzini, kraxel, cjia, qemu-devel, kvm, Ruan,
	Shuai, Song, Jike, Lv, Zhiyuan, bjsdjshi

On Fri, 27 May 2016 10:03:31 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Kirti Wankhede
> > Sent: Wednesday, May 25, 2016 9:05 PM
> > 
> >   
> > >> +{
> > >> +	int ret = -EINVAL;
> > >> +	struct phy_device *phy_dev = mdevice->phy_dev;
> > >> +
> > >> +	if (dev_is_pci(phy_dev->dev) && phy_dev->ops->get_region_info) {
> > >> +		mutex_lock(&mdevice->ops_lock);
> > >> +		ret = phy_dev->ops->get_region_info(mdevice, index,
> > >> +						    vfio_region_info);
> > >> +		mutex_unlock(&mdevice->ops_lock);
> > >> +	}
> > >> +	return ret;
> > >> +}
> > >> +
> > >> +static int mdev_read_base(struct vfio_mdevice *vdev)  
> > >
> > > similar as earlier comment - vdev or mdev?
> > >  
> > 
> > Here vdev is of type 'vfio_mdevice', that's why vdev, mdev doesn't suit
> > here. Changing it to 'vmdev' in next patch set.
> >   
> 
> 'vmdev' looks more confusing... :-)
> 
> Alex, can you give your thought here?

I don't see any problem with vmdev personally, are you unhappy with it
because it includes 'vm'?  It seems like it has a valid rationale, so
long as it's used consistently, I'm happy.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v4 2/3] VFIO driver for mediated PCI device
@ 2016-05-27 15:13           ` Alex Williamson
  0 siblings, 0 replies; 92+ messages in thread
From: Alex Williamson @ 2016-05-27 15:13 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Kirti Wankhede, pbonzini, kraxel, cjia, qemu-devel, kvm, Ruan,
	Shuai, Song, Jike, Lv, Zhiyuan, bjsdjshi

On Fri, 27 May 2016 10:03:31 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Kirti Wankhede
> > Sent: Wednesday, May 25, 2016 9:05 PM
> > 
> >   
> > >> +{
> > >> +	int ret = -EINVAL;
> > >> +	struct phy_device *phy_dev = mdevice->phy_dev;
> > >> +
> > >> +	if (dev_is_pci(phy_dev->dev) && phy_dev->ops->get_region_info) {
> > >> +		mutex_lock(&mdevice->ops_lock);
> > >> +		ret = phy_dev->ops->get_region_info(mdevice, index,
> > >> +						    vfio_region_info);
> > >> +		mutex_unlock(&mdevice->ops_lock);
> > >> +	}
> > >> +	return ret;
> > >> +}
> > >> +
> > >> +static int mdev_read_base(struct vfio_mdevice *vdev)  
> > >
> > > similar as earlier comment - vdev or mdev?
> > >  
> > 
> > Here vdev is of type 'vfio_mdevice', that's why vdev, mdev doesn't suit
> > here. Changing it to 'vmdev' in next patch set.
> >   
> 
> 'vmdev' looks more confusing... :-)
> 
> Alex, can you give your thought here?

I don't see any problem with vmdev personally, are you unhappy with it
because it includes 'vm'?  It seems like it has a valid rationale, so
long as it's used consistently, I'm happy.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 92+ messages in thread

* RE: [RFC PATCH v4 0/3] Add Mediated device support[was: Add vGPU support]
  2016-05-27 14:54         ` [Qemu-devel] " Alex Williamson
@ 2016-05-27 22:43           ` Tian, Kevin
  -1 siblings, 0 replies; 92+ messages in thread
From: Tian, Kevin @ 2016-05-27 22:43 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Kirti Wankhede, pbonzini, kraxel, cjia, qemu-devel, kvm, Ruan,
	Shuai, Song, Jike, Lv, Zhiyuan, bjsdjshi

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Friday, May 27, 2016 10:55 PM
> 
> On Fri, 27 May 2016 11:02:46 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > Sent: Wednesday, May 25, 2016 9:44 PM
> > >
> > > On Wed, 25 May 2016 07:13:58 +0000
> > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > >
> > > > > From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> > > > > Sent: Wednesday, May 25, 2016 3:58 AM
> > > > >
> > > > > This series adds Mediated device support to v4.6 Linux host kernel. Purpose
> > > > > of this series is to provide a common interface for mediated device
> > > > > management that can be used by different devices. This series introduces
> > > > > Mdev core module that create and manage mediated devices, VFIO based driver
> > > > > for mediated PCI devices that are created by Mdev core module and update
> > > > > VFIO type1 IOMMU module to support mediated devices.
> > > >
> > > > Thanks. "Mediated device" is more generic than previous one. :-)
> > > >
> > > > >
> > > > > What's new in v4?
> > > > > - Renamed 'vgpu' module to 'mdev' module that represent generic term
> > > > >   'Mediated device'.
> > > > > - Moved mdev directory to drivers/vfio directory as this is the extension
> > > > >   of VFIO APIs for mediated devices.
> > > > > - Updated mdev driver to be flexible to register multiple types of drivers
> > > > >   to mdev_bus_type bus.
> > > > > - Updated mdev core driver with mdev_put_device() and mdev_get_device() for
> > > > >   mediated devices.
> > > > >
> > > > >
> > > >
> > > > Just curious. In this version you move the whole mdev core under
> > > > VFIO now. Sorry if I missed any agreement on this change. IIRC Alex
> > > > doesn't want VFIO to manage mdev life-cycle directly. Instead VFIO is
> > > > just a mdev driver on created mediated devices....
> > >
> > > I did originally suggest keeping them separate, but as we've progressed
> > > through the implementation, it's become more clear that the mediated
> > > device interface is very much tied to the vfio interface, acting mostly
> > > as a passthrough.  So I thought it made sense to pull them together.
> > > Still open to discussion of course.  Thanks,
> > >
> >
> > The main benefit of maintaining a separate mdev framework, IMHO, is
> > to allow better support of both KVM and Xen. Xen doesn't work with VFIO
> > today, because other VM's memory is not allocated from Dom0 which
> > means VFIO within Dom0 doesn't has view/permission to control isolation
> > for other VMs.
> 
> Isn't this just a matter of the vfio iommu model selected?  There could
> be a vfio-iommu-xen that knows how to do the grant calls.
> 
> > However, after some thinking I think it might not be a big problem to
> > combine VFIO/mdev together, if we extend Xen to just use VFIO for
> > resource enumeration. In such model, VFIO still behaves as a single
> > kernel portal to enumerate mediated devices to user space, but give up
> > permission control to Qemu which will request a secure agent - Xen
> > hypervisor - to ensure isolation of VM usage on mediated device (including
> > EPT/IOMMU configuration).
> 
> The whole point here is to use the vfio user api and we seem to be
> progressing towards using vfio-core as a conduit where the mediated
> driver api is also fairly vfio-ish.  So it seems we're really headed
> towards a vfio-mediated device rather than some sort generic mediated
> driver interface.  I would object to leaving permission control to
> QEMU, QEMU is just a vfio user, there are others like DPDK.  The kernel
> needs to be in charge of protecting itself and users from each other,
> QEMU can't do this, which is part of reason that KVM has moved to vfio
> rather than the pci-sysfs resource interface.
> 
> > I'm not sure whether VFIO can support this usage today. It is somehow
> > similar to channel io passthru in s390, where we also rely on Qemu to
> > mediate ccw commands to ensure isolation. Maybe just some slight
> > extension is required (e.g. not assume some API must be invoked). Of
> > course Qemu side vfio code also need some change. If this can work,
> > at least we can first put it as the enumeration interface for mediated
> > device in Xen. In the future it may be extended to cover normal Xen
> > PCI assignment as well instead of using sysfs to read PCI resource
> > today.
> 
> The channel io proposal doesn't rely on QEMU for security either, the
> mediation occurs in the host kernel, parsing the ccw command program,
> and doing translations to replace the guest physical addresses with
> verified and pinned host physical addresses before submitting the
> program to be run.  A mediated device is policed by the mediated
> vendor driver in the host kernel, QEMU is untrusted, just like any
> other user.
> 
> If xen is currently using pci-sysfs for mapping device resources, then
> vfio should be directly usable, which leaves the IOMMU interfaces, such
> as pinning and mapping user memory and making use of the IOMMU API,
> that part of vfio is fairly modular though IOMMU groups is a fairly
> fundamental concept within the core.  Thanks,
> 

My impression was that you don't like hypervisor specific thing in VFIO,
which makes it a bit tricky to accomplish those tasks in kernel. If we 
can add Xen specific logic directly in VFIO (like vfio-iommu-xen you 
mentioned), the whole thing would be easier.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v4 0/3] Add Mediated device support[was: Add vGPU support]
@ 2016-05-27 22:43           ` Tian, Kevin
  0 siblings, 0 replies; 92+ messages in thread
From: Tian, Kevin @ 2016-05-27 22:43 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Kirti Wankhede, pbonzini, kraxel, cjia, qemu-devel, kvm, Ruan,
	Shuai, Song, Jike, Lv, Zhiyuan, bjsdjshi

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Friday, May 27, 2016 10:55 PM
> 
> On Fri, 27 May 2016 11:02:46 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > Sent: Wednesday, May 25, 2016 9:44 PM
> > >
> > > On Wed, 25 May 2016 07:13:58 +0000
> > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > >
> > > > > From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> > > > > Sent: Wednesday, May 25, 2016 3:58 AM
> > > > >
> > > > > This series adds Mediated device support to v4.6 Linux host kernel. Purpose
> > > > > of this series is to provide a common interface for mediated device
> > > > > management that can be used by different devices. This series introduces
> > > > > Mdev core module that create and manage mediated devices, VFIO based driver
> > > > > for mediated PCI devices that are created by Mdev core module and update
> > > > > VFIO type1 IOMMU module to support mediated devices.
> > > >
> > > > Thanks. "Mediated device" is more generic than previous one. :-)
> > > >
> > > > >
> > > > > What's new in v4?
> > > > > - Renamed 'vgpu' module to 'mdev' module that represent generic term
> > > > >   'Mediated device'.
> > > > > - Moved mdev directory to drivers/vfio directory as this is the extension
> > > > >   of VFIO APIs for mediated devices.
> > > > > - Updated mdev driver to be flexible to register multiple types of drivers
> > > > >   to mdev_bus_type bus.
> > > > > - Updated mdev core driver with mdev_put_device() and mdev_get_device() for
> > > > >   mediated devices.
> > > > >
> > > > >
> > > >
> > > > Just curious. In this version you move the whole mdev core under
> > > > VFIO now. Sorry if I missed any agreement on this change. IIRC Alex
> > > > doesn't want VFIO to manage mdev life-cycle directly. Instead VFIO is
> > > > just a mdev driver on created mediated devices....
> > >
> > > I did originally suggest keeping them separate, but as we've progressed
> > > through the implementation, it's become more clear that the mediated
> > > device interface is very much tied to the vfio interface, acting mostly
> > > as a passthrough.  So I thought it made sense to pull them together.
> > > Still open to discussion of course.  Thanks,
> > >
> >
> > The main benefit of maintaining a separate mdev framework, IMHO, is
> > to allow better support of both KVM and Xen. Xen doesn't work with VFIO
> > today, because other VM's memory is not allocated from Dom0 which
> > means VFIO within Dom0 doesn't has view/permission to control isolation
> > for other VMs.
> 
> Isn't this just a matter of the vfio iommu model selected?  There could
> be a vfio-iommu-xen that knows how to do the grant calls.
> 
> > However, after some thinking I think it might not be a big problem to
> > combine VFIO/mdev together, if we extend Xen to just use VFIO for
> > resource enumeration. In such model, VFIO still behaves as a single
> > kernel portal to enumerate mediated devices to user space, but give up
> > permission control to Qemu which will request a secure agent - Xen
> > hypervisor - to ensure isolation of VM usage on mediated device (including
> > EPT/IOMMU configuration).
> 
> The whole point here is to use the vfio user api and we seem to be
> progressing towards using vfio-core as a conduit where the mediated
> driver api is also fairly vfio-ish.  So it seems we're really headed
> towards a vfio-mediated device rather than some sort generic mediated
> driver interface.  I would object to leaving permission control to
> QEMU, QEMU is just a vfio user, there are others like DPDK.  The kernel
> needs to be in charge of protecting itself and users from each other,
> QEMU can't do this, which is part of reason that KVM has moved to vfio
> rather than the pci-sysfs resource interface.
> 
> > I'm not sure whether VFIO can support this usage today. It is somehow
> > similar to channel io passthru in s390, where we also rely on Qemu to
> > mediate ccw commands to ensure isolation. Maybe just some slight
> > extension is required (e.g. not assume some API must be invoked). Of
> > course Qemu side vfio code also need some change. If this can work,
> > at least we can first put it as the enumeration interface for mediated
> > device in Xen. In the future it may be extended to cover normal Xen
> > PCI assignment as well instead of using sysfs to read PCI resource
> > today.
> 
> The channel io proposal doesn't rely on QEMU for security either, the
> mediation occurs in the host kernel, parsing the ccw command program,
> and doing translations to replace the guest physical addresses with
> verified and pinned host physical addresses before submitting the
> program to be run.  A mediated device is policed by the mediated
> vendor driver in the host kernel, QEMU is untrusted, just like any
> other user.
> 
> If xen is currently using pci-sysfs for mapping device resources, then
> vfio should be directly usable, which leaves the IOMMU interfaces, such
> as pinning and mapping user memory and making use of the IOMMU API,
> that part of vfio is fairly modular though IOMMU groups is a fairly
> fundamental concept within the core.  Thanks,
> 

My impression was that you don't like hypervisor specific thing in VFIO,
which makes it a bit tricky to accomplish those tasks in kernel. If we 
can add Xen specific logic directly in VFIO (like vfio-iommu-xen you 
mentioned), the whole thing would be easier.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC PATCH v4 0/3] Add Mediated device support[was: Add vGPU support]
  2016-05-27 22:43           ` [Qemu-devel] " Tian, Kevin
@ 2016-05-28 14:56             ` Alex Williamson
  -1 siblings, 0 replies; 92+ messages in thread
From: Alex Williamson @ 2016-05-28 14:56 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Kirti Wankhede, pbonzini, kraxel, cjia, qemu-devel, kvm, Ruan,
	Shuai, Song, Jike, Lv, Zhiyuan, bjsdjshi

On Fri, 27 May 2016 22:43:54 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Friday, May 27, 2016 10:55 PM
> > 
> > On Fri, 27 May 2016 11:02:46 +0000
> > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> >   
> > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > Sent: Wednesday, May 25, 2016 9:44 PM
> > > >
> > > > On Wed, 25 May 2016 07:13:58 +0000
> > > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > > >  
> > > > > > From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> > > > > > Sent: Wednesday, May 25, 2016 3:58 AM
> > > > > >
> > > > > > This series adds Mediated device support to v4.6 Linux host kernel. Purpose
> > > > > > of this series is to provide a common interface for mediated device
> > > > > > management that can be used by different devices. This series introduces
> > > > > > Mdev core module that create and manage mediated devices, VFIO based driver
> > > > > > for mediated PCI devices that are created by Mdev core module and update
> > > > > > VFIO type1 IOMMU module to support mediated devices.  
> > > > >
> > > > > Thanks. "Mediated device" is more generic than previous one. :-)
> > > > >  
> > > > > >
> > > > > > What's new in v4?
> > > > > > - Renamed 'vgpu' module to 'mdev' module that represent generic term
> > > > > >   'Mediated device'.
> > > > > > - Moved mdev directory to drivers/vfio directory as this is the extension
> > > > > >   of VFIO APIs for mediated devices.
> > > > > > - Updated mdev driver to be flexible to register multiple types of drivers
> > > > > >   to mdev_bus_type bus.
> > > > > > - Updated mdev core driver with mdev_put_device() and mdev_get_device() for
> > > > > >   mediated devices.
> > > > > >
> > > > > >  
> > > > >
> > > > > Just curious. In this version you move the whole mdev core under
> > > > > VFIO now. Sorry if I missed any agreement on this change. IIRC Alex
> > > > > doesn't want VFIO to manage mdev life-cycle directly. Instead VFIO is
> > > > > just a mdev driver on created mediated devices....  
> > > >
> > > > I did originally suggest keeping them separate, but as we've progressed
> > > > through the implementation, it's become more clear that the mediated
> > > > device interface is very much tied to the vfio interface, acting mostly
> > > > as a passthrough.  So I thought it made sense to pull them together.
> > > > Still open to discussion of course.  Thanks,
> > > >  
> > >
> > > The main benefit of maintaining a separate mdev framework, IMHO, is
> > > to allow better support of both KVM and Xen. Xen doesn't work with VFIO
> > > today, because other VM's memory is not allocated from Dom0 which
> > > means VFIO within Dom0 doesn't has view/permission to control isolation
> > > for other VMs.  
> > 
> > Isn't this just a matter of the vfio iommu model selected?  There could
> > be a vfio-iommu-xen that knows how to do the grant calls.
> >   
> > > However, after some thinking I think it might not be a big problem to
> > > combine VFIO/mdev together, if we extend Xen to just use VFIO for
> > > resource enumeration. In such model, VFIO still behaves as a single
> > > kernel portal to enumerate mediated devices to user space, but give up
> > > permission control to Qemu which will request a secure agent - Xen
> > > hypervisor - to ensure isolation of VM usage on mediated device (including
> > > EPT/IOMMU configuration).  
> > 
> > The whole point here is to use the vfio user api and we seem to be
> > progressing towards using vfio-core as a conduit where the mediated
> > driver api is also fairly vfio-ish.  So it seems we're really headed
> > towards a vfio-mediated device rather than some sort generic mediated
> > driver interface.  I would object to leaving permission control to
> > QEMU, QEMU is just a vfio user, there are others like DPDK.  The kernel
> > needs to be in charge of protecting itself and users from each other,
> > QEMU can't do this, which is part of reason that KVM has moved to vfio
> > rather than the pci-sysfs resource interface.
> >   
> > > I'm not sure whether VFIO can support this usage today. It is somehow
> > > similar to channel io passthru in s390, where we also rely on Qemu to
> > > mediate ccw commands to ensure isolation. Maybe just some slight
> > > extension is required (e.g. not assume some API must be invoked). Of
> > > course Qemu side vfio code also need some change. If this can work,
> > > at least we can first put it as the enumeration interface for mediated
> > > device in Xen. In the future it may be extended to cover normal Xen
> > > PCI assignment as well instead of using sysfs to read PCI resource
> > > today.  
> > 
> > The channel io proposal doesn't rely on QEMU for security either, the
> > mediation occurs in the host kernel, parsing the ccw command program,
> > and doing translations to replace the guest physical addresses with
> > verified and pinned host physical addresses before submitting the
> > program to be run.  A mediated device is policed by the mediated
> > vendor driver in the host kernel, QEMU is untrusted, just like any
> > other user.
> > 
> > If xen is currently using pci-sysfs for mapping device resources, then
> > vfio should be directly usable, which leaves the IOMMU interfaces, such
> > as pinning and mapping user memory and making use of the IOMMU API,
> > that part of vfio is fairly modular though IOMMU groups is a fairly
> > fundamental concept within the core.  Thanks,
> >   
> 
> My impression was that you don't like hypervisor specific thing in VFIO,
> which makes it a bit tricky to accomplish those tasks in kernel. If we 
> can add Xen specific logic directly in VFIO (like vfio-iommu-xen you 
> mentioned), the whole thing would be easier.

If vfio is hosted in dom0, then Xen is the platform and we need to
interact with the hypervisor to manage the iommu.  That said, there are
aspects of vfio that do not seem to map well to a hypervisor managed
iommu or a Xen-like hypervisor.  For instance, how does dom0 manage
iommu groups and what's the distinction of using vfio to manage a
userspace driver in dom0 versus managing a device for another domain.
In the case of kvm, vfio has no dependency on kvm, there is some minor
interaction, but we're not running on kvm and it's not appropriate to
use vfio as a gateway to interact with a hypervisor that may or may not
exist.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v4 0/3] Add Mediated device support[was: Add vGPU support]
@ 2016-05-28 14:56             ` Alex Williamson
  0 siblings, 0 replies; 92+ messages in thread
From: Alex Williamson @ 2016-05-28 14:56 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Kirti Wankhede, pbonzini, kraxel, cjia, qemu-devel, kvm, Ruan,
	Shuai, Song, Jike, Lv, Zhiyuan, bjsdjshi

On Fri, 27 May 2016 22:43:54 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Friday, May 27, 2016 10:55 PM
> > 
> > On Fri, 27 May 2016 11:02:46 +0000
> > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> >   
> > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > Sent: Wednesday, May 25, 2016 9:44 PM
> > > >
> > > > On Wed, 25 May 2016 07:13:58 +0000
> > > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > > >  
> > > > > > From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> > > > > > Sent: Wednesday, May 25, 2016 3:58 AM
> > > > > >
> > > > > > This series adds Mediated device support to v4.6 Linux host kernel. Purpose
> > > > > > of this series is to provide a common interface for mediated device
> > > > > > management that can be used by different devices. This series introduces
> > > > > > Mdev core module that create and manage mediated devices, VFIO based driver
> > > > > > for mediated PCI devices that are created by Mdev core module and update
> > > > > > VFIO type1 IOMMU module to support mediated devices.  
> > > > >
> > > > > Thanks. "Mediated device" is more generic than previous one. :-)
> > > > >  
> > > > > >
> > > > > > What's new in v4?
> > > > > > - Renamed 'vgpu' module to 'mdev' module that represent generic term
> > > > > >   'Mediated device'.
> > > > > > - Moved mdev directory to drivers/vfio directory as this is the extension
> > > > > >   of VFIO APIs for mediated devices.
> > > > > > - Updated mdev driver to be flexible to register multiple types of drivers
> > > > > >   to mdev_bus_type bus.
> > > > > > - Updated mdev core driver with mdev_put_device() and mdev_get_device() for
> > > > > >   mediated devices.
> > > > > >
> > > > > >  
> > > > >
> > > > > Just curious. In this version you move the whole mdev core under
> > > > > VFIO now. Sorry if I missed any agreement on this change. IIRC Alex
> > > > > doesn't want VFIO to manage mdev life-cycle directly. Instead VFIO is
> > > > > just a mdev driver on created mediated devices....  
> > > >
> > > > I did originally suggest keeping them separate, but as we've progressed
> > > > through the implementation, it's become more clear that the mediated
> > > > device interface is very much tied to the vfio interface, acting mostly
> > > > as a passthrough.  So I thought it made sense to pull them together.
> > > > Still open to discussion of course.  Thanks,
> > > >  
> > >
> > > The main benefit of maintaining a separate mdev framework, IMHO, is
> > > to allow better support of both KVM and Xen. Xen doesn't work with VFIO
> > > today, because other VM's memory is not allocated from Dom0 which
> > > means VFIO within Dom0 doesn't has view/permission to control isolation
> > > for other VMs.  
> > 
> > Isn't this just a matter of the vfio iommu model selected?  There could
> > be a vfio-iommu-xen that knows how to do the grant calls.
> >   
> > > However, after some thinking I think it might not be a big problem to
> > > combine VFIO/mdev together, if we extend Xen to just use VFIO for
> > > resource enumeration. In such model, VFIO still behaves as a single
> > > kernel portal to enumerate mediated devices to user space, but give up
> > > permission control to Qemu which will request a secure agent - Xen
> > > hypervisor - to ensure isolation of VM usage on mediated device (including
> > > EPT/IOMMU configuration).  
> > 
> > The whole point here is to use the vfio user api and we seem to be
> > progressing towards using vfio-core as a conduit where the mediated
> > driver api is also fairly vfio-ish.  So it seems we're really headed
> > towards a vfio-mediated device rather than some sort generic mediated
> > driver interface.  I would object to leaving permission control to
> > QEMU, QEMU is just a vfio user, there are others like DPDK.  The kernel
> > needs to be in charge of protecting itself and users from each other,
> > QEMU can't do this, which is part of reason that KVM has moved to vfio
> > rather than the pci-sysfs resource interface.
> >   
> > > I'm not sure whether VFIO can support this usage today. It is somehow
> > > similar to channel io passthru in s390, where we also rely on Qemu to
> > > mediate ccw commands to ensure isolation. Maybe just some slight
> > > extension is required (e.g. not assume some API must be invoked). Of
> > > course Qemu side vfio code also need some change. If this can work,
> > > at least we can first put it as the enumeration interface for mediated
> > > device in Xen. In the future it may be extended to cover normal Xen
> > > PCI assignment as well instead of using sysfs to read PCI resource
> > > today.  
> > 
> > The channel io proposal doesn't rely on QEMU for security either, the
> > mediation occurs in the host kernel, parsing the ccw command program,
> > and doing translations to replace the guest physical addresses with
> > verified and pinned host physical addresses before submitting the
> > program to be run.  A mediated device is policed by the mediated
> > vendor driver in the host kernel, QEMU is untrusted, just like any
> > other user.
> > 
> > If xen is currently using pci-sysfs for mapping device resources, then
> > vfio should be directly usable, which leaves the IOMMU interfaces, such
> > as pinning and mapping user memory and making use of the IOMMU API,
> > that part of vfio is fairly modular though IOMMU groups is a fairly
> > fundamental concept within the core.  Thanks,
> >   
> 
> My impression was that you don't like hypervisor specific thing in VFIO,
> which makes it a bit tricky to accomplish those tasks in kernel. If we 
> can add Xen specific logic directly in VFIO (like vfio-iommu-xen you 
> mentioned), the whole thing would be easier.

If vfio is hosted in dom0, then Xen is the platform and we need to
interact with the hypervisor to manage the iommu.  That said, there are
aspects of vfio that do not seem to map well to a hypervisor managed
iommu or a Xen-like hypervisor.  For instance, how does dom0 manage
iommu groups and what's the distinction of using vfio to manage a
userspace driver in dom0 versus managing a device for another domain.
In the case of kvm, vfio has no dependency on kvm, there is some minor
interaction, but we're not running on kvm and it's not appropriate to
use vfio as a gateway to interact with a hypervisor that may or may not
exist.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC PATCH v4 0/3] Add Mediated device support[was: Add vGPU support]
  2016-05-28 14:56             ` [Qemu-devel] " Alex Williamson
@ 2016-05-31  2:29               ` Jike Song
  -1 siblings, 0 replies; 92+ messages in thread
From: Jike Song @ 2016-05-31  2:29 UTC (permalink / raw)
  To: Alex Williamson, Tian, Kevin
  Cc: Kirti Wankhede, pbonzini, kraxel, cjia, qemu-devel, kvm, Ruan,
	Shuai, Lv, Zhiyuan, bjsdjshi

On 05/28/2016 10:56 PM, Alex Williamson wrote:
> On Fri, 27 May 2016 22:43:54 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
>>
>> My impression was that you don't like hypervisor specific thing in VFIO,
>> which makes it a bit tricky to accomplish those tasks in kernel. If we 
>> can add Xen specific logic directly in VFIO (like vfio-iommu-xen you 
>> mentioned), the whole thing would be easier.
> 
> If vfio is hosted in dom0, then Xen is the platform and we need to
> interact with the hypervisor to manage the iommu.  That said, there are
> aspects of vfio that do not seem to map well to a hypervisor managed
> iommu or a Xen-like hypervisor.  For instance, how does dom0 manage
> iommu groups and what's the distinction of using vfio to manage a
> userspace driver in dom0 versus managing a device for another domain.
> In the case of kvm, vfio has no dependency on kvm, there is some minor
> interaction, but we're not running on kvm and it's not appropriate to
> use vfio as a gateway to interact with a hypervisor that may or may not
> exist.  Thanks,

Hi Alex,

Beyond iommu, there are other aspects vfio need to interact with Xen?
e.g. to pass-through MMIO, one have to call hypercalls to establish EPT
mappings.


--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v4 0/3] Add Mediated device support[was: Add vGPU support]
@ 2016-05-31  2:29               ` Jike Song
  0 siblings, 0 replies; 92+ messages in thread
From: Jike Song @ 2016-05-31  2:29 UTC (permalink / raw)
  To: Alex Williamson, Tian, Kevin
  Cc: Kirti Wankhede, pbonzini, kraxel, cjia, qemu-devel, kvm, Ruan,
	Shuai, Lv, Zhiyuan, bjsdjshi

On 05/28/2016 10:56 PM, Alex Williamson wrote:
> On Fri, 27 May 2016 22:43:54 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
>>
>> My impression was that you don't like hypervisor specific thing in VFIO,
>> which makes it a bit tricky to accomplish those tasks in kernel. If we 
>> can add Xen specific logic directly in VFIO (like vfio-iommu-xen you 
>> mentioned), the whole thing would be easier.
> 
> If vfio is hosted in dom0, then Xen is the platform and we need to
> interact with the hypervisor to manage the iommu.  That said, there are
> aspects of vfio that do not seem to map well to a hypervisor managed
> iommu or a Xen-like hypervisor.  For instance, how does dom0 manage
> iommu groups and what's the distinction of using vfio to manage a
> userspace driver in dom0 versus managing a device for another domain.
> In the case of kvm, vfio has no dependency on kvm, there is some minor
> interaction, but we're not running on kvm and it's not appropriate to
> use vfio as a gateway to interact with a hypervisor that may or may not
> exist.  Thanks,

Hi Alex,

Beyond iommu, there are other aspects vfio need to interact with Xen?
e.g. to pass-through MMIO, one have to call hypercalls to establish EPT
mappings.


--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC PATCH v4 0/3] Add Mediated device support[was: Add vGPU support]
  2016-05-31  2:29               ` [Qemu-devel] " Jike Song
@ 2016-05-31 14:29                 ` Alex Williamson
  -1 siblings, 0 replies; 92+ messages in thread
From: Alex Williamson @ 2016-05-31 14:29 UTC (permalink / raw)
  To: Jike Song
  Cc: Ruan, Shuai, Tian, Kevin, cjia, kvm, qemu-devel, Kirti Wankhede,
	kraxel, pbonzini, bjsdjshi, Lv, Zhiyuan

On Tue, 31 May 2016 10:29:10 +0800
Jike Song <jike.song@intel.com> wrote:

> On 05/28/2016 10:56 PM, Alex Williamson wrote:
> > On Fri, 27 May 2016 22:43:54 +0000
> > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> >   
> >>
> >> My impression was that you don't like hypervisor specific thing in VFIO,
> >> which makes it a bit tricky to accomplish those tasks in kernel. If we 
> >> can add Xen specific logic directly in VFIO (like vfio-iommu-xen you 
> >> mentioned), the whole thing would be easier.  
> > 
> > If vfio is hosted in dom0, then Xen is the platform and we need to
> > interact with the hypervisor to manage the iommu.  That said, there are
> > aspects of vfio that do not seem to map well to a hypervisor managed
> > iommu or a Xen-like hypervisor.  For instance, how does dom0 manage
> > iommu groups and what's the distinction of using vfio to manage a
> > userspace driver in dom0 versus managing a device for another domain.
> > In the case of kvm, vfio has no dependency on kvm, there is some minor
> > interaction, but we're not running on kvm and it's not appropriate to
> > use vfio as a gateway to interact with a hypervisor that may or may not
> > exist.  Thanks,  
> 
> Hi Alex,
> 
> Beyond iommu, there are other aspects vfio need to interact with Xen?
> e.g. to pass-through MMIO, one have to call hypercalls to establish EPT
> mappings.

If it's part of running on a Xen platform and not trying to interact
with a VM in ways that are out of scope for vfio, I might be open to
it, I'd need to see a proposal.  This also goes back to my question of
how does vfio know whether it's configuring a device for a guest driver
or a guest VM, with kvm these are one and the same.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v4 0/3] Add Mediated device support[was: Add vGPU support]
@ 2016-05-31 14:29                 ` Alex Williamson
  0 siblings, 0 replies; 92+ messages in thread
From: Alex Williamson @ 2016-05-31 14:29 UTC (permalink / raw)
  To: Jike Song
  Cc: Tian, Kevin, Kirti Wankhede, pbonzini, kraxel, cjia, qemu-devel,
	kvm, Ruan, Shuai, Lv, Zhiyuan, bjsdjshi

On Tue, 31 May 2016 10:29:10 +0800
Jike Song <jike.song@intel.com> wrote:

> On 05/28/2016 10:56 PM, Alex Williamson wrote:
> > On Fri, 27 May 2016 22:43:54 +0000
> > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> >   
> >>
> >> My impression was that you don't like hypervisor specific thing in VFIO,
> >> which makes it a bit tricky to accomplish those tasks in kernel. If we 
> >> can add Xen specific logic directly in VFIO (like vfio-iommu-xen you 
> >> mentioned), the whole thing would be easier.  
> > 
> > If vfio is hosted in dom0, then Xen is the platform and we need to
> > interact with the hypervisor to manage the iommu.  That said, there are
> > aspects of vfio that do not seem to map well to a hypervisor managed
> > iommu or a Xen-like hypervisor.  For instance, how does dom0 manage
> > iommu groups and what's the distinction of using vfio to manage a
> > userspace driver in dom0 versus managing a device for another domain.
> > In the case of kvm, vfio has no dependency on kvm, there is some minor
> > interaction, but we're not running on kvm and it's not appropriate to
> > use vfio as a gateway to interact with a hypervisor that may or may not
> > exist.  Thanks,  
> 
> Hi Alex,
> 
> Beyond iommu, there are other aspects vfio need to interact with Xen?
> e.g. to pass-through MMIO, one have to call hypercalls to establish EPT
> mappings.

If it's part of running on a Xen platform and not trying to interact
with a VM in ways that are out of scope for vfio, I might be open to
it, I'd need to see a proposal.  This also goes back to my question of
how does vfio know whether it's configuring a device for a guest driver
or a guest VM, with kvm these are one and the same.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC PATCH v4 3/3] VFIO Type1 IOMMU: Add support for mediated devices
  2016-05-24 19:58   ` [Qemu-devel] " Kirti Wankhede
@ 2016-06-01  8:40     ` Dong Jia
  -1 siblings, 0 replies; 92+ messages in thread
From: Dong Jia @ 2016-06-01  8:40 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: alex.williamson, pbonzini, kraxel, cjia, qemu-devel, kvm,
	kevin.tian, shuai.ruan, jike.song, zhiyuan.lv

On Wed, 25 May 2016 01:28:17 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> +
> +/*
> + * Pin a set of guest PFNs and return their associated host PFNs for API
> + * supported domain only.
> + * @vaddr [in]: array of guest PFNs
> + * @npage [in]: count of array elements
> + * @prot [in] : protection flags
> + * @pfn_base[out] : array of host PFNs
> + */
> +long vfio_pin_pages(void *iommu_data, dma_addr_t *vaddr, long npage,
> +		   int prot, dma_addr_t *pfn_base)
> +{
> +	struct vfio_iommu *iommu = iommu_data;
> +	struct vfio_domain *domain = NULL;
> +	int i = 0, ret = 0;
> +	long retpage;
> +	unsigned long remote_vaddr = 0;
> +	dma_addr_t *pfn = pfn_base;
> +	struct vfio_dma *dma;
> +
> +	if (!iommu || !vaddr || !pfn_base)
> +		return -EINVAL;
> +
> +	mutex_lock(&iommu->lock);
> +
> +	if (!iommu->mediated_domain) {
> +		ret = -EINVAL;
> +		goto pin_done;
> +	}
> +
> +	domain = iommu->mediated_domain;
> +
> +	for (i = 0; i < npage; i++) {
> +		struct vfio_pfn *p, *lpfn;
> +		unsigned long tpfn;
> +		dma_addr_t iova;
> +		long pg_cnt = 1;
> +
> +		iova = vaddr[i] << PAGE_SHIFT;
Dear Kirti:

Got one question for the vaddr-iova conversion here.
Is this a common rule that can be applied to all architectures?
AFAIK, this is wrong for the s390 case. Or I must miss something...

If the answer to the above question is 'no', should we introduce a new
argument to pass in the iovas? Say 'dma_addr_t *iova'.

> +
> +		dma = vfio_find_dma(iommu, iova, 0 /*  size */);
> +		if (!dma) {
> +			ret = -EINVAL;
> +			goto pin_done;
> +		}
> +
> +		remote_vaddr = dma->vaddr + iova - dma->iova;
> +
> +		retpage = vfio_pin_pages_internal(domain, remote_vaddr,
> +						  pg_cnt, prot, &tpfn);
> +		if (retpage <= 0) {
> +			WARN_ON(!retpage);
> +			ret = (int)retpage;
> +			goto pin_done;
> +		}
> +
> +		pfn[i] = tpfn;
> +
> +		/* search if pfn exist */
> +		p = vfio_find_pfn(domain, tpfn);
> +		if (p) {
> +			atomic_inc(&p->ref_count);
> +			continue;
> +		}
> +
> +		/* add to pfn_list */
> +		lpfn = kzalloc(sizeof(*lpfn), GFP_KERNEL);
> +		if (!lpfn) {
> +			ret = -ENOMEM;
> +			goto pin_done;
> +		}
> +		lpfn->vaddr = remote_vaddr;
> +		lpfn->iova = iova;
> +		lpfn->pfn = pfn[i];
> +		lpfn->npage = 1;
> +		lpfn->prot = prot;
> +		atomic_inc(&lpfn->ref_count);
> +		vfio_link_pfn(domain, lpfn);
> +	}
> +
> +	ret = i;
> +
> +pin_done:
> +	mutex_unlock(&iommu->lock);
> +	return ret;
> +}
> +EXPORT_SYMBOL(vfio_pin_pages);


--------
Dong Jia


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v4 3/3] VFIO Type1 IOMMU: Add support for mediated devices
@ 2016-06-01  8:40     ` Dong Jia
  0 siblings, 0 replies; 92+ messages in thread
From: Dong Jia @ 2016-06-01  8:40 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: alex.williamson, pbonzini, kraxel, cjia, qemu-devel, kvm,
	kevin.tian, shuai.ruan, jike.song, zhiyuan.lv

On Wed, 25 May 2016 01:28:17 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> +
> +/*
> + * Pin a set of guest PFNs and return their associated host PFNs for API
> + * supported domain only.
> + * @vaddr [in]: array of guest PFNs
> + * @npage [in]: count of array elements
> + * @prot [in] : protection flags
> + * @pfn_base[out] : array of host PFNs
> + */
> +long vfio_pin_pages(void *iommu_data, dma_addr_t *vaddr, long npage,
> +		   int prot, dma_addr_t *pfn_base)
> +{
> +	struct vfio_iommu *iommu = iommu_data;
> +	struct vfio_domain *domain = NULL;
> +	int i = 0, ret = 0;
> +	long retpage;
> +	unsigned long remote_vaddr = 0;
> +	dma_addr_t *pfn = pfn_base;
> +	struct vfio_dma *dma;
> +
> +	if (!iommu || !vaddr || !pfn_base)
> +		return -EINVAL;
> +
> +	mutex_lock(&iommu->lock);
> +
> +	if (!iommu->mediated_domain) {
> +		ret = -EINVAL;
> +		goto pin_done;
> +	}
> +
> +	domain = iommu->mediated_domain;
> +
> +	for (i = 0; i < npage; i++) {
> +		struct vfio_pfn *p, *lpfn;
> +		unsigned long tpfn;
> +		dma_addr_t iova;
> +		long pg_cnt = 1;
> +
> +		iova = vaddr[i] << PAGE_SHIFT;
Dear Kirti:

Got one question for the vaddr-iova conversion here.
Is this a common rule that can be applied to all architectures?
AFAIK, this is wrong for the s390 case. Or I must miss something...

If the answer to the above question is 'no', should we introduce a new
argument to pass in the iovas? Say 'dma_addr_t *iova'.

> +
> +		dma = vfio_find_dma(iommu, iova, 0 /*  size */);
> +		if (!dma) {
> +			ret = -EINVAL;
> +			goto pin_done;
> +		}
> +
> +		remote_vaddr = dma->vaddr + iova - dma->iova;
> +
> +		retpage = vfio_pin_pages_internal(domain, remote_vaddr,
> +						  pg_cnt, prot, &tpfn);
> +		if (retpage <= 0) {
> +			WARN_ON(!retpage);
> +			ret = (int)retpage;
> +			goto pin_done;
> +		}
> +
> +		pfn[i] = tpfn;
> +
> +		/* search if pfn exist */
> +		p = vfio_find_pfn(domain, tpfn);
> +		if (p) {
> +			atomic_inc(&p->ref_count);
> +			continue;
> +		}
> +
> +		/* add to pfn_list */
> +		lpfn = kzalloc(sizeof(*lpfn), GFP_KERNEL);
> +		if (!lpfn) {
> +			ret = -ENOMEM;
> +			goto pin_done;
> +		}
> +		lpfn->vaddr = remote_vaddr;
> +		lpfn->iova = iova;
> +		lpfn->pfn = pfn[i];
> +		lpfn->npage = 1;
> +		lpfn->prot = prot;
> +		atomic_inc(&lpfn->ref_count);
> +		vfio_link_pfn(domain, lpfn);
> +	}
> +
> +	ret = i;
> +
> +pin_done:
> +	mutex_unlock(&iommu->lock);
> +	return ret;
> +}
> +EXPORT_SYMBOL(vfio_pin_pages);


--------
Dong Jia

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC PATCH v4 0/3] Add Mediated device support[was: Add vGPU support]
  2016-05-31 14:29                 ` [Qemu-devel] " Alex Williamson
@ 2016-06-02  2:11                   ` Jike Song
  -1 siblings, 0 replies; 92+ messages in thread
From: Jike Song @ 2016-06-02  2:11 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, Kirti Wankhede, pbonzini, kraxel, cjia, qemu-devel,
	kvm, Ruan, Shuai, Lv, Zhiyuan, bjsdjshi

On 05/31/2016 10:29 PM, Alex Williamson wrote:
> On Tue, 31 May 2016 10:29:10 +0800
> Jike Song <jike.song@intel.com> wrote:
> 
>> On 05/28/2016 10:56 PM, Alex Williamson wrote:
>>> On Fri, 27 May 2016 22:43:54 +0000
>>> "Tian, Kevin" <kevin.tian@intel.com> wrote:
>>>   
>>>>
>>>> My impression was that you don't like hypervisor specific thing in VFIO,
>>>> which makes it a bit tricky to accomplish those tasks in kernel. If we 
>>>> can add Xen specific logic directly in VFIO (like vfio-iommu-xen you 
>>>> mentioned), the whole thing would be easier.  
>>>
>>> If vfio is hosted in dom0, then Xen is the platform and we need to
>>> interact with the hypervisor to manage the iommu.  That said, there are
>>> aspects of vfio that do not seem to map well to a hypervisor managed
>>> iommu or a Xen-like hypervisor.  For instance, how does dom0 manage
>>> iommu groups and what's the distinction of using vfio to manage a
>>> userspace driver in dom0 versus managing a device for another domain.
>>> In the case of kvm, vfio has no dependency on kvm, there is some minor
>>> interaction, but we're not running on kvm and it's not appropriate to
>>> use vfio as a gateway to interact with a hypervisor that may or may not
>>> exist.  Thanks,  
>>
>> Hi Alex,
>>
>> Beyond iommu, there are other aspects vfio need to interact with Xen?
>> e.g. to pass-through MMIO, one have to call hypercalls to establish EPT
>> mappings.
> 
> If it's part of running on a Xen platform and not trying to interact
> with a VM in ways that are out of scope for vfio, I might be open to
> it, I'd need to see a proposal.  This also goes back to my question of
> how does vfio know whether it's configuring a device for a guest driver
> or a guest VM, with kvm these are one and the same.  Thanks,


Yes, this brings us back to Kevin suggestion, 

> I'm not sure whether VFIO can support this usage today. It is somehow 
> similar to channel io passthru in s390, where we also rely on Qemu to 
> mediate ccw commands to ensure isolation. Maybe just some slight 
> extension is required (e.g. not assume some API must be invoked). Of 
> course Qemu side vfio code also need some change. If this can work, 
> at least we can first put it as the enumeration interface for mediated 
> device in Xen. In the future it may be extended to cover normal Xen 
> PCI assignment as well instead of using sysfs to read PCI resource
> today.
> 
> If above works, then we have a sound plan to enable mediated devices 
> based on VFIO first for KVM, and then extend to Xen with reasonable 
> effort.

We'll work on the proposal, thanks!

--
Thanks,
Jike


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v4 0/3] Add Mediated device support[was: Add vGPU support]
@ 2016-06-02  2:11                   ` Jike Song
  0 siblings, 0 replies; 92+ messages in thread
From: Jike Song @ 2016-06-02  2:11 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, Kirti Wankhede, pbonzini, kraxel, cjia, qemu-devel,
	kvm, Ruan, Shuai, Lv, Zhiyuan, bjsdjshi

On 05/31/2016 10:29 PM, Alex Williamson wrote:
> On Tue, 31 May 2016 10:29:10 +0800
> Jike Song <jike.song@intel.com> wrote:
> 
>> On 05/28/2016 10:56 PM, Alex Williamson wrote:
>>> On Fri, 27 May 2016 22:43:54 +0000
>>> "Tian, Kevin" <kevin.tian@intel.com> wrote:
>>>   
>>>>
>>>> My impression was that you don't like hypervisor specific thing in VFIO,
>>>> which makes it a bit tricky to accomplish those tasks in kernel. If we 
>>>> can add Xen specific logic directly in VFIO (like vfio-iommu-xen you 
>>>> mentioned), the whole thing would be easier.  
>>>
>>> If vfio is hosted in dom0, then Xen is the platform and we need to
>>> interact with the hypervisor to manage the iommu.  That said, there are
>>> aspects of vfio that do not seem to map well to a hypervisor managed
>>> iommu or a Xen-like hypervisor.  For instance, how does dom0 manage
>>> iommu groups and what's the distinction of using vfio to manage a
>>> userspace driver in dom0 versus managing a device for another domain.
>>> In the case of kvm, vfio has no dependency on kvm, there is some minor
>>> interaction, but we're not running on kvm and it's not appropriate to
>>> use vfio as a gateway to interact with a hypervisor that may or may not
>>> exist.  Thanks,  
>>
>> Hi Alex,
>>
>> Beyond iommu, there are other aspects vfio need to interact with Xen?
>> e.g. to pass-through MMIO, one have to call hypercalls to establish EPT
>> mappings.
> 
> If it's part of running on a Xen platform and not trying to interact
> with a VM in ways that are out of scope for vfio, I might be open to
> it, I'd need to see a proposal.  This also goes back to my question of
> how does vfio know whether it's configuring a device for a guest driver
> or a guest VM, with kvm these are one and the same.  Thanks,


Yes, this brings us back to Kevin suggestion, 

> I'm not sure whether VFIO can support this usage today. It is somehow 
> similar to channel io passthru in s390, where we also rely on Qemu to 
> mediate ccw commands to ensure isolation. Maybe just some slight 
> extension is required (e.g. not assume some API must be invoked). Of 
> course Qemu side vfio code also need some change. If this can work, 
> at least we can first put it as the enumeration interface for mediated 
> device in Xen. In the future it may be extended to cover normal Xen 
> PCI assignment as well instead of using sysfs to read PCI resource
> today.
> 
> If above works, then we have a sound plan to enable mediated devices 
> based on VFIO first for KVM, and then extend to Xen with reasonable 
> effort.

We'll work on the proposal, thanks!

--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC PATCH v4 3/3] VFIO Type1 IOMMU: Add support for mediated devices
  2016-06-01  8:40     ` [Qemu-devel] " Dong Jia
@ 2016-06-02  7:56       ` Neo Jia
  -1 siblings, 0 replies; 92+ messages in thread
From: Neo Jia @ 2016-06-02  7:56 UTC (permalink / raw)
  To: Dong Jia
  Cc: Kirti Wankhede, alex.williamson, pbonzini, kraxel, qemu-devel,
	kvm, kevin.tian, shuai.ruan, jike.song, zhiyuan.lv

On Wed, Jun 01, 2016 at 04:40:19PM +0800, Dong Jia wrote:
> On Wed, 25 May 2016 01:28:17 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
> > +
> > +/*
> > + * Pin a set of guest PFNs and return their associated host PFNs for API
> > + * supported domain only.
> > + * @vaddr [in]: array of guest PFNs
> > + * @npage [in]: count of array elements
> > + * @prot [in] : protection flags
> > + * @pfn_base[out] : array of host PFNs
> > + */
> > +long vfio_pin_pages(void *iommu_data, dma_addr_t *vaddr, long npage,
> > +		   int prot, dma_addr_t *pfn_base)
> > +{
> > +	struct vfio_iommu *iommu = iommu_data;
> > +	struct vfio_domain *domain = NULL;
> > +	int i = 0, ret = 0;
> > +	long retpage;
> > +	unsigned long remote_vaddr = 0;
> > +	dma_addr_t *pfn = pfn_base;
> > +	struct vfio_dma *dma;
> > +
> > +	if (!iommu || !vaddr || !pfn_base)
> > +		return -EINVAL;
> > +
> > +	mutex_lock(&iommu->lock);
> > +
> > +	if (!iommu->mediated_domain) {
> > +		ret = -EINVAL;
> > +		goto pin_done;
> > +	}
> > +
> > +	domain = iommu->mediated_domain;
> > +
> > +	for (i = 0; i < npage; i++) {
> > +		struct vfio_pfn *p, *lpfn;
> > +		unsigned long tpfn;
> > +		dma_addr_t iova;
> > +		long pg_cnt = 1;
> > +
> > +		iova = vaddr[i] << PAGE_SHIFT;
> Dear Kirti:
> 
> Got one question for the vaddr-iova conversion here.
> Is this a common rule that can be applied to all architectures?
> AFAIK, this is wrong for the s390 case. Or I must miss something...

I need more details about the "wrong" part. 
IIUC, you are thinking about the guest iommu case?

Thanks,
Neo

> 
> If the answer to the above question is 'no', should we introduce a new
> argument to pass in the iovas? Say 'dma_addr_t *iova'.
> 
> > +
> > +		dma = vfio_find_dma(iommu, iova, 0 /*  size */);
> > +		if (!dma) {
> > +			ret = -EINVAL;
> > +			goto pin_done;
> > +		}
> > +
> > +		remote_vaddr = dma->vaddr + iova - dma->iova;
> > +
> > +		retpage = vfio_pin_pages_internal(domain, remote_vaddr,
> > +						  pg_cnt, prot, &tpfn);
> > +		if (retpage <= 0) {
> > +			WARN_ON(!retpage);
> > +			ret = (int)retpage;
> > +			goto pin_done;
> > +		}
> > +
> > +		pfn[i] = tpfn;
> > +
> > +		/* search if pfn exist */
> > +		p = vfio_find_pfn(domain, tpfn);
> > +		if (p) {
> > +			atomic_inc(&p->ref_count);
> > +			continue;
> > +		}
> > +
> > +		/* add to pfn_list */
> > +		lpfn = kzalloc(sizeof(*lpfn), GFP_KERNEL);
> > +		if (!lpfn) {
> > +			ret = -ENOMEM;
> > +			goto pin_done;
> > +		}
> > +		lpfn->vaddr = remote_vaddr;
> > +		lpfn->iova = iova;
> > +		lpfn->pfn = pfn[i];
> > +		lpfn->npage = 1;
> > +		lpfn->prot = prot;
> > +		atomic_inc(&lpfn->ref_count);
> > +		vfio_link_pfn(domain, lpfn);
> > +	}
> > +
> > +	ret = i;
> > +
> > +pin_done:
> > +	mutex_unlock(&iommu->lock);
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL(vfio_pin_pages);
> 
> 
> --------
> Dong Jia
> 

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v4 3/3] VFIO Type1 IOMMU: Add support for mediated devices
@ 2016-06-02  7:56       ` Neo Jia
  0 siblings, 0 replies; 92+ messages in thread
From: Neo Jia @ 2016-06-02  7:56 UTC (permalink / raw)
  To: Dong Jia
  Cc: Kirti Wankhede, alex.williamson, pbonzini, kraxel, qemu-devel,
	kvm, kevin.tian, shuai.ruan, jike.song, zhiyuan.lv

On Wed, Jun 01, 2016 at 04:40:19PM +0800, Dong Jia wrote:
> On Wed, 25 May 2016 01:28:17 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
> > +
> > +/*
> > + * Pin a set of guest PFNs and return their associated host PFNs for API
> > + * supported domain only.
> > + * @vaddr [in]: array of guest PFNs
> > + * @npage [in]: count of array elements
> > + * @prot [in] : protection flags
> > + * @pfn_base[out] : array of host PFNs
> > + */
> > +long vfio_pin_pages(void *iommu_data, dma_addr_t *vaddr, long npage,
> > +		   int prot, dma_addr_t *pfn_base)
> > +{
> > +	struct vfio_iommu *iommu = iommu_data;
> > +	struct vfio_domain *domain = NULL;
> > +	int i = 0, ret = 0;
> > +	long retpage;
> > +	unsigned long remote_vaddr = 0;
> > +	dma_addr_t *pfn = pfn_base;
> > +	struct vfio_dma *dma;
> > +
> > +	if (!iommu || !vaddr || !pfn_base)
> > +		return -EINVAL;
> > +
> > +	mutex_lock(&iommu->lock);
> > +
> > +	if (!iommu->mediated_domain) {
> > +		ret = -EINVAL;
> > +		goto pin_done;
> > +	}
> > +
> > +	domain = iommu->mediated_domain;
> > +
> > +	for (i = 0; i < npage; i++) {
> > +		struct vfio_pfn *p, *lpfn;
> > +		unsigned long tpfn;
> > +		dma_addr_t iova;
> > +		long pg_cnt = 1;
> > +
> > +		iova = vaddr[i] << PAGE_SHIFT;
> Dear Kirti:
> 
> Got one question for the vaddr-iova conversion here.
> Is this a common rule that can be applied to all architectures?
> AFAIK, this is wrong for the s390 case. Or I must miss something...

I need more details about the "wrong" part. 
IIUC, you are thinking about the guest iommu case?

Thanks,
Neo

> 
> If the answer to the above question is 'no', should we introduce a new
> argument to pass in the iovas? Say 'dma_addr_t *iova'.
> 
> > +
> > +		dma = vfio_find_dma(iommu, iova, 0 /*  size */);
> > +		if (!dma) {
> > +			ret = -EINVAL;
> > +			goto pin_done;
> > +		}
> > +
> > +		remote_vaddr = dma->vaddr + iova - dma->iova;
> > +
> > +		retpage = vfio_pin_pages_internal(domain, remote_vaddr,
> > +						  pg_cnt, prot, &tpfn);
> > +		if (retpage <= 0) {
> > +			WARN_ON(!retpage);
> > +			ret = (int)retpage;
> > +			goto pin_done;
> > +		}
> > +
> > +		pfn[i] = tpfn;
> > +
> > +		/* search if pfn exist */
> > +		p = vfio_find_pfn(domain, tpfn);
> > +		if (p) {
> > +			atomic_inc(&p->ref_count);
> > +			continue;
> > +		}
> > +
> > +		/* add to pfn_list */
> > +		lpfn = kzalloc(sizeof(*lpfn), GFP_KERNEL);
> > +		if (!lpfn) {
> > +			ret = -ENOMEM;
> > +			goto pin_done;
> > +		}
> > +		lpfn->vaddr = remote_vaddr;
> > +		lpfn->iova = iova;
> > +		lpfn->pfn = pfn[i];
> > +		lpfn->npage = 1;
> > +		lpfn->prot = prot;
> > +		atomic_inc(&lpfn->ref_count);
> > +		vfio_link_pfn(domain, lpfn);
> > +	}
> > +
> > +	ret = i;
> > +
> > +pin_done:
> > +	mutex_unlock(&iommu->lock);
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL(vfio_pin_pages);
> 
> 
> --------
> Dong Jia
> 

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC PATCH v4 3/3] VFIO Type1 IOMMU: Add support for mediated devices
  2016-06-02  7:56       ` [Qemu-devel] " Neo Jia
@ 2016-06-03  8:32         ` Dong Jia
  -1 siblings, 0 replies; 92+ messages in thread
From: Dong Jia @ 2016-06-03  8:32 UTC (permalink / raw)
  To: Neo Jia
  Cc: Kirti Wankhede, alex.williamson, pbonzini, kraxel, qemu-devel,
	kvm, kevin.tian, shuai.ruan, jike.song, zhiyuan.lv

On Thu, 2 Jun 2016 00:56:47 -0700
Neo Jia <cjia@nvidia.com> wrote:

> On Wed, Jun 01, 2016 at 04:40:19PM +0800, Dong Jia wrote:
> > On Wed, 25 May 2016 01:28:17 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > 
> > > +
> > > +/*
> > > + * Pin a set of guest PFNs and return their associated host PFNs for API
> > > + * supported domain only.
> > > + * @vaddr [in]: array of guest PFNs
> > > + * @npage [in]: count of array elements
> > > + * @prot [in] : protection flags
> > > + * @pfn_base[out] : array of host PFNs
> > > + */
> > > +long vfio_pin_pages(void *iommu_data, dma_addr_t *vaddr, long npage,
> > > +		   int prot, dma_addr_t *pfn_base)
> > > +{
> > > +	struct vfio_iommu *iommu = iommu_data;
> > > +	struct vfio_domain *domain = NULL;
> > > +	int i = 0, ret = 0;
> > > +	long retpage;
> > > +	unsigned long remote_vaddr = 0;
> > > +	dma_addr_t *pfn = pfn_base;
> > > +	struct vfio_dma *dma;
> > > +
> > > +	if (!iommu || !vaddr || !pfn_base)
> > > +		return -EINVAL;
> > > +
> > > +	mutex_lock(&iommu->lock);
> > > +
> > > +	if (!iommu->mediated_domain) {
> > > +		ret = -EINVAL;
> > > +		goto pin_done;
> > > +	}
> > > +
> > > +	domain = iommu->mediated_domain;
> > > +
> > > +	for (i = 0; i < npage; i++) {
> > > +		struct vfio_pfn *p, *lpfn;
> > > +		unsigned long tpfn;
> > > +		dma_addr_t iova;
> > > +		long pg_cnt = 1;
> > > +
> > > +		iova = vaddr[i] << PAGE_SHIFT;
> > Dear Kirti:
> > 
> > Got one question for the vaddr-iova conversion here.
> > Is this a common rule that can be applied to all architectures?
> > AFAIK, this is wrong for the s390 case. Or I must miss something...
> 
> I need more details about the "wrong" part. 
> IIUC, you are thinking about the guest iommu case?
> 
Dear Neo:

Sorry for the mistake I made. When I saw 'vaddr', I intuitively thought
it is an user-space virtual address. Now I saw the comment which says it
is the "array of guest PFNs".

After I modify my patches according to the right usage of this
argument, they worked fine. :>

> Thanks,
> Neo
> 
> > 
> > If the answer to the above question is 'no', should we introduce a new
> > argument to pass in the iovas? Say 'dma_addr_t *iova'.
> > 
> > > +
> > > +		dma = vfio_find_dma(iommu, iova, 0 /*  size */);
> > > +		if (!dma) {
> > > +			ret = -EINVAL;
> > > +			goto pin_done;
> > > +		}
> > > +
> > > +		remote_vaddr = dma->vaddr + iova - dma->iova;
> > > +
> > > +		retpage = vfio_pin_pages_internal(domain, remote_vaddr,
> > > +						  pg_cnt, prot, &tpfn);
> > > +		if (retpage <= 0) {
> > > +			WARN_ON(!retpage);
> > > +			ret = (int)retpage;
> > > +			goto pin_done;
> > > +		}
> > > +
> > > +		pfn[i] = tpfn;
> > > +
> > > +		/* search if pfn exist */
> > > +		p = vfio_find_pfn(domain, tpfn);
> > > +		if (p) {
> > > +			atomic_inc(&p->ref_count);
> > > +			continue;
> > > +		}
> > > +
> > > +		/* add to pfn_list */
> > > +		lpfn = kzalloc(sizeof(*lpfn), GFP_KERNEL);
> > > +		if (!lpfn) {
> > > +			ret = -ENOMEM;
> > > +			goto pin_done;
> > > +		}
> > > +		lpfn->vaddr = remote_vaddr;
> > > +		lpfn->iova = iova;
> > > +		lpfn->pfn = pfn[i];
> > > +		lpfn->npage = 1;
> > > +		lpfn->prot = prot;
> > > +		atomic_inc(&lpfn->ref_count);
> > > +		vfio_link_pfn(domain, lpfn);
> > > +	}
> > > +
> > > +	ret = i;
> > > +
> > > +pin_done:
> > > +	mutex_unlock(&iommu->lock);
> > > +	return ret;
> > > +}
> > > +EXPORT_SYMBOL(vfio_pin_pages);
> > 
> > 
> > --------
> > Dong Jia
> > 
> 



--------
Dong Jia


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v4 3/3] VFIO Type1 IOMMU: Add support for mediated devices
@ 2016-06-03  8:32         ` Dong Jia
  0 siblings, 0 replies; 92+ messages in thread
From: Dong Jia @ 2016-06-03  8:32 UTC (permalink / raw)
  To: Neo Jia
  Cc: Kirti Wankhede, alex.williamson, pbonzini, kraxel, qemu-devel,
	kvm, kevin.tian, shuai.ruan, jike.song, zhiyuan.lv

On Thu, 2 Jun 2016 00:56:47 -0700
Neo Jia <cjia@nvidia.com> wrote:

> On Wed, Jun 01, 2016 at 04:40:19PM +0800, Dong Jia wrote:
> > On Wed, 25 May 2016 01:28:17 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > 
> > > +
> > > +/*
> > > + * Pin a set of guest PFNs and return their associated host PFNs for API
> > > + * supported domain only.
> > > + * @vaddr [in]: array of guest PFNs
> > > + * @npage [in]: count of array elements
> > > + * @prot [in] : protection flags
> > > + * @pfn_base[out] : array of host PFNs
> > > + */
> > > +long vfio_pin_pages(void *iommu_data, dma_addr_t *vaddr, long npage,
> > > +		   int prot, dma_addr_t *pfn_base)
> > > +{
> > > +	struct vfio_iommu *iommu = iommu_data;
> > > +	struct vfio_domain *domain = NULL;
> > > +	int i = 0, ret = 0;
> > > +	long retpage;
> > > +	unsigned long remote_vaddr = 0;
> > > +	dma_addr_t *pfn = pfn_base;
> > > +	struct vfio_dma *dma;
> > > +
> > > +	if (!iommu || !vaddr || !pfn_base)
> > > +		return -EINVAL;
> > > +
> > > +	mutex_lock(&iommu->lock);
> > > +
> > > +	if (!iommu->mediated_domain) {
> > > +		ret = -EINVAL;
> > > +		goto pin_done;
> > > +	}
> > > +
> > > +	domain = iommu->mediated_domain;
> > > +
> > > +	for (i = 0; i < npage; i++) {
> > > +		struct vfio_pfn *p, *lpfn;
> > > +		unsigned long tpfn;
> > > +		dma_addr_t iova;
> > > +		long pg_cnt = 1;
> > > +
> > > +		iova = vaddr[i] << PAGE_SHIFT;
> > Dear Kirti:
> > 
> > Got one question for the vaddr-iova conversion here.
> > Is this a common rule that can be applied to all architectures?
> > AFAIK, this is wrong for the s390 case. Or I must miss something...
> 
> I need more details about the "wrong" part. 
> IIUC, you are thinking about the guest iommu case?
> 
Dear Neo:

Sorry for the mistake I made. When I saw 'vaddr', I intuitively thought
it is an user-space virtual address. Now I saw the comment which says it
is the "array of guest PFNs".

After I modify my patches according to the right usage of this
argument, they worked fine. :>

> Thanks,
> Neo
> 
> > 
> > If the answer to the above question is 'no', should we introduce a new
> > argument to pass in the iovas? Say 'dma_addr_t *iova'.
> > 
> > > +
> > > +		dma = vfio_find_dma(iommu, iova, 0 /*  size */);
> > > +		if (!dma) {
> > > +			ret = -EINVAL;
> > > +			goto pin_done;
> > > +		}
> > > +
> > > +		remote_vaddr = dma->vaddr + iova - dma->iova;
> > > +
> > > +		retpage = vfio_pin_pages_internal(domain, remote_vaddr,
> > > +						  pg_cnt, prot, &tpfn);
> > > +		if (retpage <= 0) {
> > > +			WARN_ON(!retpage);
> > > +			ret = (int)retpage;
> > > +			goto pin_done;
> > > +		}
> > > +
> > > +		pfn[i] = tpfn;
> > > +
> > > +		/* search if pfn exist */
> > > +		p = vfio_find_pfn(domain, tpfn);
> > > +		if (p) {
> > > +			atomic_inc(&p->ref_count);
> > > +			continue;
> > > +		}
> > > +
> > > +		/* add to pfn_list */
> > > +		lpfn = kzalloc(sizeof(*lpfn), GFP_KERNEL);
> > > +		if (!lpfn) {
> > > +			ret = -ENOMEM;
> > > +			goto pin_done;
> > > +		}
> > > +		lpfn->vaddr = remote_vaddr;
> > > +		lpfn->iova = iova;
> > > +		lpfn->pfn = pfn[i];
> > > +		lpfn->npage = 1;
> > > +		lpfn->prot = prot;
> > > +		atomic_inc(&lpfn->ref_count);
> > > +		vfio_link_pfn(domain, lpfn);
> > > +	}
> > > +
> > > +	ret = i;
> > > +
> > > +pin_done:
> > > +	mutex_unlock(&iommu->lock);
> > > +	return ret;
> > > +}
> > > +EXPORT_SYMBOL(vfio_pin_pages);
> > 
> > 
> > --------
> > Dong Jia
> > 
> 



--------
Dong Jia

^ permalink raw reply	[flat|nested] 92+ messages in thread

* RE: [RFC PATCH v4 3/3] VFIO Type1 IOMMU: Add support for mediated devices
  2016-06-03  8:32         ` [Qemu-devel] " Dong Jia
@ 2016-06-03  8:37           ` Tian, Kevin
  -1 siblings, 0 replies; 92+ messages in thread
From: Tian, Kevin @ 2016-06-03  8:37 UTC (permalink / raw)
  To: Dong Jia, Neo Jia
  Cc: Kirti Wankhede, alex.williamson, pbonzini, kraxel, qemu-devel,
	kvm, Ruan, Shuai, Song, Jike, Lv, Zhiyuan

> From: Dong Jia [mailto:bjsdjshi@linux.vnet.ibm.com]
> Sent: Friday, June 03, 2016 4:32 PM
> 
> On Thu, 2 Jun 2016 00:56:47 -0700
> Neo Jia <cjia@nvidia.com> wrote:
> 
> > On Wed, Jun 01, 2016 at 04:40:19PM +0800, Dong Jia wrote:
> > > On Wed, 25 May 2016 01:28:17 +0530
> > > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > >
> > > > +
> > > > +/*
> > > > + * Pin a set of guest PFNs and return their associated host PFNs for API
> > > > + * supported domain only.
> > > > + * @vaddr [in]: array of guest PFNs
> > > > + * @npage [in]: count of array elements
> > > > + * @prot [in] : protection flags
> > > > + * @pfn_base[out] : array of host PFNs
> > > > + */
> > > > +long vfio_pin_pages(void *iommu_data, dma_addr_t *vaddr, long npage,
> > > > +		   int prot, dma_addr_t *pfn_base)
> > > > +{
> > > > +	struct vfio_iommu *iommu = iommu_data;
> > > > +	struct vfio_domain *domain = NULL;
> > > > +	int i = 0, ret = 0;
> > > > +	long retpage;
> > > > +	unsigned long remote_vaddr = 0;
> > > > +	dma_addr_t *pfn = pfn_base;
> > > > +	struct vfio_dma *dma;
> > > > +
> > > > +	if (!iommu || !vaddr || !pfn_base)
> > > > +		return -EINVAL;
> > > > +
> > > > +	mutex_lock(&iommu->lock);
> > > > +
> > > > +	if (!iommu->mediated_domain) {
> > > > +		ret = -EINVAL;
> > > > +		goto pin_done;
> > > > +	}
> > > > +
> > > > +	domain = iommu->mediated_domain;
> > > > +
> > > > +	for (i = 0; i < npage; i++) {
> > > > +		struct vfio_pfn *p, *lpfn;
> > > > +		unsigned long tpfn;
> > > > +		dma_addr_t iova;
> > > > +		long pg_cnt = 1;
> > > > +
> > > > +		iova = vaddr[i] << PAGE_SHIFT;
> > > Dear Kirti:
> > >
> > > Got one question for the vaddr-iova conversion here.
> > > Is this a common rule that can be applied to all architectures?
> > > AFAIK, this is wrong for the s390 case. Or I must miss something...
> >
> > I need more details about the "wrong" part.
> > IIUC, you are thinking about the guest iommu case?
> >
> Dear Neo:
> 
> Sorry for the mistake I made. When I saw 'vaddr', I intuitively thought
> it is an user-space virtual address. Now I saw the comment which says it
> is the "array of guest PFNs".
> 
> After I modify my patches according to the right usage of this
> argument, they worked fine. :>
> 

Maybe renaming 'vaddr' to 'iova_array' would be clearer, given that
elements within this array is called 'iova' later and it can differentiate
from 'remote_vaddr' which means a real 'virtual address'.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v4 3/3] VFIO Type1 IOMMU: Add support for mediated devices
@ 2016-06-03  8:37           ` Tian, Kevin
  0 siblings, 0 replies; 92+ messages in thread
From: Tian, Kevin @ 2016-06-03  8:37 UTC (permalink / raw)
  To: Dong Jia, Neo Jia
  Cc: Kirti Wankhede, alex.williamson, pbonzini, kraxel, qemu-devel,
	kvm, Ruan, Shuai, Song, Jike, Lv, Zhiyuan

> From: Dong Jia [mailto:bjsdjshi@linux.vnet.ibm.com]
> Sent: Friday, June 03, 2016 4:32 PM
> 
> On Thu, 2 Jun 2016 00:56:47 -0700
> Neo Jia <cjia@nvidia.com> wrote:
> 
> > On Wed, Jun 01, 2016 at 04:40:19PM +0800, Dong Jia wrote:
> > > On Wed, 25 May 2016 01:28:17 +0530
> > > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > >
> > > > +
> > > > +/*
> > > > + * Pin a set of guest PFNs and return their associated host PFNs for API
> > > > + * supported domain only.
> > > > + * @vaddr [in]: array of guest PFNs
> > > > + * @npage [in]: count of array elements
> > > > + * @prot [in] : protection flags
> > > > + * @pfn_base[out] : array of host PFNs
> > > > + */
> > > > +long vfio_pin_pages(void *iommu_data, dma_addr_t *vaddr, long npage,
> > > > +		   int prot, dma_addr_t *pfn_base)
> > > > +{
> > > > +	struct vfio_iommu *iommu = iommu_data;
> > > > +	struct vfio_domain *domain = NULL;
> > > > +	int i = 0, ret = 0;
> > > > +	long retpage;
> > > > +	unsigned long remote_vaddr = 0;
> > > > +	dma_addr_t *pfn = pfn_base;
> > > > +	struct vfio_dma *dma;
> > > > +
> > > > +	if (!iommu || !vaddr || !pfn_base)
> > > > +		return -EINVAL;
> > > > +
> > > > +	mutex_lock(&iommu->lock);
> > > > +
> > > > +	if (!iommu->mediated_domain) {
> > > > +		ret = -EINVAL;
> > > > +		goto pin_done;
> > > > +	}
> > > > +
> > > > +	domain = iommu->mediated_domain;
> > > > +
> > > > +	for (i = 0; i < npage; i++) {
> > > > +		struct vfio_pfn *p, *lpfn;
> > > > +		unsigned long tpfn;
> > > > +		dma_addr_t iova;
> > > > +		long pg_cnt = 1;
> > > > +
> > > > +		iova = vaddr[i] << PAGE_SHIFT;
> > > Dear Kirti:
> > >
> > > Got one question for the vaddr-iova conversion here.
> > > Is this a common rule that can be applied to all architectures?
> > > AFAIK, this is wrong for the s390 case. Or I must miss something...
> >
> > I need more details about the "wrong" part.
> > IIUC, you are thinking about the guest iommu case?
> >
> Dear Neo:
> 
> Sorry for the mistake I made. When I saw 'vaddr', I intuitively thought
> it is an user-space virtual address. Now I saw the comment which says it
> is the "array of guest PFNs".
> 
> After I modify my patches according to the right usage of this
> argument, they worked fine. :>
> 

Maybe renaming 'vaddr' to 'iova_array' would be clearer, given that
elements within this array is called 'iova' later and it can differentiate
from 'remote_vaddr' which means a real 'virtual address'.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC PATCH v4 1/3] Mediated device Core driver
  2016-05-24 19:58   ` [Qemu-devel] " Kirti Wankhede
@ 2016-06-03  8:57     ` Dong Jia
  -1 siblings, 0 replies; 92+ messages in thread
From: Dong Jia @ 2016-06-03  8:57 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: alex.williamson, pbonzini, kraxel, cjia, qemu-devel, kvm,
	kevin.tian, shuai.ruan, jike.song, zhiyuan.lv

On Wed, 25 May 2016 01:28:15 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:


...snip...

> +struct phy_device_ops {
> +	struct module   *owner;
> +	const struct attribute_group **dev_attr_groups;
> +	const struct attribute_group **mdev_attr_groups;
> +
> +	int	(*supported_config)(struct device *dev, char *config);
> +	int     (*create)(struct device *dev, uuid_le uuid,
> +			  uint32_t instance, char *mdev_params);
> +	int     (*destroy)(struct device *dev, uuid_le uuid,
> +			   uint32_t instance);
> +	int     (*start)(uuid_le uuid);
> +	int     (*shutdown)(uuid_le uuid);
> +	ssize_t (*read)(struct mdev_device *vdev, char *buf, size_t count,
> +			enum mdev_emul_space address_space, loff_t pos);
> +	ssize_t (*write)(struct mdev_device *vdev, char *buf, size_t count,
> +			 enum mdev_emul_space address_space, loff_t pos);
> +	int     (*set_irqs)(struct mdev_device *vdev, uint32_t flags,
> +			    unsigned int index, unsigned int start,
> +			    unsigned int count, void *data);
> +	int	(*get_region_info)(struct mdev_device *vdev, int region_index,
> +				 struct pci_region_info *region_info);
> +	int	(*validate_map_request)(struct mdev_device *vdev,
> +					unsigned long virtaddr,
> +					unsigned long *pfn, unsigned long *size,
> +					pgprot_t *prot);
> +};

Dear Kirti:

When I rebased my vfio-ccw patches on this series, I found I need an
extra 'ioctl' callback in phy_device_ops.

The ccw physical device only supports one ccw mediated device. And I
have two new ioctl commands for the ccw mediated device. One is 
to hot-reset the resource in the physical device that allocated for
the mediated device, the other is to do an I/O instruction translation
and perform an I/O operation on the physical device. I found the
existing callbacks could not meet my requirements.

Something like the following would be fine for my case:
	int (*ioctl)(struct mdev_device *vdev,
		     unsigned int cmd,
		     unsigned long arg);

What do you think about this?

--------
Dong Jia


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v4 1/3] Mediated device Core driver
@ 2016-06-03  8:57     ` Dong Jia
  0 siblings, 0 replies; 92+ messages in thread
From: Dong Jia @ 2016-06-03  8:57 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: alex.williamson, pbonzini, kraxel, cjia, qemu-devel, kvm,
	kevin.tian, shuai.ruan, jike.song, zhiyuan.lv

On Wed, 25 May 2016 01:28:15 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:


...snip...

> +struct phy_device_ops {
> +	struct module   *owner;
> +	const struct attribute_group **dev_attr_groups;
> +	const struct attribute_group **mdev_attr_groups;
> +
> +	int	(*supported_config)(struct device *dev, char *config);
> +	int     (*create)(struct device *dev, uuid_le uuid,
> +			  uint32_t instance, char *mdev_params);
> +	int     (*destroy)(struct device *dev, uuid_le uuid,
> +			   uint32_t instance);
> +	int     (*start)(uuid_le uuid);
> +	int     (*shutdown)(uuid_le uuid);
> +	ssize_t (*read)(struct mdev_device *vdev, char *buf, size_t count,
> +			enum mdev_emul_space address_space, loff_t pos);
> +	ssize_t (*write)(struct mdev_device *vdev, char *buf, size_t count,
> +			 enum mdev_emul_space address_space, loff_t pos);
> +	int     (*set_irqs)(struct mdev_device *vdev, uint32_t flags,
> +			    unsigned int index, unsigned int start,
> +			    unsigned int count, void *data);
> +	int	(*get_region_info)(struct mdev_device *vdev, int region_index,
> +				 struct pci_region_info *region_info);
> +	int	(*validate_map_request)(struct mdev_device *vdev,
> +					unsigned long virtaddr,
> +					unsigned long *pfn, unsigned long *size,
> +					pgprot_t *prot);
> +};

Dear Kirti:

When I rebased my vfio-ccw patches on this series, I found I need an
extra 'ioctl' callback in phy_device_ops.

The ccw physical device only supports one ccw mediated device. And I
have two new ioctl commands for the ccw mediated device. One is 
to hot-reset the resource in the physical device that allocated for
the mediated device, the other is to do an I/O instruction translation
and perform an I/O operation on the physical device. I found the
existing callbacks could not meet my requirements.

Something like the following would be fine for my case:
	int (*ioctl)(struct mdev_device *vdev,
		     unsigned int cmd,
		     unsigned long arg);

What do you think about this?

--------
Dong Jia

^ permalink raw reply	[flat|nested] 92+ messages in thread

* RE: [RFC PATCH v4 1/3] Mediated device Core driver
  2016-06-03  8:57     ` [Qemu-devel] " Dong Jia
@ 2016-06-03  9:40       ` Tian, Kevin
  -1 siblings, 0 replies; 92+ messages in thread
From: Tian, Kevin @ 2016-06-03  9:40 UTC (permalink / raw)
  To: Dong Jia, Kirti Wankhede
  Cc: alex.williamson, pbonzini, kraxel, cjia, qemu-devel, kvm, Ruan,
	Shuai, Song, Jike, Lv, Zhiyuan

> From: Dong Jia [mailto:bjsdjshi@linux.vnet.ibm.com]
> Sent: Friday, June 03, 2016 4:58 PM
> 
> 
> ...snip...
> 
> > +struct phy_device_ops {
> > +	struct module   *owner;
> > +	const struct attribute_group **dev_attr_groups;
> > +	const struct attribute_group **mdev_attr_groups;
> > +
> > +	int	(*supported_config)(struct device *dev, char *config);
> > +	int     (*create)(struct device *dev, uuid_le uuid,
> > +			  uint32_t instance, char *mdev_params);
> > +	int     (*destroy)(struct device *dev, uuid_le uuid,
> > +			   uint32_t instance);
> > +	int     (*start)(uuid_le uuid);
> > +	int     (*shutdown)(uuid_le uuid);
> > +	ssize_t (*read)(struct mdev_device *vdev, char *buf, size_t count,
> > +			enum mdev_emul_space address_space, loff_t pos);
> > +	ssize_t (*write)(struct mdev_device *vdev, char *buf, size_t count,
> > +			 enum mdev_emul_space address_space, loff_t pos);
> > +	int     (*set_irqs)(struct mdev_device *vdev, uint32_t flags,
> > +			    unsigned int index, unsigned int start,
> > +			    unsigned int count, void *data);
> > +	int	(*get_region_info)(struct mdev_device *vdev, int region_index,
> > +				 struct pci_region_info *region_info);
> > +	int	(*validate_map_request)(struct mdev_device *vdev,
> > +					unsigned long virtaddr,
> > +					unsigned long *pfn, unsigned long *size,
> > +					pgprot_t *prot);
> > +};
> 
> Dear Kirti:
> 
> When I rebased my vfio-ccw patches on this series, I found I need an
> extra 'ioctl' callback in phy_device_ops.
> 
> The ccw physical device only supports one ccw mediated device. And I
> have two new ioctl commands for the ccw mediated device. One is
> to hot-reset the resource in the physical device that allocated for
> the mediated device, the other is to do an I/O instruction translation
> and perform an I/O operation on the physical device. I found the
> existing callbacks could not meet my requirements.
> 
> Something like the following would be fine for my case:
> 	int (*ioctl)(struct mdev_device *vdev,
> 		     unsigned int cmd,
> 		     unsigned long arg);
> 
> What do you think about this?
> 

'reset' should be generic. better to define an individual callback
for it (then we can also expose a node under vgpu path in sysfs).

Thanks
Kevin

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v4 1/3] Mediated device Core driver
@ 2016-06-03  9:40       ` Tian, Kevin
  0 siblings, 0 replies; 92+ messages in thread
From: Tian, Kevin @ 2016-06-03  9:40 UTC (permalink / raw)
  To: Dong Jia, Kirti Wankhede
  Cc: alex.williamson, pbonzini, kraxel, cjia, qemu-devel, kvm, Ruan,
	Shuai, Song, Jike, Lv, Zhiyuan

> From: Dong Jia [mailto:bjsdjshi@linux.vnet.ibm.com]
> Sent: Friday, June 03, 2016 4:58 PM
> 
> 
> ...snip...
> 
> > +struct phy_device_ops {
> > +	struct module   *owner;
> > +	const struct attribute_group **dev_attr_groups;
> > +	const struct attribute_group **mdev_attr_groups;
> > +
> > +	int	(*supported_config)(struct device *dev, char *config);
> > +	int     (*create)(struct device *dev, uuid_le uuid,
> > +			  uint32_t instance, char *mdev_params);
> > +	int     (*destroy)(struct device *dev, uuid_le uuid,
> > +			   uint32_t instance);
> > +	int     (*start)(uuid_le uuid);
> > +	int     (*shutdown)(uuid_le uuid);
> > +	ssize_t (*read)(struct mdev_device *vdev, char *buf, size_t count,
> > +			enum mdev_emul_space address_space, loff_t pos);
> > +	ssize_t (*write)(struct mdev_device *vdev, char *buf, size_t count,
> > +			 enum mdev_emul_space address_space, loff_t pos);
> > +	int     (*set_irqs)(struct mdev_device *vdev, uint32_t flags,
> > +			    unsigned int index, unsigned int start,
> > +			    unsigned int count, void *data);
> > +	int	(*get_region_info)(struct mdev_device *vdev, int region_index,
> > +				 struct pci_region_info *region_info);
> > +	int	(*validate_map_request)(struct mdev_device *vdev,
> > +					unsigned long virtaddr,
> > +					unsigned long *pfn, unsigned long *size,
> > +					pgprot_t *prot);
> > +};
> 
> Dear Kirti:
> 
> When I rebased my vfio-ccw patches on this series, I found I need an
> extra 'ioctl' callback in phy_device_ops.
> 
> The ccw physical device only supports one ccw mediated device. And I
> have two new ioctl commands for the ccw mediated device. One is
> to hot-reset the resource in the physical device that allocated for
> the mediated device, the other is to do an I/O instruction translation
> and perform an I/O operation on the physical device. I found the
> existing callbacks could not meet my requirements.
> 
> Something like the following would be fine for my case:
> 	int (*ioctl)(struct mdev_device *vdev,
> 		     unsigned int cmd,
> 		     unsigned long arg);
> 
> What do you think about this?
> 

'reset' should be generic. better to define an individual callback
for it (then we can also expose a node under vgpu path in sysfs).

Thanks
Kevin

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC PATCH v4 1/3] Mediated device Core driver
  2016-06-03  9:40       ` [Qemu-devel] " Tian, Kevin
@ 2016-06-06  2:24         ` Dong Jia
  -1 siblings, 0 replies; 92+ messages in thread
From: Dong Jia @ 2016-06-06  2:24 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Kirti Wankhede, alex.williamson, pbonzini, kraxel, cjia,
	qemu-devel, kvm, Ruan, Shuai, Song, Jike, Lv, Zhiyuan

On Fri, 3 Jun 2016 09:40:16 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Dong Jia [mailto:bjsdjshi@linux.vnet.ibm.com]
> > Sent: Friday, June 03, 2016 4:58 PM
> > 
> > 
> > ...snip...
> > 
> > > +struct phy_device_ops {
> > > +	struct module   *owner;
> > > +	const struct attribute_group **dev_attr_groups;
> > > +	const struct attribute_group **mdev_attr_groups;
> > > +
> > > +	int	(*supported_config)(struct device *dev, char *config);
> > > +	int     (*create)(struct device *dev, uuid_le uuid,
> > > +			  uint32_t instance, char *mdev_params);
> > > +	int     (*destroy)(struct device *dev, uuid_le uuid,
> > > +			   uint32_t instance);
> > > +	int     (*start)(uuid_le uuid);
> > > +	int     (*shutdown)(uuid_le uuid);
> > > +	ssize_t (*read)(struct mdev_device *vdev, char *buf, size_t count,
> > > +			enum mdev_emul_space address_space, loff_t pos);
> > > +	ssize_t (*write)(struct mdev_device *vdev, char *buf, size_t count,
> > > +			 enum mdev_emul_space address_space, loff_t pos);
> > > +	int     (*set_irqs)(struct mdev_device *vdev, uint32_t flags,
> > > +			    unsigned int index, unsigned int start,
> > > +			    unsigned int count, void *data);
> > > +	int	(*get_region_info)(struct mdev_device *vdev, int region_index,
> > > +				 struct pci_region_info *region_info);
> > > +	int	(*validate_map_request)(struct mdev_device *vdev,
> > > +					unsigned long virtaddr,
> > > +					unsigned long *pfn, unsigned long *size,
> > > +					pgprot_t *prot);
> > > +};
> > 
> > Dear Kirti:
> > 
> > When I rebased my vfio-ccw patches on this series, I found I need an
> > extra 'ioctl' callback in phy_device_ops.
> > 
> > The ccw physical device only supports one ccw mediated device. And I
> > have two new ioctl commands for the ccw mediated device. One is
> > to hot-reset the resource in the physical device that allocated for
> > the mediated device, the other is to do an I/O instruction translation
> > and perform an I/O operation on the physical device. I found the
> > existing callbacks could not meet my requirements.
> > 
> > Something like the following would be fine for my case:
> > 	int (*ioctl)(struct mdev_device *vdev,
> > 		     unsigned int cmd,
> > 		     unsigned long arg);
> > 
> > What do you think about this?
> > 
> 
> 'reset' should be generic. better to define an individual callback
> for it (then we can also expose a node under vgpu path in sysfs).
> 
Sounds reasonable for me. :>

> Thanks
> Kevin
> 



--------
Dong Jia


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v4 1/3] Mediated device Core driver
@ 2016-06-06  2:24         ` Dong Jia
  0 siblings, 0 replies; 92+ messages in thread
From: Dong Jia @ 2016-06-06  2:24 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Kirti Wankhede, alex.williamson, pbonzini, kraxel, cjia,
	qemu-devel, kvm, Ruan, Shuai, Song, Jike, Lv, Zhiyuan

On Fri, 3 Jun 2016 09:40:16 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Dong Jia [mailto:bjsdjshi@linux.vnet.ibm.com]
> > Sent: Friday, June 03, 2016 4:58 PM
> > 
> > 
> > ...snip...
> > 
> > > +struct phy_device_ops {
> > > +	struct module   *owner;
> > > +	const struct attribute_group **dev_attr_groups;
> > > +	const struct attribute_group **mdev_attr_groups;
> > > +
> > > +	int	(*supported_config)(struct device *dev, char *config);
> > > +	int     (*create)(struct device *dev, uuid_le uuid,
> > > +			  uint32_t instance, char *mdev_params);
> > > +	int     (*destroy)(struct device *dev, uuid_le uuid,
> > > +			   uint32_t instance);
> > > +	int     (*start)(uuid_le uuid);
> > > +	int     (*shutdown)(uuid_le uuid);
> > > +	ssize_t (*read)(struct mdev_device *vdev, char *buf, size_t count,
> > > +			enum mdev_emul_space address_space, loff_t pos);
> > > +	ssize_t (*write)(struct mdev_device *vdev, char *buf, size_t count,
> > > +			 enum mdev_emul_space address_space, loff_t pos);
> > > +	int     (*set_irqs)(struct mdev_device *vdev, uint32_t flags,
> > > +			    unsigned int index, unsigned int start,
> > > +			    unsigned int count, void *data);
> > > +	int	(*get_region_info)(struct mdev_device *vdev, int region_index,
> > > +				 struct pci_region_info *region_info);
> > > +	int	(*validate_map_request)(struct mdev_device *vdev,
> > > +					unsigned long virtaddr,
> > > +					unsigned long *pfn, unsigned long *size,
> > > +					pgprot_t *prot);
> > > +};
> > 
> > Dear Kirti:
> > 
> > When I rebased my vfio-ccw patches on this series, I found I need an
> > extra 'ioctl' callback in phy_device_ops.
> > 
> > The ccw physical device only supports one ccw mediated device. And I
> > have two new ioctl commands for the ccw mediated device. One is
> > to hot-reset the resource in the physical device that allocated for
> > the mediated device, the other is to do an I/O instruction translation
> > and perform an I/O operation on the physical device. I found the
> > existing callbacks could not meet my requirements.
> > 
> > Something like the following would be fine for my case:
> > 	int (*ioctl)(struct mdev_device *vdev,
> > 		     unsigned int cmd,
> > 		     unsigned long arg);
> > 
> > What do you think about this?
> > 
> 
> 'reset' should be generic. better to define an individual callback
> for it (then we can also expose a node under vgpu path in sysfs).
> 
Sounds reasonable for me. :>

> Thanks
> Kevin
> 



--------
Dong Jia

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC PATCH v4 1/3] Mediated device Core driver
  2016-06-03  8:57     ` [Qemu-devel] " Dong Jia
@ 2016-06-06  5:27       ` Kirti Wankhede
  -1 siblings, 0 replies; 92+ messages in thread
From: Kirti Wankhede @ 2016-06-06  5:27 UTC (permalink / raw)
  To: Dong Jia
  Cc: alex.williamson, pbonzini, kraxel, cjia, qemu-devel, kvm,
	kevin.tian, shuai.ruan, jike.song, zhiyuan.lv



On 6/3/2016 2:27 PM, Dong Jia wrote:
> On Wed, 25 May 2016 01:28:15 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
> 
> ...snip...
> 
>> +struct phy_device_ops {
>> +	struct module   *owner;
>> +	const struct attribute_group **dev_attr_groups;
>> +	const struct attribute_group **mdev_attr_groups;
>> +
>> +	int	(*supported_config)(struct device *dev, char *config);
>> +	int     (*create)(struct device *dev, uuid_le uuid,
>> +			  uint32_t instance, char *mdev_params);
>> +	int     (*destroy)(struct device *dev, uuid_le uuid,
>> +			   uint32_t instance);
>> +	int     (*start)(uuid_le uuid);
>> +	int     (*shutdown)(uuid_le uuid);
>> +	ssize_t (*read)(struct mdev_device *vdev, char *buf, size_t count,
>> +			enum mdev_emul_space address_space, loff_t pos);
>> +	ssize_t (*write)(struct mdev_device *vdev, char *buf, size_t count,
>> +			 enum mdev_emul_space address_space, loff_t pos);
>> +	int     (*set_irqs)(struct mdev_device *vdev, uint32_t flags,
>> +			    unsigned int index, unsigned int start,
>> +			    unsigned int count, void *data);
>> +	int	(*get_region_info)(struct mdev_device *vdev, int region_index,
>> +				 struct pci_region_info *region_info);
>> +	int	(*validate_map_request)(struct mdev_device *vdev,
>> +					unsigned long virtaddr,
>> +					unsigned long *pfn, unsigned long *size,
>> +					pgprot_t *prot);
>> +};
> 
> Dear Kirti:
> 
> When I rebased my vfio-ccw patches on this series, I found I need an
> extra 'ioctl' callback in phy_device_ops.
> 

Thanks for taking closer look. As per my knowledge ccw is not PCI
device, right? Correct me if I'm wrong. I'm curious to know. Are you
planning to write a driver (vfio-mccw) for mediated ccw device?

Thanks,
Kirti

> The ccw physical device only supports one ccw mediated device. And I
> have two new ioctl commands for the ccw mediated device. One is 
> to hot-reset the resource in the physical device that allocated for
> the mediated device, the other is to do an I/O instruction translation
> and perform an I/O operation on the physical device. I found the
> existing callbacks could not meet my requirements.
> 
> Something like the following would be fine for my case:
> 	int (*ioctl)(struct mdev_device *vdev,
> 		     unsigned int cmd,
> 		     unsigned long arg);
> 
> What do you think about this?
> 
> --------
> Dong Jia
> 

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v4 1/3] Mediated device Core driver
@ 2016-06-06  5:27       ` Kirti Wankhede
  0 siblings, 0 replies; 92+ messages in thread
From: Kirti Wankhede @ 2016-06-06  5:27 UTC (permalink / raw)
  To: Dong Jia
  Cc: alex.williamson, pbonzini, kraxel, cjia, qemu-devel, kvm,
	kevin.tian, shuai.ruan, jike.song, zhiyuan.lv



On 6/3/2016 2:27 PM, Dong Jia wrote:
> On Wed, 25 May 2016 01:28:15 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
> 
> ...snip...
> 
>> +struct phy_device_ops {
>> +	struct module   *owner;
>> +	const struct attribute_group **dev_attr_groups;
>> +	const struct attribute_group **mdev_attr_groups;
>> +
>> +	int	(*supported_config)(struct device *dev, char *config);
>> +	int     (*create)(struct device *dev, uuid_le uuid,
>> +			  uint32_t instance, char *mdev_params);
>> +	int     (*destroy)(struct device *dev, uuid_le uuid,
>> +			   uint32_t instance);
>> +	int     (*start)(uuid_le uuid);
>> +	int     (*shutdown)(uuid_le uuid);
>> +	ssize_t (*read)(struct mdev_device *vdev, char *buf, size_t count,
>> +			enum mdev_emul_space address_space, loff_t pos);
>> +	ssize_t (*write)(struct mdev_device *vdev, char *buf, size_t count,
>> +			 enum mdev_emul_space address_space, loff_t pos);
>> +	int     (*set_irqs)(struct mdev_device *vdev, uint32_t flags,
>> +			    unsigned int index, unsigned int start,
>> +			    unsigned int count, void *data);
>> +	int	(*get_region_info)(struct mdev_device *vdev, int region_index,
>> +				 struct pci_region_info *region_info);
>> +	int	(*validate_map_request)(struct mdev_device *vdev,
>> +					unsigned long virtaddr,
>> +					unsigned long *pfn, unsigned long *size,
>> +					pgprot_t *prot);
>> +};
> 
> Dear Kirti:
> 
> When I rebased my vfio-ccw patches on this series, I found I need an
> extra 'ioctl' callback in phy_device_ops.
> 

Thanks for taking closer look. As per my knowledge ccw is not PCI
device, right? Correct me if I'm wrong. I'm curious to know. Are you
planning to write a driver (vfio-mccw) for mediated ccw device?

Thanks,
Kirti

> The ccw physical device only supports one ccw mediated device. And I
> have two new ioctl commands for the ccw mediated device. One is 
> to hot-reset the resource in the physical device that allocated for
> the mediated device, the other is to do an I/O instruction translation
> and perform an I/O operation on the physical device. I found the
> existing callbacks could not meet my requirements.
> 
> Something like the following would be fine for my case:
> 	int (*ioctl)(struct mdev_device *vdev,
> 		     unsigned int cmd,
> 		     unsigned long arg);
> 
> What do you think about this?
> 
> --------
> Dong Jia
> 

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC PATCH v4 1/3] Mediated device Core driver
  2016-06-06  5:27       ` [Qemu-devel] " Kirti Wankhede
@ 2016-06-06  6:01         ` Dong Jia
  -1 siblings, 0 replies; 92+ messages in thread
From: Dong Jia @ 2016-06-06  6:01 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: alex.williamson, pbonzini, kraxel, cjia, qemu-devel, kvm,
	kevin.tian, shuai.ruan, jike.song, zhiyuan.lv

On Mon, 6 Jun 2016 10:57:49 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> 
> 
> On 6/3/2016 2:27 PM, Dong Jia wrote:
> > On Wed, 25 May 2016 01:28:15 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > 
> > 
> > ...snip...
> > 
> >> +struct phy_device_ops {
> >> +	struct module   *owner;
> >> +	const struct attribute_group **dev_attr_groups;
> >> +	const struct attribute_group **mdev_attr_groups;
> >> +
> >> +	int	(*supported_config)(struct device *dev, char *config);
> >> +	int     (*create)(struct device *dev, uuid_le uuid,
> >> +			  uint32_t instance, char *mdev_params);
> >> +	int     (*destroy)(struct device *dev, uuid_le uuid,
> >> +			   uint32_t instance);
> >> +	int     (*start)(uuid_le uuid);
> >> +	int     (*shutdown)(uuid_le uuid);
> >> +	ssize_t (*read)(struct mdev_device *vdev, char *buf, size_t count,
> >> +			enum mdev_emul_space address_space, loff_t pos);
> >> +	ssize_t (*write)(struct mdev_device *vdev, char *buf, size_t count,
> >> +			 enum mdev_emul_space address_space, loff_t pos);
> >> +	int     (*set_irqs)(struct mdev_device *vdev, uint32_t flags,
> >> +			    unsigned int index, unsigned int start,
> >> +			    unsigned int count, void *data);
> >> +	int	(*get_region_info)(struct mdev_device *vdev, int region_index,
> >> +				 struct pci_region_info *region_info);
> >> +	int	(*validate_map_request)(struct mdev_device *vdev,
> >> +					unsigned long virtaddr,
> >> +					unsigned long *pfn, unsigned long *size,
> >> +					pgprot_t *prot);
> >> +};
> > 
> > Dear Kirti:
> > 
> > When I rebased my vfio-ccw patches on this series, I found I need an
> > extra 'ioctl' callback in phy_device_ops.
> > 
> 
> Thanks for taking closer look. As per my knowledge ccw is not PCI
> device, right? Correct me if I'm wrong.
Dear Kirti:

You are right. CCW is different to PCI. The official term is 'Channel
I/O device'. They use 'Channels' (co-processors) and CCWs (channel
command words) to handle I/O operations.

> I'm curious to know. Are you planning to write a driver (vfio-mccw) for
> mediated ccw device?
I wrote two drivers:
1. A vfio-pccw driver for the physical ccw device, which will reigister
the device and callbacks to mdev framework. With this, I could create
a mediated ccw device for the physical one then.
2. A vfio-mccw driver for the mediated ccw device, which will add
itself to a vfio_group, mimiced what vfio-mpci did.

The problem is, vfio-mccw need to implement new ioctls besides the
existing ones (VFIO_DEVICE_GET_INFO, etc). And these ioctls really need
the physical device help to handle.

> 
> Thanks,
> Kirti
> 
> > The ccw physical device only supports one ccw mediated device. And I
> > have two new ioctl commands for the ccw mediated device. One is 
> > to hot-reset the resource in the physical device that allocated for
> > the mediated device, the other is to do an I/O instruction translation
> > and perform an I/O operation on the physical device. I found the
> > existing callbacks could not meet my requirements.
> > 
> > Something like the following would be fine for my case:
> > 	int (*ioctl)(struct mdev_device *vdev,
> > 		     unsigned int cmd,
> > 		     unsigned long arg);
> > 
> > What do you think about this?
> > 
> > --------
> > Dong Jia
> > 
> 

--------
Dong Jia


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v4 1/3] Mediated device Core driver
@ 2016-06-06  6:01         ` Dong Jia
  0 siblings, 0 replies; 92+ messages in thread
From: Dong Jia @ 2016-06-06  6:01 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: alex.williamson, pbonzini, kraxel, cjia, qemu-devel, kvm,
	kevin.tian, shuai.ruan, jike.song, zhiyuan.lv

On Mon, 6 Jun 2016 10:57:49 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> 
> 
> On 6/3/2016 2:27 PM, Dong Jia wrote:
> > On Wed, 25 May 2016 01:28:15 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > 
> > 
> > ...snip...
> > 
> >> +struct phy_device_ops {
> >> +	struct module   *owner;
> >> +	const struct attribute_group **dev_attr_groups;
> >> +	const struct attribute_group **mdev_attr_groups;
> >> +
> >> +	int	(*supported_config)(struct device *dev, char *config);
> >> +	int     (*create)(struct device *dev, uuid_le uuid,
> >> +			  uint32_t instance, char *mdev_params);
> >> +	int     (*destroy)(struct device *dev, uuid_le uuid,
> >> +			   uint32_t instance);
> >> +	int     (*start)(uuid_le uuid);
> >> +	int     (*shutdown)(uuid_le uuid);
> >> +	ssize_t (*read)(struct mdev_device *vdev, char *buf, size_t count,
> >> +			enum mdev_emul_space address_space, loff_t pos);
> >> +	ssize_t (*write)(struct mdev_device *vdev, char *buf, size_t count,
> >> +			 enum mdev_emul_space address_space, loff_t pos);
> >> +	int     (*set_irqs)(struct mdev_device *vdev, uint32_t flags,
> >> +			    unsigned int index, unsigned int start,
> >> +			    unsigned int count, void *data);
> >> +	int	(*get_region_info)(struct mdev_device *vdev, int region_index,
> >> +				 struct pci_region_info *region_info);
> >> +	int	(*validate_map_request)(struct mdev_device *vdev,
> >> +					unsigned long virtaddr,
> >> +					unsigned long *pfn, unsigned long *size,
> >> +					pgprot_t *prot);
> >> +};
> > 
> > Dear Kirti:
> > 
> > When I rebased my vfio-ccw patches on this series, I found I need an
> > extra 'ioctl' callback in phy_device_ops.
> > 
> 
> Thanks for taking closer look. As per my knowledge ccw is not PCI
> device, right? Correct me if I'm wrong.
Dear Kirti:

You are right. CCW is different to PCI. The official term is 'Channel
I/O device'. They use 'Channels' (co-processors) and CCWs (channel
command words) to handle I/O operations.

> I'm curious to know. Are you planning to write a driver (vfio-mccw) for
> mediated ccw device?
I wrote two drivers:
1. A vfio-pccw driver for the physical ccw device, which will reigister
the device and callbacks to mdev framework. With this, I could create
a mediated ccw device for the physical one then.
2. A vfio-mccw driver for the mediated ccw device, which will add
itself to a vfio_group, mimiced what vfio-mpci did.

The problem is, vfio-mccw need to implement new ioctls besides the
existing ones (VFIO_DEVICE_GET_INFO, etc). And these ioctls really need
the physical device help to handle.

> 
> Thanks,
> Kirti
> 
> > The ccw physical device only supports one ccw mediated device. And I
> > have two new ioctl commands for the ccw mediated device. One is 
> > to hot-reset the resource in the physical device that allocated for
> > the mediated device, the other is to do an I/O instruction translation
> > and perform an I/O operation on the physical device. I found the
> > existing callbacks could not meet my requirements.
> > 
> > Something like the following would be fine for my case:
> > 	int (*ioctl)(struct mdev_device *vdev,
> > 		     unsigned int cmd,
> > 		     unsigned long arg);
> > 
> > What do you think about this?
> > 
> > --------
> > Dong Jia
> > 
> 

--------
Dong Jia

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC PATCH v4 1/3] Mediated device Core driver
  2016-06-06  6:01         ` [Qemu-devel] " Dong Jia
@ 2016-06-06  6:27           ` Neo Jia
  -1 siblings, 0 replies; 92+ messages in thread
From: Neo Jia @ 2016-06-06  6:27 UTC (permalink / raw)
  To: Dong Jia
  Cc: Kirti Wankhede, alex.williamson, pbonzini, kraxel, qemu-devel,
	kvm, kevin.tian, shuai.ruan, jike.song, zhiyuan.lv

On Mon, Jun 06, 2016 at 02:01:48PM +0800, Dong Jia wrote:
> On Mon, 6 Jun 2016 10:57:49 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
> > 
> > 
> > On 6/3/2016 2:27 PM, Dong Jia wrote:
> > > On Wed, 25 May 2016 01:28:15 +0530
> > > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > > 
> > > 
> > > ...snip...
> > > 
> > >> +struct phy_device_ops {
> > >> +	struct module   *owner;
> > >> +	const struct attribute_group **dev_attr_groups;
> > >> +	const struct attribute_group **mdev_attr_groups;
> > >> +
> > >> +	int	(*supported_config)(struct device *dev, char *config);
> > >> +	int     (*create)(struct device *dev, uuid_le uuid,
> > >> +			  uint32_t instance, char *mdev_params);
> > >> +	int     (*destroy)(struct device *dev, uuid_le uuid,
> > >> +			   uint32_t instance);
> > >> +	int     (*start)(uuid_le uuid);
> > >> +	int     (*shutdown)(uuid_le uuid);
> > >> +	ssize_t (*read)(struct mdev_device *vdev, char *buf, size_t count,
> > >> +			enum mdev_emul_space address_space, loff_t pos);
> > >> +	ssize_t (*write)(struct mdev_device *vdev, char *buf, size_t count,
> > >> +			 enum mdev_emul_space address_space, loff_t pos);
> > >> +	int     (*set_irqs)(struct mdev_device *vdev, uint32_t flags,
> > >> +			    unsigned int index, unsigned int start,
> > >> +			    unsigned int count, void *data);
> > >> +	int	(*get_region_info)(struct mdev_device *vdev, int region_index,
> > >> +				 struct pci_region_info *region_info);
> > >> +	int	(*validate_map_request)(struct mdev_device *vdev,
> > >> +					unsigned long virtaddr,
> > >> +					unsigned long *pfn, unsigned long *size,
> > >> +					pgprot_t *prot);
> > >> +};
> > > 
> > > Dear Kirti:
> > > 
> > > When I rebased my vfio-ccw patches on this series, I found I need an
> > > extra 'ioctl' callback in phy_device_ops.
> > > 
> > 
> > Thanks for taking closer look. As per my knowledge ccw is not PCI
> > device, right? Correct me if I'm wrong.
> Dear Kirti:
> 
> You are right. CCW is different to PCI. The official term is 'Channel
> I/O device'. They use 'Channels' (co-processors) and CCWs (channel
> command words) to handle I/O operations.
> 
> > I'm curious to know. Are you planning to write a driver (vfio-mccw) for
> > mediated ccw device?
> I wrote two drivers:
> 1. A vfio-pccw driver for the physical ccw device, which will reigister
> the device and callbacks to mdev framework. With this, I could create
> a mediated ccw device for the physical one then.
> 2. A vfio-mccw driver for the mediated ccw device, which will add
> itself to a vfio_group, mimiced what vfio-mpci did.
> 
> The problem is, vfio-mccw need to implement new ioctls besides the
> existing ones (VFIO_DEVICE_GET_INFO, etc). And these ioctls really need
> the physical device help to handle.

Hi Dong,

Could you please help us understand a bit more about the new VFIO ioctl? Since it is
a new ioctl it is send down by QEMU in this case right? More details?

Thanks,
Neo

> 
> > 
> > Thanks,
> > Kirti
> > 
> > > The ccw physical device only supports one ccw mediated device. And I
> > > have two new ioctl commands for the ccw mediated device. One is 
> > > to hot-reset the resource in the physical device that allocated for
> > > the mediated device, the other is to do an I/O instruction translation
> > > and perform an I/O operation on the physical device. I found the
> > > existing callbacks could not meet my requirements.
> > > 
> > > Something like the following would be fine for my case:
> > > 	int (*ioctl)(struct mdev_device *vdev,
> > > 		     unsigned int cmd,
> > > 		     unsigned long arg);
> > > 
> > > What do you think about this?
> > > 
> > > --------
> > > Dong Jia
> > > 
> > 
> 
> --------
> Dong Jia
> 

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v4 1/3] Mediated device Core driver
@ 2016-06-06  6:27           ` Neo Jia
  0 siblings, 0 replies; 92+ messages in thread
From: Neo Jia @ 2016-06-06  6:27 UTC (permalink / raw)
  To: Dong Jia
  Cc: Kirti Wankhede, alex.williamson, pbonzini, kraxel, qemu-devel,
	kvm, kevin.tian, shuai.ruan, jike.song, zhiyuan.lv

On Mon, Jun 06, 2016 at 02:01:48PM +0800, Dong Jia wrote:
> On Mon, 6 Jun 2016 10:57:49 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
> > 
> > 
> > On 6/3/2016 2:27 PM, Dong Jia wrote:
> > > On Wed, 25 May 2016 01:28:15 +0530
> > > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > > 
> > > 
> > > ...snip...
> > > 
> > >> +struct phy_device_ops {
> > >> +	struct module   *owner;
> > >> +	const struct attribute_group **dev_attr_groups;
> > >> +	const struct attribute_group **mdev_attr_groups;
> > >> +
> > >> +	int	(*supported_config)(struct device *dev, char *config);
> > >> +	int     (*create)(struct device *dev, uuid_le uuid,
> > >> +			  uint32_t instance, char *mdev_params);
> > >> +	int     (*destroy)(struct device *dev, uuid_le uuid,
> > >> +			   uint32_t instance);
> > >> +	int     (*start)(uuid_le uuid);
> > >> +	int     (*shutdown)(uuid_le uuid);
> > >> +	ssize_t (*read)(struct mdev_device *vdev, char *buf, size_t count,
> > >> +			enum mdev_emul_space address_space, loff_t pos);
> > >> +	ssize_t (*write)(struct mdev_device *vdev, char *buf, size_t count,
> > >> +			 enum mdev_emul_space address_space, loff_t pos);
> > >> +	int     (*set_irqs)(struct mdev_device *vdev, uint32_t flags,
> > >> +			    unsigned int index, unsigned int start,
> > >> +			    unsigned int count, void *data);
> > >> +	int	(*get_region_info)(struct mdev_device *vdev, int region_index,
> > >> +				 struct pci_region_info *region_info);
> > >> +	int	(*validate_map_request)(struct mdev_device *vdev,
> > >> +					unsigned long virtaddr,
> > >> +					unsigned long *pfn, unsigned long *size,
> > >> +					pgprot_t *prot);
> > >> +};
> > > 
> > > Dear Kirti:
> > > 
> > > When I rebased my vfio-ccw patches on this series, I found I need an
> > > extra 'ioctl' callback in phy_device_ops.
> > > 
> > 
> > Thanks for taking closer look. As per my knowledge ccw is not PCI
> > device, right? Correct me if I'm wrong.
> Dear Kirti:
> 
> You are right. CCW is different to PCI. The official term is 'Channel
> I/O device'. They use 'Channels' (co-processors) and CCWs (channel
> command words) to handle I/O operations.
> 
> > I'm curious to know. Are you planning to write a driver (vfio-mccw) for
> > mediated ccw device?
> I wrote two drivers:
> 1. A vfio-pccw driver for the physical ccw device, which will reigister
> the device and callbacks to mdev framework. With this, I could create
> a mediated ccw device for the physical one then.
> 2. A vfio-mccw driver for the mediated ccw device, which will add
> itself to a vfio_group, mimiced what vfio-mpci did.
> 
> The problem is, vfio-mccw need to implement new ioctls besides the
> existing ones (VFIO_DEVICE_GET_INFO, etc). And these ioctls really need
> the physical device help to handle.

Hi Dong,

Could you please help us understand a bit more about the new VFIO ioctl? Since it is
a new ioctl it is send down by QEMU in this case right? More details?

Thanks,
Neo

> 
> > 
> > Thanks,
> > Kirti
> > 
> > > The ccw physical device only supports one ccw mediated device. And I
> > > have two new ioctl commands for the ccw mediated device. One is 
> > > to hot-reset the resource in the physical device that allocated for
> > > the mediated device, the other is to do an I/O instruction translation
> > > and perform an I/O operation on the physical device. I found the
> > > existing callbacks could not meet my requirements.
> > > 
> > > Something like the following would be fine for my case:
> > > 	int (*ioctl)(struct mdev_device *vdev,
> > > 		     unsigned int cmd,
> > > 		     unsigned long arg);
> > > 
> > > What do you think about this?
> > > 
> > > --------
> > > Dong Jia
> > > 
> > 
> 
> --------
> Dong Jia
> 

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC PATCH v4 1/3] Mediated device Core driver
  2016-06-06  6:27           ` [Qemu-devel] " Neo Jia
@ 2016-06-06  8:29             ` Dong Jia
  -1 siblings, 0 replies; 92+ messages in thread
From: Dong Jia @ 2016-06-06  8:29 UTC (permalink / raw)
  To: Neo Jia
  Cc: Kirti Wankhede, alex.williamson, pbonzini, kraxel, qemu-devel,
	kvm, kevin.tian, shuai.ruan, jike.song, zhiyuan.lv, Dong Jia

On Sun, 5 Jun 2016 23:27:42 -0700
Neo Jia <cjia@nvidia.com> wrote:

> On Mon, Jun 06, 2016 at 02:01:48PM +0800, Dong Jia wrote:
> > On Mon, 6 Jun 2016 10:57:49 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > 
> > > 
> > > 
> > > On 6/3/2016 2:27 PM, Dong Jia wrote:
> > > > On Wed, 25 May 2016 01:28:15 +0530
> > > > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > > > 
> > > > 
> > > > ...snip...
> > > > 
> > > >> +struct phy_device_ops {
> > > >> +	struct module   *owner;
> > > >> +	const struct attribute_group **dev_attr_groups;
> > > >> +	const struct attribute_group **mdev_attr_groups;
> > > >> +
> > > >> +	int	(*supported_config)(struct device *dev, char *config);
> > > >> +	int     (*create)(struct device *dev, uuid_le uuid,
> > > >> +			  uint32_t instance, char *mdev_params);
> > > >> +	int     (*destroy)(struct device *dev, uuid_le uuid,
> > > >> +			   uint32_t instance);
> > > >> +	int     (*start)(uuid_le uuid);
> > > >> +	int     (*shutdown)(uuid_le uuid);
> > > >> +	ssize_t (*read)(struct mdev_device *vdev, char *buf, size_t count,
> > > >> +			enum mdev_emul_space address_space, loff_t pos);
> > > >> +	ssize_t (*write)(struct mdev_device *vdev, char *buf, size_t count,
> > > >> +			 enum mdev_emul_space address_space, loff_t pos);
> > > >> +	int     (*set_irqs)(struct mdev_device *vdev, uint32_t flags,
> > > >> +			    unsigned int index, unsigned int start,
> > > >> +			    unsigned int count, void *data);
> > > >> +	int	(*get_region_info)(struct mdev_device *vdev, int region_index,
> > > >> +				 struct pci_region_info *region_info);
> > > >> +	int	(*validate_map_request)(struct mdev_device *vdev,
> > > >> +					unsigned long virtaddr,
> > > >> +					unsigned long *pfn, unsigned long *size,
> > > >> +					pgprot_t *prot);
> > > >> +};
> > > > 
> > > > Dear Kirti:
> > > > 
> > > > When I rebased my vfio-ccw patches on this series, I found I need an
> > > > extra 'ioctl' callback in phy_device_ops.
> > > > 
> > > 
> > > Thanks for taking closer look. As per my knowledge ccw is not PCI
> > > device, right? Correct me if I'm wrong.
> > Dear Kirti:
> > 
> > You are right. CCW is different to PCI. The official term is 'Channel
> > I/O device'. They use 'Channels' (co-processors) and CCWs (channel
> > command words) to handle I/O operations.
> > 
> > > I'm curious to know. Are you planning to write a driver (vfio-mccw) for
> > > mediated ccw device?
> > I wrote two drivers:
> > 1. A vfio-pccw driver for the physical ccw device, which will reigister
> > the device and callbacks to mdev framework. With this, I could create
> > a mediated ccw device for the physical one then.
> > 2. A vfio-mccw driver for the mediated ccw device, which will add
> > itself to a vfio_group, mimiced what vfio-mpci did.
> > 
> > The problem is, vfio-mccw need to implement new ioctls besides the
> > existing ones (VFIO_DEVICE_GET_INFO, etc). And these ioctls really need
> > the physical device help to handle.
> 
> Hi Dong,
> 
> Could you please help us understand a bit more about the new VFIO ioctl?
Dear Neo,

Sure, with pleasure.
Since I tried not to bring too much ccw specific technical details
here, I wrote quite briefly. Please feel free to ask me for
supplements. :>

> Since it is a new ioctl it is send down by QEMU in this case right?
Right.

> More details?
As mentioned in the former emails, I currently added two new ioctl
commands.

1. VFIO_DEVICE_CCW_HOT_RESET
Both the name and the purpose of this command looks pretty the same
with VFIO_DEVICE_PCI_HOT_RESET.
Since Kevin proposed an individual callback for hot-reset handling, I
believe this will not be a problem (only if you accept the proposal).

2. VFIO_DEVICE_CCW_CMD_REQUEST
This intends to handle an intercepted channel I/O instruction. It
basically need to do the following thing:
  a. Copy the raw data of the CCW program (a group of chained CCWs) from
     user into kernel space buffers.
  b. Do CCW program translation based on the raw data to get a
     real-device runnable CCW program. We'd pin pages for those CCWs
     which have memory space pointers for their offload, and update the
     CCW program with the pinned results (phys).
  c. Issue the translated CCW program to a real-device to perform the
     I/O operation, and wait for the I/O result interrupt.
  d. Once we got the I/O result, copy the result back to user, and
     unpin the pages.

Step c could only be done by the physical device driver, since it's it
that the int_handler belongs to.
Step b and d should be done by the physical device driver. Or we'd
pin/unpin pages in the mediated device driver?

That's why I asked for the new callback.

> 
> Thanks,
> Neo
> 
> > 
> > > 
> > > Thanks,
> > > Kirti
> > > 
> > > > The ccw physical device only supports one ccw mediated device. And I
> > > > have two new ioctl commands for the ccw mediated device. One is 
> > > > to hot-reset the resource in the physical device that allocated for
> > > > the mediated device, the other is to do an I/O instruction translation
> > > > and perform an I/O operation on the physical device. I found the
> > > > existing callbacks could not meet my requirements.
> > > > 
> > > > Something like the following would be fine for my case:
> > > > 	int (*ioctl)(struct mdev_device *vdev,
> > > > 		     unsigned int cmd,
> > > > 		     unsigned long arg);
> > > > 
> > > > What do you think about this?
> > > > 
> > > > --------
> > > > Dong Jia
> > > > 
> > > 
> > 
> > --------
> > Dong Jia
> > 
> 



--------
Dong Jia


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v4 1/3] Mediated device Core driver
@ 2016-06-06  8:29             ` Dong Jia
  0 siblings, 0 replies; 92+ messages in thread
From: Dong Jia @ 2016-06-06  8:29 UTC (permalink / raw)
  To: Neo Jia
  Cc: Kirti Wankhede, alex.williamson, pbonzini, kraxel, qemu-devel,
	kvm, kevin.tian, shuai.ruan, jike.song, zhiyuan.lv, Dong Jia

On Sun, 5 Jun 2016 23:27:42 -0700
Neo Jia <cjia@nvidia.com> wrote:

> On Mon, Jun 06, 2016 at 02:01:48PM +0800, Dong Jia wrote:
> > On Mon, 6 Jun 2016 10:57:49 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > 
> > > 
> > > 
> > > On 6/3/2016 2:27 PM, Dong Jia wrote:
> > > > On Wed, 25 May 2016 01:28:15 +0530
> > > > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > > > 
> > > > 
> > > > ...snip...
> > > > 
> > > >> +struct phy_device_ops {
> > > >> +	struct module   *owner;
> > > >> +	const struct attribute_group **dev_attr_groups;
> > > >> +	const struct attribute_group **mdev_attr_groups;
> > > >> +
> > > >> +	int	(*supported_config)(struct device *dev, char *config);
> > > >> +	int     (*create)(struct device *dev, uuid_le uuid,
> > > >> +			  uint32_t instance, char *mdev_params);
> > > >> +	int     (*destroy)(struct device *dev, uuid_le uuid,
> > > >> +			   uint32_t instance);
> > > >> +	int     (*start)(uuid_le uuid);
> > > >> +	int     (*shutdown)(uuid_le uuid);
> > > >> +	ssize_t (*read)(struct mdev_device *vdev, char *buf, size_t count,
> > > >> +			enum mdev_emul_space address_space, loff_t pos);
> > > >> +	ssize_t (*write)(struct mdev_device *vdev, char *buf, size_t count,
> > > >> +			 enum mdev_emul_space address_space, loff_t pos);
> > > >> +	int     (*set_irqs)(struct mdev_device *vdev, uint32_t flags,
> > > >> +			    unsigned int index, unsigned int start,
> > > >> +			    unsigned int count, void *data);
> > > >> +	int	(*get_region_info)(struct mdev_device *vdev, int region_index,
> > > >> +				 struct pci_region_info *region_info);
> > > >> +	int	(*validate_map_request)(struct mdev_device *vdev,
> > > >> +					unsigned long virtaddr,
> > > >> +					unsigned long *pfn, unsigned long *size,
> > > >> +					pgprot_t *prot);
> > > >> +};
> > > > 
> > > > Dear Kirti:
> > > > 
> > > > When I rebased my vfio-ccw patches on this series, I found I need an
> > > > extra 'ioctl' callback in phy_device_ops.
> > > > 
> > > 
> > > Thanks for taking closer look. As per my knowledge ccw is not PCI
> > > device, right? Correct me if I'm wrong.
> > Dear Kirti:
> > 
> > You are right. CCW is different to PCI. The official term is 'Channel
> > I/O device'. They use 'Channels' (co-processors) and CCWs (channel
> > command words) to handle I/O operations.
> > 
> > > I'm curious to know. Are you planning to write a driver (vfio-mccw) for
> > > mediated ccw device?
> > I wrote two drivers:
> > 1. A vfio-pccw driver for the physical ccw device, which will reigister
> > the device and callbacks to mdev framework. With this, I could create
> > a mediated ccw device for the physical one then.
> > 2. A vfio-mccw driver for the mediated ccw device, which will add
> > itself to a vfio_group, mimiced what vfio-mpci did.
> > 
> > The problem is, vfio-mccw need to implement new ioctls besides the
> > existing ones (VFIO_DEVICE_GET_INFO, etc). And these ioctls really need
> > the physical device help to handle.
> 
> Hi Dong,
> 
> Could you please help us understand a bit more about the new VFIO ioctl?
Dear Neo,

Sure, with pleasure.
Since I tried not to bring too much ccw specific technical details
here, I wrote quite briefly. Please feel free to ask me for
supplements. :>

> Since it is a new ioctl it is send down by QEMU in this case right?
Right.

> More details?
As mentioned in the former emails, I currently added two new ioctl
commands.

1. VFIO_DEVICE_CCW_HOT_RESET
Both the name and the purpose of this command looks pretty the same
with VFIO_DEVICE_PCI_HOT_RESET.
Since Kevin proposed an individual callback for hot-reset handling, I
believe this will not be a problem (only if you accept the proposal).

2. VFIO_DEVICE_CCW_CMD_REQUEST
This intends to handle an intercepted channel I/O instruction. It
basically need to do the following thing:
  a. Copy the raw data of the CCW program (a group of chained CCWs) from
     user into kernel space buffers.
  b. Do CCW program translation based on the raw data to get a
     real-device runnable CCW program. We'd pin pages for those CCWs
     which have memory space pointers for their offload, and update the
     CCW program with the pinned results (phys).
  c. Issue the translated CCW program to a real-device to perform the
     I/O operation, and wait for the I/O result interrupt.
  d. Once we got the I/O result, copy the result back to user, and
     unpin the pages.

Step c could only be done by the physical device driver, since it's it
that the int_handler belongs to.
Step b and d should be done by the physical device driver. Or we'd
pin/unpin pages in the mediated device driver?

That's why I asked for the new callback.

> 
> Thanks,
> Neo
> 
> > 
> > > 
> > > Thanks,
> > > Kirti
> > > 
> > > > The ccw physical device only supports one ccw mediated device. And I
> > > > have two new ioctl commands for the ccw mediated device. One is 
> > > > to hot-reset the resource in the physical device that allocated for
> > > > the mediated device, the other is to do an I/O instruction translation
> > > > and perform an I/O operation on the physical device. I found the
> > > > existing callbacks could not meet my requirements.
> > > > 
> > > > Something like the following would be fine for my case:
> > > > 	int (*ioctl)(struct mdev_device *vdev,
> > > > 		     unsigned int cmd,
> > > > 		     unsigned long arg);
> > > > 
> > > > What do you think about this?
> > > > 
> > > > --------
> > > > Dong Jia
> > > > 
> > > 
> > 
> > --------
> > Dong Jia
> > 
> 



--------
Dong Jia

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC PATCH v4 1/3] Mediated device Core driver
  2016-06-06  8:29             ` [Qemu-devel] " Dong Jia
@ 2016-06-06 17:44               ` Neo Jia
  -1 siblings, 0 replies; 92+ messages in thread
From: Neo Jia @ 2016-06-06 17:44 UTC (permalink / raw)
  To: Dong Jia
  Cc: Kirti Wankhede, alex.williamson, pbonzini, kraxel, qemu-devel,
	kvm, kevin.tian, shuai.ruan, jike.song, zhiyuan.lv

On Mon, Jun 06, 2016 at 04:29:11PM +0800, Dong Jia wrote:
> On Sun, 5 Jun 2016 23:27:42 -0700
> Neo Jia <cjia@nvidia.com> wrote:
> 
> 2. VFIO_DEVICE_CCW_CMD_REQUEST
> This intends to handle an intercepted channel I/O instruction. It
> basically need to do the following thing:

May I ask how and when QEMU knows that he needs to issue such VFIO ioctl at
first place?

Thanks,
Neo

>   a. Copy the raw data of the CCW program (a group of chained CCWs) from
>      user into kernel space buffers.
>   b. Do CCW program translation based on the raw data to get a
>      real-device runnable CCW program. We'd pin pages for those CCWs
>      which have memory space pointers for their offload, and update the
>      CCW program with the pinned results (phys).
>   c. Issue the translated CCW program to a real-device to perform the
>      I/O operation, and wait for the I/O result interrupt.
>   d. Once we got the I/O result, copy the result back to user, and
>      unpin the pages.
> 
> Step c could only be done by the physical device driver, since it's it
> that the int_handler belongs to.
> Step b and d should be done by the physical device driver. Or we'd
> pin/unpin pages in the mediated device driver?
> 
> That's why I asked for the new callback.
> 

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v4 1/3] Mediated device Core driver
@ 2016-06-06 17:44               ` Neo Jia
  0 siblings, 0 replies; 92+ messages in thread
From: Neo Jia @ 2016-06-06 17:44 UTC (permalink / raw)
  To: Dong Jia
  Cc: Kirti Wankhede, alex.williamson, pbonzini, kraxel, qemu-devel,
	kvm, kevin.tian, shuai.ruan, jike.song, zhiyuan.lv

On Mon, Jun 06, 2016 at 04:29:11PM +0800, Dong Jia wrote:
> On Sun, 5 Jun 2016 23:27:42 -0700
> Neo Jia <cjia@nvidia.com> wrote:
> 
> 2. VFIO_DEVICE_CCW_CMD_REQUEST
> This intends to handle an intercepted channel I/O instruction. It
> basically need to do the following thing:

May I ask how and when QEMU knows that he needs to issue such VFIO ioctl at
first place?

Thanks,
Neo

>   a. Copy the raw data of the CCW program (a group of chained CCWs) from
>      user into kernel space buffers.
>   b. Do CCW program translation based on the raw data to get a
>      real-device runnable CCW program. We'd pin pages for those CCWs
>      which have memory space pointers for their offload, and update the
>      CCW program with the pinned results (phys).
>   c. Issue the translated CCW program to a real-device to perform the
>      I/O operation, and wait for the I/O result interrupt.
>   d. Once we got the I/O result, copy the result back to user, and
>      unpin the pages.
> 
> Step c could only be done by the physical device driver, since it's it
> that the int_handler belongs to.
> Step b and d should be done by the physical device driver. Or we'd
> pin/unpin pages in the mediated device driver?
> 
> That's why I asked for the new callback.
> 

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC PATCH v4 1/3] Mediated device Core driver
  2016-06-06 17:44               ` [Qemu-devel] " Neo Jia
@ 2016-06-06 19:31                 ` Alex Williamson
  -1 siblings, 0 replies; 92+ messages in thread
From: Alex Williamson @ 2016-06-06 19:31 UTC (permalink / raw)
  To: Neo Jia
  Cc: Dong Jia, Kirti Wankhede, pbonzini, kraxel, qemu-devel, kvm,
	kevin.tian, shuai.ruan, jike.song, zhiyuan.lv

On Mon, 6 Jun 2016 10:44:25 -0700
Neo Jia <cjia@nvidia.com> wrote:

> On Mon, Jun 06, 2016 at 04:29:11PM +0800, Dong Jia wrote:
> > On Sun, 5 Jun 2016 23:27:42 -0700
> > Neo Jia <cjia@nvidia.com> wrote:
> > 
> > 2. VFIO_DEVICE_CCW_CMD_REQUEST
> > This intends to handle an intercepted channel I/O instruction. It
> > basically need to do the following thing:  
> 
> May I ask how and when QEMU knows that he needs to issue such VFIO ioctl at
> first place?

Yep, this is my question as well.  It sounds a bit like there's an
emulated device in QEMU that's trying to tell the mediated device when
to start an operation when we probably should be passing through
whatever i/o operations indicate that status directly to the mediated
device. Thanks,

Alex

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v4 1/3] Mediated device Core driver
@ 2016-06-06 19:31                 ` Alex Williamson
  0 siblings, 0 replies; 92+ messages in thread
From: Alex Williamson @ 2016-06-06 19:31 UTC (permalink / raw)
  To: Neo Jia
  Cc: Dong Jia, Kirti Wankhede, pbonzini, kraxel, qemu-devel, kvm,
	kevin.tian, shuai.ruan, jike.song, zhiyuan.lv

On Mon, 6 Jun 2016 10:44:25 -0700
Neo Jia <cjia@nvidia.com> wrote:

> On Mon, Jun 06, 2016 at 04:29:11PM +0800, Dong Jia wrote:
> > On Sun, 5 Jun 2016 23:27:42 -0700
> > Neo Jia <cjia@nvidia.com> wrote:
> > 
> > 2. VFIO_DEVICE_CCW_CMD_REQUEST
> > This intends to handle an intercepted channel I/O instruction. It
> > basically need to do the following thing:  
> 
> May I ask how and when QEMU knows that he needs to issue such VFIO ioctl at
> first place?

Yep, this is my question as well.  It sounds a bit like there's an
emulated device in QEMU that's trying to tell the mediated device when
to start an operation when we probably should be passing through
whatever i/o operations indicate that status directly to the mediated
device. Thanks,

Alex

^ permalink raw reply	[flat|nested] 92+ messages in thread

* RE: [RFC PATCH v4 1/3] Mediated device Core driver
  2016-06-06 19:31                 ` [Qemu-devel] " Alex Williamson
@ 2016-06-07  3:03                   ` Tian, Kevin
  -1 siblings, 0 replies; 92+ messages in thread
From: Tian, Kevin @ 2016-06-07  3:03 UTC (permalink / raw)
  To: Alex Williamson, Neo Jia
  Cc: Dong Jia, Kirti Wankhede, pbonzini, kraxel, qemu-devel, kvm,
	Ruan, Shuai, Song, Jike, Lv, Zhiyuan

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Tuesday, June 07, 2016 3:31 AM
> 
> On Mon, 6 Jun 2016 10:44:25 -0700
> Neo Jia <cjia@nvidia.com> wrote:
> 
> > On Mon, Jun 06, 2016 at 04:29:11PM +0800, Dong Jia wrote:
> > > On Sun, 5 Jun 2016 23:27:42 -0700
> > > Neo Jia <cjia@nvidia.com> wrote:
> > >
> > > 2. VFIO_DEVICE_CCW_CMD_REQUEST
> > > This intends to handle an intercepted channel I/O instruction. It
> > > basically need to do the following thing:
> >
> > May I ask how and when QEMU knows that he needs to issue such VFIO ioctl at
> > first place?
> 
> Yep, this is my question as well.  It sounds a bit like there's an
> emulated device in QEMU that's trying to tell the mediated device when
> to start an operation when we probably should be passing through
> whatever i/o operations indicate that status directly to the mediated
> device. Thanks,
> 
> Alex

Below is copied from Dong's earlier post which said clear that
a guest cmd submission will trigger the whole flow:

----
Explanation:
Q1-Q4: Qemu side process.
K1-K6: Kernel side process.

Q1. Intercept a ssch instruction.
Q2. Translate the guest ccw program to a user space ccw program
    (u_ccwchain).
Q3. Call VFIO_DEVICE_CCW_CMD_REQUEST (u_ccwchain, orb, irb).
    K1. Copy from u_ccwchain to kernel (k_ccwchain).
    K2. Translate the user space ccw program to a kernel space ccw
        program, which becomes runnable for a real device.
    K3. With the necessary information contained in the orb passed in
        by Qemu, issue the k_ccwchain to the device, and wait event q
        for the I/O result.
    K4. Interrupt handler gets the I/O result, and wakes up the wait q.
    K5. CMD_REQUEST ioctl gets the I/O result, and uses the result to
        update the user space irb.
    K6. Copy irb and scsw back to user space.
Q4. Update the irb for the guest.
----

My understanding is that such thing belongs to how device is mediated
(so device driver specific), instead of something to be abstracted in 
VFIO which manages resource but doesn't care how resource is used.

Actually we have same requirement in vGPU case, that a guest driver 
needs submit GPU commands through some MMIO register. vGPU device 
model will intercept the submission request (in its own way), do its 
necessary scan/audit to ensure correctness/security, and then submit 
to physical GPU through vendor specific interface. 

No difference with channel I/O here.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v4 1/3] Mediated device Core driver
@ 2016-06-07  3:03                   ` Tian, Kevin
  0 siblings, 0 replies; 92+ messages in thread
From: Tian, Kevin @ 2016-06-07  3:03 UTC (permalink / raw)
  To: Alex Williamson, Neo Jia
  Cc: Dong Jia, Kirti Wankhede, pbonzini, kraxel, qemu-devel, kvm,
	Ruan, Shuai, Song, Jike, Lv, Zhiyuan

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Tuesday, June 07, 2016 3:31 AM
> 
> On Mon, 6 Jun 2016 10:44:25 -0700
> Neo Jia <cjia@nvidia.com> wrote:
> 
> > On Mon, Jun 06, 2016 at 04:29:11PM +0800, Dong Jia wrote:
> > > On Sun, 5 Jun 2016 23:27:42 -0700
> > > Neo Jia <cjia@nvidia.com> wrote:
> > >
> > > 2. VFIO_DEVICE_CCW_CMD_REQUEST
> > > This intends to handle an intercepted channel I/O instruction. It
> > > basically need to do the following thing:
> >
> > May I ask how and when QEMU knows that he needs to issue such VFIO ioctl at
> > first place?
> 
> Yep, this is my question as well.  It sounds a bit like there's an
> emulated device in QEMU that's trying to tell the mediated device when
> to start an operation when we probably should be passing through
> whatever i/o operations indicate that status directly to the mediated
> device. Thanks,
> 
> Alex

Below is copied from Dong's earlier post which said clear that
a guest cmd submission will trigger the whole flow:

----
Explanation:
Q1-Q4: Qemu side process.
K1-K6: Kernel side process.

Q1. Intercept a ssch instruction.
Q2. Translate the guest ccw program to a user space ccw program
    (u_ccwchain).
Q3. Call VFIO_DEVICE_CCW_CMD_REQUEST (u_ccwchain, orb, irb).
    K1. Copy from u_ccwchain to kernel (k_ccwchain).
    K2. Translate the user space ccw program to a kernel space ccw
        program, which becomes runnable for a real device.
    K3. With the necessary information contained in the orb passed in
        by Qemu, issue the k_ccwchain to the device, and wait event q
        for the I/O result.
    K4. Interrupt handler gets the I/O result, and wakes up the wait q.
    K5. CMD_REQUEST ioctl gets the I/O result, and uses the result to
        update the user space irb.
    K6. Copy irb and scsw back to user space.
Q4. Update the irb for the guest.
----

My understanding is that such thing belongs to how device is mediated
(so device driver specific), instead of something to be abstracted in 
VFIO which manages resource but doesn't care how resource is used.

Actually we have same requirement in vGPU case, that a guest driver 
needs submit GPU commands through some MMIO register. vGPU device 
model will intercept the submission request (in its own way), do its 
necessary scan/audit to ensure correctness/security, and then submit 
to physical GPU through vendor specific interface. 

No difference with channel I/O here.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC PATCH v4 1/3] Mediated device Core driver
  2016-06-07  3:03                   ` [Qemu-devel] " Tian, Kevin
@ 2016-06-07 22:42                     ` Alex Williamson
  -1 siblings, 0 replies; 92+ messages in thread
From: Alex Williamson @ 2016-06-07 22:42 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Neo Jia, Dong Jia, Kirti Wankhede, pbonzini, kraxel, qemu-devel,
	kvm, Ruan, Shuai, Song, Jike, Lv, Zhiyuan

On Tue, 7 Jun 2016 03:03:32 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Tuesday, June 07, 2016 3:31 AM
> > 
> > On Mon, 6 Jun 2016 10:44:25 -0700
> > Neo Jia <cjia@nvidia.com> wrote:
> >   
> > > On Mon, Jun 06, 2016 at 04:29:11PM +0800, Dong Jia wrote:  
> > > > On Sun, 5 Jun 2016 23:27:42 -0700
> > > > Neo Jia <cjia@nvidia.com> wrote:
> > > >
> > > > 2. VFIO_DEVICE_CCW_CMD_REQUEST
> > > > This intends to handle an intercepted channel I/O instruction. It
> > > > basically need to do the following thing:  
> > >
> > > May I ask how and when QEMU knows that he needs to issue such VFIO ioctl at
> > > first place?  
> > 
> > Yep, this is my question as well.  It sounds a bit like there's an
> > emulated device in QEMU that's trying to tell the mediated device when
> > to start an operation when we probably should be passing through
> > whatever i/o operations indicate that status directly to the mediated
> > device. Thanks,
> > 
> > Alex  
> 
> Below is copied from Dong's earlier post which said clear that
> a guest cmd submission will trigger the whole flow:
> 
> ----
> Explanation:
> Q1-Q4: Qemu side process.
> K1-K6: Kernel side process.
> 
> Q1. Intercept a ssch instruction.
> Q2. Translate the guest ccw program to a user space ccw program
>     (u_ccwchain).
> Q3. Call VFIO_DEVICE_CCW_CMD_REQUEST (u_ccwchain, orb, irb).
>     K1. Copy from u_ccwchain to kernel (k_ccwchain).
>     K2. Translate the user space ccw program to a kernel space ccw
>         program, which becomes runnable for a real device.
>     K3. With the necessary information contained in the orb passed in
>         by Qemu, issue the k_ccwchain to the device, and wait event q
>         for the I/O result.
>     K4. Interrupt handler gets the I/O result, and wakes up the wait q.
>     K5. CMD_REQUEST ioctl gets the I/O result, and uses the result to
>         update the user space irb.
>     K6. Copy irb and scsw back to user space.
> Q4. Update the irb for the guest.
> ----

Right, but this was the pre-mediated device approach, now we no longer
need step Q2 so we really only need Q1 and therefore Q3 to exist in
QEMU if those are operations that are not visible to the mediated
device; which they very well might be, since it's described as an
instruction rather than an i/o operation.  It's not terrible if that's
the case, vfio-pci has its own ioctl for doing a hot reset.
 
> My understanding is that such thing belongs to how device is mediated
> (so device driver specific), instead of something to be abstracted in 
> VFIO which manages resource but doesn't care how resource is used.
> 
> Actually we have same requirement in vGPU case, that a guest driver 
> needs submit GPU commands through some MMIO register. vGPU device 
> model will intercept the submission request (in its own way), do its 
> necessary scan/audit to ensure correctness/security, and then submit 
> to physical GPU through vendor specific interface. 
> 
> No difference with channel I/O here.

Well, if the GPU command is submitted through an MMIO register, is that
MMIO register part of the mediated device?  If so, could the mediated
device recognize the command and do the scan/audit itself?  QEMU must
not be the point at which mediation occurs for security purposes, QEMU
is userspace and userspace is not to be trusted.  I'm still open to
ioctls where it makes sense, as above, we have PCI specific ioctls and
already, but we need to evaluate each one, why it needs to exist, and
whether we can skip it if the mediated device can trigger the action on
its own.  After all, that's why we're using the vfio api, so we can
re-use much of the existing infrastructure, especially for a vGPU that
exposes itself as a PCI device.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v4 1/3] Mediated device Core driver
@ 2016-06-07 22:42                     ` Alex Williamson
  0 siblings, 0 replies; 92+ messages in thread
From: Alex Williamson @ 2016-06-07 22:42 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Neo Jia, Dong Jia, Kirti Wankhede, pbonzini, kraxel, qemu-devel,
	kvm, Ruan, Shuai, Song, Jike, Lv, Zhiyuan

On Tue, 7 Jun 2016 03:03:32 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Tuesday, June 07, 2016 3:31 AM
> > 
> > On Mon, 6 Jun 2016 10:44:25 -0700
> > Neo Jia <cjia@nvidia.com> wrote:
> >   
> > > On Mon, Jun 06, 2016 at 04:29:11PM +0800, Dong Jia wrote:  
> > > > On Sun, 5 Jun 2016 23:27:42 -0700
> > > > Neo Jia <cjia@nvidia.com> wrote:
> > > >
> > > > 2. VFIO_DEVICE_CCW_CMD_REQUEST
> > > > This intends to handle an intercepted channel I/O instruction. It
> > > > basically need to do the following thing:  
> > >
> > > May I ask how and when QEMU knows that he needs to issue such VFIO ioctl at
> > > first place?  
> > 
> > Yep, this is my question as well.  It sounds a bit like there's an
> > emulated device in QEMU that's trying to tell the mediated device when
> > to start an operation when we probably should be passing through
> > whatever i/o operations indicate that status directly to the mediated
> > device. Thanks,
> > 
> > Alex  
> 
> Below is copied from Dong's earlier post which said clear that
> a guest cmd submission will trigger the whole flow:
> 
> ----
> Explanation:
> Q1-Q4: Qemu side process.
> K1-K6: Kernel side process.
> 
> Q1. Intercept a ssch instruction.
> Q2. Translate the guest ccw program to a user space ccw program
>     (u_ccwchain).
> Q3. Call VFIO_DEVICE_CCW_CMD_REQUEST (u_ccwchain, orb, irb).
>     K1. Copy from u_ccwchain to kernel (k_ccwchain).
>     K2. Translate the user space ccw program to a kernel space ccw
>         program, which becomes runnable for a real device.
>     K3. With the necessary information contained in the orb passed in
>         by Qemu, issue the k_ccwchain to the device, and wait event q
>         for the I/O result.
>     K4. Interrupt handler gets the I/O result, and wakes up the wait q.
>     K5. CMD_REQUEST ioctl gets the I/O result, and uses the result to
>         update the user space irb.
>     K6. Copy irb and scsw back to user space.
> Q4. Update the irb for the guest.
> ----

Right, but this was the pre-mediated device approach, now we no longer
need step Q2 so we really only need Q1 and therefore Q3 to exist in
QEMU if those are operations that are not visible to the mediated
device; which they very well might be, since it's described as an
instruction rather than an i/o operation.  It's not terrible if that's
the case, vfio-pci has its own ioctl for doing a hot reset.
 
> My understanding is that such thing belongs to how device is mediated
> (so device driver specific), instead of something to be abstracted in 
> VFIO which manages resource but doesn't care how resource is used.
> 
> Actually we have same requirement in vGPU case, that a guest driver 
> needs submit GPU commands through some MMIO register. vGPU device 
> model will intercept the submission request (in its own way), do its 
> necessary scan/audit to ensure correctness/security, and then submit 
> to physical GPU through vendor specific interface. 
> 
> No difference with channel I/O here.

Well, if the GPU command is submitted through an MMIO register, is that
MMIO register part of the mediated device?  If so, could the mediated
device recognize the command and do the scan/audit itself?  QEMU must
not be the point at which mediation occurs for security purposes, QEMU
is userspace and userspace is not to be trusted.  I'm still open to
ioctls where it makes sense, as above, we have PCI specific ioctls and
already, but we need to evaluate each one, why it needs to exist, and
whether we can skip it if the mediated device can trigger the action on
its own.  After all, that's why we're using the vfio api, so we can
re-use much of the existing infrastructure, especially for a vGPU that
exposes itself as a PCI device.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 92+ messages in thread

* RE: [RFC PATCH v4 1/3] Mediated device Core driver
  2016-06-07 22:42                     ` [Qemu-devel] " Alex Williamson
@ 2016-06-08  1:18                       ` Tian, Kevin
  -1 siblings, 0 replies; 92+ messages in thread
From: Tian, Kevin @ 2016-06-08  1:18 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Neo Jia, Dong Jia, Kirti Wankhede, pbonzini, kraxel, qemu-devel,
	kvm, Ruan, Shuai, Song, Jike, Lv, Zhiyuan

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Wednesday, June 08, 2016 6:42 AM
> 
> On Tue, 7 Jun 2016 03:03:32 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > Sent: Tuesday, June 07, 2016 3:31 AM
> > >
> > > On Mon, 6 Jun 2016 10:44:25 -0700
> > > Neo Jia <cjia@nvidia.com> wrote:
> > >
> > > > On Mon, Jun 06, 2016 at 04:29:11PM +0800, Dong Jia wrote:
> > > > > On Sun, 5 Jun 2016 23:27:42 -0700
> > > > > Neo Jia <cjia@nvidia.com> wrote:
> > > > >
> > > > > 2. VFIO_DEVICE_CCW_CMD_REQUEST
> > > > > This intends to handle an intercepted channel I/O instruction. It
> > > > > basically need to do the following thing:
> > > >
> > > > May I ask how and when QEMU knows that he needs to issue such VFIO ioctl at
> > > > first place?
> > >
> > > Yep, this is my question as well.  It sounds a bit like there's an
> > > emulated device in QEMU that's trying to tell the mediated device when
> > > to start an operation when we probably should be passing through
> > > whatever i/o operations indicate that status directly to the mediated
> > > device. Thanks,
> > >
> > > Alex
> >
> > Below is copied from Dong's earlier post which said clear that
> > a guest cmd submission will trigger the whole flow:
> >
> > ----
> > Explanation:
> > Q1-Q4: Qemu side process.
> > K1-K6: Kernel side process.
> >
> > Q1. Intercept a ssch instruction.
> > Q2. Translate the guest ccw program to a user space ccw program
> >     (u_ccwchain).
> > Q3. Call VFIO_DEVICE_CCW_CMD_REQUEST (u_ccwchain, orb, irb).
> >     K1. Copy from u_ccwchain to kernel (k_ccwchain).
> >     K2. Translate the user space ccw program to a kernel space ccw
> >         program, which becomes runnable for a real device.
> >     K3. With the necessary information contained in the orb passed in
> >         by Qemu, issue the k_ccwchain to the device, and wait event q
> >         for the I/O result.
> >     K4. Interrupt handler gets the I/O result, and wakes up the wait q.
> >     K5. CMD_REQUEST ioctl gets the I/O result, and uses the result to
> >         update the user space irb.
> >     K6. Copy irb and scsw back to user space.
> > Q4. Update the irb for the guest.
> > ----
> 
> Right, but this was the pre-mediated device approach, now we no longer
> need step Q2 so we really only need Q1 and therefore Q3 to exist in
> QEMU if those are operations that are not visible to the mediated
> device; which they very well might be, since it's described as an
> instruction rather than an i/o operation.  It's not terrible if that's
> the case, vfio-pci has its own ioctl for doing a hot reset.



> 
> > My understanding is that such thing belongs to how device is mediated
> > (so device driver specific), instead of something to be abstracted in
> > VFIO which manages resource but doesn't care how resource is used.
> >
> > Actually we have same requirement in vGPU case, that a guest driver
> > needs submit GPU commands through some MMIO register. vGPU device
> > model will intercept the submission request (in its own way), do its
> > necessary scan/audit to ensure correctness/security, and then submit
> > to physical GPU through vendor specific interface.
> >
> > No difference with channel I/O here.
> 
> Well, if the GPU command is submitted through an MMIO register, is that
> MMIO register part of the mediated device?  If so, could the mediated
> device recognize the command and do the scan/audit itself?  QEMU must
> not be the point at which mediation occurs for security purposes, QEMU
> is userspace and userspace is not to be trusted.  I'm still open to
> ioctls where it makes sense, as above, we have PCI specific ioctls and
> already, but we need to evaluate each one, why it needs to exist, and
> whether we can skip it if the mediated device can trigger the action on
> its own.  After all, that's why we're using the vfio api, so we can
> re-use much of the existing infrastructure, especially for a vGPU that
> exposes itself as a PCI device.  Thanks,
> 

My point is that a guest submission on vGPU is just a normal trapped 
register write, which is forwarded from Qemu to VFIO through pwrite 
interface and then hit mediated vGPU device. The mediated device
will recognize this register write as a submission request and then do
necessary scan (looks we are saying same thing) and then submit to
physical device driver. If loading ccw cmds on channel i/o are also 
through some I/O registers, it can be implemented same way w/o
introducing new ioctl. The r/w handler of mediated device can figure
out whether it's a ccw submission or not. But my understanding might 
be wrong here.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v4 1/3] Mediated device Core driver
@ 2016-06-08  1:18                       ` Tian, Kevin
  0 siblings, 0 replies; 92+ messages in thread
From: Tian, Kevin @ 2016-06-08  1:18 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Neo Jia, Dong Jia, Kirti Wankhede, pbonzini, kraxel, qemu-devel,
	kvm, Ruan, Shuai, Song, Jike, Lv, Zhiyuan

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Wednesday, June 08, 2016 6:42 AM
> 
> On Tue, 7 Jun 2016 03:03:32 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > Sent: Tuesday, June 07, 2016 3:31 AM
> > >
> > > On Mon, 6 Jun 2016 10:44:25 -0700
> > > Neo Jia <cjia@nvidia.com> wrote:
> > >
> > > > On Mon, Jun 06, 2016 at 04:29:11PM +0800, Dong Jia wrote:
> > > > > On Sun, 5 Jun 2016 23:27:42 -0700
> > > > > Neo Jia <cjia@nvidia.com> wrote:
> > > > >
> > > > > 2. VFIO_DEVICE_CCW_CMD_REQUEST
> > > > > This intends to handle an intercepted channel I/O instruction. It
> > > > > basically need to do the following thing:
> > > >
> > > > May I ask how and when QEMU knows that he needs to issue such VFIO ioctl at
> > > > first place?
> > >
> > > Yep, this is my question as well.  It sounds a bit like there's an
> > > emulated device in QEMU that's trying to tell the mediated device when
> > > to start an operation when we probably should be passing through
> > > whatever i/o operations indicate that status directly to the mediated
> > > device. Thanks,
> > >
> > > Alex
> >
> > Below is copied from Dong's earlier post which said clear that
> > a guest cmd submission will trigger the whole flow:
> >
> > ----
> > Explanation:
> > Q1-Q4: Qemu side process.
> > K1-K6: Kernel side process.
> >
> > Q1. Intercept a ssch instruction.
> > Q2. Translate the guest ccw program to a user space ccw program
> >     (u_ccwchain).
> > Q3. Call VFIO_DEVICE_CCW_CMD_REQUEST (u_ccwchain, orb, irb).
> >     K1. Copy from u_ccwchain to kernel (k_ccwchain).
> >     K2. Translate the user space ccw program to a kernel space ccw
> >         program, which becomes runnable for a real device.
> >     K3. With the necessary information contained in the orb passed in
> >         by Qemu, issue the k_ccwchain to the device, and wait event q
> >         for the I/O result.
> >     K4. Interrupt handler gets the I/O result, and wakes up the wait q.
> >     K5. CMD_REQUEST ioctl gets the I/O result, and uses the result to
> >         update the user space irb.
> >     K6. Copy irb and scsw back to user space.
> > Q4. Update the irb for the guest.
> > ----
> 
> Right, but this was the pre-mediated device approach, now we no longer
> need step Q2 so we really only need Q1 and therefore Q3 to exist in
> QEMU if those are operations that are not visible to the mediated
> device; which they very well might be, since it's described as an
> instruction rather than an i/o operation.  It's not terrible if that's
> the case, vfio-pci has its own ioctl for doing a hot reset.



> 
> > My understanding is that such thing belongs to how device is mediated
> > (so device driver specific), instead of something to be abstracted in
> > VFIO which manages resource but doesn't care how resource is used.
> >
> > Actually we have same requirement in vGPU case, that a guest driver
> > needs submit GPU commands through some MMIO register. vGPU device
> > model will intercept the submission request (in its own way), do its
> > necessary scan/audit to ensure correctness/security, and then submit
> > to physical GPU through vendor specific interface.
> >
> > No difference with channel I/O here.
> 
> Well, if the GPU command is submitted through an MMIO register, is that
> MMIO register part of the mediated device?  If so, could the mediated
> device recognize the command and do the scan/audit itself?  QEMU must
> not be the point at which mediation occurs for security purposes, QEMU
> is userspace and userspace is not to be trusted.  I'm still open to
> ioctls where it makes sense, as above, we have PCI specific ioctls and
> already, but we need to evaluate each one, why it needs to exist, and
> whether we can skip it if the mediated device can trigger the action on
> its own.  After all, that's why we're using the vfio api, so we can
> re-use much of the existing infrastructure, especially for a vGPU that
> exposes itself as a PCI device.  Thanks,
> 

My point is that a guest submission on vGPU is just a normal trapped 
register write, which is forwarded from Qemu to VFIO through pwrite 
interface and then hit mediated vGPU device. The mediated device
will recognize this register write as a submission request and then do
necessary scan (looks we are saying same thing) and then submit to
physical device driver. If loading ccw cmds on channel i/o are also 
through some I/O registers, it can be implemented same way w/o
introducing new ioctl. The r/w handler of mediated device can figure
out whether it's a ccw submission or not. But my understanding might 
be wrong here.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC PATCH v4 1/3] Mediated device Core driver
  2016-06-08  1:18                       ` [Qemu-devel] " Tian, Kevin
@ 2016-06-08  1:39                         ` Alex Williamson
  -1 siblings, 0 replies; 92+ messages in thread
From: Alex Williamson @ 2016-06-08  1:39 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Neo Jia, Dong Jia, Kirti Wankhede, pbonzini, kraxel, qemu-devel,
	kvm, Ruan, Shuai, Song, Jike, Lv, Zhiyuan

On Wed, 8 Jun 2016 01:18:42 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Wednesday, June 08, 2016 6:42 AM
> > 
> > On Tue, 7 Jun 2016 03:03:32 +0000
> > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> >   
> > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > Sent: Tuesday, June 07, 2016 3:31 AM
> > > >
> > > > On Mon, 6 Jun 2016 10:44:25 -0700
> > > > Neo Jia <cjia@nvidia.com> wrote:
> > > >  
> > > > > On Mon, Jun 06, 2016 at 04:29:11PM +0800, Dong Jia wrote:  
> > > > > > On Sun, 5 Jun 2016 23:27:42 -0700
> > > > > > Neo Jia <cjia@nvidia.com> wrote:
> > > > > >
> > > > > > 2. VFIO_DEVICE_CCW_CMD_REQUEST
> > > > > > This intends to handle an intercepted channel I/O instruction. It
> > > > > > basically need to do the following thing:  
> > > > >
> > > > > May I ask how and when QEMU knows that he needs to issue such VFIO ioctl at
> > > > > first place?  
> > > >
> > > > Yep, this is my question as well.  It sounds a bit like there's an
> > > > emulated device in QEMU that's trying to tell the mediated device when
> > > > to start an operation when we probably should be passing through
> > > > whatever i/o operations indicate that status directly to the mediated
> > > > device. Thanks,
> > > >
> > > > Alex  
> > >
> > > Below is copied from Dong's earlier post which said clear that
> > > a guest cmd submission will trigger the whole flow:
> > >
> > > ----
> > > Explanation:
> > > Q1-Q4: Qemu side process.
> > > K1-K6: Kernel side process.
> > >
> > > Q1. Intercept a ssch instruction.
> > > Q2. Translate the guest ccw program to a user space ccw program
> > >     (u_ccwchain).
> > > Q3. Call VFIO_DEVICE_CCW_CMD_REQUEST (u_ccwchain, orb, irb).
> > >     K1. Copy from u_ccwchain to kernel (k_ccwchain).
> > >     K2. Translate the user space ccw program to a kernel space ccw
> > >         program, which becomes runnable for a real device.
> > >     K3. With the necessary information contained in the orb passed in
> > >         by Qemu, issue the k_ccwchain to the device, and wait event q
> > >         for the I/O result.
> > >     K4. Interrupt handler gets the I/O result, and wakes up the wait q.
> > >     K5. CMD_REQUEST ioctl gets the I/O result, and uses the result to
> > >         update the user space irb.
> > >     K6. Copy irb and scsw back to user space.
> > > Q4. Update the irb for the guest.
> > > ----  
> > 
> > Right, but this was the pre-mediated device approach, now we no longer
> > need step Q2 so we really only need Q1 and therefore Q3 to exist in
> > QEMU if those are operations that are not visible to the mediated
> > device; which they very well might be, since it's described as an
> > instruction rather than an i/o operation.  It's not terrible if that's
> > the case, vfio-pci has its own ioctl for doing a hot reset.  
> 
> 
> 
> >   
> > > My understanding is that such thing belongs to how device is mediated
> > > (so device driver specific), instead of something to be abstracted in
> > > VFIO which manages resource but doesn't care how resource is used.
> > >
> > > Actually we have same requirement in vGPU case, that a guest driver
> > > needs submit GPU commands through some MMIO register. vGPU device
> > > model will intercept the submission request (in its own way), do its
> > > necessary scan/audit to ensure correctness/security, and then submit
> > > to physical GPU through vendor specific interface.
> > >
> > > No difference with channel I/O here.  
> > 
> > Well, if the GPU command is submitted through an MMIO register, is that
> > MMIO register part of the mediated device?  If so, could the mediated
> > device recognize the command and do the scan/audit itself?  QEMU must
> > not be the point at which mediation occurs for security purposes, QEMU
> > is userspace and userspace is not to be trusted.  I'm still open to
> > ioctls where it makes sense, as above, we have PCI specific ioctls and
> > already, but we need to evaluate each one, why it needs to exist, and
> > whether we can skip it if the mediated device can trigger the action on
> > its own.  After all, that's why we're using the vfio api, so we can
> > re-use much of the existing infrastructure, especially for a vGPU that
> > exposes itself as a PCI device.  Thanks,
> >   
> 
> My point is that a guest submission on vGPU is just a normal trapped 
> register write, which is forwarded from Qemu to VFIO through pwrite 
> interface and then hit mediated vGPU device. The mediated device
> will recognize this register write as a submission request and then do
> necessary scan (looks we are saying same thing) and then submit to
> physical device driver. If loading ccw cmds on channel i/o are also 
> through some I/O registers, it can be implemented same way w/o
> introducing new ioctl. The r/w handler of mediated device can figure
> out whether it's a ccw submission or not. But my understanding might 
> be wrong here.

I think we're in violent agreement ;)

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v4 1/3] Mediated device Core driver
@ 2016-06-08  1:39                         ` Alex Williamson
  0 siblings, 0 replies; 92+ messages in thread
From: Alex Williamson @ 2016-06-08  1:39 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Neo Jia, Dong Jia, Kirti Wankhede, pbonzini, kraxel, qemu-devel,
	kvm, Ruan, Shuai, Song, Jike, Lv, Zhiyuan

On Wed, 8 Jun 2016 01:18:42 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Wednesday, June 08, 2016 6:42 AM
> > 
> > On Tue, 7 Jun 2016 03:03:32 +0000
> > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> >   
> > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > Sent: Tuesday, June 07, 2016 3:31 AM
> > > >
> > > > On Mon, 6 Jun 2016 10:44:25 -0700
> > > > Neo Jia <cjia@nvidia.com> wrote:
> > > >  
> > > > > On Mon, Jun 06, 2016 at 04:29:11PM +0800, Dong Jia wrote:  
> > > > > > On Sun, 5 Jun 2016 23:27:42 -0700
> > > > > > Neo Jia <cjia@nvidia.com> wrote:
> > > > > >
> > > > > > 2. VFIO_DEVICE_CCW_CMD_REQUEST
> > > > > > This intends to handle an intercepted channel I/O instruction. It
> > > > > > basically need to do the following thing:  
> > > > >
> > > > > May I ask how and when QEMU knows that he needs to issue such VFIO ioctl at
> > > > > first place?  
> > > >
> > > > Yep, this is my question as well.  It sounds a bit like there's an
> > > > emulated device in QEMU that's trying to tell the mediated device when
> > > > to start an operation when we probably should be passing through
> > > > whatever i/o operations indicate that status directly to the mediated
> > > > device. Thanks,
> > > >
> > > > Alex  
> > >
> > > Below is copied from Dong's earlier post which said clear that
> > > a guest cmd submission will trigger the whole flow:
> > >
> > > ----
> > > Explanation:
> > > Q1-Q4: Qemu side process.
> > > K1-K6: Kernel side process.
> > >
> > > Q1. Intercept a ssch instruction.
> > > Q2. Translate the guest ccw program to a user space ccw program
> > >     (u_ccwchain).
> > > Q3. Call VFIO_DEVICE_CCW_CMD_REQUEST (u_ccwchain, orb, irb).
> > >     K1. Copy from u_ccwchain to kernel (k_ccwchain).
> > >     K2. Translate the user space ccw program to a kernel space ccw
> > >         program, which becomes runnable for a real device.
> > >     K3. With the necessary information contained in the orb passed in
> > >         by Qemu, issue the k_ccwchain to the device, and wait event q
> > >         for the I/O result.
> > >     K4. Interrupt handler gets the I/O result, and wakes up the wait q.
> > >     K5. CMD_REQUEST ioctl gets the I/O result, and uses the result to
> > >         update the user space irb.
> > >     K6. Copy irb and scsw back to user space.
> > > Q4. Update the irb for the guest.
> > > ----  
> > 
> > Right, but this was the pre-mediated device approach, now we no longer
> > need step Q2 so we really only need Q1 and therefore Q3 to exist in
> > QEMU if those are operations that are not visible to the mediated
> > device; which they very well might be, since it's described as an
> > instruction rather than an i/o operation.  It's not terrible if that's
> > the case, vfio-pci has its own ioctl for doing a hot reset.  
> 
> 
> 
> >   
> > > My understanding is that such thing belongs to how device is mediated
> > > (so device driver specific), instead of something to be abstracted in
> > > VFIO which manages resource but doesn't care how resource is used.
> > >
> > > Actually we have same requirement in vGPU case, that a guest driver
> > > needs submit GPU commands through some MMIO register. vGPU device
> > > model will intercept the submission request (in its own way), do its
> > > necessary scan/audit to ensure correctness/security, and then submit
> > > to physical GPU through vendor specific interface.
> > >
> > > No difference with channel I/O here.  
> > 
> > Well, if the GPU command is submitted through an MMIO register, is that
> > MMIO register part of the mediated device?  If so, could the mediated
> > device recognize the command and do the scan/audit itself?  QEMU must
> > not be the point at which mediation occurs for security purposes, QEMU
> > is userspace and userspace is not to be trusted.  I'm still open to
> > ioctls where it makes sense, as above, we have PCI specific ioctls and
> > already, but we need to evaluate each one, why it needs to exist, and
> > whether we can skip it if the mediated device can trigger the action on
> > its own.  After all, that's why we're using the vfio api, so we can
> > re-use much of the existing infrastructure, especially for a vGPU that
> > exposes itself as a PCI device.  Thanks,
> >   
> 
> My point is that a guest submission on vGPU is just a normal trapped 
> register write, which is forwarded from Qemu to VFIO through pwrite 
> interface and then hit mediated vGPU device. The mediated device
> will recognize this register write as a submission request and then do
> necessary scan (looks we are saying same thing) and then submit to
> physical device driver. If loading ccw cmds on channel i/o are also 
> through some I/O registers, it can be implemented same way w/o
> introducing new ioctl. The r/w handler of mediated device can figure
> out whether it's a ccw submission or not. But my understanding might 
> be wrong here.

I think we're in violent agreement ;)

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC PATCH v4 1/3] Mediated device Core driver
  2016-06-08  1:39                         ` [Qemu-devel] " Alex Williamson
@ 2016-06-08  3:18                           ` Dong Jia
  -1 siblings, 0 replies; 92+ messages in thread
From: Dong Jia @ 2016-06-08  3:18 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, Neo Jia, Kirti Wankhede, pbonzini, kraxel,
	qemu-devel, kvm, Ruan, Shuai, Song, Jike, Lv, Zhiyuan, Dong Jia

On Tue, 7 Jun 2016 19:39:21 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Wed, 8 Jun 2016 01:18:42 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > Sent: Wednesday, June 08, 2016 6:42 AM
> > > 
> > > On Tue, 7 Jun 2016 03:03:32 +0000
> > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > >   
> > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > Sent: Tuesday, June 07, 2016 3:31 AM
> > > > >
> > > > > On Mon, 6 Jun 2016 10:44:25 -0700
> > > > > Neo Jia <cjia@nvidia.com> wrote:
> > > > >  
> > > > > > On Mon, Jun 06, 2016 at 04:29:11PM +0800, Dong Jia wrote:  
> > > > > > > On Sun, 5 Jun 2016 23:27:42 -0700
> > > > > > > Neo Jia <cjia@nvidia.com> wrote:
> > > > > > >
> > > > > > > 2. VFIO_DEVICE_CCW_CMD_REQUEST
> > > > > > > This intends to handle an intercepted channel I/O instruction. It
> > > > > > > basically need to do the following thing:  
> > > > > >
> > > > > > May I ask how and when QEMU knows that he needs to issue such VFIO ioctl at
> > > > > > first place?  
> > > > >
> > > > > Yep, this is my question as well.  It sounds a bit like there's an
> > > > > emulated device in QEMU that's trying to tell the mediated device when
> > > > > to start an operation when we probably should be passing through
> > > > > whatever i/o operations indicate that status directly to the mediated
> > > > > device. Thanks,
> > > > >
> > > > > Alex  
> > > >
> > > > Below is copied from Dong's earlier post which said clear that
> > > > a guest cmd submission will trigger the whole flow:
> > > >
> > > > ----
> > > > Explanation:
> > > > Q1-Q4: Qemu side process.
> > > > K1-K6: Kernel side process.
> > > >
> > > > Q1. Intercept a ssch instruction.
> > > > Q2. Translate the guest ccw program to a user space ccw program
> > > >     (u_ccwchain).
> > > > Q3. Call VFIO_DEVICE_CCW_CMD_REQUEST (u_ccwchain, orb, irb).
> > > >     K1. Copy from u_ccwchain to kernel (k_ccwchain).
> > > >     K2. Translate the user space ccw program to a kernel space ccw
> > > >         program, which becomes runnable for a real device.
> > > >     K3. With the necessary information contained in the orb passed in
> > > >         by Qemu, issue the k_ccwchain to the device, and wait event q
> > > >         for the I/O result.
> > > >     K4. Interrupt handler gets the I/O result, and wakes up the wait q.
> > > >     K5. CMD_REQUEST ioctl gets the I/O result, and uses the result to
> > > >         update the user space irb.
> > > >     K6. Copy irb and scsw back to user space.
> > > > Q4. Update the irb for the guest.
> > > > ----  
> > > 
> > > Right, but this was the pre-mediated device approach, now we no longer
> > > need step Q2 so we really only need Q1 and therefore Q3 to exist in
> > > QEMU if those are operations that are not visible to the mediated
> > > device; which they very well might be, since it's described as an
> > > instruction rather than an i/o operation.  It's not terrible if that's
> > > the case, vfio-pci has its own ioctl for doing a hot reset.  
Dear Alex, Kevin and Neo,

'ssch' is a privileged I/O instruction, which should be finally issued
to the dedicated subchannel of the physical device.

BTW, I did remove step Q2 with all of the user-space translation code,
according to your comments in another thread.

> > 
> > 
> > >   
> > > > My understanding is that such thing belongs to how device is mediated
> > > > (so device driver specific), instead of something to be abstracted in
> > > > VFIO which manages resource but doesn't care how resource is used.
> > > >
> > > > Actually we have same requirement in vGPU case, that a guest driver
> > > > needs submit GPU commands through some MMIO register. vGPU device
> > > > model will intercept the submission request (in its own way), do its
> > > > necessary scan/audit to ensure correctness/security, and then submit
> > > > to physical GPU through vendor specific interface.
> > > >
> > > > No difference with channel I/O here.  
> > > 
> > > Well, if the GPU command is submitted through an MMIO register, is that
> > > MMIO register part of the mediated device?  If so, could the mediated
> > > device recognize the command and do the scan/audit itself?  QEMU must
> > > not be the point at which mediation occurs for security purposes, QEMU
> > > is userspace and userspace is not to be trusted.  I'm still open to
> > > ioctls where it makes sense, as above, we have PCI specific ioctls and
> > > already, but we need to evaluate each one, why it needs to exist, and
> > > whether we can skip it if the mediated device can trigger the action on
> > > its own.  After all, that's why we're using the vfio api, so we can
> > > re-use much of the existing infrastructure, especially for a vGPU that
> > > exposes itself as a PCI device.  Thanks,
> > >   
> > 
> > My point is that a guest submission on vGPU is just a normal trapped 
> > register write, which is forwarded from Qemu to VFIO through pwrite 
> > interface and then hit mediated vGPU device. The mediated device
> > will recognize this register write as a submission request and then do
> > necessary scan (looks we are saying same thing) and then submit to
> > physical device driver. If loading ccw cmds on channel i/o are also 
> > through some I/O registers, it can be implemented same way w/o
> > introducing new ioctl.
We are different here. The target of an I/O instruction is the
subchannel. CCW devices don't have these kind of registers. The mediated
ccw device can not recognize such an submission by its own capbilities. 

A CCW device does not have such registers in both the physical and the
mediated devices to sense or recognize the submission request. It's the
CPU that recognizes the submission by intercepting the guest ssch
instruction.

CPU can not tell if it is issued from a passed thru device driver or a
virtio device driver from the guest. So it has to exit to QEMU, and let
QEMU take over.

Once QEMU identifies the target subchannel is serving a passed thru
device, it uses the ioctl to pass the instruction parameters into the
kernel all the way along the mediated driver to the physical driver to
the subchannel to perform the I/O operation.

> > The r/w handler of mediated device can figure
> > out whether it's a ccw submission or not. But my understanding might 
> > be wrong here.
We don't have registers to sense an instruction or operation.

> 
> I think we're in violent agreement ;)
> 

--------
Dong Jia


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v4 1/3] Mediated device Core driver
@ 2016-06-08  3:18                           ` Dong Jia
  0 siblings, 0 replies; 92+ messages in thread
From: Dong Jia @ 2016-06-08  3:18 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, Neo Jia, Kirti Wankhede, pbonzini, kraxel,
	qemu-devel, kvm, Ruan, Shuai, Song, Jike, Lv, Zhiyuan, Dong Jia

On Tue, 7 Jun 2016 19:39:21 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Wed, 8 Jun 2016 01:18:42 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > Sent: Wednesday, June 08, 2016 6:42 AM
> > > 
> > > On Tue, 7 Jun 2016 03:03:32 +0000
> > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > >   
> > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > Sent: Tuesday, June 07, 2016 3:31 AM
> > > > >
> > > > > On Mon, 6 Jun 2016 10:44:25 -0700
> > > > > Neo Jia <cjia@nvidia.com> wrote:
> > > > >  
> > > > > > On Mon, Jun 06, 2016 at 04:29:11PM +0800, Dong Jia wrote:  
> > > > > > > On Sun, 5 Jun 2016 23:27:42 -0700
> > > > > > > Neo Jia <cjia@nvidia.com> wrote:
> > > > > > >
> > > > > > > 2. VFIO_DEVICE_CCW_CMD_REQUEST
> > > > > > > This intends to handle an intercepted channel I/O instruction. It
> > > > > > > basically need to do the following thing:  
> > > > > >
> > > > > > May I ask how and when QEMU knows that he needs to issue such VFIO ioctl at
> > > > > > first place?  
> > > > >
> > > > > Yep, this is my question as well.  It sounds a bit like there's an
> > > > > emulated device in QEMU that's trying to tell the mediated device when
> > > > > to start an operation when we probably should be passing through
> > > > > whatever i/o operations indicate that status directly to the mediated
> > > > > device. Thanks,
> > > > >
> > > > > Alex  
> > > >
> > > > Below is copied from Dong's earlier post which said clear that
> > > > a guest cmd submission will trigger the whole flow:
> > > >
> > > > ----
> > > > Explanation:
> > > > Q1-Q4: Qemu side process.
> > > > K1-K6: Kernel side process.
> > > >
> > > > Q1. Intercept a ssch instruction.
> > > > Q2. Translate the guest ccw program to a user space ccw program
> > > >     (u_ccwchain).
> > > > Q3. Call VFIO_DEVICE_CCW_CMD_REQUEST (u_ccwchain, orb, irb).
> > > >     K1. Copy from u_ccwchain to kernel (k_ccwchain).
> > > >     K2. Translate the user space ccw program to a kernel space ccw
> > > >         program, which becomes runnable for a real device.
> > > >     K3. With the necessary information contained in the orb passed in
> > > >         by Qemu, issue the k_ccwchain to the device, and wait event q
> > > >         for the I/O result.
> > > >     K4. Interrupt handler gets the I/O result, and wakes up the wait q.
> > > >     K5. CMD_REQUEST ioctl gets the I/O result, and uses the result to
> > > >         update the user space irb.
> > > >     K6. Copy irb and scsw back to user space.
> > > > Q4. Update the irb for the guest.
> > > > ----  
> > > 
> > > Right, but this was the pre-mediated device approach, now we no longer
> > > need step Q2 so we really only need Q1 and therefore Q3 to exist in
> > > QEMU if those are operations that are not visible to the mediated
> > > device; which they very well might be, since it's described as an
> > > instruction rather than an i/o operation.  It's not terrible if that's
> > > the case, vfio-pci has its own ioctl for doing a hot reset.  
Dear Alex, Kevin and Neo,

'ssch' is a privileged I/O instruction, which should be finally issued
to the dedicated subchannel of the physical device.

BTW, I did remove step Q2 with all of the user-space translation code,
according to your comments in another thread.

> > 
> > 
> > >   
> > > > My understanding is that such thing belongs to how device is mediated
> > > > (so device driver specific), instead of something to be abstracted in
> > > > VFIO which manages resource but doesn't care how resource is used.
> > > >
> > > > Actually we have same requirement in vGPU case, that a guest driver
> > > > needs submit GPU commands through some MMIO register. vGPU device
> > > > model will intercept the submission request (in its own way), do its
> > > > necessary scan/audit to ensure correctness/security, and then submit
> > > > to physical GPU through vendor specific interface.
> > > >
> > > > No difference with channel I/O here.  
> > > 
> > > Well, if the GPU command is submitted through an MMIO register, is that
> > > MMIO register part of the mediated device?  If so, could the mediated
> > > device recognize the command and do the scan/audit itself?  QEMU must
> > > not be the point at which mediation occurs for security purposes, QEMU
> > > is userspace and userspace is not to be trusted.  I'm still open to
> > > ioctls where it makes sense, as above, we have PCI specific ioctls and
> > > already, but we need to evaluate each one, why it needs to exist, and
> > > whether we can skip it if the mediated device can trigger the action on
> > > its own.  After all, that's why we're using the vfio api, so we can
> > > re-use much of the existing infrastructure, especially for a vGPU that
> > > exposes itself as a PCI device.  Thanks,
> > >   
> > 
> > My point is that a guest submission on vGPU is just a normal trapped 
> > register write, which is forwarded from Qemu to VFIO through pwrite 
> > interface and then hit mediated vGPU device. The mediated device
> > will recognize this register write as a submission request and then do
> > necessary scan (looks we are saying same thing) and then submit to
> > physical device driver. If loading ccw cmds on channel i/o are also 
> > through some I/O registers, it can be implemented same way w/o
> > introducing new ioctl.
We are different here. The target of an I/O instruction is the
subchannel. CCW devices don't have these kind of registers. The mediated
ccw device can not recognize such an submission by its own capbilities. 

A CCW device does not have such registers in both the physical and the
mediated devices to sense or recognize the submission request. It's the
CPU that recognizes the submission by intercepting the guest ssch
instruction.

CPU can not tell if it is issued from a passed thru device driver or a
virtio device driver from the guest. So it has to exit to QEMU, and let
QEMU take over.

Once QEMU identifies the target subchannel is serving a passed thru
device, it uses the ioctl to pass the instruction parameters into the
kernel all the way along the mediated driver to the physical driver to
the subchannel to perform the I/O operation.

> > The r/w handler of mediated device can figure
> > out whether it's a ccw submission or not. But my understanding might 
> > be wrong here.
We don't have registers to sense an instruction or operation.

> 
> I think we're in violent agreement ;)
> 

--------
Dong Jia

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC PATCH v4 1/3] Mediated device Core driver
  2016-06-08  3:18                           ` [Qemu-devel] " Dong Jia
@ 2016-06-08  3:48                             ` Neo Jia
  -1 siblings, 0 replies; 92+ messages in thread
From: Neo Jia @ 2016-06-08  3:48 UTC (permalink / raw)
  To: Dong Jia
  Cc: Alex Williamson, Tian, Kevin, Kirti Wankhede, pbonzini, kraxel,
	qemu-devel, kvm, Ruan, Shuai, Song, Jike, Lv, Zhiyuan

On Wed, Jun 08, 2016 at 11:18:42AM +0800, Dong Jia wrote:
> On Tue, 7 Jun 2016 19:39:21 -0600
> Alex Williamson <alex.williamson@redhat.com> wrote:
> 
> > On Wed, 8 Jun 2016 01:18:42 +0000
> > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > 
> > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > Sent: Wednesday, June 08, 2016 6:42 AM
> > > > 
> > > > On Tue, 7 Jun 2016 03:03:32 +0000
> > > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > > >   
> > > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > > Sent: Tuesday, June 07, 2016 3:31 AM
> > > > > >
> > > > > > On Mon, 6 Jun 2016 10:44:25 -0700
> > > > > > Neo Jia <cjia@nvidia.com> wrote:
> > > > > >  
> > > > > > > On Mon, Jun 06, 2016 at 04:29:11PM +0800, Dong Jia wrote:  
> > > > > > > > On Sun, 5 Jun 2016 23:27:42 -0700
> > > > > > > > Neo Jia <cjia@nvidia.com> wrote:
> > > > > > > >
> > > > > > > > 2. VFIO_DEVICE_CCW_CMD_REQUEST
> > > > > > > > This intends to handle an intercepted channel I/O instruction. It
> > > > > > > > basically need to do the following thing:  
> > > > > > >
> > > > > > > May I ask how and when QEMU knows that he needs to issue such VFIO ioctl at
> > > > > > > first place?  
> > > > > >
> > > > > > Yep, this is my question as well.  It sounds a bit like there's an
> > > > > > emulated device in QEMU that's trying to tell the mediated device when
> > > > > > to start an operation when we probably should be passing through
> > > > > > whatever i/o operations indicate that status directly to the mediated
> > > > > > device. Thanks,
> > > > > >
> > > > > > Alex  
> > > > >
> > > > > Below is copied from Dong's earlier post which said clear that
> > > > > a guest cmd submission will trigger the whole flow:
> > > > >
> > > > > ----
> > > > > Explanation:
> > > > > Q1-Q4: Qemu side process.
> > > > > K1-K6: Kernel side process.
> > > > >
> > > > > Q1. Intercept a ssch instruction.
> > > > > Q2. Translate the guest ccw program to a user space ccw program
> > > > >     (u_ccwchain).
> > > > > Q3. Call VFIO_DEVICE_CCW_CMD_REQUEST (u_ccwchain, orb, irb).
> > > > >     K1. Copy from u_ccwchain to kernel (k_ccwchain).
> > > > >     K2. Translate the user space ccw program to a kernel space ccw
> > > > >         program, which becomes runnable for a real device.
> > > > >     K3. With the necessary information contained in the orb passed in
> > > > >         by Qemu, issue the k_ccwchain to the device, and wait event q
> > > > >         for the I/O result.
> > > > >     K4. Interrupt handler gets the I/O result, and wakes up the wait q.
> > > > >     K5. CMD_REQUEST ioctl gets the I/O result, and uses the result to
> > > > >         update the user space irb.
> > > > >     K6. Copy irb and scsw back to user space.
> > > > > Q4. Update the irb for the guest.
> > > > > ----  
> > > > 
> > > > Right, but this was the pre-mediated device approach, now we no longer
> > > > need step Q2 so we really only need Q1 and therefore Q3 to exist in
> > > > QEMU if those are operations that are not visible to the mediated
> > > > device; which they very well might be, since it's described as an
> > > > instruction rather than an i/o operation.  It's not terrible if that's
> > > > the case, vfio-pci has its own ioctl for doing a hot reset.  
> Dear Alex, Kevin and Neo,
> 
> 'ssch' is a privileged I/O instruction, which should be finally issued
> to the dedicated subchannel of the physical device.
> 
> BTW, I did remove step Q2 with all of the user-space translation code,
> according to your comments in another thread.
> 
> > > 
> > > 
> > > >   
> > > > > My understanding is that such thing belongs to how device is mediated
> > > > > (so device driver specific), instead of something to be abstracted in
> > > > > VFIO which manages resource but doesn't care how resource is used.
> > > > >
> > > > > Actually we have same requirement in vGPU case, that a guest driver
> > > > > needs submit GPU commands through some MMIO register. vGPU device
> > > > > model will intercept the submission request (in its own way), do its
> > > > > necessary scan/audit to ensure correctness/security, and then submit
> > > > > to physical GPU through vendor specific interface.
> > > > >
> > > > > No difference with channel I/O here.  
> > > > 
> > > > Well, if the GPU command is submitted through an MMIO register, is that
> > > > MMIO register part of the mediated device?  If so, could the mediated
> > > > device recognize the command and do the scan/audit itself?  QEMU must
> > > > not be the point at which mediation occurs for security purposes, QEMU
> > > > is userspace and userspace is not to be trusted.  I'm still open to
> > > > ioctls where it makes sense, as above, we have PCI specific ioctls and
> > > > already, but we need to evaluate each one, why it needs to exist, and
> > > > whether we can skip it if the mediated device can trigger the action on
> > > > its own.  After all, that's why we're using the vfio api, so we can
> > > > re-use much of the existing infrastructure, especially for a vGPU that
> > > > exposes itself as a PCI device.  Thanks,
> > > >   
> > > 
> > > My point is that a guest submission on vGPU is just a normal trapped 
> > > register write, which is forwarded from Qemu to VFIO through pwrite 
> > > interface and then hit mediated vGPU device. The mediated device
> > > will recognize this register write as a submission request and then do
> > > necessary scan (looks we are saying same thing) and then submit to
> > > physical device driver. If loading ccw cmds on channel i/o are also 
> > > through some I/O registers, it can be implemented same way w/o
> > > introducing new ioctl.
> We are different here. The target of an I/O instruction is the
> subchannel. CCW devices don't have these kind of registers. The mediated
> ccw device can not recognize such an submission by its own capbilities. 
> 
> A CCW device does not have such registers in both the physical and the
> mediated devices to sense or recognize the submission request. It's the
> CPU that recognizes the submission by intercepting the guest ssch
> instruction.
> 
> CPU can not tell if it is issued from a passed thru device driver or a
> virtio device driver from the guest. So it has to exit to QEMU, and let
> QEMU take over.

Hi Dong,

What actually has triggered the VM_EXIT to QEMU of that vCPU? Is it an MMIO
access of the "virtual device" inside guest?

Thanks,
Neo

> 
> Once QEMU identifies the target subchannel is serving a passed thru
> device, it uses the ioctl to pass the instruction parameters into the
> kernel all the way along the mediated driver to the physical driver to
> the subchannel to perform the I/O operation.
> 
> > > The r/w handler of mediated device can figure
> > > out whether it's a ccw submission or not. But my understanding might 
> > > be wrong here.
> We don't have registers to sense an instruction or operation.
> 
> > 
> > I think we're in violent agreement ;)
> > 
> 
> --------
> Dong Jia
> 

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v4 1/3] Mediated device Core driver
@ 2016-06-08  3:48                             ` Neo Jia
  0 siblings, 0 replies; 92+ messages in thread
From: Neo Jia @ 2016-06-08  3:48 UTC (permalink / raw)
  To: Dong Jia
  Cc: Alex Williamson, Tian, Kevin, Kirti Wankhede, pbonzini, kraxel,
	qemu-devel, kvm, Ruan, Shuai, Song, Jike, Lv, Zhiyuan

On Wed, Jun 08, 2016 at 11:18:42AM +0800, Dong Jia wrote:
> On Tue, 7 Jun 2016 19:39:21 -0600
> Alex Williamson <alex.williamson@redhat.com> wrote:
> 
> > On Wed, 8 Jun 2016 01:18:42 +0000
> > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > 
> > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > Sent: Wednesday, June 08, 2016 6:42 AM
> > > > 
> > > > On Tue, 7 Jun 2016 03:03:32 +0000
> > > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > > >   
> > > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > > Sent: Tuesday, June 07, 2016 3:31 AM
> > > > > >
> > > > > > On Mon, 6 Jun 2016 10:44:25 -0700
> > > > > > Neo Jia <cjia@nvidia.com> wrote:
> > > > > >  
> > > > > > > On Mon, Jun 06, 2016 at 04:29:11PM +0800, Dong Jia wrote:  
> > > > > > > > On Sun, 5 Jun 2016 23:27:42 -0700
> > > > > > > > Neo Jia <cjia@nvidia.com> wrote:
> > > > > > > >
> > > > > > > > 2. VFIO_DEVICE_CCW_CMD_REQUEST
> > > > > > > > This intends to handle an intercepted channel I/O instruction. It
> > > > > > > > basically need to do the following thing:  
> > > > > > >
> > > > > > > May I ask how and when QEMU knows that he needs to issue such VFIO ioctl at
> > > > > > > first place?  
> > > > > >
> > > > > > Yep, this is my question as well.  It sounds a bit like there's an
> > > > > > emulated device in QEMU that's trying to tell the mediated device when
> > > > > > to start an operation when we probably should be passing through
> > > > > > whatever i/o operations indicate that status directly to the mediated
> > > > > > device. Thanks,
> > > > > >
> > > > > > Alex  
> > > > >
> > > > > Below is copied from Dong's earlier post which said clear that
> > > > > a guest cmd submission will trigger the whole flow:
> > > > >
> > > > > ----
> > > > > Explanation:
> > > > > Q1-Q4: Qemu side process.
> > > > > K1-K6: Kernel side process.
> > > > >
> > > > > Q1. Intercept a ssch instruction.
> > > > > Q2. Translate the guest ccw program to a user space ccw program
> > > > >     (u_ccwchain).
> > > > > Q3. Call VFIO_DEVICE_CCW_CMD_REQUEST (u_ccwchain, orb, irb).
> > > > >     K1. Copy from u_ccwchain to kernel (k_ccwchain).
> > > > >     K2. Translate the user space ccw program to a kernel space ccw
> > > > >         program, which becomes runnable for a real device.
> > > > >     K3. With the necessary information contained in the orb passed in
> > > > >         by Qemu, issue the k_ccwchain to the device, and wait event q
> > > > >         for the I/O result.
> > > > >     K4. Interrupt handler gets the I/O result, and wakes up the wait q.
> > > > >     K5. CMD_REQUEST ioctl gets the I/O result, and uses the result to
> > > > >         update the user space irb.
> > > > >     K6. Copy irb and scsw back to user space.
> > > > > Q4. Update the irb for the guest.
> > > > > ----  
> > > > 
> > > > Right, but this was the pre-mediated device approach, now we no longer
> > > > need step Q2 so we really only need Q1 and therefore Q3 to exist in
> > > > QEMU if those are operations that are not visible to the mediated
> > > > device; which they very well might be, since it's described as an
> > > > instruction rather than an i/o operation.  It's not terrible if that's
> > > > the case, vfio-pci has its own ioctl for doing a hot reset.  
> Dear Alex, Kevin and Neo,
> 
> 'ssch' is a privileged I/O instruction, which should be finally issued
> to the dedicated subchannel of the physical device.
> 
> BTW, I did remove step Q2 with all of the user-space translation code,
> according to your comments in another thread.
> 
> > > 
> > > 
> > > >   
> > > > > My understanding is that such thing belongs to how device is mediated
> > > > > (so device driver specific), instead of something to be abstracted in
> > > > > VFIO which manages resource but doesn't care how resource is used.
> > > > >
> > > > > Actually we have same requirement in vGPU case, that a guest driver
> > > > > needs submit GPU commands through some MMIO register. vGPU device
> > > > > model will intercept the submission request (in its own way), do its
> > > > > necessary scan/audit to ensure correctness/security, and then submit
> > > > > to physical GPU through vendor specific interface.
> > > > >
> > > > > No difference with channel I/O here.  
> > > > 
> > > > Well, if the GPU command is submitted through an MMIO register, is that
> > > > MMIO register part of the mediated device?  If so, could the mediated
> > > > device recognize the command and do the scan/audit itself?  QEMU must
> > > > not be the point at which mediation occurs for security purposes, QEMU
> > > > is userspace and userspace is not to be trusted.  I'm still open to
> > > > ioctls where it makes sense, as above, we have PCI specific ioctls and
> > > > already, but we need to evaluate each one, why it needs to exist, and
> > > > whether we can skip it if the mediated device can trigger the action on
> > > > its own.  After all, that's why we're using the vfio api, so we can
> > > > re-use much of the existing infrastructure, especially for a vGPU that
> > > > exposes itself as a PCI device.  Thanks,
> > > >   
> > > 
> > > My point is that a guest submission on vGPU is just a normal trapped 
> > > register write, which is forwarded from Qemu to VFIO through pwrite 
> > > interface and then hit mediated vGPU device. The mediated device
> > > will recognize this register write as a submission request and then do
> > > necessary scan (looks we are saying same thing) and then submit to
> > > physical device driver. If loading ccw cmds on channel i/o are also 
> > > through some I/O registers, it can be implemented same way w/o
> > > introducing new ioctl.
> We are different here. The target of an I/O instruction is the
> subchannel. CCW devices don't have these kind of registers. The mediated
> ccw device can not recognize such an submission by its own capbilities. 
> 
> A CCW device does not have such registers in both the physical and the
> mediated devices to sense or recognize the submission request. It's the
> CPU that recognizes the submission by intercepting the guest ssch
> instruction.
> 
> CPU can not tell if it is issued from a passed thru device driver or a
> virtio device driver from the guest. So it has to exit to QEMU, and let
> QEMU take over.

Hi Dong,

What actually has triggered the VM_EXIT to QEMU of that vCPU? Is it an MMIO
access of the "virtual device" inside guest?

Thanks,
Neo

> 
> Once QEMU identifies the target subchannel is serving a passed thru
> device, it uses the ioctl to pass the instruction parameters into the
> kernel all the way along the mediated driver to the physical driver to
> the subchannel to perform the I/O operation.
> 
> > > The r/w handler of mediated device can figure
> > > out whether it's a ccw submission or not. But my understanding might 
> > > be wrong here.
> We don't have registers to sense an instruction or operation.
> 
> > 
> > I think we're in violent agreement ;)
> > 
> 
> --------
> Dong Jia
> 

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC PATCH v4 1/3] Mediated device Core driver
  2016-06-08  3:18                           ` [Qemu-devel] " Dong Jia
@ 2016-06-08  4:29                             ` Alex Williamson
  -1 siblings, 0 replies; 92+ messages in thread
From: Alex Williamson @ 2016-06-08  4:29 UTC (permalink / raw)
  To: Dong Jia
  Cc: Tian, Kevin, Neo Jia, Kirti Wankhede, pbonzini, kraxel,
	qemu-devel, kvm, Ruan, Shuai, Song, Jike, Lv, Zhiyuan

On Wed, 8 Jun 2016 11:18:42 +0800
Dong Jia <bjsdjshi@linux.vnet.ibm.com> wrote:

> On Tue, 7 Jun 2016 19:39:21 -0600
> Alex Williamson <alex.williamson@redhat.com> wrote:
> 
> > On Wed, 8 Jun 2016 01:18:42 +0000
> > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> >   
> > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > Sent: Wednesday, June 08, 2016 6:42 AM
> > > > 
> > > > On Tue, 7 Jun 2016 03:03:32 +0000
> > > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > > >     
> > > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > > Sent: Tuesday, June 07, 2016 3:31 AM
> > > > > >
> > > > > > On Mon, 6 Jun 2016 10:44:25 -0700
> > > > > > Neo Jia <cjia@nvidia.com> wrote:
> > > > > >    
> > > > > > > On Mon, Jun 06, 2016 at 04:29:11PM +0800, Dong Jia wrote:    
> > > > > > > > On Sun, 5 Jun 2016 23:27:42 -0700
> > > > > > > > Neo Jia <cjia@nvidia.com> wrote:
> > > > > > > >
> > > > > > > > 2. VFIO_DEVICE_CCW_CMD_REQUEST
> > > > > > > > This intends to handle an intercepted channel I/O instruction. It
> > > > > > > > basically need to do the following thing:    
> > > > > > >
> > > > > > > May I ask how and when QEMU knows that he needs to issue such VFIO ioctl at
> > > > > > > first place?    
> > > > > >
> > > > > > Yep, this is my question as well.  It sounds a bit like there's an
> > > > > > emulated device in QEMU that's trying to tell the mediated device when
> > > > > > to start an operation when we probably should be passing through
> > > > > > whatever i/o operations indicate that status directly to the mediated
> > > > > > device. Thanks,
> > > > > >
> > > > > > Alex    
> > > > >
> > > > > Below is copied from Dong's earlier post which said clear that
> > > > > a guest cmd submission will trigger the whole flow:
> > > > >
> > > > > ----
> > > > > Explanation:
> > > > > Q1-Q4: Qemu side process.
> > > > > K1-K6: Kernel side process.
> > > > >
> > > > > Q1. Intercept a ssch instruction.
> > > > > Q2. Translate the guest ccw program to a user space ccw program
> > > > >     (u_ccwchain).
> > > > > Q3. Call VFIO_DEVICE_CCW_CMD_REQUEST (u_ccwchain, orb, irb).
> > > > >     K1. Copy from u_ccwchain to kernel (k_ccwchain).
> > > > >     K2. Translate the user space ccw program to a kernel space ccw
> > > > >         program, which becomes runnable for a real device.
> > > > >     K3. With the necessary information contained in the orb passed in
> > > > >         by Qemu, issue the k_ccwchain to the device, and wait event q
> > > > >         for the I/O result.
> > > > >     K4. Interrupt handler gets the I/O result, and wakes up the wait q.
> > > > >     K5. CMD_REQUEST ioctl gets the I/O result, and uses the result to
> > > > >         update the user space irb.
> > > > >     K6. Copy irb and scsw back to user space.
> > > > > Q4. Update the irb for the guest.
> > > > > ----    
> > > > 
> > > > Right, but this was the pre-mediated device approach, now we no longer
> > > > need step Q2 so we really only need Q1 and therefore Q3 to exist in
> > > > QEMU if those are operations that are not visible to the mediated
> > > > device; which they very well might be, since it's described as an
> > > > instruction rather than an i/o operation.  It's not terrible if that's
> > > > the case, vfio-pci has its own ioctl for doing a hot reset.    
> Dear Alex, Kevin and Neo,
> 
> 'ssch' is a privileged I/O instruction, which should be finally issued
> to the dedicated subchannel of the physical device.
> 
> BTW, I did remove step Q2 with all of the user-space translation code,
> according to your comments in another thread.
> 
> > > 
> > >   
> > > >     
> > > > > My understanding is that such thing belongs to how device is mediated
> > > > > (so device driver specific), instead of something to be abstracted in
> > > > > VFIO which manages resource but doesn't care how resource is used.
> > > > >
> > > > > Actually we have same requirement in vGPU case, that a guest driver
> > > > > needs submit GPU commands through some MMIO register. vGPU device
> > > > > model will intercept the submission request (in its own way), do its
> > > > > necessary scan/audit to ensure correctness/security, and then submit
> > > > > to physical GPU through vendor specific interface.
> > > > >
> > > > > No difference with channel I/O here.    
> > > > 
> > > > Well, if the GPU command is submitted through an MMIO register, is that
> > > > MMIO register part of the mediated device?  If so, could the mediated
> > > > device recognize the command and do the scan/audit itself?  QEMU must
> > > > not be the point at which mediation occurs for security purposes, QEMU
> > > > is userspace and userspace is not to be trusted.  I'm still open to
> > > > ioctls where it makes sense, as above, we have PCI specific ioctls and
> > > > already, but we need to evaluate each one, why it needs to exist, and
> > > > whether we can skip it if the mediated device can trigger the action on
> > > > its own.  After all, that's why we're using the vfio api, so we can
> > > > re-use much of the existing infrastructure, especially for a vGPU that
> > > > exposes itself as a PCI device.  Thanks,
> > > >     
> > > 
> > > My point is that a guest submission on vGPU is just a normal trapped 
> > > register write, which is forwarded from Qemu to VFIO through pwrite 
> > > interface and then hit mediated vGPU device. The mediated device
> > > will recognize this register write as a submission request and then do
> > > necessary scan (looks we are saying same thing) and then submit to
> > > physical device driver. If loading ccw cmds on channel i/o are also 
> > > through some I/O registers, it can be implemented same way w/o
> > > introducing new ioctl.  
> We are different here. The target of an I/O instruction is the
> subchannel. CCW devices don't have these kind of registers. The mediated
> ccw device can not recognize such an submission by its own capbilities. 
> 
> A CCW device does not have such registers in both the physical and the
> mediated devices to sense or recognize the submission request. It's the
> CPU that recognizes the submission by intercepting the guest ssch
> instruction.
> 
> CPU can not tell if it is issued from a passed thru device driver or a
> virtio device driver from the guest. So it has to exit to QEMU, and let
> QEMU take over.
> 
> Once QEMU identifies the target subchannel is serving a passed thru
> device, it uses the ioctl to pass the instruction parameters into the
> kernel all the way along the mediated driver to the physical driver to
> the subchannel to perform the I/O operation.
> 
> > > The r/w handler of mediated device can figure
> > > out whether it's a ccw submission or not. But my understanding might 
> > > be wrong here.  
> We don't have registers to sense an instruction or operation.

Ok, so it seems we need to create some sort of interface to initiate
the ccw program, but I suppose I'm not yet convinced that it needs a
new ioctl.  For instance if you only need to "kick" the device to tell
it when to begin translation and execution, we could create a virtual
interrupt into the mediated device with an irqfd.  QEMU writes to the
irqfd (eventfd), the mediated driver receives this kick and begins
processing.  Another virtual interrupt out to the user might indicate
completion. On the other hand if the ioctl was intended to write the
ccw program itself to the device, we have vfio device regions that can
do this.  Simply define within the vfio-ccw API that one of the regions
is a virtual program buffer and define the API between the mediated
driver and user the sequence of writes that load the program state,
initiate the program, and return the result.

The vfio API already has a very extensible mechanism for communicating
with a device through regions and interrupts, not all of which
necessarily need to match physical attributes of the device.  ioctls
can be added, but lets exhaust the mechanisms we already have through
the vfio api first.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v4 1/3] Mediated device Core driver
@ 2016-06-08  4:29                             ` Alex Williamson
  0 siblings, 0 replies; 92+ messages in thread
From: Alex Williamson @ 2016-06-08  4:29 UTC (permalink / raw)
  To: Dong Jia
  Cc: Tian, Kevin, Neo Jia, Kirti Wankhede, pbonzini, kraxel,
	qemu-devel, kvm, Ruan, Shuai, Song, Jike, Lv, Zhiyuan

On Wed, 8 Jun 2016 11:18:42 +0800
Dong Jia <bjsdjshi@linux.vnet.ibm.com> wrote:

> On Tue, 7 Jun 2016 19:39:21 -0600
> Alex Williamson <alex.williamson@redhat.com> wrote:
> 
> > On Wed, 8 Jun 2016 01:18:42 +0000
> > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> >   
> > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > Sent: Wednesday, June 08, 2016 6:42 AM
> > > > 
> > > > On Tue, 7 Jun 2016 03:03:32 +0000
> > > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > > >     
> > > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > > Sent: Tuesday, June 07, 2016 3:31 AM
> > > > > >
> > > > > > On Mon, 6 Jun 2016 10:44:25 -0700
> > > > > > Neo Jia <cjia@nvidia.com> wrote:
> > > > > >    
> > > > > > > On Mon, Jun 06, 2016 at 04:29:11PM +0800, Dong Jia wrote:    
> > > > > > > > On Sun, 5 Jun 2016 23:27:42 -0700
> > > > > > > > Neo Jia <cjia@nvidia.com> wrote:
> > > > > > > >
> > > > > > > > 2. VFIO_DEVICE_CCW_CMD_REQUEST
> > > > > > > > This intends to handle an intercepted channel I/O instruction. It
> > > > > > > > basically need to do the following thing:    
> > > > > > >
> > > > > > > May I ask how and when QEMU knows that he needs to issue such VFIO ioctl at
> > > > > > > first place?    
> > > > > >
> > > > > > Yep, this is my question as well.  It sounds a bit like there's an
> > > > > > emulated device in QEMU that's trying to tell the mediated device when
> > > > > > to start an operation when we probably should be passing through
> > > > > > whatever i/o operations indicate that status directly to the mediated
> > > > > > device. Thanks,
> > > > > >
> > > > > > Alex    
> > > > >
> > > > > Below is copied from Dong's earlier post which said clear that
> > > > > a guest cmd submission will trigger the whole flow:
> > > > >
> > > > > ----
> > > > > Explanation:
> > > > > Q1-Q4: Qemu side process.
> > > > > K1-K6: Kernel side process.
> > > > >
> > > > > Q1. Intercept a ssch instruction.
> > > > > Q2. Translate the guest ccw program to a user space ccw program
> > > > >     (u_ccwchain).
> > > > > Q3. Call VFIO_DEVICE_CCW_CMD_REQUEST (u_ccwchain, orb, irb).
> > > > >     K1. Copy from u_ccwchain to kernel (k_ccwchain).
> > > > >     K2. Translate the user space ccw program to a kernel space ccw
> > > > >         program, which becomes runnable for a real device.
> > > > >     K3. With the necessary information contained in the orb passed in
> > > > >         by Qemu, issue the k_ccwchain to the device, and wait event q
> > > > >         for the I/O result.
> > > > >     K4. Interrupt handler gets the I/O result, and wakes up the wait q.
> > > > >     K5. CMD_REQUEST ioctl gets the I/O result, and uses the result to
> > > > >         update the user space irb.
> > > > >     K6. Copy irb and scsw back to user space.
> > > > > Q4. Update the irb for the guest.
> > > > > ----    
> > > > 
> > > > Right, but this was the pre-mediated device approach, now we no longer
> > > > need step Q2 so we really only need Q1 and therefore Q3 to exist in
> > > > QEMU if those are operations that are not visible to the mediated
> > > > device; which they very well might be, since it's described as an
> > > > instruction rather than an i/o operation.  It's not terrible if that's
> > > > the case, vfio-pci has its own ioctl for doing a hot reset.    
> Dear Alex, Kevin and Neo,
> 
> 'ssch' is a privileged I/O instruction, which should be finally issued
> to the dedicated subchannel of the physical device.
> 
> BTW, I did remove step Q2 with all of the user-space translation code,
> according to your comments in another thread.
> 
> > > 
> > >   
> > > >     
> > > > > My understanding is that such thing belongs to how device is mediated
> > > > > (so device driver specific), instead of something to be abstracted in
> > > > > VFIO which manages resource but doesn't care how resource is used.
> > > > >
> > > > > Actually we have same requirement in vGPU case, that a guest driver
> > > > > needs submit GPU commands through some MMIO register. vGPU device
> > > > > model will intercept the submission request (in its own way), do its
> > > > > necessary scan/audit to ensure correctness/security, and then submit
> > > > > to physical GPU through vendor specific interface.
> > > > >
> > > > > No difference with channel I/O here.    
> > > > 
> > > > Well, if the GPU command is submitted through an MMIO register, is that
> > > > MMIO register part of the mediated device?  If so, could the mediated
> > > > device recognize the command and do the scan/audit itself?  QEMU must
> > > > not be the point at which mediation occurs for security purposes, QEMU
> > > > is userspace and userspace is not to be trusted.  I'm still open to
> > > > ioctls where it makes sense, as above, we have PCI specific ioctls and
> > > > already, but we need to evaluate each one, why it needs to exist, and
> > > > whether we can skip it if the mediated device can trigger the action on
> > > > its own.  After all, that's why we're using the vfio api, so we can
> > > > re-use much of the existing infrastructure, especially for a vGPU that
> > > > exposes itself as a PCI device.  Thanks,
> > > >     
> > > 
> > > My point is that a guest submission on vGPU is just a normal trapped 
> > > register write, which is forwarded from Qemu to VFIO through pwrite 
> > > interface and then hit mediated vGPU device. The mediated device
> > > will recognize this register write as a submission request and then do
> > > necessary scan (looks we are saying same thing) and then submit to
> > > physical device driver. If loading ccw cmds on channel i/o are also 
> > > through some I/O registers, it can be implemented same way w/o
> > > introducing new ioctl.  
> We are different here. The target of an I/O instruction is the
> subchannel. CCW devices don't have these kind of registers. The mediated
> ccw device can not recognize such an submission by its own capbilities. 
> 
> A CCW device does not have such registers in both the physical and the
> mediated devices to sense or recognize the submission request. It's the
> CPU that recognizes the submission by intercepting the guest ssch
> instruction.
> 
> CPU can not tell if it is issued from a passed thru device driver or a
> virtio device driver from the guest. So it has to exit to QEMU, and let
> QEMU take over.
> 
> Once QEMU identifies the target subchannel is serving a passed thru
> device, it uses the ioctl to pass the instruction parameters into the
> kernel all the way along the mediated driver to the physical driver to
> the subchannel to perform the I/O operation.
> 
> > > The r/w handler of mediated device can figure
> > > out whether it's a ccw submission or not. But my understanding might 
> > > be wrong here.  
> We don't have registers to sense an instruction or operation.

Ok, so it seems we need to create some sort of interface to initiate
the ccw program, but I suppose I'm not yet convinced that it needs a
new ioctl.  For instance if you only need to "kick" the device to tell
it when to begin translation and execution, we could create a virtual
interrupt into the mediated device with an irqfd.  QEMU writes to the
irqfd (eventfd), the mediated driver receives this kick and begins
processing.  Another virtual interrupt out to the user might indicate
completion. On the other hand if the ioctl was intended to write the
ccw program itself to the device, we have vfio device regions that can
do this.  Simply define within the vfio-ccw API that one of the regions
is a virtual program buffer and define the API between the mediated
driver and user the sequence of writes that load the program state,
initiate the program, and return the result.

The vfio API already has a very extensible mechanism for communicating
with a device through regions and interrupts, not all of which
necessarily need to match physical attributes of the device.  ioctls
can be added, but lets exhaust the mechanisms we already have through
the vfio api first.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC PATCH v4 1/3] Mediated device Core driver
  2016-06-08  3:48                             ` [Qemu-devel] " Neo Jia
@ 2016-06-08  6:13                               ` Dong Jia
  -1 siblings, 0 replies; 92+ messages in thread
From: Dong Jia @ 2016-06-08  6:13 UTC (permalink / raw)
  To: Neo Jia
  Cc: Alex Williamson, Tian, Kevin, Kirti Wankhede, pbonzini, kraxel,
	qemu-devel, kvm, Ruan, Shuai, Song, Jike, Lv, Zhiyuan, Dong Jia

On Tue, 7 Jun 2016 20:48:42 -0700
Neo Jia <cjia@nvidia.com> wrote:

> On Wed, Jun 08, 2016 at 11:18:42AM +0800, Dong Jia wrote:
> > On Tue, 7 Jun 2016 19:39:21 -0600
> > Alex Williamson <alex.williamson@redhat.com> wrote:
> > 
> > > On Wed, 8 Jun 2016 01:18:42 +0000
> > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > > 
> > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > Sent: Wednesday, June 08, 2016 6:42 AM
> > > > > 
> > > > > On Tue, 7 Jun 2016 03:03:32 +0000
> > > > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > > > >   
> > > > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > > > Sent: Tuesday, June 07, 2016 3:31 AM
> > > > > > >
> > > > > > > On Mon, 6 Jun 2016 10:44:25 -0700
> > > > > > > Neo Jia <cjia@nvidia.com> wrote:
> > > > > > >  
> > > > > > > > On Mon, Jun 06, 2016 at 04:29:11PM +0800, Dong Jia wrote:  
> > > > > > > > > On Sun, 5 Jun 2016 23:27:42 -0700
> > > > > > > > > Neo Jia <cjia@nvidia.com> wrote:
> > > > > > > > >
> > > > > > > > > 2. VFIO_DEVICE_CCW_CMD_REQUEST
> > > > > > > > > This intends to handle an intercepted channel I/O instruction. It
> > > > > > > > > basically need to do the following thing:  
> > > > > > > >
> > > > > > > > May I ask how and when QEMU knows that he needs to issue such VFIO ioctl at
> > > > > > > > first place?  
> > > > > > >
> > > > > > > Yep, this is my question as well.  It sounds a bit like there's an
> > > > > > > emulated device in QEMU that's trying to tell the mediated device when
> > > > > > > to start an operation when we probably should be passing through
> > > > > > > whatever i/o operations indicate that status directly to the mediated
> > > > > > > device. Thanks,
> > > > > > >
> > > > > > > Alex  
> > > > > >
> > > > > > Below is copied from Dong's earlier post which said clear that
> > > > > > a guest cmd submission will trigger the whole flow:
> > > > > >
> > > > > > ----
> > > > > > Explanation:
> > > > > > Q1-Q4: Qemu side process.
> > > > > > K1-K6: Kernel side process.
> > > > > >
> > > > > > Q1. Intercept a ssch instruction.
> > > > > > Q2. Translate the guest ccw program to a user space ccw program
> > > > > >     (u_ccwchain).
> > > > > > Q3. Call VFIO_DEVICE_CCW_CMD_REQUEST (u_ccwchain, orb, irb).
> > > > > >     K1. Copy from u_ccwchain to kernel (k_ccwchain).
> > > > > >     K2. Translate the user space ccw program to a kernel space ccw
> > > > > >         program, which becomes runnable for a real device.
> > > > > >     K3. With the necessary information contained in the orb passed in
> > > > > >         by Qemu, issue the k_ccwchain to the device, and wait event q
> > > > > >         for the I/O result.
> > > > > >     K4. Interrupt handler gets the I/O result, and wakes up the wait q.
> > > > > >     K5. CMD_REQUEST ioctl gets the I/O result, and uses the result to
> > > > > >         update the user space irb.
> > > > > >     K6. Copy irb and scsw back to user space.
> > > > > > Q4. Update the irb for the guest.
> > > > > > ----  
> > > > > 
> > > > > Right, but this was the pre-mediated device approach, now we no longer
> > > > > need step Q2 so we really only need Q1 and therefore Q3 to exist in
> > > > > QEMU if those are operations that are not visible to the mediated
> > > > > device; which they very well might be, since it's described as an
> > > > > instruction rather than an i/o operation.  It's not terrible if that's
> > > > > the case, vfio-pci has its own ioctl for doing a hot reset.  
> > Dear Alex, Kevin and Neo,
> > 
> > 'ssch' is a privileged I/O instruction, which should be finally issued
> > to the dedicated subchannel of the physical device.
> > 
> > BTW, I did remove step Q2 with all of the user-space translation code,
> > according to your comments in another thread.
> > 
> > > > 
> > > > 
> > > > >   
> > > > > > My understanding is that such thing belongs to how device is mediated
> > > > > > (so device driver specific), instead of something to be abstracted in
> > > > > > VFIO which manages resource but doesn't care how resource is used.
> > > > > >
> > > > > > Actually we have same requirement in vGPU case, that a guest driver
> > > > > > needs submit GPU commands through some MMIO register. vGPU device
> > > > > > model will intercept the submission request (in its own way), do its
> > > > > > necessary scan/audit to ensure correctness/security, and then submit
> > > > > > to physical GPU through vendor specific interface.
> > > > > >
> > > > > > No difference with channel I/O here.  
> > > > > 
> > > > > Well, if the GPU command is submitted through an MMIO register, is that
> > > > > MMIO register part of the mediated device?  If so, could the mediated
> > > > > device recognize the command and do the scan/audit itself?  QEMU must
> > > > > not be the point at which mediation occurs for security purposes, QEMU
> > > > > is userspace and userspace is not to be trusted.  I'm still open to
> > > > > ioctls where it makes sense, as above, we have PCI specific ioctls and
> > > > > already, but we need to evaluate each one, why it needs to exist, and
> > > > > whether we can skip it if the mediated device can trigger the action on
> > > > > its own.  After all, that's why we're using the vfio api, so we can
> > > > > re-use much of the existing infrastructure, especially for a vGPU that
> > > > > exposes itself as a PCI device.  Thanks,
> > > > >   
> > > > 
> > > > My point is that a guest submission on vGPU is just a normal trapped 
> > > > register write, which is forwarded from Qemu to VFIO through pwrite 
> > > > interface and then hit mediated vGPU device. The mediated device
> > > > will recognize this register write as a submission request and then do
> > > > necessary scan (looks we are saying same thing) and then submit to
> > > > physical device driver. If loading ccw cmds on channel i/o are also 
> > > > through some I/O registers, it can be implemented same way w/o
> > > > introducing new ioctl.
> > We are different here. The target of an I/O instruction is the
> > subchannel. CCW devices don't have these kind of registers. The mediated
> > ccw device can not recognize such an submission by its own capbilities. 
> > 
> > A CCW device does not have such registers in both the physical and the
> > mediated devices to sense or recognize the submission request. It's the
> > CPU that recognizes the submission by intercepting the guest ssch
> > instruction.
> > 
> > CPU can not tell if it is issued from a passed thru device driver or a
> > virtio device driver from the guest. So it has to exit to QEMU, and let
> > QEMU take over.
> 
> Hi Dong,
> 
> What actually has triggered the VM_EXIT to QEMU of that vCPU? Is it an MMIO
> access of the "virtual device" inside guest?
Dear Neo,

It's not a MMIO access, but an I/O instruction.

Our cpu has a mode (like vt-x in the x86 world? I guess...) to oversee
the execution of programs in a virtual machine environment. Once the cpu
enters this mode, it commence execution of the guest program. It could
handle many aspects of an virtual machine, or, when for some
instructions if such handling is not provided, cpu will exit from this
mode. The I/O instruction 'ssch' is one kind of the instructions that
this cpu mode could not handle. So a ssch issued from the guest will
trigger the exit of this cpu mode with the exit_reason, and then the
vcpu gets the reason and exit to QEMU.

> 
> Thanks,
> Neo
> 
> > 
> > Once QEMU identifies the target subchannel is serving a passed thru
> > device, it uses the ioctl to pass the instruction parameters into the
> > kernel all the way along the mediated driver to the physical driver to
> > the subchannel to perform the I/O operation.
> > 
> > > > The r/w handler of mediated device can figure
> > > > out whether it's a ccw submission or not. But my understanding might 
> > > > be wrong here.
> > We don't have registers to sense an instruction or operation.
> > 
> > > 
> > > I think we're in violent agreement ;)
> > > 
> > 
> > --------
> > Dong Jia
> > 
> 



--------
Dong Jia


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v4 1/3] Mediated device Core driver
@ 2016-06-08  6:13                               ` Dong Jia
  0 siblings, 0 replies; 92+ messages in thread
From: Dong Jia @ 2016-06-08  6:13 UTC (permalink / raw)
  To: Neo Jia
  Cc: Alex Williamson, Tian, Kevin, Kirti Wankhede, pbonzini, kraxel,
	qemu-devel, kvm, Ruan, Shuai, Song, Jike, Lv, Zhiyuan, Dong Jia

On Tue, 7 Jun 2016 20:48:42 -0700
Neo Jia <cjia@nvidia.com> wrote:

> On Wed, Jun 08, 2016 at 11:18:42AM +0800, Dong Jia wrote:
> > On Tue, 7 Jun 2016 19:39:21 -0600
> > Alex Williamson <alex.williamson@redhat.com> wrote:
> > 
> > > On Wed, 8 Jun 2016 01:18:42 +0000
> > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > > 
> > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > Sent: Wednesday, June 08, 2016 6:42 AM
> > > > > 
> > > > > On Tue, 7 Jun 2016 03:03:32 +0000
> > > > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > > > >   
> > > > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > > > Sent: Tuesday, June 07, 2016 3:31 AM
> > > > > > >
> > > > > > > On Mon, 6 Jun 2016 10:44:25 -0700
> > > > > > > Neo Jia <cjia@nvidia.com> wrote:
> > > > > > >  
> > > > > > > > On Mon, Jun 06, 2016 at 04:29:11PM +0800, Dong Jia wrote:  
> > > > > > > > > On Sun, 5 Jun 2016 23:27:42 -0700
> > > > > > > > > Neo Jia <cjia@nvidia.com> wrote:
> > > > > > > > >
> > > > > > > > > 2. VFIO_DEVICE_CCW_CMD_REQUEST
> > > > > > > > > This intends to handle an intercepted channel I/O instruction. It
> > > > > > > > > basically need to do the following thing:  
> > > > > > > >
> > > > > > > > May I ask how and when QEMU knows that he needs to issue such VFIO ioctl at
> > > > > > > > first place?  
> > > > > > >
> > > > > > > Yep, this is my question as well.  It sounds a bit like there's an
> > > > > > > emulated device in QEMU that's trying to tell the mediated device when
> > > > > > > to start an operation when we probably should be passing through
> > > > > > > whatever i/o operations indicate that status directly to the mediated
> > > > > > > device. Thanks,
> > > > > > >
> > > > > > > Alex  
> > > > > >
> > > > > > Below is copied from Dong's earlier post which said clear that
> > > > > > a guest cmd submission will trigger the whole flow:
> > > > > >
> > > > > > ----
> > > > > > Explanation:
> > > > > > Q1-Q4: Qemu side process.
> > > > > > K1-K6: Kernel side process.
> > > > > >
> > > > > > Q1. Intercept a ssch instruction.
> > > > > > Q2. Translate the guest ccw program to a user space ccw program
> > > > > >     (u_ccwchain).
> > > > > > Q3. Call VFIO_DEVICE_CCW_CMD_REQUEST (u_ccwchain, orb, irb).
> > > > > >     K1. Copy from u_ccwchain to kernel (k_ccwchain).
> > > > > >     K2. Translate the user space ccw program to a kernel space ccw
> > > > > >         program, which becomes runnable for a real device.
> > > > > >     K3. With the necessary information contained in the orb passed in
> > > > > >         by Qemu, issue the k_ccwchain to the device, and wait event q
> > > > > >         for the I/O result.
> > > > > >     K4. Interrupt handler gets the I/O result, and wakes up the wait q.
> > > > > >     K5. CMD_REQUEST ioctl gets the I/O result, and uses the result to
> > > > > >         update the user space irb.
> > > > > >     K6. Copy irb and scsw back to user space.
> > > > > > Q4. Update the irb for the guest.
> > > > > > ----  
> > > > > 
> > > > > Right, but this was the pre-mediated device approach, now we no longer
> > > > > need step Q2 so we really only need Q1 and therefore Q3 to exist in
> > > > > QEMU if those are operations that are not visible to the mediated
> > > > > device; which they very well might be, since it's described as an
> > > > > instruction rather than an i/o operation.  It's not terrible if that's
> > > > > the case, vfio-pci has its own ioctl for doing a hot reset.  
> > Dear Alex, Kevin and Neo,
> > 
> > 'ssch' is a privileged I/O instruction, which should be finally issued
> > to the dedicated subchannel of the physical device.
> > 
> > BTW, I did remove step Q2 with all of the user-space translation code,
> > according to your comments in another thread.
> > 
> > > > 
> > > > 
> > > > >   
> > > > > > My understanding is that such thing belongs to how device is mediated
> > > > > > (so device driver specific), instead of something to be abstracted in
> > > > > > VFIO which manages resource but doesn't care how resource is used.
> > > > > >
> > > > > > Actually we have same requirement in vGPU case, that a guest driver
> > > > > > needs submit GPU commands through some MMIO register. vGPU device
> > > > > > model will intercept the submission request (in its own way), do its
> > > > > > necessary scan/audit to ensure correctness/security, and then submit
> > > > > > to physical GPU through vendor specific interface.
> > > > > >
> > > > > > No difference with channel I/O here.  
> > > > > 
> > > > > Well, if the GPU command is submitted through an MMIO register, is that
> > > > > MMIO register part of the mediated device?  If so, could the mediated
> > > > > device recognize the command and do the scan/audit itself?  QEMU must
> > > > > not be the point at which mediation occurs for security purposes, QEMU
> > > > > is userspace and userspace is not to be trusted.  I'm still open to
> > > > > ioctls where it makes sense, as above, we have PCI specific ioctls and
> > > > > already, but we need to evaluate each one, why it needs to exist, and
> > > > > whether we can skip it if the mediated device can trigger the action on
> > > > > its own.  After all, that's why we're using the vfio api, so we can
> > > > > re-use much of the existing infrastructure, especially for a vGPU that
> > > > > exposes itself as a PCI device.  Thanks,
> > > > >   
> > > > 
> > > > My point is that a guest submission on vGPU is just a normal trapped 
> > > > register write, which is forwarded from Qemu to VFIO through pwrite 
> > > > interface and then hit mediated vGPU device. The mediated device
> > > > will recognize this register write as a submission request and then do
> > > > necessary scan (looks we are saying same thing) and then submit to
> > > > physical device driver. If loading ccw cmds on channel i/o are also 
> > > > through some I/O registers, it can be implemented same way w/o
> > > > introducing new ioctl.
> > We are different here. The target of an I/O instruction is the
> > subchannel. CCW devices don't have these kind of registers. The mediated
> > ccw device can not recognize such an submission by its own capbilities. 
> > 
> > A CCW device does not have such registers in both the physical and the
> > mediated devices to sense or recognize the submission request. It's the
> > CPU that recognizes the submission by intercepting the guest ssch
> > instruction.
> > 
> > CPU can not tell if it is issued from a passed thru device driver or a
> > virtio device driver from the guest. So it has to exit to QEMU, and let
> > QEMU take over.
> 
> Hi Dong,
> 
> What actually has triggered the VM_EXIT to QEMU of that vCPU? Is it an MMIO
> access of the "virtual device" inside guest?
Dear Neo,

It's not a MMIO access, but an I/O instruction.

Our cpu has a mode (like vt-x in the x86 world? I guess...) to oversee
the execution of programs in a virtual machine environment. Once the cpu
enters this mode, it commence execution of the guest program. It could
handle many aspects of an virtual machine, or, when for some
instructions if such handling is not provided, cpu will exit from this
mode. The I/O instruction 'ssch' is one kind of the instructions that
this cpu mode could not handle. So a ssch issued from the guest will
trigger the exit of this cpu mode with the exit_reason, and then the
vcpu gets the reason and exit to QEMU.

> 
> Thanks,
> Neo
> 
> > 
> > Once QEMU identifies the target subchannel is serving a passed thru
> > device, it uses the ioctl to pass the instruction parameters into the
> > kernel all the way along the mediated driver to the physical driver to
> > the subchannel to perform the I/O operation.
> > 
> > > > The r/w handler of mediated device can figure
> > > > out whether it's a ccw submission or not. But my understanding might 
> > > > be wrong here.
> > We don't have registers to sense an instruction or operation.
> > 
> > > 
> > > I think we're in violent agreement ;)
> > > 
> > 
> > --------
> > Dong Jia
> > 
> 



--------
Dong Jia

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC PATCH v4 1/3] Mediated device Core driver
  2016-06-08  6:13                               ` [Qemu-devel] " Dong Jia
@ 2016-06-08  6:22                                 ` Neo Jia
  -1 siblings, 0 replies; 92+ messages in thread
From: Neo Jia @ 2016-06-08  6:22 UTC (permalink / raw)
  To: Dong Jia
  Cc: Alex Williamson, Tian, Kevin, Kirti Wankhede, pbonzini, kraxel,
	qemu-devel, kvm, Ruan, Shuai, Song, Jike, Lv, Zhiyuan

On Wed, Jun 08, 2016 at 02:13:49PM +0800, Dong Jia wrote:
> On Tue, 7 Jun 2016 20:48:42 -0700
> Neo Jia <cjia@nvidia.com> wrote:
> 
> > On Wed, Jun 08, 2016 at 11:18:42AM +0800, Dong Jia wrote:
> > > On Tue, 7 Jun 2016 19:39:21 -0600
> > > Alex Williamson <alex.williamson@redhat.com> wrote:
> > > 
> > > > On Wed, 8 Jun 2016 01:18:42 +0000
> > > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > > > 
> > > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > > Sent: Wednesday, June 08, 2016 6:42 AM
> > > > > > 
> > > > > > On Tue, 7 Jun 2016 03:03:32 +0000
> > > > > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > > > > >   
> > > > > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > > > > Sent: Tuesday, June 07, 2016 3:31 AM
> > > > > > > >
> > > > > > > > On Mon, 6 Jun 2016 10:44:25 -0700
> > > > > > > > Neo Jia <cjia@nvidia.com> wrote:
> > > > > > > >  
> > > > > > > > > On Mon, Jun 06, 2016 at 04:29:11PM +0800, Dong Jia wrote:  
> > > > > > > > > > On Sun, 5 Jun 2016 23:27:42 -0700
> > > > > > > > > > Neo Jia <cjia@nvidia.com> wrote:
> > > > > > > > > >
> > > > > > > > > > 2. VFIO_DEVICE_CCW_CMD_REQUEST
> > > > > > > > > > This intends to handle an intercepted channel I/O instruction. It
> > > > > > > > > > basically need to do the following thing:  
> > > > > > > > >
> > > > > > > > > May I ask how and when QEMU knows that he needs to issue such VFIO ioctl at
> > > > > > > > > first place?  
> > > > > > > >
> > > > > > > > Yep, this is my question as well.  It sounds a bit like there's an
> > > > > > > > emulated device in QEMU that's trying to tell the mediated device when
> > > > > > > > to start an operation when we probably should be passing through
> > > > > > > > whatever i/o operations indicate that status directly to the mediated
> > > > > > > > device. Thanks,
> > > > > > > >
> > > > > > > > Alex  
> > > > > > >
> > > > > > > Below is copied from Dong's earlier post which said clear that
> > > > > > > a guest cmd submission will trigger the whole flow:
> > > > > > >
> > > > > > > ----
> > > > > > > Explanation:
> > > > > > > Q1-Q4: Qemu side process.
> > > > > > > K1-K6: Kernel side process.
> > > > > > >
> > > > > > > Q1. Intercept a ssch instruction.
> > > > > > > Q2. Translate the guest ccw program to a user space ccw program
> > > > > > >     (u_ccwchain).
> > > > > > > Q3. Call VFIO_DEVICE_CCW_CMD_REQUEST (u_ccwchain, orb, irb).
> > > > > > >     K1. Copy from u_ccwchain to kernel (k_ccwchain).
> > > > > > >     K2. Translate the user space ccw program to a kernel space ccw
> > > > > > >         program, which becomes runnable for a real device.
> > > > > > >     K3. With the necessary information contained in the orb passed in
> > > > > > >         by Qemu, issue the k_ccwchain to the device, and wait event q
> > > > > > >         for the I/O result.
> > > > > > >     K4. Interrupt handler gets the I/O result, and wakes up the wait q.
> > > > > > >     K5. CMD_REQUEST ioctl gets the I/O result, and uses the result to
> > > > > > >         update the user space irb.
> > > > > > >     K6. Copy irb and scsw back to user space.
> > > > > > > Q4. Update the irb for the guest.
> > > > > > > ----  
> > > > > > 
> > > > > > Right, but this was the pre-mediated device approach, now we no longer
> > > > > > need step Q2 so we really only need Q1 and therefore Q3 to exist in
> > > > > > QEMU if those are operations that are not visible to the mediated
> > > > > > device; which they very well might be, since it's described as an
> > > > > > instruction rather than an i/o operation.  It's not terrible if that's
> > > > > > the case, vfio-pci has its own ioctl for doing a hot reset.  
> > > Dear Alex, Kevin and Neo,
> > > 
> > > 'ssch' is a privileged I/O instruction, which should be finally issued
> > > to the dedicated subchannel of the physical device.
> > > 
> > > BTW, I did remove step Q2 with all of the user-space translation code,
> > > according to your comments in another thread.
> > > 
> > > > > 
> > > > > 
> > > > > >   
> > > > > > > My understanding is that such thing belongs to how device is mediated
> > > > > > > (so device driver specific), instead of something to be abstracted in
> > > > > > > VFIO which manages resource but doesn't care how resource is used.
> > > > > > >
> > > > > > > Actually we have same requirement in vGPU case, that a guest driver
> > > > > > > needs submit GPU commands through some MMIO register. vGPU device
> > > > > > > model will intercept the submission request (in its own way), do its
> > > > > > > necessary scan/audit to ensure correctness/security, and then submit
> > > > > > > to physical GPU through vendor specific interface.
> > > > > > >
> > > > > > > No difference with channel I/O here.  
> > > > > > 
> > > > > > Well, if the GPU command is submitted through an MMIO register, is that
> > > > > > MMIO register part of the mediated device?  If so, could the mediated
> > > > > > device recognize the command and do the scan/audit itself?  QEMU must
> > > > > > not be the point at which mediation occurs for security purposes, QEMU
> > > > > > is userspace and userspace is not to be trusted.  I'm still open to
> > > > > > ioctls where it makes sense, as above, we have PCI specific ioctls and
> > > > > > already, but we need to evaluate each one, why it needs to exist, and
> > > > > > whether we can skip it if the mediated device can trigger the action on
> > > > > > its own.  After all, that's why we're using the vfio api, so we can
> > > > > > re-use much of the existing infrastructure, especially for a vGPU that
> > > > > > exposes itself as a PCI device.  Thanks,
> > > > > >   
> > > > > 
> > > > > My point is that a guest submission on vGPU is just a normal trapped 
> > > > > register write, which is forwarded from Qemu to VFIO through pwrite 
> > > > > interface and then hit mediated vGPU device. The mediated device
> > > > > will recognize this register write as a submission request and then do
> > > > > necessary scan (looks we are saying same thing) and then submit to
> > > > > physical device driver. If loading ccw cmds on channel i/o are also 
> > > > > through some I/O registers, it can be implemented same way w/o
> > > > > introducing new ioctl.
> > > We are different here. The target of an I/O instruction is the
> > > subchannel. CCW devices don't have these kind of registers. The mediated
> > > ccw device can not recognize such an submission by its own capbilities. 
> > > 
> > > A CCW device does not have such registers in both the physical and the
> > > mediated devices to sense or recognize the submission request. It's the
> > > CPU that recognizes the submission by intercepting the guest ssch
> > > instruction.
> > > 
> > > CPU can not tell if it is issued from a passed thru device driver or a
> > > virtio device driver from the guest. So it has to exit to QEMU, and let
> > > QEMU take over.
> > 
> > Hi Dong,
> > 
> > What actually has triggered the VM_EXIT to QEMU of that vCPU? Is it an MMIO
> > access of the "virtual device" inside guest?
> Dear Neo,
> 
> It's not a MMIO access, but an I/O instruction.
> 
> Our cpu has a mode (like vt-x in the x86 world? I guess...) to oversee
> the execution of programs in a virtual machine environment. Once the cpu
> enters this mode, it commence execution of the guest program. It could
> handle many aspects of an virtual machine, or, when for some
> instructions if such handling is not provided, cpu will exit from this
> mode. The I/O instruction 'ssch' is one kind of the instructions that
> this cpu mode could not handle. So a ssch issued from the guest will
> trigger the exit of this cpu mode with the exit_reason, and then the
> vcpu gets the reason and exit to QEMU.

Hi Dong,

Thanks for the details. 

Can you claim a MMIO region for your virtual device? If yes, then the I/O
instruction triggered VM_EXIT can be forward to your device by a pwrite from
QEMU thru this new region.

Thanks,
Neo

> 
> > 
> > Thanks,
> > Neo
> > 
> > > 
> > > Once QEMU identifies the target subchannel is serving a passed thru
> > > device, it uses the ioctl to pass the instruction parameters into the
> > > kernel all the way along the mediated driver to the physical driver to
> > > the subchannel to perform the I/O operation.
> > > 
> > > > > The r/w handler of mediated device can figure
> > > > > out whether it's a ccw submission or not. But my understanding might 
> > > > > be wrong here.
> > > We don't have registers to sense an instruction or operation.
> > > 
> > > > 
> > > > I think we're in violent agreement ;)
> > > > 
> > > 
> > > --------
> > > Dong Jia
> > > 
> > 
> 
> 
> 
> --------
> Dong Jia
> 

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v4 1/3] Mediated device Core driver
@ 2016-06-08  6:22                                 ` Neo Jia
  0 siblings, 0 replies; 92+ messages in thread
From: Neo Jia @ 2016-06-08  6:22 UTC (permalink / raw)
  To: Dong Jia
  Cc: Alex Williamson, Tian, Kevin, Kirti Wankhede, pbonzini, kraxel,
	qemu-devel, kvm, Ruan, Shuai, Song, Jike, Lv, Zhiyuan

On Wed, Jun 08, 2016 at 02:13:49PM +0800, Dong Jia wrote:
> On Tue, 7 Jun 2016 20:48:42 -0700
> Neo Jia <cjia@nvidia.com> wrote:
> 
> > On Wed, Jun 08, 2016 at 11:18:42AM +0800, Dong Jia wrote:
> > > On Tue, 7 Jun 2016 19:39:21 -0600
> > > Alex Williamson <alex.williamson@redhat.com> wrote:
> > > 
> > > > On Wed, 8 Jun 2016 01:18:42 +0000
> > > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > > > 
> > > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > > Sent: Wednesday, June 08, 2016 6:42 AM
> > > > > > 
> > > > > > On Tue, 7 Jun 2016 03:03:32 +0000
> > > > > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > > > > >   
> > > > > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > > > > Sent: Tuesday, June 07, 2016 3:31 AM
> > > > > > > >
> > > > > > > > On Mon, 6 Jun 2016 10:44:25 -0700
> > > > > > > > Neo Jia <cjia@nvidia.com> wrote:
> > > > > > > >  
> > > > > > > > > On Mon, Jun 06, 2016 at 04:29:11PM +0800, Dong Jia wrote:  
> > > > > > > > > > On Sun, 5 Jun 2016 23:27:42 -0700
> > > > > > > > > > Neo Jia <cjia@nvidia.com> wrote:
> > > > > > > > > >
> > > > > > > > > > 2. VFIO_DEVICE_CCW_CMD_REQUEST
> > > > > > > > > > This intends to handle an intercepted channel I/O instruction. It
> > > > > > > > > > basically need to do the following thing:  
> > > > > > > > >
> > > > > > > > > May I ask how and when QEMU knows that he needs to issue such VFIO ioctl at
> > > > > > > > > first place?  
> > > > > > > >
> > > > > > > > Yep, this is my question as well.  It sounds a bit like there's an
> > > > > > > > emulated device in QEMU that's trying to tell the mediated device when
> > > > > > > > to start an operation when we probably should be passing through
> > > > > > > > whatever i/o operations indicate that status directly to the mediated
> > > > > > > > device. Thanks,
> > > > > > > >
> > > > > > > > Alex  
> > > > > > >
> > > > > > > Below is copied from Dong's earlier post which said clear that
> > > > > > > a guest cmd submission will trigger the whole flow:
> > > > > > >
> > > > > > > ----
> > > > > > > Explanation:
> > > > > > > Q1-Q4: Qemu side process.
> > > > > > > K1-K6: Kernel side process.
> > > > > > >
> > > > > > > Q1. Intercept a ssch instruction.
> > > > > > > Q2. Translate the guest ccw program to a user space ccw program
> > > > > > >     (u_ccwchain).
> > > > > > > Q3. Call VFIO_DEVICE_CCW_CMD_REQUEST (u_ccwchain, orb, irb).
> > > > > > >     K1. Copy from u_ccwchain to kernel (k_ccwchain).
> > > > > > >     K2. Translate the user space ccw program to a kernel space ccw
> > > > > > >         program, which becomes runnable for a real device.
> > > > > > >     K3. With the necessary information contained in the orb passed in
> > > > > > >         by Qemu, issue the k_ccwchain to the device, and wait event q
> > > > > > >         for the I/O result.
> > > > > > >     K4. Interrupt handler gets the I/O result, and wakes up the wait q.
> > > > > > >     K5. CMD_REQUEST ioctl gets the I/O result, and uses the result to
> > > > > > >         update the user space irb.
> > > > > > >     K6. Copy irb and scsw back to user space.
> > > > > > > Q4. Update the irb for the guest.
> > > > > > > ----  
> > > > > > 
> > > > > > Right, but this was the pre-mediated device approach, now we no longer
> > > > > > need step Q2 so we really only need Q1 and therefore Q3 to exist in
> > > > > > QEMU if those are operations that are not visible to the mediated
> > > > > > device; which they very well might be, since it's described as an
> > > > > > instruction rather than an i/o operation.  It's not terrible if that's
> > > > > > the case, vfio-pci has its own ioctl for doing a hot reset.  
> > > Dear Alex, Kevin and Neo,
> > > 
> > > 'ssch' is a privileged I/O instruction, which should be finally issued
> > > to the dedicated subchannel of the physical device.
> > > 
> > > BTW, I did remove step Q2 with all of the user-space translation code,
> > > according to your comments in another thread.
> > > 
> > > > > 
> > > > > 
> > > > > >   
> > > > > > > My understanding is that such thing belongs to how device is mediated
> > > > > > > (so device driver specific), instead of something to be abstracted in
> > > > > > > VFIO which manages resource but doesn't care how resource is used.
> > > > > > >
> > > > > > > Actually we have same requirement in vGPU case, that a guest driver
> > > > > > > needs submit GPU commands through some MMIO register. vGPU device
> > > > > > > model will intercept the submission request (in its own way), do its
> > > > > > > necessary scan/audit to ensure correctness/security, and then submit
> > > > > > > to physical GPU through vendor specific interface.
> > > > > > >
> > > > > > > No difference with channel I/O here.  
> > > > > > 
> > > > > > Well, if the GPU command is submitted through an MMIO register, is that
> > > > > > MMIO register part of the mediated device?  If so, could the mediated
> > > > > > device recognize the command and do the scan/audit itself?  QEMU must
> > > > > > not be the point at which mediation occurs for security purposes, QEMU
> > > > > > is userspace and userspace is not to be trusted.  I'm still open to
> > > > > > ioctls where it makes sense, as above, we have PCI specific ioctls and
> > > > > > already, but we need to evaluate each one, why it needs to exist, and
> > > > > > whether we can skip it if the mediated device can trigger the action on
> > > > > > its own.  After all, that's why we're using the vfio api, so we can
> > > > > > re-use much of the existing infrastructure, especially for a vGPU that
> > > > > > exposes itself as a PCI device.  Thanks,
> > > > > >   
> > > > > 
> > > > > My point is that a guest submission on vGPU is just a normal trapped 
> > > > > register write, which is forwarded from Qemu to VFIO through pwrite 
> > > > > interface and then hit mediated vGPU device. The mediated device
> > > > > will recognize this register write as a submission request and then do
> > > > > necessary scan (looks we are saying same thing) and then submit to
> > > > > physical device driver. If loading ccw cmds on channel i/o are also 
> > > > > through some I/O registers, it can be implemented same way w/o
> > > > > introducing new ioctl.
> > > We are different here. The target of an I/O instruction is the
> > > subchannel. CCW devices don't have these kind of registers. The mediated
> > > ccw device can not recognize such an submission by its own capbilities. 
> > > 
> > > A CCW device does not have such registers in both the physical and the
> > > mediated devices to sense or recognize the submission request. It's the
> > > CPU that recognizes the submission by intercepting the guest ssch
> > > instruction.
> > > 
> > > CPU can not tell if it is issued from a passed thru device driver or a
> > > virtio device driver from the guest. So it has to exit to QEMU, and let
> > > QEMU take over.
> > 
> > Hi Dong,
> > 
> > What actually has triggered the VM_EXIT to QEMU of that vCPU? Is it an MMIO
> > access of the "virtual device" inside guest?
> Dear Neo,
> 
> It's not a MMIO access, but an I/O instruction.
> 
> Our cpu has a mode (like vt-x in the x86 world? I guess...) to oversee
> the execution of programs in a virtual machine environment. Once the cpu
> enters this mode, it commence execution of the guest program. It could
> handle many aspects of an virtual machine, or, when for some
> instructions if such handling is not provided, cpu will exit from this
> mode. The I/O instruction 'ssch' is one kind of the instructions that
> this cpu mode could not handle. So a ssch issued from the guest will
> trigger the exit of this cpu mode with the exit_reason, and then the
> vcpu gets the reason and exit to QEMU.

Hi Dong,

Thanks for the details. 

Can you claim a MMIO region for your virtual device? If yes, then the I/O
instruction triggered VM_EXIT can be forward to your device by a pwrite from
QEMU thru this new region.

Thanks,
Neo

> 
> > 
> > Thanks,
> > Neo
> > 
> > > 
> > > Once QEMU identifies the target subchannel is serving a passed thru
> > > device, it uses the ioctl to pass the instruction parameters into the
> > > kernel all the way along the mediated driver to the physical driver to
> > > the subchannel to perform the I/O operation.
> > > 
> > > > > The r/w handler of mediated device can figure
> > > > > out whether it's a ccw submission or not. But my understanding might 
> > > > > be wrong here.
> > > We don't have registers to sense an instruction or operation.
> > > 
> > > > 
> > > > I think we're in violent agreement ;)
> > > > 
> > > 
> > > --------
> > > Dong Jia
> > > 
> > 
> 
> 
> 
> --------
> Dong Jia
> 

^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [RFC PATCH v4 1/3] Mediated device Core driver
  2016-06-08  4:29                             ` [Qemu-devel] " Alex Williamson
@ 2016-06-15  6:37                               ` Dong Jia
  -1 siblings, 0 replies; 92+ messages in thread
From: Dong Jia @ 2016-06-15  6:37 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, Neo Jia, Kirti Wankhede, pbonzini, kraxel,
	qemu-devel, kvm, Ruan, Shuai, Song, Jike, Lv, Zhiyuan, Dong Jia

On Tue, 7 Jun 2016 22:29:30 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Wed, 8 Jun 2016 11:18:42 +0800
> Dong Jia <bjsdjshi@linux.vnet.ibm.com> wrote:
> 
> > On Tue, 7 Jun 2016 19:39:21 -0600
> > Alex Williamson <alex.williamson@redhat.com> wrote:
> > 
> > > On Wed, 8 Jun 2016 01:18:42 +0000
> > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > >   
> > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > Sent: Wednesday, June 08, 2016 6:42 AM
> > > > > 
> > > > > On Tue, 7 Jun 2016 03:03:32 +0000
> > > > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > > > >     
> > > > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > > > Sent: Tuesday, June 07, 2016 3:31 AM
> > > > > > >
> > > > > > > On Mon, 6 Jun 2016 10:44:25 -0700
> > > > > > > Neo Jia <cjia@nvidia.com> wrote:
> > > > > > >    
> > > > > > > > On Mon, Jun 06, 2016 at 04:29:11PM +0800, Dong Jia wrote:    
> > > > > > > > > On Sun, 5 Jun 2016 23:27:42 -0700
> > > > > > > > > Neo Jia <cjia@nvidia.com> wrote:
> > > > > > > > >
> > > > > > > > > 2. VFIO_DEVICE_CCW_CMD_REQUEST
> > > > > > > > > This intends to handle an intercepted channel I/O instruction. It
> > > > > > > > > basically need to do the following thing:    
> > > > > > > >
> > > > > > > > May I ask how and when QEMU knows that he needs to issue such VFIO ioctl at
> > > > > > > > first place?    
> > > > > > >
> > > > > > > Yep, this is my question as well.  It sounds a bit like there's an
> > > > > > > emulated device in QEMU that's trying to tell the mediated device when
> > > > > > > to start an operation when we probably should be passing through
> > > > > > > whatever i/o operations indicate that status directly to the mediated
> > > > > > > device. Thanks,
> > > > > > >
> > > > > > > Alex    
> > > > > >
> > > > > > Below is copied from Dong's earlier post which said clear that
> > > > > > a guest cmd submission will trigger the whole flow:
> > > > > >
> > > > > > ----
> > > > > > Explanation:
> > > > > > Q1-Q4: Qemu side process.
> > > > > > K1-K6: Kernel side process.
> > > > > >
> > > > > > Q1. Intercept a ssch instruction.
> > > > > > Q2. Translate the guest ccw program to a user space ccw program
> > > > > >     (u_ccwchain).
> > > > > > Q3. Call VFIO_DEVICE_CCW_CMD_REQUEST (u_ccwchain, orb, irb).
> > > > > >     K1. Copy from u_ccwchain to kernel (k_ccwchain).
> > > > > >     K2. Translate the user space ccw program to a kernel space ccw
> > > > > >         program, which becomes runnable for a real device.
> > > > > >     K3. With the necessary information contained in the orb passed in
> > > > > >         by Qemu, issue the k_ccwchain to the device, and wait event q
> > > > > >         for the I/O result.
> > > > > >     K4. Interrupt handler gets the I/O result, and wakes up the wait q.
> > > > > >     K5. CMD_REQUEST ioctl gets the I/O result, and uses the result to
> > > > > >         update the user space irb.
> > > > > >     K6. Copy irb and scsw back to user space.
> > > > > > Q4. Update the irb for the guest.
> > > > > > ----    
> > > > > 
> > > > > Right, but this was the pre-mediated device approach, now we no longer
> > > > > need step Q2 so we really only need Q1 and therefore Q3 to exist in
> > > > > QEMU if those are operations that are not visible to the mediated
> > > > > device; which they very well might be, since it's described as an
> > > > > instruction rather than an i/o operation.  It's not terrible if that's
> > > > > the case, vfio-pci has its own ioctl for doing a hot reset.    
> > Dear Alex, Kevin and Neo,
> > 
> > 'ssch' is a privileged I/O instruction, which should be finally issued
> > to the dedicated subchannel of the physical device.
> > 
> > BTW, I did remove step Q2 with all of the user-space translation code,
> > according to your comments in another thread.
> > 
> > > > 
> > > >   
> > > > >     
> > > > > > My understanding is that such thing belongs to how device is mediated
> > > > > > (so device driver specific), instead of something to be abstracted in
> > > > > > VFIO which manages resource but doesn't care how resource is used.
> > > > > >
> > > > > > Actually we have same requirement in vGPU case, that a guest driver
> > > > > > needs submit GPU commands through some MMIO register. vGPU device
> > > > > > model will intercept the submission request (in its own way), do its
> > > > > > necessary scan/audit to ensure correctness/security, and then submit
> > > > > > to physical GPU through vendor specific interface.
> > > > > >
> > > > > > No difference with channel I/O here.    
> > > > > 
> > > > > Well, if the GPU command is submitted through an MMIO register, is that
> > > > > MMIO register part of the mediated device?  If so, could the mediated
> > > > > device recognize the command and do the scan/audit itself?  QEMU must
> > > > > not be the point at which mediation occurs for security purposes, QEMU
> > > > > is userspace and userspace is not to be trusted.  I'm still open to
> > > > > ioctls where it makes sense, as above, we have PCI specific ioctls and
> > > > > already, but we need to evaluate each one, why it needs to exist, and
> > > > > whether we can skip it if the mediated device can trigger the action on
> > > > > its own.  After all, that's why we're using the vfio api, so we can
> > > > > re-use much of the existing infrastructure, especially for a vGPU that
> > > > > exposes itself as a PCI device.  Thanks,
> > > > >     
> > > > 
> > > > My point is that a guest submission on vGPU is just a normal trapped 
> > > > register write, which is forwarded from Qemu to VFIO through pwrite 
> > > > interface and then hit mediated vGPU device. The mediated device
> > > > will recognize this register write as a submission request and then do
> > > > necessary scan (looks we are saying same thing) and then submit to
> > > > physical device driver. If loading ccw cmds on channel i/o are also 
> > > > through some I/O registers, it can be implemented same way w/o
> > > > introducing new ioctl.  
> > We are different here. The target of an I/O instruction is the
> > subchannel. CCW devices don't have these kind of registers. The mediated
> > ccw device can not recognize such an submission by its own capbilities. 
> > 
> > A CCW device does not have such registers in both the physical and the
> > mediated devices to sense or recognize the submission request. It's the
> > CPU that recognizes the submission by intercepting the guest ssch
> > instruction.
> > 
> > CPU can not tell if it is issued from a passed thru device driver or a
> > virtio device driver from the guest. So it has to exit to QEMU, and let
> > QEMU take over.
> > 
> > Once QEMU identifies the target subchannel is serving a passed thru
> > device, it uses the ioctl to pass the instruction parameters into the
> > kernel all the way along the mediated driver to the physical driver to
> > the subchannel to perform the I/O operation.
> > 
> > > > The r/w handler of mediated device can figure
> > > > out whether it's a ccw submission or not. But my understanding might 
> > > > be wrong here.  
> > We don't have registers to sense an instruction or operation.
> 
> Ok, so it seems we need to create some sort of interface to initiate
> the ccw program, but I suppose I'm not yet convinced that it needs a
> new ioctl.  For instance if you only need to "kick" the device to tell
> it when to begin translation and execution, we could create a virtual
> interrupt into the mediated device with an irqfd.  QEMU writes to the
> irqfd (eventfd), the mediated driver receives this kick and begins
> processing.  Another virtual interrupt out to the user might indicate
> completion. On the other hand if the ioctl was intended to write the
> ccw program itself to the device, we have vfio device regions that can
> do this.  Simply define within the vfio-ccw API that one of the regions
> is a virtual program buffer and define the API between the mediated
> driver and user the sequence of writes that load the program state,
> initiate the program, and return the result.
> 
> The vfio API already has a very extensible mechanism for communicating
> with a device through regions and interrupts, not all of which
> necessarily need to match physical attributes of the device.  ioctls
> can be added, but lets exhaust the mechanisms we already have through
> the vfio api first.  Thanks,
Dear Alex and Neo,

I tried as what you suggested - add an MMIO region to the device, and
it works fine. It's an interesting and elegant way. I like it. :>

So indeed, we neither need to introduce a new ioctl command, nor the
ioctl callback on phy_device_ops.

Thanks!

> 
> Alex
> 



--------
Dong Jia


^ permalink raw reply	[flat|nested] 92+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v4 1/3] Mediated device Core driver
@ 2016-06-15  6:37                               ` Dong Jia
  0 siblings, 0 replies; 92+ messages in thread
From: Dong Jia @ 2016-06-15  6:37 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, Neo Jia, Kirti Wankhede, pbonzini, kraxel,
	qemu-devel, kvm, Ruan, Shuai, Song, Jike, Lv, Zhiyuan, Dong Jia

On Tue, 7 Jun 2016 22:29:30 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Wed, 8 Jun 2016 11:18:42 +0800
> Dong Jia <bjsdjshi@linux.vnet.ibm.com> wrote:
> 
> > On Tue, 7 Jun 2016 19:39:21 -0600
> > Alex Williamson <alex.williamson@redhat.com> wrote:
> > 
> > > On Wed, 8 Jun 2016 01:18:42 +0000
> > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > >   
> > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > Sent: Wednesday, June 08, 2016 6:42 AM
> > > > > 
> > > > > On Tue, 7 Jun 2016 03:03:32 +0000
> > > > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > > > >     
> > > > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > > > Sent: Tuesday, June 07, 2016 3:31 AM
> > > > > > >
> > > > > > > On Mon, 6 Jun 2016 10:44:25 -0700
> > > > > > > Neo Jia <cjia@nvidia.com> wrote:
> > > > > > >    
> > > > > > > > On Mon, Jun 06, 2016 at 04:29:11PM +0800, Dong Jia wrote:    
> > > > > > > > > On Sun, 5 Jun 2016 23:27:42 -0700
> > > > > > > > > Neo Jia <cjia@nvidia.com> wrote:
> > > > > > > > >
> > > > > > > > > 2. VFIO_DEVICE_CCW_CMD_REQUEST
> > > > > > > > > This intends to handle an intercepted channel I/O instruction. It
> > > > > > > > > basically need to do the following thing:    
> > > > > > > >
> > > > > > > > May I ask how and when QEMU knows that he needs to issue such VFIO ioctl at
> > > > > > > > first place?    
> > > > > > >
> > > > > > > Yep, this is my question as well.  It sounds a bit like there's an
> > > > > > > emulated device in QEMU that's trying to tell the mediated device when
> > > > > > > to start an operation when we probably should be passing through
> > > > > > > whatever i/o operations indicate that status directly to the mediated
> > > > > > > device. Thanks,
> > > > > > >
> > > > > > > Alex    
> > > > > >
> > > > > > Below is copied from Dong's earlier post which said clear that
> > > > > > a guest cmd submission will trigger the whole flow:
> > > > > >
> > > > > > ----
> > > > > > Explanation:
> > > > > > Q1-Q4: Qemu side process.
> > > > > > K1-K6: Kernel side process.
> > > > > >
> > > > > > Q1. Intercept a ssch instruction.
> > > > > > Q2. Translate the guest ccw program to a user space ccw program
> > > > > >     (u_ccwchain).
> > > > > > Q3. Call VFIO_DEVICE_CCW_CMD_REQUEST (u_ccwchain, orb, irb).
> > > > > >     K1. Copy from u_ccwchain to kernel (k_ccwchain).
> > > > > >     K2. Translate the user space ccw program to a kernel space ccw
> > > > > >         program, which becomes runnable for a real device.
> > > > > >     K3. With the necessary information contained in the orb passed in
> > > > > >         by Qemu, issue the k_ccwchain to the device, and wait event q
> > > > > >         for the I/O result.
> > > > > >     K4. Interrupt handler gets the I/O result, and wakes up the wait q.
> > > > > >     K5. CMD_REQUEST ioctl gets the I/O result, and uses the result to
> > > > > >         update the user space irb.
> > > > > >     K6. Copy irb and scsw back to user space.
> > > > > > Q4. Update the irb for the guest.
> > > > > > ----    
> > > > > 
> > > > > Right, but this was the pre-mediated device approach, now we no longer
> > > > > need step Q2 so we really only need Q1 and therefore Q3 to exist in
> > > > > QEMU if those are operations that are not visible to the mediated
> > > > > device; which they very well might be, since it's described as an
> > > > > instruction rather than an i/o operation.  It's not terrible if that's
> > > > > the case, vfio-pci has its own ioctl for doing a hot reset.    
> > Dear Alex, Kevin and Neo,
> > 
> > 'ssch' is a privileged I/O instruction, which should be finally issued
> > to the dedicated subchannel of the physical device.
> > 
> > BTW, I did remove step Q2 with all of the user-space translation code,
> > according to your comments in another thread.
> > 
> > > > 
> > > >   
> > > > >     
> > > > > > My understanding is that such thing belongs to how device is mediated
> > > > > > (so device driver specific), instead of something to be abstracted in
> > > > > > VFIO which manages resource but doesn't care how resource is used.
> > > > > >
> > > > > > Actually we have same requirement in vGPU case, that a guest driver
> > > > > > needs submit GPU commands through some MMIO register. vGPU device
> > > > > > model will intercept the submission request (in its own way), do its
> > > > > > necessary scan/audit to ensure correctness/security, and then submit
> > > > > > to physical GPU through vendor specific interface.
> > > > > >
> > > > > > No difference with channel I/O here.    
> > > > > 
> > > > > Well, if the GPU command is submitted through an MMIO register, is that
> > > > > MMIO register part of the mediated device?  If so, could the mediated
> > > > > device recognize the command and do the scan/audit itself?  QEMU must
> > > > > not be the point at which mediation occurs for security purposes, QEMU
> > > > > is userspace and userspace is not to be trusted.  I'm still open to
> > > > > ioctls where it makes sense, as above, we have PCI specific ioctls and
> > > > > already, but we need to evaluate each one, why it needs to exist, and
> > > > > whether we can skip it if the mediated device can trigger the action on
> > > > > its own.  After all, that's why we're using the vfio api, so we can
> > > > > re-use much of the existing infrastructure, especially for a vGPU that
> > > > > exposes itself as a PCI device.  Thanks,
> > > > >     
> > > > 
> > > > My point is that a guest submission on vGPU is just a normal trapped 
> > > > register write, which is forwarded from Qemu to VFIO through pwrite 
> > > > interface and then hit mediated vGPU device. The mediated device
> > > > will recognize this register write as a submission request and then do
> > > > necessary scan (looks we are saying same thing) and then submit to
> > > > physical device driver. If loading ccw cmds on channel i/o are also 
> > > > through some I/O registers, it can be implemented same way w/o
> > > > introducing new ioctl.  
> > We are different here. The target of an I/O instruction is the
> > subchannel. CCW devices don't have these kind of registers. The mediated
> > ccw device can not recognize such an submission by its own capbilities. 
> > 
> > A CCW device does not have such registers in both the physical and the
> > mediated devices to sense or recognize the submission request. It's the
> > CPU that recognizes the submission by intercepting the guest ssch
> > instruction.
> > 
> > CPU can not tell if it is issued from a passed thru device driver or a
> > virtio device driver from the guest. So it has to exit to QEMU, and let
> > QEMU take over.
> > 
> > Once QEMU identifies the target subchannel is serving a passed thru
> > device, it uses the ioctl to pass the instruction parameters into the
> > kernel all the way along the mediated driver to the physical driver to
> > the subchannel to perform the I/O operation.
> > 
> > > > The r/w handler of mediated device can figure
> > > > out whether it's a ccw submission or not. But my understanding might 
> > > > be wrong here.  
> > We don't have registers to sense an instruction or operation.
> 
> Ok, so it seems we need to create some sort of interface to initiate
> the ccw program, but I suppose I'm not yet convinced that it needs a
> new ioctl.  For instance if you only need to "kick" the device to tell
> it when to begin translation and execution, we could create a virtual
> interrupt into the mediated device with an irqfd.  QEMU writes to the
> irqfd (eventfd), the mediated driver receives this kick and begins
> processing.  Another virtual interrupt out to the user might indicate
> completion. On the other hand if the ioctl was intended to write the
> ccw program itself to the device, we have vfio device regions that can
> do this.  Simply define within the vfio-ccw API that one of the regions
> is a virtual program buffer and define the API between the mediated
> driver and user the sequence of writes that load the program state,
> initiate the program, and return the result.
> 
> The vfio API already has a very extensible mechanism for communicating
> with a device through regions and interrupts, not all of which
> necessarily need to match physical attributes of the device.  ioctls
> can be added, but lets exhaust the mechanisms we already have through
> the vfio api first.  Thanks,
Dear Alex and Neo,

I tried as what you suggested - add an MMIO region to the device, and
it works fine. It's an interesting and elegant way. I like it. :>

So indeed, we neither need to introduce a new ioctl command, nor the
ioctl callback on phy_device_ops.

Thanks!

> 
> Alex
> 



--------
Dong Jia

^ permalink raw reply	[flat|nested] 92+ messages in thread

end of thread, other threads:[~2016-06-15  6:37 UTC | newest]

Thread overview: 92+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-24 19:58 [RFC PATCH v4 0/3] Add Mediated device support[was: Add vGPU support] Kirti Wankhede
2016-05-24 19:58 ` [Qemu-devel] " Kirti Wankhede
2016-05-24 19:58 ` [RFC PATCH v4 1/3] Mediated device Core driver Kirti Wankhede
2016-05-24 19:58   ` [Qemu-devel] " Kirti Wankhede
2016-05-25  7:55   ` Tian, Kevin
2016-05-25  7:55     ` [Qemu-devel] " Tian, Kevin
2016-05-25 14:47     ` Kirti Wankhede
2016-05-25 14:47       ` [Qemu-devel] " Kirti Wankhede
2016-05-27  9:00       ` Tian, Kevin
2016-05-27  9:00         ` [Qemu-devel] " Tian, Kevin
2016-05-25 22:39   ` Alex Williamson
2016-05-25 22:39     ` [Qemu-devel] " Alex Williamson
2016-05-26  9:03     ` Kirti Wankhede
2016-05-26  9:03       ` [Qemu-devel] " Kirti Wankhede
2016-05-26 14:06       ` Alex Williamson
2016-05-26 14:06         ` [Qemu-devel] " Alex Williamson
2016-06-03  8:57   ` Dong Jia
2016-06-03  8:57     ` [Qemu-devel] " Dong Jia
2016-06-03  9:40     ` Tian, Kevin
2016-06-03  9:40       ` [Qemu-devel] " Tian, Kevin
2016-06-06  2:24       ` Dong Jia
2016-06-06  2:24         ` [Qemu-devel] " Dong Jia
2016-06-06  5:27     ` Kirti Wankhede
2016-06-06  5:27       ` [Qemu-devel] " Kirti Wankhede
2016-06-06  6:01       ` Dong Jia
2016-06-06  6:01         ` [Qemu-devel] " Dong Jia
2016-06-06  6:27         ` Neo Jia
2016-06-06  6:27           ` [Qemu-devel] " Neo Jia
2016-06-06  8:29           ` Dong Jia
2016-06-06  8:29             ` [Qemu-devel] " Dong Jia
2016-06-06 17:44             ` Neo Jia
2016-06-06 17:44               ` [Qemu-devel] " Neo Jia
2016-06-06 19:31               ` Alex Williamson
2016-06-06 19:31                 ` [Qemu-devel] " Alex Williamson
2016-06-07  3:03                 ` Tian, Kevin
2016-06-07  3:03                   ` [Qemu-devel] " Tian, Kevin
2016-06-07 22:42                   ` Alex Williamson
2016-06-07 22:42                     ` [Qemu-devel] " Alex Williamson
2016-06-08  1:18                     ` Tian, Kevin
2016-06-08  1:18                       ` [Qemu-devel] " Tian, Kevin
2016-06-08  1:39                       ` Alex Williamson
2016-06-08  1:39                         ` [Qemu-devel] " Alex Williamson
2016-06-08  3:18                         ` Dong Jia
2016-06-08  3:18                           ` [Qemu-devel] " Dong Jia
2016-06-08  3:48                           ` Neo Jia
2016-06-08  3:48                             ` [Qemu-devel] " Neo Jia
2016-06-08  6:13                             ` Dong Jia
2016-06-08  6:13                               ` [Qemu-devel] " Dong Jia
2016-06-08  6:22                               ` Neo Jia
2016-06-08  6:22                                 ` [Qemu-devel] " Neo Jia
2016-06-08  4:29                           ` Alex Williamson
2016-06-08  4:29                             ` [Qemu-devel] " Alex Williamson
2016-06-15  6:37                             ` Dong Jia
2016-06-15  6:37                               ` [Qemu-devel] " Dong Jia
2016-05-24 19:58 ` [RFC PATCH v4 2/3] VFIO driver for mediated PCI device Kirti Wankhede
2016-05-24 19:58   ` [Qemu-devel] " Kirti Wankhede
2016-05-25  8:15   ` Tian, Kevin
2016-05-25  8:15     ` [Qemu-devel] " Tian, Kevin
2016-05-25 13:04     ` Kirti Wankhede
2016-05-25 13:04       ` [Qemu-devel] " Kirti Wankhede
2016-05-27 10:03       ` Tian, Kevin
2016-05-27 10:03         ` [Qemu-devel] " Tian, Kevin
2016-05-27 15:13         ` Alex Williamson
2016-05-27 15:13           ` [Qemu-devel] " Alex Williamson
2016-05-24 19:58 ` [RFC PATCH v4 3/3] VFIO Type1 IOMMU: Add support for mediated devices Kirti Wankhede
2016-05-24 19:58   ` [Qemu-devel] " Kirti Wankhede
2016-06-01  8:40   ` Dong Jia
2016-06-01  8:40     ` [Qemu-devel] " Dong Jia
2016-06-02  7:56     ` Neo Jia
2016-06-02  7:56       ` [Qemu-devel] " Neo Jia
2016-06-03  8:32       ` Dong Jia
2016-06-03  8:32         ` [Qemu-devel] " Dong Jia
2016-06-03  8:37         ` Tian, Kevin
2016-06-03  8:37           ` [Qemu-devel] " Tian, Kevin
2016-05-25  7:13 ` [RFC PATCH v4 0/3] Add Mediated device support[was: Add vGPU support] Tian, Kevin
2016-05-25  7:13   ` [Qemu-devel] " Tian, Kevin
2016-05-25 13:43   ` Alex Williamson
2016-05-25 13:43     ` [Qemu-devel] " Alex Williamson
2016-05-27 11:02     ` Tian, Kevin
2016-05-27 11:02       ` [Qemu-devel] " Tian, Kevin
2016-05-27 14:54       ` Alex Williamson
2016-05-27 14:54         ` [Qemu-devel] " Alex Williamson
2016-05-27 22:43         ` Tian, Kevin
2016-05-27 22:43           ` [Qemu-devel] " Tian, Kevin
2016-05-28 14:56           ` Alex Williamson
2016-05-28 14:56             ` [Qemu-devel] " Alex Williamson
2016-05-31  2:29             ` Jike Song
2016-05-31  2:29               ` [Qemu-devel] " Jike Song
2016-05-31 14:29               ` Alex Williamson
2016-05-31 14:29                 ` [Qemu-devel] " Alex Williamson
2016-06-02  2:11                 ` Jike Song
2016-06-02  2:11                   ` [Qemu-devel] " Jike Song

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.