All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v2 1/3] vGPU Core driver
@ 2016-02-23 16:24 ` Kirti Wankhede
  0 siblings, 0 replies; 38+ messages in thread
From: Kirti Wankhede @ 2016-02-23 16:24 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel
  Cc: qemu-devel, kvm, kevin.tian, shuai.ruan, jike.song, zhiyuan.lv,
	Kirti Wankhede, Neo Jia

Design for vGPU Driver:
Main purpose of vGPU driver is to provide a common interface for vGPU
management that can be used by differnt GPU drivers.

This module would provide a generic interface to create the device, add
it to vGPU bus, add device to IOMMU group and then add it to vfio group.

High Level block diagram:

+--------------+    vgpu_register_driver()+---------------+
|     __init() +------------------------->+               |
|              |                          |               |
|              +<-------------------------+    vgpu.ko    |
| vgpu_vfio.ko |   probe()/remove()       |               |
|              |                +---------+               +---------+
+--------------+                |         +-------+-------+         |
                                |                 ^                 |
                                | callback        |                 |
                                |         +-------+--------+        |
                                |         |vgpu_register_device()   |
                                |         |                |        |
                                +---^-----+-----+    +-----+------+-+
                                    | nvidia.ko |    |  i915.ko   |
                                    |           |    |            |
                                    +-----------+    +------------+

vGPU driver provides two types of registration interfaces:
1. Registration interface for vGPU bus driver:

/**
  * struct vgpu_driver - vGPU device driver
  * @name: driver name
  * @probe: called when new device created
  * @remove: called when device removed
  * @driver: device driver structure
  *
  **/
struct vgpu_driver {
         const char *name;
         int  (*probe)  (struct device *dev);
         void (*remove) (struct device *dev);
         struct device_driver    driver;
};

int  vgpu_register_driver(struct vgpu_driver *drv, struct module *owner);
void vgpu_unregister_driver(struct vgpu_driver *drv);

VFIO bus driver for vgpu, should use this interface to register with
vGPU driver. With this, VFIO bus driver for vGPU devices is responsible
to add vGPU device to VFIO group.

2. GPU driver interface
GPU driver interface provides GPU driver the set APIs to manage GPU driver
related work in their own driver. APIs are to:
- vgpu_supported_config: provide supported configuration list by the GPU.
- vgpu_create: to allocate basic resouces in GPU driver for a vGPU device.
- vgpu_destroy: to free resources in GPU driver during vGPU device destroy.
- vgpu_start: to initiate vGPU initialization process from GPU driver when VM
  boots and before QEMU starts.
- vgpu_shutdown: to teardown vGPU resources during VM teardown.
- read : read emulation callback.
- write: write emulation callback.
- vgpu_set_irqs: send interrupt configuration information that QEMU sets.
- vgpu_bar_info: to provice BAR size and its flags for the vGPU device.
- validate_map_request: to validate remap pfn request.

This registration interface should be used by GPU drivers to register
each physical device to vGPU driver.

Updated this patch with couple of more functions in GPU driver interface
which were discussed during v1 version of this RFC.

Thanks,
Kirti.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
---
 drivers/Kconfig             |    2 +
 drivers/Makefile            |    1 +
 drivers/vgpu/Kconfig        |   26 +++
 drivers/vgpu/Makefile       |    4 +
 drivers/vgpu/vgpu-core.c    |  422 +++++++++++++++++++++++++++++++++++++++++++
 drivers/vgpu/vgpu-driver.c  |  137 ++++++++++++++
 drivers/vgpu/vgpu-sysfs.c   |  366 +++++++++++++++++++++++++++++++++++++
 drivers/vgpu/vgpu_private.h |   36 ++++
 include/linux/vgpu.h        |  217 ++++++++++++++++++++++
 9 files changed, 1211 insertions(+), 0 deletions(-)
 create mode 100644 drivers/vgpu/Kconfig
 create mode 100644 drivers/vgpu/Makefile
 create mode 100644 drivers/vgpu/vgpu-core.c
 create mode 100644 drivers/vgpu/vgpu-driver.c
 create mode 100644 drivers/vgpu/vgpu-sysfs.c
 create mode 100644 drivers/vgpu/vgpu_private.h
 create mode 100644 include/linux/vgpu.h

diff --git a/drivers/Kconfig b/drivers/Kconfig
index d2ac339..5fd9eae 100644
--- a/drivers/Kconfig
+++ b/drivers/Kconfig
@@ -122,6 +122,8 @@ source "drivers/uio/Kconfig"
 
 source "drivers/vfio/Kconfig"
 
+source "drivers/vgpu/Kconfig"
+
 source "drivers/vlynq/Kconfig"
 
 source "drivers/virt/Kconfig"
diff --git a/drivers/Makefile b/drivers/Makefile
index 795d0ca..1c43250 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -84,6 +84,7 @@ obj-$(CONFIG_FUSION)		+= message/
 obj-y				+= firewire/
 obj-$(CONFIG_UIO)		+= uio/
 obj-$(CONFIG_VFIO)		+= vfio/
+obj-$(CONFIG_VFIO)		+= vgpu/
 obj-y				+= cdrom/
 obj-y				+= auxdisplay/
 obj-$(CONFIG_PCCARD)		+= pcmcia/
diff --git a/drivers/vgpu/Kconfig b/drivers/vgpu/Kconfig
new file mode 100644
index 0000000..698ddf9
--- /dev/null
+++ b/drivers/vgpu/Kconfig
@@ -0,0 +1,26 @@
+
+menuconfig VGPU
+    tristate "VGPU driver framework"
+    depends on VFIO
+    select VGPU_VFIO
+    select VFIO_IOMMU_TYPE1_VGPU
+    help
+        VGPU provides a framework to virtualize GPU without SR-IOV cap
+        See Documentation/vgpu.txt for more details.
+
+        If you don't know what do here, say N.
+
+config VGPU
+    tristate
+    depends on VFIO
+    default n
+
+config VGPU_VFIO
+    tristate
+    depends on VGPU 
+    default n
+
+config VFIO_IOMMU_TYPE1_VGPU
+    tristate
+    depends on VGPU_VFIO
+    default n
diff --git a/drivers/vgpu/Makefile b/drivers/vgpu/Makefile
new file mode 100644
index 0000000..f5be980
--- /dev/null
+++ b/drivers/vgpu/Makefile
@@ -0,0 +1,4 @@
+
+vgpu-y := vgpu-core.o vgpu-sysfs.o vgpu-driver.o
+
+obj-$(CONFIG_VGPU)			+= vgpu.o
diff --git a/drivers/vgpu/vgpu-core.c b/drivers/vgpu/vgpu-core.c
new file mode 100644
index 0000000..7710021
--- /dev/null
+++ b/drivers/vgpu/vgpu-core.c
@@ -0,0 +1,422 @@
+/*
+ * VGPU Core Driver
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/poll.h>
+#include <linux/slab.h>
+#include <linux/cdev.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/uuid.h>
+#include <linux/vfio.h>
+#include <linux/iommu.h>
+#include <linux/sysfs.h>
+#include <linux/ctype.h>
+#include <linux/vgpu.h>
+
+#include "vgpu_private.h"
+
+#define DRIVER_VERSION	"0.1"
+#define DRIVER_AUTHOR	"NVIDIA Corporation"
+#define DRIVER_DESC	"VGPU Core Driver"
+
+/*
+ * #defines
+ */
+
+#define VGPU_CLASS_NAME		"vgpu"
+
+/*
+ * Global Structures
+ */
+
+static struct vgpu {
+	struct list_head    vgpu_devices_list;
+	struct mutex        vgpu_devices_lock;
+	struct list_head    gpu_devices_list;
+	struct mutex        gpu_devices_lock;
+} vgpu;
+
+static struct class vgpu_class;
+
+/*
+ * Functions
+ */
+
+struct vgpu_device *get_vgpu_device_from_group(struct iommu_group *group)
+{
+	struct vgpu_device *vdev = NULL;
+
+	mutex_lock(&vgpu.vgpu_devices_lock);
+	list_for_each_entry(vdev, &vgpu.vgpu_devices_list, list) {
+		if (vdev->group) {
+			if (iommu_group_id(vdev->group) == iommu_group_id(group)) {
+				mutex_unlock(&vgpu.vgpu_devices_lock);
+				return vdev;
+			}
+		}
+	}
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+	return NULL;
+}
+
+EXPORT_SYMBOL_GPL(get_vgpu_device_from_group);
+
+static int vgpu_add_attribute_group(struct device *dev,
+			            const struct attribute_group **groups)
+{
+        return sysfs_create_groups(&dev->kobj, groups);
+}
+
+static void vgpu_remove_attribute_group(struct device *dev,
+			                const struct attribute_group **groups)
+{
+        sysfs_remove_groups(&dev->kobj, groups);
+}
+
+int vgpu_register_device(struct pci_dev *dev, const struct gpu_device_ops *ops)
+{
+	int ret = 0;
+	struct gpu_device *gpu_dev, *tmp;
+
+	if (!dev)
+		return -EINVAL;
+
+        gpu_dev = kzalloc(sizeof(*gpu_dev), GFP_KERNEL);
+        if (!gpu_dev)
+                return -ENOMEM;
+
+	gpu_dev->dev = dev;
+        gpu_dev->ops = ops;
+
+        mutex_lock(&vgpu.gpu_devices_lock);
+
+        /* Check for duplicates */
+        list_for_each_entry(tmp, &vgpu.gpu_devices_list, gpu_next) {
+                if (tmp->dev == dev) {
+			ret = -EINVAL;
+			goto add_error;
+                }
+        }
+
+	ret = vgpu_create_pci_device_files(dev);
+	if (ret)
+		goto add_error;
+
+	ret = vgpu_add_attribute_group(&dev->dev, ops->dev_attr_groups);
+	if (ret)
+		goto add_group_error;
+
+        list_add(&gpu_dev->gpu_next, &vgpu.gpu_devices_list);
+
+	printk(KERN_INFO "VGPU: Registered dev 0x%x 0x%x, class 0x%x\n",
+			 dev->vendor, dev->device, dev->class);
+        mutex_unlock(&vgpu.gpu_devices_lock);
+
+        return 0;
+
+add_group_error:
+	vgpu_remove_pci_device_files(dev);
+add_error:
+	mutex_unlock(&vgpu.gpu_devices_lock);
+	kfree(gpu_dev);
+	return ret;
+
+}
+EXPORT_SYMBOL(vgpu_register_device);
+
+void vgpu_unregister_device(struct pci_dev *dev)
+{
+        struct gpu_device *gpu_dev;
+
+        mutex_lock(&vgpu.gpu_devices_lock);
+        list_for_each_entry(gpu_dev, &vgpu.gpu_devices_list, gpu_next) {
+		struct vgpu_device *vdev = NULL;
+
+                if (gpu_dev->dev != dev)
+			continue;
+
+		printk(KERN_INFO "VGPU: Unregistered dev 0x%x 0x%x, class 0x%x\n",
+				dev->vendor, dev->device, dev->class);
+
+		list_for_each_entry(vdev, &vgpu.vgpu_devices_list, list) {
+			if (vdev->gpu_dev != gpu_dev)
+				continue;
+			destroy_vgpu_device(vdev);
+		}
+		vgpu_remove_attribute_group(&dev->dev, gpu_dev->ops->dev_attr_groups);
+		vgpu_remove_pci_device_files(dev);
+		list_del(&gpu_dev->gpu_next);
+		mutex_unlock(&vgpu.gpu_devices_lock);
+		kfree(gpu_dev);
+		return;
+        }
+        mutex_unlock(&vgpu.gpu_devices_lock);
+}
+EXPORT_SYMBOL(vgpu_unregister_device);
+
+/*
+ * Helper Functions
+ */
+
+static struct vgpu_device *vgpu_device_alloc(uuid_le uuid, int instance, char *name)
+{
+	struct vgpu_device *vgpu_dev = NULL;
+
+	vgpu_dev = kzalloc(sizeof(*vgpu_dev), GFP_KERNEL);
+	if (!vgpu_dev)
+		return ERR_PTR(-ENOMEM);
+
+	kref_init(&vgpu_dev->kref);
+	memcpy(&vgpu_dev->uuid, &uuid, sizeof(uuid_le));
+	vgpu_dev->vgpu_instance = instance;
+	strcpy(vgpu_dev->dev_name, name);
+
+	mutex_lock(&vgpu.vgpu_devices_lock);
+	list_add(&vgpu_dev->list, &vgpu.vgpu_devices_list);
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+
+	return vgpu_dev;
+}
+
+static void vgpu_device_free(struct vgpu_device *vgpu_dev)
+{
+	if (vgpu_dev) {
+		mutex_lock(&vgpu.vgpu_devices_lock);
+		list_del(&vgpu_dev->list);
+		mutex_unlock(&vgpu.vgpu_devices_lock);
+		kfree(vgpu_dev);
+	}
+	return;
+}
+
+struct vgpu_device *vgpu_drv_get_vgpu_device(uuid_le uuid, int instance)
+{
+	struct vgpu_device *vdev = NULL;
+
+	mutex_lock(&vgpu.vgpu_devices_lock);
+	list_for_each_entry(vdev, &vgpu.vgpu_devices_list, list) {
+		if ((uuid_le_cmp(vdev->uuid, uuid) == 0) &&
+		    (vdev->vgpu_instance == instance)) {
+			mutex_unlock(&vgpu.vgpu_devices_lock);
+			return vdev;
+		}
+	}
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+	return NULL;
+}
+
+static void vgpu_device_release(struct device *dev)
+{
+	struct vgpu_device *vgpu_dev = to_vgpu_device(dev);
+	vgpu_device_free(vgpu_dev);
+}
+
+int create_vgpu_device(struct pci_dev *pdev, uuid_le uuid, uint32_t instance, char *vgpu_params)
+{
+	char name[64];
+	int numChar = 0;
+	int retval = 0;
+	struct vgpu_device *vgpu_dev = NULL;
+	struct gpu_device *gpu_dev;
+
+	printk(KERN_INFO "VGPU: %s: device ", __FUNCTION__);
+
+	numChar = sprintf(name, "%pUb-%d", uuid.b, instance);
+	name[numChar] = '\0';
+
+	vgpu_dev = vgpu_device_alloc(uuid, instance, name);
+	if (IS_ERR(vgpu_dev)) {
+		return PTR_ERR(vgpu_dev);
+	}
+
+	vgpu_dev->dev.parent  = NULL;
+	vgpu_dev->dev.bus     = &vgpu_bus_type;
+	vgpu_dev->dev.release = vgpu_device_release;
+	dev_set_name(&vgpu_dev->dev, "%s", name);
+
+	retval = device_register(&vgpu_dev->dev);
+	if (retval)
+		goto create_failed1;
+
+	printk(KERN_INFO "UUID %pUb \n", vgpu_dev->uuid.b);
+
+	mutex_lock(&vgpu.gpu_devices_lock);
+	list_for_each_entry(gpu_dev, &vgpu.gpu_devices_list, gpu_next) {
+		if (gpu_dev->dev != pdev)
+			continue;
+
+		vgpu_dev->gpu_dev = gpu_dev;
+		if (gpu_dev->ops->vgpu_create) {
+			retval = gpu_dev->ops->vgpu_create(pdev, vgpu_dev->uuid,
+							   instance, vgpu_params);
+			if (retval) {
+				mutex_unlock(&vgpu.gpu_devices_lock);
+				goto create_failed2;
+			}
+		}
+		break;
+	}
+	if (!vgpu_dev->gpu_dev) {
+		retval = -EINVAL;
+		mutex_unlock(&vgpu.gpu_devices_lock);
+		goto create_failed2;
+	}
+
+	mutex_unlock(&vgpu.gpu_devices_lock);
+
+	retval = vgpu_add_attribute_group(&vgpu_dev->dev, gpu_dev->ops->vgpu_attr_groups);
+	if (retval)
+		goto create_attr_error;
+
+	return retval;
+
+create_attr_error:
+	if (gpu_dev->ops->vgpu_destroy) {
+		int ret = 0;
+		ret = gpu_dev->ops->vgpu_destroy(gpu_dev->dev,
+						 vgpu_dev->uuid,
+						 vgpu_dev->vgpu_instance);
+	}
+
+create_failed2:
+	device_unregister(&vgpu_dev->dev);
+
+create_failed1:
+	vgpu_device_free(vgpu_dev);
+
+	return retval;
+}
+
+void destroy_vgpu_device(struct vgpu_device *vgpu_dev)
+{
+	struct gpu_device *gpu_dev = vgpu_dev->gpu_dev;
+
+	printk(KERN_INFO "VGPU: destroying device %s ", vgpu_dev->dev_name);
+	if (gpu_dev->ops->vgpu_destroy) {
+		int retval = 0;
+		retval = gpu_dev->ops->vgpu_destroy(gpu_dev->dev,
+						    vgpu_dev->uuid,
+						    vgpu_dev->vgpu_instance);
+	/* if vendor driver doesn't return success that means vendor driver doesn't
+	 * support hot-unplug */
+		if (retval)
+			return;
+	}
+
+	vgpu_remove_attribute_group(&vgpu_dev->dev, gpu_dev->ops->vgpu_attr_groups);
+	device_unregister(&vgpu_dev->dev);
+}
+
+void get_vgpu_supported_types(struct device *dev, char *str)
+{
+	struct gpu_device *gpu_dev;
+
+	mutex_lock(&vgpu.gpu_devices_lock);
+	list_for_each_entry(gpu_dev, &vgpu.gpu_devices_list, gpu_next) {
+		if (&gpu_dev->dev->dev == dev) {
+			if (gpu_dev->ops->vgpu_supported_config)
+				gpu_dev->ops->vgpu_supported_config(gpu_dev->dev, str);
+			break;
+		}
+	}
+	mutex_unlock(&vgpu.gpu_devices_lock);
+}
+
+int vgpu_start_callback(struct vgpu_device *vgpu_dev)
+{
+	int ret = 0;
+	struct gpu_device *gpu_dev = vgpu_dev->gpu_dev;
+
+	mutex_lock(&vgpu.gpu_devices_lock);
+	if (gpu_dev->ops->vgpu_start)
+		ret = gpu_dev->ops->vgpu_start(vgpu_dev->uuid);
+	mutex_unlock(&vgpu.gpu_devices_lock);
+	return ret;
+}
+
+int vgpu_shutdown_callback(struct vgpu_device *vgpu_dev)
+{
+	int ret = 0;
+	struct gpu_device *gpu_dev = vgpu_dev->gpu_dev;
+
+	mutex_lock(&vgpu.gpu_devices_lock);
+	if (gpu_dev->ops->vgpu_shutdown)
+		ret = gpu_dev->ops->vgpu_shutdown(vgpu_dev->uuid);
+	mutex_unlock(&vgpu.gpu_devices_lock);
+	return ret;
+}
+
+char *vgpu_devnode(struct device *dev, umode_t *mode)
+{
+	return kasprintf(GFP_KERNEL, "vgpu/%s", dev_name(dev));
+}
+
+static void release_vgpubus_dev(struct device *dev)
+{
+	struct vgpu_device *vgpu_dev = to_vgpu_device(dev);
+	destroy_vgpu_device(vgpu_dev);
+}
+
+static struct class vgpu_class = {
+	.name		= VGPU_CLASS_NAME,
+	.owner		= THIS_MODULE,
+	.class_attrs	= vgpu_class_attrs,
+	.dev_groups	= vgpu_dev_groups,
+	.devnode	= vgpu_devnode,
+	.dev_release    = release_vgpubus_dev,
+};
+
+static int __init vgpu_init(void)
+{
+	int rc = 0;
+
+	memset(&vgpu, 0 , sizeof(vgpu));
+
+	mutex_init(&vgpu.vgpu_devices_lock);
+	INIT_LIST_HEAD(&vgpu.vgpu_devices_list);
+	mutex_init(&vgpu.gpu_devices_lock);
+	INIT_LIST_HEAD(&vgpu.gpu_devices_list);
+
+	rc = class_register(&vgpu_class);
+	if (rc < 0) {
+		printk(KERN_ERR "Error: failed to register vgpu class\n");
+		goto failed1;
+	}
+
+	rc = vgpu_bus_register();
+	if (rc < 0) {
+		printk(KERN_ERR "Error: failed to register vgpu bus\n");
+		class_unregister(&vgpu_class);
+	}
+
+failed1:
+	return rc;
+}
+
+static void __exit vgpu_exit(void)
+{
+	vgpu_bus_unregister();
+	class_unregister(&vgpu_class);
+}
+
+module_init(vgpu_init)
+module_exit(vgpu_exit)
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/drivers/vgpu/vgpu-driver.c b/drivers/vgpu/vgpu-driver.c
new file mode 100644
index 0000000..6b62f19
--- /dev/null
+++ b/drivers/vgpu/vgpu-driver.c
@@ -0,0 +1,137 @@
+/*
+ * VGPU driver
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/vfio.h>
+#include <linux/iommu.h>
+#include <linux/sysfs.h>
+#include <linux/ctype.h>
+#include <linux/vgpu.h>
+
+#include "vgpu_private.h"
+
+static int vgpu_device_attach_iommu(struct vgpu_device *vgpu_dev)
+{
+        int retval = 0;
+        struct iommu_group *group = NULL;
+
+        group = iommu_group_alloc();
+        if (IS_ERR(group)) {
+                printk(KERN_ERR "VGPU: failed to allocate group!\n");
+                return PTR_ERR(group);
+        }
+
+        retval = iommu_group_add_device(group, &vgpu_dev->dev);
+        if (retval) {
+                printk(KERN_ERR "VGPU: failed to add dev to group!\n");
+                iommu_group_put(group);
+                return retval;
+        }
+
+        vgpu_dev->group = group;
+
+        printk(KERN_INFO "VGPU: group_id = %d \n", iommu_group_id(group));
+        return retval;
+}
+
+static void vgpu_device_detach_iommu(struct vgpu_device *vgpu_dev)
+{
+        iommu_group_put(vgpu_dev->dev.iommu_group);
+        iommu_group_remove_device(&vgpu_dev->dev);
+        printk(KERN_INFO "VGPU: detaching iommu \n");
+}
+
+static int vgpu_device_probe(struct device *dev)
+{
+	struct vgpu_driver *drv = to_vgpu_driver(dev->driver);
+	struct vgpu_device *vgpu_dev = to_vgpu_device(dev);
+	int status = 0;
+
+	status = vgpu_device_attach_iommu(vgpu_dev);
+	if (status) {
+		printk(KERN_ERR "Failed to attach IOMMU\n");
+		return status;
+	}
+
+	if (drv && drv->probe) {
+		status = drv->probe(dev);
+	}
+
+	return status;
+}
+
+static int vgpu_device_remove(struct device *dev)
+{
+	struct vgpu_driver *drv = to_vgpu_driver(dev->driver);
+	struct vgpu_device *vgpu_dev = to_vgpu_device(dev);
+	int status = 0;
+
+	if (drv && drv->remove) {
+		drv->remove(dev);
+	}
+
+	vgpu_device_detach_iommu(vgpu_dev);
+
+	return status;
+}
+
+struct bus_type vgpu_bus_type = {
+	.name		= "vgpu",
+	.probe		= vgpu_device_probe,
+	.remove		= vgpu_device_remove,
+};
+EXPORT_SYMBOL_GPL(vgpu_bus_type);
+
+/**
+ * vgpu_register_driver - register a new vGPU driver
+ * @drv: the driver to register
+ * @owner: owner module of driver ro register
+ *
+ * Returns a negative value on error, otherwise 0.
+ */
+int vgpu_register_driver(struct vgpu_driver *drv, struct module *owner)
+{
+	/* initialize common driver fields */
+	drv->driver.name = drv->name;
+	drv->driver.bus = &vgpu_bus_type;
+	drv->driver.owner = owner;
+
+	/* register with core */
+	return driver_register(&drv->driver);
+}
+EXPORT_SYMBOL(vgpu_register_driver);
+
+/**
+ * vgpu_unregister_driver - unregister vGPU driver
+ * @drv: the driver to unregister
+ *
+ */
+void vgpu_unregister_driver(struct vgpu_driver *drv)
+{
+	driver_unregister(&drv->driver);
+}
+EXPORT_SYMBOL(vgpu_unregister_driver);
+
+int vgpu_bus_register(void)
+{
+	return bus_register(&vgpu_bus_type);
+}
+
+void vgpu_bus_unregister(void)
+{
+	bus_unregister(&vgpu_bus_type);
+}
+
diff --git a/drivers/vgpu/vgpu-sysfs.c b/drivers/vgpu/vgpu-sysfs.c
new file mode 100644
index 0000000..a1b321b
--- /dev/null
+++ b/drivers/vgpu/vgpu-sysfs.c
@@ -0,0 +1,366 @@
+/*
+ * File attributes for vGPU devices
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/fs.h>
+#include <linux/sysfs.h>
+#include <linux/ctype.h>
+#include <linux/uuid.h>
+#include <linux/vfio.h>
+#include <linux/vgpu.h>
+
+#include "vgpu_private.h"
+
+/* Prototypes */
+
+static ssize_t vgpu_supported_types_show(struct device *dev,
+					 struct device_attribute *attr,
+					 char *buf);
+static DEVICE_ATTR_RO(vgpu_supported_types);
+
+static ssize_t vgpu_create_store(struct device *dev,
+				 struct device_attribute *attr,
+				 const char *buf, size_t count);
+static DEVICE_ATTR_WO(vgpu_create);
+
+static ssize_t vgpu_destroy_store(struct device *dev,
+				  struct device_attribute *attr,
+				  const char *buf, size_t count);
+static DEVICE_ATTR_WO(vgpu_destroy);
+
+
+/* Static functions */
+
+static bool is_uuid_sep(char sep)
+{
+	if (sep == '\n' || sep == '-' || sep == ':' || sep == '\0')
+		return true;
+	return false;
+}
+
+
+static int uuid_parse(const char *str, uuid_le *uuid)
+{
+	int i;
+
+	if (strlen(str) < 36)
+		return -1;
+
+	for (i = 0; i < 16; i++) {
+		if (!isxdigit(str[0]) || !isxdigit(str[1])) {
+			printk(KERN_ERR "%s err", __FUNCTION__);
+			return -EINVAL;
+		}
+
+		uuid->b[i] = (hex_to_bin(str[0]) << 4) | hex_to_bin(str[1]);
+		str += 2;
+		if (is_uuid_sep(*str))
+			str++;
+	}
+
+	return 0;
+}
+
+
+/* Functions */
+static ssize_t vgpu_supported_types_show(struct device *dev,
+					 struct device_attribute *attr,
+					 char *buf)
+{
+	char *str;
+	ssize_t n;
+
+        str = kzalloc(sizeof(*str) * 512, GFP_KERNEL);
+        if (!str)
+                return -ENOMEM;
+
+	get_vgpu_supported_types(dev, str);
+
+	n = sprintf(buf,"%s\n", str);
+	kfree(str);
+
+	return n;
+}
+
+static ssize_t vgpu_create_store(struct device *dev,
+				 struct device_attribute *attr,
+				 const char *buf, size_t count)
+{
+	char *str, *pstr;
+	char *uuid_str, *instance_str, *vgpu_params = NULL;
+	uuid_le uuid;
+	uint32_t instance;
+	struct pci_dev *pdev;
+	int ret = 0;
+
+	pstr = str = kstrndup(buf, count, GFP_KERNEL);
+
+	if (!str)
+		return -ENOMEM;
+
+	if ((uuid_str = strsep(&str, ":")) == NULL) {
+		printk(KERN_ERR "%s Empty UUID or string %s \n",
+				 __FUNCTION__, buf);
+		ret = -EINVAL;
+		goto create_error;
+	}
+
+	if (!str) {
+		printk(KERN_ERR "%s vgpu instance not specified %s \n",
+				 __FUNCTION__, buf);
+		ret = -EINVAL;
+		goto create_error;
+	}
+
+	if ((instance_str = strsep(&str, ":")) == NULL) {
+		printk(KERN_ERR "%s Empty instance or string %s \n",
+				 __FUNCTION__, buf);
+		ret = -EINVAL;
+		goto create_error;
+	}
+
+	instance = (unsigned int)simple_strtoul(instance_str, NULL, 0);
+
+	if (!str) {
+		printk(KERN_ERR "%s vgpu params not specified %s \n",
+				 __FUNCTION__, buf);
+		ret = -EINVAL;
+		goto create_error;
+	}
+
+	vgpu_params = kstrdup(str, GFP_KERNEL);
+
+	if (!vgpu_params) {
+		printk(KERN_ERR "%s vgpu params allocation failed \n",
+				 __FUNCTION__);
+		ret = -EINVAL;
+		goto create_error;
+	}
+
+	if (uuid_parse(uuid_str, &uuid) < 0) {
+		printk(KERN_ERR "%s UUID parse error  %s \n", __FUNCTION__, buf);
+		ret = -EINVAL;
+		goto create_error;
+	}
+
+	if (dev_is_pci(dev)) {
+		pdev = to_pci_dev(dev);
+
+		if (create_vgpu_device(pdev, uuid, instance, vgpu_params) < 0) {
+			printk(KERN_ERR "%s vgpu create error \n", __FUNCTION__);
+			ret = -EINVAL;
+			goto create_error;
+		}
+		ret = count;
+	}
+
+create_error:
+	if (vgpu_params)
+		kfree(vgpu_params);
+
+	if (pstr)
+		kfree(pstr);
+	return ret;
+}
+
+static ssize_t vgpu_destroy_store(struct device *dev,
+				  struct device_attribute *attr,
+				  const char *buf, size_t count)
+{
+	char *uuid_str, *str;
+	uuid_le uuid;
+	unsigned int instance;
+	struct vgpu_device *vgpu_dev = NULL;
+
+	str = kstrndup(buf, count, GFP_KERNEL);
+
+	if (!str)
+		return -ENOMEM;
+
+	if ((uuid_str = strsep(&str, ":")) == NULL) {
+		printk(KERN_ERR "%s Empty UUID or string %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+
+	if (str == NULL) {
+		printk(KERN_ERR "%s instance not specified %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+
+	instance = (unsigned int)simple_strtoul(str, NULL, 0);
+
+	if (uuid_parse(uuid_str, &uuid) < 0) {
+		printk(KERN_ERR "%s UUID parse error  %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+
+	printk(KERN_INFO "%s UUID %pUb - %d \n", __FUNCTION__, uuid.b, instance);
+
+	vgpu_dev = vgpu_drv_get_vgpu_device(uuid, instance);
+
+	if (vgpu_dev)
+		destroy_vgpu_device(vgpu_dev);
+
+	return count;
+}
+
+static ssize_t
+vgpu_uuid_show(struct device *dev, struct device_attribute *attr, char *buf)
+{
+	struct vgpu_device *drv = to_vgpu_device(dev);
+
+	if (drv)
+		return sprintf(buf, "%pUb \n", drv->uuid.b);
+
+	return sprintf(buf, " \n");
+}
+
+static DEVICE_ATTR_RO(vgpu_uuid);
+
+static ssize_t
+vgpu_group_id_show(struct device *dev, struct device_attribute *attr, char *buf)
+{
+	struct vgpu_device *drv = to_vgpu_device(dev);
+
+	if (drv && drv->group)
+		return sprintf(buf, "%d \n", iommu_group_id(drv->group));
+
+	return sprintf(buf, " \n");
+}
+
+static DEVICE_ATTR_RO(vgpu_group_id);
+
+
+static struct attribute *vgpu_dev_attrs[] = {
+	&dev_attr_vgpu_uuid.attr,
+	&dev_attr_vgpu_group_id.attr,
+	NULL,
+};
+
+static const struct attribute_group vgpu_dev_group = {
+	.attrs = vgpu_dev_attrs,
+};
+
+const struct attribute_group *vgpu_dev_groups[] = {
+	&vgpu_dev_group,
+	NULL,
+};
+
+
+ssize_t vgpu_start_store(struct class *class, struct class_attribute *attr,
+			 const char *buf, size_t count)
+{
+	char *uuid_str;
+	uuid_le uuid;
+	struct vgpu_device *vgpu_dev = NULL;
+	int ret;
+
+	uuid_str = kstrndup(buf, count, GFP_KERNEL);
+
+	if (!uuid_str)
+		return -ENOMEM;
+
+	if (uuid_parse(uuid_str, &uuid) < 0) {
+		printk(KERN_ERR "%s UUID parse error  %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+
+	vgpu_dev = vgpu_drv_get_vgpu_device(uuid, 0);
+
+	if (vgpu_dev && dev_is_vgpu(&vgpu_dev->dev)) {
+		kobject_uevent(&vgpu_dev->dev.kobj, KOBJ_ONLINE);
+
+		ret = vgpu_start_callback(vgpu_dev);
+		if (ret < 0) {
+			printk(KERN_ERR "%s vgpu_start callback failed  %d \n",
+					 __FUNCTION__, ret);
+			return ret;
+		}
+	}
+
+	return count;
+}
+
+ssize_t vgpu_shutdown_store(struct class *class, struct class_attribute *attr,
+			    const char *buf, size_t count)
+{
+	char *uuid_str;
+	uuid_le uuid;
+	struct vgpu_device *vgpu_dev = NULL;
+	int ret;
+
+	uuid_str = kstrndup(buf, count, GFP_KERNEL);
+
+	if (!uuid_str)
+		return -ENOMEM;
+
+	if (uuid_parse(uuid_str, &uuid) < 0) {
+		printk(KERN_ERR "%s UUID parse error  %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+	vgpu_dev = vgpu_drv_get_vgpu_device(uuid, 0);
+
+	if (vgpu_dev && dev_is_vgpu(&vgpu_dev->dev)) {
+		kobject_uevent(&vgpu_dev->dev.kobj, KOBJ_OFFLINE);
+
+		ret = vgpu_shutdown_callback(vgpu_dev);
+		if (ret < 0) {
+			printk(KERN_ERR "%s vgpu_shutdown callback failed  %d \n",
+					 __FUNCTION__, ret);
+			return ret;
+		}
+	}
+
+	return count;
+}
+
+struct class_attribute vgpu_class_attrs[] = {
+	__ATTR_WO(vgpu_start),
+	__ATTR_WO(vgpu_shutdown),
+	__ATTR_NULL
+};
+
+int vgpu_create_pci_device_files(struct pci_dev *dev)
+{
+	int retval;
+
+	retval = sysfs_create_file(&dev->dev.kobj,
+				   &dev_attr_vgpu_supported_types.attr);
+	if (retval) {
+		printk(KERN_ERR "VGPU-VFIO: failed to create vgpu_supported_types sysfs entry\n");
+		return retval;
+	}
+
+	retval = sysfs_create_file(&dev->dev.kobj, &dev_attr_vgpu_create.attr);
+	if (retval) {
+		printk(KERN_ERR "VGPU-VFIO: failed to create vgpu_create sysfs entry\n");
+		return retval;
+	}
+
+	retval = sysfs_create_file(&dev->dev.kobj, &dev_attr_vgpu_destroy.attr);
+	if (retval) {
+		printk(KERN_ERR "VGPU-VFIO: failed to create vgpu_destroy sysfs entry\n");
+		return retval;
+	}
+
+	return 0;
+}
+
+
+void vgpu_remove_pci_device_files(struct pci_dev *dev)
+{
+	sysfs_remove_file(&dev->dev.kobj, &dev_attr_vgpu_supported_types.attr);
+	sysfs_remove_file(&dev->dev.kobj, &dev_attr_vgpu_create.attr);
+	sysfs_remove_file(&dev->dev.kobj, &dev_attr_vgpu_destroy.attr);
+}
+
diff --git a/drivers/vgpu/vgpu_private.h b/drivers/vgpu/vgpu_private.h
new file mode 100644
index 0000000..35158ef
--- /dev/null
+++ b/drivers/vgpu/vgpu_private.h
@@ -0,0 +1,36 @@
+/*
+ * VGPU interal definition
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author:
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef VGPU_PRIVATE_H
+#define VGPU_PRIVATE_H
+
+struct vgpu_device *vgpu_drv_get_vgpu_device(uuid_le uuid, int instance);
+
+int  create_vgpu_device(struct pci_dev *pdev, uuid_le uuid, uint32_t instance,
+		       char *vgpu_params);
+void destroy_vgpu_device(struct vgpu_device *vgpu_dev);
+
+int  vgpu_bus_register(void);
+void vgpu_bus_unregister(void);
+
+/* Function prototypes for vgpu_sysfs */
+
+extern struct class_attribute vgpu_class_attrs[];
+extern const struct attribute_group *vgpu_dev_groups[];
+
+int  vgpu_create_pci_device_files(struct pci_dev *dev);
+void vgpu_remove_pci_device_files(struct pci_dev *dev);
+
+void get_vgpu_supported_types(struct device *dev, char *str);
+int  vgpu_start_callback(struct vgpu_device *vgpu_dev);
+int  vgpu_shutdown_callback(struct vgpu_device *vgpu_dev);
+
+#endif /* VGPU_PRIVATE_H */
diff --git a/include/linux/vgpu.h b/include/linux/vgpu.h
new file mode 100644
index 0000000..7e1cb4e
--- /dev/null
+++ b/include/linux/vgpu.h
@@ -0,0 +1,217 @@
+/*
+ * VGPU definition
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author:
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef VGPU_H
+#define VGPU_H
+
+// Common Data structures
+
+struct pci_bar_info {
+	uint64_t start;
+	uint64_t size;
+	uint32_t flags;
+};
+
+enum vgpu_emul_space_e {
+	vgpu_emul_space_config = 0, /*!< PCI configuration space */
+	vgpu_emul_space_io = 1,     /*!< I/O register space */
+	vgpu_emul_space_mmio = 2    /*!< Memory-mapped I/O space */
+};
+
+struct gpu_device;
+
+/*
+ * VGPU device
+ */
+struct vgpu_device {
+	struct kref		kref;
+	struct device		dev;
+	struct gpu_device	*gpu_dev;
+	struct iommu_group	*group;
+#define DEVICE_NAME_LEN		(64)
+	char			dev_name[DEVICE_NAME_LEN];
+	uuid_le			uuid;
+	uint32_t		vgpu_instance;
+	struct device_attribute	*dev_attr_vgpu_status;
+	int			vgpu_device_status;
+
+	void			*driver_data;
+
+	struct list_head	list;
+};
+
+
+/**
+ * struct gpu_device_ops - Structure to be registered for each physical GPU to
+ * register the device to vgpu module.
+ *
+ * @owner:			The module owner.
+ * @dev_attr_groups:		Default attributes of the physical device.
+ * @vgpu_attr_groups:		Default attributes of the vGPU device.
+ * @vgpu_supported_config:	Called to get information about supported vgpu types.
+ *				@dev : pci device structure of physical GPU.
+ *				@config: should return string listing supported config
+ *				Returns integer: success (0) or error (< 0)
+ * @vgpu_create:		Called to allocate basic resouces in graphics
+ *				driver for a particular vgpu.
+ *				@dev: physical pci device structure on which vgpu
+ *				      should be created
+ *				@uuid: VM's uuid for which VM it is intended to
+ *				@instance: vgpu instance in that VM
+ *				@vgpu_params: extra parameters required by GPU driver.
+ *				Returns integer: success (0) or error (< 0)
+ * @vgpu_destroy:		Called to free resources in graphics driver for
+ *				a vgpu instance of that VM.
+ *				@dev: physical pci device structure to which
+ *				this vgpu points to.
+ *				@uuid: VM's uuid for which the vgpu belongs to.
+ *				@instance: vgpu instance in that VM
+ *				Returns integer: success (0) or error (< 0)
+ *				If VM is running and vgpu_destroy is called that
+ *				means the vGPU is being hotunpluged. Return error
+ *				if VM is running and graphics driver doesn't
+ *				support vgpu hotplug.
+ * @vgpu_start:			Called to do initiate vGPU initialization
+ *				process in graphics driver when VM boots before
+ *				qemu starts.
+ *				@uuid: VM's UUID which is booting.
+ *				Returns integer: success (0) or error (< 0)
+ * @vgpu_shutdown:		Called to teardown vGPU related resources for
+ *				the VM
+ *				@uuid: VM's UUID which is shutting down .
+ *				Returns integer: success (0) or error (< 0)
+ * @read:			Read emulation callback
+ *				@vdev: vgpu device structure
+ *				@buf: read buffer
+ *				@count: number bytes to read
+ *				@address_space: specifies for which address space
+ *				the request is: pci_config_space, IO register
+ *				space or MMIO space.
+ *				@pos: offset from base address.
+ *				Retuns number on bytes read on success or error.
+ * @write:			Write emulation callback
+ *				@vdev: vgpu device structure
+ *				@buf: write buffer
+ *				@count: number bytes to be written
+ *				@address_space: specifies for which address space
+ *				the request is: pci_config_space, IO register
+ *				space or MMIO space.
+ *				@pos: offset from base address.
+ *				Retuns number on bytes written on success or error.
+ * @vgpu_set_irqs:		Called to send about interrupts configuration
+ *				information that qemu set.
+ *				@vdev: vgpu device structure
+ *				@flags, index, start, count and *data : same as
+ *				that of struct vfio_irq_set of
+ *				VFIO_DEVICE_SET_IRQS API.
+ * @vgpu_bar_info:		Called to get BAR size and flags of vGPU device.
+ *				@vdev: vgpu device structure
+ *				@bar_index: BAR index
+ *				@bar_info: output, returns size and flags of
+ *				requested BAR
+ *				Returns integer: success (0) or error (< 0)
+ * @validate_map_request:	Validate remap pfn request
+ *				@vdev: vgpu device structure
+ *				@virtaddr: target user address to start at
+ *				@pfn: physical address of kernel memory, GPU
+ *				driver can change if required.
+ *				@size: size of map area, GPU driver can change
+ *				the size of map area if desired.
+ *				@prot: page protection flags for this mapping,
+ *				GPU driver can change, if required.
+ *				Returns integer: success (0) or error (< 0)
+ *
+ * Physical GPU that support vGPU should be register with vgpu module with
+ * gpu_device_ops structure.
+ */
+
+struct gpu_device_ops {
+	struct module   *owner;
+	const struct attribute_group **dev_attr_groups;
+	const struct attribute_group **vgpu_attr_groups;
+
+	int	(*vgpu_supported_config)(struct pci_dev *dev, char *config);
+	int     (*vgpu_create)(struct pci_dev *dev, uuid_le uuid,
+			       uint32_t instance, char *vgpu_params);
+	int     (*vgpu_destroy)(struct pci_dev *dev, uuid_le uuid,
+			        uint32_t instance);
+
+	int     (*vgpu_start)(uuid_le uuid);
+	int     (*vgpu_shutdown)(uuid_le uuid);
+
+	ssize_t (*read) (struct vgpu_device *vdev, char *buf, size_t count,
+			 uint32_t address_space, loff_t pos);
+	ssize_t (*write)(struct vgpu_device *vdev, char *buf, size_t count,
+			 uint32_t address_space, loff_t pos);
+	int     (*vgpu_set_irqs)(struct vgpu_device *vdev, uint32_t flags,
+				 unsigned index, unsigned start, unsigned count,
+				 void *data);
+	int	(*vgpu_bar_info)(struct vgpu_device *vdev, int bar_index,
+				 struct pci_bar_info *bar_info);
+	int	(*validate_map_request)(struct vgpu_device *vdev,
+					unsigned long virtaddr,
+					unsigned long *pfn, unsigned long *size,
+					pgprot_t *prot);
+};
+
+/*
+ * Physical GPU
+ */
+struct gpu_device {
+	struct pci_dev                  *dev;
+	const struct gpu_device_ops     *ops;
+	struct list_head                gpu_next;
+};
+
+/**
+ * struct vgpu_driver - vGPU device driver
+ * @name: driver name
+ * @probe: called when new device created
+ * @remove: called when device removed
+ * @driver: device driver structure
+ *
+ **/
+struct vgpu_driver {
+	const char *name;
+	int  (*probe)  (struct device *dev);
+	void (*remove) (struct device *dev);
+	struct device_driver	driver;
+};
+
+static inline struct vgpu_driver *to_vgpu_driver(struct device_driver *drv)
+{
+	return drv ? container_of(drv, struct vgpu_driver, driver) : NULL;
+}
+
+static inline struct vgpu_device *to_vgpu_device(struct device *dev)
+{
+	return dev ? container_of(dev, struct vgpu_device, dev) : NULL;
+}
+
+extern struct bus_type vgpu_bus_type;
+
+#define dev_is_vgpu(d) ((d)->bus == &vgpu_bus_type)
+
+extern int  vgpu_register_device(struct pci_dev *dev,
+				 const struct gpu_device_ops *ops);
+extern void vgpu_unregister_device(struct pci_dev *dev);
+
+extern int  vgpu_register_driver(struct vgpu_driver *drv, struct module *owner);
+extern void vgpu_unregister_driver(struct vgpu_driver *drv);
+
+extern int vgpu_map_virtual_bar(uint64_t virt_bar_addr, uint64_t phys_bar_addr,
+				uint32_t len, uint32_t flags);
+extern int vgpu_dma_do_translate(dma_addr_t * gfn_buffer, uint32_t count);
+
+struct vgpu_device *get_vgpu_device_from_group(struct iommu_group *group);
+
+#endif /* VGPU_H */
+
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [Qemu-devel] [RFC PATCH v2 1/3] vGPU Core driver
@ 2016-02-23 16:24 ` Kirti Wankhede
  0 siblings, 0 replies; 38+ messages in thread
From: Kirti Wankhede @ 2016-02-23 16:24 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel
  Cc: shuai.ruan, jike.song, Neo Jia, kvm, kevin.tian, qemu-devel,
	Kirti Wankhede, zhiyuan.lv

Design for vGPU Driver:
Main purpose of vGPU driver is to provide a common interface for vGPU
management that can be used by differnt GPU drivers.

This module would provide a generic interface to create the device, add
it to vGPU bus, add device to IOMMU group and then add it to vfio group.

High Level block diagram:

+--------------+    vgpu_register_driver()+---------------+
|     __init() +------------------------->+               |
|              |                          |               |
|              +<-------------------------+    vgpu.ko    |
| vgpu_vfio.ko |   probe()/remove()       |               |
|              |                +---------+               +---------+
+--------------+                |         +-------+-------+         |
                                |                 ^                 |
                                | callback        |                 |
                                |         +-------+--------+        |
                                |         |vgpu_register_device()   |
                                |         |                |        |
                                +---^-----+-----+    +-----+------+-+
                                    | nvidia.ko |    |  i915.ko   |
                                    |           |    |            |
                                    +-----------+    +------------+

vGPU driver provides two types of registration interfaces:
1. Registration interface for vGPU bus driver:

/**
  * struct vgpu_driver - vGPU device driver
  * @name: driver name
  * @probe: called when new device created
  * @remove: called when device removed
  * @driver: device driver structure
  *
  **/
struct vgpu_driver {
         const char *name;
         int  (*probe)  (struct device *dev);
         void (*remove) (struct device *dev);
         struct device_driver    driver;
};

int  vgpu_register_driver(struct vgpu_driver *drv, struct module *owner);
void vgpu_unregister_driver(struct vgpu_driver *drv);

VFIO bus driver for vgpu, should use this interface to register with
vGPU driver. With this, VFIO bus driver for vGPU devices is responsible
to add vGPU device to VFIO group.

2. GPU driver interface
GPU driver interface provides GPU driver the set APIs to manage GPU driver
related work in their own driver. APIs are to:
- vgpu_supported_config: provide supported configuration list by the GPU.
- vgpu_create: to allocate basic resouces in GPU driver for a vGPU device.
- vgpu_destroy: to free resources in GPU driver during vGPU device destroy.
- vgpu_start: to initiate vGPU initialization process from GPU driver when VM
  boots and before QEMU starts.
- vgpu_shutdown: to teardown vGPU resources during VM teardown.
- read : read emulation callback.
- write: write emulation callback.
- vgpu_set_irqs: send interrupt configuration information that QEMU sets.
- vgpu_bar_info: to provice BAR size and its flags for the vGPU device.
- validate_map_request: to validate remap pfn request.

This registration interface should be used by GPU drivers to register
each physical device to vGPU driver.

Updated this patch with couple of more functions in GPU driver interface
which were discussed during v1 version of this RFC.

Thanks,
Kirti.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
---
 drivers/Kconfig             |    2 +
 drivers/Makefile            |    1 +
 drivers/vgpu/Kconfig        |   26 +++
 drivers/vgpu/Makefile       |    4 +
 drivers/vgpu/vgpu-core.c    |  422 +++++++++++++++++++++++++++++++++++++++++++
 drivers/vgpu/vgpu-driver.c  |  137 ++++++++++++++
 drivers/vgpu/vgpu-sysfs.c   |  366 +++++++++++++++++++++++++++++++++++++
 drivers/vgpu/vgpu_private.h |   36 ++++
 include/linux/vgpu.h        |  217 ++++++++++++++++++++++
 9 files changed, 1211 insertions(+), 0 deletions(-)
 create mode 100644 drivers/vgpu/Kconfig
 create mode 100644 drivers/vgpu/Makefile
 create mode 100644 drivers/vgpu/vgpu-core.c
 create mode 100644 drivers/vgpu/vgpu-driver.c
 create mode 100644 drivers/vgpu/vgpu-sysfs.c
 create mode 100644 drivers/vgpu/vgpu_private.h
 create mode 100644 include/linux/vgpu.h

diff --git a/drivers/Kconfig b/drivers/Kconfig
index d2ac339..5fd9eae 100644
--- a/drivers/Kconfig
+++ b/drivers/Kconfig
@@ -122,6 +122,8 @@ source "drivers/uio/Kconfig"
 
 source "drivers/vfio/Kconfig"
 
+source "drivers/vgpu/Kconfig"
+
 source "drivers/vlynq/Kconfig"
 
 source "drivers/virt/Kconfig"
diff --git a/drivers/Makefile b/drivers/Makefile
index 795d0ca..1c43250 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -84,6 +84,7 @@ obj-$(CONFIG_FUSION)		+= message/
 obj-y				+= firewire/
 obj-$(CONFIG_UIO)		+= uio/
 obj-$(CONFIG_VFIO)		+= vfio/
+obj-$(CONFIG_VFIO)		+= vgpu/
 obj-y				+= cdrom/
 obj-y				+= auxdisplay/
 obj-$(CONFIG_PCCARD)		+= pcmcia/
diff --git a/drivers/vgpu/Kconfig b/drivers/vgpu/Kconfig
new file mode 100644
index 0000000..698ddf9
--- /dev/null
+++ b/drivers/vgpu/Kconfig
@@ -0,0 +1,26 @@
+
+menuconfig VGPU
+    tristate "VGPU driver framework"
+    depends on VFIO
+    select VGPU_VFIO
+    select VFIO_IOMMU_TYPE1_VGPU
+    help
+        VGPU provides a framework to virtualize GPU without SR-IOV cap
+        See Documentation/vgpu.txt for more details.
+
+        If you don't know what do here, say N.
+
+config VGPU
+    tristate
+    depends on VFIO
+    default n
+
+config VGPU_VFIO
+    tristate
+    depends on VGPU 
+    default n
+
+config VFIO_IOMMU_TYPE1_VGPU
+    tristate
+    depends on VGPU_VFIO
+    default n
diff --git a/drivers/vgpu/Makefile b/drivers/vgpu/Makefile
new file mode 100644
index 0000000..f5be980
--- /dev/null
+++ b/drivers/vgpu/Makefile
@@ -0,0 +1,4 @@
+
+vgpu-y := vgpu-core.o vgpu-sysfs.o vgpu-driver.o
+
+obj-$(CONFIG_VGPU)			+= vgpu.o
diff --git a/drivers/vgpu/vgpu-core.c b/drivers/vgpu/vgpu-core.c
new file mode 100644
index 0000000..7710021
--- /dev/null
+++ b/drivers/vgpu/vgpu-core.c
@@ -0,0 +1,422 @@
+/*
+ * VGPU Core Driver
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/poll.h>
+#include <linux/slab.h>
+#include <linux/cdev.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/uuid.h>
+#include <linux/vfio.h>
+#include <linux/iommu.h>
+#include <linux/sysfs.h>
+#include <linux/ctype.h>
+#include <linux/vgpu.h>
+
+#include "vgpu_private.h"
+
+#define DRIVER_VERSION	"0.1"
+#define DRIVER_AUTHOR	"NVIDIA Corporation"
+#define DRIVER_DESC	"VGPU Core Driver"
+
+/*
+ * #defines
+ */
+
+#define VGPU_CLASS_NAME		"vgpu"
+
+/*
+ * Global Structures
+ */
+
+static struct vgpu {
+	struct list_head    vgpu_devices_list;
+	struct mutex        vgpu_devices_lock;
+	struct list_head    gpu_devices_list;
+	struct mutex        gpu_devices_lock;
+} vgpu;
+
+static struct class vgpu_class;
+
+/*
+ * Functions
+ */
+
+struct vgpu_device *get_vgpu_device_from_group(struct iommu_group *group)
+{
+	struct vgpu_device *vdev = NULL;
+
+	mutex_lock(&vgpu.vgpu_devices_lock);
+	list_for_each_entry(vdev, &vgpu.vgpu_devices_list, list) {
+		if (vdev->group) {
+			if (iommu_group_id(vdev->group) == iommu_group_id(group)) {
+				mutex_unlock(&vgpu.vgpu_devices_lock);
+				return vdev;
+			}
+		}
+	}
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+	return NULL;
+}
+
+EXPORT_SYMBOL_GPL(get_vgpu_device_from_group);
+
+static int vgpu_add_attribute_group(struct device *dev,
+			            const struct attribute_group **groups)
+{
+        return sysfs_create_groups(&dev->kobj, groups);
+}
+
+static void vgpu_remove_attribute_group(struct device *dev,
+			                const struct attribute_group **groups)
+{
+        sysfs_remove_groups(&dev->kobj, groups);
+}
+
+int vgpu_register_device(struct pci_dev *dev, const struct gpu_device_ops *ops)
+{
+	int ret = 0;
+	struct gpu_device *gpu_dev, *tmp;
+
+	if (!dev)
+		return -EINVAL;
+
+        gpu_dev = kzalloc(sizeof(*gpu_dev), GFP_KERNEL);
+        if (!gpu_dev)
+                return -ENOMEM;
+
+	gpu_dev->dev = dev;
+        gpu_dev->ops = ops;
+
+        mutex_lock(&vgpu.gpu_devices_lock);
+
+        /* Check for duplicates */
+        list_for_each_entry(tmp, &vgpu.gpu_devices_list, gpu_next) {
+                if (tmp->dev == dev) {
+			ret = -EINVAL;
+			goto add_error;
+                }
+        }
+
+	ret = vgpu_create_pci_device_files(dev);
+	if (ret)
+		goto add_error;
+
+	ret = vgpu_add_attribute_group(&dev->dev, ops->dev_attr_groups);
+	if (ret)
+		goto add_group_error;
+
+        list_add(&gpu_dev->gpu_next, &vgpu.gpu_devices_list);
+
+	printk(KERN_INFO "VGPU: Registered dev 0x%x 0x%x, class 0x%x\n",
+			 dev->vendor, dev->device, dev->class);
+        mutex_unlock(&vgpu.gpu_devices_lock);
+
+        return 0;
+
+add_group_error:
+	vgpu_remove_pci_device_files(dev);
+add_error:
+	mutex_unlock(&vgpu.gpu_devices_lock);
+	kfree(gpu_dev);
+	return ret;
+
+}
+EXPORT_SYMBOL(vgpu_register_device);
+
+void vgpu_unregister_device(struct pci_dev *dev)
+{
+        struct gpu_device *gpu_dev;
+
+        mutex_lock(&vgpu.gpu_devices_lock);
+        list_for_each_entry(gpu_dev, &vgpu.gpu_devices_list, gpu_next) {
+		struct vgpu_device *vdev = NULL;
+
+                if (gpu_dev->dev != dev)
+			continue;
+
+		printk(KERN_INFO "VGPU: Unregistered dev 0x%x 0x%x, class 0x%x\n",
+				dev->vendor, dev->device, dev->class);
+
+		list_for_each_entry(vdev, &vgpu.vgpu_devices_list, list) {
+			if (vdev->gpu_dev != gpu_dev)
+				continue;
+			destroy_vgpu_device(vdev);
+		}
+		vgpu_remove_attribute_group(&dev->dev, gpu_dev->ops->dev_attr_groups);
+		vgpu_remove_pci_device_files(dev);
+		list_del(&gpu_dev->gpu_next);
+		mutex_unlock(&vgpu.gpu_devices_lock);
+		kfree(gpu_dev);
+		return;
+        }
+        mutex_unlock(&vgpu.gpu_devices_lock);
+}
+EXPORT_SYMBOL(vgpu_unregister_device);
+
+/*
+ * Helper Functions
+ */
+
+static struct vgpu_device *vgpu_device_alloc(uuid_le uuid, int instance, char *name)
+{
+	struct vgpu_device *vgpu_dev = NULL;
+
+	vgpu_dev = kzalloc(sizeof(*vgpu_dev), GFP_KERNEL);
+	if (!vgpu_dev)
+		return ERR_PTR(-ENOMEM);
+
+	kref_init(&vgpu_dev->kref);
+	memcpy(&vgpu_dev->uuid, &uuid, sizeof(uuid_le));
+	vgpu_dev->vgpu_instance = instance;
+	strcpy(vgpu_dev->dev_name, name);
+
+	mutex_lock(&vgpu.vgpu_devices_lock);
+	list_add(&vgpu_dev->list, &vgpu.vgpu_devices_list);
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+
+	return vgpu_dev;
+}
+
+static void vgpu_device_free(struct vgpu_device *vgpu_dev)
+{
+	if (vgpu_dev) {
+		mutex_lock(&vgpu.vgpu_devices_lock);
+		list_del(&vgpu_dev->list);
+		mutex_unlock(&vgpu.vgpu_devices_lock);
+		kfree(vgpu_dev);
+	}
+	return;
+}
+
+struct vgpu_device *vgpu_drv_get_vgpu_device(uuid_le uuid, int instance)
+{
+	struct vgpu_device *vdev = NULL;
+
+	mutex_lock(&vgpu.vgpu_devices_lock);
+	list_for_each_entry(vdev, &vgpu.vgpu_devices_list, list) {
+		if ((uuid_le_cmp(vdev->uuid, uuid) == 0) &&
+		    (vdev->vgpu_instance == instance)) {
+			mutex_unlock(&vgpu.vgpu_devices_lock);
+			return vdev;
+		}
+	}
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+	return NULL;
+}
+
+static void vgpu_device_release(struct device *dev)
+{
+	struct vgpu_device *vgpu_dev = to_vgpu_device(dev);
+	vgpu_device_free(vgpu_dev);
+}
+
+int create_vgpu_device(struct pci_dev *pdev, uuid_le uuid, uint32_t instance, char *vgpu_params)
+{
+	char name[64];
+	int numChar = 0;
+	int retval = 0;
+	struct vgpu_device *vgpu_dev = NULL;
+	struct gpu_device *gpu_dev;
+
+	printk(KERN_INFO "VGPU: %s: device ", __FUNCTION__);
+
+	numChar = sprintf(name, "%pUb-%d", uuid.b, instance);
+	name[numChar] = '\0';
+
+	vgpu_dev = vgpu_device_alloc(uuid, instance, name);
+	if (IS_ERR(vgpu_dev)) {
+		return PTR_ERR(vgpu_dev);
+	}
+
+	vgpu_dev->dev.parent  = NULL;
+	vgpu_dev->dev.bus     = &vgpu_bus_type;
+	vgpu_dev->dev.release = vgpu_device_release;
+	dev_set_name(&vgpu_dev->dev, "%s", name);
+
+	retval = device_register(&vgpu_dev->dev);
+	if (retval)
+		goto create_failed1;
+
+	printk(KERN_INFO "UUID %pUb \n", vgpu_dev->uuid.b);
+
+	mutex_lock(&vgpu.gpu_devices_lock);
+	list_for_each_entry(gpu_dev, &vgpu.gpu_devices_list, gpu_next) {
+		if (gpu_dev->dev != pdev)
+			continue;
+
+		vgpu_dev->gpu_dev = gpu_dev;
+		if (gpu_dev->ops->vgpu_create) {
+			retval = gpu_dev->ops->vgpu_create(pdev, vgpu_dev->uuid,
+							   instance, vgpu_params);
+			if (retval) {
+				mutex_unlock(&vgpu.gpu_devices_lock);
+				goto create_failed2;
+			}
+		}
+		break;
+	}
+	if (!vgpu_dev->gpu_dev) {
+		retval = -EINVAL;
+		mutex_unlock(&vgpu.gpu_devices_lock);
+		goto create_failed2;
+	}
+
+	mutex_unlock(&vgpu.gpu_devices_lock);
+
+	retval = vgpu_add_attribute_group(&vgpu_dev->dev, gpu_dev->ops->vgpu_attr_groups);
+	if (retval)
+		goto create_attr_error;
+
+	return retval;
+
+create_attr_error:
+	if (gpu_dev->ops->vgpu_destroy) {
+		int ret = 0;
+		ret = gpu_dev->ops->vgpu_destroy(gpu_dev->dev,
+						 vgpu_dev->uuid,
+						 vgpu_dev->vgpu_instance);
+	}
+
+create_failed2:
+	device_unregister(&vgpu_dev->dev);
+
+create_failed1:
+	vgpu_device_free(vgpu_dev);
+
+	return retval;
+}
+
+void destroy_vgpu_device(struct vgpu_device *vgpu_dev)
+{
+	struct gpu_device *gpu_dev = vgpu_dev->gpu_dev;
+
+	printk(KERN_INFO "VGPU: destroying device %s ", vgpu_dev->dev_name);
+	if (gpu_dev->ops->vgpu_destroy) {
+		int retval = 0;
+		retval = gpu_dev->ops->vgpu_destroy(gpu_dev->dev,
+						    vgpu_dev->uuid,
+						    vgpu_dev->vgpu_instance);
+	/* if vendor driver doesn't return success that means vendor driver doesn't
+	 * support hot-unplug */
+		if (retval)
+			return;
+	}
+
+	vgpu_remove_attribute_group(&vgpu_dev->dev, gpu_dev->ops->vgpu_attr_groups);
+	device_unregister(&vgpu_dev->dev);
+}
+
+void get_vgpu_supported_types(struct device *dev, char *str)
+{
+	struct gpu_device *gpu_dev;
+
+	mutex_lock(&vgpu.gpu_devices_lock);
+	list_for_each_entry(gpu_dev, &vgpu.gpu_devices_list, gpu_next) {
+		if (&gpu_dev->dev->dev == dev) {
+			if (gpu_dev->ops->vgpu_supported_config)
+				gpu_dev->ops->vgpu_supported_config(gpu_dev->dev, str);
+			break;
+		}
+	}
+	mutex_unlock(&vgpu.gpu_devices_lock);
+}
+
+int vgpu_start_callback(struct vgpu_device *vgpu_dev)
+{
+	int ret = 0;
+	struct gpu_device *gpu_dev = vgpu_dev->gpu_dev;
+
+	mutex_lock(&vgpu.gpu_devices_lock);
+	if (gpu_dev->ops->vgpu_start)
+		ret = gpu_dev->ops->vgpu_start(vgpu_dev->uuid);
+	mutex_unlock(&vgpu.gpu_devices_lock);
+	return ret;
+}
+
+int vgpu_shutdown_callback(struct vgpu_device *vgpu_dev)
+{
+	int ret = 0;
+	struct gpu_device *gpu_dev = vgpu_dev->gpu_dev;
+
+	mutex_lock(&vgpu.gpu_devices_lock);
+	if (gpu_dev->ops->vgpu_shutdown)
+		ret = gpu_dev->ops->vgpu_shutdown(vgpu_dev->uuid);
+	mutex_unlock(&vgpu.gpu_devices_lock);
+	return ret;
+}
+
+char *vgpu_devnode(struct device *dev, umode_t *mode)
+{
+	return kasprintf(GFP_KERNEL, "vgpu/%s", dev_name(dev));
+}
+
+static void release_vgpubus_dev(struct device *dev)
+{
+	struct vgpu_device *vgpu_dev = to_vgpu_device(dev);
+	destroy_vgpu_device(vgpu_dev);
+}
+
+static struct class vgpu_class = {
+	.name		= VGPU_CLASS_NAME,
+	.owner		= THIS_MODULE,
+	.class_attrs	= vgpu_class_attrs,
+	.dev_groups	= vgpu_dev_groups,
+	.devnode	= vgpu_devnode,
+	.dev_release    = release_vgpubus_dev,
+};
+
+static int __init vgpu_init(void)
+{
+	int rc = 0;
+
+	memset(&vgpu, 0 , sizeof(vgpu));
+
+	mutex_init(&vgpu.vgpu_devices_lock);
+	INIT_LIST_HEAD(&vgpu.vgpu_devices_list);
+	mutex_init(&vgpu.gpu_devices_lock);
+	INIT_LIST_HEAD(&vgpu.gpu_devices_list);
+
+	rc = class_register(&vgpu_class);
+	if (rc < 0) {
+		printk(KERN_ERR "Error: failed to register vgpu class\n");
+		goto failed1;
+	}
+
+	rc = vgpu_bus_register();
+	if (rc < 0) {
+		printk(KERN_ERR "Error: failed to register vgpu bus\n");
+		class_unregister(&vgpu_class);
+	}
+
+failed1:
+	return rc;
+}
+
+static void __exit vgpu_exit(void)
+{
+	vgpu_bus_unregister();
+	class_unregister(&vgpu_class);
+}
+
+module_init(vgpu_init)
+module_exit(vgpu_exit)
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/drivers/vgpu/vgpu-driver.c b/drivers/vgpu/vgpu-driver.c
new file mode 100644
index 0000000..6b62f19
--- /dev/null
+++ b/drivers/vgpu/vgpu-driver.c
@@ -0,0 +1,137 @@
+/*
+ * VGPU driver
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/vfio.h>
+#include <linux/iommu.h>
+#include <linux/sysfs.h>
+#include <linux/ctype.h>
+#include <linux/vgpu.h>
+
+#include "vgpu_private.h"
+
+static int vgpu_device_attach_iommu(struct vgpu_device *vgpu_dev)
+{
+        int retval = 0;
+        struct iommu_group *group = NULL;
+
+        group = iommu_group_alloc();
+        if (IS_ERR(group)) {
+                printk(KERN_ERR "VGPU: failed to allocate group!\n");
+                return PTR_ERR(group);
+        }
+
+        retval = iommu_group_add_device(group, &vgpu_dev->dev);
+        if (retval) {
+                printk(KERN_ERR "VGPU: failed to add dev to group!\n");
+                iommu_group_put(group);
+                return retval;
+        }
+
+        vgpu_dev->group = group;
+
+        printk(KERN_INFO "VGPU: group_id = %d \n", iommu_group_id(group));
+        return retval;
+}
+
+static void vgpu_device_detach_iommu(struct vgpu_device *vgpu_dev)
+{
+        iommu_group_put(vgpu_dev->dev.iommu_group);
+        iommu_group_remove_device(&vgpu_dev->dev);
+        printk(KERN_INFO "VGPU: detaching iommu \n");
+}
+
+static int vgpu_device_probe(struct device *dev)
+{
+	struct vgpu_driver *drv = to_vgpu_driver(dev->driver);
+	struct vgpu_device *vgpu_dev = to_vgpu_device(dev);
+	int status = 0;
+
+	status = vgpu_device_attach_iommu(vgpu_dev);
+	if (status) {
+		printk(KERN_ERR "Failed to attach IOMMU\n");
+		return status;
+	}
+
+	if (drv && drv->probe) {
+		status = drv->probe(dev);
+	}
+
+	return status;
+}
+
+static int vgpu_device_remove(struct device *dev)
+{
+	struct vgpu_driver *drv = to_vgpu_driver(dev->driver);
+	struct vgpu_device *vgpu_dev = to_vgpu_device(dev);
+	int status = 0;
+
+	if (drv && drv->remove) {
+		drv->remove(dev);
+	}
+
+	vgpu_device_detach_iommu(vgpu_dev);
+
+	return status;
+}
+
+struct bus_type vgpu_bus_type = {
+	.name		= "vgpu",
+	.probe		= vgpu_device_probe,
+	.remove		= vgpu_device_remove,
+};
+EXPORT_SYMBOL_GPL(vgpu_bus_type);
+
+/**
+ * vgpu_register_driver - register a new vGPU driver
+ * @drv: the driver to register
+ * @owner: owner module of driver ro register
+ *
+ * Returns a negative value on error, otherwise 0.
+ */
+int vgpu_register_driver(struct vgpu_driver *drv, struct module *owner)
+{
+	/* initialize common driver fields */
+	drv->driver.name = drv->name;
+	drv->driver.bus = &vgpu_bus_type;
+	drv->driver.owner = owner;
+
+	/* register with core */
+	return driver_register(&drv->driver);
+}
+EXPORT_SYMBOL(vgpu_register_driver);
+
+/**
+ * vgpu_unregister_driver - unregister vGPU driver
+ * @drv: the driver to unregister
+ *
+ */
+void vgpu_unregister_driver(struct vgpu_driver *drv)
+{
+	driver_unregister(&drv->driver);
+}
+EXPORT_SYMBOL(vgpu_unregister_driver);
+
+int vgpu_bus_register(void)
+{
+	return bus_register(&vgpu_bus_type);
+}
+
+void vgpu_bus_unregister(void)
+{
+	bus_unregister(&vgpu_bus_type);
+}
+
diff --git a/drivers/vgpu/vgpu-sysfs.c b/drivers/vgpu/vgpu-sysfs.c
new file mode 100644
index 0000000..a1b321b
--- /dev/null
+++ b/drivers/vgpu/vgpu-sysfs.c
@@ -0,0 +1,366 @@
+/*
+ * File attributes for vGPU devices
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/fs.h>
+#include <linux/sysfs.h>
+#include <linux/ctype.h>
+#include <linux/uuid.h>
+#include <linux/vfio.h>
+#include <linux/vgpu.h>
+
+#include "vgpu_private.h"
+
+/* Prototypes */
+
+static ssize_t vgpu_supported_types_show(struct device *dev,
+					 struct device_attribute *attr,
+					 char *buf);
+static DEVICE_ATTR_RO(vgpu_supported_types);
+
+static ssize_t vgpu_create_store(struct device *dev,
+				 struct device_attribute *attr,
+				 const char *buf, size_t count);
+static DEVICE_ATTR_WO(vgpu_create);
+
+static ssize_t vgpu_destroy_store(struct device *dev,
+				  struct device_attribute *attr,
+				  const char *buf, size_t count);
+static DEVICE_ATTR_WO(vgpu_destroy);
+
+
+/* Static functions */
+
+static bool is_uuid_sep(char sep)
+{
+	if (sep == '\n' || sep == '-' || sep == ':' || sep == '\0')
+		return true;
+	return false;
+}
+
+
+static int uuid_parse(const char *str, uuid_le *uuid)
+{
+	int i;
+
+	if (strlen(str) < 36)
+		return -1;
+
+	for (i = 0; i < 16; i++) {
+		if (!isxdigit(str[0]) || !isxdigit(str[1])) {
+			printk(KERN_ERR "%s err", __FUNCTION__);
+			return -EINVAL;
+		}
+
+		uuid->b[i] = (hex_to_bin(str[0]) << 4) | hex_to_bin(str[1]);
+		str += 2;
+		if (is_uuid_sep(*str))
+			str++;
+	}
+
+	return 0;
+}
+
+
+/* Functions */
+static ssize_t vgpu_supported_types_show(struct device *dev,
+					 struct device_attribute *attr,
+					 char *buf)
+{
+	char *str;
+	ssize_t n;
+
+        str = kzalloc(sizeof(*str) * 512, GFP_KERNEL);
+        if (!str)
+                return -ENOMEM;
+
+	get_vgpu_supported_types(dev, str);
+
+	n = sprintf(buf,"%s\n", str);
+	kfree(str);
+
+	return n;
+}
+
+static ssize_t vgpu_create_store(struct device *dev,
+				 struct device_attribute *attr,
+				 const char *buf, size_t count)
+{
+	char *str, *pstr;
+	char *uuid_str, *instance_str, *vgpu_params = NULL;
+	uuid_le uuid;
+	uint32_t instance;
+	struct pci_dev *pdev;
+	int ret = 0;
+
+	pstr = str = kstrndup(buf, count, GFP_KERNEL);
+
+	if (!str)
+		return -ENOMEM;
+
+	if ((uuid_str = strsep(&str, ":")) == NULL) {
+		printk(KERN_ERR "%s Empty UUID or string %s \n",
+				 __FUNCTION__, buf);
+		ret = -EINVAL;
+		goto create_error;
+	}
+
+	if (!str) {
+		printk(KERN_ERR "%s vgpu instance not specified %s \n",
+				 __FUNCTION__, buf);
+		ret = -EINVAL;
+		goto create_error;
+	}
+
+	if ((instance_str = strsep(&str, ":")) == NULL) {
+		printk(KERN_ERR "%s Empty instance or string %s \n",
+				 __FUNCTION__, buf);
+		ret = -EINVAL;
+		goto create_error;
+	}
+
+	instance = (unsigned int)simple_strtoul(instance_str, NULL, 0);
+
+	if (!str) {
+		printk(KERN_ERR "%s vgpu params not specified %s \n",
+				 __FUNCTION__, buf);
+		ret = -EINVAL;
+		goto create_error;
+	}
+
+	vgpu_params = kstrdup(str, GFP_KERNEL);
+
+	if (!vgpu_params) {
+		printk(KERN_ERR "%s vgpu params allocation failed \n",
+				 __FUNCTION__);
+		ret = -EINVAL;
+		goto create_error;
+	}
+
+	if (uuid_parse(uuid_str, &uuid) < 0) {
+		printk(KERN_ERR "%s UUID parse error  %s \n", __FUNCTION__, buf);
+		ret = -EINVAL;
+		goto create_error;
+	}
+
+	if (dev_is_pci(dev)) {
+		pdev = to_pci_dev(dev);
+
+		if (create_vgpu_device(pdev, uuid, instance, vgpu_params) < 0) {
+			printk(KERN_ERR "%s vgpu create error \n", __FUNCTION__);
+			ret = -EINVAL;
+			goto create_error;
+		}
+		ret = count;
+	}
+
+create_error:
+	if (vgpu_params)
+		kfree(vgpu_params);
+
+	if (pstr)
+		kfree(pstr);
+	return ret;
+}
+
+static ssize_t vgpu_destroy_store(struct device *dev,
+				  struct device_attribute *attr,
+				  const char *buf, size_t count)
+{
+	char *uuid_str, *str;
+	uuid_le uuid;
+	unsigned int instance;
+	struct vgpu_device *vgpu_dev = NULL;
+
+	str = kstrndup(buf, count, GFP_KERNEL);
+
+	if (!str)
+		return -ENOMEM;
+
+	if ((uuid_str = strsep(&str, ":")) == NULL) {
+		printk(KERN_ERR "%s Empty UUID or string %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+
+	if (str == NULL) {
+		printk(KERN_ERR "%s instance not specified %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+
+	instance = (unsigned int)simple_strtoul(str, NULL, 0);
+
+	if (uuid_parse(uuid_str, &uuid) < 0) {
+		printk(KERN_ERR "%s UUID parse error  %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+
+	printk(KERN_INFO "%s UUID %pUb - %d \n", __FUNCTION__, uuid.b, instance);
+
+	vgpu_dev = vgpu_drv_get_vgpu_device(uuid, instance);
+
+	if (vgpu_dev)
+		destroy_vgpu_device(vgpu_dev);
+
+	return count;
+}
+
+static ssize_t
+vgpu_uuid_show(struct device *dev, struct device_attribute *attr, char *buf)
+{
+	struct vgpu_device *drv = to_vgpu_device(dev);
+
+	if (drv)
+		return sprintf(buf, "%pUb \n", drv->uuid.b);
+
+	return sprintf(buf, " \n");
+}
+
+static DEVICE_ATTR_RO(vgpu_uuid);
+
+static ssize_t
+vgpu_group_id_show(struct device *dev, struct device_attribute *attr, char *buf)
+{
+	struct vgpu_device *drv = to_vgpu_device(dev);
+
+	if (drv && drv->group)
+		return sprintf(buf, "%d \n", iommu_group_id(drv->group));
+
+	return sprintf(buf, " \n");
+}
+
+static DEVICE_ATTR_RO(vgpu_group_id);
+
+
+static struct attribute *vgpu_dev_attrs[] = {
+	&dev_attr_vgpu_uuid.attr,
+	&dev_attr_vgpu_group_id.attr,
+	NULL,
+};
+
+static const struct attribute_group vgpu_dev_group = {
+	.attrs = vgpu_dev_attrs,
+};
+
+const struct attribute_group *vgpu_dev_groups[] = {
+	&vgpu_dev_group,
+	NULL,
+};
+
+
+ssize_t vgpu_start_store(struct class *class, struct class_attribute *attr,
+			 const char *buf, size_t count)
+{
+	char *uuid_str;
+	uuid_le uuid;
+	struct vgpu_device *vgpu_dev = NULL;
+	int ret;
+
+	uuid_str = kstrndup(buf, count, GFP_KERNEL);
+
+	if (!uuid_str)
+		return -ENOMEM;
+
+	if (uuid_parse(uuid_str, &uuid) < 0) {
+		printk(KERN_ERR "%s UUID parse error  %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+
+	vgpu_dev = vgpu_drv_get_vgpu_device(uuid, 0);
+
+	if (vgpu_dev && dev_is_vgpu(&vgpu_dev->dev)) {
+		kobject_uevent(&vgpu_dev->dev.kobj, KOBJ_ONLINE);
+
+		ret = vgpu_start_callback(vgpu_dev);
+		if (ret < 0) {
+			printk(KERN_ERR "%s vgpu_start callback failed  %d \n",
+					 __FUNCTION__, ret);
+			return ret;
+		}
+	}
+
+	return count;
+}
+
+ssize_t vgpu_shutdown_store(struct class *class, struct class_attribute *attr,
+			    const char *buf, size_t count)
+{
+	char *uuid_str;
+	uuid_le uuid;
+	struct vgpu_device *vgpu_dev = NULL;
+	int ret;
+
+	uuid_str = kstrndup(buf, count, GFP_KERNEL);
+
+	if (!uuid_str)
+		return -ENOMEM;
+
+	if (uuid_parse(uuid_str, &uuid) < 0) {
+		printk(KERN_ERR "%s UUID parse error  %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+	vgpu_dev = vgpu_drv_get_vgpu_device(uuid, 0);
+
+	if (vgpu_dev && dev_is_vgpu(&vgpu_dev->dev)) {
+		kobject_uevent(&vgpu_dev->dev.kobj, KOBJ_OFFLINE);
+
+		ret = vgpu_shutdown_callback(vgpu_dev);
+		if (ret < 0) {
+			printk(KERN_ERR "%s vgpu_shutdown callback failed  %d \n",
+					 __FUNCTION__, ret);
+			return ret;
+		}
+	}
+
+	return count;
+}
+
+struct class_attribute vgpu_class_attrs[] = {
+	__ATTR_WO(vgpu_start),
+	__ATTR_WO(vgpu_shutdown),
+	__ATTR_NULL
+};
+
+int vgpu_create_pci_device_files(struct pci_dev *dev)
+{
+	int retval;
+
+	retval = sysfs_create_file(&dev->dev.kobj,
+				   &dev_attr_vgpu_supported_types.attr);
+	if (retval) {
+		printk(KERN_ERR "VGPU-VFIO: failed to create vgpu_supported_types sysfs entry\n");
+		return retval;
+	}
+
+	retval = sysfs_create_file(&dev->dev.kobj, &dev_attr_vgpu_create.attr);
+	if (retval) {
+		printk(KERN_ERR "VGPU-VFIO: failed to create vgpu_create sysfs entry\n");
+		return retval;
+	}
+
+	retval = sysfs_create_file(&dev->dev.kobj, &dev_attr_vgpu_destroy.attr);
+	if (retval) {
+		printk(KERN_ERR "VGPU-VFIO: failed to create vgpu_destroy sysfs entry\n");
+		return retval;
+	}
+
+	return 0;
+}
+
+
+void vgpu_remove_pci_device_files(struct pci_dev *dev)
+{
+	sysfs_remove_file(&dev->dev.kobj, &dev_attr_vgpu_supported_types.attr);
+	sysfs_remove_file(&dev->dev.kobj, &dev_attr_vgpu_create.attr);
+	sysfs_remove_file(&dev->dev.kobj, &dev_attr_vgpu_destroy.attr);
+}
+
diff --git a/drivers/vgpu/vgpu_private.h b/drivers/vgpu/vgpu_private.h
new file mode 100644
index 0000000..35158ef
--- /dev/null
+++ b/drivers/vgpu/vgpu_private.h
@@ -0,0 +1,36 @@
+/*
+ * VGPU interal definition
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author:
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef VGPU_PRIVATE_H
+#define VGPU_PRIVATE_H
+
+struct vgpu_device *vgpu_drv_get_vgpu_device(uuid_le uuid, int instance);
+
+int  create_vgpu_device(struct pci_dev *pdev, uuid_le uuid, uint32_t instance,
+		       char *vgpu_params);
+void destroy_vgpu_device(struct vgpu_device *vgpu_dev);
+
+int  vgpu_bus_register(void);
+void vgpu_bus_unregister(void);
+
+/* Function prototypes for vgpu_sysfs */
+
+extern struct class_attribute vgpu_class_attrs[];
+extern const struct attribute_group *vgpu_dev_groups[];
+
+int  vgpu_create_pci_device_files(struct pci_dev *dev);
+void vgpu_remove_pci_device_files(struct pci_dev *dev);
+
+void get_vgpu_supported_types(struct device *dev, char *str);
+int  vgpu_start_callback(struct vgpu_device *vgpu_dev);
+int  vgpu_shutdown_callback(struct vgpu_device *vgpu_dev);
+
+#endif /* VGPU_PRIVATE_H */
diff --git a/include/linux/vgpu.h b/include/linux/vgpu.h
new file mode 100644
index 0000000..7e1cb4e
--- /dev/null
+++ b/include/linux/vgpu.h
@@ -0,0 +1,217 @@
+/*
+ * VGPU definition
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author:
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef VGPU_H
+#define VGPU_H
+
+// Common Data structures
+
+struct pci_bar_info {
+	uint64_t start;
+	uint64_t size;
+	uint32_t flags;
+};
+
+enum vgpu_emul_space_e {
+	vgpu_emul_space_config = 0, /*!< PCI configuration space */
+	vgpu_emul_space_io = 1,     /*!< I/O register space */
+	vgpu_emul_space_mmio = 2    /*!< Memory-mapped I/O space */
+};
+
+struct gpu_device;
+
+/*
+ * VGPU device
+ */
+struct vgpu_device {
+	struct kref		kref;
+	struct device		dev;
+	struct gpu_device	*gpu_dev;
+	struct iommu_group	*group;
+#define DEVICE_NAME_LEN		(64)
+	char			dev_name[DEVICE_NAME_LEN];
+	uuid_le			uuid;
+	uint32_t		vgpu_instance;
+	struct device_attribute	*dev_attr_vgpu_status;
+	int			vgpu_device_status;
+
+	void			*driver_data;
+
+	struct list_head	list;
+};
+
+
+/**
+ * struct gpu_device_ops - Structure to be registered for each physical GPU to
+ * register the device to vgpu module.
+ *
+ * @owner:			The module owner.
+ * @dev_attr_groups:		Default attributes of the physical device.
+ * @vgpu_attr_groups:		Default attributes of the vGPU device.
+ * @vgpu_supported_config:	Called to get information about supported vgpu types.
+ *				@dev : pci device structure of physical GPU.
+ *				@config: should return string listing supported config
+ *				Returns integer: success (0) or error (< 0)
+ * @vgpu_create:		Called to allocate basic resouces in graphics
+ *				driver for a particular vgpu.
+ *				@dev: physical pci device structure on which vgpu
+ *				      should be created
+ *				@uuid: VM's uuid for which VM it is intended to
+ *				@instance: vgpu instance in that VM
+ *				@vgpu_params: extra parameters required by GPU driver.
+ *				Returns integer: success (0) or error (< 0)
+ * @vgpu_destroy:		Called to free resources in graphics driver for
+ *				a vgpu instance of that VM.
+ *				@dev: physical pci device structure to which
+ *				this vgpu points to.
+ *				@uuid: VM's uuid for which the vgpu belongs to.
+ *				@instance: vgpu instance in that VM
+ *				Returns integer: success (0) or error (< 0)
+ *				If VM is running and vgpu_destroy is called that
+ *				means the vGPU is being hotunpluged. Return error
+ *				if VM is running and graphics driver doesn't
+ *				support vgpu hotplug.
+ * @vgpu_start:			Called to do initiate vGPU initialization
+ *				process in graphics driver when VM boots before
+ *				qemu starts.
+ *				@uuid: VM's UUID which is booting.
+ *				Returns integer: success (0) or error (< 0)
+ * @vgpu_shutdown:		Called to teardown vGPU related resources for
+ *				the VM
+ *				@uuid: VM's UUID which is shutting down .
+ *				Returns integer: success (0) or error (< 0)
+ * @read:			Read emulation callback
+ *				@vdev: vgpu device structure
+ *				@buf: read buffer
+ *				@count: number bytes to read
+ *				@address_space: specifies for which address space
+ *				the request is: pci_config_space, IO register
+ *				space or MMIO space.
+ *				@pos: offset from base address.
+ *				Retuns number on bytes read on success or error.
+ * @write:			Write emulation callback
+ *				@vdev: vgpu device structure
+ *				@buf: write buffer
+ *				@count: number bytes to be written
+ *				@address_space: specifies for which address space
+ *				the request is: pci_config_space, IO register
+ *				space or MMIO space.
+ *				@pos: offset from base address.
+ *				Retuns number on bytes written on success or error.
+ * @vgpu_set_irqs:		Called to send about interrupts configuration
+ *				information that qemu set.
+ *				@vdev: vgpu device structure
+ *				@flags, index, start, count and *data : same as
+ *				that of struct vfio_irq_set of
+ *				VFIO_DEVICE_SET_IRQS API.
+ * @vgpu_bar_info:		Called to get BAR size and flags of vGPU device.
+ *				@vdev: vgpu device structure
+ *				@bar_index: BAR index
+ *				@bar_info: output, returns size and flags of
+ *				requested BAR
+ *				Returns integer: success (0) or error (< 0)
+ * @validate_map_request:	Validate remap pfn request
+ *				@vdev: vgpu device structure
+ *				@virtaddr: target user address to start at
+ *				@pfn: physical address of kernel memory, GPU
+ *				driver can change if required.
+ *				@size: size of map area, GPU driver can change
+ *				the size of map area if desired.
+ *				@prot: page protection flags for this mapping,
+ *				GPU driver can change, if required.
+ *				Returns integer: success (0) or error (< 0)
+ *
+ * Physical GPU that support vGPU should be register with vgpu module with
+ * gpu_device_ops structure.
+ */
+
+struct gpu_device_ops {
+	struct module   *owner;
+	const struct attribute_group **dev_attr_groups;
+	const struct attribute_group **vgpu_attr_groups;
+
+	int	(*vgpu_supported_config)(struct pci_dev *dev, char *config);
+	int     (*vgpu_create)(struct pci_dev *dev, uuid_le uuid,
+			       uint32_t instance, char *vgpu_params);
+	int     (*vgpu_destroy)(struct pci_dev *dev, uuid_le uuid,
+			        uint32_t instance);
+
+	int     (*vgpu_start)(uuid_le uuid);
+	int     (*vgpu_shutdown)(uuid_le uuid);
+
+	ssize_t (*read) (struct vgpu_device *vdev, char *buf, size_t count,
+			 uint32_t address_space, loff_t pos);
+	ssize_t (*write)(struct vgpu_device *vdev, char *buf, size_t count,
+			 uint32_t address_space, loff_t pos);
+	int     (*vgpu_set_irqs)(struct vgpu_device *vdev, uint32_t flags,
+				 unsigned index, unsigned start, unsigned count,
+				 void *data);
+	int	(*vgpu_bar_info)(struct vgpu_device *vdev, int bar_index,
+				 struct pci_bar_info *bar_info);
+	int	(*validate_map_request)(struct vgpu_device *vdev,
+					unsigned long virtaddr,
+					unsigned long *pfn, unsigned long *size,
+					pgprot_t *prot);
+};
+
+/*
+ * Physical GPU
+ */
+struct gpu_device {
+	struct pci_dev                  *dev;
+	const struct gpu_device_ops     *ops;
+	struct list_head                gpu_next;
+};
+
+/**
+ * struct vgpu_driver - vGPU device driver
+ * @name: driver name
+ * @probe: called when new device created
+ * @remove: called when device removed
+ * @driver: device driver structure
+ *
+ **/
+struct vgpu_driver {
+	const char *name;
+	int  (*probe)  (struct device *dev);
+	void (*remove) (struct device *dev);
+	struct device_driver	driver;
+};
+
+static inline struct vgpu_driver *to_vgpu_driver(struct device_driver *drv)
+{
+	return drv ? container_of(drv, struct vgpu_driver, driver) : NULL;
+}
+
+static inline struct vgpu_device *to_vgpu_device(struct device *dev)
+{
+	return dev ? container_of(dev, struct vgpu_device, dev) : NULL;
+}
+
+extern struct bus_type vgpu_bus_type;
+
+#define dev_is_vgpu(d) ((d)->bus == &vgpu_bus_type)
+
+extern int  vgpu_register_device(struct pci_dev *dev,
+				 const struct gpu_device_ops *ops);
+extern void vgpu_unregister_device(struct pci_dev *dev);
+
+extern int  vgpu_register_driver(struct vgpu_driver *drv, struct module *owner);
+extern void vgpu_unregister_driver(struct vgpu_driver *drv);
+
+extern int vgpu_map_virtual_bar(uint64_t virt_bar_addr, uint64_t phys_bar_addr,
+				uint32_t len, uint32_t flags);
+extern int vgpu_dma_do_translate(dma_addr_t * gfn_buffer, uint32_t count);
+
+struct vgpu_device *get_vgpu_device_from_group(struct iommu_group *group);
+
+#endif /* VGPU_H */
+
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH v2 2/3] VFIO driver for vGPU device
  2016-02-23 16:24 ` [Qemu-devel] " Kirti Wankhede
@ 2016-02-23 16:24   ` Kirti Wankhede
  -1 siblings, 0 replies; 38+ messages in thread
From: Kirti Wankhede @ 2016-02-23 16:24 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel
  Cc: qemu-devel, kvm, kevin.tian, shuai.ruan, jike.song, zhiyuan.lv,
	Kirti Wankhede, Neo Jia

VFIO driver registers with vGPU core driver. vGPU core driver creates vGPU
device and calls probe routine of vGPU VFIO driver. This vGPU VFIO driver adds
vGPU device to VFIO core module.
Main aim of this module is to manage all VFIO APIs for each vGPU device.
Those are:
- get region information from GPU driver.
- trap and emulate PCI config space and BAR region.
- Send interrupt configuration information to GPU driver.
- mmap mappable region with invalidate mapping and fault on access to remap pfn.

Thanks,
Kirti.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
---
 drivers/vgpu/Makefile    |    1 +
 drivers/vgpu/vgpu_vfio.c |  664 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 665 insertions(+), 0 deletions(-)
 create mode 100644 drivers/vgpu/vgpu_vfio.c

diff --git a/drivers/vgpu/Makefile b/drivers/vgpu/Makefile
index f5be980..a0a2655 100644
--- a/drivers/vgpu/Makefile
+++ b/drivers/vgpu/Makefile
@@ -2,3 +2,4 @@
 vgpu-y := vgpu-core.o vgpu-sysfs.o vgpu-driver.o
 
 obj-$(CONFIG_VGPU)			+= vgpu.o
+obj-$(CONFIG_VGPU_VFIO)                 += vgpu_vfio.o
diff --git a/drivers/vgpu/vgpu_vfio.c b/drivers/vgpu/vgpu_vfio.c
new file mode 100644
index 0000000..dc19630
--- /dev/null
+++ b/drivers/vgpu/vgpu_vfio.c
@@ -0,0 +1,664 @@
+/*
+ * VGPU VFIO device
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/poll.h>
+#include <linux/slab.h>
+#include <linux/cdev.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/uuid.h>
+#include <linux/vfio.h>
+#include <linux/iommu.h>
+#include <linux/vgpu.h>
+
+#include "vgpu_private.h"
+
+#define DRIVER_VERSION  "0.1"
+#define DRIVER_AUTHOR   "NVIDIA Corporation"
+#define DRIVER_DESC     "VGPU VFIO Driver"
+
+#define VFIO_PCI_OFFSET_SHIFT   40
+
+#define VFIO_PCI_OFFSET_TO_INDEX(off)	(off >> VFIO_PCI_OFFSET_SHIFT)
+#define VFIO_PCI_INDEX_TO_OFFSET(index)	((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
+#define VFIO_PCI_OFFSET_MASK	(((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
+
+struct vfio_vgpu_device {
+	struct iommu_group *group;
+	struct vgpu_device *vgpu_dev;
+	int		    refcnt;
+	struct pci_bar_info bar_info[VFIO_PCI_NUM_REGIONS];
+	u8		    *vconfig;
+};
+
+static DEFINE_MUTEX(vfio_vgpu_lock);
+
+static int get_virtual_bar_info(struct vgpu_device *vgpu_dev,
+				struct pci_bar_info *bar_info,
+				int index)
+{
+	int ret = -1;
+	struct gpu_device *gpu_dev = vgpu_dev->gpu_dev;
+
+	if (gpu_dev->ops->vgpu_bar_info)
+		ret = gpu_dev->ops->vgpu_bar_info(vgpu_dev, index, bar_info);
+	return ret;
+}
+
+static int vdev_read_base(struct vfio_vgpu_device *vdev)
+{
+	int index, pos;
+	u32 start_lo, start_hi;
+	u32 mem_type;
+
+	pos = PCI_BASE_ADDRESS_0;
+
+	for (index = 0; index <= VFIO_PCI_BAR5_REGION_INDEX; index++) {
+
+		if (!vdev->bar_info[index].size)
+			continue;
+
+		start_lo = (*(u32 *)(vdev->vconfig + pos)) &
+					PCI_BASE_ADDRESS_MEM_MASK;
+		mem_type = (*(u32 *)(vdev->vconfig + pos)) &
+					PCI_BASE_ADDRESS_MEM_TYPE_MASK;
+
+		switch (mem_type) {
+		case PCI_BASE_ADDRESS_MEM_TYPE_64:
+			start_hi = (*(u32 *)(vdev->vconfig + pos + 4));
+			pos += 4;
+			break;
+		case PCI_BASE_ADDRESS_MEM_TYPE_32:
+		case PCI_BASE_ADDRESS_MEM_TYPE_1M:
+			/* 1M mem BAR treated as 32-bit BAR */
+		default:
+			/* mem unknown type treated as 32-bit BAR */
+			start_hi = 0;
+			break;
+		}
+		pos += 4;
+		vdev->bar_info[index].start = ((u64)start_hi << 32) | start_lo;
+	}
+	return 0;
+}
+
+static int vgpu_dev_open(void *device_data)
+{
+	int ret = 0;
+	struct vfio_vgpu_device *vdev = device_data;
+
+	if (!try_module_get(THIS_MODULE))
+		return -ENODEV;
+
+	mutex_lock(&vfio_vgpu_lock);
+
+	if (!vdev->refcnt) {
+		u8 *vconfig;
+		int vconfig_size, index;
+
+		for (index = 0; index < VFIO_PCI_NUM_REGIONS; index++) {
+			ret = get_virtual_bar_info(vdev->vgpu_dev,
+						   &vdev->bar_info[index],
+						   index);
+			if (ret)
+				goto open_error;
+		}
+		vconfig_size = vdev->bar_info[VFIO_PCI_CONFIG_REGION_INDEX].size;
+		if (!vconfig_size)
+			goto open_error;
+
+		vconfig = kzalloc(vconfig_size, GFP_KERNEL);
+		if (!vconfig) {
+			ret = -ENOMEM;
+			goto open_error;
+		}
+
+		vdev->vconfig = vconfig;
+	}
+
+	vdev->refcnt++;
+open_error:
+
+	mutex_unlock(&vfio_vgpu_lock);
+
+	if (ret)
+		module_put(THIS_MODULE);
+
+	return ret;
+}
+
+static void vgpu_dev_close(void *device_data)
+{
+	struct vfio_vgpu_device *vdev = device_data;
+
+	mutex_lock(&vfio_vgpu_lock);
+
+	vdev->refcnt--;
+	if (!vdev->refcnt) {
+		memset(&vdev->bar_info, 0, sizeof(vdev->bar_info));
+		if (vdev->vconfig)
+			kfree(vdev->vconfig);
+	}
+
+	mutex_unlock(&vfio_vgpu_lock);
+
+	module_put(THIS_MODULE);
+}
+
+static int vgpu_get_irq_count(struct vfio_vgpu_device *vdev, int irq_type)
+{
+       return 1;
+}
+
+static long vgpu_dev_unlocked_ioctl(void *device_data,
+		unsigned int cmd, unsigned long arg)
+{
+	int ret = 0;
+	struct vfio_vgpu_device *vdev = device_data;
+	unsigned long minsz;
+
+	switch (cmd)
+	{
+	case VFIO_DEVICE_GET_INFO:
+	{
+		struct vfio_device_info info;
+		printk(KERN_INFO "%s VFIO_DEVICE_GET_INFO cmd index ", __FUNCTION__);
+		minsz = offsetofend(struct vfio_device_info, num_irqs);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz)
+			return -EINVAL;
+
+		info.flags = VFIO_DEVICE_FLAGS_PCI;
+		info.num_regions = VFIO_PCI_NUM_REGIONS;
+		info.num_irqs = VFIO_PCI_NUM_IRQS;
+
+		return copy_to_user((void __user *)arg, &info, minsz);
+	}
+
+	case VFIO_DEVICE_GET_REGION_INFO:
+	{
+		struct vfio_region_info info;
+
+		minsz = offsetofend(struct vfio_region_info, offset);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz)
+			return -EINVAL;
+
+		printk(KERN_INFO "%s VFIO_DEVICE_GET_REGION_INFO cmd for region_index %d", __FUNCTION__, info.index);
+		switch (info.index) {
+		case VFIO_PCI_CONFIG_REGION_INDEX:
+		case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
+			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+			info.size = vdev->bar_info[info.index].size;
+			if (!info.size) {
+				info.flags = 0;
+				break;
+			}
+
+			info.flags = vdev->bar_info[info.index].flags;
+			break;
+		case VFIO_PCI_VGA_REGION_INDEX:
+			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+			info.size = 0xc0000;
+			info.flags = VFIO_REGION_INFO_FLAG_READ |
+				     VFIO_REGION_INFO_FLAG_WRITE;
+				break;
+
+		case VFIO_PCI_ROM_REGION_INDEX:
+		default:
+			return -EINVAL;
+		}
+
+		return copy_to_user((void __user *)arg, &info, minsz);
+
+	}
+	case VFIO_DEVICE_GET_IRQ_INFO:
+	{
+		struct vfio_irq_info info;
+
+		printk(KERN_INFO "%s VFIO_DEVICE_GET_IRQ_INFO cmd", __FUNCTION__);
+		minsz = offsetofend(struct vfio_irq_info, count);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz || info.index >= VFIO_PCI_NUM_IRQS)
+			return -EINVAL;
+
+		switch (info.index) {
+		case VFIO_PCI_INTX_IRQ_INDEX ... VFIO_PCI_MSIX_IRQ_INDEX:
+		case VFIO_PCI_REQ_IRQ_INDEX:
+			break;
+			/* pass thru to return error */
+		default:
+			return -EINVAL;
+		}
+
+		info.count = VFIO_PCI_NUM_IRQS;
+
+		info.flags = VFIO_IRQ_INFO_EVENTFD;
+		info.count = vgpu_get_irq_count(vdev, info.index);
+
+		if (info.index == VFIO_PCI_INTX_IRQ_INDEX)
+			info.flags |= (VFIO_IRQ_INFO_MASKABLE |
+					VFIO_IRQ_INFO_AUTOMASKED);
+		else
+			info.flags |= VFIO_IRQ_INFO_NORESIZE;
+
+		return copy_to_user((void __user *)arg, &info, minsz);
+	}
+
+	case VFIO_DEVICE_SET_IRQS:
+	{
+		struct vfio_irq_set hdr;
+		struct gpu_device *gpu_dev = vdev->vgpu_dev->gpu_dev;
+		u8 *data = NULL;
+		int ret = 0;
+		minsz = offsetofend(struct vfio_irq_set, count);
+
+		if (copy_from_user(&hdr, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (hdr.argsz < minsz || hdr.index >= VFIO_PCI_NUM_IRQS ||
+		    hdr.flags & ~(VFIO_IRQ_SET_DATA_TYPE_MASK |
+		    VFIO_IRQ_SET_ACTION_TYPE_MASK))
+			return -EINVAL;
+
+		if (!(hdr.flags & VFIO_IRQ_SET_DATA_NONE)) {
+			size_t size;
+			int max = vgpu_get_irq_count(vdev, hdr.index);
+
+			if (hdr.flags & VFIO_IRQ_SET_DATA_BOOL)
+				size = sizeof(uint8_t);
+			else if (hdr.flags & VFIO_IRQ_SET_DATA_EVENTFD)
+				size = sizeof(int32_t);
+			else
+				return -EINVAL;
+
+			if (hdr.argsz - minsz < hdr.count * size ||
+			    hdr.start >= max || hdr.start + hdr.count > max)
+				return -EINVAL;
+
+			data = memdup_user((void __user *)(arg + minsz),
+						hdr.count * size);
+				if (IS_ERR(data))
+					return PTR_ERR(data);
+
+			}
+
+			if (gpu_dev->ops->vgpu_set_irqs) {
+				ret = gpu_dev->ops->vgpu_set_irqs(vdev->vgpu_dev,
+								  hdr.flags,
+								  hdr.index, hdr.start,
+								  hdr.count, data);
+			}
+			kfree(data);
+			return ret;
+		}
+
+		default:
+			return -EINVAL;
+	}
+	return ret;
+}
+
+ssize_t vgpu_dev_config_rw(struct vfio_vgpu_device *vdev, char __user *buf,
+		size_t count, loff_t *ppos, bool iswrite)
+{
+	struct vgpu_device *vgpu_dev = vdev->vgpu_dev;
+	struct gpu_device *gpu_dev = vgpu_dev->gpu_dev;
+	int cfg_size = vdev->bar_info[VFIO_PCI_CONFIG_REGION_INDEX].size;
+	int ret = 0;
+	uint64_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
+
+	if (pos < 0 || pos >= cfg_size ||
+	    pos + count > cfg_size) {
+		printk(KERN_ERR "%s pos 0x%llx out of range\n", __FUNCTION__, pos);
+		ret = -EFAULT;
+		goto config_rw_exit;
+	}
+
+	if (iswrite) {
+		char *user_data = kmalloc(count, GFP_KERNEL);
+
+		if (user_data == NULL) {
+			ret = -ENOMEM;
+			goto config_rw_exit;
+		}
+
+		if (copy_from_user(user_data, buf, count)) {
+			ret = -EFAULT;
+			kfree(user_data);
+			goto config_rw_exit;
+		}
+
+		if (gpu_dev->ops->write) {
+			ret = gpu_dev->ops->write(vgpu_dev,
+						  user_data,
+						  count,
+						  vgpu_emul_space_config,
+						  pos);
+		}
+
+		memcpy((void *)(vdev->vconfig + pos), (void *)user_data, count);
+		kfree(user_data);
+	}
+	else
+	{
+		char *ret_data = kzalloc(count, GFP_KERNEL);
+
+		if (ret_data == NULL) {
+			ret = -ENOMEM;
+			goto config_rw_exit;
+		}
+
+		if (gpu_dev->ops->read) {
+			ret = gpu_dev->ops->read(vgpu_dev,
+						 ret_data,
+						 count,
+						 vgpu_emul_space_config,
+						 pos);
+		}
+
+		if (ret > 0 ) {
+			if (copy_to_user(buf, ret_data, ret)) {
+				ret = -EFAULT;
+				kfree(ret_data);
+				goto config_rw_exit;
+			}
+
+			memcpy((void *)(vdev->vconfig + pos), (void *)ret_data, count);
+		}
+		kfree(ret_data);
+	}
+config_rw_exit:
+	return ret;
+}
+
+ssize_t vgpu_dev_bar_rw(struct vfio_vgpu_device *vdev, char __user *buf,
+		size_t count, loff_t *ppos, bool iswrite)
+{
+	struct vgpu_device *vgpu_dev = vdev->vgpu_dev;
+	struct gpu_device *gpu_dev = vgpu_dev->gpu_dev;
+	loff_t offset = *ppos & VFIO_PCI_OFFSET_MASK;
+	loff_t pos;
+	int bar_index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+	int ret = 0;
+
+	if (!vdev->bar_info[bar_index].start) {
+		ret = vdev_read_base(vdev);
+		if (ret)
+			goto bar_rw_exit;
+	}
+
+	if (offset >= vdev->bar_info[bar_index].size) {
+		ret = -EINVAL;
+		goto bar_rw_exit;
+	}
+
+	pos = vdev->bar_info[bar_index].start + offset;
+	if (iswrite) {
+		char *user_data = kmalloc(count, GFP_KERNEL);
+
+		if (user_data == NULL) {
+			ret = -ENOMEM;
+			goto bar_rw_exit;
+		}
+
+		if (copy_from_user(user_data, buf, count)) {
+			ret = -EFAULT;
+			kfree(user_data);
+			goto bar_rw_exit;
+		}
+
+		if (gpu_dev->ops->write) {
+			ret = gpu_dev->ops->write(vgpu_dev,
+						  user_data,
+						  count,
+						  vgpu_emul_space_mmio,
+						  pos);
+		}
+
+		kfree(user_data);
+	}
+	else
+	{
+		char *ret_data = kmalloc(count, GFP_KERNEL);
+
+		if (ret_data == NULL) {
+			ret = -ENOMEM;
+			goto bar_rw_exit;
+		}
+
+		memset(ret_data, 0, count);
+
+		if (gpu_dev->ops->read) {
+			ret = gpu_dev->ops->read(vgpu_dev,
+						 ret_data,
+						 count,
+						 vgpu_emul_space_mmio,
+						 pos);
+		}
+
+		if (ret > 0 ) {
+			if (copy_to_user(buf, ret_data, ret)) {
+				ret = -EFAULT;
+			}
+		}
+		kfree(ret_data);
+	}
+
+bar_rw_exit:
+	return ret;
+}
+
+
+static ssize_t vgpu_dev_rw(void *device_data, char __user *buf,
+		size_t count, loff_t *ppos, bool iswrite)
+{
+	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+	struct vfio_vgpu_device *vdev = device_data;
+
+	if (index >= VFIO_PCI_NUM_REGIONS)
+		return -EINVAL;
+
+	switch (index) {
+	case VFIO_PCI_CONFIG_REGION_INDEX:
+		return vgpu_dev_config_rw(vdev, buf, count, ppos, iswrite);
+
+	case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
+		return vgpu_dev_bar_rw(vdev, buf, count, ppos, iswrite);
+
+	case VFIO_PCI_ROM_REGION_INDEX:
+	case VFIO_PCI_VGA_REGION_INDEX:
+		break;
+	}
+
+	return -EINVAL;
+}
+
+
+static ssize_t vgpu_dev_read(void *device_data, char __user *buf,
+			     size_t count, loff_t *ppos)
+{
+	int ret = 0;
+
+	if (count)
+		ret = vgpu_dev_rw(device_data, buf, count, ppos, false);
+
+	return ret;
+}
+
+static ssize_t vgpu_dev_write(void *device_data, const char __user *buf,
+			      size_t count, loff_t *ppos)
+{
+	int ret = 0;
+
+	if (count)
+		ret = vgpu_dev_rw(device_data, (char *)buf, count, ppos, true);
+
+	return ret;
+}
+
+static int vgpu_dev_mmio_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	int ret = 0;
+	struct vfio_vgpu_device *vdev = vma->vm_private_data;
+	struct vgpu_device *vgpu_dev;
+	struct gpu_device *gpu_dev;
+	u64 virtaddr = (u64)vmf->virtual_address;
+	u64 offset, phyaddr;
+	unsigned long req_size, pgoff;
+	pgprot_t pg_prot;
+
+	if (!vdev && !vdev->vgpu_dev)
+		return -EINVAL;
+
+	vgpu_dev = vdev->vgpu_dev;
+	gpu_dev  = vgpu_dev->gpu_dev;
+
+	offset   = vma->vm_pgoff << PAGE_SHIFT;
+	phyaddr  = virtaddr - vma->vm_start + offset;
+	pgoff    = phyaddr >> PAGE_SHIFT;
+	req_size = vma->vm_end - virtaddr;
+	pg_prot  = vma->vm_page_prot;
+
+	if (gpu_dev->ops->validate_map_request) {
+		ret = gpu_dev->ops->validate_map_request(vgpu_dev, virtaddr, &pgoff,
+							 &req_size, &pg_prot);
+		if (ret)
+			return ret;
+
+		if (!req_size)
+			return -EINVAL;
+	}
+
+	ret = remap_pfn_range(vma, virtaddr, pgoff, req_size, pg_prot);
+
+	return ret | VM_FAULT_NOPAGE;
+}
+
+static const struct vm_operations_struct vgpu_dev_mmio_ops = {
+	.fault = vgpu_dev_mmio_fault,
+};
+
+
+static int vgpu_dev_mmap(void *device_data, struct vm_area_struct *vma)
+{
+	unsigned int index;
+	struct vfio_vgpu_device *vdev = device_data;
+	struct vgpu_device *vgpu_dev = vdev->vgpu_dev;
+	struct pci_dev *pdev = vgpu_dev->gpu_dev->dev;
+	unsigned long pgoff;
+
+	loff_t offset = vma->vm_pgoff << PAGE_SHIFT;
+
+	index = VFIO_PCI_OFFSET_TO_INDEX(offset);
+
+	if (index >= VFIO_PCI_ROM_REGION_INDEX)
+		return -EINVAL;
+
+	pgoff = vma->vm_pgoff &
+		((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
+
+	vma->vm_pgoff = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff;
+
+	vma->vm_private_data = vdev;
+	vma->vm_ops = &vgpu_dev_mmio_ops;
+
+	return 0;
+}
+
+static const struct vfio_device_ops vgpu_vfio_dev_ops = {
+	.name		= "vfio-vgpu",
+	.open		= vgpu_dev_open,
+	.release	= vgpu_dev_close,
+	.ioctl		= vgpu_dev_unlocked_ioctl,
+	.read		= vgpu_dev_read,
+	.write		= vgpu_dev_write,
+	.mmap		= vgpu_dev_mmap,
+};
+
+int vgpu_vfio_probe(struct device *dev)
+{
+	struct vfio_vgpu_device *vdev;
+	struct vgpu_device *vgpu_dev = to_vgpu_device(dev);
+	int ret = 0;
+
+	if (vgpu_dev == NULL)
+		return -EINVAL;
+
+	vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
+	if (!vdev) {
+		return -ENOMEM;
+	}
+
+	vdev->vgpu_dev = vgpu_dev;
+	vdev->group = vgpu_dev->group;
+
+	ret = vfio_add_group_dev(dev, &vgpu_vfio_dev_ops, vdev);
+	if (ret)
+		kfree(vdev);
+
+	printk(KERN_INFO "%s ret = %d\n", __FUNCTION__, ret);
+	return ret;
+}
+
+void vgpu_vfio_remove(struct device *dev)
+{
+	struct vfio_vgpu_device *vdev;
+
+	printk(KERN_INFO "%s \n", __FUNCTION__);
+	vdev = vfio_del_group_dev(dev);
+	if (vdev) {
+		printk(KERN_INFO "%s vdev being freed\n", __FUNCTION__);
+		kfree(vdev);
+	}
+}
+
+struct vgpu_driver vgpu_vfio_driver = {
+        .name	= "vgpu-vfio",
+        .probe	= vgpu_vfio_probe,
+        .remove	= vgpu_vfio_remove,
+};
+
+static int __init vgpu_vfio_init(void)
+{
+	printk(KERN_INFO "%s \n", __FUNCTION__);
+	return vgpu_register_driver(&vgpu_vfio_driver, THIS_MODULE);
+}
+
+static void __exit vgpu_vfio_exit(void)
+{
+	printk(KERN_INFO "%s \n", __FUNCTION__);
+	vgpu_unregister_driver(&vgpu_vfio_driver);
+}
+
+module_init(vgpu_vfio_init)
+module_exit(vgpu_vfio_exit)
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
+
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [Qemu-devel] [RFC PATCH v2 2/3] VFIO driver for vGPU device
@ 2016-02-23 16:24   ` Kirti Wankhede
  0 siblings, 0 replies; 38+ messages in thread
From: Kirti Wankhede @ 2016-02-23 16:24 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel
  Cc: shuai.ruan, jike.song, Neo Jia, kvm, kevin.tian, qemu-devel,
	Kirti Wankhede, zhiyuan.lv

VFIO driver registers with vGPU core driver. vGPU core driver creates vGPU
device and calls probe routine of vGPU VFIO driver. This vGPU VFIO driver adds
vGPU device to VFIO core module.
Main aim of this module is to manage all VFIO APIs for each vGPU device.
Those are:
- get region information from GPU driver.
- trap and emulate PCI config space and BAR region.
- Send interrupt configuration information to GPU driver.
- mmap mappable region with invalidate mapping and fault on access to remap pfn.

Thanks,
Kirti.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
---
 drivers/vgpu/Makefile    |    1 +
 drivers/vgpu/vgpu_vfio.c |  664 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 665 insertions(+), 0 deletions(-)
 create mode 100644 drivers/vgpu/vgpu_vfio.c

diff --git a/drivers/vgpu/Makefile b/drivers/vgpu/Makefile
index f5be980..a0a2655 100644
--- a/drivers/vgpu/Makefile
+++ b/drivers/vgpu/Makefile
@@ -2,3 +2,4 @@
 vgpu-y := vgpu-core.o vgpu-sysfs.o vgpu-driver.o
 
 obj-$(CONFIG_VGPU)			+= vgpu.o
+obj-$(CONFIG_VGPU_VFIO)                 += vgpu_vfio.o
diff --git a/drivers/vgpu/vgpu_vfio.c b/drivers/vgpu/vgpu_vfio.c
new file mode 100644
index 0000000..dc19630
--- /dev/null
+++ b/drivers/vgpu/vgpu_vfio.c
@@ -0,0 +1,664 @@
+/*
+ * VGPU VFIO device
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/poll.h>
+#include <linux/slab.h>
+#include <linux/cdev.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/uuid.h>
+#include <linux/vfio.h>
+#include <linux/iommu.h>
+#include <linux/vgpu.h>
+
+#include "vgpu_private.h"
+
+#define DRIVER_VERSION  "0.1"
+#define DRIVER_AUTHOR   "NVIDIA Corporation"
+#define DRIVER_DESC     "VGPU VFIO Driver"
+
+#define VFIO_PCI_OFFSET_SHIFT   40
+
+#define VFIO_PCI_OFFSET_TO_INDEX(off)	(off >> VFIO_PCI_OFFSET_SHIFT)
+#define VFIO_PCI_INDEX_TO_OFFSET(index)	((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
+#define VFIO_PCI_OFFSET_MASK	(((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
+
+struct vfio_vgpu_device {
+	struct iommu_group *group;
+	struct vgpu_device *vgpu_dev;
+	int		    refcnt;
+	struct pci_bar_info bar_info[VFIO_PCI_NUM_REGIONS];
+	u8		    *vconfig;
+};
+
+static DEFINE_MUTEX(vfio_vgpu_lock);
+
+static int get_virtual_bar_info(struct vgpu_device *vgpu_dev,
+				struct pci_bar_info *bar_info,
+				int index)
+{
+	int ret = -1;
+	struct gpu_device *gpu_dev = vgpu_dev->gpu_dev;
+
+	if (gpu_dev->ops->vgpu_bar_info)
+		ret = gpu_dev->ops->vgpu_bar_info(vgpu_dev, index, bar_info);
+	return ret;
+}
+
+static int vdev_read_base(struct vfio_vgpu_device *vdev)
+{
+	int index, pos;
+	u32 start_lo, start_hi;
+	u32 mem_type;
+
+	pos = PCI_BASE_ADDRESS_0;
+
+	for (index = 0; index <= VFIO_PCI_BAR5_REGION_INDEX; index++) {
+
+		if (!vdev->bar_info[index].size)
+			continue;
+
+		start_lo = (*(u32 *)(vdev->vconfig + pos)) &
+					PCI_BASE_ADDRESS_MEM_MASK;
+		mem_type = (*(u32 *)(vdev->vconfig + pos)) &
+					PCI_BASE_ADDRESS_MEM_TYPE_MASK;
+
+		switch (mem_type) {
+		case PCI_BASE_ADDRESS_MEM_TYPE_64:
+			start_hi = (*(u32 *)(vdev->vconfig + pos + 4));
+			pos += 4;
+			break;
+		case PCI_BASE_ADDRESS_MEM_TYPE_32:
+		case PCI_BASE_ADDRESS_MEM_TYPE_1M:
+			/* 1M mem BAR treated as 32-bit BAR */
+		default:
+			/* mem unknown type treated as 32-bit BAR */
+			start_hi = 0;
+			break;
+		}
+		pos += 4;
+		vdev->bar_info[index].start = ((u64)start_hi << 32) | start_lo;
+	}
+	return 0;
+}
+
+static int vgpu_dev_open(void *device_data)
+{
+	int ret = 0;
+	struct vfio_vgpu_device *vdev = device_data;
+
+	if (!try_module_get(THIS_MODULE))
+		return -ENODEV;
+
+	mutex_lock(&vfio_vgpu_lock);
+
+	if (!vdev->refcnt) {
+		u8 *vconfig;
+		int vconfig_size, index;
+
+		for (index = 0; index < VFIO_PCI_NUM_REGIONS; index++) {
+			ret = get_virtual_bar_info(vdev->vgpu_dev,
+						   &vdev->bar_info[index],
+						   index);
+			if (ret)
+				goto open_error;
+		}
+		vconfig_size = vdev->bar_info[VFIO_PCI_CONFIG_REGION_INDEX].size;
+		if (!vconfig_size)
+			goto open_error;
+
+		vconfig = kzalloc(vconfig_size, GFP_KERNEL);
+		if (!vconfig) {
+			ret = -ENOMEM;
+			goto open_error;
+		}
+
+		vdev->vconfig = vconfig;
+	}
+
+	vdev->refcnt++;
+open_error:
+
+	mutex_unlock(&vfio_vgpu_lock);
+
+	if (ret)
+		module_put(THIS_MODULE);
+
+	return ret;
+}
+
+static void vgpu_dev_close(void *device_data)
+{
+	struct vfio_vgpu_device *vdev = device_data;
+
+	mutex_lock(&vfio_vgpu_lock);
+
+	vdev->refcnt--;
+	if (!vdev->refcnt) {
+		memset(&vdev->bar_info, 0, sizeof(vdev->bar_info));
+		if (vdev->vconfig)
+			kfree(vdev->vconfig);
+	}
+
+	mutex_unlock(&vfio_vgpu_lock);
+
+	module_put(THIS_MODULE);
+}
+
+static int vgpu_get_irq_count(struct vfio_vgpu_device *vdev, int irq_type)
+{
+       return 1;
+}
+
+static long vgpu_dev_unlocked_ioctl(void *device_data,
+		unsigned int cmd, unsigned long arg)
+{
+	int ret = 0;
+	struct vfio_vgpu_device *vdev = device_data;
+	unsigned long minsz;
+
+	switch (cmd)
+	{
+	case VFIO_DEVICE_GET_INFO:
+	{
+		struct vfio_device_info info;
+		printk(KERN_INFO "%s VFIO_DEVICE_GET_INFO cmd index ", __FUNCTION__);
+		minsz = offsetofend(struct vfio_device_info, num_irqs);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz)
+			return -EINVAL;
+
+		info.flags = VFIO_DEVICE_FLAGS_PCI;
+		info.num_regions = VFIO_PCI_NUM_REGIONS;
+		info.num_irqs = VFIO_PCI_NUM_IRQS;
+
+		return copy_to_user((void __user *)arg, &info, minsz);
+	}
+
+	case VFIO_DEVICE_GET_REGION_INFO:
+	{
+		struct vfio_region_info info;
+
+		minsz = offsetofend(struct vfio_region_info, offset);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz)
+			return -EINVAL;
+
+		printk(KERN_INFO "%s VFIO_DEVICE_GET_REGION_INFO cmd for region_index %d", __FUNCTION__, info.index);
+		switch (info.index) {
+		case VFIO_PCI_CONFIG_REGION_INDEX:
+		case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
+			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+			info.size = vdev->bar_info[info.index].size;
+			if (!info.size) {
+				info.flags = 0;
+				break;
+			}
+
+			info.flags = vdev->bar_info[info.index].flags;
+			break;
+		case VFIO_PCI_VGA_REGION_INDEX:
+			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+			info.size = 0xc0000;
+			info.flags = VFIO_REGION_INFO_FLAG_READ |
+				     VFIO_REGION_INFO_FLAG_WRITE;
+				break;
+
+		case VFIO_PCI_ROM_REGION_INDEX:
+		default:
+			return -EINVAL;
+		}
+
+		return copy_to_user((void __user *)arg, &info, minsz);
+
+	}
+	case VFIO_DEVICE_GET_IRQ_INFO:
+	{
+		struct vfio_irq_info info;
+
+		printk(KERN_INFO "%s VFIO_DEVICE_GET_IRQ_INFO cmd", __FUNCTION__);
+		minsz = offsetofend(struct vfio_irq_info, count);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz || info.index >= VFIO_PCI_NUM_IRQS)
+			return -EINVAL;
+
+		switch (info.index) {
+		case VFIO_PCI_INTX_IRQ_INDEX ... VFIO_PCI_MSIX_IRQ_INDEX:
+		case VFIO_PCI_REQ_IRQ_INDEX:
+			break;
+			/* pass thru to return error */
+		default:
+			return -EINVAL;
+		}
+
+		info.count = VFIO_PCI_NUM_IRQS;
+
+		info.flags = VFIO_IRQ_INFO_EVENTFD;
+		info.count = vgpu_get_irq_count(vdev, info.index);
+
+		if (info.index == VFIO_PCI_INTX_IRQ_INDEX)
+			info.flags |= (VFIO_IRQ_INFO_MASKABLE |
+					VFIO_IRQ_INFO_AUTOMASKED);
+		else
+			info.flags |= VFIO_IRQ_INFO_NORESIZE;
+
+		return copy_to_user((void __user *)arg, &info, minsz);
+	}
+
+	case VFIO_DEVICE_SET_IRQS:
+	{
+		struct vfio_irq_set hdr;
+		struct gpu_device *gpu_dev = vdev->vgpu_dev->gpu_dev;
+		u8 *data = NULL;
+		int ret = 0;
+		minsz = offsetofend(struct vfio_irq_set, count);
+
+		if (copy_from_user(&hdr, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (hdr.argsz < minsz || hdr.index >= VFIO_PCI_NUM_IRQS ||
+		    hdr.flags & ~(VFIO_IRQ_SET_DATA_TYPE_MASK |
+		    VFIO_IRQ_SET_ACTION_TYPE_MASK))
+			return -EINVAL;
+
+		if (!(hdr.flags & VFIO_IRQ_SET_DATA_NONE)) {
+			size_t size;
+			int max = vgpu_get_irq_count(vdev, hdr.index);
+
+			if (hdr.flags & VFIO_IRQ_SET_DATA_BOOL)
+				size = sizeof(uint8_t);
+			else if (hdr.flags & VFIO_IRQ_SET_DATA_EVENTFD)
+				size = sizeof(int32_t);
+			else
+				return -EINVAL;
+
+			if (hdr.argsz - minsz < hdr.count * size ||
+			    hdr.start >= max || hdr.start + hdr.count > max)
+				return -EINVAL;
+
+			data = memdup_user((void __user *)(arg + minsz),
+						hdr.count * size);
+				if (IS_ERR(data))
+					return PTR_ERR(data);
+
+			}
+
+			if (gpu_dev->ops->vgpu_set_irqs) {
+				ret = gpu_dev->ops->vgpu_set_irqs(vdev->vgpu_dev,
+								  hdr.flags,
+								  hdr.index, hdr.start,
+								  hdr.count, data);
+			}
+			kfree(data);
+			return ret;
+		}
+
+		default:
+			return -EINVAL;
+	}
+	return ret;
+}
+
+ssize_t vgpu_dev_config_rw(struct vfio_vgpu_device *vdev, char __user *buf,
+		size_t count, loff_t *ppos, bool iswrite)
+{
+	struct vgpu_device *vgpu_dev = vdev->vgpu_dev;
+	struct gpu_device *gpu_dev = vgpu_dev->gpu_dev;
+	int cfg_size = vdev->bar_info[VFIO_PCI_CONFIG_REGION_INDEX].size;
+	int ret = 0;
+	uint64_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
+
+	if (pos < 0 || pos >= cfg_size ||
+	    pos + count > cfg_size) {
+		printk(KERN_ERR "%s pos 0x%llx out of range\n", __FUNCTION__, pos);
+		ret = -EFAULT;
+		goto config_rw_exit;
+	}
+
+	if (iswrite) {
+		char *user_data = kmalloc(count, GFP_KERNEL);
+
+		if (user_data == NULL) {
+			ret = -ENOMEM;
+			goto config_rw_exit;
+		}
+
+		if (copy_from_user(user_data, buf, count)) {
+			ret = -EFAULT;
+			kfree(user_data);
+			goto config_rw_exit;
+		}
+
+		if (gpu_dev->ops->write) {
+			ret = gpu_dev->ops->write(vgpu_dev,
+						  user_data,
+						  count,
+						  vgpu_emul_space_config,
+						  pos);
+		}
+
+		memcpy((void *)(vdev->vconfig + pos), (void *)user_data, count);
+		kfree(user_data);
+	}
+	else
+	{
+		char *ret_data = kzalloc(count, GFP_KERNEL);
+
+		if (ret_data == NULL) {
+			ret = -ENOMEM;
+			goto config_rw_exit;
+		}
+
+		if (gpu_dev->ops->read) {
+			ret = gpu_dev->ops->read(vgpu_dev,
+						 ret_data,
+						 count,
+						 vgpu_emul_space_config,
+						 pos);
+		}
+
+		if (ret > 0 ) {
+			if (copy_to_user(buf, ret_data, ret)) {
+				ret = -EFAULT;
+				kfree(ret_data);
+				goto config_rw_exit;
+			}
+
+			memcpy((void *)(vdev->vconfig + pos), (void *)ret_data, count);
+		}
+		kfree(ret_data);
+	}
+config_rw_exit:
+	return ret;
+}
+
+ssize_t vgpu_dev_bar_rw(struct vfio_vgpu_device *vdev, char __user *buf,
+		size_t count, loff_t *ppos, bool iswrite)
+{
+	struct vgpu_device *vgpu_dev = vdev->vgpu_dev;
+	struct gpu_device *gpu_dev = vgpu_dev->gpu_dev;
+	loff_t offset = *ppos & VFIO_PCI_OFFSET_MASK;
+	loff_t pos;
+	int bar_index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+	int ret = 0;
+
+	if (!vdev->bar_info[bar_index].start) {
+		ret = vdev_read_base(vdev);
+		if (ret)
+			goto bar_rw_exit;
+	}
+
+	if (offset >= vdev->bar_info[bar_index].size) {
+		ret = -EINVAL;
+		goto bar_rw_exit;
+	}
+
+	pos = vdev->bar_info[bar_index].start + offset;
+	if (iswrite) {
+		char *user_data = kmalloc(count, GFP_KERNEL);
+
+		if (user_data == NULL) {
+			ret = -ENOMEM;
+			goto bar_rw_exit;
+		}
+
+		if (copy_from_user(user_data, buf, count)) {
+			ret = -EFAULT;
+			kfree(user_data);
+			goto bar_rw_exit;
+		}
+
+		if (gpu_dev->ops->write) {
+			ret = gpu_dev->ops->write(vgpu_dev,
+						  user_data,
+						  count,
+						  vgpu_emul_space_mmio,
+						  pos);
+		}
+
+		kfree(user_data);
+	}
+	else
+	{
+		char *ret_data = kmalloc(count, GFP_KERNEL);
+
+		if (ret_data == NULL) {
+			ret = -ENOMEM;
+			goto bar_rw_exit;
+		}
+
+		memset(ret_data, 0, count);
+
+		if (gpu_dev->ops->read) {
+			ret = gpu_dev->ops->read(vgpu_dev,
+						 ret_data,
+						 count,
+						 vgpu_emul_space_mmio,
+						 pos);
+		}
+
+		if (ret > 0 ) {
+			if (copy_to_user(buf, ret_data, ret)) {
+				ret = -EFAULT;
+			}
+		}
+		kfree(ret_data);
+	}
+
+bar_rw_exit:
+	return ret;
+}
+
+
+static ssize_t vgpu_dev_rw(void *device_data, char __user *buf,
+		size_t count, loff_t *ppos, bool iswrite)
+{
+	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+	struct vfio_vgpu_device *vdev = device_data;
+
+	if (index >= VFIO_PCI_NUM_REGIONS)
+		return -EINVAL;
+
+	switch (index) {
+	case VFIO_PCI_CONFIG_REGION_INDEX:
+		return vgpu_dev_config_rw(vdev, buf, count, ppos, iswrite);
+
+	case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
+		return vgpu_dev_bar_rw(vdev, buf, count, ppos, iswrite);
+
+	case VFIO_PCI_ROM_REGION_INDEX:
+	case VFIO_PCI_VGA_REGION_INDEX:
+		break;
+	}
+
+	return -EINVAL;
+}
+
+
+static ssize_t vgpu_dev_read(void *device_data, char __user *buf,
+			     size_t count, loff_t *ppos)
+{
+	int ret = 0;
+
+	if (count)
+		ret = vgpu_dev_rw(device_data, buf, count, ppos, false);
+
+	return ret;
+}
+
+static ssize_t vgpu_dev_write(void *device_data, const char __user *buf,
+			      size_t count, loff_t *ppos)
+{
+	int ret = 0;
+
+	if (count)
+		ret = vgpu_dev_rw(device_data, (char *)buf, count, ppos, true);
+
+	return ret;
+}
+
+static int vgpu_dev_mmio_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	int ret = 0;
+	struct vfio_vgpu_device *vdev = vma->vm_private_data;
+	struct vgpu_device *vgpu_dev;
+	struct gpu_device *gpu_dev;
+	u64 virtaddr = (u64)vmf->virtual_address;
+	u64 offset, phyaddr;
+	unsigned long req_size, pgoff;
+	pgprot_t pg_prot;
+
+	if (!vdev && !vdev->vgpu_dev)
+		return -EINVAL;
+
+	vgpu_dev = vdev->vgpu_dev;
+	gpu_dev  = vgpu_dev->gpu_dev;
+
+	offset   = vma->vm_pgoff << PAGE_SHIFT;
+	phyaddr  = virtaddr - vma->vm_start + offset;
+	pgoff    = phyaddr >> PAGE_SHIFT;
+	req_size = vma->vm_end - virtaddr;
+	pg_prot  = vma->vm_page_prot;
+
+	if (gpu_dev->ops->validate_map_request) {
+		ret = gpu_dev->ops->validate_map_request(vgpu_dev, virtaddr, &pgoff,
+							 &req_size, &pg_prot);
+		if (ret)
+			return ret;
+
+		if (!req_size)
+			return -EINVAL;
+	}
+
+	ret = remap_pfn_range(vma, virtaddr, pgoff, req_size, pg_prot);
+
+	return ret | VM_FAULT_NOPAGE;
+}
+
+static const struct vm_operations_struct vgpu_dev_mmio_ops = {
+	.fault = vgpu_dev_mmio_fault,
+};
+
+
+static int vgpu_dev_mmap(void *device_data, struct vm_area_struct *vma)
+{
+	unsigned int index;
+	struct vfio_vgpu_device *vdev = device_data;
+	struct vgpu_device *vgpu_dev = vdev->vgpu_dev;
+	struct pci_dev *pdev = vgpu_dev->gpu_dev->dev;
+	unsigned long pgoff;
+
+	loff_t offset = vma->vm_pgoff << PAGE_SHIFT;
+
+	index = VFIO_PCI_OFFSET_TO_INDEX(offset);
+
+	if (index >= VFIO_PCI_ROM_REGION_INDEX)
+		return -EINVAL;
+
+	pgoff = vma->vm_pgoff &
+		((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
+
+	vma->vm_pgoff = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff;
+
+	vma->vm_private_data = vdev;
+	vma->vm_ops = &vgpu_dev_mmio_ops;
+
+	return 0;
+}
+
+static const struct vfio_device_ops vgpu_vfio_dev_ops = {
+	.name		= "vfio-vgpu",
+	.open		= vgpu_dev_open,
+	.release	= vgpu_dev_close,
+	.ioctl		= vgpu_dev_unlocked_ioctl,
+	.read		= vgpu_dev_read,
+	.write		= vgpu_dev_write,
+	.mmap		= vgpu_dev_mmap,
+};
+
+int vgpu_vfio_probe(struct device *dev)
+{
+	struct vfio_vgpu_device *vdev;
+	struct vgpu_device *vgpu_dev = to_vgpu_device(dev);
+	int ret = 0;
+
+	if (vgpu_dev == NULL)
+		return -EINVAL;
+
+	vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
+	if (!vdev) {
+		return -ENOMEM;
+	}
+
+	vdev->vgpu_dev = vgpu_dev;
+	vdev->group = vgpu_dev->group;
+
+	ret = vfio_add_group_dev(dev, &vgpu_vfio_dev_ops, vdev);
+	if (ret)
+		kfree(vdev);
+
+	printk(KERN_INFO "%s ret = %d\n", __FUNCTION__, ret);
+	return ret;
+}
+
+void vgpu_vfio_remove(struct device *dev)
+{
+	struct vfio_vgpu_device *vdev;
+
+	printk(KERN_INFO "%s \n", __FUNCTION__);
+	vdev = vfio_del_group_dev(dev);
+	if (vdev) {
+		printk(KERN_INFO "%s vdev being freed\n", __FUNCTION__);
+		kfree(vdev);
+	}
+}
+
+struct vgpu_driver vgpu_vfio_driver = {
+        .name	= "vgpu-vfio",
+        .probe	= vgpu_vfio_probe,
+        .remove	= vgpu_vfio_remove,
+};
+
+static int __init vgpu_vfio_init(void)
+{
+	printk(KERN_INFO "%s \n", __FUNCTION__);
+	return vgpu_register_driver(&vgpu_vfio_driver, THIS_MODULE);
+}
+
+static void __exit vgpu_vfio_exit(void)
+{
+	printk(KERN_INFO "%s \n", __FUNCTION__);
+	vgpu_unregister_driver(&vgpu_vfio_driver);
+}
+
+module_init(vgpu_vfio_init)
+module_exit(vgpu_vfio_exit)
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
+
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [RFC PATCH v2 3/3] VFIO: Type1 IOMMU mapping support for vGPU
  2016-02-23 16:24 ` [Qemu-devel] " Kirti Wankhede
@ 2016-02-23 16:24   ` Kirti Wankhede
  -1 siblings, 0 replies; 38+ messages in thread
From: Kirti Wankhede @ 2016-02-23 16:24 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel
  Cc: qemu-devel, kvm, kevin.tian, shuai.ruan, jike.song, zhiyuan.lv,
	Kirti Wankhede, Neo Jia

Aim of this module is to pin and unpin guest memory.
This module provides interface to GPU driver that can be used to map guest
physical memory into its kernel space driver.
Currently this module has duplicate code from vfio_iommu_type1.c
Working on refining functions to reuse existing code in vfio_iommu_type1.c and
with that will add API to unpin pages.
This is for the reference to review the overall design of vGPU.

Thanks,
Kirti.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
---
 drivers/vgpu/Makefile                |    1 +
 drivers/vgpu/vfio_iommu_type1_vgpu.c |  423 ++++++++++++++++++++++++++++++++++
 2 files changed, 424 insertions(+), 0 deletions(-)
 create mode 100644 drivers/vgpu/vfio_iommu_type1_vgpu.c

diff --git a/drivers/vgpu/Makefile b/drivers/vgpu/Makefile
index a0a2655..8ace18d 100644
--- a/drivers/vgpu/Makefile
+++ b/drivers/vgpu/Makefile
@@ -3,3 +3,4 @@ vgpu-y := vgpu-core.o vgpu-sysfs.o vgpu-driver.o
 
 obj-$(CONFIG_VGPU)			+= vgpu.o
 obj-$(CONFIG_VGPU_VFIO)                 += vgpu_vfio.o
+obj-$(CONFIG_VFIO_IOMMU_TYPE1_VGPU)     += vfio_iommu_type1_vgpu.o
diff --git a/drivers/vgpu/vfio_iommu_type1_vgpu.c b/drivers/vgpu/vfio_iommu_type1_vgpu.c
new file mode 100644
index 0000000..0b36ae5
--- /dev/null
+++ b/drivers/vgpu/vfio_iommu_type1_vgpu.c
@@ -0,0 +1,423 @@
+/*
+ * VGPU : IOMMU DMA mapping support for VGPU
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/compat.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/miscdevice.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/uuid.h>
+#include <linux/vfio.h>
+#include <linux/iommu.h>
+#include <linux/vgpu.h>
+
+#include "vgpu_private.h"
+
+#define DRIVER_VERSION	"0.1"
+#define DRIVER_AUTHOR	"NVIDIA Corporation"
+#define DRIVER_DESC     "VGPU Type1 IOMMU driver for VFIO"
+
+// VFIO structures
+
+struct vfio_iommu_vgpu {
+	struct mutex lock;
+	struct iommu_group *group;
+	struct vgpu_device *vgpu_dev;
+	struct rb_root dma_list;
+	struct mm_struct * vm_mm;
+};
+
+struct vgpu_vfio_dma {
+	struct rb_node node;
+	dma_addr_t iova;
+	unsigned long vaddr;
+	size_t size;
+	int prot;
+};
+
+/*
+ * VGPU VFIO FOPs definition
+ *
+ */
+
+/*
+ * Duplicated from vfio_link_dma, just quick hack ... should
+ * reuse code later
+ */
+
+static void vgpu_link_dma(struct vfio_iommu_vgpu *iommu,
+			  struct vgpu_vfio_dma *new)
+{
+	struct rb_node **link = &iommu->dma_list.rb_node, *parent = NULL;
+	struct vgpu_vfio_dma *dma;
+
+	while (*link) {
+		parent = *link;
+		dma = rb_entry(parent, struct vgpu_vfio_dma, node);
+
+		if (new->iova + new->size <= dma->iova)
+			link = &(*link)->rb_left;
+		else
+			link = &(*link)->rb_right;
+	}
+
+	rb_link_node(&new->node, parent, link);
+	rb_insert_color(&new->node, &iommu->dma_list);
+}
+
+static struct vgpu_vfio_dma *vgpu_find_dma(struct vfio_iommu_vgpu *iommu,
+					   dma_addr_t start, size_t size)
+{
+	struct rb_node *node = iommu->dma_list.rb_node;
+
+	while (node) {
+		struct vgpu_vfio_dma *dma = rb_entry(node, struct vgpu_vfio_dma, node);
+
+		if (start + size <= dma->iova)
+			node = node->rb_left;
+		else if (start >= dma->iova + dma->size)
+			node = node->rb_right;
+		else
+			return dma;
+	}
+
+	return NULL;
+}
+
+static void vgpu_unlink_dma(struct vfio_iommu_vgpu *iommu, struct vgpu_vfio_dma *old)
+{
+	rb_erase(&old->node, &iommu->dma_list);
+}
+
+static void vgpu_dump_dma(struct vfio_iommu_vgpu *iommu)
+{
+	struct vgpu_vfio_dma *c, *n;
+	uint32_t i = 0;
+
+	rbtree_postorder_for_each_entry_safe(c, n, &iommu->dma_list, node)
+		printk(KERN_INFO "%s: dma[%d] iova:0x%llx, vaddr:0x%lx, size:0x%lx\n",
+		       __FUNCTION__, i++, c->iova, c->vaddr, c->size);
+}
+
+static int vgpu_dma_do_track(struct vfio_iommu_vgpu * vgpu_iommu,
+	struct vfio_iommu_type1_dma_map *map)
+{
+	dma_addr_t iova = map->iova;
+	unsigned long vaddr = map->vaddr;
+	int ret = 0, prot = 0;
+	struct vgpu_vfio_dma *vgpu_dma;
+
+	mutex_lock(&vgpu_iommu->lock);
+
+	if (vgpu_find_dma(vgpu_iommu, map->iova, map->size)) {
+		mutex_unlock(&vgpu_iommu->lock);
+		return -EEXIST;
+	}
+
+	vgpu_dma = kzalloc(sizeof(*vgpu_dma), GFP_KERNEL);
+
+	if (!vgpu_dma) {
+		mutex_unlock(&vgpu_iommu->lock);
+		return -ENOMEM;
+	}
+
+	vgpu_dma->iova = iova;
+	vgpu_dma->vaddr = vaddr;
+	vgpu_dma->prot = prot;
+	vgpu_dma->size = map->size;
+
+	vgpu_link_dma(vgpu_iommu, vgpu_dma);
+
+	mutex_unlock(&vgpu_iommu->lock);
+	return ret;
+}
+
+static int vgpu_dma_do_untrack(struct vfio_iommu_vgpu * vgpu_iommu,
+	struct vfio_iommu_type1_dma_unmap *unmap)
+{
+	struct vgpu_vfio_dma *vgpu_dma;
+	size_t unmapped = 0;
+	int ret = 0;
+
+	mutex_lock(&vgpu_iommu->lock);
+
+	vgpu_dma = vgpu_find_dma(vgpu_iommu, unmap->iova, 0);
+	if (vgpu_dma && vgpu_dma->iova != unmap->iova) {
+		ret = -EINVAL;
+		goto unlock;
+	}
+
+	vgpu_dma = vgpu_find_dma(vgpu_iommu, unmap->iova + unmap->size - 1, 0);
+	if (vgpu_dma && vgpu_dma->iova + vgpu_dma->size != unmap->iova + unmap->size) {
+		ret = -EINVAL;
+		goto unlock;
+	}
+
+	while (( vgpu_dma = vgpu_find_dma(vgpu_iommu, unmap->iova, unmap->size))) {
+		unmapped += vgpu_dma->size;
+		vgpu_unlink_dma(vgpu_iommu, vgpu_dma);
+	}
+
+unlock:
+	mutex_unlock(&vgpu_iommu->lock);
+	unmap->size = unmapped;
+
+	return ret;
+}
+
+/* Ugly hack to quickly test single deivce ... */
+
+static struct vfio_iommu_vgpu *_local_iommu = NULL;
+
+int vgpu_dma_do_translate(dma_addr_t *gfn_buffer, uint32_t count)
+{
+	int i = 0, ret = 0, prot = 0;
+	unsigned long remote_vaddr = 0, pfn = 0;
+	struct vfio_iommu_vgpu *vgpu_iommu = _local_iommu;
+	struct vgpu_vfio_dma *vgpu_dma;
+	struct page *page[1];
+	// unsigned long * addr = NULL;
+	struct mm_struct *mm = vgpu_iommu->vm_mm;
+
+	prot = IOMMU_READ | IOMMU_WRITE;
+
+	printk(KERN_INFO "%s: >>>>\n", __FUNCTION__);
+
+	mutex_lock(&vgpu_iommu->lock);
+
+	vgpu_dump_dma(vgpu_iommu);
+
+	for (i = 0; i < count; i++) {
+		dma_addr_t iova = gfn_buffer[i] << PAGE_SHIFT;
+		vgpu_dma = vgpu_find_dma(vgpu_iommu, iova, 0 /*  size */);
+
+		if (!vgpu_dma) {
+			printk(KERN_INFO "%s: fail locate iova[%d]:0x%llx\n", __FUNCTION__, i, iova);
+			ret = -EINVAL;
+			goto unlock;
+		}
+
+		remote_vaddr = vgpu_dma->vaddr + iova - vgpu_dma->iova;
+		printk(KERN_INFO "%s: find dma iova[%d]:0x%llx, vaddr:0x%lx, size:0x%lx, remote_vaddr:0x%lx\n",
+			__FUNCTION__, i, vgpu_dma->iova,
+			vgpu_dma->vaddr, vgpu_dma->size, remote_vaddr);
+
+		if (get_user_pages_unlocked(NULL, mm, remote_vaddr, 1, 1, 0, page) == 1) {
+			pfn = page_to_pfn(page[0]);
+			printk(KERN_INFO "%s: pfn[%d]:0x%lx\n", __FUNCTION__, i, pfn);
+			// addr = vmap(page, 1, VM_MAP, PAGE_KERNEL);
+		}
+		else {
+			printk(KERN_INFO "%s: fail to pin pfn[%d]\n", __FUNCTION__, i);
+			ret = -ENOMEM;
+			goto unlock;
+		}
+
+		gfn_buffer[i] = pfn;
+		// vunmap(addr);
+
+	}
+
+unlock:
+	mutex_unlock(&vgpu_iommu->lock);
+	printk(KERN_INFO "%s: <<<<\n", __FUNCTION__);
+	return ret;
+}
+EXPORT_SYMBOL(vgpu_dma_do_translate);
+
+static void *vfio_iommu_vgpu_open(unsigned long arg)
+{
+	struct vfio_iommu_vgpu *iommu;
+
+	iommu = kzalloc(sizeof(*iommu), GFP_KERNEL);
+
+	if (!iommu)
+		return ERR_PTR(-ENOMEM);
+
+	mutex_init(&iommu->lock);
+
+	printk(KERN_INFO "%s", __FUNCTION__);
+
+	/* TODO: Keep track the v2 vs. v1, for now only assume
+	 * we are v2 due to QEMU code */
+	_local_iommu = iommu;
+	return iommu;
+}
+
+static void vfio_iommu_vgpu_release(void *iommu_data)
+{
+	struct vfio_iommu_vgpu *iommu = iommu_data;
+	kfree(iommu);
+	printk(KERN_INFO "%s", __FUNCTION__);
+}
+
+static long vfio_iommu_vgpu_ioctl(void *iommu_data,
+		unsigned int cmd, unsigned long arg)
+{
+	int ret = 0;
+	unsigned long minsz;
+	struct vfio_iommu_vgpu *vgpu_iommu = iommu_data;
+
+	switch (cmd) {
+	case VFIO_CHECK_EXTENSION:
+	{
+		if ((arg == VFIO_TYPE1_IOMMU) || (arg == VFIO_TYPE1v2_IOMMU))
+			return 1;
+		else
+			return 0;
+	}
+
+	case VFIO_IOMMU_GET_INFO:
+	{
+		struct vfio_iommu_type1_info info;
+		minsz = offsetofend(struct vfio_iommu_type1_info, iova_pgsizes);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz)
+			return -EINVAL;
+
+		info.flags = 0;
+
+		return copy_to_user((void __user *)arg, &info, minsz);
+	}
+	case VFIO_IOMMU_MAP_DMA:
+	{
+		// TODO
+		struct vfio_iommu_type1_dma_map map;
+		minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);
+
+		if (copy_from_user(&map, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (map.argsz < minsz)
+			return -EINVAL;
+
+		printk(KERN_INFO "VGPU-IOMMU:MAP_DMA flags:%d, vaddr:0x%llx, iova:0x%llx, size:0x%llx\n",
+			map.flags, map.vaddr, map.iova, map.size);
+
+		/*
+		 * TODO: Tracking code is mostly duplicated from TYPE1 IOMMU, ideally,
+		 * this should be merged into one single file and reuse data
+		 * structure
+		 *
+		 */
+		ret = vgpu_dma_do_track(vgpu_iommu, &map);
+		break;
+	}
+	case VFIO_IOMMU_UNMAP_DMA:
+	{
+		// TODO
+		struct vfio_iommu_type1_dma_unmap unmap;
+
+		minsz = offsetofend(struct vfio_iommu_type1_dma_unmap, size);
+
+		if (copy_from_user(&unmap, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (unmap.argsz < minsz)
+			return -EINVAL;
+
+		ret = vgpu_dma_do_untrack(vgpu_iommu, &unmap);
+		break;
+	}
+	default:
+	{
+		printk(KERN_INFO "%s cmd default ", __FUNCTION__);
+		ret = -ENOTTY;
+		break;
+	}
+	}
+
+	return ret;
+}
+
+
+static int vfio_iommu_vgpu_attach_group(void *iommu_data,
+		                        struct iommu_group *iommu_group)
+{
+	struct vfio_iommu_vgpu *iommu = iommu_data;
+	struct vgpu_device *vgpu_dev = NULL;
+
+	printk(KERN_INFO "%s", __FUNCTION__);
+
+	vgpu_dev = get_vgpu_device_from_group(iommu_group);
+	if (vgpu_dev) {
+		iommu->vgpu_dev = vgpu_dev;
+		iommu->group = iommu_group;
+
+		/* IOMMU shares the same life cylce as VM MM */
+		iommu->vm_mm = current->mm;
+
+		return 0;
+	}
+	iommu->group = iommu_group;
+	return 1;
+}
+
+static void vfio_iommu_vgpu_detach_group(void *iommu_data,
+		struct iommu_group *iommu_group)
+{
+	struct vfio_iommu_vgpu *iommu = iommu_data;
+
+	printk(KERN_INFO "%s", __FUNCTION__);
+	iommu->vm_mm = NULL;
+	iommu->group = NULL;
+
+	return;
+}
+
+
+static const struct vfio_iommu_driver_ops vfio_iommu_vgpu_driver_ops = {
+	.name           = "vgpu_vfio",
+	.owner          = THIS_MODULE,
+	.open           = vfio_iommu_vgpu_open,
+	.release        = vfio_iommu_vgpu_release,
+	.ioctl          = vfio_iommu_vgpu_ioctl,
+	.attach_group   = vfio_iommu_vgpu_attach_group,
+	.detach_group   = vfio_iommu_vgpu_detach_group,
+};
+
+
+int vgpu_vfio_iommu_init(void)
+{
+	int rc = vfio_register_iommu_driver(&vfio_iommu_vgpu_driver_ops);
+
+	printk(KERN_INFO "%s\n", __FUNCTION__);
+	if (rc < 0) {
+		printk(KERN_ERR "Error: failed to register vfio iommu, err:%d\n", rc);
+	}
+
+	return rc;
+}
+
+void vgpu_vfio_iommu_exit(void)
+{
+	// unregister vgpu_vfio driver
+	vfio_unregister_iommu_driver(&vfio_iommu_vgpu_driver_ops);
+	printk(KERN_INFO "%s\n", __FUNCTION__);
+}
+
+
+module_init(vgpu_vfio_iommu_init);
+module_exit(vgpu_vfio_iommu_exit);
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
+
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [Qemu-devel] [RFC PATCH v2 3/3] VFIO: Type1 IOMMU mapping support for vGPU
@ 2016-02-23 16:24   ` Kirti Wankhede
  0 siblings, 0 replies; 38+ messages in thread
From: Kirti Wankhede @ 2016-02-23 16:24 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel
  Cc: shuai.ruan, jike.song, Neo Jia, kvm, kevin.tian, qemu-devel,
	Kirti Wankhede, zhiyuan.lv

Aim of this module is to pin and unpin guest memory.
This module provides interface to GPU driver that can be used to map guest
physical memory into its kernel space driver.
Currently this module has duplicate code from vfio_iommu_type1.c
Working on refining functions to reuse existing code in vfio_iommu_type1.c and
with that will add API to unpin pages.
This is for the reference to review the overall design of vGPU.

Thanks,
Kirti.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
---
 drivers/vgpu/Makefile                |    1 +
 drivers/vgpu/vfio_iommu_type1_vgpu.c |  423 ++++++++++++++++++++++++++++++++++
 2 files changed, 424 insertions(+), 0 deletions(-)
 create mode 100644 drivers/vgpu/vfio_iommu_type1_vgpu.c

diff --git a/drivers/vgpu/Makefile b/drivers/vgpu/Makefile
index a0a2655..8ace18d 100644
--- a/drivers/vgpu/Makefile
+++ b/drivers/vgpu/Makefile
@@ -3,3 +3,4 @@ vgpu-y := vgpu-core.o vgpu-sysfs.o vgpu-driver.o
 
 obj-$(CONFIG_VGPU)			+= vgpu.o
 obj-$(CONFIG_VGPU_VFIO)                 += vgpu_vfio.o
+obj-$(CONFIG_VFIO_IOMMU_TYPE1_VGPU)     += vfio_iommu_type1_vgpu.o
diff --git a/drivers/vgpu/vfio_iommu_type1_vgpu.c b/drivers/vgpu/vfio_iommu_type1_vgpu.c
new file mode 100644
index 0000000..0b36ae5
--- /dev/null
+++ b/drivers/vgpu/vfio_iommu_type1_vgpu.c
@@ -0,0 +1,423 @@
+/*
+ * VGPU : IOMMU DMA mapping support for VGPU
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/compat.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/miscdevice.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/uuid.h>
+#include <linux/vfio.h>
+#include <linux/iommu.h>
+#include <linux/vgpu.h>
+
+#include "vgpu_private.h"
+
+#define DRIVER_VERSION	"0.1"
+#define DRIVER_AUTHOR	"NVIDIA Corporation"
+#define DRIVER_DESC     "VGPU Type1 IOMMU driver for VFIO"
+
+// VFIO structures
+
+struct vfio_iommu_vgpu {
+	struct mutex lock;
+	struct iommu_group *group;
+	struct vgpu_device *vgpu_dev;
+	struct rb_root dma_list;
+	struct mm_struct * vm_mm;
+};
+
+struct vgpu_vfio_dma {
+	struct rb_node node;
+	dma_addr_t iova;
+	unsigned long vaddr;
+	size_t size;
+	int prot;
+};
+
+/*
+ * VGPU VFIO FOPs definition
+ *
+ */
+
+/*
+ * Duplicated from vfio_link_dma, just quick hack ... should
+ * reuse code later
+ */
+
+static void vgpu_link_dma(struct vfio_iommu_vgpu *iommu,
+			  struct vgpu_vfio_dma *new)
+{
+	struct rb_node **link = &iommu->dma_list.rb_node, *parent = NULL;
+	struct vgpu_vfio_dma *dma;
+
+	while (*link) {
+		parent = *link;
+		dma = rb_entry(parent, struct vgpu_vfio_dma, node);
+
+		if (new->iova + new->size <= dma->iova)
+			link = &(*link)->rb_left;
+		else
+			link = &(*link)->rb_right;
+	}
+
+	rb_link_node(&new->node, parent, link);
+	rb_insert_color(&new->node, &iommu->dma_list);
+}
+
+static struct vgpu_vfio_dma *vgpu_find_dma(struct vfio_iommu_vgpu *iommu,
+					   dma_addr_t start, size_t size)
+{
+	struct rb_node *node = iommu->dma_list.rb_node;
+
+	while (node) {
+		struct vgpu_vfio_dma *dma = rb_entry(node, struct vgpu_vfio_dma, node);
+
+		if (start + size <= dma->iova)
+			node = node->rb_left;
+		else if (start >= dma->iova + dma->size)
+			node = node->rb_right;
+		else
+			return dma;
+	}
+
+	return NULL;
+}
+
+static void vgpu_unlink_dma(struct vfio_iommu_vgpu *iommu, struct vgpu_vfio_dma *old)
+{
+	rb_erase(&old->node, &iommu->dma_list);
+}
+
+static void vgpu_dump_dma(struct vfio_iommu_vgpu *iommu)
+{
+	struct vgpu_vfio_dma *c, *n;
+	uint32_t i = 0;
+
+	rbtree_postorder_for_each_entry_safe(c, n, &iommu->dma_list, node)
+		printk(KERN_INFO "%s: dma[%d] iova:0x%llx, vaddr:0x%lx, size:0x%lx\n",
+		       __FUNCTION__, i++, c->iova, c->vaddr, c->size);
+}
+
+static int vgpu_dma_do_track(struct vfio_iommu_vgpu * vgpu_iommu,
+	struct vfio_iommu_type1_dma_map *map)
+{
+	dma_addr_t iova = map->iova;
+	unsigned long vaddr = map->vaddr;
+	int ret = 0, prot = 0;
+	struct vgpu_vfio_dma *vgpu_dma;
+
+	mutex_lock(&vgpu_iommu->lock);
+
+	if (vgpu_find_dma(vgpu_iommu, map->iova, map->size)) {
+		mutex_unlock(&vgpu_iommu->lock);
+		return -EEXIST;
+	}
+
+	vgpu_dma = kzalloc(sizeof(*vgpu_dma), GFP_KERNEL);
+
+	if (!vgpu_dma) {
+		mutex_unlock(&vgpu_iommu->lock);
+		return -ENOMEM;
+	}
+
+	vgpu_dma->iova = iova;
+	vgpu_dma->vaddr = vaddr;
+	vgpu_dma->prot = prot;
+	vgpu_dma->size = map->size;
+
+	vgpu_link_dma(vgpu_iommu, vgpu_dma);
+
+	mutex_unlock(&vgpu_iommu->lock);
+	return ret;
+}
+
+static int vgpu_dma_do_untrack(struct vfio_iommu_vgpu * vgpu_iommu,
+	struct vfio_iommu_type1_dma_unmap *unmap)
+{
+	struct vgpu_vfio_dma *vgpu_dma;
+	size_t unmapped = 0;
+	int ret = 0;
+
+	mutex_lock(&vgpu_iommu->lock);
+
+	vgpu_dma = vgpu_find_dma(vgpu_iommu, unmap->iova, 0);
+	if (vgpu_dma && vgpu_dma->iova != unmap->iova) {
+		ret = -EINVAL;
+		goto unlock;
+	}
+
+	vgpu_dma = vgpu_find_dma(vgpu_iommu, unmap->iova + unmap->size - 1, 0);
+	if (vgpu_dma && vgpu_dma->iova + vgpu_dma->size != unmap->iova + unmap->size) {
+		ret = -EINVAL;
+		goto unlock;
+	}
+
+	while (( vgpu_dma = vgpu_find_dma(vgpu_iommu, unmap->iova, unmap->size))) {
+		unmapped += vgpu_dma->size;
+		vgpu_unlink_dma(vgpu_iommu, vgpu_dma);
+	}
+
+unlock:
+	mutex_unlock(&vgpu_iommu->lock);
+	unmap->size = unmapped;
+
+	return ret;
+}
+
+/* Ugly hack to quickly test single deivce ... */
+
+static struct vfio_iommu_vgpu *_local_iommu = NULL;
+
+int vgpu_dma_do_translate(dma_addr_t *gfn_buffer, uint32_t count)
+{
+	int i = 0, ret = 0, prot = 0;
+	unsigned long remote_vaddr = 0, pfn = 0;
+	struct vfio_iommu_vgpu *vgpu_iommu = _local_iommu;
+	struct vgpu_vfio_dma *vgpu_dma;
+	struct page *page[1];
+	// unsigned long * addr = NULL;
+	struct mm_struct *mm = vgpu_iommu->vm_mm;
+
+	prot = IOMMU_READ | IOMMU_WRITE;
+
+	printk(KERN_INFO "%s: >>>>\n", __FUNCTION__);
+
+	mutex_lock(&vgpu_iommu->lock);
+
+	vgpu_dump_dma(vgpu_iommu);
+
+	for (i = 0; i < count; i++) {
+		dma_addr_t iova = gfn_buffer[i] << PAGE_SHIFT;
+		vgpu_dma = vgpu_find_dma(vgpu_iommu, iova, 0 /*  size */);
+
+		if (!vgpu_dma) {
+			printk(KERN_INFO "%s: fail locate iova[%d]:0x%llx\n", __FUNCTION__, i, iova);
+			ret = -EINVAL;
+			goto unlock;
+		}
+
+		remote_vaddr = vgpu_dma->vaddr + iova - vgpu_dma->iova;
+		printk(KERN_INFO "%s: find dma iova[%d]:0x%llx, vaddr:0x%lx, size:0x%lx, remote_vaddr:0x%lx\n",
+			__FUNCTION__, i, vgpu_dma->iova,
+			vgpu_dma->vaddr, vgpu_dma->size, remote_vaddr);
+
+		if (get_user_pages_unlocked(NULL, mm, remote_vaddr, 1, 1, 0, page) == 1) {
+			pfn = page_to_pfn(page[0]);
+			printk(KERN_INFO "%s: pfn[%d]:0x%lx\n", __FUNCTION__, i, pfn);
+			// addr = vmap(page, 1, VM_MAP, PAGE_KERNEL);
+		}
+		else {
+			printk(KERN_INFO "%s: fail to pin pfn[%d]\n", __FUNCTION__, i);
+			ret = -ENOMEM;
+			goto unlock;
+		}
+
+		gfn_buffer[i] = pfn;
+		// vunmap(addr);
+
+	}
+
+unlock:
+	mutex_unlock(&vgpu_iommu->lock);
+	printk(KERN_INFO "%s: <<<<\n", __FUNCTION__);
+	return ret;
+}
+EXPORT_SYMBOL(vgpu_dma_do_translate);
+
+static void *vfio_iommu_vgpu_open(unsigned long arg)
+{
+	struct vfio_iommu_vgpu *iommu;
+
+	iommu = kzalloc(sizeof(*iommu), GFP_KERNEL);
+
+	if (!iommu)
+		return ERR_PTR(-ENOMEM);
+
+	mutex_init(&iommu->lock);
+
+	printk(KERN_INFO "%s", __FUNCTION__);
+
+	/* TODO: Keep track the v2 vs. v1, for now only assume
+	 * we are v2 due to QEMU code */
+	_local_iommu = iommu;
+	return iommu;
+}
+
+static void vfio_iommu_vgpu_release(void *iommu_data)
+{
+	struct vfio_iommu_vgpu *iommu = iommu_data;
+	kfree(iommu);
+	printk(KERN_INFO "%s", __FUNCTION__);
+}
+
+static long vfio_iommu_vgpu_ioctl(void *iommu_data,
+		unsigned int cmd, unsigned long arg)
+{
+	int ret = 0;
+	unsigned long minsz;
+	struct vfio_iommu_vgpu *vgpu_iommu = iommu_data;
+
+	switch (cmd) {
+	case VFIO_CHECK_EXTENSION:
+	{
+		if ((arg == VFIO_TYPE1_IOMMU) || (arg == VFIO_TYPE1v2_IOMMU))
+			return 1;
+		else
+			return 0;
+	}
+
+	case VFIO_IOMMU_GET_INFO:
+	{
+		struct vfio_iommu_type1_info info;
+		minsz = offsetofend(struct vfio_iommu_type1_info, iova_pgsizes);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz)
+			return -EINVAL;
+
+		info.flags = 0;
+
+		return copy_to_user((void __user *)arg, &info, minsz);
+	}
+	case VFIO_IOMMU_MAP_DMA:
+	{
+		// TODO
+		struct vfio_iommu_type1_dma_map map;
+		minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);
+
+		if (copy_from_user(&map, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (map.argsz < minsz)
+			return -EINVAL;
+
+		printk(KERN_INFO "VGPU-IOMMU:MAP_DMA flags:%d, vaddr:0x%llx, iova:0x%llx, size:0x%llx\n",
+			map.flags, map.vaddr, map.iova, map.size);
+
+		/*
+		 * TODO: Tracking code is mostly duplicated from TYPE1 IOMMU, ideally,
+		 * this should be merged into one single file and reuse data
+		 * structure
+		 *
+		 */
+		ret = vgpu_dma_do_track(vgpu_iommu, &map);
+		break;
+	}
+	case VFIO_IOMMU_UNMAP_DMA:
+	{
+		// TODO
+		struct vfio_iommu_type1_dma_unmap unmap;
+
+		minsz = offsetofend(struct vfio_iommu_type1_dma_unmap, size);
+
+		if (copy_from_user(&unmap, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (unmap.argsz < minsz)
+			return -EINVAL;
+
+		ret = vgpu_dma_do_untrack(vgpu_iommu, &unmap);
+		break;
+	}
+	default:
+	{
+		printk(KERN_INFO "%s cmd default ", __FUNCTION__);
+		ret = -ENOTTY;
+		break;
+	}
+	}
+
+	return ret;
+}
+
+
+static int vfio_iommu_vgpu_attach_group(void *iommu_data,
+		                        struct iommu_group *iommu_group)
+{
+	struct vfio_iommu_vgpu *iommu = iommu_data;
+	struct vgpu_device *vgpu_dev = NULL;
+
+	printk(KERN_INFO "%s", __FUNCTION__);
+
+	vgpu_dev = get_vgpu_device_from_group(iommu_group);
+	if (vgpu_dev) {
+		iommu->vgpu_dev = vgpu_dev;
+		iommu->group = iommu_group;
+
+		/* IOMMU shares the same life cylce as VM MM */
+		iommu->vm_mm = current->mm;
+
+		return 0;
+	}
+	iommu->group = iommu_group;
+	return 1;
+}
+
+static void vfio_iommu_vgpu_detach_group(void *iommu_data,
+		struct iommu_group *iommu_group)
+{
+	struct vfio_iommu_vgpu *iommu = iommu_data;
+
+	printk(KERN_INFO "%s", __FUNCTION__);
+	iommu->vm_mm = NULL;
+	iommu->group = NULL;
+
+	return;
+}
+
+
+static const struct vfio_iommu_driver_ops vfio_iommu_vgpu_driver_ops = {
+	.name           = "vgpu_vfio",
+	.owner          = THIS_MODULE,
+	.open           = vfio_iommu_vgpu_open,
+	.release        = vfio_iommu_vgpu_release,
+	.ioctl          = vfio_iommu_vgpu_ioctl,
+	.attach_group   = vfio_iommu_vgpu_attach_group,
+	.detach_group   = vfio_iommu_vgpu_detach_group,
+};
+
+
+int vgpu_vfio_iommu_init(void)
+{
+	int rc = vfio_register_iommu_driver(&vfio_iommu_vgpu_driver_ops);
+
+	printk(KERN_INFO "%s\n", __FUNCTION__);
+	if (rc < 0) {
+		printk(KERN_ERR "Error: failed to register vfio iommu, err:%d\n", rc);
+	}
+
+	return rc;
+}
+
+void vgpu_vfio_iommu_exit(void)
+{
+	// unregister vgpu_vfio driver
+	vfio_unregister_iommu_driver(&vfio_iommu_vgpu_driver_ops);
+	printk(KERN_INFO "%s\n", __FUNCTION__);
+}
+
+
+module_init(vgpu_vfio_iommu_init);
+module_exit(vgpu_vfio_iommu_exit);
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
+
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* RE: [RFC PATCH v2 1/3] vGPU Core driver
  2016-02-23 16:24 ` [Qemu-devel] " Kirti Wankhede
@ 2016-02-29  5:39   ` Tian, Kevin
  -1 siblings, 0 replies; 38+ messages in thread
From: Tian, Kevin @ 2016-02-29  5:39 UTC (permalink / raw)
  To: Kirti Wankhede, alex.williamson, pbonzini, kraxel
  Cc: qemu-devel, kvm, Ruan, Shuai, Song, Jike, Lv, Zhiyuan, Neo Jia

> From: Kirti Wankhede
> Sent: Wednesday, February 24, 2016 12:24 AM
> 
> Design for vGPU Driver:
> Main purpose of vGPU driver is to provide a common interface for vGPU
> management that can be used by differnt GPU drivers.
> 
> This module would provide a generic interface to create the device, add
> it to vGPU bus, add device to IOMMU group and then add it to vfio group.
> 
> High Level block diagram:
> 
> +--------------+    vgpu_register_driver()+---------------+
> |     __init() +------------------------->+               |
> |              |                          |               |
> |              +<-------------------------+    vgpu.ko    |
> | vgpu_vfio.ko |   probe()/remove()       |               |
> |              |                +---------+               +---------+
> +--------------+                |         +-------+-------+         |
>                                 |                 ^                 |
>                                 | callback        |                 |
>                                 |         +-------+--------+        |
>                                 |         |vgpu_register_device()   |
>                                 |         |                |        |
>                                 +---^-----+-----+    +-----+------+-+
>                                     | nvidia.ko |    |  i915.ko   |
>                                     |           |    |            |
>                                     +-----------+    +------------+
> 
> vGPU driver provides two types of registration interfaces:
> 1. Registration interface for vGPU bus driver:
> 
> /**
>   * struct vgpu_driver - vGPU device driver
>   * @name: driver name
>   * @probe: called when new device created
>   * @remove: called when device removed
>   * @driver: device driver structure
>   *
>   **/
> struct vgpu_driver {
>          const char *name;
>          int  (*probe)  (struct device *dev);
>          void (*remove) (struct device *dev);
>          struct device_driver    driver;
> };
> 
> int  vgpu_register_driver(struct vgpu_driver *drv, struct module *owner);
> void vgpu_unregister_driver(struct vgpu_driver *drv);
> 
> VFIO bus driver for vgpu, should use this interface to register with
> vGPU driver. With this, VFIO bus driver for vGPU devices is responsible
> to add vGPU device to VFIO group.
> 
> 2. GPU driver interface
> GPU driver interface provides GPU driver the set APIs to manage GPU driver
> related work in their own driver. APIs are to:
> - vgpu_supported_config: provide supported configuration list by the GPU.
> - vgpu_create: to allocate basic resouces in GPU driver for a vGPU device.
> - vgpu_destroy: to free resources in GPU driver during vGPU device destroy.
> - vgpu_start: to initiate vGPU initialization process from GPU driver when VM
>   boots and before QEMU starts.
> - vgpu_shutdown: to teardown vGPU resources during VM teardown.
> - read : read emulation callback.
> - write: write emulation callback.
> - vgpu_set_irqs: send interrupt configuration information that QEMU sets.
> - vgpu_bar_info: to provice BAR size and its flags for the vGPU device.
> - validate_map_request: to validate remap pfn request.
> 
> This registration interface should be used by GPU drivers to register
> each physical device to vGPU driver.
> 
> Updated this patch with couple of more functions in GPU driver interface
> which were discussed during v1 version of this RFC.
> 
> Thanks,
> Kirti.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Signed-off-by: Neo Jia <cjia@nvidia.com>

Hi, Kirti/Neo,

Thanks a lot for you updated version. Having not looked into detail
code, first come with some high level comments.

First, in a glimpse the majority of the code (possibly >95%) is device
agnostic, though we call it vgpu today. Just thinking about the
extensibility and usability of this framework, would it be better to 
name it in a way that any other type of I/O device can be fit into 
this framework? I don't have a good idea of the name now, but 
a simple idea is to replace vgpu with vdev (vdev-core, vfio-vdev,
vfio-iommu-type1-vdev, etc.), and then underlying GPU drivers are
just one category of users of this general vdev framework. In the
future it's easily extended to support other I/O virtualization based 
on similar vgpu concept;

Second, are these 3 patches already working with nvidia device,
or are they just conceptual implementation w/o completing actual
test yet? We'll start moving our implementation toward this direction
too, so would be good to know the current status and how we can
further cooperate to move forward. Based on that we can start 
giving more comments on next level detail.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 1/3] vGPU Core driver
@ 2016-02-29  5:39   ` Tian, Kevin
  0 siblings, 0 replies; 38+ messages in thread
From: Tian, Kevin @ 2016-02-29  5:39 UTC (permalink / raw)
  To: Kirti Wankhede, alex.williamson, pbonzini, kraxel
  Cc: Ruan, Shuai, Song, Jike, Neo Jia, kvm, qemu-devel, Lv, Zhiyuan

> From: Kirti Wankhede
> Sent: Wednesday, February 24, 2016 12:24 AM
> 
> Design for vGPU Driver:
> Main purpose of vGPU driver is to provide a common interface for vGPU
> management that can be used by differnt GPU drivers.
> 
> This module would provide a generic interface to create the device, add
> it to vGPU bus, add device to IOMMU group and then add it to vfio group.
> 
> High Level block diagram:
> 
> +--------------+    vgpu_register_driver()+---------------+
> |     __init() +------------------------->+               |
> |              |                          |               |
> |              +<-------------------------+    vgpu.ko    |
> | vgpu_vfio.ko |   probe()/remove()       |               |
> |              |                +---------+               +---------+
> +--------------+                |         +-------+-------+         |
>                                 |                 ^                 |
>                                 | callback        |                 |
>                                 |         +-------+--------+        |
>                                 |         |vgpu_register_device()   |
>                                 |         |                |        |
>                                 +---^-----+-----+    +-----+------+-+
>                                     | nvidia.ko |    |  i915.ko   |
>                                     |           |    |            |
>                                     +-----------+    +------------+
> 
> vGPU driver provides two types of registration interfaces:
> 1. Registration interface for vGPU bus driver:
> 
> /**
>   * struct vgpu_driver - vGPU device driver
>   * @name: driver name
>   * @probe: called when new device created
>   * @remove: called when device removed
>   * @driver: device driver structure
>   *
>   **/
> struct vgpu_driver {
>          const char *name;
>          int  (*probe)  (struct device *dev);
>          void (*remove) (struct device *dev);
>          struct device_driver    driver;
> };
> 
> int  vgpu_register_driver(struct vgpu_driver *drv, struct module *owner);
> void vgpu_unregister_driver(struct vgpu_driver *drv);
> 
> VFIO bus driver for vgpu, should use this interface to register with
> vGPU driver. With this, VFIO bus driver for vGPU devices is responsible
> to add vGPU device to VFIO group.
> 
> 2. GPU driver interface
> GPU driver interface provides GPU driver the set APIs to manage GPU driver
> related work in their own driver. APIs are to:
> - vgpu_supported_config: provide supported configuration list by the GPU.
> - vgpu_create: to allocate basic resouces in GPU driver for a vGPU device.
> - vgpu_destroy: to free resources in GPU driver during vGPU device destroy.
> - vgpu_start: to initiate vGPU initialization process from GPU driver when VM
>   boots and before QEMU starts.
> - vgpu_shutdown: to teardown vGPU resources during VM teardown.
> - read : read emulation callback.
> - write: write emulation callback.
> - vgpu_set_irqs: send interrupt configuration information that QEMU sets.
> - vgpu_bar_info: to provice BAR size and its flags for the vGPU device.
> - validate_map_request: to validate remap pfn request.
> 
> This registration interface should be used by GPU drivers to register
> each physical device to vGPU driver.
> 
> Updated this patch with couple of more functions in GPU driver interface
> which were discussed during v1 version of this RFC.
> 
> Thanks,
> Kirti.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Signed-off-by: Neo Jia <cjia@nvidia.com>

Hi, Kirti/Neo,

Thanks a lot for you updated version. Having not looked into detail
code, first come with some high level comments.

First, in a glimpse the majority of the code (possibly >95%) is device
agnostic, though we call it vgpu today. Just thinking about the
extensibility and usability of this framework, would it be better to 
name it in a way that any other type of I/O device can be fit into 
this framework? I don't have a good idea of the name now, but 
a simple idea is to replace vgpu with vdev (vdev-core, vfio-vdev,
vfio-iommu-type1-vdev, etc.), and then underlying GPU drivers are
just one category of users of this general vdev framework. In the
future it's easily extended to support other I/O virtualization based 
on similar vgpu concept;

Second, are these 3 patches already working with nvidia device,
or are they just conceptual implementation w/o completing actual
test yet? We'll start moving our implementation toward this direction
too, so would be good to know the current status and how we can
further cooperate to move forward. Based on that we can start 
giving more comments on next level detail.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v2 1/3] vGPU Core driver
  2016-02-29  5:39   ` [Qemu-devel] " Tian, Kevin
@ 2016-02-29 23:17     ` Neo Jia
  -1 siblings, 0 replies; 38+ messages in thread
From: Neo Jia @ 2016-02-29 23:17 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Ruan, Shuai, Song, Jike, alex.williamson, kvm, qemu-devel,
	Kirti Wankhede, kraxel, pbonzini, Lv, Zhiyuan

On Mon, Feb 29, 2016 at 05:39:02AM +0000, Tian, Kevin wrote:
> > From: Kirti Wankhede
> > Sent: Wednesday, February 24, 2016 12:24 AM
> > 
> > Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> > Signed-off-by: Neo Jia <cjia@nvidia.com>
> 
> Hi, Kirti/Neo,
> 
> Thanks a lot for you updated version. Having not looked into detail
> code, first come with some high level comments.
> 
> First, in a glimpse the majority of the code (possibly >95%) is device
> agnostic, though we call it vgpu today. Just thinking about the
> extensibility and usability of this framework, would it be better to 
> name it in a way that any other type of I/O device can be fit into 
> this framework? I don't have a good idea of the name now, but 
> a simple idea is to replace vgpu with vdev (vdev-core, vfio-vdev,
> vfio-iommu-type1-vdev, etc.), and then underlying GPU drivers are
> just one category of users of this general vdev framework. In the
> future it's easily extended to support other I/O virtualization based 
> on similar vgpu concept;
> 
> Second, are these 3 patches already working with nvidia device,
> or are they just conceptual implementation w/o completing actual
> test yet? We'll start moving our implementation toward this direction
> too, so would be good to know the current status and how we can
> further cooperate to move forward. Based on that we can start 
> giving more comments on next level detail.
> 

Hi Kevin,

Yes, we do have an engineering prototype up and running with this set of kernel
patches we have posted.

Please let us know if you have any questions while integrating your vgpu solution
within this framework.

Thanks,
Neo

> Thanks
> Kevin

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 1/3] vGPU Core driver
@ 2016-02-29 23:17     ` Neo Jia
  0 siblings, 0 replies; 38+ messages in thread
From: Neo Jia @ 2016-02-29 23:17 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Ruan, Shuai, Song, Jike, alex.williamson, kvm, qemu-devel,
	Kirti Wankhede, kraxel, pbonzini, Lv, Zhiyuan

On Mon, Feb 29, 2016 at 05:39:02AM +0000, Tian, Kevin wrote:
> > From: Kirti Wankhede
> > Sent: Wednesday, February 24, 2016 12:24 AM
> > 
> > Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> > Signed-off-by: Neo Jia <cjia@nvidia.com>
> 
> Hi, Kirti/Neo,
> 
> Thanks a lot for you updated version. Having not looked into detail
> code, first come with some high level comments.
> 
> First, in a glimpse the majority of the code (possibly >95%) is device
> agnostic, though we call it vgpu today. Just thinking about the
> extensibility and usability of this framework, would it be better to 
> name it in a way that any other type of I/O device can be fit into 
> this framework? I don't have a good idea of the name now, but 
> a simple idea is to replace vgpu with vdev (vdev-core, vfio-vdev,
> vfio-iommu-type1-vdev, etc.), and then underlying GPU drivers are
> just one category of users of this general vdev framework. In the
> future it's easily extended to support other I/O virtualization based 
> on similar vgpu concept;
> 
> Second, are these 3 patches already working with nvidia device,
> or are they just conceptual implementation w/o completing actual
> test yet? We'll start moving our implementation toward this direction
> too, so would be good to know the current status and how we can
> further cooperate to move forward. Based on that we can start 
> giving more comments on next level detail.
> 

Hi Kevin,

Yes, we do have an engineering prototype up and running with this set of kernel
patches we have posted.

Please let us know if you have any questions while integrating your vgpu solution
within this framework.

Thanks,
Neo

> Thanks
> Kevin

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v2 1/3] vGPU Core driver
  2016-02-29 23:17     ` [Qemu-devel] " Neo Jia
@ 2016-03-01  3:10       ` Jike Song
  -1 siblings, 0 replies; 38+ messages in thread
From: Jike Song @ 2016-03-01  3:10 UTC (permalink / raw)
  To: Neo Jia
  Cc: Tian, Kevin, Kirti Wankhede, alex.williamson, pbonzini, kraxel,
	qemu-devel, kvm, Ruan, Shuai, Lv, Zhiyuan

On 03/01/2016 07:17 AM, Neo Jia wrote:
> On Mon, Feb 29, 2016 at 05:39:02AM +0000, Tian, Kevin wrote:
>>> From: Kirti Wankhede
>>> Sent: Wednesday, February 24, 2016 12:24 AM
>>>
>>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>>> Signed-off-by: Neo Jia <cjia@nvidia.com>
>>
>> Hi, Kirti/Neo,
>>
>> Thanks a lot for you updated version. Having not looked into detail
>> code, first come with some high level comments.
>>
>> First, in a glimpse the majority of the code (possibly >95%) is device
>> agnostic, though we call it vgpu today. Just thinking about the
>> extensibility and usability of this framework, would it be better to 
>> name it in a way that any other type of I/O device can be fit into 
>> this framework? I don't have a good idea of the name now, but 
>> a simple idea is to replace vgpu with vdev (vdev-core, vfio-vdev,
>> vfio-iommu-type1-vdev, etc.), and then underlying GPU drivers are
>> just one category of users of this general vdev framework. In the
>> future it's easily extended to support other I/O virtualization based 
>> on similar vgpu concept;
>>
>> Second, are these 3 patches already working with nvidia device,
>> or are they just conceptual implementation w/o completing actual
>> test yet? We'll start moving our implementation toward this direction
>> too, so would be good to know the current status and how we can
>> further cooperate to move forward. Based on that we can start 
>> giving more comments on next level detail.
>>
> 
> Hi Kevin,
> 
> Yes, we do have an engineering prototype up and running with this set of kernel
> patches we have posted.
> 

Good to know that :)

> Please let us know if you have any questions while integrating your vgpu solution
> within this framework.

Thanks for your work, we are evaluating the integrate of the framework
with our vgpu implementation, will make/propose changes to this.

> 
> Thanks,
> Neo
> 
--
Thanks,
Jike


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 1/3] vGPU Core driver
@ 2016-03-01  3:10       ` Jike Song
  0 siblings, 0 replies; 38+ messages in thread
From: Jike Song @ 2016-03-01  3:10 UTC (permalink / raw)
  To: Neo Jia
  Cc: Ruan, Shuai, Tian, Kevin, kvm, qemu-devel, Kirti Wankhede,
	alex.williamson, kraxel, pbonzini, Lv, Zhiyuan

On 03/01/2016 07:17 AM, Neo Jia wrote:
> On Mon, Feb 29, 2016 at 05:39:02AM +0000, Tian, Kevin wrote:
>>> From: Kirti Wankhede
>>> Sent: Wednesday, February 24, 2016 12:24 AM
>>>
>>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>>> Signed-off-by: Neo Jia <cjia@nvidia.com>
>>
>> Hi, Kirti/Neo,
>>
>> Thanks a lot for you updated version. Having not looked into detail
>> code, first come with some high level comments.
>>
>> First, in a glimpse the majority of the code (possibly >95%) is device
>> agnostic, though we call it vgpu today. Just thinking about the
>> extensibility and usability of this framework, would it be better to 
>> name it in a way that any other type of I/O device can be fit into 
>> this framework? I don't have a good idea of the name now, but 
>> a simple idea is to replace vgpu with vdev (vdev-core, vfio-vdev,
>> vfio-iommu-type1-vdev, etc.), and then underlying GPU drivers are
>> just one category of users of this general vdev framework. In the
>> future it's easily extended to support other I/O virtualization based 
>> on similar vgpu concept;
>>
>> Second, are these 3 patches already working with nvidia device,
>> or are they just conceptual implementation w/o completing actual
>> test yet? We'll start moving our implementation toward this direction
>> too, so would be good to know the current status and how we can
>> further cooperate to move forward. Based on that we can start 
>> giving more comments on next level detail.
>>
> 
> Hi Kevin,
> 
> Yes, we do have an engineering prototype up and running with this set of kernel
> patches we have posted.
> 

Good to know that :)

> Please let us know if you have any questions while integrating your vgpu solution
> within this framework.

Thanks for your work, we are evaluating the integrate of the framework
with our vgpu implementation, will make/propose changes to this.

> 
> Thanks,
> Neo
> 
--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v2 3/3] VFIO: Type1 IOMMU mapping support for vGPU
  2016-02-23 16:24   ` [Qemu-devel] " Kirti Wankhede
@ 2016-03-02  8:38     ` Jike Song
  -1 siblings, 0 replies; 38+ messages in thread
From: Jike Song @ 2016-03-02  8:38 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: alex.williamson, pbonzini, kraxel, qemu-devel, kvm, kevin.tian,
	shuai.ruan, zhiyuan.lv, Neo Jia

On 02/24/2016 12:24 AM, Kirti Wankhede wrote:
> Aim of this module is to pin and unpin guest memory.
> This module provides interface to GPU driver that can be used to map guest
> physical memory into its kernel space driver.
> Currently this module has duplicate code from vfio_iommu_type1.c
> Working on refining functions to reuse existing code in vfio_iommu_type1.c and
> with that will add API to unpin pages.
> This is for the reference to review the overall design of vGPU.
> 
> Thanks,
> Kirti.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Signed-off-by: Neo Jia <cjia@nvidia.com>
> ---
>  drivers/vgpu/Makefile                |    1 +
>  drivers/vgpu/vfio_iommu_type1_vgpu.c |  423 ++++++++++++++++++++++++++++++++++
>  2 files changed, 424 insertions(+), 0 deletions(-)
>  create mode 100644 drivers/vgpu/vfio_iommu_type1_vgpu.c
> 
> diff --git a/drivers/vgpu/Makefile b/drivers/vgpu/Makefile
> index a0a2655..8ace18d 100644
> --- a/drivers/vgpu/Makefile
> +++ b/drivers/vgpu/Makefile
> @@ -3,3 +3,4 @@ vgpu-y := vgpu-core.o vgpu-sysfs.o vgpu-driver.o
>  
>  obj-$(CONFIG_VGPU)			+= vgpu.o
>  obj-$(CONFIG_VGPU_VFIO)                 += vgpu_vfio.o
> +obj-$(CONFIG_VFIO_IOMMU_TYPE1_VGPU)     += vfio_iommu_type1_vgpu.o
> diff --git a/drivers/vgpu/vfio_iommu_type1_vgpu.c b/drivers/vgpu/vfio_iommu_type1_vgpu.c
> new file mode 100644
> index 0000000..0b36ae5
> --- /dev/null
> +++ b/drivers/vgpu/vfio_iommu_type1_vgpu.c
> @@ -0,0 +1,423 @@
> +/*
> + * VGPU : IOMMU DMA mapping support for VGPU
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/init.h>
> +#include <linux/module.h>
> +#include <linux/compat.h>
> +#include <linux/device.h>
> +#include <linux/kernel.h>
> +#include <linux/fs.h>
> +#include <linux/miscdevice.h>
> +#include <linux/sched.h>
> +#include <linux/wait.h>
> +#include <linux/uuid.h>
> +#include <linux/vfio.h>
> +#include <linux/iommu.h>
> +#include <linux/vgpu.h>
> +
> +#include "vgpu_private.h"
> +
> +#define DRIVER_VERSION	"0.1"
> +#define DRIVER_AUTHOR	"NVIDIA Corporation"
> +#define DRIVER_DESC     "VGPU Type1 IOMMU driver for VFIO"
> +
> +// VFIO structures
> +
> +struct vfio_iommu_vgpu {
> +	struct mutex lock;
> +	struct iommu_group *group;
> +	struct vgpu_device *vgpu_dev;
> +	struct rb_root dma_list;
> +	struct mm_struct * vm_mm;
> +};
> +
> +struct vgpu_vfio_dma {
> +	struct rb_node node;
> +	dma_addr_t iova;
> +	unsigned long vaddr;
> +	size_t size;
> +	int prot;
> +};
> +
> +/*
> + * VGPU VFIO FOPs definition
> + *
> + */
> +
> +/*
> + * Duplicated from vfio_link_dma, just quick hack ... should
> + * reuse code later
> + */
> +
> +static void vgpu_link_dma(struct vfio_iommu_vgpu *iommu,
> +			  struct vgpu_vfio_dma *new)
> +{
> +	struct rb_node **link = &iommu->dma_list.rb_node, *parent = NULL;
> +	struct vgpu_vfio_dma *dma;
> +
> +	while (*link) {
> +		parent = *link;
> +		dma = rb_entry(parent, struct vgpu_vfio_dma, node);
> +
> +		if (new->iova + new->size <= dma->iova)
> +			link = &(*link)->rb_left;
> +		else
> +			link = &(*link)->rb_right;
> +	}
> +
> +	rb_link_node(&new->node, parent, link);
> +	rb_insert_color(&new->node, &iommu->dma_list);
> +}
> +
> +static struct vgpu_vfio_dma *vgpu_find_dma(struct vfio_iommu_vgpu *iommu,
> +					   dma_addr_t start, size_t size)
> +{
> +	struct rb_node *node = iommu->dma_list.rb_node;
> +
> +	while (node) {
> +		struct vgpu_vfio_dma *dma = rb_entry(node, struct vgpu_vfio_dma, node);
> +
> +		if (start + size <= dma->iova)
> +			node = node->rb_left;
> +		else if (start >= dma->iova + dma->size)
> +			node = node->rb_right;
> +		else
> +			return dma;
> +	}
> +
> +	return NULL;
> +}
> +
> +static void vgpu_unlink_dma(struct vfio_iommu_vgpu *iommu, struct vgpu_vfio_dma *old)
> +{
> +	rb_erase(&old->node, &iommu->dma_list);
> +}
> +
> +static void vgpu_dump_dma(struct vfio_iommu_vgpu *iommu)
> +{
> +	struct vgpu_vfio_dma *c, *n;
> +	uint32_t i = 0;
> +
> +	rbtree_postorder_for_each_entry_safe(c, n, &iommu->dma_list, node)
> +		printk(KERN_INFO "%s: dma[%d] iova:0x%llx, vaddr:0x%lx, size:0x%lx\n",
> +		       __FUNCTION__, i++, c->iova, c->vaddr, c->size);
> +}
> +
> +static int vgpu_dma_do_track(struct vfio_iommu_vgpu * vgpu_iommu,
> +	struct vfio_iommu_type1_dma_map *map)
> +{
> +	dma_addr_t iova = map->iova;
> +	unsigned long vaddr = map->vaddr;
> +	int ret = 0, prot = 0;
> +	struct vgpu_vfio_dma *vgpu_dma;
> +
> +	mutex_lock(&vgpu_iommu->lock);
> +
> +	if (vgpu_find_dma(vgpu_iommu, map->iova, map->size)) {
> +		mutex_unlock(&vgpu_iommu->lock);
> +		return -EEXIST;
> +	}
> +
> +	vgpu_dma = kzalloc(sizeof(*vgpu_dma), GFP_KERNEL);
> +
> +	if (!vgpu_dma) {
> +		mutex_unlock(&vgpu_iommu->lock);
> +		return -ENOMEM;
> +	}
> +
> +	vgpu_dma->iova = iova;
> +	vgpu_dma->vaddr = vaddr;
> +	vgpu_dma->prot = prot;
> +	vgpu_dma->size = map->size;
> +
> +	vgpu_link_dma(vgpu_iommu, vgpu_dma);

Hi Kirti & Neo,

seems that no one actually setup mappings for IOMMU here?

> +
> +	mutex_unlock(&vgpu_iommu->lock);
> +	return ret;
> +}
> +
> +static int vgpu_dma_do_untrack(struct vfio_iommu_vgpu * vgpu_iommu,
> +	struct vfio_iommu_type1_dma_unmap *unmap)
> +{
> +	struct vgpu_vfio_dma *vgpu_dma;
> +	size_t unmapped = 0;
> +	int ret = 0;
> +
> +	mutex_lock(&vgpu_iommu->lock);
> +
> +	vgpu_dma = vgpu_find_dma(vgpu_iommu, unmap->iova, 0);
> +	if (vgpu_dma && vgpu_dma->iova != unmap->iova) {
> +		ret = -EINVAL;
> +		goto unlock;
> +	}
> +
> +	vgpu_dma = vgpu_find_dma(vgpu_iommu, unmap->iova + unmap->size - 1, 0);
> +	if (vgpu_dma && vgpu_dma->iova + vgpu_dma->size != unmap->iova + unmap->size) {
> +		ret = -EINVAL;
> +		goto unlock;
> +	}
> +
> +	while (( vgpu_dma = vgpu_find_dma(vgpu_iommu, unmap->iova, unmap->size))) {
> +		unmapped += vgpu_dma->size;
> +		vgpu_unlink_dma(vgpu_iommu, vgpu_dma);
> +	}
> +
> +unlock:
> +	mutex_unlock(&vgpu_iommu->lock);
> +	unmap->size = unmapped;
> +
> +	return ret;
> +}
> +
> +/* Ugly hack to quickly test single deivce ... */
> +
> +static struct vfio_iommu_vgpu *_local_iommu = NULL;
> +
> +int vgpu_dma_do_translate(dma_addr_t *gfn_buffer, uint32_t count)
> +{
> +	int i = 0, ret = 0, prot = 0;
> +	unsigned long remote_vaddr = 0, pfn = 0;
> +	struct vfio_iommu_vgpu *vgpu_iommu = _local_iommu;
> +	struct vgpu_vfio_dma *vgpu_dma;
> +	struct page *page[1];
> +	// unsigned long * addr = NULL;
> +	struct mm_struct *mm = vgpu_iommu->vm_mm;
> +
> +	prot = IOMMU_READ | IOMMU_WRITE;
> +
> +	printk(KERN_INFO "%s: >>>>\n", __FUNCTION__);
> +
> +	mutex_lock(&vgpu_iommu->lock);
> +
> +	vgpu_dump_dma(vgpu_iommu);
> +
> +	for (i = 0; i < count; i++) {
> +		dma_addr_t iova = gfn_buffer[i] << PAGE_SHIFT;
> +		vgpu_dma = vgpu_find_dma(vgpu_iommu, iova, 0 /*  size */);
> +
> +		if (!vgpu_dma) {
> +			printk(KERN_INFO "%s: fail locate iova[%d]:0x%llx\n", __FUNCTION__, i, iova);
> +			ret = -EINVAL;
> +			goto unlock;
> +		}
> +
> +		remote_vaddr = vgpu_dma->vaddr + iova - vgpu_dma->iova;
> +		printk(KERN_INFO "%s: find dma iova[%d]:0x%llx, vaddr:0x%lx, size:0x%lx, remote_vaddr:0x%lx\n",
> +			__FUNCTION__, i, vgpu_dma->iova,
> +			vgpu_dma->vaddr, vgpu_dma->size, remote_vaddr);
> +
> +		if (get_user_pages_unlocked(NULL, mm, remote_vaddr, 1, 1, 0, page) == 1) {
> +			pfn = page_to_pfn(page[0]);
> +			printk(KERN_INFO "%s: pfn[%d]:0x%lx\n", __FUNCTION__, i, pfn);
> +			// addr = vmap(page, 1, VM_MAP, PAGE_KERNEL);
> +		}
> +		else {
> +			printk(KERN_INFO "%s: fail to pin pfn[%d]\n", __FUNCTION__, i);
> +			ret = -ENOMEM;
> +			goto unlock;
> +		}
> +
> +		gfn_buffer[i] = pfn;
> +		// vunmap(addr);
> +
> +	}
> +
> +unlock:
> +	mutex_unlock(&vgpu_iommu->lock);
> +	printk(KERN_INFO "%s: <<<<\n", __FUNCTION__);
> +	return ret;
> +}
> +EXPORT_SYMBOL(vgpu_dma_do_translate);
> +
> +static void *vfio_iommu_vgpu_open(unsigned long arg)
> +{
> +	struct vfio_iommu_vgpu *iommu;
> +
> +	iommu = kzalloc(sizeof(*iommu), GFP_KERNEL);
> +
> +	if (!iommu)
> +		return ERR_PTR(-ENOMEM);
> +
> +	mutex_init(&iommu->lock);
> +
> +	printk(KERN_INFO "%s", __FUNCTION__);
> +
> +	/* TODO: Keep track the v2 vs. v1, for now only assume
> +	 * we are v2 due to QEMU code */
> +	_local_iommu = iommu;
> +	return iommu;
> +}
> +
> +static void vfio_iommu_vgpu_release(void *iommu_data)
> +{
> +	struct vfio_iommu_vgpu *iommu = iommu_data;
> +	kfree(iommu);
> +	printk(KERN_INFO "%s", __FUNCTION__);
> +}
> +
> +static long vfio_iommu_vgpu_ioctl(void *iommu_data,
> +		unsigned int cmd, unsigned long arg)
> +{
> +	int ret = 0;
> +	unsigned long minsz;
> +	struct vfio_iommu_vgpu *vgpu_iommu = iommu_data;
> +
> +	switch (cmd) {
> +	case VFIO_CHECK_EXTENSION:
> +	{
> +		if ((arg == VFIO_TYPE1_IOMMU) || (arg == VFIO_TYPE1v2_IOMMU))
> +			return 1;
> +		else
> +			return 0;
> +	}
> +
> +	case VFIO_IOMMU_GET_INFO:
> +	{
> +		struct vfio_iommu_type1_info info;
> +		minsz = offsetofend(struct vfio_iommu_type1_info, iova_pgsizes);
> +
> +		if (copy_from_user(&info, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (info.argsz < minsz)
> +			return -EINVAL;
> +
> +		info.flags = 0;
> +
> +		return copy_to_user((void __user *)arg, &info, minsz);
> +	}
> +	case VFIO_IOMMU_MAP_DMA:
> +	{
> +		// TODO
> +		struct vfio_iommu_type1_dma_map map;
> +		minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);
> +
> +		if (copy_from_user(&map, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (map.argsz < minsz)
> +			return -EINVAL;
> +
> +		printk(KERN_INFO "VGPU-IOMMU:MAP_DMA flags:%d, vaddr:0x%llx, iova:0x%llx, size:0x%llx\n",
> +			map.flags, map.vaddr, map.iova, map.size);
> +
> +		/*
> +		 * TODO: Tracking code is mostly duplicated from TYPE1 IOMMU, ideally,
> +		 * this should be merged into one single file and reuse data
> +		 * structure
> +		 *
> +		 */
> +		ret = vgpu_dma_do_track(vgpu_iommu, &map);
> +		break;
> +	}
> +	case VFIO_IOMMU_UNMAP_DMA:
> +	{
> +		// TODO
> +		struct vfio_iommu_type1_dma_unmap unmap;
> +
> +		minsz = offsetofend(struct vfio_iommu_type1_dma_unmap, size);
> +
> +		if (copy_from_user(&unmap, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (unmap.argsz < minsz)
> +			return -EINVAL;
> +
> +		ret = vgpu_dma_do_untrack(vgpu_iommu, &unmap);
> +		break;
> +	}
> +	default:
> +	{
> +		printk(KERN_INFO "%s cmd default ", __FUNCTION__);
> +		ret = -ENOTTY;
> +		break;
> +	}
> +	}
> +
> +	return ret;
> +}
> +
> +
> +static int vfio_iommu_vgpu_attach_group(void *iommu_data,
> +		                        struct iommu_group *iommu_group)
> +{
> +	struct vfio_iommu_vgpu *iommu = iommu_data;
> +	struct vgpu_device *vgpu_dev = NULL;
> +
> +	printk(KERN_INFO "%s", __FUNCTION__);
> +
> +	vgpu_dev = get_vgpu_device_from_group(iommu_group);
> +	if (vgpu_dev) {
> +		iommu->vgpu_dev = vgpu_dev;
> +		iommu->group = iommu_group;
> +
> +		/* IOMMU shares the same life cylce as VM MM */
> +		iommu->vm_mm = current->mm;
> +
> +		return 0;
> +	}
> +	iommu->group = iommu_group;
> +	return 1;
> +}
> +
> +static void vfio_iommu_vgpu_detach_group(void *iommu_data,
> +		struct iommu_group *iommu_group)
> +{
> +	struct vfio_iommu_vgpu *iommu = iommu_data;
> +
> +	printk(KERN_INFO "%s", __FUNCTION__);
> +	iommu->vm_mm = NULL;
> +	iommu->group = NULL;
> +
> +	return;
> +}
> +
> +
> +static const struct vfio_iommu_driver_ops vfio_iommu_vgpu_driver_ops = {
> +	.name           = "vgpu_vfio",
> +	.owner          = THIS_MODULE,
> +	.open           = vfio_iommu_vgpu_open,
> +	.release        = vfio_iommu_vgpu_release,
> +	.ioctl          = vfio_iommu_vgpu_ioctl,
> +	.attach_group   = vfio_iommu_vgpu_attach_group,
> +	.detach_group   = vfio_iommu_vgpu_detach_group,
> +};
> +
> +
> +int vgpu_vfio_iommu_init(void)
> +{
> +	int rc = vfio_register_iommu_driver(&vfio_iommu_vgpu_driver_ops);
> +
> +	printk(KERN_INFO "%s\n", __FUNCTION__);
> +	if (rc < 0) {
> +		printk(KERN_ERR "Error: failed to register vfio iommu, err:%d\n", rc);
> +	}
> +
> +	return rc;
> +}
> +
> +void vgpu_vfio_iommu_exit(void)
> +{
> +	// unregister vgpu_vfio driver
> +	vfio_unregister_iommu_driver(&vfio_iommu_vgpu_driver_ops);
> +	printk(KERN_INFO "%s\n", __FUNCTION__);
> +}
> +
> +
> +module_init(vgpu_vfio_iommu_init);
> +module_exit(vgpu_vfio_iommu_exit);
> +
> +MODULE_VERSION(DRIVER_VERSION);
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR(DRIVER_AUTHOR);
> +MODULE_DESCRIPTION(DRIVER_DESC);
> +
> 

--
Thanks,
Jike


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 3/3] VFIO: Type1 IOMMU mapping support for vGPU
@ 2016-03-02  8:38     ` Jike Song
  0 siblings, 0 replies; 38+ messages in thread
From: Jike Song @ 2016-03-02  8:38 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: shuai.ruan, kevin.tian, Neo Jia, kvm, qemu-devel,
	alex.williamson, kraxel, pbonzini, zhiyuan.lv

On 02/24/2016 12:24 AM, Kirti Wankhede wrote:
> Aim of this module is to pin and unpin guest memory.
> This module provides interface to GPU driver that can be used to map guest
> physical memory into its kernel space driver.
> Currently this module has duplicate code from vfio_iommu_type1.c
> Working on refining functions to reuse existing code in vfio_iommu_type1.c and
> with that will add API to unpin pages.
> This is for the reference to review the overall design of vGPU.
> 
> Thanks,
> Kirti.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Signed-off-by: Neo Jia <cjia@nvidia.com>
> ---
>  drivers/vgpu/Makefile                |    1 +
>  drivers/vgpu/vfio_iommu_type1_vgpu.c |  423 ++++++++++++++++++++++++++++++++++
>  2 files changed, 424 insertions(+), 0 deletions(-)
>  create mode 100644 drivers/vgpu/vfio_iommu_type1_vgpu.c
> 
> diff --git a/drivers/vgpu/Makefile b/drivers/vgpu/Makefile
> index a0a2655..8ace18d 100644
> --- a/drivers/vgpu/Makefile
> +++ b/drivers/vgpu/Makefile
> @@ -3,3 +3,4 @@ vgpu-y := vgpu-core.o vgpu-sysfs.o vgpu-driver.o
>  
>  obj-$(CONFIG_VGPU)			+= vgpu.o
>  obj-$(CONFIG_VGPU_VFIO)                 += vgpu_vfio.o
> +obj-$(CONFIG_VFIO_IOMMU_TYPE1_VGPU)     += vfio_iommu_type1_vgpu.o
> diff --git a/drivers/vgpu/vfio_iommu_type1_vgpu.c b/drivers/vgpu/vfio_iommu_type1_vgpu.c
> new file mode 100644
> index 0000000..0b36ae5
> --- /dev/null
> +++ b/drivers/vgpu/vfio_iommu_type1_vgpu.c
> @@ -0,0 +1,423 @@
> +/*
> + * VGPU : IOMMU DMA mapping support for VGPU
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/init.h>
> +#include <linux/module.h>
> +#include <linux/compat.h>
> +#include <linux/device.h>
> +#include <linux/kernel.h>
> +#include <linux/fs.h>
> +#include <linux/miscdevice.h>
> +#include <linux/sched.h>
> +#include <linux/wait.h>
> +#include <linux/uuid.h>
> +#include <linux/vfio.h>
> +#include <linux/iommu.h>
> +#include <linux/vgpu.h>
> +
> +#include "vgpu_private.h"
> +
> +#define DRIVER_VERSION	"0.1"
> +#define DRIVER_AUTHOR	"NVIDIA Corporation"
> +#define DRIVER_DESC     "VGPU Type1 IOMMU driver for VFIO"
> +
> +// VFIO structures
> +
> +struct vfio_iommu_vgpu {
> +	struct mutex lock;
> +	struct iommu_group *group;
> +	struct vgpu_device *vgpu_dev;
> +	struct rb_root dma_list;
> +	struct mm_struct * vm_mm;
> +};
> +
> +struct vgpu_vfio_dma {
> +	struct rb_node node;
> +	dma_addr_t iova;
> +	unsigned long vaddr;
> +	size_t size;
> +	int prot;
> +};
> +
> +/*
> + * VGPU VFIO FOPs definition
> + *
> + */
> +
> +/*
> + * Duplicated from vfio_link_dma, just quick hack ... should
> + * reuse code later
> + */
> +
> +static void vgpu_link_dma(struct vfio_iommu_vgpu *iommu,
> +			  struct vgpu_vfio_dma *new)
> +{
> +	struct rb_node **link = &iommu->dma_list.rb_node, *parent = NULL;
> +	struct vgpu_vfio_dma *dma;
> +
> +	while (*link) {
> +		parent = *link;
> +		dma = rb_entry(parent, struct vgpu_vfio_dma, node);
> +
> +		if (new->iova + new->size <= dma->iova)
> +			link = &(*link)->rb_left;
> +		else
> +			link = &(*link)->rb_right;
> +	}
> +
> +	rb_link_node(&new->node, parent, link);
> +	rb_insert_color(&new->node, &iommu->dma_list);
> +}
> +
> +static struct vgpu_vfio_dma *vgpu_find_dma(struct vfio_iommu_vgpu *iommu,
> +					   dma_addr_t start, size_t size)
> +{
> +	struct rb_node *node = iommu->dma_list.rb_node;
> +
> +	while (node) {
> +		struct vgpu_vfio_dma *dma = rb_entry(node, struct vgpu_vfio_dma, node);
> +
> +		if (start + size <= dma->iova)
> +			node = node->rb_left;
> +		else if (start >= dma->iova + dma->size)
> +			node = node->rb_right;
> +		else
> +			return dma;
> +	}
> +
> +	return NULL;
> +}
> +
> +static void vgpu_unlink_dma(struct vfio_iommu_vgpu *iommu, struct vgpu_vfio_dma *old)
> +{
> +	rb_erase(&old->node, &iommu->dma_list);
> +}
> +
> +static void vgpu_dump_dma(struct vfio_iommu_vgpu *iommu)
> +{
> +	struct vgpu_vfio_dma *c, *n;
> +	uint32_t i = 0;
> +
> +	rbtree_postorder_for_each_entry_safe(c, n, &iommu->dma_list, node)
> +		printk(KERN_INFO "%s: dma[%d] iova:0x%llx, vaddr:0x%lx, size:0x%lx\n",
> +		       __FUNCTION__, i++, c->iova, c->vaddr, c->size);
> +}
> +
> +static int vgpu_dma_do_track(struct vfio_iommu_vgpu * vgpu_iommu,
> +	struct vfio_iommu_type1_dma_map *map)
> +{
> +	dma_addr_t iova = map->iova;
> +	unsigned long vaddr = map->vaddr;
> +	int ret = 0, prot = 0;
> +	struct vgpu_vfio_dma *vgpu_dma;
> +
> +	mutex_lock(&vgpu_iommu->lock);
> +
> +	if (vgpu_find_dma(vgpu_iommu, map->iova, map->size)) {
> +		mutex_unlock(&vgpu_iommu->lock);
> +		return -EEXIST;
> +	}
> +
> +	vgpu_dma = kzalloc(sizeof(*vgpu_dma), GFP_KERNEL);
> +
> +	if (!vgpu_dma) {
> +		mutex_unlock(&vgpu_iommu->lock);
> +		return -ENOMEM;
> +	}
> +
> +	vgpu_dma->iova = iova;
> +	vgpu_dma->vaddr = vaddr;
> +	vgpu_dma->prot = prot;
> +	vgpu_dma->size = map->size;
> +
> +	vgpu_link_dma(vgpu_iommu, vgpu_dma);

Hi Kirti & Neo,

seems that no one actually setup mappings for IOMMU here?

> +
> +	mutex_unlock(&vgpu_iommu->lock);
> +	return ret;
> +}
> +
> +static int vgpu_dma_do_untrack(struct vfio_iommu_vgpu * vgpu_iommu,
> +	struct vfio_iommu_type1_dma_unmap *unmap)
> +{
> +	struct vgpu_vfio_dma *vgpu_dma;
> +	size_t unmapped = 0;
> +	int ret = 0;
> +
> +	mutex_lock(&vgpu_iommu->lock);
> +
> +	vgpu_dma = vgpu_find_dma(vgpu_iommu, unmap->iova, 0);
> +	if (vgpu_dma && vgpu_dma->iova != unmap->iova) {
> +		ret = -EINVAL;
> +		goto unlock;
> +	}
> +
> +	vgpu_dma = vgpu_find_dma(vgpu_iommu, unmap->iova + unmap->size - 1, 0);
> +	if (vgpu_dma && vgpu_dma->iova + vgpu_dma->size != unmap->iova + unmap->size) {
> +		ret = -EINVAL;
> +		goto unlock;
> +	}
> +
> +	while (( vgpu_dma = vgpu_find_dma(vgpu_iommu, unmap->iova, unmap->size))) {
> +		unmapped += vgpu_dma->size;
> +		vgpu_unlink_dma(vgpu_iommu, vgpu_dma);
> +	}
> +
> +unlock:
> +	mutex_unlock(&vgpu_iommu->lock);
> +	unmap->size = unmapped;
> +
> +	return ret;
> +}
> +
> +/* Ugly hack to quickly test single deivce ... */
> +
> +static struct vfio_iommu_vgpu *_local_iommu = NULL;
> +
> +int vgpu_dma_do_translate(dma_addr_t *gfn_buffer, uint32_t count)
> +{
> +	int i = 0, ret = 0, prot = 0;
> +	unsigned long remote_vaddr = 0, pfn = 0;
> +	struct vfio_iommu_vgpu *vgpu_iommu = _local_iommu;
> +	struct vgpu_vfio_dma *vgpu_dma;
> +	struct page *page[1];
> +	// unsigned long * addr = NULL;
> +	struct mm_struct *mm = vgpu_iommu->vm_mm;
> +
> +	prot = IOMMU_READ | IOMMU_WRITE;
> +
> +	printk(KERN_INFO "%s: >>>>\n", __FUNCTION__);
> +
> +	mutex_lock(&vgpu_iommu->lock);
> +
> +	vgpu_dump_dma(vgpu_iommu);
> +
> +	for (i = 0; i < count; i++) {
> +		dma_addr_t iova = gfn_buffer[i] << PAGE_SHIFT;
> +		vgpu_dma = vgpu_find_dma(vgpu_iommu, iova, 0 /*  size */);
> +
> +		if (!vgpu_dma) {
> +			printk(KERN_INFO "%s: fail locate iova[%d]:0x%llx\n", __FUNCTION__, i, iova);
> +			ret = -EINVAL;
> +			goto unlock;
> +		}
> +
> +		remote_vaddr = vgpu_dma->vaddr + iova - vgpu_dma->iova;
> +		printk(KERN_INFO "%s: find dma iova[%d]:0x%llx, vaddr:0x%lx, size:0x%lx, remote_vaddr:0x%lx\n",
> +			__FUNCTION__, i, vgpu_dma->iova,
> +			vgpu_dma->vaddr, vgpu_dma->size, remote_vaddr);
> +
> +		if (get_user_pages_unlocked(NULL, mm, remote_vaddr, 1, 1, 0, page) == 1) {
> +			pfn = page_to_pfn(page[0]);
> +			printk(KERN_INFO "%s: pfn[%d]:0x%lx\n", __FUNCTION__, i, pfn);
> +			// addr = vmap(page, 1, VM_MAP, PAGE_KERNEL);
> +		}
> +		else {
> +			printk(KERN_INFO "%s: fail to pin pfn[%d]\n", __FUNCTION__, i);
> +			ret = -ENOMEM;
> +			goto unlock;
> +		}
> +
> +		gfn_buffer[i] = pfn;
> +		// vunmap(addr);
> +
> +	}
> +
> +unlock:
> +	mutex_unlock(&vgpu_iommu->lock);
> +	printk(KERN_INFO "%s: <<<<\n", __FUNCTION__);
> +	return ret;
> +}
> +EXPORT_SYMBOL(vgpu_dma_do_translate);
> +
> +static void *vfio_iommu_vgpu_open(unsigned long arg)
> +{
> +	struct vfio_iommu_vgpu *iommu;
> +
> +	iommu = kzalloc(sizeof(*iommu), GFP_KERNEL);
> +
> +	if (!iommu)
> +		return ERR_PTR(-ENOMEM);
> +
> +	mutex_init(&iommu->lock);
> +
> +	printk(KERN_INFO "%s", __FUNCTION__);
> +
> +	/* TODO: Keep track the v2 vs. v1, for now only assume
> +	 * we are v2 due to QEMU code */
> +	_local_iommu = iommu;
> +	return iommu;
> +}
> +
> +static void vfio_iommu_vgpu_release(void *iommu_data)
> +{
> +	struct vfio_iommu_vgpu *iommu = iommu_data;
> +	kfree(iommu);
> +	printk(KERN_INFO "%s", __FUNCTION__);
> +}
> +
> +static long vfio_iommu_vgpu_ioctl(void *iommu_data,
> +		unsigned int cmd, unsigned long arg)
> +{
> +	int ret = 0;
> +	unsigned long minsz;
> +	struct vfio_iommu_vgpu *vgpu_iommu = iommu_data;
> +
> +	switch (cmd) {
> +	case VFIO_CHECK_EXTENSION:
> +	{
> +		if ((arg == VFIO_TYPE1_IOMMU) || (arg == VFIO_TYPE1v2_IOMMU))
> +			return 1;
> +		else
> +			return 0;
> +	}
> +
> +	case VFIO_IOMMU_GET_INFO:
> +	{
> +		struct vfio_iommu_type1_info info;
> +		minsz = offsetofend(struct vfio_iommu_type1_info, iova_pgsizes);
> +
> +		if (copy_from_user(&info, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (info.argsz < minsz)
> +			return -EINVAL;
> +
> +		info.flags = 0;
> +
> +		return copy_to_user((void __user *)arg, &info, minsz);
> +	}
> +	case VFIO_IOMMU_MAP_DMA:
> +	{
> +		// TODO
> +		struct vfio_iommu_type1_dma_map map;
> +		minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);
> +
> +		if (copy_from_user(&map, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (map.argsz < minsz)
> +			return -EINVAL;
> +
> +		printk(KERN_INFO "VGPU-IOMMU:MAP_DMA flags:%d, vaddr:0x%llx, iova:0x%llx, size:0x%llx\n",
> +			map.flags, map.vaddr, map.iova, map.size);
> +
> +		/*
> +		 * TODO: Tracking code is mostly duplicated from TYPE1 IOMMU, ideally,
> +		 * this should be merged into one single file and reuse data
> +		 * structure
> +		 *
> +		 */
> +		ret = vgpu_dma_do_track(vgpu_iommu, &map);
> +		break;
> +	}
> +	case VFIO_IOMMU_UNMAP_DMA:
> +	{
> +		// TODO
> +		struct vfio_iommu_type1_dma_unmap unmap;
> +
> +		minsz = offsetofend(struct vfio_iommu_type1_dma_unmap, size);
> +
> +		if (copy_from_user(&unmap, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (unmap.argsz < minsz)
> +			return -EINVAL;
> +
> +		ret = vgpu_dma_do_untrack(vgpu_iommu, &unmap);
> +		break;
> +	}
> +	default:
> +	{
> +		printk(KERN_INFO "%s cmd default ", __FUNCTION__);
> +		ret = -ENOTTY;
> +		break;
> +	}
> +	}
> +
> +	return ret;
> +}
> +
> +
> +static int vfio_iommu_vgpu_attach_group(void *iommu_data,
> +		                        struct iommu_group *iommu_group)
> +{
> +	struct vfio_iommu_vgpu *iommu = iommu_data;
> +	struct vgpu_device *vgpu_dev = NULL;
> +
> +	printk(KERN_INFO "%s", __FUNCTION__);
> +
> +	vgpu_dev = get_vgpu_device_from_group(iommu_group);
> +	if (vgpu_dev) {
> +		iommu->vgpu_dev = vgpu_dev;
> +		iommu->group = iommu_group;
> +
> +		/* IOMMU shares the same life cylce as VM MM */
> +		iommu->vm_mm = current->mm;
> +
> +		return 0;
> +	}
> +	iommu->group = iommu_group;
> +	return 1;
> +}
> +
> +static void vfio_iommu_vgpu_detach_group(void *iommu_data,
> +		struct iommu_group *iommu_group)
> +{
> +	struct vfio_iommu_vgpu *iommu = iommu_data;
> +
> +	printk(KERN_INFO "%s", __FUNCTION__);
> +	iommu->vm_mm = NULL;
> +	iommu->group = NULL;
> +
> +	return;
> +}
> +
> +
> +static const struct vfio_iommu_driver_ops vfio_iommu_vgpu_driver_ops = {
> +	.name           = "vgpu_vfio",
> +	.owner          = THIS_MODULE,
> +	.open           = vfio_iommu_vgpu_open,
> +	.release        = vfio_iommu_vgpu_release,
> +	.ioctl          = vfio_iommu_vgpu_ioctl,
> +	.attach_group   = vfio_iommu_vgpu_attach_group,
> +	.detach_group   = vfio_iommu_vgpu_detach_group,
> +};
> +
> +
> +int vgpu_vfio_iommu_init(void)
> +{
> +	int rc = vfio_register_iommu_driver(&vfio_iommu_vgpu_driver_ops);
> +
> +	printk(KERN_INFO "%s\n", __FUNCTION__);
> +	if (rc < 0) {
> +		printk(KERN_ERR "Error: failed to register vfio iommu, err:%d\n", rc);
> +	}
> +
> +	return rc;
> +}
> +
> +void vgpu_vfio_iommu_exit(void)
> +{
> +	// unregister vgpu_vfio driver
> +	vfio_unregister_iommu_driver(&vfio_iommu_vgpu_driver_ops);
> +	printk(KERN_INFO "%s\n", __FUNCTION__);
> +}
> +
> +
> +module_init(vgpu_vfio_iommu_init);
> +module_exit(vgpu_vfio_iommu_exit);
> +
> +MODULE_VERSION(DRIVER_VERSION);
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR(DRIVER_AUTHOR);
> +MODULE_DESCRIPTION(DRIVER_DESC);
> +
> 

--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v2 3/3] VFIO: Type1 IOMMU mapping support for vGPU
  2016-03-02  8:38     ` [Qemu-devel] " Jike Song
@ 2016-03-04  7:00       ` Neo Jia
  -1 siblings, 0 replies; 38+ messages in thread
From: Neo Jia @ 2016-03-04  7:00 UTC (permalink / raw)
  To: Jike Song
  Cc: Kirti Wankhede, alex.williamson, pbonzini, kraxel, qemu-devel,
	kvm, kevin.tian, shuai.ruan, zhiyuan.lv

On Wed, Mar 02, 2016 at 04:38:34PM +0800, Jike Song wrote:
> On 02/24/2016 12:24 AM, Kirti Wankhede wrote:
> > +	vgpu_dma->size = map->size;
> > +
> > +	vgpu_link_dma(vgpu_iommu, vgpu_dma);
> 
> Hi Kirti & Neo,
> 
> seems that no one actually setup mappings for IOMMU here?
> 

Hi Jike,

Yes.

The actual mapping should be done by the host kernel driver after calling the
translation/pinning API vgpu_dma_do_translate.

Thanks,
Neo

> > 
> 
> --
> Thanks,
> Jike
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 3/3] VFIO: Type1 IOMMU mapping support for vGPU
@ 2016-03-04  7:00       ` Neo Jia
  0 siblings, 0 replies; 38+ messages in thread
From: Neo Jia @ 2016-03-04  7:00 UTC (permalink / raw)
  To: Jike Song
  Cc: shuai.ruan, kevin.tian, alex.williamson, kvm, qemu-devel,
	Kirti Wankhede, kraxel, pbonzini, zhiyuan.lv

On Wed, Mar 02, 2016 at 04:38:34PM +0800, Jike Song wrote:
> On 02/24/2016 12:24 AM, Kirti Wankhede wrote:
> > +	vgpu_dma->size = map->size;
> > +
> > +	vgpu_link_dma(vgpu_iommu, vgpu_dma);
> 
> Hi Kirti & Neo,
> 
> seems that no one actually setup mappings for IOMMU here?
> 

Hi Jike,

Yes.

The actual mapping should be done by the host kernel driver after calling the
translation/pinning API vgpu_dma_do_translate.

Thanks,
Neo

> > 
> 
> --
> Thanks,
> Jike
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v2 3/3] VFIO: Type1 IOMMU mapping support for vGPU
  2016-03-04  7:00       ` [Qemu-devel] " Neo Jia
@ 2016-03-07  6:07         ` Jike Song
  -1 siblings, 0 replies; 38+ messages in thread
From: Jike Song @ 2016-03-07  6:07 UTC (permalink / raw)
  To: Neo Jia
  Cc: Jike Song, Kirti Wankhede, Alex Williamson, pbonzini, kraxel,
	qemu-devel, kvm, kevin.tian, shuai.ruan, zhiyuan.lv

Hi Neo,

On Fri, Mar 4, 2016 at 3:00 PM, Neo Jia <cjia@nvidia.com> wrote:
> On Wed, Mar 02, 2016 at 04:38:34PM +0800, Jike Song wrote:
>> On 02/24/2016 12:24 AM, Kirti Wankhede wrote:
>> > +   vgpu_dma->size = map->size;
>> > +
>> > +   vgpu_link_dma(vgpu_iommu, vgpu_dma);
>>
>> Hi Kirti & Neo,
>>
>> seems that no one actually setup mappings for IOMMU here?
>>
>
> Hi Jike,
>
> Yes.
>
> The actual mapping should be done by the host kernel driver after calling the
> translation/pinning API vgpu_dma_do_translate.

Thanks for the reply. I mis-deleted the mail in my intel account, so
reply with private mail account, sorry for that.


In vgpu_dma_do_translate():

for (i = 0; i < count; i++) {
   {snip}
   dma_addr_t iova = gfn_buffer[i] << PAGE_SHIFT;
   vgpu_dma = vgpu_find_dma(vgpu_iommu, iova, 0 /*  size */);

    remote_vaddr = vgpu_dma->vaddr + iova - vgpu_dma->iova;
    if (get_user_pages_unlocked(NULL, mm, remote_vaddr, 1, 1, 0, page) == 1) {
        pfn = page_to_pfn(page[0]);
    }
    gfn_buffer[i] = pfn;
}

If I understand correctly, the purpose of above code, is given an
array of gfns, try to pin & return associated pfns. There is still no
IOMMU mappings here.  Is it supposed to be the caller who should set
up IOMMU by DMA api such as dma_map_page(), after calling
vgpu_dma_do_translate()?


-- 
Thanks,
Jike

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 3/3] VFIO: Type1 IOMMU mapping support for vGPU
@ 2016-03-07  6:07         ` Jike Song
  0 siblings, 0 replies; 38+ messages in thread
From: Jike Song @ 2016-03-07  6:07 UTC (permalink / raw)
  To: Neo Jia
  Cc: shuai.ruan, Jike Song, kvm, qemu-devel, Kirti Wankhede,
	kevin.tian, Alex Williamson, kraxel, pbonzini, zhiyuan.lv

Hi Neo,

On Fri, Mar 4, 2016 at 3:00 PM, Neo Jia <cjia@nvidia.com> wrote:
> On Wed, Mar 02, 2016 at 04:38:34PM +0800, Jike Song wrote:
>> On 02/24/2016 12:24 AM, Kirti Wankhede wrote:
>> > +   vgpu_dma->size = map->size;
>> > +
>> > +   vgpu_link_dma(vgpu_iommu, vgpu_dma);
>>
>> Hi Kirti & Neo,
>>
>> seems that no one actually setup mappings for IOMMU here?
>>
>
> Hi Jike,
>
> Yes.
>
> The actual mapping should be done by the host kernel driver after calling the
> translation/pinning API vgpu_dma_do_translate.

Thanks for the reply. I mis-deleted the mail in my intel account, so
reply with private mail account, sorry for that.


In vgpu_dma_do_translate():

for (i = 0; i < count; i++) {
   {snip}
   dma_addr_t iova = gfn_buffer[i] << PAGE_SHIFT;
   vgpu_dma = vgpu_find_dma(vgpu_iommu, iova, 0 /*  size */);

    remote_vaddr = vgpu_dma->vaddr + iova - vgpu_dma->iova;
    if (get_user_pages_unlocked(NULL, mm, remote_vaddr, 1, 1, 0, page) == 1) {
        pfn = page_to_pfn(page[0]);
    }
    gfn_buffer[i] = pfn;
}

If I understand correctly, the purpose of above code, is given an
array of gfns, try to pin & return associated pfns. There is still no
IOMMU mappings here.  Is it supposed to be the caller who should set
up IOMMU by DMA api such as dma_map_page(), after calling
vgpu_dma_do_translate()?


-- 
Thanks,
Jike

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v2 3/3] VFIO: Type1 IOMMU mapping support for vGPU
  2016-03-07  6:07         ` [Qemu-devel] " Jike Song
@ 2016-03-08  0:31           ` Neo Jia
  -1 siblings, 0 replies; 38+ messages in thread
From: Neo Jia @ 2016-03-08  0:31 UTC (permalink / raw)
  To: jike.song
  Cc: shuai.ruan, Jike Song, kvm, qemu-devel, Kirti Wankhede,
	kevin.tian, Alex Williamson, kraxel, pbonzini, zhiyuan.lv

On Mon, Mar 07, 2016 at 02:07:15PM +0800, Jike Song wrote:
> Hi Neo,
> 
> On Fri, Mar 4, 2016 at 3:00 PM, Neo Jia <cjia@nvidia.com> wrote:
> > On Wed, Mar 02, 2016 at 04:38:34PM +0800, Jike Song wrote:
> >> On 02/24/2016 12:24 AM, Kirti Wankhede wrote:
> >> > +   vgpu_dma->size = map->size;
> >> > +
> >> > +   vgpu_link_dma(vgpu_iommu, vgpu_dma);
> >>
> >> Hi Kirti & Neo,
> >>
> >> seems that no one actually setup mappings for IOMMU here?
> >>
> >
> > Hi Jike,
> >
> > Yes.
> >
> > The actual mapping should be done by the host kernel driver after calling the
> > translation/pinning API vgpu_dma_do_translate.
> 
> Thanks for the reply. I mis-deleted the mail in my intel account, so
> reply with private mail account, sorry for that.
> 
> 
> In vgpu_dma_do_translate():
> 
> for (i = 0; i < count; i++) {
>    {snip}
>    dma_addr_t iova = gfn_buffer[i] << PAGE_SHIFT;
>    vgpu_dma = vgpu_find_dma(vgpu_iommu, iova, 0 /*  size */);
> 
>     remote_vaddr = vgpu_dma->vaddr + iova - vgpu_dma->iova;
>     if (get_user_pages_unlocked(NULL, mm, remote_vaddr, 1, 1, 0, page) == 1) {
>         pfn = page_to_pfn(page[0]);
>     }
>     gfn_buffer[i] = pfn;
> }
> 
> If I understand correctly, the purpose of above code, is given an
> array of gfns, try to pin & return associated pfns. There is still no
> IOMMU mappings here.  

Yes.

> Is it supposed to be the caller who should set
> up IOMMU by DMA api such as dma_map_page(), after calling
> vgpu_dma_do_translate()?
> 

Don't think you need to call dma_map_page here. Once you have the pfn available
to your GPU kernel driver, you can just go ahead to setup the mapping as you
normally do such as calling pci_map_sg and its friends.

Thanks,
Neo

> 
> -- 
> Thanks,
> Jike

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 3/3] VFIO: Type1 IOMMU mapping support for vGPU
@ 2016-03-08  0:31           ` Neo Jia
  0 siblings, 0 replies; 38+ messages in thread
From: Neo Jia @ 2016-03-08  0:31 UTC (permalink / raw)
  To: jike.song
  Cc: shuai.ruan, kvm, qemu-devel, Kirti Wankhede, kevin.tian,
	Alex Williamson, kraxel, pbonzini, zhiyuan.lv

On Mon, Mar 07, 2016 at 02:07:15PM +0800, Jike Song wrote:
> Hi Neo,
> 
> On Fri, Mar 4, 2016 at 3:00 PM, Neo Jia <cjia@nvidia.com> wrote:
> > On Wed, Mar 02, 2016 at 04:38:34PM +0800, Jike Song wrote:
> >> On 02/24/2016 12:24 AM, Kirti Wankhede wrote:
> >> > +   vgpu_dma->size = map->size;
> >> > +
> >> > +   vgpu_link_dma(vgpu_iommu, vgpu_dma);
> >>
> >> Hi Kirti & Neo,
> >>
> >> seems that no one actually setup mappings for IOMMU here?
> >>
> >
> > Hi Jike,
> >
> > Yes.
> >
> > The actual mapping should be done by the host kernel driver after calling the
> > translation/pinning API vgpu_dma_do_translate.
> 
> Thanks for the reply. I mis-deleted the mail in my intel account, so
> reply with private mail account, sorry for that.
> 
> 
> In vgpu_dma_do_translate():
> 
> for (i = 0; i < count; i++) {
>    {snip}
>    dma_addr_t iova = gfn_buffer[i] << PAGE_SHIFT;
>    vgpu_dma = vgpu_find_dma(vgpu_iommu, iova, 0 /*  size */);
> 
>     remote_vaddr = vgpu_dma->vaddr + iova - vgpu_dma->iova;
>     if (get_user_pages_unlocked(NULL, mm, remote_vaddr, 1, 1, 0, page) == 1) {
>         pfn = page_to_pfn(page[0]);
>     }
>     gfn_buffer[i] = pfn;
> }
> 
> If I understand correctly, the purpose of above code, is given an
> array of gfns, try to pin & return associated pfns. There is still no
> IOMMU mappings here.  

Yes.

> Is it supposed to be the caller who should set
> up IOMMU by DMA api such as dma_map_page(), after calling
> vgpu_dma_do_translate()?
> 

Don't think you need to call dma_map_page here. Once you have the pfn available
to your GPU kernel driver, you can just go ahead to setup the mapping as you
normally do such as calling pci_map_sg and its friends.

Thanks,
Neo

> 
> -- 
> Thanks,
> Jike

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v2 3/3] VFIO: Type1 IOMMU mapping support for vGPU
  2016-03-08  0:31           ` [Qemu-devel] " Neo Jia
@ 2016-03-10  3:10             ` Jike Song
  -1 siblings, 0 replies; 38+ messages in thread
From: Jike Song @ 2016-03-10  3:10 UTC (permalink / raw)
  To: Neo Jia
  Cc: Kirti Wankhede, Alex Williamson, pbonzini, kraxel, qemu-devel,
	kvm, kevin.tian, shuai.ruan, zhiyuan.lv

On 03/08/2016 08:31 AM, Neo Jia wrote:
> On Mon, Mar 07, 2016 at 02:07:15PM +0800, Jike Song wrote:
>> Hi Neo,
>>
>> On Fri, Mar 4, 2016 at 3:00 PM, Neo Jia <cjia@nvidia.com> wrote:
>>> On Wed, Mar 02, 2016 at 04:38:34PM +0800, Jike Song wrote:
>>>> On 02/24/2016 12:24 AM, Kirti Wankhede wrote:
>>>>> +   vgpu_dma->size = map->size;
>>>>> +
>>>>> +   vgpu_link_dma(vgpu_iommu, vgpu_dma);
>>>>
>>>> Hi Kirti & Neo,
>>>>
>>>> seems that no one actually setup mappings for IOMMU here?
>>>>
>>>
>>> Hi Jike,
>>>
>>> Yes.
>>>
>>> The actual mapping should be done by the host kernel driver after calling the
>>> translation/pinning API vgpu_dma_do_translate.
>>
>> Thanks for the reply. I mis-deleted the mail in my intel account, so
>> reply with private mail account, sorry for that.
>>
>>
>> In vgpu_dma_do_translate():
>>
>> for (i = 0; i < count; i++) {
>>    {snip}
>>    dma_addr_t iova = gfn_buffer[i] << PAGE_SHIFT;
>>    vgpu_dma = vgpu_find_dma(vgpu_iommu, iova, 0 /*  size */);
>>
>>     remote_vaddr = vgpu_dma->vaddr + iova - vgpu_dma->iova;
>>     if (get_user_pages_unlocked(NULL, mm, remote_vaddr, 1, 1, 0, page) == 1) {
>>         pfn = page_to_pfn(page[0]);
>>     }
>>     gfn_buffer[i] = pfn;
>> }
>>
>> If I understand correctly, the purpose of above code, is given an
>> array of gfns, try to pin & return associated pfns. There is still no
>> IOMMU mappings here.  
> 
> Yes.
> 

Thanks for the conformation.

>> Is it supposed to be the caller who should set
>> up IOMMU by DMA api such as dma_map_page(), after calling
>> vgpu_dma_do_translate()?
>>
> 
> Don't think you need to call dma_map_page here. Once you have the pfn available
> to your GPU kernel driver, you can just go ahead to setup the mapping as you
> normally do such as calling pci_map_sg and its friends.
> 

Technically it's definitely OK to call DMA API from the caller rather than here,
however personally I think it is a bit counter-intuitive: IOMMU page tables
should be constructed within the VFIO IOMMU driver.


> Thanks,
> Neo

--
Thanks,
Jike


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 3/3] VFIO: Type1 IOMMU mapping support for vGPU
@ 2016-03-10  3:10             ` Jike Song
  0 siblings, 0 replies; 38+ messages in thread
From: Jike Song @ 2016-03-10  3:10 UTC (permalink / raw)
  To: Neo Jia
  Cc: shuai.ruan, kevin.tian, Alex Williamson, kvm, qemu-devel,
	Kirti Wankhede, kraxel, pbonzini, zhiyuan.lv

On 03/08/2016 08:31 AM, Neo Jia wrote:
> On Mon, Mar 07, 2016 at 02:07:15PM +0800, Jike Song wrote:
>> Hi Neo,
>>
>> On Fri, Mar 4, 2016 at 3:00 PM, Neo Jia <cjia@nvidia.com> wrote:
>>> On Wed, Mar 02, 2016 at 04:38:34PM +0800, Jike Song wrote:
>>>> On 02/24/2016 12:24 AM, Kirti Wankhede wrote:
>>>>> +   vgpu_dma->size = map->size;
>>>>> +
>>>>> +   vgpu_link_dma(vgpu_iommu, vgpu_dma);
>>>>
>>>> Hi Kirti & Neo,
>>>>
>>>> seems that no one actually setup mappings for IOMMU here?
>>>>
>>>
>>> Hi Jike,
>>>
>>> Yes.
>>>
>>> The actual mapping should be done by the host kernel driver after calling the
>>> translation/pinning API vgpu_dma_do_translate.
>>
>> Thanks for the reply. I mis-deleted the mail in my intel account, so
>> reply with private mail account, sorry for that.
>>
>>
>> In vgpu_dma_do_translate():
>>
>> for (i = 0; i < count; i++) {
>>    {snip}
>>    dma_addr_t iova = gfn_buffer[i] << PAGE_SHIFT;
>>    vgpu_dma = vgpu_find_dma(vgpu_iommu, iova, 0 /*  size */);
>>
>>     remote_vaddr = vgpu_dma->vaddr + iova - vgpu_dma->iova;
>>     if (get_user_pages_unlocked(NULL, mm, remote_vaddr, 1, 1, 0, page) == 1) {
>>         pfn = page_to_pfn(page[0]);
>>     }
>>     gfn_buffer[i] = pfn;
>> }
>>
>> If I understand correctly, the purpose of above code, is given an
>> array of gfns, try to pin & return associated pfns. There is still no
>> IOMMU mappings here.  
> 
> Yes.
> 

Thanks for the conformation.

>> Is it supposed to be the caller who should set
>> up IOMMU by DMA api such as dma_map_page(), after calling
>> vgpu_dma_do_translate()?
>>
> 
> Don't think you need to call dma_map_page here. Once you have the pfn available
> to your GPU kernel driver, you can just go ahead to setup the mapping as you
> normally do such as calling pci_map_sg and its friends.
> 

Technically it's definitely OK to call DMA API from the caller rather than here,
however personally I think it is a bit counter-intuitive: IOMMU page tables
should be constructed within the VFIO IOMMU driver.


> Thanks,
> Neo

--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v2 3/3] VFIO: Type1 IOMMU mapping support for vGPU
  2016-03-10  3:10             ` [Qemu-devel] " Jike Song
@ 2016-03-11  4:19               ` Neo Jia
  -1 siblings, 0 replies; 38+ messages in thread
From: Neo Jia @ 2016-03-11  4:19 UTC (permalink / raw)
  To: Jike Song
  Cc: Kirti Wankhede, Alex Williamson, pbonzini, kraxel, qemu-devel,
	kvm, kevin.tian, shuai.ruan, zhiyuan.lv

On Thu, Mar 10, 2016 at 11:10:10AM +0800, Jike Song wrote:
> 
> >> Is it supposed to be the caller who should set
> >> up IOMMU by DMA api such as dma_map_page(), after calling
> >> vgpu_dma_do_translate()?
> >>
> > 
> > Don't think you need to call dma_map_page here. Once you have the pfn available
> > to your GPU kernel driver, you can just go ahead to setup the mapping as you
> > normally do such as calling pci_map_sg and its friends.
> > 
> 
> Technically it's definitely OK to call DMA API from the caller rather than here,
> however personally I think it is a bit counter-intuitive: IOMMU page tables
> should be constructed within the VFIO IOMMU driver.
> 

Hi Jike,

For vGPU, what we have is just a virtual device and a fake IOMMU group, therefore 
the actual interaction with the real GPU should be managed by the GPU vendor driver.

With the default TYPE1 IOMMU, it works with the vfio-pci as it owns the device.

Thanks,
Neo

> --
> Thanks,
> Jike
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 3/3] VFIO: Type1 IOMMU mapping support for vGPU
@ 2016-03-11  4:19               ` Neo Jia
  0 siblings, 0 replies; 38+ messages in thread
From: Neo Jia @ 2016-03-11  4:19 UTC (permalink / raw)
  To: Jike Song
  Cc: shuai.ruan, kevin.tian, Alex Williamson, kvm, qemu-devel,
	Kirti Wankhede, kraxel, pbonzini, zhiyuan.lv

On Thu, Mar 10, 2016 at 11:10:10AM +0800, Jike Song wrote:
> 
> >> Is it supposed to be the caller who should set
> >> up IOMMU by DMA api such as dma_map_page(), after calling
> >> vgpu_dma_do_translate()?
> >>
> > 
> > Don't think you need to call dma_map_page here. Once you have the pfn available
> > to your GPU kernel driver, you can just go ahead to setup the mapping as you
> > normally do such as calling pci_map_sg and its friends.
> > 
> 
> Technically it's definitely OK to call DMA API from the caller rather than here,
> however personally I think it is a bit counter-intuitive: IOMMU page tables
> should be constructed within the VFIO IOMMU driver.
> 

Hi Jike,

For vGPU, what we have is just a virtual device and a fake IOMMU group, therefore 
the actual interaction with the real GPU should be managed by the GPU vendor driver.

With the default TYPE1 IOMMU, it works with the vfio-pci as it owns the device.

Thanks,
Neo

> --
> Thanks,
> Jike
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [RFC PATCH v2 3/3] VFIO: Type1 IOMMU mapping support for vGPU
  2016-03-11  4:19               ` [Qemu-devel] " Neo Jia
@ 2016-03-11  4:46                 ` Tian, Kevin
  -1 siblings, 0 replies; 38+ messages in thread
From: Tian, Kevin @ 2016-03-11  4:46 UTC (permalink / raw)
  To: Neo Jia, Song, Jike
  Cc: Kirti Wankhede, Alex Williamson, pbonzini, kraxel, qemu-devel,
	kvm, Ruan, Shuai, Lv, Zhiyuan

> From: Neo Jia [mailto:cjia@nvidia.com]
> Sent: Friday, March 11, 2016 12:20 PM
> 
> On Thu, Mar 10, 2016 at 11:10:10AM +0800, Jike Song wrote:
> >
> > >> Is it supposed to be the caller who should set
> > >> up IOMMU by DMA api such as dma_map_page(), after calling
> > >> vgpu_dma_do_translate()?
> > >>
> > >
> > > Don't think you need to call dma_map_page here. Once you have the pfn available
> > > to your GPU kernel driver, you can just go ahead to setup the mapping as you
> > > normally do such as calling pci_map_sg and its friends.
> > >
> >
> > Technically it's definitely OK to call DMA API from the caller rather than here,
> > however personally I think it is a bit counter-intuitive: IOMMU page tables
> > should be constructed within the VFIO IOMMU driver.
> >
> 
> Hi Jike,
> 
> For vGPU, what we have is just a virtual device and a fake IOMMU group, therefore
> the actual interaction with the real GPU should be managed by the GPU vendor driver.
> 

Hi, Neo,

Seems we have a different thought on this. Regardless of whether it's a virtual/physical 
device, imo, VFIO should manage IOMMU configuration. The only difference is:

- for physical device, VFIO directly invokes IOMMU API to set IOMMU entry (GPA->HPA);
- for virtual device, VFIO invokes kernel DMA APIs which indirectly lead to IOMMU entry 
set if CONFIG_IOMMU is enabled in kernel (GPA->IOVA);

This would provide an unified way to manage the translation in VFIO, and then vendor
specific driver only needs to query and use returned IOVA corresponding to a GPA. 

Doing so has another benefit, to make underlying vGPU driver VMM agnostic. For KVM,
yes we can use pci_map_sg. However for Xen it's different (today Dom0 doesn't see
IOMMU. In the future there'll be a PVIOMMU implementation) so different code path is 
required. It's better to abstract such specific knowledge out of vGPU driver, which just
uses whatever dma_addr returned by other agent (VFIO here, or another Xen specific
agent) in a centralized way.

Alex, what's your opinion on this?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 3/3] VFIO: Type1 IOMMU mapping support for vGPU
@ 2016-03-11  4:46                 ` Tian, Kevin
  0 siblings, 0 replies; 38+ messages in thread
From: Tian, Kevin @ 2016-03-11  4:46 UTC (permalink / raw)
  To: Neo Jia, Song, Jike
  Cc: Ruan, Shuai, Alex Williamson, kvm, qemu-devel, Kirti Wankhede,
	kraxel, pbonzini, Lv, Zhiyuan

> From: Neo Jia [mailto:cjia@nvidia.com]
> Sent: Friday, March 11, 2016 12:20 PM
> 
> On Thu, Mar 10, 2016 at 11:10:10AM +0800, Jike Song wrote:
> >
> > >> Is it supposed to be the caller who should set
> > >> up IOMMU by DMA api such as dma_map_page(), after calling
> > >> vgpu_dma_do_translate()?
> > >>
> > >
> > > Don't think you need to call dma_map_page here. Once you have the pfn available
> > > to your GPU kernel driver, you can just go ahead to setup the mapping as you
> > > normally do such as calling pci_map_sg and its friends.
> > >
> >
> > Technically it's definitely OK to call DMA API from the caller rather than here,
> > however personally I think it is a bit counter-intuitive: IOMMU page tables
> > should be constructed within the VFIO IOMMU driver.
> >
> 
> Hi Jike,
> 
> For vGPU, what we have is just a virtual device and a fake IOMMU group, therefore
> the actual interaction with the real GPU should be managed by the GPU vendor driver.
> 

Hi, Neo,

Seems we have a different thought on this. Regardless of whether it's a virtual/physical 
device, imo, VFIO should manage IOMMU configuration. The only difference is:

- for physical device, VFIO directly invokes IOMMU API to set IOMMU entry (GPA->HPA);
- for virtual device, VFIO invokes kernel DMA APIs which indirectly lead to IOMMU entry 
set if CONFIG_IOMMU is enabled in kernel (GPA->IOVA);

This would provide an unified way to manage the translation in VFIO, and then vendor
specific driver only needs to query and use returned IOVA corresponding to a GPA. 

Doing so has another benefit, to make underlying vGPU driver VMM agnostic. For KVM,
yes we can use pci_map_sg. However for Xen it's different (today Dom0 doesn't see
IOMMU. In the future there'll be a PVIOMMU implementation) so different code path is 
required. It's better to abstract such specific knowledge out of vGPU driver, which just
uses whatever dma_addr returned by other agent (VFIO here, or another Xen specific
agent) in a centralized way.

Alex, what's your opinion on this?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v2 3/3] VFIO: Type1 IOMMU mapping support for vGPU
  2016-03-11  4:46                 ` [Qemu-devel] " Tian, Kevin
@ 2016-03-11  6:10                   ` Neo Jia
  -1 siblings, 0 replies; 38+ messages in thread
From: Neo Jia @ 2016-03-11  6:10 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Ruan, Shuai, Song, Jike, kvm, qemu-devel, Kirti Wankhede,
	Alex Williamson, kraxel, pbonzini, Lv, Zhiyuan

On Fri, Mar 11, 2016 at 04:46:23AM +0000, Tian, Kevin wrote:
> > From: Neo Jia [mailto:cjia@nvidia.com]
> > Sent: Friday, March 11, 2016 12:20 PM
> > 
> > On Thu, Mar 10, 2016 at 11:10:10AM +0800, Jike Song wrote:
> > >
> > > >> Is it supposed to be the caller who should set
> > > >> up IOMMU by DMA api such as dma_map_page(), after calling
> > > >> vgpu_dma_do_translate()?
> > > >>
> > > >
> > > > Don't think you need to call dma_map_page here. Once you have the pfn available
> > > > to your GPU kernel driver, you can just go ahead to setup the mapping as you
> > > > normally do such as calling pci_map_sg and its friends.
> > > >
> > >
> > > Technically it's definitely OK to call DMA API from the caller rather than here,
> > > however personally I think it is a bit counter-intuitive: IOMMU page tables
> > > should be constructed within the VFIO IOMMU driver.
> > >
> > 
> > Hi Jike,
> > 
> > For vGPU, what we have is just a virtual device and a fake IOMMU group, therefore
> > the actual interaction with the real GPU should be managed by the GPU vendor driver.
> > 
> 
> Hi, Neo,
> 
> Seems we have a different thought on this. Regardless of whether it's a virtual/physical 
> device, imo, VFIO should manage IOMMU configuration. The only difference is:
> 
> - for physical device, VFIO directly invokes IOMMU API to set IOMMU entry (GPA->HPA);
> - for virtual device, VFIO invokes kernel DMA APIs which indirectly lead to IOMMU entry 
> set if CONFIG_IOMMU is enabled in kernel (GPA->IOVA);

How does it make any sense for us to do a dma_map_page for a physical device that we don't 
have any direct interaction with?

> 
> This would provide an unified way to manage the translation in VFIO, and then vendor
> specific driver only needs to query and use returned IOVA corresponding to a GPA. 
> 
> Doing so has another benefit, to make underlying vGPU driver VMM agnostic. For KVM,
> yes we can use pci_map_sg. However for Xen it's different (today Dom0 doesn't see
> IOMMU. In the future there'll be a PVIOMMU implementation) so different code path is 
> required. It's better to abstract such specific knowledge out of vGPU driver, which just
> uses whatever dma_addr returned by other agent (VFIO here, or another Xen specific
> agent) in a centralized way.
> 
> Alex, what's your opinion on this?
> 
> Thanks
> Kevin

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 3/3] VFIO: Type1 IOMMU mapping support for vGPU
@ 2016-03-11  6:10                   ` Neo Jia
  0 siblings, 0 replies; 38+ messages in thread
From: Neo Jia @ 2016-03-11  6:10 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Ruan, Shuai, Song, Jike, kvm, qemu-devel, Kirti Wankhede,
	Alex Williamson, kraxel, pbonzini, Lv, Zhiyuan

On Fri, Mar 11, 2016 at 04:46:23AM +0000, Tian, Kevin wrote:
> > From: Neo Jia [mailto:cjia@nvidia.com]
> > Sent: Friday, March 11, 2016 12:20 PM
> > 
> > On Thu, Mar 10, 2016 at 11:10:10AM +0800, Jike Song wrote:
> > >
> > > >> Is it supposed to be the caller who should set
> > > >> up IOMMU by DMA api such as dma_map_page(), after calling
> > > >> vgpu_dma_do_translate()?
> > > >>
> > > >
> > > > Don't think you need to call dma_map_page here. Once you have the pfn available
> > > > to your GPU kernel driver, you can just go ahead to setup the mapping as you
> > > > normally do such as calling pci_map_sg and its friends.
> > > >
> > >
> > > Technically it's definitely OK to call DMA API from the caller rather than here,
> > > however personally I think it is a bit counter-intuitive: IOMMU page tables
> > > should be constructed within the VFIO IOMMU driver.
> > >
> > 
> > Hi Jike,
> > 
> > For vGPU, what we have is just a virtual device and a fake IOMMU group, therefore
> > the actual interaction with the real GPU should be managed by the GPU vendor driver.
> > 
> 
> Hi, Neo,
> 
> Seems we have a different thought on this. Regardless of whether it's a virtual/physical 
> device, imo, VFIO should manage IOMMU configuration. The only difference is:
> 
> - for physical device, VFIO directly invokes IOMMU API to set IOMMU entry (GPA->HPA);
> - for virtual device, VFIO invokes kernel DMA APIs which indirectly lead to IOMMU entry 
> set if CONFIG_IOMMU is enabled in kernel (GPA->IOVA);

How does it make any sense for us to do a dma_map_page for a physical device that we don't 
have any direct interaction with?

> 
> This would provide an unified way to manage the translation in VFIO, and then vendor
> specific driver only needs to query and use returned IOVA corresponding to a GPA. 
> 
> Doing so has another benefit, to make underlying vGPU driver VMM agnostic. For KVM,
> yes we can use pci_map_sg. However for Xen it's different (today Dom0 doesn't see
> IOMMU. In the future there'll be a PVIOMMU implementation) so different code path is 
> required. It's better to abstract such specific knowledge out of vGPU driver, which just
> uses whatever dma_addr returned by other agent (VFIO here, or another Xen specific
> agent) in a centralized way.
> 
> Alex, what's your opinion on this?
> 
> Thanks
> Kevin

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [RFC PATCH v2 3/3] VFIO: Type1 IOMMU mapping support for vGPU
  2016-03-11  6:10                   ` [Qemu-devel] " Neo Jia
@ 2016-03-11  8:06                     ` Tian, Kevin
  -1 siblings, 0 replies; 38+ messages in thread
From: Tian, Kevin @ 2016-03-11  8:06 UTC (permalink / raw)
  To: Neo Jia
  Cc: Song, Jike, Kirti Wankhede, Alex Williamson, pbonzini, kraxel,
	qemu-devel, kvm, Ruan, Shuai, Lv, Zhiyuan

> From: Neo Jia
> Sent: Friday, March 11, 2016 2:11 PM
> > > Hi Jike,
> > >
> > > For vGPU, what we have is just a virtual device and a fake IOMMU group, therefore
> > > the actual interaction with the real GPU should be managed by the GPU vendor driver.
> > >
> >
> > Hi, Neo,
> >
> > Seems we have a different thought on this. Regardless of whether it's a virtual/physical
> > device, imo, VFIO should manage IOMMU configuration. The only difference is:
> >
> > - for physical device, VFIO directly invokes IOMMU API to set IOMMU entry (GPA->HPA);
> > - for virtual device, VFIO invokes kernel DMA APIs which indirectly lead to IOMMU entry
> > set if CONFIG_IOMMU is enabled in kernel (GPA->IOVA);
> 
> How does it make any sense for us to do a dma_map_page for a physical device that we
> don't
> have any direct interaction with?
> 

That is also a valid point. It really depends on how we look at this issue.

>From VFIO p.o.v, it needs to enforce DMA isolation for managed devices. In 
that manner it doesn't matter whether it's a physical or virtual one. However 
if looking at specific linux DMA interface, you are right that it is built around 
the physical device instance, which is not managed by VFIO in this case. 

On the other hand, your proposal leaves DMA mapping to vendor specific 
driver which actually manages physical device. However this way VFIO relies
on other agent to enforce DMA isolation of vGPUs. Might not be a real 
problem (more a conceptual one)...

So let me do more thinking here (half-way convinced by you) :-)

Thanks
Kevin

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 3/3] VFIO: Type1 IOMMU mapping support for vGPU
@ 2016-03-11  8:06                     ` Tian, Kevin
  0 siblings, 0 replies; 38+ messages in thread
From: Tian, Kevin @ 2016-03-11  8:06 UTC (permalink / raw)
  To: Neo Jia
  Cc: Ruan, Shuai, Song, Jike, kvm, qemu-devel, Kirti Wankhede,
	Alex Williamson, kraxel, pbonzini, Lv, Zhiyuan

> From: Neo Jia
> Sent: Friday, March 11, 2016 2:11 PM
> > > Hi Jike,
> > >
> > > For vGPU, what we have is just a virtual device and a fake IOMMU group, therefore
> > > the actual interaction with the real GPU should be managed by the GPU vendor driver.
> > >
> >
> > Hi, Neo,
> >
> > Seems we have a different thought on this. Regardless of whether it's a virtual/physical
> > device, imo, VFIO should manage IOMMU configuration. The only difference is:
> >
> > - for physical device, VFIO directly invokes IOMMU API to set IOMMU entry (GPA->HPA);
> > - for virtual device, VFIO invokes kernel DMA APIs which indirectly lead to IOMMU entry
> > set if CONFIG_IOMMU is enabled in kernel (GPA->IOVA);
> 
> How does it make any sense for us to do a dma_map_page for a physical device that we
> don't
> have any direct interaction with?
> 

That is also a valid point. It really depends on how we look at this issue.

>From VFIO p.o.v, it needs to enforce DMA isolation for managed devices. In 
that manner it doesn't matter whether it's a physical or virtual one. However 
if looking at specific linux DMA interface, you are right that it is built around 
the physical device instance, which is not managed by VFIO in this case. 

On the other hand, your proposal leaves DMA mapping to vendor specific 
driver which actually manages physical device. However this way VFIO relies
on other agent to enforce DMA isolation of vGPUs. Might not be a real 
problem (more a conceptual one)...

So let me do more thinking here (half-way convinced by you) :-)

Thanks
Kevin

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v2 3/3] VFIO: Type1 IOMMU mapping support for vGPU
  2016-03-11  4:46                 ` [Qemu-devel] " Tian, Kevin
@ 2016-03-11 16:13                   ` Alex Williamson
  -1 siblings, 0 replies; 38+ messages in thread
From: Alex Williamson @ 2016-03-11 16:13 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Neo Jia, Song, Jike, Kirti Wankhede, pbonzini, kraxel,
	qemu-devel, kvm, Ruan, Shuai, Lv, Zhiyuan

On Fri, 11 Mar 2016 04:46:23 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Neo Jia [mailto:cjia@nvidia.com]
> > Sent: Friday, March 11, 2016 12:20 PM
> > 
> > On Thu, Mar 10, 2016 at 11:10:10AM +0800, Jike Song wrote:  
> > >  
> > > >> Is it supposed to be the caller who should set
> > > >> up IOMMU by DMA api such as dma_map_page(), after calling
> > > >> vgpu_dma_do_translate()?
> > > >>  
> > > >
> > > > Don't think you need to call dma_map_page here. Once you have the pfn available
> > > > to your GPU kernel driver, you can just go ahead to setup the mapping as you
> > > > normally do such as calling pci_map_sg and its friends.
> > > >  
> > >
> > > Technically it's definitely OK to call DMA API from the caller rather than here,
> > > however personally I think it is a bit counter-intuitive: IOMMU page tables
> > > should be constructed within the VFIO IOMMU driver.
> > >  
> > 
> > Hi Jike,
> > 
> > For vGPU, what we have is just a virtual device and a fake IOMMU group, therefore
> > the actual interaction with the real GPU should be managed by the GPU vendor driver.
> >   
> 
> Hi, Neo,
> 
> Seems we have a different thought on this. Regardless of whether it's a virtual/physical 
> device, imo, VFIO should manage IOMMU configuration. The only difference is:
> 
> - for physical device, VFIO directly invokes IOMMU API to set IOMMU entry (GPA->HPA);
> - for virtual device, VFIO invokes kernel DMA APIs which indirectly lead to IOMMU entry 
> set if CONFIG_IOMMU is enabled in kernel (GPA->IOVA);
> 
> This would provide an unified way to manage the translation in VFIO, and then vendor
> specific driver only needs to query and use returned IOVA corresponding to a GPA. 
> 
> Doing so has another benefit, to make underlying vGPU driver VMM agnostic. For KVM,
> yes we can use pci_map_sg. However for Xen it's different (today Dom0 doesn't see
> IOMMU. In the future there'll be a PVIOMMU implementation) so different code path is 
> required. It's better to abstract such specific knowledge out of vGPU driver, which just
> uses whatever dma_addr returned by other agent (VFIO here, or another Xen specific
> agent) in a centralized way.
> 
> Alex, what's your opinion on this?

The sticky point is how vfio, which is only handling the vGPU, has a
reference to the physical GPU on which to call DMA API operations.  If
that reference is provided by the vendor vGPU driver, for example
vgpu_dma_do_translate_for_pci(gpa, pci_dev), I don't see any reason to
be opposed to such an API.  I would not condone vfio deriving or owning
a reference to the physical device on its own though, that's in the
realm of the vendor vGPU driver.  It does seem a bit cleaner and should
reduce duplicate code if the vfio vGPU iommu interface could handle the
iommu mapping for the vendor vgpu driver when necessary.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 3/3] VFIO: Type1 IOMMU mapping support for vGPU
@ 2016-03-11 16:13                   ` Alex Williamson
  0 siblings, 0 replies; 38+ messages in thread
From: Alex Williamson @ 2016-03-11 16:13 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Ruan, Shuai, Song, Jike, Neo Jia, kvm, qemu-devel,
	Kirti Wankhede, kraxel, pbonzini, Lv, Zhiyuan

On Fri, 11 Mar 2016 04:46:23 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Neo Jia [mailto:cjia@nvidia.com]
> > Sent: Friday, March 11, 2016 12:20 PM
> > 
> > On Thu, Mar 10, 2016 at 11:10:10AM +0800, Jike Song wrote:  
> > >  
> > > >> Is it supposed to be the caller who should set
> > > >> up IOMMU by DMA api such as dma_map_page(), after calling
> > > >> vgpu_dma_do_translate()?
> > > >>  
> > > >
> > > > Don't think you need to call dma_map_page here. Once you have the pfn available
> > > > to your GPU kernel driver, you can just go ahead to setup the mapping as you
> > > > normally do such as calling pci_map_sg and its friends.
> > > >  
> > >
> > > Technically it's definitely OK to call DMA API from the caller rather than here,
> > > however personally I think it is a bit counter-intuitive: IOMMU page tables
> > > should be constructed within the VFIO IOMMU driver.
> > >  
> > 
> > Hi Jike,
> > 
> > For vGPU, what we have is just a virtual device and a fake IOMMU group, therefore
> > the actual interaction with the real GPU should be managed by the GPU vendor driver.
> >   
> 
> Hi, Neo,
> 
> Seems we have a different thought on this. Regardless of whether it's a virtual/physical 
> device, imo, VFIO should manage IOMMU configuration. The only difference is:
> 
> - for physical device, VFIO directly invokes IOMMU API to set IOMMU entry (GPA->HPA);
> - for virtual device, VFIO invokes kernel DMA APIs which indirectly lead to IOMMU entry 
> set if CONFIG_IOMMU is enabled in kernel (GPA->IOVA);
> 
> This would provide an unified way to manage the translation in VFIO, and then vendor
> specific driver only needs to query and use returned IOVA corresponding to a GPA. 
> 
> Doing so has another benefit, to make underlying vGPU driver VMM agnostic. For KVM,
> yes we can use pci_map_sg. However for Xen it's different (today Dom0 doesn't see
> IOMMU. In the future there'll be a PVIOMMU implementation) so different code path is 
> required. It's better to abstract such specific knowledge out of vGPU driver, which just
> uses whatever dma_addr returned by other agent (VFIO here, or another Xen specific
> agent) in a centralized way.
> 
> Alex, what's your opinion on this?

The sticky point is how vfio, which is only handling the vGPU, has a
reference to the physical GPU on which to call DMA API operations.  If
that reference is provided by the vendor vGPU driver, for example
vgpu_dma_do_translate_for_pci(gpa, pci_dev), I don't see any reason to
be opposed to such an API.  I would not condone vfio deriving or owning
a reference to the physical device on its own though, that's in the
realm of the vendor vGPU driver.  It does seem a bit cleaner and should
reduce duplicate code if the vfio vGPU iommu interface could handle the
iommu mapping for the vendor vgpu driver when necessary.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v2 3/3] VFIO: Type1 IOMMU mapping support for vGPU
  2016-03-11 16:13                   ` [Qemu-devel] " Alex Williamson
@ 2016-03-11 16:55                     ` Neo Jia
  -1 siblings, 0 replies; 38+ messages in thread
From: Neo Jia @ 2016-03-11 16:55 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, Song, Jike, Kirti Wankhede, pbonzini, kraxel,
	qemu-devel, kvm, Ruan, Shuai, Lv, Zhiyuan

On Fri, Mar 11, 2016 at 09:13:15AM -0700, Alex Williamson wrote:
> On Fri, 11 Mar 2016 04:46:23 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Neo Jia [mailto:cjia@nvidia.com]
> > > Sent: Friday, March 11, 2016 12:20 PM
> > > 
> > > On Thu, Mar 10, 2016 at 11:10:10AM +0800, Jike Song wrote:  
> > > >  
> > > > >> Is it supposed to be the caller who should set
> > > > >> up IOMMU by DMA api such as dma_map_page(), after calling
> > > > >> vgpu_dma_do_translate()?
> > > > >>  
> > > > >
> > > > > Don't think you need to call dma_map_page here. Once you have the pfn available
> > > > > to your GPU kernel driver, you can just go ahead to setup the mapping as you
> > > > > normally do such as calling pci_map_sg and its friends.
> > > > >  
> > > >
> > > > Technically it's definitely OK to call DMA API from the caller rather than here,
> > > > however personally I think it is a bit counter-intuitive: IOMMU page tables
> > > > should be constructed within the VFIO IOMMU driver.
> > > >  
> > > 
> > > Hi Jike,
> > > 
> > > For vGPU, what we have is just a virtual device and a fake IOMMU group, therefore
> > > the actual interaction with the real GPU should be managed by the GPU vendor driver.
> > >   
> > 
> > Hi, Neo,
> > 
> > Seems we have a different thought on this. Regardless of whether it's a virtual/physical 
> > device, imo, VFIO should manage IOMMU configuration. The only difference is:
> > 
> > - for physical device, VFIO directly invokes IOMMU API to set IOMMU entry (GPA->HPA);
> > - for virtual device, VFIO invokes kernel DMA APIs which indirectly lead to IOMMU entry 
> > set if CONFIG_IOMMU is enabled in kernel (GPA->IOVA);
> > 
> > This would provide an unified way to manage the translation in VFIO, and then vendor
> > specific driver only needs to query and use returned IOVA corresponding to a GPA. 
> > 
> > Doing so has another benefit, to make underlying vGPU driver VMM agnostic. For KVM,
> > yes we can use pci_map_sg. However for Xen it's different (today Dom0 doesn't see
> > IOMMU. In the future there'll be a PVIOMMU implementation) so different code path is 
> > required. It's better to abstract such specific knowledge out of vGPU driver, which just
> > uses whatever dma_addr returned by other agent (VFIO here, or another Xen specific
> > agent) in a centralized way.
> > 
> > Alex, what's your opinion on this?
> 
> The sticky point is how vfio, which is only handling the vGPU, has a
> reference to the physical GPU on which to call DMA API operations.  If
> that reference is provided by the vendor vGPU driver, for example
> vgpu_dma_do_translate_for_pci(gpa, pci_dev), I don't see any reason to
> be opposed to such an API.  I would not condone vfio deriving or owning
> a reference to the physical device on its own though, that's in the
> realm of the vendor vGPU driver.  It does seem a bit cleaner and should
> reduce duplicate code if the vfio vGPU iommu interface could handle the
> iommu mapping for the vendor vgpu driver when necessary.  Thanks,

Hi Alex,

Since we don't want to allow vfio iommu to derive or own a reference to the
physical device, I think it is still better not providing such pci_dev to the 
vfio iommu type1 driver.

Also, I need to point out that if the vfio iommu is going to set up iommu page
table for the real underlying physical device, giving the fact of single RID we
are all having here, the iommu mapping code has to return the new "IOVA" that is
mapped to the HPA, which the GPU vendro driver will have to put on its DMA
engine. This is very different than the current VFIO IOMMU mapping logic.

And we still have to provide another interface to translate the GPA to
HPA for CPU mapping.

In the current RFC, we only need to have a single interface to provide the most
basic information to the GPU vendor driver and without taking the risk of
leaking a ref to VFIO IOMMU.

Thanks,
Neo

> 
> Alex

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 3/3] VFIO: Type1 IOMMU mapping support for vGPU
@ 2016-03-11 16:55                     ` Neo Jia
  0 siblings, 0 replies; 38+ messages in thread
From: Neo Jia @ 2016-03-11 16:55 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Ruan, Shuai, Tian, Kevin, kvm, qemu-devel, Song, Jike,
	Kirti Wankhede, kraxel, pbonzini, Lv, Zhiyuan

On Fri, Mar 11, 2016 at 09:13:15AM -0700, Alex Williamson wrote:
> On Fri, 11 Mar 2016 04:46:23 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Neo Jia [mailto:cjia@nvidia.com]
> > > Sent: Friday, March 11, 2016 12:20 PM
> > > 
> > > On Thu, Mar 10, 2016 at 11:10:10AM +0800, Jike Song wrote:  
> > > >  
> > > > >> Is it supposed to be the caller who should set
> > > > >> up IOMMU by DMA api such as dma_map_page(), after calling
> > > > >> vgpu_dma_do_translate()?
> > > > >>  
> > > > >
> > > > > Don't think you need to call dma_map_page here. Once you have the pfn available
> > > > > to your GPU kernel driver, you can just go ahead to setup the mapping as you
> > > > > normally do such as calling pci_map_sg and its friends.
> > > > >  
> > > >
> > > > Technically it's definitely OK to call DMA API from the caller rather than here,
> > > > however personally I think it is a bit counter-intuitive: IOMMU page tables
> > > > should be constructed within the VFIO IOMMU driver.
> > > >  
> > > 
> > > Hi Jike,
> > > 
> > > For vGPU, what we have is just a virtual device and a fake IOMMU group, therefore
> > > the actual interaction with the real GPU should be managed by the GPU vendor driver.
> > >   
> > 
> > Hi, Neo,
> > 
> > Seems we have a different thought on this. Regardless of whether it's a virtual/physical 
> > device, imo, VFIO should manage IOMMU configuration. The only difference is:
> > 
> > - for physical device, VFIO directly invokes IOMMU API to set IOMMU entry (GPA->HPA);
> > - for virtual device, VFIO invokes kernel DMA APIs which indirectly lead to IOMMU entry 
> > set if CONFIG_IOMMU is enabled in kernel (GPA->IOVA);
> > 
> > This would provide an unified way to manage the translation in VFIO, and then vendor
> > specific driver only needs to query and use returned IOVA corresponding to a GPA. 
> > 
> > Doing so has another benefit, to make underlying vGPU driver VMM agnostic. For KVM,
> > yes we can use pci_map_sg. However for Xen it's different (today Dom0 doesn't see
> > IOMMU. In the future there'll be a PVIOMMU implementation) so different code path is 
> > required. It's better to abstract such specific knowledge out of vGPU driver, which just
> > uses whatever dma_addr returned by other agent (VFIO here, or another Xen specific
> > agent) in a centralized way.
> > 
> > Alex, what's your opinion on this?
> 
> The sticky point is how vfio, which is only handling the vGPU, has a
> reference to the physical GPU on which to call DMA API operations.  If
> that reference is provided by the vendor vGPU driver, for example
> vgpu_dma_do_translate_for_pci(gpa, pci_dev), I don't see any reason to
> be opposed to such an API.  I would not condone vfio deriving or owning
> a reference to the physical device on its own though, that's in the
> realm of the vendor vGPU driver.  It does seem a bit cleaner and should
> reduce duplicate code if the vfio vGPU iommu interface could handle the
> iommu mapping for the vendor vgpu driver when necessary.  Thanks,

Hi Alex,

Since we don't want to allow vfio iommu to derive or own a reference to the
physical device, I think it is still better not providing such pci_dev to the 
vfio iommu type1 driver.

Also, I need to point out that if the vfio iommu is going to set up iommu page
table for the real underlying physical device, giving the fact of single RID we
are all having here, the iommu mapping code has to return the new "IOVA" that is
mapped to the HPA, which the GPU vendro driver will have to put on its DMA
engine. This is very different than the current VFIO IOMMU mapping logic.

And we still have to provide another interface to translate the GPA to
HPA for CPU mapping.

In the current RFC, we only need to have a single interface to provide the most
basic information to the GPU vendor driver and without taking the risk of
leaking a ref to VFIO IOMMU.

Thanks,
Neo

> 
> Alex

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v2 3/3] VFIO: Type1 IOMMU mapping support for vGPU
  2016-03-11 16:55                     ` [Qemu-devel] " Neo Jia
@ 2016-03-11 17:56                       ` Alex Williamson
  -1 siblings, 0 replies; 38+ messages in thread
From: Alex Williamson @ 2016-03-11 17:56 UTC (permalink / raw)
  To: Neo Jia
  Cc: Tian, Kevin, Song, Jike, Kirti Wankhede, pbonzini, kraxel,
	qemu-devel, kvm, Ruan, Shuai, Lv, Zhiyuan

On Fri, 11 Mar 2016 08:55:44 -0800
Neo Jia <cjia@nvidia.com> wrote:

> On Fri, Mar 11, 2016 at 09:13:15AM -0700, Alex Williamson wrote:
> > On Fri, 11 Mar 2016 04:46:23 +0000
> > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> >   
> > > > From: Neo Jia [mailto:cjia@nvidia.com]
> > > > Sent: Friday, March 11, 2016 12:20 PM
> > > > 
> > > > On Thu, Mar 10, 2016 at 11:10:10AM +0800, Jike Song wrote:    
> > > > >    
> > > > > >> Is it supposed to be the caller who should set
> > > > > >> up IOMMU by DMA api such as dma_map_page(), after calling
> > > > > >> vgpu_dma_do_translate()?
> > > > > >>    
> > > > > >
> > > > > > Don't think you need to call dma_map_page here. Once you have the pfn available
> > > > > > to your GPU kernel driver, you can just go ahead to setup the mapping as you
> > > > > > normally do such as calling pci_map_sg and its friends.
> > > > > >    
> > > > >
> > > > > Technically it's definitely OK to call DMA API from the caller rather than here,
> > > > > however personally I think it is a bit counter-intuitive: IOMMU page tables
> > > > > should be constructed within the VFIO IOMMU driver.
> > > > >    
> > > > 
> > > > Hi Jike,
> > > > 
> > > > For vGPU, what we have is just a virtual device and a fake IOMMU group, therefore
> > > > the actual interaction with the real GPU should be managed by the GPU vendor driver.
> > > >     
> > > 
> > > Hi, Neo,
> > > 
> > > Seems we have a different thought on this. Regardless of whether it's a virtual/physical 
> > > device, imo, VFIO should manage IOMMU configuration. The only difference is:
> > > 
> > > - for physical device, VFIO directly invokes IOMMU API to set IOMMU entry (GPA->HPA);
> > > - for virtual device, VFIO invokes kernel DMA APIs which indirectly lead to IOMMU entry 
> > > set if CONFIG_IOMMU is enabled in kernel (GPA->IOVA);
> > > 
> > > This would provide an unified way to manage the translation in VFIO, and then vendor
> > > specific driver only needs to query and use returned IOVA corresponding to a GPA. 
> > > 
> > > Doing so has another benefit, to make underlying vGPU driver VMM agnostic. For KVM,
> > > yes we can use pci_map_sg. However for Xen it's different (today Dom0 doesn't see
> > > IOMMU. In the future there'll be a PVIOMMU implementation) so different code path is 
> > > required. It's better to abstract such specific knowledge out of vGPU driver, which just
> > > uses whatever dma_addr returned by other agent (VFIO here, or another Xen specific
> > > agent) in a centralized way.
> > > 
> > > Alex, what's your opinion on this?  
> > 
> > The sticky point is how vfio, which is only handling the vGPU, has a
> > reference to the physical GPU on which to call DMA API operations.  If
> > that reference is provided by the vendor vGPU driver, for example
> > vgpu_dma_do_translate_for_pci(gpa, pci_dev), I don't see any reason to
> > be opposed to such an API.  I would not condone vfio deriving or owning
> > a reference to the physical device on its own though, that's in the
> > realm of the vendor vGPU driver.  It does seem a bit cleaner and should
> > reduce duplicate code if the vfio vGPU iommu interface could handle the
> > iommu mapping for the vendor vgpu driver when necessary.  Thanks,  
> 
> Hi Alex,
> 
> Since we don't want to allow vfio iommu to derive or own a reference to the
> physical device, I think it is still better not providing such pci_dev to the 
> vfio iommu type1 driver.
> 
> Also, I need to point out that if the vfio iommu is going to set up iommu page
> table for the real underlying physical device, giving the fact of single RID we
> are all having here, the iommu mapping code has to return the new "IOVA" that is
> mapped to the HPA, which the GPU vendro driver will have to put on its DMA
> engine. This is very different than the current VFIO IOMMU mapping logic.
> 
> And we still have to provide another interface to translate the GPA to
> HPA for CPU mapping.
> 
> In the current RFC, we only need to have a single interface to provide the most
> basic information to the GPU vendor driver and without taking the risk of
> leaking a ref to VFIO IOMMU.

I don't see this as some fundamental difference of opinion, it's really
just whether vfio provides a "pin this GFN and return the HPA" function
or whether that function could be extended to include "... and also map
it through the DMA API for the provided device and return the host
IOVA".  It might even still be a single function to vfio for CPU vs
device mapping where the device and IOVA return pointer are NULL when
only pinning is required for CPU access (though maybe there are better
ways to provide CPU access than pinning).  A wrapper could even give the
appearance that those are two separate functions.

So long as vfio isn't owning or deriving the device for the DMA API
calls and we don't introduce some complication in page accounting, this
really just seems like a question of whether moving the DMA API
handling into vfio is common between the vendor vGPU drivers and are we
reducing the overall amount and complexity of code by giving the vendor
drivers the opportunity to do both operations with one interface.
If as Kevin suggest it also provides some additional abstractions
for Xen vs KVM, even better.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 3/3] VFIO: Type1 IOMMU mapping support for vGPU
@ 2016-03-11 17:56                       ` Alex Williamson
  0 siblings, 0 replies; 38+ messages in thread
From: Alex Williamson @ 2016-03-11 17:56 UTC (permalink / raw)
  To: Neo Jia
  Cc: Ruan, Shuai, Tian, Kevin, kvm, qemu-devel, Song, Jike,
	Kirti Wankhede, kraxel, pbonzini, Lv, Zhiyuan

On Fri, 11 Mar 2016 08:55:44 -0800
Neo Jia <cjia@nvidia.com> wrote:

> On Fri, Mar 11, 2016 at 09:13:15AM -0700, Alex Williamson wrote:
> > On Fri, 11 Mar 2016 04:46:23 +0000
> > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> >   
> > > > From: Neo Jia [mailto:cjia@nvidia.com]
> > > > Sent: Friday, March 11, 2016 12:20 PM
> > > > 
> > > > On Thu, Mar 10, 2016 at 11:10:10AM +0800, Jike Song wrote:    
> > > > >    
> > > > > >> Is it supposed to be the caller who should set
> > > > > >> up IOMMU by DMA api such as dma_map_page(), after calling
> > > > > >> vgpu_dma_do_translate()?
> > > > > >>    
> > > > > >
> > > > > > Don't think you need to call dma_map_page here. Once you have the pfn available
> > > > > > to your GPU kernel driver, you can just go ahead to setup the mapping as you
> > > > > > normally do such as calling pci_map_sg and its friends.
> > > > > >    
> > > > >
> > > > > Technically it's definitely OK to call DMA API from the caller rather than here,
> > > > > however personally I think it is a bit counter-intuitive: IOMMU page tables
> > > > > should be constructed within the VFIO IOMMU driver.
> > > > >    
> > > > 
> > > > Hi Jike,
> > > > 
> > > > For vGPU, what we have is just a virtual device and a fake IOMMU group, therefore
> > > > the actual interaction with the real GPU should be managed by the GPU vendor driver.
> > > >     
> > > 
> > > Hi, Neo,
> > > 
> > > Seems we have a different thought on this. Regardless of whether it's a virtual/physical 
> > > device, imo, VFIO should manage IOMMU configuration. The only difference is:
> > > 
> > > - for physical device, VFIO directly invokes IOMMU API to set IOMMU entry (GPA->HPA);
> > > - for virtual device, VFIO invokes kernel DMA APIs which indirectly lead to IOMMU entry 
> > > set if CONFIG_IOMMU is enabled in kernel (GPA->IOVA);
> > > 
> > > This would provide an unified way to manage the translation in VFIO, and then vendor
> > > specific driver only needs to query and use returned IOVA corresponding to a GPA. 
> > > 
> > > Doing so has another benefit, to make underlying vGPU driver VMM agnostic. For KVM,
> > > yes we can use pci_map_sg. However for Xen it's different (today Dom0 doesn't see
> > > IOMMU. In the future there'll be a PVIOMMU implementation) so different code path is 
> > > required. It's better to abstract such specific knowledge out of vGPU driver, which just
> > > uses whatever dma_addr returned by other agent (VFIO here, or another Xen specific
> > > agent) in a centralized way.
> > > 
> > > Alex, what's your opinion on this?  
> > 
> > The sticky point is how vfio, which is only handling the vGPU, has a
> > reference to the physical GPU on which to call DMA API operations.  If
> > that reference is provided by the vendor vGPU driver, for example
> > vgpu_dma_do_translate_for_pci(gpa, pci_dev), I don't see any reason to
> > be opposed to such an API.  I would not condone vfio deriving or owning
> > a reference to the physical device on its own though, that's in the
> > realm of the vendor vGPU driver.  It does seem a bit cleaner and should
> > reduce duplicate code if the vfio vGPU iommu interface could handle the
> > iommu mapping for the vendor vgpu driver when necessary.  Thanks,  
> 
> Hi Alex,
> 
> Since we don't want to allow vfio iommu to derive or own a reference to the
> physical device, I think it is still better not providing such pci_dev to the 
> vfio iommu type1 driver.
> 
> Also, I need to point out that if the vfio iommu is going to set up iommu page
> table for the real underlying physical device, giving the fact of single RID we
> are all having here, the iommu mapping code has to return the new "IOVA" that is
> mapped to the HPA, which the GPU vendro driver will have to put on its DMA
> engine. This is very different than the current VFIO IOMMU mapping logic.
> 
> And we still have to provide another interface to translate the GPA to
> HPA for CPU mapping.
> 
> In the current RFC, we only need to have a single interface to provide the most
> basic information to the GPU vendor driver and without taking the risk of
> leaking a ref to VFIO IOMMU.

I don't see this as some fundamental difference of opinion, it's really
just whether vfio provides a "pin this GFN and return the HPA" function
or whether that function could be extended to include "... and also map
it through the DMA API for the provided device and return the host
IOVA".  It might even still be a single function to vfio for CPU vs
device mapping where the device and IOVA return pointer are NULL when
only pinning is required for CPU access (though maybe there are better
ways to provide CPU access than pinning).  A wrapper could even give the
appearance that those are two separate functions.

So long as vfio isn't owning or deriving the device for the DMA API
calls and we don't introduce some complication in page accounting, this
really just seems like a question of whether moving the DMA API
handling into vfio is common between the vendor vGPU drivers and are we
reducing the overall amount and complexity of code by giving the vendor
drivers the opportunity to do both operations with one interface.
If as Kevin suggest it also provides some additional abstractions
for Xen vs KVM, even better.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [RFC PATCH v2 3/3] VFIO: Type1 IOMMU mapping support for vGPU
  2016-03-11 17:56                       ` [Qemu-devel] " Alex Williamson
@ 2016-03-11 18:18                         ` Neo Jia
  -1 siblings, 0 replies; 38+ messages in thread
From: Neo Jia @ 2016-03-11 18:18 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Ruan, Shuai, Tian, Kevin, kvm, qemu-devel, Song, Jike,
	Kirti Wankhede, kraxel, pbonzini, Lv, Zhiyuan

On Fri, Mar 11, 2016 at 10:56:24AM -0700, Alex Williamson wrote:
> On Fri, 11 Mar 2016 08:55:44 -0800
> Neo Jia <cjia@nvidia.com> wrote:
> 
> > > > Alex, what's your opinion on this?  
> > > 
> > > The sticky point is how vfio, which is only handling the vGPU, has a
> > > reference to the physical GPU on which to call DMA API operations.  If
> > > that reference is provided by the vendor vGPU driver, for example
> > > vgpu_dma_do_translate_for_pci(gpa, pci_dev), I don't see any reason to
> > > be opposed to such an API.  I would not condone vfio deriving or owning
> > > a reference to the physical device on its own though, that's in the
> > > realm of the vendor vGPU driver.  It does seem a bit cleaner and should
> > > reduce duplicate code if the vfio vGPU iommu interface could handle the
> > > iommu mapping for the vendor vgpu driver when necessary.  Thanks,  
> > 
> > Hi Alex,
> > 
> > Since we don't want to allow vfio iommu to derive or own a reference to the
> > physical device, I think it is still better not providing such pci_dev to the 
> > vfio iommu type1 driver.
> > 
> > Also, I need to point out that if the vfio iommu is going to set up iommu page
> > table for the real underlying physical device, giving the fact of single RID we
> > are all having here, the iommu mapping code has to return the new "IOVA" that is
> > mapped to the HPA, which the GPU vendro driver will have to put on its DMA
> > engine. This is very different than the current VFIO IOMMU mapping logic.
> > 
> > And we still have to provide another interface to translate the GPA to
> > HPA for CPU mapping.
> > 
> > In the current RFC, we only need to have a single interface to provide the most
> > basic information to the GPU vendor driver and without taking the risk of
> > leaking a ref to VFIO IOMMU.
> 
> I don't see this as some fundamental difference of opinion, it's really
> just whether vfio provides a "pin this GFN and return the HPA" function
> or whether that function could be extended to include "... and also map
> it through the DMA API for the provided device and return the host
> IOVA".  It might even still be a single function to vfio for CPU vs
> device mapping where the device and IOVA return pointer are NULL when
> only pinning is required for CPU access (though maybe there are better
> ways to provide CPU access than pinning).  A wrapper could even give the
> appearance that those are two separate functions.
> 
> So long as vfio isn't owning or deriving the device for the DMA API
> calls and we don't introduce some complication in page accounting, this
> really just seems like a question of whether moving the DMA API
> handling into vfio is common between the vendor vGPU drivers and are we
> reducing the overall amount and complexity of code by giving the vendor
> drivers the opportunity to do both operations with one interface.

Hi Alex,

OK, I will look into of adding such facilitation and probably include it in a
bit later rev of VGPU IOMMU if we don't run any surprise or the issues you
mentioned above.

Thanks,
Neo

> If as Kevin suggest it also provides some additional abstractions
> for Xen vs KVM, even better.  Thanks,
> 
> Alex

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v2 3/3] VFIO: Type1 IOMMU mapping support for vGPU
@ 2016-03-11 18:18                         ` Neo Jia
  0 siblings, 0 replies; 38+ messages in thread
From: Neo Jia @ 2016-03-11 18:18 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Ruan, Shuai, Tian, Kevin, kvm, qemu-devel, Song, Jike,
	Kirti Wankhede, kraxel, pbonzini, Lv, Zhiyuan

On Fri, Mar 11, 2016 at 10:56:24AM -0700, Alex Williamson wrote:
> On Fri, 11 Mar 2016 08:55:44 -0800
> Neo Jia <cjia@nvidia.com> wrote:
> 
> > > > Alex, what's your opinion on this?  
> > > 
> > > The sticky point is how vfio, which is only handling the vGPU, has a
> > > reference to the physical GPU on which to call DMA API operations.  If
> > > that reference is provided by the vendor vGPU driver, for example
> > > vgpu_dma_do_translate_for_pci(gpa, pci_dev), I don't see any reason to
> > > be opposed to such an API.  I would not condone vfio deriving or owning
> > > a reference to the physical device on its own though, that's in the
> > > realm of the vendor vGPU driver.  It does seem a bit cleaner and should
> > > reduce duplicate code if the vfio vGPU iommu interface could handle the
> > > iommu mapping for the vendor vgpu driver when necessary.  Thanks,  
> > 
> > Hi Alex,
> > 
> > Since we don't want to allow vfio iommu to derive or own a reference to the
> > physical device, I think it is still better not providing such pci_dev to the 
> > vfio iommu type1 driver.
> > 
> > Also, I need to point out that if the vfio iommu is going to set up iommu page
> > table for the real underlying physical device, giving the fact of single RID we
> > are all having here, the iommu mapping code has to return the new "IOVA" that is
> > mapped to the HPA, which the GPU vendro driver will have to put on its DMA
> > engine. This is very different than the current VFIO IOMMU mapping logic.
> > 
> > And we still have to provide another interface to translate the GPA to
> > HPA for CPU mapping.
> > 
> > In the current RFC, we only need to have a single interface to provide the most
> > basic information to the GPU vendor driver and without taking the risk of
> > leaking a ref to VFIO IOMMU.
> 
> I don't see this as some fundamental difference of opinion, it's really
> just whether vfio provides a "pin this GFN and return the HPA" function
> or whether that function could be extended to include "... and also map
> it through the DMA API for the provided device and return the host
> IOVA".  It might even still be a single function to vfio for CPU vs
> device mapping where the device and IOVA return pointer are NULL when
> only pinning is required for CPU access (though maybe there are better
> ways to provide CPU access than pinning).  A wrapper could even give the
> appearance that those are two separate functions.
> 
> So long as vfio isn't owning or deriving the device for the DMA API
> calls and we don't introduce some complication in page accounting, this
> really just seems like a question of whether moving the DMA API
> handling into vfio is common between the vendor vGPU drivers and are we
> reducing the overall amount and complexity of code by giving the vendor
> drivers the opportunity to do both operations with one interface.

Hi Alex,

OK, I will look into of adding such facilitation and probably include it in a
bit later rev of VGPU IOMMU if we don't run any surprise or the issues you
mentioned above.

Thanks,
Neo

> If as Kevin suggest it also provides some additional abstractions
> for Xen vs KVM, even better.  Thanks,
> 
> Alex

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2016-03-11 18:18 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-23 16:24 [RFC PATCH v2 1/3] vGPU Core driver Kirti Wankhede
2016-02-23 16:24 ` [Qemu-devel] " Kirti Wankhede
2016-02-23 16:24 ` [RFC PATCH v2 2/3] VFIO driver for vGPU device Kirti Wankhede
2016-02-23 16:24   ` [Qemu-devel] " Kirti Wankhede
2016-02-23 16:24 ` [RFC PATCH v2 3/3] VFIO: Type1 IOMMU mapping support for vGPU Kirti Wankhede
2016-02-23 16:24   ` [Qemu-devel] " Kirti Wankhede
2016-03-02  8:38   ` Jike Song
2016-03-02  8:38     ` [Qemu-devel] " Jike Song
2016-03-04  7:00     ` Neo Jia
2016-03-04  7:00       ` [Qemu-devel] " Neo Jia
2016-03-07  6:07       ` Jike Song
2016-03-07  6:07         ` [Qemu-devel] " Jike Song
2016-03-08  0:31         ` Neo Jia
2016-03-08  0:31           ` [Qemu-devel] " Neo Jia
2016-03-10  3:10           ` Jike Song
2016-03-10  3:10             ` [Qemu-devel] " Jike Song
2016-03-11  4:19             ` Neo Jia
2016-03-11  4:19               ` [Qemu-devel] " Neo Jia
2016-03-11  4:46               ` Tian, Kevin
2016-03-11  4:46                 ` [Qemu-devel] " Tian, Kevin
2016-03-11  6:10                 ` Neo Jia
2016-03-11  6:10                   ` [Qemu-devel] " Neo Jia
2016-03-11  8:06                   ` Tian, Kevin
2016-03-11  8:06                     ` [Qemu-devel] " Tian, Kevin
2016-03-11 16:13                 ` Alex Williamson
2016-03-11 16:13                   ` [Qemu-devel] " Alex Williamson
2016-03-11 16:55                   ` Neo Jia
2016-03-11 16:55                     ` [Qemu-devel] " Neo Jia
2016-03-11 17:56                     ` Alex Williamson
2016-03-11 17:56                       ` [Qemu-devel] " Alex Williamson
2016-03-11 18:18                       ` Neo Jia
2016-03-11 18:18                         ` [Qemu-devel] " Neo Jia
2016-02-29  5:39 ` [RFC PATCH v2 1/3] vGPU Core driver Tian, Kevin
2016-02-29  5:39   ` [Qemu-devel] " Tian, Kevin
2016-02-29 23:17   ` Neo Jia
2016-02-29 23:17     ` [Qemu-devel] " Neo Jia
2016-03-01  3:10     ` Jike Song
2016-03-01  3:10       ` [Qemu-devel] " Jike Song

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.