All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v3 0/3] Add vGPU support
@ 2016-05-02 18:40 ` Kirti Wankhede
  0 siblings, 0 replies; 154+ messages in thread
From: Kirti Wankhede @ 2016-05-02 18:40 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, kevin.tian, shuai.ruan, jike.song, zhiyuan.lv,
	Kirti Wankhede

This series adds vGPU support to v4.6 Linux host kernel. Purpose of this series
is to provide a common interface for vGPU management that can be used
by different GPU drivers. This series introduces vGPU core module that create
and manage vGPU devices, VFIO based driver for vGPU devices that are created by
vGPU core module and update VFIO type1 IOMMU module to support vGPU devices.

What's new in v3?
VFIO type1 IOMMU module supports devices which are IOMMU capable. This version
of patched adds support for vGPU devices, which are not IOMMU capable, to use
existing VFIO IOMMU module. VFIO Type1 IOMMU patch provide new set of APIs for
guest page translation.

What's left to do?
VFIO driver for vGPU device doesn't support devices with MSI-X enabled.

Please review.

Thanks,
Kirti

Kirti Wankhede (3):
  vGPU Core driver
  VFIO driver for vGPU device
  VFIO Type1 IOMMU change: to support with iommu and without iommu

 drivers/Kconfig                 |    2 +
 drivers/Makefile                |    1 +
 drivers/vfio/vfio_iommu_type1.c |  427 +++++++++++++++++++++++--
 drivers/vgpu/Kconfig            |   21 ++
 drivers/vgpu/Makefile           |    5 +
 drivers/vgpu/vgpu-core.c        |  424 ++++++++++++++++++++++++
 drivers/vgpu/vgpu-driver.c      |  136 ++++++++
 drivers/vgpu/vgpu-sysfs.c       |  365 +++++++++++++++++++++
 drivers/vgpu/vgpu_private.h     |   36 ++
 drivers/vgpu/vgpu_vfio.c        |  671 +++++++++++++++++++++++++++++++++++++++
 include/linux/vfio.h            |    6 +
 include/linux/vgpu.h            |  216 +++++++++++++
 12 files changed, 2278 insertions(+), 32 deletions(-)
 create mode 100644 drivers/vgpu/Kconfig
 create mode 100644 drivers/vgpu/Makefile
 create mode 100644 drivers/vgpu/vgpu-core.c
 create mode 100644 drivers/vgpu/vgpu-driver.c
 create mode 100644 drivers/vgpu/vgpu-sysfs.c
 create mode 100644 drivers/vgpu/vgpu_private.h
 create mode 100644 drivers/vgpu/vgpu_vfio.c
 create mode 100644 include/linux/vgpu.h


^ permalink raw reply	[flat|nested] 154+ messages in thread

* [Qemu-devel] [RFC PATCH v3 0/3] Add vGPU support
@ 2016-05-02 18:40 ` Kirti Wankhede
  0 siblings, 0 replies; 154+ messages in thread
From: Kirti Wankhede @ 2016-05-02 18:40 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, kevin.tian, shuai.ruan, jike.song, zhiyuan.lv,
	Kirti Wankhede

This series adds vGPU support to v4.6 Linux host kernel. Purpose of this series
is to provide a common interface for vGPU management that can be used
by different GPU drivers. This series introduces vGPU core module that create
and manage vGPU devices, VFIO based driver for vGPU devices that are created by
vGPU core module and update VFIO type1 IOMMU module to support vGPU devices.

What's new in v3?
VFIO type1 IOMMU module supports devices which are IOMMU capable. This version
of patched adds support for vGPU devices, which are not IOMMU capable, to use
existing VFIO IOMMU module. VFIO Type1 IOMMU patch provide new set of APIs for
guest page translation.

What's left to do?
VFIO driver for vGPU device doesn't support devices with MSI-X enabled.

Please review.

Thanks,
Kirti

Kirti Wankhede (3):
  vGPU Core driver
  VFIO driver for vGPU device
  VFIO Type1 IOMMU change: to support with iommu and without iommu

 drivers/Kconfig                 |    2 +
 drivers/Makefile                |    1 +
 drivers/vfio/vfio_iommu_type1.c |  427 +++++++++++++++++++++++--
 drivers/vgpu/Kconfig            |   21 ++
 drivers/vgpu/Makefile           |    5 +
 drivers/vgpu/vgpu-core.c        |  424 ++++++++++++++++++++++++
 drivers/vgpu/vgpu-driver.c      |  136 ++++++++
 drivers/vgpu/vgpu-sysfs.c       |  365 +++++++++++++++++++++
 drivers/vgpu/vgpu_private.h     |   36 ++
 drivers/vgpu/vgpu_vfio.c        |  671 +++++++++++++++++++++++++++++++++++++++
 include/linux/vfio.h            |    6 +
 include/linux/vgpu.h            |  216 +++++++++++++
 12 files changed, 2278 insertions(+), 32 deletions(-)
 create mode 100644 drivers/vgpu/Kconfig
 create mode 100644 drivers/vgpu/Makefile
 create mode 100644 drivers/vgpu/vgpu-core.c
 create mode 100644 drivers/vgpu/vgpu-driver.c
 create mode 100644 drivers/vgpu/vgpu-sysfs.c
 create mode 100644 drivers/vgpu/vgpu_private.h
 create mode 100644 drivers/vgpu/vgpu_vfio.c
 create mode 100644 include/linux/vgpu.h

^ permalink raw reply	[flat|nested] 154+ messages in thread

* [RFC PATCH v3 1/3] vGPU Core driver
  2016-05-02 18:40 ` [Qemu-devel] " Kirti Wankhede
@ 2016-05-02 18:40   ` Kirti Wankhede
  -1 siblings, 0 replies; 154+ messages in thread
From: Kirti Wankhede @ 2016-05-02 18:40 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, kevin.tian, shuai.ruan, jike.song, zhiyuan.lv,
	Kirti Wankhede

Design for vGPU Driver:
Main purpose of vGPU driver is to provide a common interface for vGPU
management that can be used by differnt GPU drivers.

This module would provide a generic interface to create the device, add
it to vGPU bus, add device to IOMMU group and then add it to vfio group.

High Level block diagram:

+--------------+    vgpu_register_driver()+---------------+
|     __init() +------------------------->+               |
|              |                          |               |
|              +<-------------------------+    vgpu.ko    |
| vgpu_vfio.ko |   probe()/remove()       |               |
|              |                +---------+               +---------+
+--------------+                |         +-------+-------+         |
                                |                 ^                 |
                                | callback        |                 |
                                |         +-------+--------+        |
                                |         |vgpu_register_device()   |
                                |         |                |        |
                                +---^-----+-----+    +-----+------+-+
                                    | nvidia.ko |    |  i915.ko   |
                                    |           |    |            |
                                    +-----------+    +------------+

vGPU driver provides two types of registration interfaces:
1. Registration interface for vGPU bus driver:

/**
  * struct vgpu_driver - vGPU device driver
  * @name: driver name
  * @probe: called when new device created
  * @remove: called when device removed
  * @driver: device driver structure
  *
  **/
struct vgpu_driver {
         const char *name;
         int  (*probe)  (struct device *dev);
         void (*remove) (struct device *dev);
         struct device_driver    driver;
};

int  vgpu_register_driver(struct vgpu_driver *drv, struct module *owner);
void vgpu_unregister_driver(struct vgpu_driver *drv);

VFIO bus driver for vgpu, should use this interface to register with
vGPU driver. With this, VFIO bus driver for vGPU devices is responsible
to add vGPU device to VFIO group.

2. GPU driver interface
GPU driver interface provides GPU driver the set APIs to manage GPU driver
related work in their own driver. APIs are to:
- vgpu_supported_config: provide supported configuration list by the GPU.
- vgpu_create: to allocate basic resouces in GPU driver for a vGPU device.
- vgpu_destroy: to free resources in GPU driver during vGPU device destroy.
- vgpu_start: to initiate vGPU initialization process from GPU driver when VM
  boots and before QEMU starts.
- vgpu_shutdown: to teardown vGPU resources during VM teardown.
- read : read emulation callback.
- write: write emulation callback.
- vgpu_set_irqs: send interrupt configuration information that QEMU sets.
- vgpu_bar_info: to provice BAR size and its flags for the vGPU device.
- validate_map_request: to validate remap pfn request.

This registration interface should be used by GPU drivers to register
each physical device to vGPU driver.

Updated this patch with couple of more functions in GPU driver interface
which were discussed during v1 version of this RFC.

Thanks,
Kirti.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Change-Id: I1c13c411f61b7b2e750e85adfe1b097f9fd218b9
---
 drivers/Kconfig             |    2 +
 drivers/Makefile            |    1 +
 drivers/vgpu/Kconfig        |   21 ++
 drivers/vgpu/Makefile       |    4 +
 drivers/vgpu/vgpu-core.c    |  424 +++++++++++++++++++++++++++++++++++++++++++
 drivers/vgpu/vgpu-driver.c  |  136 ++++++++++++++
 drivers/vgpu/vgpu-sysfs.c   |  365 +++++++++++++++++++++++++++++++++++++
 drivers/vgpu/vgpu_private.h |   36 ++++
 include/linux/vgpu.h        |  216 ++++++++++++++++++++++
 9 files changed, 1205 insertions(+), 0 deletions(-)
 create mode 100644 drivers/vgpu/Kconfig
 create mode 100644 drivers/vgpu/Makefile
 create mode 100644 drivers/vgpu/vgpu-core.c
 create mode 100644 drivers/vgpu/vgpu-driver.c
 create mode 100644 drivers/vgpu/vgpu-sysfs.c
 create mode 100644 drivers/vgpu/vgpu_private.h
 create mode 100644 include/linux/vgpu.h

diff --git a/drivers/Kconfig b/drivers/Kconfig
index d2ac339..5fd9eae 100644
--- a/drivers/Kconfig
+++ b/drivers/Kconfig
@@ -122,6 +122,8 @@ source "drivers/uio/Kconfig"
 
 source "drivers/vfio/Kconfig"
 
+source "drivers/vgpu/Kconfig"
+
 source "drivers/vlynq/Kconfig"
 
 source "drivers/virt/Kconfig"
diff --git a/drivers/Makefile b/drivers/Makefile
index 8f5d076..36f1110 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -84,6 +84,7 @@ obj-$(CONFIG_FUSION)		+= message/
 obj-y				+= firewire/
 obj-$(CONFIG_UIO)		+= uio/
 obj-$(CONFIG_VFIO)		+= vfio/
+obj-$(CONFIG_VFIO)		+= vgpu/
 obj-y				+= cdrom/
 obj-y				+= auxdisplay/
 obj-$(CONFIG_PCCARD)		+= pcmcia/
diff --git a/drivers/vgpu/Kconfig b/drivers/vgpu/Kconfig
new file mode 100644
index 0000000..792eb48
--- /dev/null
+++ b/drivers/vgpu/Kconfig
@@ -0,0 +1,21 @@
+
+menuconfig VGPU
+    tristate "VGPU driver framework"
+    depends on VFIO
+    select VGPU_VFIO
+    help
+        VGPU provides a framework to virtualize GPU without SR-IOV cap
+        See Documentation/vgpu.txt for more details.
+
+        If you don't know what do here, say N.
+
+config VGPU
+    tristate
+    depends on VFIO
+    default n
+
+config VGPU_VFIO
+    tristate
+    depends on VGPU
+    default n
+
diff --git a/drivers/vgpu/Makefile b/drivers/vgpu/Makefile
new file mode 100644
index 0000000..f5be980
--- /dev/null
+++ b/drivers/vgpu/Makefile
@@ -0,0 +1,4 @@
+
+vgpu-y := vgpu-core.o vgpu-sysfs.o vgpu-driver.o
+
+obj-$(CONFIG_VGPU)			+= vgpu.o
diff --git a/drivers/vgpu/vgpu-core.c b/drivers/vgpu/vgpu-core.c
new file mode 100644
index 0000000..1a7d274
--- /dev/null
+++ b/drivers/vgpu/vgpu-core.c
@@ -0,0 +1,424 @@
+/*
+ * VGPU Core Driver
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/poll.h>
+#include <linux/slab.h>
+#include <linux/cdev.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/uuid.h>
+#include <linux/vfio.h>
+#include <linux/iommu.h>
+#include <linux/sysfs.h>
+#include <linux/ctype.h>
+#include <linux/vgpu.h>
+
+#include "vgpu_private.h"
+
+#define DRIVER_VERSION	"0.1"
+#define DRIVER_AUTHOR	"NVIDIA Corporation"
+#define DRIVER_DESC	"VGPU Core Driver"
+
+/*
+ * #defines
+ */
+
+#define VGPU_CLASS_NAME		"vgpu"
+
+/*
+ * Global Structures
+ */
+
+static struct vgpu {
+	struct list_head    vgpu_devices_list;
+	struct mutex        vgpu_devices_lock;
+	struct list_head    gpu_devices_list;
+	struct mutex        gpu_devices_lock;
+} vgpu;
+
+static struct class vgpu_class;
+
+/*
+ * Functions
+ */
+
+struct vgpu_device *get_vgpu_device_from_group(struct iommu_group *group)
+{
+	struct vgpu_device *vdev = NULL;
+
+	mutex_lock(&vgpu.vgpu_devices_lock);
+	list_for_each_entry(vdev, &vgpu.vgpu_devices_list, list) {
+		if (vdev->group) {
+			if (iommu_group_id(vdev->group) == iommu_group_id(group)) {
+				mutex_unlock(&vgpu.vgpu_devices_lock);
+				return vdev;
+			}
+		}
+	}
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+	return NULL;
+}
+
+EXPORT_SYMBOL_GPL(get_vgpu_device_from_group);
+
+static int vgpu_add_attribute_group(struct device *dev,
+			            const struct attribute_group **groups)
+{
+        return sysfs_create_groups(&dev->kobj, groups);
+}
+
+static void vgpu_remove_attribute_group(struct device *dev,
+			                const struct attribute_group **groups)
+{
+        sysfs_remove_groups(&dev->kobj, groups);
+}
+
+int vgpu_register_device(struct pci_dev *dev, const struct gpu_device_ops *ops)
+{
+	int ret = 0;
+	struct gpu_device *gpu_dev, *tmp;
+
+	if (!dev)
+		return -EINVAL;
+
+        gpu_dev = kzalloc(sizeof(*gpu_dev), GFP_KERNEL);
+        if (!gpu_dev)
+                return -ENOMEM;
+
+	gpu_dev->dev = dev;
+        gpu_dev->ops = ops;
+
+        mutex_lock(&vgpu.gpu_devices_lock);
+
+        /* Check for duplicates */
+        list_for_each_entry(tmp, &vgpu.gpu_devices_list, gpu_next) {
+                if (tmp->dev == dev) {
+			ret = -EINVAL;
+			goto add_error;
+                }
+        }
+
+	ret = vgpu_create_pci_device_files(dev);
+	if (ret)
+		goto add_error;
+
+	ret = vgpu_add_attribute_group(&dev->dev, ops->dev_attr_groups);
+	if (ret)
+		goto add_group_error;
+
+        list_add(&gpu_dev->gpu_next, &vgpu.gpu_devices_list);
+
+	printk(KERN_INFO "VGPU: Registered dev 0x%x 0x%x, class 0x%x\n",
+			 dev->vendor, dev->device, dev->class);
+        mutex_unlock(&vgpu.gpu_devices_lock);
+
+        return 0;
+
+add_group_error:
+	vgpu_remove_pci_device_files(dev);
+add_error:
+	mutex_unlock(&vgpu.gpu_devices_lock);
+	kfree(gpu_dev);
+	return ret;
+
+}
+EXPORT_SYMBOL(vgpu_register_device);
+
+void vgpu_unregister_device(struct pci_dev *dev)
+{
+        struct gpu_device *gpu_dev;
+
+        mutex_lock(&vgpu.gpu_devices_lock);
+        list_for_each_entry(gpu_dev, &vgpu.gpu_devices_list, gpu_next) {
+		struct vgpu_device *vdev = NULL;
+
+                if (gpu_dev->dev != dev)
+			continue;
+
+		printk(KERN_INFO "VGPU: Unregistered dev 0x%x 0x%x, class 0x%x\n",
+				dev->vendor, dev->device, dev->class);
+
+		list_for_each_entry(vdev, &vgpu.vgpu_devices_list, list) {
+			if (vdev->gpu_dev != gpu_dev)
+				continue;
+			destroy_vgpu_device(vdev);
+		}
+		vgpu_remove_attribute_group(&dev->dev, gpu_dev->ops->dev_attr_groups);
+		vgpu_remove_pci_device_files(dev);
+		list_del(&gpu_dev->gpu_next);
+		mutex_unlock(&vgpu.gpu_devices_lock);
+		kfree(gpu_dev);
+		return;
+        }
+        mutex_unlock(&vgpu.gpu_devices_lock);
+}
+EXPORT_SYMBOL(vgpu_unregister_device);
+
+/*
+ * Helper Functions
+ */
+
+static struct vgpu_device *vgpu_device_alloc(uuid_le uuid, int instance, char *name)
+{
+	struct vgpu_device *vgpu_dev = NULL;
+
+	vgpu_dev = kzalloc(sizeof(*vgpu_dev), GFP_KERNEL);
+	if (!vgpu_dev)
+		return ERR_PTR(-ENOMEM);
+
+	kref_init(&vgpu_dev->kref);
+	memcpy(&vgpu_dev->uuid, &uuid, sizeof(uuid_le));
+	vgpu_dev->vgpu_instance = instance;
+	strcpy(vgpu_dev->dev_name, name);
+
+	mutex_lock(&vgpu.vgpu_devices_lock);
+	list_add(&vgpu_dev->list, &vgpu.vgpu_devices_list);
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+
+	return vgpu_dev;
+}
+
+static void vgpu_device_free(struct vgpu_device *vgpu_dev)
+{
+	if (vgpu_dev) {
+		mutex_lock(&vgpu.vgpu_devices_lock);
+		list_del(&vgpu_dev->list);
+		mutex_unlock(&vgpu.vgpu_devices_lock);
+		kfree(vgpu_dev);
+	}
+	return;
+}
+
+struct vgpu_device *vgpu_drv_get_vgpu_device(uuid_le uuid, int instance)
+{
+	struct vgpu_device *vdev = NULL;
+
+	mutex_lock(&vgpu.vgpu_devices_lock);
+	list_for_each_entry(vdev, &vgpu.vgpu_devices_list, list) {
+		if ((uuid_le_cmp(vdev->uuid, uuid) == 0) &&
+		    (vdev->vgpu_instance == instance)) {
+			mutex_unlock(&vgpu.vgpu_devices_lock);
+			return vdev;
+		}
+	}
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+	return NULL;
+}
+
+static void vgpu_device_release(struct device *dev)
+{
+	struct vgpu_device *vgpu_dev = to_vgpu_device(dev);
+	vgpu_device_free(vgpu_dev);
+}
+
+int create_vgpu_device(struct pci_dev *pdev, uuid_le uuid, uint32_t instance, char *vgpu_params)
+{
+	char name[64];
+	int numChar = 0;
+	int retval = 0;
+	struct vgpu_device *vgpu_dev = NULL;
+	struct gpu_device *gpu_dev;
+
+	printk(KERN_INFO "VGPU: %s: device ", __FUNCTION__);
+
+	numChar = sprintf(name, "%pUb-%d", uuid.b, instance);
+	name[numChar] = '\0';
+
+	vgpu_dev = vgpu_device_alloc(uuid, instance, name);
+	if (IS_ERR(vgpu_dev)) {
+		return PTR_ERR(vgpu_dev);
+	}
+
+	vgpu_dev->dev.parent  = &pdev->dev;
+	vgpu_dev->dev.bus     = &vgpu_bus_type;
+	vgpu_dev->dev.release = vgpu_device_release;
+	dev_set_name(&vgpu_dev->dev, "%s", name);
+
+	retval = device_register(&vgpu_dev->dev);
+	if (retval)
+		goto create_failed1;
+
+	printk(KERN_INFO "UUID %pUb \n", vgpu_dev->uuid.b);
+
+	mutex_lock(&vgpu.gpu_devices_lock);
+	list_for_each_entry(gpu_dev, &vgpu.gpu_devices_list, gpu_next) {
+		if (gpu_dev->dev != pdev)
+			continue;
+
+		vgpu_dev->gpu_dev = gpu_dev;
+		if (gpu_dev->ops->vgpu_create) {
+			retval = gpu_dev->ops->vgpu_create(pdev, vgpu_dev->uuid,
+							   instance, vgpu_params);
+			if (retval) {
+				mutex_unlock(&vgpu.gpu_devices_lock);
+				goto create_failed2;
+			}
+		}
+		break;
+	}
+	if (!vgpu_dev->gpu_dev) {
+		retval = -EINVAL;
+		mutex_unlock(&vgpu.gpu_devices_lock);
+		goto create_failed2;
+	}
+
+	mutex_unlock(&vgpu.gpu_devices_lock);
+
+	retval = vgpu_add_attribute_group(&vgpu_dev->dev, gpu_dev->ops->vgpu_attr_groups);
+	if (retval)
+		goto create_attr_error;
+
+	return retval;
+
+create_attr_error:
+	if (gpu_dev->ops->vgpu_destroy) {
+		int ret = 0;
+		ret = gpu_dev->ops->vgpu_destroy(gpu_dev->dev,
+						 vgpu_dev->uuid,
+						 vgpu_dev->vgpu_instance);
+	}
+
+create_failed2:
+	device_unregister(&vgpu_dev->dev);
+
+create_failed1:
+	vgpu_device_free(vgpu_dev);
+
+	return retval;
+}
+
+void destroy_vgpu_device(struct vgpu_device *vgpu_dev)
+{
+	struct gpu_device *gpu_dev = vgpu_dev->gpu_dev;
+
+	printk(KERN_INFO "VGPU: destroying device %s ", vgpu_dev->dev_name);
+	if (gpu_dev->ops->vgpu_destroy) {
+		int retval = 0;
+		retval = gpu_dev->ops->vgpu_destroy(gpu_dev->dev,
+						    vgpu_dev->uuid,
+						    vgpu_dev->vgpu_instance);
+	/* if vendor driver doesn't return success that means vendor driver doesn't
+	 * support hot-unplug */
+		if (retval)
+			return;
+	}
+
+	vgpu_remove_attribute_group(&vgpu_dev->dev, gpu_dev->ops->vgpu_attr_groups);
+	device_unregister(&vgpu_dev->dev);
+}
+
+void get_vgpu_supported_types(struct device *dev, char *str)
+{
+	struct gpu_device *gpu_dev;
+
+	mutex_lock(&vgpu.gpu_devices_lock);
+	list_for_each_entry(gpu_dev, &vgpu.gpu_devices_list, gpu_next) {
+		if (&gpu_dev->dev->dev == dev) {
+			if (gpu_dev->ops->vgpu_supported_config)
+				gpu_dev->ops->vgpu_supported_config(gpu_dev->dev, str);
+			break;
+		}
+	}
+	mutex_unlock(&vgpu.gpu_devices_lock);
+}
+
+int vgpu_start_callback(struct vgpu_device *vgpu_dev)
+{
+	int ret = 0;
+	struct gpu_device *gpu_dev = vgpu_dev->gpu_dev;
+
+	mutex_lock(&vgpu.gpu_devices_lock);
+	if (gpu_dev->ops->vgpu_start)
+		ret = gpu_dev->ops->vgpu_start(vgpu_dev->uuid);
+	mutex_unlock(&vgpu.gpu_devices_lock);
+	return ret;
+}
+
+int vgpu_shutdown_callback(struct vgpu_device *vgpu_dev)
+{
+	int ret = 0;
+	struct gpu_device *gpu_dev = vgpu_dev->gpu_dev;
+
+	mutex_lock(&vgpu.gpu_devices_lock);
+	if (gpu_dev->ops->vgpu_shutdown)
+		ret = gpu_dev->ops->vgpu_shutdown(vgpu_dev->uuid);
+	mutex_unlock(&vgpu.gpu_devices_lock);
+	return ret;
+}
+
+char *vgpu_devnode(struct device *dev, umode_t *mode)
+{
+	return kasprintf(GFP_KERNEL, "vgpu/%s", dev_name(dev));
+}
+
+static void release_vgpubus_dev(struct device *dev)
+{
+	struct vgpu_device *vgpu_dev = to_vgpu_device(dev);
+	destroy_vgpu_device(vgpu_dev);
+}
+
+static struct class vgpu_class = {
+	.name		= VGPU_CLASS_NAME,
+	.owner		= THIS_MODULE,
+	.class_attrs	= vgpu_class_attrs,
+	.dev_groups	= vgpu_dev_groups,
+	.devnode	= vgpu_devnode,
+	.dev_release    = release_vgpubus_dev,
+};
+
+static int __init vgpu_init(void)
+{
+	int rc = 0;
+
+	memset(&vgpu, 0 , sizeof(vgpu));
+
+	mutex_init(&vgpu.vgpu_devices_lock);
+	INIT_LIST_HEAD(&vgpu.vgpu_devices_list);
+	mutex_init(&vgpu.gpu_devices_lock);
+	INIT_LIST_HEAD(&vgpu.gpu_devices_list);
+
+	rc = class_register(&vgpu_class);
+	if (rc < 0) {
+		printk(KERN_ERR "Error: failed to register vgpu class\n");
+		goto failed1;
+	}
+
+	rc = vgpu_bus_register();
+	if (rc < 0) {
+		printk(KERN_ERR "Error: failed to register vgpu bus\n");
+		class_unregister(&vgpu_class);
+	}
+
+    request_module_nowait("vgpu_vfio");
+
+failed1:
+	return rc;
+}
+
+static void __exit vgpu_exit(void)
+{
+	vgpu_bus_unregister();
+	class_unregister(&vgpu_class);
+}
+
+module_init(vgpu_init)
+module_exit(vgpu_exit)
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/drivers/vgpu/vgpu-driver.c b/drivers/vgpu/vgpu-driver.c
new file mode 100644
index 0000000..c4c2e9f
--- /dev/null
+++ b/drivers/vgpu/vgpu-driver.c
@@ -0,0 +1,136 @@
+/*
+ * VGPU driver
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/vfio.h>
+#include <linux/iommu.h>
+#include <linux/sysfs.h>
+#include <linux/ctype.h>
+#include <linux/vgpu.h>
+
+#include "vgpu_private.h"
+
+static int vgpu_device_attach_iommu(struct vgpu_device *vgpu_dev)
+{
+        int retval = 0;
+        struct iommu_group *group = NULL;
+
+        group = iommu_group_alloc();
+        if (IS_ERR(group)) {
+                printk(KERN_ERR "VGPU: failed to allocate group!\n");
+                return PTR_ERR(group);
+        }
+
+        retval = iommu_group_add_device(group, &vgpu_dev->dev);
+        if (retval) {
+                printk(KERN_ERR "VGPU: failed to add dev to group!\n");
+                iommu_group_put(group);
+                return retval;
+        }
+
+        vgpu_dev->group = group;
+
+        printk(KERN_INFO "VGPU: group_id = %d \n", iommu_group_id(group));
+        return retval;
+}
+
+static void vgpu_device_detach_iommu(struct vgpu_device *vgpu_dev)
+{
+        iommu_group_put(vgpu_dev->dev.iommu_group);
+        iommu_group_remove_device(&vgpu_dev->dev);
+        printk(KERN_INFO "VGPU: detaching iommu \n");
+}
+
+static int vgpu_device_probe(struct device *dev)
+{
+	struct vgpu_driver *drv = to_vgpu_driver(dev->driver);
+	struct vgpu_device *vgpu_dev = to_vgpu_device(dev);
+	int status = 0;
+
+	status = vgpu_device_attach_iommu(vgpu_dev);
+	if (status) {
+		printk(KERN_ERR "Failed to attach IOMMU\n");
+		return status;
+	}
+
+	if (drv && drv->probe) {
+		status = drv->probe(dev);
+	}
+
+	return status;
+}
+
+static int vgpu_device_remove(struct device *dev)
+{
+	struct vgpu_driver *drv = to_vgpu_driver(dev->driver);
+	struct vgpu_device *vgpu_dev = to_vgpu_device(dev);
+	int status = 0;
+
+	if (drv && drv->remove) {
+		drv->remove(dev);
+	}
+
+	vgpu_device_detach_iommu(vgpu_dev);
+
+	return status;
+}
+
+struct bus_type vgpu_bus_type = {
+	.name		= "vgpu",
+	.probe		= vgpu_device_probe,
+	.remove		= vgpu_device_remove,
+};
+EXPORT_SYMBOL_GPL(vgpu_bus_type);
+
+/**
+ * vgpu_register_driver - register a new vGPU driver
+ * @drv: the driver to register
+ * @owner: owner module of driver ro register
+ *
+ * Returns a negative value on error, otherwise 0.
+ */
+int vgpu_register_driver(struct vgpu_driver *drv, struct module *owner)
+{
+	/* initialize common driver fields */
+	drv->driver.name = drv->name;
+	drv->driver.bus = &vgpu_bus_type;
+	drv->driver.owner = owner;
+
+	/* register with core */
+	return driver_register(&drv->driver);
+}
+EXPORT_SYMBOL(vgpu_register_driver);
+
+/**
+ * vgpu_unregister_driver - unregister vGPU driver
+ * @drv: the driver to unregister
+ *
+ */
+void vgpu_unregister_driver(struct vgpu_driver *drv)
+{
+	driver_unregister(&drv->driver);
+}
+EXPORT_SYMBOL(vgpu_unregister_driver);
+
+int vgpu_bus_register(void)
+{
+	return bus_register(&vgpu_bus_type);
+}
+
+void vgpu_bus_unregister(void)
+{
+	bus_unregister(&vgpu_bus_type);
+}
diff --git a/drivers/vgpu/vgpu-sysfs.c b/drivers/vgpu/vgpu-sysfs.c
new file mode 100644
index 0000000..b740f9a
--- /dev/null
+++ b/drivers/vgpu/vgpu-sysfs.c
@@ -0,0 +1,365 @@
+/*
+ * File attributes for vGPU devices
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/fs.h>
+#include <linux/sysfs.h>
+#include <linux/ctype.h>
+#include <linux/uuid.h>
+#include <linux/vfio.h>
+#include <linux/vgpu.h>
+
+#include "vgpu_private.h"
+
+/* Prototypes */
+
+static ssize_t vgpu_supported_types_show(struct device *dev,
+					 struct device_attribute *attr,
+					 char *buf);
+static DEVICE_ATTR_RO(vgpu_supported_types);
+
+static ssize_t vgpu_create_store(struct device *dev,
+				 struct device_attribute *attr,
+				 const char *buf, size_t count);
+static DEVICE_ATTR_WO(vgpu_create);
+
+static ssize_t vgpu_destroy_store(struct device *dev,
+				  struct device_attribute *attr,
+				  const char *buf, size_t count);
+static DEVICE_ATTR_WO(vgpu_destroy);
+
+
+/* Static functions */
+
+static bool is_uuid_sep(char sep)
+{
+	if (sep == '\n' || sep == '-' || sep == ':' || sep == '\0')
+		return true;
+	return false;
+}
+
+
+static int uuid_parse(const char *str, uuid_le *uuid)
+{
+	int i;
+
+	if (strlen(str) < 36)
+		return -1;
+
+	for (i = 0; i < 16; i++) {
+		if (!isxdigit(str[0]) || !isxdigit(str[1])) {
+			printk(KERN_ERR "%s err", __FUNCTION__);
+			return -EINVAL;
+		}
+
+		uuid->b[i] = (hex_to_bin(str[0]) << 4) | hex_to_bin(str[1]);
+		str += 2;
+		if (is_uuid_sep(*str))
+			str++;
+	}
+
+	return 0;
+}
+
+
+/* Functions */
+static ssize_t vgpu_supported_types_show(struct device *dev,
+					 struct device_attribute *attr,
+					 char *buf)
+{
+	char *str;
+	ssize_t n;
+
+        str = kzalloc(sizeof(*str) * 512, GFP_KERNEL);
+        if (!str)
+                return -ENOMEM;
+
+	get_vgpu_supported_types(dev, str);
+
+	n = sprintf(buf,"%s\n", str);
+	kfree(str);
+
+	return n;
+}
+
+static ssize_t vgpu_create_store(struct device *dev,
+				 struct device_attribute *attr,
+				 const char *buf, size_t count)
+{
+	char *str, *pstr;
+	char *uuid_str, *instance_str, *vgpu_params = NULL;
+	uuid_le uuid;
+	uint32_t instance;
+	struct pci_dev *pdev;
+	int ret = 0;
+
+	pstr = str = kstrndup(buf, count, GFP_KERNEL);
+
+	if (!str)
+		return -ENOMEM;
+
+	if ((uuid_str = strsep(&str, ":")) == NULL) {
+		printk(KERN_ERR "%s Empty UUID or string %s \n",
+				 __FUNCTION__, buf);
+		ret = -EINVAL;
+		goto create_error;
+	}
+
+	if (!str) {
+		printk(KERN_ERR "%s vgpu instance not specified %s \n",
+				 __FUNCTION__, buf);
+		ret = -EINVAL;
+		goto create_error;
+	}
+
+	if ((instance_str = strsep(&str, ":")) == NULL) {
+		printk(KERN_ERR "%s Empty instance or string %s \n",
+				 __FUNCTION__, buf);
+		ret = -EINVAL;
+		goto create_error;
+	}
+
+	instance = (unsigned int)simple_strtoul(instance_str, NULL, 0);
+
+	if (!str) {
+		printk(KERN_ERR "%s vgpu params not specified %s \n",
+				 __FUNCTION__, buf);
+		ret = -EINVAL;
+		goto create_error;
+	}
+
+	vgpu_params = kstrdup(str, GFP_KERNEL);
+
+	if (!vgpu_params) {
+		printk(KERN_ERR "%s vgpu params allocation failed \n",
+				 __FUNCTION__);
+		ret = -EINVAL;
+		goto create_error;
+	}
+
+	if (uuid_parse(uuid_str, &uuid) < 0) {
+		printk(KERN_ERR "%s UUID parse error  %s \n", __FUNCTION__, buf);
+		ret = -EINVAL;
+		goto create_error;
+	}
+
+	if (dev_is_pci(dev)) {
+		pdev = to_pci_dev(dev);
+
+		if (create_vgpu_device(pdev, uuid, instance, vgpu_params) < 0) {
+			printk(KERN_ERR "%s vgpu create error \n", __FUNCTION__);
+			ret = -EINVAL;
+			goto create_error;
+		}
+		ret = count;
+	}
+
+create_error:
+	if (vgpu_params)
+		kfree(vgpu_params);
+
+	if (pstr)
+		kfree(pstr);
+	return ret;
+}
+
+static ssize_t vgpu_destroy_store(struct device *dev,
+				  struct device_attribute *attr,
+				  const char *buf, size_t count)
+{
+	char *uuid_str, *str;
+	uuid_le uuid;
+	unsigned int instance;
+	struct vgpu_device *vgpu_dev = NULL;
+
+	str = kstrndup(buf, count, GFP_KERNEL);
+
+	if (!str)
+		return -ENOMEM;
+
+	if ((uuid_str = strsep(&str, ":")) == NULL) {
+		printk(KERN_ERR "%s Empty UUID or string %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+
+	if (str == NULL) {
+		printk(KERN_ERR "%s instance not specified %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+
+	instance = (unsigned int)simple_strtoul(str, NULL, 0);
+
+	if (uuid_parse(uuid_str, &uuid) < 0) {
+		printk(KERN_ERR "%s UUID parse error  %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+
+	printk(KERN_INFO "%s UUID %pUb - %d \n", __FUNCTION__, uuid.b, instance);
+
+	vgpu_dev = vgpu_drv_get_vgpu_device(uuid, instance);
+
+	if (vgpu_dev)
+		destroy_vgpu_device(vgpu_dev);
+
+	return count;
+}
+
+static ssize_t
+vgpu_uuid_show(struct device *dev, struct device_attribute *attr, char *buf)
+{
+	struct vgpu_device *drv = to_vgpu_device(dev);
+
+	if (drv)
+		return sprintf(buf, "%pUb \n", drv->uuid.b);
+
+	return sprintf(buf, " \n");
+}
+
+static DEVICE_ATTR_RO(vgpu_uuid);
+
+static ssize_t
+vgpu_group_id_show(struct device *dev, struct device_attribute *attr, char *buf)
+{
+	struct vgpu_device *drv = to_vgpu_device(dev);
+
+	if (drv && drv->group)
+		return sprintf(buf, "%d \n", iommu_group_id(drv->group));
+
+	return sprintf(buf, " \n");
+}
+
+static DEVICE_ATTR_RO(vgpu_group_id);
+
+
+static struct attribute *vgpu_dev_attrs[] = {
+	&dev_attr_vgpu_uuid.attr,
+	&dev_attr_vgpu_group_id.attr,
+	NULL,
+};
+
+static const struct attribute_group vgpu_dev_group = {
+	.attrs = vgpu_dev_attrs,
+};
+
+const struct attribute_group *vgpu_dev_groups[] = {
+	&vgpu_dev_group,
+	NULL,
+};
+
+
+ssize_t vgpu_start_store(struct class *class, struct class_attribute *attr,
+			 const char *buf, size_t count)
+{
+	char *uuid_str;
+	uuid_le uuid;
+	struct vgpu_device *vgpu_dev = NULL;
+	int ret;
+
+	uuid_str = kstrndup(buf, count, GFP_KERNEL);
+
+	if (!uuid_str)
+		return -ENOMEM;
+
+	if (uuid_parse(uuid_str, &uuid) < 0) {
+		printk(KERN_ERR "%s UUID parse error  %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+
+	vgpu_dev = vgpu_drv_get_vgpu_device(uuid, 0);
+
+	if (vgpu_dev && dev_is_vgpu(&vgpu_dev->dev)) {
+		kobject_uevent(&vgpu_dev->dev.kobj, KOBJ_ONLINE);
+
+		ret = vgpu_start_callback(vgpu_dev);
+		if (ret < 0) {
+			printk(KERN_ERR "%s vgpu_start callback failed  %d \n",
+					 __FUNCTION__, ret);
+			return ret;
+		}
+	}
+
+	return count;
+}
+
+ssize_t vgpu_shutdown_store(struct class *class, struct class_attribute *attr,
+			    const char *buf, size_t count)
+{
+	char *uuid_str;
+	uuid_le uuid;
+	struct vgpu_device *vgpu_dev = NULL;
+	int ret;
+
+	uuid_str = kstrndup(buf, count, GFP_KERNEL);
+
+	if (!uuid_str)
+		return -ENOMEM;
+
+	if (uuid_parse(uuid_str, &uuid) < 0) {
+		printk(KERN_ERR "%s UUID parse error  %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+	vgpu_dev = vgpu_drv_get_vgpu_device(uuid, 0);
+
+	if (vgpu_dev && dev_is_vgpu(&vgpu_dev->dev)) {
+		kobject_uevent(&vgpu_dev->dev.kobj, KOBJ_OFFLINE);
+
+		ret = vgpu_shutdown_callback(vgpu_dev);
+		if (ret < 0) {
+			printk(KERN_ERR "%s vgpu_shutdown callback failed  %d \n",
+					 __FUNCTION__, ret);
+			return ret;
+		}
+	}
+
+	return count;
+}
+
+struct class_attribute vgpu_class_attrs[] = {
+	__ATTR_WO(vgpu_start),
+	__ATTR_WO(vgpu_shutdown),
+	__ATTR_NULL
+};
+
+int vgpu_create_pci_device_files(struct pci_dev *dev)
+{
+	int retval;
+
+	retval = sysfs_create_file(&dev->dev.kobj,
+				   &dev_attr_vgpu_supported_types.attr);
+	if (retval) {
+		printk(KERN_ERR "VGPU-VFIO: failed to create vgpu_supported_types sysfs entry\n");
+		return retval;
+	}
+
+	retval = sysfs_create_file(&dev->dev.kobj, &dev_attr_vgpu_create.attr);
+	if (retval) {
+		printk(KERN_ERR "VGPU-VFIO: failed to create vgpu_create sysfs entry\n");
+		return retval;
+	}
+
+	retval = sysfs_create_file(&dev->dev.kobj, &dev_attr_vgpu_destroy.attr);
+	if (retval) {
+		printk(KERN_ERR "VGPU-VFIO: failed to create vgpu_destroy sysfs entry\n");
+		return retval;
+	}
+
+	return 0;
+}
+
+
+void vgpu_remove_pci_device_files(struct pci_dev *dev)
+{
+	sysfs_remove_file(&dev->dev.kobj, &dev_attr_vgpu_supported_types.attr);
+	sysfs_remove_file(&dev->dev.kobj, &dev_attr_vgpu_create.attr);
+	sysfs_remove_file(&dev->dev.kobj, &dev_attr_vgpu_destroy.attr);
+}
diff --git a/drivers/vgpu/vgpu_private.h b/drivers/vgpu/vgpu_private.h
new file mode 100644
index 0000000..35158ef
--- /dev/null
+++ b/drivers/vgpu/vgpu_private.h
@@ -0,0 +1,36 @@
+/*
+ * VGPU interal definition
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author:
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef VGPU_PRIVATE_H
+#define VGPU_PRIVATE_H
+
+struct vgpu_device *vgpu_drv_get_vgpu_device(uuid_le uuid, int instance);
+
+int  create_vgpu_device(struct pci_dev *pdev, uuid_le uuid, uint32_t instance,
+		       char *vgpu_params);
+void destroy_vgpu_device(struct vgpu_device *vgpu_dev);
+
+int  vgpu_bus_register(void);
+void vgpu_bus_unregister(void);
+
+/* Function prototypes for vgpu_sysfs */
+
+extern struct class_attribute vgpu_class_attrs[];
+extern const struct attribute_group *vgpu_dev_groups[];
+
+int  vgpu_create_pci_device_files(struct pci_dev *dev);
+void vgpu_remove_pci_device_files(struct pci_dev *dev);
+
+void get_vgpu_supported_types(struct device *dev, char *str);
+int  vgpu_start_callback(struct vgpu_device *vgpu_dev);
+int  vgpu_shutdown_callback(struct vgpu_device *vgpu_dev);
+
+#endif /* VGPU_PRIVATE_H */
diff --git a/include/linux/vgpu.h b/include/linux/vgpu.h
new file mode 100644
index 0000000..03a77cf
--- /dev/null
+++ b/include/linux/vgpu.h
@@ -0,0 +1,216 @@
+/*
+ * VGPU definition
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author:
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef VGPU_H
+#define VGPU_H
+
+// Common Data structures
+
+struct pci_bar_info {
+	uint64_t start;
+	uint64_t size;
+	uint32_t flags;
+};
+
+enum vgpu_emul_space_e {
+	vgpu_emul_space_config = 0, /*!< PCI configuration space */
+	vgpu_emul_space_io = 1,     /*!< I/O register space */
+	vgpu_emul_space_mmio = 2    /*!< Memory-mapped I/O space */
+};
+
+struct gpu_device;
+
+/*
+ * VGPU device
+ */
+struct vgpu_device {
+	struct kref		kref;
+	struct device		dev;
+	struct gpu_device	*gpu_dev;
+	struct iommu_group	*group;
+#define DEVICE_NAME_LEN		(64)
+	char			dev_name[DEVICE_NAME_LEN];
+	uuid_le			uuid;
+	uint32_t		vgpu_instance;
+	struct device_attribute	*dev_attr_vgpu_status;
+	int			vgpu_device_status;
+
+	void			*driver_data;
+
+	struct list_head	list;
+};
+
+
+/**
+ * struct gpu_device_ops - Structure to be registered for each physical GPU to
+ * register the device to vgpu module.
+ *
+ * @owner:			The module owner.
+ * @dev_attr_groups:		Default attributes of the physical device.
+ * @vgpu_attr_groups:		Default attributes of the vGPU device.
+ * @vgpu_supported_config:	Called to get information about supported vgpu types.
+ *				@dev : pci device structure of physical GPU.
+ *				@config: should return string listing supported config
+ *				Returns integer: success (0) or error (< 0)
+ * @vgpu_create:		Called to allocate basic resouces in graphics
+ *				driver for a particular vgpu.
+ *				@dev: physical pci device structure on which vgpu
+ *				      should be created
+ *				@uuid: VM's uuid for which VM it is intended to
+ *				@instance: vgpu instance in that VM
+ *				@vgpu_params: extra parameters required by GPU driver.
+ *				Returns integer: success (0) or error (< 0)
+ * @vgpu_destroy:		Called to free resources in graphics driver for
+ *				a vgpu instance of that VM.
+ *				@dev: physical pci device structure to which
+ *				this vgpu points to.
+ *				@uuid: VM's uuid for which the vgpu belongs to.
+ *				@instance: vgpu instance in that VM
+ *				Returns integer: success (0) or error (< 0)
+ *				If VM is running and vgpu_destroy is called that
+ *				means the vGPU is being hotunpluged. Return error
+ *				if VM is running and graphics driver doesn't
+ *				support vgpu hotplug.
+ * @vgpu_start:			Called to do initiate vGPU initialization
+ *				process in graphics driver when VM boots before
+ *				qemu starts.
+ *				@uuid: VM's UUID which is booting.
+ *				Returns integer: success (0) or error (< 0)
+ * @vgpu_shutdown:		Called to teardown vGPU related resources for
+ *				the VM
+ *				@uuid: VM's UUID which is shutting down .
+ *				Returns integer: success (0) or error (< 0)
+ * @read:			Read emulation callback
+ *				@vdev: vgpu device structure
+ *				@buf: read buffer
+ *				@count: number bytes to read
+ *				@address_space: specifies for which address space
+ *				the request is: pci_config_space, IO register
+ *				space or MMIO space.
+ *				@pos: offset from base address.
+ *				Retuns number on bytes read on success or error.
+ * @write:			Write emulation callback
+ *				@vdev: vgpu device structure
+ *				@buf: write buffer
+ *				@count: number bytes to be written
+ *				@address_space: specifies for which address space
+ *				the request is: pci_config_space, IO register
+ *				space or MMIO space.
+ *				@pos: offset from base address.
+ *				Retuns number on bytes written on success or error.
+ * @vgpu_set_irqs:		Called to send about interrupts configuration
+ *				information that qemu set.
+ *				@vdev: vgpu device structure
+ *				@flags, index, start, count and *data : same as
+ *				that of struct vfio_irq_set of
+ *				VFIO_DEVICE_SET_IRQS API.
+ * @vgpu_bar_info:		Called to get BAR size and flags of vGPU device.
+ *				@vdev: vgpu device structure
+ *				@bar_index: BAR index
+ *				@bar_info: output, returns size and flags of
+ *				requested BAR
+ *				Returns integer: success (0) or error (< 0)
+ * @validate_map_request:	Validate remap pfn request
+ *				@vdev: vgpu device structure
+ *				@virtaddr: target user address to start at
+ *				@pfn: physical address of kernel memory, GPU
+ *				driver can change if required.
+ *				@size: size of map area, GPU driver can change
+ *				the size of map area if desired.
+ *				@prot: page protection flags for this mapping,
+ *				GPU driver can change, if required.
+ *				Returns integer: success (0) or error (< 0)
+ *
+ * Physical GPU that support vGPU should be register with vgpu module with
+ * gpu_device_ops structure.
+ */
+
+struct gpu_device_ops {
+	struct module   *owner;
+	const struct attribute_group **dev_attr_groups;
+	const struct attribute_group **vgpu_attr_groups;
+
+	int	(*vgpu_supported_config)(struct pci_dev *dev, char *config);
+	int     (*vgpu_create)(struct pci_dev *dev, uuid_le uuid,
+			       uint32_t instance, char *vgpu_params);
+	int     (*vgpu_destroy)(struct pci_dev *dev, uuid_le uuid,
+			        uint32_t instance);
+
+	int     (*vgpu_start)(uuid_le uuid);
+	int     (*vgpu_shutdown)(uuid_le uuid);
+
+	ssize_t (*read) (struct vgpu_device *vdev, char *buf, size_t count,
+			 uint32_t address_space, loff_t pos);
+	ssize_t (*write)(struct vgpu_device *vdev, char *buf, size_t count,
+			 uint32_t address_space, loff_t pos);
+	int     (*vgpu_set_irqs)(struct vgpu_device *vdev, uint32_t flags,
+				 unsigned index, unsigned start, unsigned count,
+				 void *data);
+	int	(*vgpu_bar_info)(struct vgpu_device *vdev, int bar_index,
+				 struct pci_bar_info *bar_info);
+	int	(*validate_map_request)(struct vgpu_device *vdev,
+					unsigned long virtaddr,
+					unsigned long *pfn, unsigned long *size,
+					pgprot_t *prot);
+};
+
+/*
+ * Physical GPU
+ */
+struct gpu_device {
+	struct pci_dev                  *dev;
+	const struct gpu_device_ops     *ops;
+	struct list_head                gpu_next;
+};
+
+/**
+ * struct vgpu_driver - vGPU device driver
+ * @name: driver name
+ * @probe: called when new device created
+ * @remove: called when device removed
+ * @driver: device driver structure
+ *
+ **/
+struct vgpu_driver {
+	const char *name;
+	int  (*probe)  (struct device *dev);
+	void (*remove) (struct device *dev);
+	struct device_driver	driver;
+};
+
+static inline struct vgpu_driver *to_vgpu_driver(struct device_driver *drv)
+{
+	return drv ? container_of(drv, struct vgpu_driver, driver) : NULL;
+}
+
+static inline struct vgpu_device *to_vgpu_device(struct device *dev)
+{
+	return dev ? container_of(dev, struct vgpu_device, dev) : NULL;
+}
+
+extern struct bus_type vgpu_bus_type;
+
+#define dev_is_vgpu(d) ((d)->bus == &vgpu_bus_type)
+
+extern int  vgpu_register_device(struct pci_dev *dev,
+				 const struct gpu_device_ops *ops);
+extern void vgpu_unregister_device(struct pci_dev *dev);
+
+extern int  vgpu_register_driver(struct vgpu_driver *drv, struct module *owner);
+extern void vgpu_unregister_driver(struct vgpu_driver *drv);
+
+extern int vgpu_map_virtual_bar(uint64_t virt_bar_addr, uint64_t phys_bar_addr,
+				uint32_t len, uint32_t flags);
+extern int vgpu_dma_do_translate(dma_addr_t * gfn_buffer, uint32_t count);
+
+struct vgpu_device *get_vgpu_device_from_group(struct iommu_group *group);
+
+#endif /* VGPU_H */
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 154+ messages in thread

* [Qemu-devel] [RFC PATCH v3 1/3] vGPU Core driver
@ 2016-05-02 18:40   ` Kirti Wankhede
  0 siblings, 0 replies; 154+ messages in thread
From: Kirti Wankhede @ 2016-05-02 18:40 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, kevin.tian, shuai.ruan, jike.song, zhiyuan.lv,
	Kirti Wankhede

Design for vGPU Driver:
Main purpose of vGPU driver is to provide a common interface for vGPU
management that can be used by differnt GPU drivers.

This module would provide a generic interface to create the device, add
it to vGPU bus, add device to IOMMU group and then add it to vfio group.

High Level block diagram:

+--------------+    vgpu_register_driver()+---------------+
|     __init() +------------------------->+               |
|              |                          |               |
|              +<-------------------------+    vgpu.ko    |
| vgpu_vfio.ko |   probe()/remove()       |               |
|              |                +---------+               +---------+
+--------------+                |         +-------+-------+         |
                                |                 ^                 |
                                | callback        |                 |
                                |         +-------+--------+        |
                                |         |vgpu_register_device()   |
                                |         |                |        |
                                +---^-----+-----+    +-----+------+-+
                                    | nvidia.ko |    |  i915.ko   |
                                    |           |    |            |
                                    +-----------+    +------------+

vGPU driver provides two types of registration interfaces:
1. Registration interface for vGPU bus driver:

/**
  * struct vgpu_driver - vGPU device driver
  * @name: driver name
  * @probe: called when new device created
  * @remove: called when device removed
  * @driver: device driver structure
  *
  **/
struct vgpu_driver {
         const char *name;
         int  (*probe)  (struct device *dev);
         void (*remove) (struct device *dev);
         struct device_driver    driver;
};

int  vgpu_register_driver(struct vgpu_driver *drv, struct module *owner);
void vgpu_unregister_driver(struct vgpu_driver *drv);

VFIO bus driver for vgpu, should use this interface to register with
vGPU driver. With this, VFIO bus driver for vGPU devices is responsible
to add vGPU device to VFIO group.

2. GPU driver interface
GPU driver interface provides GPU driver the set APIs to manage GPU driver
related work in their own driver. APIs are to:
- vgpu_supported_config: provide supported configuration list by the GPU.
- vgpu_create: to allocate basic resouces in GPU driver for a vGPU device.
- vgpu_destroy: to free resources in GPU driver during vGPU device destroy.
- vgpu_start: to initiate vGPU initialization process from GPU driver when VM
  boots and before QEMU starts.
- vgpu_shutdown: to teardown vGPU resources during VM teardown.
- read : read emulation callback.
- write: write emulation callback.
- vgpu_set_irqs: send interrupt configuration information that QEMU sets.
- vgpu_bar_info: to provice BAR size and its flags for the vGPU device.
- validate_map_request: to validate remap pfn request.

This registration interface should be used by GPU drivers to register
each physical device to vGPU driver.

Updated this patch with couple of more functions in GPU driver interface
which were discussed during v1 version of this RFC.

Thanks,
Kirti.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Change-Id: I1c13c411f61b7b2e750e85adfe1b097f9fd218b9
---
 drivers/Kconfig             |    2 +
 drivers/Makefile            |    1 +
 drivers/vgpu/Kconfig        |   21 ++
 drivers/vgpu/Makefile       |    4 +
 drivers/vgpu/vgpu-core.c    |  424 +++++++++++++++++++++++++++++++++++++++++++
 drivers/vgpu/vgpu-driver.c  |  136 ++++++++++++++
 drivers/vgpu/vgpu-sysfs.c   |  365 +++++++++++++++++++++++++++++++++++++
 drivers/vgpu/vgpu_private.h |   36 ++++
 include/linux/vgpu.h        |  216 ++++++++++++++++++++++
 9 files changed, 1205 insertions(+), 0 deletions(-)
 create mode 100644 drivers/vgpu/Kconfig
 create mode 100644 drivers/vgpu/Makefile
 create mode 100644 drivers/vgpu/vgpu-core.c
 create mode 100644 drivers/vgpu/vgpu-driver.c
 create mode 100644 drivers/vgpu/vgpu-sysfs.c
 create mode 100644 drivers/vgpu/vgpu_private.h
 create mode 100644 include/linux/vgpu.h

diff --git a/drivers/Kconfig b/drivers/Kconfig
index d2ac339..5fd9eae 100644
--- a/drivers/Kconfig
+++ b/drivers/Kconfig
@@ -122,6 +122,8 @@ source "drivers/uio/Kconfig"
 
 source "drivers/vfio/Kconfig"
 
+source "drivers/vgpu/Kconfig"
+
 source "drivers/vlynq/Kconfig"
 
 source "drivers/virt/Kconfig"
diff --git a/drivers/Makefile b/drivers/Makefile
index 8f5d076..36f1110 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -84,6 +84,7 @@ obj-$(CONFIG_FUSION)		+= message/
 obj-y				+= firewire/
 obj-$(CONFIG_UIO)		+= uio/
 obj-$(CONFIG_VFIO)		+= vfio/
+obj-$(CONFIG_VFIO)		+= vgpu/
 obj-y				+= cdrom/
 obj-y				+= auxdisplay/
 obj-$(CONFIG_PCCARD)		+= pcmcia/
diff --git a/drivers/vgpu/Kconfig b/drivers/vgpu/Kconfig
new file mode 100644
index 0000000..792eb48
--- /dev/null
+++ b/drivers/vgpu/Kconfig
@@ -0,0 +1,21 @@
+
+menuconfig VGPU
+    tristate "VGPU driver framework"
+    depends on VFIO
+    select VGPU_VFIO
+    help
+        VGPU provides a framework to virtualize GPU without SR-IOV cap
+        See Documentation/vgpu.txt for more details.
+
+        If you don't know what do here, say N.
+
+config VGPU
+    tristate
+    depends on VFIO
+    default n
+
+config VGPU_VFIO
+    tristate
+    depends on VGPU
+    default n
+
diff --git a/drivers/vgpu/Makefile b/drivers/vgpu/Makefile
new file mode 100644
index 0000000..f5be980
--- /dev/null
+++ b/drivers/vgpu/Makefile
@@ -0,0 +1,4 @@
+
+vgpu-y := vgpu-core.o vgpu-sysfs.o vgpu-driver.o
+
+obj-$(CONFIG_VGPU)			+= vgpu.o
diff --git a/drivers/vgpu/vgpu-core.c b/drivers/vgpu/vgpu-core.c
new file mode 100644
index 0000000..1a7d274
--- /dev/null
+++ b/drivers/vgpu/vgpu-core.c
@@ -0,0 +1,424 @@
+/*
+ * VGPU Core Driver
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/poll.h>
+#include <linux/slab.h>
+#include <linux/cdev.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/uuid.h>
+#include <linux/vfio.h>
+#include <linux/iommu.h>
+#include <linux/sysfs.h>
+#include <linux/ctype.h>
+#include <linux/vgpu.h>
+
+#include "vgpu_private.h"
+
+#define DRIVER_VERSION	"0.1"
+#define DRIVER_AUTHOR	"NVIDIA Corporation"
+#define DRIVER_DESC	"VGPU Core Driver"
+
+/*
+ * #defines
+ */
+
+#define VGPU_CLASS_NAME		"vgpu"
+
+/*
+ * Global Structures
+ */
+
+static struct vgpu {
+	struct list_head    vgpu_devices_list;
+	struct mutex        vgpu_devices_lock;
+	struct list_head    gpu_devices_list;
+	struct mutex        gpu_devices_lock;
+} vgpu;
+
+static struct class vgpu_class;
+
+/*
+ * Functions
+ */
+
+struct vgpu_device *get_vgpu_device_from_group(struct iommu_group *group)
+{
+	struct vgpu_device *vdev = NULL;
+
+	mutex_lock(&vgpu.vgpu_devices_lock);
+	list_for_each_entry(vdev, &vgpu.vgpu_devices_list, list) {
+		if (vdev->group) {
+			if (iommu_group_id(vdev->group) == iommu_group_id(group)) {
+				mutex_unlock(&vgpu.vgpu_devices_lock);
+				return vdev;
+			}
+		}
+	}
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+	return NULL;
+}
+
+EXPORT_SYMBOL_GPL(get_vgpu_device_from_group);
+
+static int vgpu_add_attribute_group(struct device *dev,
+			            const struct attribute_group **groups)
+{
+        return sysfs_create_groups(&dev->kobj, groups);
+}
+
+static void vgpu_remove_attribute_group(struct device *dev,
+			                const struct attribute_group **groups)
+{
+        sysfs_remove_groups(&dev->kobj, groups);
+}
+
+int vgpu_register_device(struct pci_dev *dev, const struct gpu_device_ops *ops)
+{
+	int ret = 0;
+	struct gpu_device *gpu_dev, *tmp;
+
+	if (!dev)
+		return -EINVAL;
+
+        gpu_dev = kzalloc(sizeof(*gpu_dev), GFP_KERNEL);
+        if (!gpu_dev)
+                return -ENOMEM;
+
+	gpu_dev->dev = dev;
+        gpu_dev->ops = ops;
+
+        mutex_lock(&vgpu.gpu_devices_lock);
+
+        /* Check for duplicates */
+        list_for_each_entry(tmp, &vgpu.gpu_devices_list, gpu_next) {
+                if (tmp->dev == dev) {
+			ret = -EINVAL;
+			goto add_error;
+                }
+        }
+
+	ret = vgpu_create_pci_device_files(dev);
+	if (ret)
+		goto add_error;
+
+	ret = vgpu_add_attribute_group(&dev->dev, ops->dev_attr_groups);
+	if (ret)
+		goto add_group_error;
+
+        list_add(&gpu_dev->gpu_next, &vgpu.gpu_devices_list);
+
+	printk(KERN_INFO "VGPU: Registered dev 0x%x 0x%x, class 0x%x\n",
+			 dev->vendor, dev->device, dev->class);
+        mutex_unlock(&vgpu.gpu_devices_lock);
+
+        return 0;
+
+add_group_error:
+	vgpu_remove_pci_device_files(dev);
+add_error:
+	mutex_unlock(&vgpu.gpu_devices_lock);
+	kfree(gpu_dev);
+	return ret;
+
+}
+EXPORT_SYMBOL(vgpu_register_device);
+
+void vgpu_unregister_device(struct pci_dev *dev)
+{
+        struct gpu_device *gpu_dev;
+
+        mutex_lock(&vgpu.gpu_devices_lock);
+        list_for_each_entry(gpu_dev, &vgpu.gpu_devices_list, gpu_next) {
+		struct vgpu_device *vdev = NULL;
+
+                if (gpu_dev->dev != dev)
+			continue;
+
+		printk(KERN_INFO "VGPU: Unregistered dev 0x%x 0x%x, class 0x%x\n",
+				dev->vendor, dev->device, dev->class);
+
+		list_for_each_entry(vdev, &vgpu.vgpu_devices_list, list) {
+			if (vdev->gpu_dev != gpu_dev)
+				continue;
+			destroy_vgpu_device(vdev);
+		}
+		vgpu_remove_attribute_group(&dev->dev, gpu_dev->ops->dev_attr_groups);
+		vgpu_remove_pci_device_files(dev);
+		list_del(&gpu_dev->gpu_next);
+		mutex_unlock(&vgpu.gpu_devices_lock);
+		kfree(gpu_dev);
+		return;
+        }
+        mutex_unlock(&vgpu.gpu_devices_lock);
+}
+EXPORT_SYMBOL(vgpu_unregister_device);
+
+/*
+ * Helper Functions
+ */
+
+static struct vgpu_device *vgpu_device_alloc(uuid_le uuid, int instance, char *name)
+{
+	struct vgpu_device *vgpu_dev = NULL;
+
+	vgpu_dev = kzalloc(sizeof(*vgpu_dev), GFP_KERNEL);
+	if (!vgpu_dev)
+		return ERR_PTR(-ENOMEM);
+
+	kref_init(&vgpu_dev->kref);
+	memcpy(&vgpu_dev->uuid, &uuid, sizeof(uuid_le));
+	vgpu_dev->vgpu_instance = instance;
+	strcpy(vgpu_dev->dev_name, name);
+
+	mutex_lock(&vgpu.vgpu_devices_lock);
+	list_add(&vgpu_dev->list, &vgpu.vgpu_devices_list);
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+
+	return vgpu_dev;
+}
+
+static void vgpu_device_free(struct vgpu_device *vgpu_dev)
+{
+	if (vgpu_dev) {
+		mutex_lock(&vgpu.vgpu_devices_lock);
+		list_del(&vgpu_dev->list);
+		mutex_unlock(&vgpu.vgpu_devices_lock);
+		kfree(vgpu_dev);
+	}
+	return;
+}
+
+struct vgpu_device *vgpu_drv_get_vgpu_device(uuid_le uuid, int instance)
+{
+	struct vgpu_device *vdev = NULL;
+
+	mutex_lock(&vgpu.vgpu_devices_lock);
+	list_for_each_entry(vdev, &vgpu.vgpu_devices_list, list) {
+		if ((uuid_le_cmp(vdev->uuid, uuid) == 0) &&
+		    (vdev->vgpu_instance == instance)) {
+			mutex_unlock(&vgpu.vgpu_devices_lock);
+			return vdev;
+		}
+	}
+	mutex_unlock(&vgpu.vgpu_devices_lock);
+	return NULL;
+}
+
+static void vgpu_device_release(struct device *dev)
+{
+	struct vgpu_device *vgpu_dev = to_vgpu_device(dev);
+	vgpu_device_free(vgpu_dev);
+}
+
+int create_vgpu_device(struct pci_dev *pdev, uuid_le uuid, uint32_t instance, char *vgpu_params)
+{
+	char name[64];
+	int numChar = 0;
+	int retval = 0;
+	struct vgpu_device *vgpu_dev = NULL;
+	struct gpu_device *gpu_dev;
+
+	printk(KERN_INFO "VGPU: %s: device ", __FUNCTION__);
+
+	numChar = sprintf(name, "%pUb-%d", uuid.b, instance);
+	name[numChar] = '\0';
+
+	vgpu_dev = vgpu_device_alloc(uuid, instance, name);
+	if (IS_ERR(vgpu_dev)) {
+		return PTR_ERR(vgpu_dev);
+	}
+
+	vgpu_dev->dev.parent  = &pdev->dev;
+	vgpu_dev->dev.bus     = &vgpu_bus_type;
+	vgpu_dev->dev.release = vgpu_device_release;
+	dev_set_name(&vgpu_dev->dev, "%s", name);
+
+	retval = device_register(&vgpu_dev->dev);
+	if (retval)
+		goto create_failed1;
+
+	printk(KERN_INFO "UUID %pUb \n", vgpu_dev->uuid.b);
+
+	mutex_lock(&vgpu.gpu_devices_lock);
+	list_for_each_entry(gpu_dev, &vgpu.gpu_devices_list, gpu_next) {
+		if (gpu_dev->dev != pdev)
+			continue;
+
+		vgpu_dev->gpu_dev = gpu_dev;
+		if (gpu_dev->ops->vgpu_create) {
+			retval = gpu_dev->ops->vgpu_create(pdev, vgpu_dev->uuid,
+							   instance, vgpu_params);
+			if (retval) {
+				mutex_unlock(&vgpu.gpu_devices_lock);
+				goto create_failed2;
+			}
+		}
+		break;
+	}
+	if (!vgpu_dev->gpu_dev) {
+		retval = -EINVAL;
+		mutex_unlock(&vgpu.gpu_devices_lock);
+		goto create_failed2;
+	}
+
+	mutex_unlock(&vgpu.gpu_devices_lock);
+
+	retval = vgpu_add_attribute_group(&vgpu_dev->dev, gpu_dev->ops->vgpu_attr_groups);
+	if (retval)
+		goto create_attr_error;
+
+	return retval;
+
+create_attr_error:
+	if (gpu_dev->ops->vgpu_destroy) {
+		int ret = 0;
+		ret = gpu_dev->ops->vgpu_destroy(gpu_dev->dev,
+						 vgpu_dev->uuid,
+						 vgpu_dev->vgpu_instance);
+	}
+
+create_failed2:
+	device_unregister(&vgpu_dev->dev);
+
+create_failed1:
+	vgpu_device_free(vgpu_dev);
+
+	return retval;
+}
+
+void destroy_vgpu_device(struct vgpu_device *vgpu_dev)
+{
+	struct gpu_device *gpu_dev = vgpu_dev->gpu_dev;
+
+	printk(KERN_INFO "VGPU: destroying device %s ", vgpu_dev->dev_name);
+	if (gpu_dev->ops->vgpu_destroy) {
+		int retval = 0;
+		retval = gpu_dev->ops->vgpu_destroy(gpu_dev->dev,
+						    vgpu_dev->uuid,
+						    vgpu_dev->vgpu_instance);
+	/* if vendor driver doesn't return success that means vendor driver doesn't
+	 * support hot-unplug */
+		if (retval)
+			return;
+	}
+
+	vgpu_remove_attribute_group(&vgpu_dev->dev, gpu_dev->ops->vgpu_attr_groups);
+	device_unregister(&vgpu_dev->dev);
+}
+
+void get_vgpu_supported_types(struct device *dev, char *str)
+{
+	struct gpu_device *gpu_dev;
+
+	mutex_lock(&vgpu.gpu_devices_lock);
+	list_for_each_entry(gpu_dev, &vgpu.gpu_devices_list, gpu_next) {
+		if (&gpu_dev->dev->dev == dev) {
+			if (gpu_dev->ops->vgpu_supported_config)
+				gpu_dev->ops->vgpu_supported_config(gpu_dev->dev, str);
+			break;
+		}
+	}
+	mutex_unlock(&vgpu.gpu_devices_lock);
+}
+
+int vgpu_start_callback(struct vgpu_device *vgpu_dev)
+{
+	int ret = 0;
+	struct gpu_device *gpu_dev = vgpu_dev->gpu_dev;
+
+	mutex_lock(&vgpu.gpu_devices_lock);
+	if (gpu_dev->ops->vgpu_start)
+		ret = gpu_dev->ops->vgpu_start(vgpu_dev->uuid);
+	mutex_unlock(&vgpu.gpu_devices_lock);
+	return ret;
+}
+
+int vgpu_shutdown_callback(struct vgpu_device *vgpu_dev)
+{
+	int ret = 0;
+	struct gpu_device *gpu_dev = vgpu_dev->gpu_dev;
+
+	mutex_lock(&vgpu.gpu_devices_lock);
+	if (gpu_dev->ops->vgpu_shutdown)
+		ret = gpu_dev->ops->vgpu_shutdown(vgpu_dev->uuid);
+	mutex_unlock(&vgpu.gpu_devices_lock);
+	return ret;
+}
+
+char *vgpu_devnode(struct device *dev, umode_t *mode)
+{
+	return kasprintf(GFP_KERNEL, "vgpu/%s", dev_name(dev));
+}
+
+static void release_vgpubus_dev(struct device *dev)
+{
+	struct vgpu_device *vgpu_dev = to_vgpu_device(dev);
+	destroy_vgpu_device(vgpu_dev);
+}
+
+static struct class vgpu_class = {
+	.name		= VGPU_CLASS_NAME,
+	.owner		= THIS_MODULE,
+	.class_attrs	= vgpu_class_attrs,
+	.dev_groups	= vgpu_dev_groups,
+	.devnode	= vgpu_devnode,
+	.dev_release    = release_vgpubus_dev,
+};
+
+static int __init vgpu_init(void)
+{
+	int rc = 0;
+
+	memset(&vgpu, 0 , sizeof(vgpu));
+
+	mutex_init(&vgpu.vgpu_devices_lock);
+	INIT_LIST_HEAD(&vgpu.vgpu_devices_list);
+	mutex_init(&vgpu.gpu_devices_lock);
+	INIT_LIST_HEAD(&vgpu.gpu_devices_list);
+
+	rc = class_register(&vgpu_class);
+	if (rc < 0) {
+		printk(KERN_ERR "Error: failed to register vgpu class\n");
+		goto failed1;
+	}
+
+	rc = vgpu_bus_register();
+	if (rc < 0) {
+		printk(KERN_ERR "Error: failed to register vgpu bus\n");
+		class_unregister(&vgpu_class);
+	}
+
+    request_module_nowait("vgpu_vfio");
+
+failed1:
+	return rc;
+}
+
+static void __exit vgpu_exit(void)
+{
+	vgpu_bus_unregister();
+	class_unregister(&vgpu_class);
+}
+
+module_init(vgpu_init)
+module_exit(vgpu_exit)
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/drivers/vgpu/vgpu-driver.c b/drivers/vgpu/vgpu-driver.c
new file mode 100644
index 0000000..c4c2e9f
--- /dev/null
+++ b/drivers/vgpu/vgpu-driver.c
@@ -0,0 +1,136 @@
+/*
+ * VGPU driver
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/vfio.h>
+#include <linux/iommu.h>
+#include <linux/sysfs.h>
+#include <linux/ctype.h>
+#include <linux/vgpu.h>
+
+#include "vgpu_private.h"
+
+static int vgpu_device_attach_iommu(struct vgpu_device *vgpu_dev)
+{
+        int retval = 0;
+        struct iommu_group *group = NULL;
+
+        group = iommu_group_alloc();
+        if (IS_ERR(group)) {
+                printk(KERN_ERR "VGPU: failed to allocate group!\n");
+                return PTR_ERR(group);
+        }
+
+        retval = iommu_group_add_device(group, &vgpu_dev->dev);
+        if (retval) {
+                printk(KERN_ERR "VGPU: failed to add dev to group!\n");
+                iommu_group_put(group);
+                return retval;
+        }
+
+        vgpu_dev->group = group;
+
+        printk(KERN_INFO "VGPU: group_id = %d \n", iommu_group_id(group));
+        return retval;
+}
+
+static void vgpu_device_detach_iommu(struct vgpu_device *vgpu_dev)
+{
+        iommu_group_put(vgpu_dev->dev.iommu_group);
+        iommu_group_remove_device(&vgpu_dev->dev);
+        printk(KERN_INFO "VGPU: detaching iommu \n");
+}
+
+static int vgpu_device_probe(struct device *dev)
+{
+	struct vgpu_driver *drv = to_vgpu_driver(dev->driver);
+	struct vgpu_device *vgpu_dev = to_vgpu_device(dev);
+	int status = 0;
+
+	status = vgpu_device_attach_iommu(vgpu_dev);
+	if (status) {
+		printk(KERN_ERR "Failed to attach IOMMU\n");
+		return status;
+	}
+
+	if (drv && drv->probe) {
+		status = drv->probe(dev);
+	}
+
+	return status;
+}
+
+static int vgpu_device_remove(struct device *dev)
+{
+	struct vgpu_driver *drv = to_vgpu_driver(dev->driver);
+	struct vgpu_device *vgpu_dev = to_vgpu_device(dev);
+	int status = 0;
+
+	if (drv && drv->remove) {
+		drv->remove(dev);
+	}
+
+	vgpu_device_detach_iommu(vgpu_dev);
+
+	return status;
+}
+
+struct bus_type vgpu_bus_type = {
+	.name		= "vgpu",
+	.probe		= vgpu_device_probe,
+	.remove		= vgpu_device_remove,
+};
+EXPORT_SYMBOL_GPL(vgpu_bus_type);
+
+/**
+ * vgpu_register_driver - register a new vGPU driver
+ * @drv: the driver to register
+ * @owner: owner module of driver ro register
+ *
+ * Returns a negative value on error, otherwise 0.
+ */
+int vgpu_register_driver(struct vgpu_driver *drv, struct module *owner)
+{
+	/* initialize common driver fields */
+	drv->driver.name = drv->name;
+	drv->driver.bus = &vgpu_bus_type;
+	drv->driver.owner = owner;
+
+	/* register with core */
+	return driver_register(&drv->driver);
+}
+EXPORT_SYMBOL(vgpu_register_driver);
+
+/**
+ * vgpu_unregister_driver - unregister vGPU driver
+ * @drv: the driver to unregister
+ *
+ */
+void vgpu_unregister_driver(struct vgpu_driver *drv)
+{
+	driver_unregister(&drv->driver);
+}
+EXPORT_SYMBOL(vgpu_unregister_driver);
+
+int vgpu_bus_register(void)
+{
+	return bus_register(&vgpu_bus_type);
+}
+
+void vgpu_bus_unregister(void)
+{
+	bus_unregister(&vgpu_bus_type);
+}
diff --git a/drivers/vgpu/vgpu-sysfs.c b/drivers/vgpu/vgpu-sysfs.c
new file mode 100644
index 0000000..b740f9a
--- /dev/null
+++ b/drivers/vgpu/vgpu-sysfs.c
@@ -0,0 +1,365 @@
+/*
+ * File attributes for vGPU devices
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/fs.h>
+#include <linux/sysfs.h>
+#include <linux/ctype.h>
+#include <linux/uuid.h>
+#include <linux/vfio.h>
+#include <linux/vgpu.h>
+
+#include "vgpu_private.h"
+
+/* Prototypes */
+
+static ssize_t vgpu_supported_types_show(struct device *dev,
+					 struct device_attribute *attr,
+					 char *buf);
+static DEVICE_ATTR_RO(vgpu_supported_types);
+
+static ssize_t vgpu_create_store(struct device *dev,
+				 struct device_attribute *attr,
+				 const char *buf, size_t count);
+static DEVICE_ATTR_WO(vgpu_create);
+
+static ssize_t vgpu_destroy_store(struct device *dev,
+				  struct device_attribute *attr,
+				  const char *buf, size_t count);
+static DEVICE_ATTR_WO(vgpu_destroy);
+
+
+/* Static functions */
+
+static bool is_uuid_sep(char sep)
+{
+	if (sep == '\n' || sep == '-' || sep == ':' || sep == '\0')
+		return true;
+	return false;
+}
+
+
+static int uuid_parse(const char *str, uuid_le *uuid)
+{
+	int i;
+
+	if (strlen(str) < 36)
+		return -1;
+
+	for (i = 0; i < 16; i++) {
+		if (!isxdigit(str[0]) || !isxdigit(str[1])) {
+			printk(KERN_ERR "%s err", __FUNCTION__);
+			return -EINVAL;
+		}
+
+		uuid->b[i] = (hex_to_bin(str[0]) << 4) | hex_to_bin(str[1]);
+		str += 2;
+		if (is_uuid_sep(*str))
+			str++;
+	}
+
+	return 0;
+}
+
+
+/* Functions */
+static ssize_t vgpu_supported_types_show(struct device *dev,
+					 struct device_attribute *attr,
+					 char *buf)
+{
+	char *str;
+	ssize_t n;
+
+        str = kzalloc(sizeof(*str) * 512, GFP_KERNEL);
+        if (!str)
+                return -ENOMEM;
+
+	get_vgpu_supported_types(dev, str);
+
+	n = sprintf(buf,"%s\n", str);
+	kfree(str);
+
+	return n;
+}
+
+static ssize_t vgpu_create_store(struct device *dev,
+				 struct device_attribute *attr,
+				 const char *buf, size_t count)
+{
+	char *str, *pstr;
+	char *uuid_str, *instance_str, *vgpu_params = NULL;
+	uuid_le uuid;
+	uint32_t instance;
+	struct pci_dev *pdev;
+	int ret = 0;
+
+	pstr = str = kstrndup(buf, count, GFP_KERNEL);
+
+	if (!str)
+		return -ENOMEM;
+
+	if ((uuid_str = strsep(&str, ":")) == NULL) {
+		printk(KERN_ERR "%s Empty UUID or string %s \n",
+				 __FUNCTION__, buf);
+		ret = -EINVAL;
+		goto create_error;
+	}
+
+	if (!str) {
+		printk(KERN_ERR "%s vgpu instance not specified %s \n",
+				 __FUNCTION__, buf);
+		ret = -EINVAL;
+		goto create_error;
+	}
+
+	if ((instance_str = strsep(&str, ":")) == NULL) {
+		printk(KERN_ERR "%s Empty instance or string %s \n",
+				 __FUNCTION__, buf);
+		ret = -EINVAL;
+		goto create_error;
+	}
+
+	instance = (unsigned int)simple_strtoul(instance_str, NULL, 0);
+
+	if (!str) {
+		printk(KERN_ERR "%s vgpu params not specified %s \n",
+				 __FUNCTION__, buf);
+		ret = -EINVAL;
+		goto create_error;
+	}
+
+	vgpu_params = kstrdup(str, GFP_KERNEL);
+
+	if (!vgpu_params) {
+		printk(KERN_ERR "%s vgpu params allocation failed \n",
+				 __FUNCTION__);
+		ret = -EINVAL;
+		goto create_error;
+	}
+
+	if (uuid_parse(uuid_str, &uuid) < 0) {
+		printk(KERN_ERR "%s UUID parse error  %s \n", __FUNCTION__, buf);
+		ret = -EINVAL;
+		goto create_error;
+	}
+
+	if (dev_is_pci(dev)) {
+		pdev = to_pci_dev(dev);
+
+		if (create_vgpu_device(pdev, uuid, instance, vgpu_params) < 0) {
+			printk(KERN_ERR "%s vgpu create error \n", __FUNCTION__);
+			ret = -EINVAL;
+			goto create_error;
+		}
+		ret = count;
+	}
+
+create_error:
+	if (vgpu_params)
+		kfree(vgpu_params);
+
+	if (pstr)
+		kfree(pstr);
+	return ret;
+}
+
+static ssize_t vgpu_destroy_store(struct device *dev,
+				  struct device_attribute *attr,
+				  const char *buf, size_t count)
+{
+	char *uuid_str, *str;
+	uuid_le uuid;
+	unsigned int instance;
+	struct vgpu_device *vgpu_dev = NULL;
+
+	str = kstrndup(buf, count, GFP_KERNEL);
+
+	if (!str)
+		return -ENOMEM;
+
+	if ((uuid_str = strsep(&str, ":")) == NULL) {
+		printk(KERN_ERR "%s Empty UUID or string %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+
+	if (str == NULL) {
+		printk(KERN_ERR "%s instance not specified %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+
+	instance = (unsigned int)simple_strtoul(str, NULL, 0);
+
+	if (uuid_parse(uuid_str, &uuid) < 0) {
+		printk(KERN_ERR "%s UUID parse error  %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+
+	printk(KERN_INFO "%s UUID %pUb - %d \n", __FUNCTION__, uuid.b, instance);
+
+	vgpu_dev = vgpu_drv_get_vgpu_device(uuid, instance);
+
+	if (vgpu_dev)
+		destroy_vgpu_device(vgpu_dev);
+
+	return count;
+}
+
+static ssize_t
+vgpu_uuid_show(struct device *dev, struct device_attribute *attr, char *buf)
+{
+	struct vgpu_device *drv = to_vgpu_device(dev);
+
+	if (drv)
+		return sprintf(buf, "%pUb \n", drv->uuid.b);
+
+	return sprintf(buf, " \n");
+}
+
+static DEVICE_ATTR_RO(vgpu_uuid);
+
+static ssize_t
+vgpu_group_id_show(struct device *dev, struct device_attribute *attr, char *buf)
+{
+	struct vgpu_device *drv = to_vgpu_device(dev);
+
+	if (drv && drv->group)
+		return sprintf(buf, "%d \n", iommu_group_id(drv->group));
+
+	return sprintf(buf, " \n");
+}
+
+static DEVICE_ATTR_RO(vgpu_group_id);
+
+
+static struct attribute *vgpu_dev_attrs[] = {
+	&dev_attr_vgpu_uuid.attr,
+	&dev_attr_vgpu_group_id.attr,
+	NULL,
+};
+
+static const struct attribute_group vgpu_dev_group = {
+	.attrs = vgpu_dev_attrs,
+};
+
+const struct attribute_group *vgpu_dev_groups[] = {
+	&vgpu_dev_group,
+	NULL,
+};
+
+
+ssize_t vgpu_start_store(struct class *class, struct class_attribute *attr,
+			 const char *buf, size_t count)
+{
+	char *uuid_str;
+	uuid_le uuid;
+	struct vgpu_device *vgpu_dev = NULL;
+	int ret;
+
+	uuid_str = kstrndup(buf, count, GFP_KERNEL);
+
+	if (!uuid_str)
+		return -ENOMEM;
+
+	if (uuid_parse(uuid_str, &uuid) < 0) {
+		printk(KERN_ERR "%s UUID parse error  %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+
+	vgpu_dev = vgpu_drv_get_vgpu_device(uuid, 0);
+
+	if (vgpu_dev && dev_is_vgpu(&vgpu_dev->dev)) {
+		kobject_uevent(&vgpu_dev->dev.kobj, KOBJ_ONLINE);
+
+		ret = vgpu_start_callback(vgpu_dev);
+		if (ret < 0) {
+			printk(KERN_ERR "%s vgpu_start callback failed  %d \n",
+					 __FUNCTION__, ret);
+			return ret;
+		}
+	}
+
+	return count;
+}
+
+ssize_t vgpu_shutdown_store(struct class *class, struct class_attribute *attr,
+			    const char *buf, size_t count)
+{
+	char *uuid_str;
+	uuid_le uuid;
+	struct vgpu_device *vgpu_dev = NULL;
+	int ret;
+
+	uuid_str = kstrndup(buf, count, GFP_KERNEL);
+
+	if (!uuid_str)
+		return -ENOMEM;
+
+	if (uuid_parse(uuid_str, &uuid) < 0) {
+		printk(KERN_ERR "%s UUID parse error  %s \n", __FUNCTION__, buf);
+		return -EINVAL;
+	}
+	vgpu_dev = vgpu_drv_get_vgpu_device(uuid, 0);
+
+	if (vgpu_dev && dev_is_vgpu(&vgpu_dev->dev)) {
+		kobject_uevent(&vgpu_dev->dev.kobj, KOBJ_OFFLINE);
+
+		ret = vgpu_shutdown_callback(vgpu_dev);
+		if (ret < 0) {
+			printk(KERN_ERR "%s vgpu_shutdown callback failed  %d \n",
+					 __FUNCTION__, ret);
+			return ret;
+		}
+	}
+
+	return count;
+}
+
+struct class_attribute vgpu_class_attrs[] = {
+	__ATTR_WO(vgpu_start),
+	__ATTR_WO(vgpu_shutdown),
+	__ATTR_NULL
+};
+
+int vgpu_create_pci_device_files(struct pci_dev *dev)
+{
+	int retval;
+
+	retval = sysfs_create_file(&dev->dev.kobj,
+				   &dev_attr_vgpu_supported_types.attr);
+	if (retval) {
+		printk(KERN_ERR "VGPU-VFIO: failed to create vgpu_supported_types sysfs entry\n");
+		return retval;
+	}
+
+	retval = sysfs_create_file(&dev->dev.kobj, &dev_attr_vgpu_create.attr);
+	if (retval) {
+		printk(KERN_ERR "VGPU-VFIO: failed to create vgpu_create sysfs entry\n");
+		return retval;
+	}
+
+	retval = sysfs_create_file(&dev->dev.kobj, &dev_attr_vgpu_destroy.attr);
+	if (retval) {
+		printk(KERN_ERR "VGPU-VFIO: failed to create vgpu_destroy sysfs entry\n");
+		return retval;
+	}
+
+	return 0;
+}
+
+
+void vgpu_remove_pci_device_files(struct pci_dev *dev)
+{
+	sysfs_remove_file(&dev->dev.kobj, &dev_attr_vgpu_supported_types.attr);
+	sysfs_remove_file(&dev->dev.kobj, &dev_attr_vgpu_create.attr);
+	sysfs_remove_file(&dev->dev.kobj, &dev_attr_vgpu_destroy.attr);
+}
diff --git a/drivers/vgpu/vgpu_private.h b/drivers/vgpu/vgpu_private.h
new file mode 100644
index 0000000..35158ef
--- /dev/null
+++ b/drivers/vgpu/vgpu_private.h
@@ -0,0 +1,36 @@
+/*
+ * VGPU interal definition
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author:
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef VGPU_PRIVATE_H
+#define VGPU_PRIVATE_H
+
+struct vgpu_device *vgpu_drv_get_vgpu_device(uuid_le uuid, int instance);
+
+int  create_vgpu_device(struct pci_dev *pdev, uuid_le uuid, uint32_t instance,
+		       char *vgpu_params);
+void destroy_vgpu_device(struct vgpu_device *vgpu_dev);
+
+int  vgpu_bus_register(void);
+void vgpu_bus_unregister(void);
+
+/* Function prototypes for vgpu_sysfs */
+
+extern struct class_attribute vgpu_class_attrs[];
+extern const struct attribute_group *vgpu_dev_groups[];
+
+int  vgpu_create_pci_device_files(struct pci_dev *dev);
+void vgpu_remove_pci_device_files(struct pci_dev *dev);
+
+void get_vgpu_supported_types(struct device *dev, char *str);
+int  vgpu_start_callback(struct vgpu_device *vgpu_dev);
+int  vgpu_shutdown_callback(struct vgpu_device *vgpu_dev);
+
+#endif /* VGPU_PRIVATE_H */
diff --git a/include/linux/vgpu.h b/include/linux/vgpu.h
new file mode 100644
index 0000000..03a77cf
--- /dev/null
+++ b/include/linux/vgpu.h
@@ -0,0 +1,216 @@
+/*
+ * VGPU definition
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author:
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef VGPU_H
+#define VGPU_H
+
+// Common Data structures
+
+struct pci_bar_info {
+	uint64_t start;
+	uint64_t size;
+	uint32_t flags;
+};
+
+enum vgpu_emul_space_e {
+	vgpu_emul_space_config = 0, /*!< PCI configuration space */
+	vgpu_emul_space_io = 1,     /*!< I/O register space */
+	vgpu_emul_space_mmio = 2    /*!< Memory-mapped I/O space */
+};
+
+struct gpu_device;
+
+/*
+ * VGPU device
+ */
+struct vgpu_device {
+	struct kref		kref;
+	struct device		dev;
+	struct gpu_device	*gpu_dev;
+	struct iommu_group	*group;
+#define DEVICE_NAME_LEN		(64)
+	char			dev_name[DEVICE_NAME_LEN];
+	uuid_le			uuid;
+	uint32_t		vgpu_instance;
+	struct device_attribute	*dev_attr_vgpu_status;
+	int			vgpu_device_status;
+
+	void			*driver_data;
+
+	struct list_head	list;
+};
+
+
+/**
+ * struct gpu_device_ops - Structure to be registered for each physical GPU to
+ * register the device to vgpu module.
+ *
+ * @owner:			The module owner.
+ * @dev_attr_groups:		Default attributes of the physical device.
+ * @vgpu_attr_groups:		Default attributes of the vGPU device.
+ * @vgpu_supported_config:	Called to get information about supported vgpu types.
+ *				@dev : pci device structure of physical GPU.
+ *				@config: should return string listing supported config
+ *				Returns integer: success (0) or error (< 0)
+ * @vgpu_create:		Called to allocate basic resouces in graphics
+ *				driver for a particular vgpu.
+ *				@dev: physical pci device structure on which vgpu
+ *				      should be created
+ *				@uuid: VM's uuid for which VM it is intended to
+ *				@instance: vgpu instance in that VM
+ *				@vgpu_params: extra parameters required by GPU driver.
+ *				Returns integer: success (0) or error (< 0)
+ * @vgpu_destroy:		Called to free resources in graphics driver for
+ *				a vgpu instance of that VM.
+ *				@dev: physical pci device structure to which
+ *				this vgpu points to.
+ *				@uuid: VM's uuid for which the vgpu belongs to.
+ *				@instance: vgpu instance in that VM
+ *				Returns integer: success (0) or error (< 0)
+ *				If VM is running and vgpu_destroy is called that
+ *				means the vGPU is being hotunpluged. Return error
+ *				if VM is running and graphics driver doesn't
+ *				support vgpu hotplug.
+ * @vgpu_start:			Called to do initiate vGPU initialization
+ *				process in graphics driver when VM boots before
+ *				qemu starts.
+ *				@uuid: VM's UUID which is booting.
+ *				Returns integer: success (0) or error (< 0)
+ * @vgpu_shutdown:		Called to teardown vGPU related resources for
+ *				the VM
+ *				@uuid: VM's UUID which is shutting down .
+ *				Returns integer: success (0) or error (< 0)
+ * @read:			Read emulation callback
+ *				@vdev: vgpu device structure
+ *				@buf: read buffer
+ *				@count: number bytes to read
+ *				@address_space: specifies for which address space
+ *				the request is: pci_config_space, IO register
+ *				space or MMIO space.
+ *				@pos: offset from base address.
+ *				Retuns number on bytes read on success or error.
+ * @write:			Write emulation callback
+ *				@vdev: vgpu device structure
+ *				@buf: write buffer
+ *				@count: number bytes to be written
+ *				@address_space: specifies for which address space
+ *				the request is: pci_config_space, IO register
+ *				space or MMIO space.
+ *				@pos: offset from base address.
+ *				Retuns number on bytes written on success or error.
+ * @vgpu_set_irqs:		Called to send about interrupts configuration
+ *				information that qemu set.
+ *				@vdev: vgpu device structure
+ *				@flags, index, start, count and *data : same as
+ *				that of struct vfio_irq_set of
+ *				VFIO_DEVICE_SET_IRQS API.
+ * @vgpu_bar_info:		Called to get BAR size and flags of vGPU device.
+ *				@vdev: vgpu device structure
+ *				@bar_index: BAR index
+ *				@bar_info: output, returns size and flags of
+ *				requested BAR
+ *				Returns integer: success (0) or error (< 0)
+ * @validate_map_request:	Validate remap pfn request
+ *				@vdev: vgpu device structure
+ *				@virtaddr: target user address to start at
+ *				@pfn: physical address of kernel memory, GPU
+ *				driver can change if required.
+ *				@size: size of map area, GPU driver can change
+ *				the size of map area if desired.
+ *				@prot: page protection flags for this mapping,
+ *				GPU driver can change, if required.
+ *				Returns integer: success (0) or error (< 0)
+ *
+ * Physical GPU that support vGPU should be register with vgpu module with
+ * gpu_device_ops structure.
+ */
+
+struct gpu_device_ops {
+	struct module   *owner;
+	const struct attribute_group **dev_attr_groups;
+	const struct attribute_group **vgpu_attr_groups;
+
+	int	(*vgpu_supported_config)(struct pci_dev *dev, char *config);
+	int     (*vgpu_create)(struct pci_dev *dev, uuid_le uuid,
+			       uint32_t instance, char *vgpu_params);
+	int     (*vgpu_destroy)(struct pci_dev *dev, uuid_le uuid,
+			        uint32_t instance);
+
+	int     (*vgpu_start)(uuid_le uuid);
+	int     (*vgpu_shutdown)(uuid_le uuid);
+
+	ssize_t (*read) (struct vgpu_device *vdev, char *buf, size_t count,
+			 uint32_t address_space, loff_t pos);
+	ssize_t (*write)(struct vgpu_device *vdev, char *buf, size_t count,
+			 uint32_t address_space, loff_t pos);
+	int     (*vgpu_set_irqs)(struct vgpu_device *vdev, uint32_t flags,
+				 unsigned index, unsigned start, unsigned count,
+				 void *data);
+	int	(*vgpu_bar_info)(struct vgpu_device *vdev, int bar_index,
+				 struct pci_bar_info *bar_info);
+	int	(*validate_map_request)(struct vgpu_device *vdev,
+					unsigned long virtaddr,
+					unsigned long *pfn, unsigned long *size,
+					pgprot_t *prot);
+};
+
+/*
+ * Physical GPU
+ */
+struct gpu_device {
+	struct pci_dev                  *dev;
+	const struct gpu_device_ops     *ops;
+	struct list_head                gpu_next;
+};
+
+/**
+ * struct vgpu_driver - vGPU device driver
+ * @name: driver name
+ * @probe: called when new device created
+ * @remove: called when device removed
+ * @driver: device driver structure
+ *
+ **/
+struct vgpu_driver {
+	const char *name;
+	int  (*probe)  (struct device *dev);
+	void (*remove) (struct device *dev);
+	struct device_driver	driver;
+};
+
+static inline struct vgpu_driver *to_vgpu_driver(struct device_driver *drv)
+{
+	return drv ? container_of(drv, struct vgpu_driver, driver) : NULL;
+}
+
+static inline struct vgpu_device *to_vgpu_device(struct device *dev)
+{
+	return dev ? container_of(dev, struct vgpu_device, dev) : NULL;
+}
+
+extern struct bus_type vgpu_bus_type;
+
+#define dev_is_vgpu(d) ((d)->bus == &vgpu_bus_type)
+
+extern int  vgpu_register_device(struct pci_dev *dev,
+				 const struct gpu_device_ops *ops);
+extern void vgpu_unregister_device(struct pci_dev *dev);
+
+extern int  vgpu_register_driver(struct vgpu_driver *drv, struct module *owner);
+extern void vgpu_unregister_driver(struct vgpu_driver *drv);
+
+extern int vgpu_map_virtual_bar(uint64_t virt_bar_addr, uint64_t phys_bar_addr,
+				uint32_t len, uint32_t flags);
+extern int vgpu_dma_do_translate(dma_addr_t * gfn_buffer, uint32_t count);
+
+struct vgpu_device *get_vgpu_device_from_group(struct iommu_group *group);
+
+#endif /* VGPU_H */
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 154+ messages in thread

* [RFC PATCH v3 2/3] VFIO driver for vGPU device
  2016-05-02 18:40 ` [Qemu-devel] " Kirti Wankhede
@ 2016-05-02 18:40   ` Kirti Wankhede
  -1 siblings, 0 replies; 154+ messages in thread
From: Kirti Wankhede @ 2016-05-02 18:40 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, kevin.tian, shuai.ruan, jike.song, zhiyuan.lv,
	Kirti Wankhede

VFIO driver registers with vGPU core driver. vGPU core driver creates vGPU
device and calls probe routine of vGPU VFIO driver. This vGPU VFIO driver adds
vGPU device to VFIO core module.
Main aim of this module is to manage all VFIO APIs for each vGPU device.
Those are:
- get region information from GPU driver.
- trap and emulate PCI config space and BAR region.
- Send interrupt configuration information to GPU driver.
- mmap mappable region with invalidate mapping and fault on access to remap pfn.

Thanks,
Kirti.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Change-Id: I949a6b499d2e98d9c3352ae579535a608729b223
---
 drivers/vgpu/Makefile    |    1 +
 drivers/vgpu/vgpu_vfio.c |  671 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 672 insertions(+), 0 deletions(-)
 create mode 100644 drivers/vgpu/vgpu_vfio.c

diff --git a/drivers/vgpu/Makefile b/drivers/vgpu/Makefile
index f5be980..a0a2655 100644
--- a/drivers/vgpu/Makefile
+++ b/drivers/vgpu/Makefile
@@ -2,3 +2,4 @@
 vgpu-y := vgpu-core.o vgpu-sysfs.o vgpu-driver.o
 
 obj-$(CONFIG_VGPU)			+= vgpu.o
+obj-$(CONFIG_VGPU_VFIO)                 += vgpu_vfio.o
diff --git a/drivers/vgpu/vgpu_vfio.c b/drivers/vgpu/vgpu_vfio.c
new file mode 100644
index 0000000..460a4dc
--- /dev/null
+++ b/drivers/vgpu/vgpu_vfio.c
@@ -0,0 +1,671 @@
+/*
+ * VGPU VFIO device
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/poll.h>
+#include <linux/slab.h>
+#include <linux/cdev.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/uuid.h>
+#include <linux/vfio.h>
+#include <linux/iommu.h>
+#include <linux/vgpu.h>
+
+#include "vgpu_private.h"
+
+#define DRIVER_VERSION  "0.1"
+#define DRIVER_AUTHOR   "NVIDIA Corporation"
+#define DRIVER_DESC     "VGPU VFIO Driver"
+
+#define VFIO_PCI_OFFSET_SHIFT   40
+
+#define VFIO_PCI_OFFSET_TO_INDEX(off)	(off >> VFIO_PCI_OFFSET_SHIFT)
+#define VFIO_PCI_INDEX_TO_OFFSET(index)	((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
+#define VFIO_PCI_OFFSET_MASK	(((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
+
+struct vfio_vgpu_device {
+	struct iommu_group *group;
+	struct vgpu_device *vgpu_dev;
+	int		    refcnt;
+	struct pci_bar_info bar_info[VFIO_PCI_NUM_REGIONS];
+	u8		    *vconfig;
+};
+
+static DEFINE_MUTEX(vfio_vgpu_lock);
+
+static int get_virtual_bar_info(struct vgpu_device *vgpu_dev,
+				struct pci_bar_info *bar_info,
+				int index)
+{
+	int ret = -1;
+	struct gpu_device *gpu_dev = vgpu_dev->gpu_dev;
+
+	if (gpu_dev->ops->vgpu_bar_info)
+		ret = gpu_dev->ops->vgpu_bar_info(vgpu_dev, index, bar_info);
+	return ret;
+}
+
+static int vdev_read_base(struct vfio_vgpu_device *vdev)
+{
+	int index, pos;
+	u32 start_lo, start_hi;
+	u32 mem_type;
+
+	pos = PCI_BASE_ADDRESS_0;
+
+	for (index = 0; index <= VFIO_PCI_BAR5_REGION_INDEX; index++) {
+
+		if (!vdev->bar_info[index].size)
+			continue;
+
+		start_lo = (*(u32 *)(vdev->vconfig + pos)) &
+					PCI_BASE_ADDRESS_MEM_MASK;
+		mem_type = (*(u32 *)(vdev->vconfig + pos)) &
+					PCI_BASE_ADDRESS_MEM_TYPE_MASK;
+
+		switch (mem_type) {
+		case PCI_BASE_ADDRESS_MEM_TYPE_64:
+			start_hi = (*(u32 *)(vdev->vconfig + pos + 4));
+			pos += 4;
+			break;
+		case PCI_BASE_ADDRESS_MEM_TYPE_32:
+		case PCI_BASE_ADDRESS_MEM_TYPE_1M:
+			/* 1M mem BAR treated as 32-bit BAR */
+		default:
+			/* mem unknown type treated as 32-bit BAR */
+			start_hi = 0;
+			break;
+		}
+		pos += 4;
+		vdev->bar_info[index].start = ((u64)start_hi << 32) | start_lo;
+	}
+	return 0;
+}
+
+static int vgpu_dev_open(void *device_data)
+{
+	int ret = 0;
+	struct vfio_vgpu_device *vdev = device_data;
+
+	if (!try_module_get(THIS_MODULE))
+		return -ENODEV;
+
+	mutex_lock(&vfio_vgpu_lock);
+
+	if (!vdev->refcnt) {
+		u8 *vconfig;
+		int vconfig_size, index;
+
+		for (index = 0; index < VFIO_PCI_NUM_REGIONS; index++) {
+			ret = get_virtual_bar_info(vdev->vgpu_dev,
+						   &vdev->bar_info[index],
+						   index);
+			if (ret)
+				goto open_error;
+		}
+		vconfig_size = vdev->bar_info[VFIO_PCI_CONFIG_REGION_INDEX].size;
+		if (!vconfig_size)
+			goto open_error;
+
+		vconfig = kzalloc(vconfig_size, GFP_KERNEL);
+		if (!vconfig) {
+			ret = -ENOMEM;
+			goto open_error;
+		}
+
+		vdev->vconfig = vconfig;
+	}
+
+	vdev->refcnt++;
+open_error:
+
+	mutex_unlock(&vfio_vgpu_lock);
+
+	if (ret)
+		module_put(THIS_MODULE);
+
+	return ret;
+}
+
+static void vgpu_dev_close(void *device_data)
+{
+	struct vfio_vgpu_device *vdev = device_data;
+
+	mutex_lock(&vfio_vgpu_lock);
+
+	vdev->refcnt--;
+	if (!vdev->refcnt) {
+		memset(&vdev->bar_info, 0, sizeof(vdev->bar_info));
+		if (vdev->vconfig)
+			kfree(vdev->vconfig);
+	}
+
+	mutex_unlock(&vfio_vgpu_lock);
+
+	module_put(THIS_MODULE);
+}
+
+static int vgpu_get_irq_count(struct vfio_vgpu_device *vdev, int irq_type)
+{
+	// Don't support MSIX for now
+	if (irq_type == VFIO_PCI_MSIX_IRQ_INDEX)
+		return -1;
+
+	return 1;
+}
+
+static long vgpu_dev_unlocked_ioctl(void *device_data,
+		unsigned int cmd, unsigned long arg)
+{
+	int ret = 0;
+	struct vfio_vgpu_device *vdev = device_data;
+	unsigned long minsz;
+
+	switch (cmd)
+	{
+	case VFIO_DEVICE_GET_INFO:
+	{
+		struct vfio_device_info info;
+		printk(KERN_INFO "%s VFIO_DEVICE_GET_INFO cmd index ", __FUNCTION__);
+		minsz = offsetofend(struct vfio_device_info, num_irqs);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz)
+			return -EINVAL;
+
+		info.flags = VFIO_DEVICE_FLAGS_PCI;
+		info.num_regions = VFIO_PCI_NUM_REGIONS;
+		info.num_irqs = VFIO_PCI_NUM_IRQS;
+
+		return copy_to_user((void __user *)arg, &info, minsz);
+	}
+
+	case VFIO_DEVICE_GET_REGION_INFO:
+	{
+		struct vfio_region_info info;
+
+		minsz = offsetofend(struct vfio_region_info, offset);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz)
+			return -EINVAL;
+
+		printk(KERN_INFO "%s VFIO_DEVICE_GET_REGION_INFO cmd for region_index %d", __FUNCTION__, info.index);
+		switch (info.index) {
+		case VFIO_PCI_CONFIG_REGION_INDEX:
+		case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
+			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+			info.size = vdev->bar_info[info.index].size;
+			if (!info.size) {
+				info.flags = 0;
+				break;
+			}
+
+			info.flags = vdev->bar_info[info.index].flags;
+			break;
+		case VFIO_PCI_VGA_REGION_INDEX:
+			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+			info.size = 0xc0000;
+			info.flags = VFIO_REGION_INFO_FLAG_READ |
+				     VFIO_REGION_INFO_FLAG_WRITE;
+				break;
+
+		case VFIO_PCI_ROM_REGION_INDEX:
+		default:
+			return -EINVAL;
+		}
+
+		return copy_to_user((void __user *)arg, &info, minsz);
+
+	}
+	case VFIO_DEVICE_GET_IRQ_INFO:
+	{
+		struct vfio_irq_info info;
+
+		printk(KERN_INFO "%s VFIO_DEVICE_GET_IRQ_INFO cmd", __FUNCTION__);
+		minsz = offsetofend(struct vfio_irq_info, count);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz || info.index >= VFIO_PCI_NUM_IRQS)
+			return -EINVAL;
+
+		switch (info.index) {
+		case VFIO_PCI_INTX_IRQ_INDEX ... VFIO_PCI_MSI_IRQ_INDEX:
+		case VFIO_PCI_REQ_IRQ_INDEX:
+			break;
+			/* pass thru to return error */
+		case VFIO_PCI_MSIX_IRQ_INDEX:
+		default:
+			return -EINVAL;
+		}
+
+		info.count = VFIO_PCI_NUM_IRQS;
+
+		info.flags = VFIO_IRQ_INFO_EVENTFD;
+		info.count = vgpu_get_irq_count(vdev, info.index);
+
+		if (info.count == -1)
+			return -EINVAL;
+
+		if (info.index == VFIO_PCI_INTX_IRQ_INDEX)
+			info.flags |= (VFIO_IRQ_INFO_MASKABLE |
+					VFIO_IRQ_INFO_AUTOMASKED);
+		else
+			info.flags |= VFIO_IRQ_INFO_NORESIZE;
+
+		return copy_to_user((void __user *)arg, &info, minsz);
+	}
+
+	case VFIO_DEVICE_SET_IRQS:
+	{
+		struct vfio_irq_set hdr;
+		struct gpu_device *gpu_dev = vdev->vgpu_dev->gpu_dev;
+		u8 *data = NULL;
+		int ret = 0;
+		minsz = offsetofend(struct vfio_irq_set, count);
+
+		if (copy_from_user(&hdr, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (hdr.argsz < minsz || hdr.index >= VFIO_PCI_NUM_IRQS ||
+		    hdr.flags & ~(VFIO_IRQ_SET_DATA_TYPE_MASK |
+		    VFIO_IRQ_SET_ACTION_TYPE_MASK))
+			return -EINVAL;
+
+		if (!(hdr.flags & VFIO_IRQ_SET_DATA_NONE)) {
+			size_t size;
+			int max = vgpu_get_irq_count(vdev, hdr.index);
+
+			if (hdr.flags & VFIO_IRQ_SET_DATA_BOOL)
+				size = sizeof(uint8_t);
+			else if (hdr.flags & VFIO_IRQ_SET_DATA_EVENTFD)
+				size = sizeof(int32_t);
+			else
+				return -EINVAL;
+
+			if (hdr.argsz - minsz < hdr.count * size ||
+			    hdr.start >= max || hdr.start + hdr.count > max)
+				return -EINVAL;
+
+			data = memdup_user((void __user *)(arg + minsz),
+						hdr.count * size);
+				if (IS_ERR(data))
+					return PTR_ERR(data);
+
+			}
+
+			if (gpu_dev->ops->vgpu_set_irqs) {
+				ret = gpu_dev->ops->vgpu_set_irqs(vdev->vgpu_dev,
+								  hdr.flags,
+								  hdr.index, hdr.start,
+								  hdr.count, data);
+			}
+			kfree(data);
+			return ret;
+		}
+
+		default:
+			return -EINVAL;
+	}
+	return ret;
+}
+
+ssize_t vgpu_dev_config_rw(struct vfio_vgpu_device *vdev, char __user *buf,
+		size_t count, loff_t *ppos, bool iswrite)
+{
+	struct vgpu_device *vgpu_dev = vdev->vgpu_dev;
+	struct gpu_device *gpu_dev = vgpu_dev->gpu_dev;
+	int cfg_size = vdev->bar_info[VFIO_PCI_CONFIG_REGION_INDEX].size;
+	int ret = 0;
+	uint64_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
+
+	if (pos < 0 || pos >= cfg_size ||
+	    pos + count > cfg_size) {
+		printk(KERN_ERR "%s pos 0x%llx out of range\n", __FUNCTION__, pos);
+		ret = -EFAULT;
+		goto config_rw_exit;
+	}
+
+	if (iswrite) {
+		char *user_data = kmalloc(count, GFP_KERNEL);
+
+		if (user_data == NULL) {
+			ret = -ENOMEM;
+			goto config_rw_exit;
+		}
+
+		if (copy_from_user(user_data, buf, count)) {
+			ret = -EFAULT;
+			kfree(user_data);
+			goto config_rw_exit;
+		}
+
+		if (gpu_dev->ops->write) {
+			ret = gpu_dev->ops->write(vgpu_dev,
+						  user_data,
+						  count,
+						  vgpu_emul_space_config,
+						  pos);
+		}
+
+		memcpy((void *)(vdev->vconfig + pos), (void *)user_data, count);
+		kfree(user_data);
+	}
+	else
+	{
+		char *ret_data = kzalloc(count, GFP_KERNEL);
+
+		if (ret_data == NULL) {
+			ret = -ENOMEM;
+			goto config_rw_exit;
+		}
+
+		if (gpu_dev->ops->read) {
+			ret = gpu_dev->ops->read(vgpu_dev,
+						 ret_data,
+						 count,
+						 vgpu_emul_space_config,
+						 pos);
+		}
+
+		if (ret > 0 ) {
+			if (copy_to_user(buf, ret_data, ret)) {
+				ret = -EFAULT;
+				kfree(ret_data);
+				goto config_rw_exit;
+			}
+
+			memcpy((void *)(vdev->vconfig + pos), (void *)ret_data, count);
+		}
+		kfree(ret_data);
+	}
+config_rw_exit:
+	return ret;
+}
+
+ssize_t vgpu_dev_bar_rw(struct vfio_vgpu_device *vdev, char __user *buf,
+		size_t count, loff_t *ppos, bool iswrite)
+{
+	struct vgpu_device *vgpu_dev = vdev->vgpu_dev;
+	struct gpu_device *gpu_dev = vgpu_dev->gpu_dev;
+	loff_t offset = *ppos & VFIO_PCI_OFFSET_MASK;
+	loff_t pos;
+	int bar_index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+	int ret = 0;
+
+	if (!vdev->bar_info[bar_index].start) {
+		ret = vdev_read_base(vdev);
+		if (ret)
+			goto bar_rw_exit;
+	}
+
+	if (offset >= vdev->bar_info[bar_index].size) {
+		ret = -EINVAL;
+		goto bar_rw_exit;
+	}
+
+	pos = vdev->bar_info[bar_index].start + offset;
+	if (iswrite) {
+		char *user_data = kmalloc(count, GFP_KERNEL);
+
+		if (user_data == NULL) {
+			ret = -ENOMEM;
+			goto bar_rw_exit;
+		}
+
+		if (copy_from_user(user_data, buf, count)) {
+			ret = -EFAULT;
+			kfree(user_data);
+			goto bar_rw_exit;
+		}
+
+		if (gpu_dev->ops->write) {
+			ret = gpu_dev->ops->write(vgpu_dev,
+						  user_data,
+						  count,
+						  vgpu_emul_space_mmio,
+						  pos);
+		}
+
+		kfree(user_data);
+	}
+	else
+	{
+		char *ret_data = kmalloc(count, GFP_KERNEL);
+
+		if (ret_data == NULL) {
+			ret = -ENOMEM;
+			goto bar_rw_exit;
+		}
+
+		memset(ret_data, 0, count);
+
+		if (gpu_dev->ops->read) {
+			ret = gpu_dev->ops->read(vgpu_dev,
+						 ret_data,
+						 count,
+						 vgpu_emul_space_mmio,
+						 pos);
+		}
+
+		if (ret > 0 ) {
+			if (copy_to_user(buf, ret_data, ret)) {
+				ret = -EFAULT;
+			}
+		}
+		kfree(ret_data);
+	}
+
+bar_rw_exit:
+	return ret;
+}
+
+
+static ssize_t vgpu_dev_rw(void *device_data, char __user *buf,
+		size_t count, loff_t *ppos, bool iswrite)
+{
+	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+	struct vfio_vgpu_device *vdev = device_data;
+
+	if (index >= VFIO_PCI_NUM_REGIONS)
+		return -EINVAL;
+
+	switch (index) {
+	case VFIO_PCI_CONFIG_REGION_INDEX:
+		return vgpu_dev_config_rw(vdev, buf, count, ppos, iswrite);
+
+	case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
+		return vgpu_dev_bar_rw(vdev, buf, count, ppos, iswrite);
+
+	case VFIO_PCI_ROM_REGION_INDEX:
+	case VFIO_PCI_VGA_REGION_INDEX:
+		break;
+	}
+
+	return -EINVAL;
+}
+
+
+static ssize_t vgpu_dev_read(void *device_data, char __user *buf,
+			     size_t count, loff_t *ppos)
+{
+	int ret = 0;
+
+	if (count)
+		ret = vgpu_dev_rw(device_data, buf, count, ppos, false);
+
+	return ret;
+}
+
+static ssize_t vgpu_dev_write(void *device_data, const char __user *buf,
+			      size_t count, loff_t *ppos)
+{
+	int ret = 0;
+
+	if (count)
+		ret = vgpu_dev_rw(device_data, (char *)buf, count, ppos, true);
+
+	return ret;
+}
+
+static int vgpu_dev_mmio_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	int ret = 0;
+	struct vfio_vgpu_device *vdev = vma->vm_private_data;
+	struct vgpu_device *vgpu_dev;
+	struct gpu_device *gpu_dev;
+	u64 virtaddr = (u64)vmf->virtual_address;
+	u64 offset, phyaddr;
+	unsigned long req_size, pgoff;
+	pgprot_t pg_prot;
+
+	if (!vdev && !vdev->vgpu_dev)
+		return -EINVAL;
+
+	vgpu_dev = vdev->vgpu_dev;
+	gpu_dev  = vgpu_dev->gpu_dev;
+
+	offset   = vma->vm_pgoff << PAGE_SHIFT;
+	phyaddr  = virtaddr - vma->vm_start + offset;
+	pgoff    = phyaddr >> PAGE_SHIFT;
+	req_size = vma->vm_end - virtaddr;
+	pg_prot  = vma->vm_page_prot;
+
+	if (gpu_dev->ops->validate_map_request) {
+		ret = gpu_dev->ops->validate_map_request(vgpu_dev, virtaddr, &pgoff,
+							 &req_size, &pg_prot);
+		if (ret)
+			return ret;
+
+		if (!req_size)
+			return -EINVAL;
+	}
+
+	ret = remap_pfn_range(vma, virtaddr, pgoff, req_size, pg_prot);
+
+	return ret | VM_FAULT_NOPAGE;
+}
+
+static const struct vm_operations_struct vgpu_dev_mmio_ops = {
+	.fault = vgpu_dev_mmio_fault,
+};
+
+
+static int vgpu_dev_mmap(void *device_data, struct vm_area_struct *vma)
+{
+	unsigned int index;
+	struct vfio_vgpu_device *vdev = device_data;
+	struct vgpu_device *vgpu_dev = vdev->vgpu_dev;
+	struct pci_dev *pdev = vgpu_dev->gpu_dev->dev;
+	unsigned long pgoff;
+
+	loff_t offset = vma->vm_pgoff << PAGE_SHIFT;
+
+	index = VFIO_PCI_OFFSET_TO_INDEX(offset);
+
+	if (index >= VFIO_PCI_ROM_REGION_INDEX)
+		return -EINVAL;
+
+	pgoff = vma->vm_pgoff &
+		((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
+
+	vma->vm_pgoff = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff;
+
+	vma->vm_private_data = vdev;
+	vma->vm_ops = &vgpu_dev_mmio_ops;
+
+	return 0;
+}
+
+static const struct vfio_device_ops vgpu_vfio_dev_ops = {
+	.name		= "vfio-vgpu",
+	.open		= vgpu_dev_open,
+	.release	= vgpu_dev_close,
+	.ioctl		= vgpu_dev_unlocked_ioctl,
+	.read		= vgpu_dev_read,
+	.write		= vgpu_dev_write,
+	.mmap		= vgpu_dev_mmap,
+};
+
+int vgpu_vfio_probe(struct device *dev)
+{
+	struct vfio_vgpu_device *vdev;
+	struct vgpu_device *vgpu_dev = to_vgpu_device(dev);
+	int ret = 0;
+
+	if (vgpu_dev == NULL)
+		return -EINVAL;
+
+	vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
+	if (!vdev) {
+		return -ENOMEM;
+	}
+
+	vdev->vgpu_dev = vgpu_dev;
+	vdev->group = vgpu_dev->group;
+
+	ret = vfio_add_group_dev(dev, &vgpu_vfio_dev_ops, vdev);
+	if (ret)
+		kfree(vdev);
+
+	printk(KERN_INFO "%s ret = %d\n", __FUNCTION__, ret);
+	return ret;
+}
+
+void vgpu_vfio_remove(struct device *dev)
+{
+	struct vfio_vgpu_device *vdev;
+
+	printk(KERN_INFO "%s \n", __FUNCTION__);
+	vdev = vfio_del_group_dev(dev);
+	if (vdev) {
+		printk(KERN_INFO "%s vdev being freed\n", __FUNCTION__);
+		kfree(vdev);
+	}
+}
+
+struct vgpu_driver vgpu_vfio_driver = {
+        .name	= "vgpu-vfio",
+        .probe	= vgpu_vfio_probe,
+        .remove	= vgpu_vfio_remove,
+};
+
+static int __init vgpu_vfio_init(void)
+{
+	printk(KERN_INFO "%s \n", __FUNCTION__);
+	return vgpu_register_driver(&vgpu_vfio_driver, THIS_MODULE);
+}
+
+static void __exit vgpu_vfio_exit(void)
+{
+	printk(KERN_INFO "%s \n", __FUNCTION__);
+	vgpu_unregister_driver(&vgpu_vfio_driver);
+}
+
+module_init(vgpu_vfio_init)
+module_exit(vgpu_vfio_exit)
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 154+ messages in thread

* [Qemu-devel] [RFC PATCH v3 2/3] VFIO driver for vGPU device
@ 2016-05-02 18:40   ` Kirti Wankhede
  0 siblings, 0 replies; 154+ messages in thread
From: Kirti Wankhede @ 2016-05-02 18:40 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, kevin.tian, shuai.ruan, jike.song, zhiyuan.lv,
	Kirti Wankhede

VFIO driver registers with vGPU core driver. vGPU core driver creates vGPU
device and calls probe routine of vGPU VFIO driver. This vGPU VFIO driver adds
vGPU device to VFIO core module.
Main aim of this module is to manage all VFIO APIs for each vGPU device.
Those are:
- get region information from GPU driver.
- trap and emulate PCI config space and BAR region.
- Send interrupt configuration information to GPU driver.
- mmap mappable region with invalidate mapping and fault on access to remap pfn.

Thanks,
Kirti.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Change-Id: I949a6b499d2e98d9c3352ae579535a608729b223
---
 drivers/vgpu/Makefile    |    1 +
 drivers/vgpu/vgpu_vfio.c |  671 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 672 insertions(+), 0 deletions(-)
 create mode 100644 drivers/vgpu/vgpu_vfio.c

diff --git a/drivers/vgpu/Makefile b/drivers/vgpu/Makefile
index f5be980..a0a2655 100644
--- a/drivers/vgpu/Makefile
+++ b/drivers/vgpu/Makefile
@@ -2,3 +2,4 @@
 vgpu-y := vgpu-core.o vgpu-sysfs.o vgpu-driver.o
 
 obj-$(CONFIG_VGPU)			+= vgpu.o
+obj-$(CONFIG_VGPU_VFIO)                 += vgpu_vfio.o
diff --git a/drivers/vgpu/vgpu_vfio.c b/drivers/vgpu/vgpu_vfio.c
new file mode 100644
index 0000000..460a4dc
--- /dev/null
+++ b/drivers/vgpu/vgpu_vfio.c
@@ -0,0 +1,671 @@
+/*
+ * VGPU VFIO device
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ *     Author: Neo Jia <cjia@nvidia.com>
+ *	       Kirti Wankhede <kwankhede@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/kernel.h>
+#include <linux/fs.h>
+#include <linux/poll.h>
+#include <linux/slab.h>
+#include <linux/cdev.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
+#include <linux/uuid.h>
+#include <linux/vfio.h>
+#include <linux/iommu.h>
+#include <linux/vgpu.h>
+
+#include "vgpu_private.h"
+
+#define DRIVER_VERSION  "0.1"
+#define DRIVER_AUTHOR   "NVIDIA Corporation"
+#define DRIVER_DESC     "VGPU VFIO Driver"
+
+#define VFIO_PCI_OFFSET_SHIFT   40
+
+#define VFIO_PCI_OFFSET_TO_INDEX(off)	(off >> VFIO_PCI_OFFSET_SHIFT)
+#define VFIO_PCI_INDEX_TO_OFFSET(index)	((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
+#define VFIO_PCI_OFFSET_MASK	(((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
+
+struct vfio_vgpu_device {
+	struct iommu_group *group;
+	struct vgpu_device *vgpu_dev;
+	int		    refcnt;
+	struct pci_bar_info bar_info[VFIO_PCI_NUM_REGIONS];
+	u8		    *vconfig;
+};
+
+static DEFINE_MUTEX(vfio_vgpu_lock);
+
+static int get_virtual_bar_info(struct vgpu_device *vgpu_dev,
+				struct pci_bar_info *bar_info,
+				int index)
+{
+	int ret = -1;
+	struct gpu_device *gpu_dev = vgpu_dev->gpu_dev;
+
+	if (gpu_dev->ops->vgpu_bar_info)
+		ret = gpu_dev->ops->vgpu_bar_info(vgpu_dev, index, bar_info);
+	return ret;
+}
+
+static int vdev_read_base(struct vfio_vgpu_device *vdev)
+{
+	int index, pos;
+	u32 start_lo, start_hi;
+	u32 mem_type;
+
+	pos = PCI_BASE_ADDRESS_0;
+
+	for (index = 0; index <= VFIO_PCI_BAR5_REGION_INDEX; index++) {
+
+		if (!vdev->bar_info[index].size)
+			continue;
+
+		start_lo = (*(u32 *)(vdev->vconfig + pos)) &
+					PCI_BASE_ADDRESS_MEM_MASK;
+		mem_type = (*(u32 *)(vdev->vconfig + pos)) &
+					PCI_BASE_ADDRESS_MEM_TYPE_MASK;
+
+		switch (mem_type) {
+		case PCI_BASE_ADDRESS_MEM_TYPE_64:
+			start_hi = (*(u32 *)(vdev->vconfig + pos + 4));
+			pos += 4;
+			break;
+		case PCI_BASE_ADDRESS_MEM_TYPE_32:
+		case PCI_BASE_ADDRESS_MEM_TYPE_1M:
+			/* 1M mem BAR treated as 32-bit BAR */
+		default:
+			/* mem unknown type treated as 32-bit BAR */
+			start_hi = 0;
+			break;
+		}
+		pos += 4;
+		vdev->bar_info[index].start = ((u64)start_hi << 32) | start_lo;
+	}
+	return 0;
+}
+
+static int vgpu_dev_open(void *device_data)
+{
+	int ret = 0;
+	struct vfio_vgpu_device *vdev = device_data;
+
+	if (!try_module_get(THIS_MODULE))
+		return -ENODEV;
+
+	mutex_lock(&vfio_vgpu_lock);
+
+	if (!vdev->refcnt) {
+		u8 *vconfig;
+		int vconfig_size, index;
+
+		for (index = 0; index < VFIO_PCI_NUM_REGIONS; index++) {
+			ret = get_virtual_bar_info(vdev->vgpu_dev,
+						   &vdev->bar_info[index],
+						   index);
+			if (ret)
+				goto open_error;
+		}
+		vconfig_size = vdev->bar_info[VFIO_PCI_CONFIG_REGION_INDEX].size;
+		if (!vconfig_size)
+			goto open_error;
+
+		vconfig = kzalloc(vconfig_size, GFP_KERNEL);
+		if (!vconfig) {
+			ret = -ENOMEM;
+			goto open_error;
+		}
+
+		vdev->vconfig = vconfig;
+	}
+
+	vdev->refcnt++;
+open_error:
+
+	mutex_unlock(&vfio_vgpu_lock);
+
+	if (ret)
+		module_put(THIS_MODULE);
+
+	return ret;
+}
+
+static void vgpu_dev_close(void *device_data)
+{
+	struct vfio_vgpu_device *vdev = device_data;
+
+	mutex_lock(&vfio_vgpu_lock);
+
+	vdev->refcnt--;
+	if (!vdev->refcnt) {
+		memset(&vdev->bar_info, 0, sizeof(vdev->bar_info));
+		if (vdev->vconfig)
+			kfree(vdev->vconfig);
+	}
+
+	mutex_unlock(&vfio_vgpu_lock);
+
+	module_put(THIS_MODULE);
+}
+
+static int vgpu_get_irq_count(struct vfio_vgpu_device *vdev, int irq_type)
+{
+	// Don't support MSIX for now
+	if (irq_type == VFIO_PCI_MSIX_IRQ_INDEX)
+		return -1;
+
+	return 1;
+}
+
+static long vgpu_dev_unlocked_ioctl(void *device_data,
+		unsigned int cmd, unsigned long arg)
+{
+	int ret = 0;
+	struct vfio_vgpu_device *vdev = device_data;
+	unsigned long minsz;
+
+	switch (cmd)
+	{
+	case VFIO_DEVICE_GET_INFO:
+	{
+		struct vfio_device_info info;
+		printk(KERN_INFO "%s VFIO_DEVICE_GET_INFO cmd index ", __FUNCTION__);
+		minsz = offsetofend(struct vfio_device_info, num_irqs);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz)
+			return -EINVAL;
+
+		info.flags = VFIO_DEVICE_FLAGS_PCI;
+		info.num_regions = VFIO_PCI_NUM_REGIONS;
+		info.num_irqs = VFIO_PCI_NUM_IRQS;
+
+		return copy_to_user((void __user *)arg, &info, minsz);
+	}
+
+	case VFIO_DEVICE_GET_REGION_INFO:
+	{
+		struct vfio_region_info info;
+
+		minsz = offsetofend(struct vfio_region_info, offset);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz)
+			return -EINVAL;
+
+		printk(KERN_INFO "%s VFIO_DEVICE_GET_REGION_INFO cmd for region_index %d", __FUNCTION__, info.index);
+		switch (info.index) {
+		case VFIO_PCI_CONFIG_REGION_INDEX:
+		case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
+			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+			info.size = vdev->bar_info[info.index].size;
+			if (!info.size) {
+				info.flags = 0;
+				break;
+			}
+
+			info.flags = vdev->bar_info[info.index].flags;
+			break;
+		case VFIO_PCI_VGA_REGION_INDEX:
+			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+			info.size = 0xc0000;
+			info.flags = VFIO_REGION_INFO_FLAG_READ |
+				     VFIO_REGION_INFO_FLAG_WRITE;
+				break;
+
+		case VFIO_PCI_ROM_REGION_INDEX:
+		default:
+			return -EINVAL;
+		}
+
+		return copy_to_user((void __user *)arg, &info, minsz);
+
+	}
+	case VFIO_DEVICE_GET_IRQ_INFO:
+	{
+		struct vfio_irq_info info;
+
+		printk(KERN_INFO "%s VFIO_DEVICE_GET_IRQ_INFO cmd", __FUNCTION__);
+		minsz = offsetofend(struct vfio_irq_info, count);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz || info.index >= VFIO_PCI_NUM_IRQS)
+			return -EINVAL;
+
+		switch (info.index) {
+		case VFIO_PCI_INTX_IRQ_INDEX ... VFIO_PCI_MSI_IRQ_INDEX:
+		case VFIO_PCI_REQ_IRQ_INDEX:
+			break;
+			/* pass thru to return error */
+		case VFIO_PCI_MSIX_IRQ_INDEX:
+		default:
+			return -EINVAL;
+		}
+
+		info.count = VFIO_PCI_NUM_IRQS;
+
+		info.flags = VFIO_IRQ_INFO_EVENTFD;
+		info.count = vgpu_get_irq_count(vdev, info.index);
+
+		if (info.count == -1)
+			return -EINVAL;
+
+		if (info.index == VFIO_PCI_INTX_IRQ_INDEX)
+			info.flags |= (VFIO_IRQ_INFO_MASKABLE |
+					VFIO_IRQ_INFO_AUTOMASKED);
+		else
+			info.flags |= VFIO_IRQ_INFO_NORESIZE;
+
+		return copy_to_user((void __user *)arg, &info, minsz);
+	}
+
+	case VFIO_DEVICE_SET_IRQS:
+	{
+		struct vfio_irq_set hdr;
+		struct gpu_device *gpu_dev = vdev->vgpu_dev->gpu_dev;
+		u8 *data = NULL;
+		int ret = 0;
+		minsz = offsetofend(struct vfio_irq_set, count);
+
+		if (copy_from_user(&hdr, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (hdr.argsz < minsz || hdr.index >= VFIO_PCI_NUM_IRQS ||
+		    hdr.flags & ~(VFIO_IRQ_SET_DATA_TYPE_MASK |
+		    VFIO_IRQ_SET_ACTION_TYPE_MASK))
+			return -EINVAL;
+
+		if (!(hdr.flags & VFIO_IRQ_SET_DATA_NONE)) {
+			size_t size;
+			int max = vgpu_get_irq_count(vdev, hdr.index);
+
+			if (hdr.flags & VFIO_IRQ_SET_DATA_BOOL)
+				size = sizeof(uint8_t);
+			else if (hdr.flags & VFIO_IRQ_SET_DATA_EVENTFD)
+				size = sizeof(int32_t);
+			else
+				return -EINVAL;
+
+			if (hdr.argsz - minsz < hdr.count * size ||
+			    hdr.start >= max || hdr.start + hdr.count > max)
+				return -EINVAL;
+
+			data = memdup_user((void __user *)(arg + minsz),
+						hdr.count * size);
+				if (IS_ERR(data))
+					return PTR_ERR(data);
+
+			}
+
+			if (gpu_dev->ops->vgpu_set_irqs) {
+				ret = gpu_dev->ops->vgpu_set_irqs(vdev->vgpu_dev,
+								  hdr.flags,
+								  hdr.index, hdr.start,
+								  hdr.count, data);
+			}
+			kfree(data);
+			return ret;
+		}
+
+		default:
+			return -EINVAL;
+	}
+	return ret;
+}
+
+ssize_t vgpu_dev_config_rw(struct vfio_vgpu_device *vdev, char __user *buf,
+		size_t count, loff_t *ppos, bool iswrite)
+{
+	struct vgpu_device *vgpu_dev = vdev->vgpu_dev;
+	struct gpu_device *gpu_dev = vgpu_dev->gpu_dev;
+	int cfg_size = vdev->bar_info[VFIO_PCI_CONFIG_REGION_INDEX].size;
+	int ret = 0;
+	uint64_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
+
+	if (pos < 0 || pos >= cfg_size ||
+	    pos + count > cfg_size) {
+		printk(KERN_ERR "%s pos 0x%llx out of range\n", __FUNCTION__, pos);
+		ret = -EFAULT;
+		goto config_rw_exit;
+	}
+
+	if (iswrite) {
+		char *user_data = kmalloc(count, GFP_KERNEL);
+
+		if (user_data == NULL) {
+			ret = -ENOMEM;
+			goto config_rw_exit;
+		}
+
+		if (copy_from_user(user_data, buf, count)) {
+			ret = -EFAULT;
+			kfree(user_data);
+			goto config_rw_exit;
+		}
+
+		if (gpu_dev->ops->write) {
+			ret = gpu_dev->ops->write(vgpu_dev,
+						  user_data,
+						  count,
+						  vgpu_emul_space_config,
+						  pos);
+		}
+
+		memcpy((void *)(vdev->vconfig + pos), (void *)user_data, count);
+		kfree(user_data);
+	}
+	else
+	{
+		char *ret_data = kzalloc(count, GFP_KERNEL);
+
+		if (ret_data == NULL) {
+			ret = -ENOMEM;
+			goto config_rw_exit;
+		}
+
+		if (gpu_dev->ops->read) {
+			ret = gpu_dev->ops->read(vgpu_dev,
+						 ret_data,
+						 count,
+						 vgpu_emul_space_config,
+						 pos);
+		}
+
+		if (ret > 0 ) {
+			if (copy_to_user(buf, ret_data, ret)) {
+				ret = -EFAULT;
+				kfree(ret_data);
+				goto config_rw_exit;
+			}
+
+			memcpy((void *)(vdev->vconfig + pos), (void *)ret_data, count);
+		}
+		kfree(ret_data);
+	}
+config_rw_exit:
+	return ret;
+}
+
+ssize_t vgpu_dev_bar_rw(struct vfio_vgpu_device *vdev, char __user *buf,
+		size_t count, loff_t *ppos, bool iswrite)
+{
+	struct vgpu_device *vgpu_dev = vdev->vgpu_dev;
+	struct gpu_device *gpu_dev = vgpu_dev->gpu_dev;
+	loff_t offset = *ppos & VFIO_PCI_OFFSET_MASK;
+	loff_t pos;
+	int bar_index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+	int ret = 0;
+
+	if (!vdev->bar_info[bar_index].start) {
+		ret = vdev_read_base(vdev);
+		if (ret)
+			goto bar_rw_exit;
+	}
+
+	if (offset >= vdev->bar_info[bar_index].size) {
+		ret = -EINVAL;
+		goto bar_rw_exit;
+	}
+
+	pos = vdev->bar_info[bar_index].start + offset;
+	if (iswrite) {
+		char *user_data = kmalloc(count, GFP_KERNEL);
+
+		if (user_data == NULL) {
+			ret = -ENOMEM;
+			goto bar_rw_exit;
+		}
+
+		if (copy_from_user(user_data, buf, count)) {
+			ret = -EFAULT;
+			kfree(user_data);
+			goto bar_rw_exit;
+		}
+
+		if (gpu_dev->ops->write) {
+			ret = gpu_dev->ops->write(vgpu_dev,
+						  user_data,
+						  count,
+						  vgpu_emul_space_mmio,
+						  pos);
+		}
+
+		kfree(user_data);
+	}
+	else
+	{
+		char *ret_data = kmalloc(count, GFP_KERNEL);
+
+		if (ret_data == NULL) {
+			ret = -ENOMEM;
+			goto bar_rw_exit;
+		}
+
+		memset(ret_data, 0, count);
+
+		if (gpu_dev->ops->read) {
+			ret = gpu_dev->ops->read(vgpu_dev,
+						 ret_data,
+						 count,
+						 vgpu_emul_space_mmio,
+						 pos);
+		}
+
+		if (ret > 0 ) {
+			if (copy_to_user(buf, ret_data, ret)) {
+				ret = -EFAULT;
+			}
+		}
+		kfree(ret_data);
+	}
+
+bar_rw_exit:
+	return ret;
+}
+
+
+static ssize_t vgpu_dev_rw(void *device_data, char __user *buf,
+		size_t count, loff_t *ppos, bool iswrite)
+{
+	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+	struct vfio_vgpu_device *vdev = device_data;
+
+	if (index >= VFIO_PCI_NUM_REGIONS)
+		return -EINVAL;
+
+	switch (index) {
+	case VFIO_PCI_CONFIG_REGION_INDEX:
+		return vgpu_dev_config_rw(vdev, buf, count, ppos, iswrite);
+
+	case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
+		return vgpu_dev_bar_rw(vdev, buf, count, ppos, iswrite);
+
+	case VFIO_PCI_ROM_REGION_INDEX:
+	case VFIO_PCI_VGA_REGION_INDEX:
+		break;
+	}
+
+	return -EINVAL;
+}
+
+
+static ssize_t vgpu_dev_read(void *device_data, char __user *buf,
+			     size_t count, loff_t *ppos)
+{
+	int ret = 0;
+
+	if (count)
+		ret = vgpu_dev_rw(device_data, buf, count, ppos, false);
+
+	return ret;
+}
+
+static ssize_t vgpu_dev_write(void *device_data, const char __user *buf,
+			      size_t count, loff_t *ppos)
+{
+	int ret = 0;
+
+	if (count)
+		ret = vgpu_dev_rw(device_data, (char *)buf, count, ppos, true);
+
+	return ret;
+}
+
+static int vgpu_dev_mmio_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	int ret = 0;
+	struct vfio_vgpu_device *vdev = vma->vm_private_data;
+	struct vgpu_device *vgpu_dev;
+	struct gpu_device *gpu_dev;
+	u64 virtaddr = (u64)vmf->virtual_address;
+	u64 offset, phyaddr;
+	unsigned long req_size, pgoff;
+	pgprot_t pg_prot;
+
+	if (!vdev && !vdev->vgpu_dev)
+		return -EINVAL;
+
+	vgpu_dev = vdev->vgpu_dev;
+	gpu_dev  = vgpu_dev->gpu_dev;
+
+	offset   = vma->vm_pgoff << PAGE_SHIFT;
+	phyaddr  = virtaddr - vma->vm_start + offset;
+	pgoff    = phyaddr >> PAGE_SHIFT;
+	req_size = vma->vm_end - virtaddr;
+	pg_prot  = vma->vm_page_prot;
+
+	if (gpu_dev->ops->validate_map_request) {
+		ret = gpu_dev->ops->validate_map_request(vgpu_dev, virtaddr, &pgoff,
+							 &req_size, &pg_prot);
+		if (ret)
+			return ret;
+
+		if (!req_size)
+			return -EINVAL;
+	}
+
+	ret = remap_pfn_range(vma, virtaddr, pgoff, req_size, pg_prot);
+
+	return ret | VM_FAULT_NOPAGE;
+}
+
+static const struct vm_operations_struct vgpu_dev_mmio_ops = {
+	.fault = vgpu_dev_mmio_fault,
+};
+
+
+static int vgpu_dev_mmap(void *device_data, struct vm_area_struct *vma)
+{
+	unsigned int index;
+	struct vfio_vgpu_device *vdev = device_data;
+	struct vgpu_device *vgpu_dev = vdev->vgpu_dev;
+	struct pci_dev *pdev = vgpu_dev->gpu_dev->dev;
+	unsigned long pgoff;
+
+	loff_t offset = vma->vm_pgoff << PAGE_SHIFT;
+
+	index = VFIO_PCI_OFFSET_TO_INDEX(offset);
+
+	if (index >= VFIO_PCI_ROM_REGION_INDEX)
+		return -EINVAL;
+
+	pgoff = vma->vm_pgoff &
+		((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
+
+	vma->vm_pgoff = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff;
+
+	vma->vm_private_data = vdev;
+	vma->vm_ops = &vgpu_dev_mmio_ops;
+
+	return 0;
+}
+
+static const struct vfio_device_ops vgpu_vfio_dev_ops = {
+	.name		= "vfio-vgpu",
+	.open		= vgpu_dev_open,
+	.release	= vgpu_dev_close,
+	.ioctl		= vgpu_dev_unlocked_ioctl,
+	.read		= vgpu_dev_read,
+	.write		= vgpu_dev_write,
+	.mmap		= vgpu_dev_mmap,
+};
+
+int vgpu_vfio_probe(struct device *dev)
+{
+	struct vfio_vgpu_device *vdev;
+	struct vgpu_device *vgpu_dev = to_vgpu_device(dev);
+	int ret = 0;
+
+	if (vgpu_dev == NULL)
+		return -EINVAL;
+
+	vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
+	if (!vdev) {
+		return -ENOMEM;
+	}
+
+	vdev->vgpu_dev = vgpu_dev;
+	vdev->group = vgpu_dev->group;
+
+	ret = vfio_add_group_dev(dev, &vgpu_vfio_dev_ops, vdev);
+	if (ret)
+		kfree(vdev);
+
+	printk(KERN_INFO "%s ret = %d\n", __FUNCTION__, ret);
+	return ret;
+}
+
+void vgpu_vfio_remove(struct device *dev)
+{
+	struct vfio_vgpu_device *vdev;
+
+	printk(KERN_INFO "%s \n", __FUNCTION__);
+	vdev = vfio_del_group_dev(dev);
+	if (vdev) {
+		printk(KERN_INFO "%s vdev being freed\n", __FUNCTION__);
+		kfree(vdev);
+	}
+}
+
+struct vgpu_driver vgpu_vfio_driver = {
+        .name	= "vgpu-vfio",
+        .probe	= vgpu_vfio_probe,
+        .remove	= vgpu_vfio_remove,
+};
+
+static int __init vgpu_vfio_init(void)
+{
+	printk(KERN_INFO "%s \n", __FUNCTION__);
+	return vgpu_register_driver(&vgpu_vfio_driver, THIS_MODULE);
+}
+
+static void __exit vgpu_vfio_exit(void)
+{
+	printk(KERN_INFO "%s \n", __FUNCTION__);
+	vgpu_unregister_driver(&vgpu_vfio_driver);
+}
+
+module_init(vgpu_vfio_init)
+module_exit(vgpu_vfio_exit)
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 154+ messages in thread

* [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
  2016-05-02 18:40 ` [Qemu-devel] " Kirti Wankhede
@ 2016-05-02 18:40   ` Kirti Wankhede
  -1 siblings, 0 replies; 154+ messages in thread
From: Kirti Wankhede @ 2016-05-02 18:40 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, kevin.tian, shuai.ruan, jike.song, zhiyuan.lv,
	Kirti Wankhede

VFIO Type1 IOMMU driver is designed for the devices which are IOMMU capable.
vGPU device are only using IOMMU TYPE1 API, the underlying hardware can be
managed by an IOMMU domain. To use most of the code of IOMMU driver for vGPU
devices, type1 IOMMU driver is modified to support vGPU devices. This change
exports functions to pin and unpin pages for vGPU devices.
It maintains data of pinned pages for vGPU domain. This data is used to verify
unpinning request and also used to unpin pages from detach_group().

Tested by assigning below combinations of devices to a single VM:
- GPU pass through only
- vGPU device only
- One GPU pass through and one vGPU device
- two GPU pass through

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Change-Id: I6e35e9fc7f14049226365e9ecef3814dc4ca1738
---
 drivers/vfio/vfio_iommu_type1.c |  427 ++++++++++++++++++++++++++++++++++++---
 include/linux/vfio.h            |    6 +
 include/linux/vgpu.h            |    4 +-
 3 files changed, 403 insertions(+), 34 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 75b24e9..a970854 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -36,6 +36,7 @@
 #include <linux/uaccess.h>
 #include <linux/vfio.h>
 #include <linux/workqueue.h>
+#include <linux/vgpu.h>
 
 #define DRIVER_VERSION  "0.2"
 #define DRIVER_AUTHOR   "Alex Williamson <alex.williamson@redhat.com>"
@@ -67,6 +68,11 @@ struct vfio_domain {
 	struct list_head	group_list;
 	int			prot;		/* IOMMU_CACHE */
 	bool			fgsp;		/* Fine-grained super pages */
+	bool			vfio_iommu_api_only;	/* Domain for device which
+							   is without physical IOMMU */
+	struct mm_struct	*vmm_mm;	/* VMM's mm */
+	struct rb_root		pfn_list;	/* Host pfn list for requested gfns */
+	struct mutex		lock;		/* mutex for pfn_list */
 };
 
 struct vfio_dma {
@@ -83,6 +89,19 @@ struct vfio_group {
 };
 
 /*
+ * Guest RAM pinning working set or DMA target
+ */
+struct vfio_vgpu_pfn {
+	struct rb_node		node;
+	unsigned long		vmm_va;		/* VMM virtual addr */
+	dma_addr_t		iova;		/* IOVA */
+	unsigned long		npage;		/* number of pages */
+	unsigned long		pfn;		/* Host pfn */
+	int			prot;
+	atomic_t		ref_count;
+	struct list_head	next;
+};
+/*
  * This code handles mapping and unmapping of user data buffers
  * into DMA'ble space using the IOMMU
  */
@@ -130,6 +149,53 @@ static void vfio_unlink_dma(struct vfio_iommu *iommu, struct vfio_dma *old)
 	rb_erase(&old->node, &iommu->dma_list);
 }
 
+/*
+ * Helper Functions for host pfn list
+ */
+
+static struct vfio_vgpu_pfn *vfio_find_vgpu_pfn(struct vfio_domain *domain,
+						unsigned long pfn)
+{
+	struct rb_node *node = domain->pfn_list.rb_node;
+
+	while (node) {
+		struct vfio_vgpu_pfn *vgpu_pfn = rb_entry(node, struct vfio_vgpu_pfn, node);
+
+		if (pfn <= vgpu_pfn->pfn)
+			node = node->rb_left;
+		else if (pfn >= vgpu_pfn->pfn)
+			node = node->rb_right;
+		else
+			return vgpu_pfn;
+	}
+
+	return NULL;
+}
+
+static void vfio_link_vgpu_pfn(struct vfio_domain *domain, struct vfio_vgpu_pfn *new)
+{
+	struct rb_node **link = &domain->pfn_list.rb_node, *parent = NULL;
+	struct vfio_vgpu_pfn *vgpu_pfn;
+
+	while (*link) {
+		parent = *link;
+		vgpu_pfn = rb_entry(parent, struct vfio_vgpu_pfn, node);
+
+		if (new->pfn <= vgpu_pfn->pfn)
+			link = &(*link)->rb_left;
+		else
+			link = &(*link)->rb_right;
+	}
+
+	rb_link_node(&new->node, parent, link);
+	rb_insert_color(&new->node, &domain->pfn_list);
+}
+
+static void vfio_unlink_vgpu_pfn(struct vfio_domain *domain, struct vfio_vgpu_pfn *old)
+{
+	rb_erase(&old->node, &domain->pfn_list);
+}
+
 struct vwork {
 	struct mm_struct	*mm;
 	long			npage;
@@ -228,20 +294,22 @@ static int put_pfn(unsigned long pfn, int prot)
 	return 0;
 }
 
-static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
+static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
+			 int prot, unsigned long *pfn)
 {
 	struct page *page[1];
 	struct vm_area_struct *vma;
 	int ret = -EFAULT;
 
-	if (get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE), page) == 1) {
+	if (get_user_pages_remote(NULL, mm, vaddr, 1, !!(prot & IOMMU_WRITE),
+				    0, page, NULL) == 1) {
 		*pfn = page_to_pfn(page[0]);
 		return 0;
 	}
 
-	down_read(&current->mm->mmap_sem);
+	down_read(&mm->mmap_sem);
 
-	vma = find_vma_intersection(current->mm, vaddr, vaddr + 1);
+	vma = find_vma_intersection(mm, vaddr, vaddr + 1);
 
 	if (vma && vma->vm_flags & VM_PFNMAP) {
 		*pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
@@ -249,28 +317,63 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
 			ret = 0;
 	}
 
-	up_read(&current->mm->mmap_sem);
+	up_read(&mm->mmap_sem);
 
 	return ret;
 }
 
 /*
+ * Get first domain with iommu and without iommu from iommu's domain_list for
+ * lookups
+ * @iommu [in]: iommu structure
+ * @domain [out]: domain with iommu
+ * @domain_vgpu [out] : domain without iommu for vGPU
+ */
+static void get_first_domains(struct vfio_iommu *iommu, struct vfio_domain **domain,
+			      struct vfio_domain **domain_vgpu)
+{
+	struct vfio_domain *d;
+
+	if (!domain || !domain_vgpu)
+		return;
+
+	list_for_each_entry(d, &iommu->domain_list, next) {
+		if (d->vfio_iommu_api_only && !*domain_vgpu)
+			*domain_vgpu = d;
+		else if (!*domain)
+			*domain = d;
+		if (*domain_vgpu && *domain)
+			break;
+	}
+}
+
+/*
  * Attempt to pin pages.  We really don't want to track all the pfns and
  * the iommu can only map chunks of consecutive pfns anyway, so get the
  * first page and all consecutive pages with the same locking.
  */
-static long vfio_pin_pages(unsigned long vaddr, long npage,
-			   int prot, unsigned long *pfn_base)
+static long vfio_pin_pages_internal(void *domain_data, unsigned long vaddr, long npage,
+		             int prot, unsigned long *pfn_base)
 {
+	struct vfio_domain *domain = domain_data;
 	unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
 	bool lock_cap = capable(CAP_IPC_LOCK);
 	long ret, i;
 	bool rsvd;
+	struct mm_struct *mm;
 
-	if (!current->mm)
+	if (!domain)
 		return -ENODEV;
 
-	ret = vaddr_get_pfn(vaddr, prot, pfn_base);
+	if (domain->vfio_iommu_api_only)
+		mm = domain->vmm_mm;
+	else
+		mm = current->mm;
+
+	if (!mm)
+		return -ENODEV;
+
+	ret = vaddr_get_pfn(mm, vaddr, prot, pfn_base);
 	if (ret)
 		return ret;
 
@@ -293,7 +396,7 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
 	for (i = 1, vaddr += PAGE_SIZE; i < npage; i++, vaddr += PAGE_SIZE) {
 		unsigned long pfn = 0;
 
-		ret = vaddr_get_pfn(vaddr, prot, &pfn);
+		ret = vaddr_get_pfn(mm, vaddr, prot, &pfn);
 		if (ret)
 			break;
 
@@ -318,25 +421,183 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
 	return i;
 }
 
-static long vfio_unpin_pages(unsigned long pfn, long npage,
-			     int prot, bool do_accounting)
+static long vfio_unpin_pages_internal(void *domain_data, unsigned long pfn, long npage,
+				      int prot, bool do_accounting)
 {
+	struct vfio_domain *domain = domain_data;
 	unsigned long unlocked = 0;
 	long i;
 
+	if (!domain)
+		return -ENODEV;
+
 	for (i = 0; i < npage; i++)
 		unlocked += put_pfn(pfn++, prot);
 
 	if (do_accounting)
 		vfio_lock_acct(-unlocked);
+	return unlocked;
+}
+
+/*
+ * Pin a set of guest PFNs and return their associated host PFNs for vGPU.
+ * @vaddr [in]: array of guest PFNs
+ * @npage [in]: count of array elements
+ * @prot [in] : protection flags
+ * @pfn_base[out] : array of host PFNs
+ */
+int vfio_pin_pages(void *iommu_data, dma_addr_t *vaddr, long npage,
+		   int prot, dma_addr_t *pfn_base)
+{
+	struct vfio_iommu *iommu = iommu_data;
+	struct vfio_domain *domain = NULL, *domain_vgpu = NULL;
+	int i = 0, ret = 0;
+	long retpage;
+	dma_addr_t remote_vaddr = 0;
+	dma_addr_t *pfn = pfn_base;
+	struct vfio_dma *dma;
+
+	if (!iommu || !pfn_base)
+		return -EINVAL;
+
+	if (list_empty(&iommu->domain_list)) {
+		ret = -EINVAL;
+		goto pin_done;
+	}
+
+	get_first_domains(iommu, &domain, &domain_vgpu);
+
+	// Return error if vGPU domain doesn't exist
+	if (!domain_vgpu) {
+		ret = -EINVAL;
+		goto pin_done;
+	}
+
+	for (i = 0; i < npage; i++) {
+		struct vfio_vgpu_pfn *p;
+		struct vfio_vgpu_pfn *lpfn;
+		unsigned long tpfn;
+		dma_addr_t iova;
+
+		mutex_lock(&iommu->lock);
+
+		iova = vaddr[i] << PAGE_SHIFT;
+
+		dma = vfio_find_dma(iommu, iova, 0 /*  size */);
+		if (!dma) {
+			mutex_unlock(&iommu->lock);
+			ret = -EINVAL;
+			goto pin_done;
+		}
+
+		remote_vaddr = dma->vaddr + iova - dma->iova;
+
+		retpage = vfio_pin_pages_internal(domain_vgpu, remote_vaddr,
+						  (long)1, prot, &tpfn);
+		mutex_unlock(&iommu->lock);
+		if (retpage <= 0) {
+			WARN_ON(!retpage);
+			ret = (int)retpage;
+			goto pin_done;
+		}
+
+		pfn[i] = tpfn;
+
+		mutex_lock(&domain_vgpu->lock);
+
+		// search if pfn exist
+		if ((p = vfio_find_vgpu_pfn(domain_vgpu, tpfn))) {
+			atomic_inc(&p->ref_count);
+			mutex_unlock(&domain_vgpu->lock);
+			continue;
+		}
+
+		// add to pfn_list
+		lpfn = kzalloc(sizeof(*lpfn), GFP_KERNEL);
+		if (!lpfn) {
+			ret = -ENOMEM;
+			mutex_unlock(&domain_vgpu->lock);
+			goto pin_done;
+		}
+		lpfn->vmm_va = remote_vaddr;
+		lpfn->iova = iova;
+		lpfn->pfn = pfn[i];
+		lpfn->npage = 1;
+		lpfn->prot = prot;
+		atomic_inc(&lpfn->ref_count);
+		vfio_link_vgpu_pfn(domain_vgpu, lpfn);
+		mutex_unlock(&domain_vgpu->lock);
+	}
+
+	ret = i;
+
+pin_done:
+	return ret;
+}
+EXPORT_SYMBOL(vfio_pin_pages);
+
+static int vfio_vgpu_unpin_pfn(struct vfio_domain *domain, struct vfio_vgpu_pfn *vpfn)
+{
+	int ret;
+
+	ret = vfio_unpin_pages_internal(domain, vpfn->pfn, vpfn->npage, vpfn->prot, true);
+
+	if (atomic_dec_and_test(&vpfn->ref_count)) {
+		// remove from pfn_list
+		vfio_unlink_vgpu_pfn(domain, vpfn);
+		kfree(vpfn);
+	}
+
+	return ret;
+}
+
+/*
+ * Unpin set of host PFNs for vGPU.
+ * @pfn	[in] : array of host PFNs to be unpinned.
+ * @npage [in] :count of elements in array, that is number of pages.
+ * @prot [in] : protection flags
+ */
+int vfio_unpin_pages(void *iommu_data, dma_addr_t *pfn, long npage,
+		     int prot)
+{
+	struct vfio_iommu *iommu = iommu_data;
+	struct vfio_domain *domain = NULL, *domain_vgpu = NULL;
+	long unlocked = 0;
+	int i;
+	if (!iommu || !pfn)
+		return -EINVAL;
+
+	if (list_empty(&iommu->domain_list))
+		return -EINVAL;
+
+	get_first_domains(iommu, &domain, &domain_vgpu);
+
+	// Return error if vGPU domain doesn't exist
+	if (!domain_vgpu)
+		return -EINVAL;
+
+	mutex_lock(&domain_vgpu->lock);
+
+	for (i = 0; i < npage; i++) {
+		struct vfio_vgpu_pfn *p;
+
+		// verify if pfn exist in pfn_list
+		if (!(p = vfio_find_vgpu_pfn(domain_vgpu, *(pfn + i)))) {
+			continue;
+		}
+
+		unlocked += vfio_vgpu_unpin_pfn(domain_vgpu, p);
+	}
+	mutex_unlock(&domain_vgpu->lock);
 
 	return unlocked;
 }
+EXPORT_SYMBOL(vfio_unpin_pages);
 
 static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
 {
 	dma_addr_t iova = dma->iova, end = dma->iova + dma->size;
-	struct vfio_domain *domain, *d;
+	struct vfio_domain *domain = NULL, *d, *domain_vgpu = NULL;
 	long unlocked = 0;
 
 	if (!dma->size)
@@ -348,12 +609,18 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
 	 * pfns to unpin.  The rest need to be unmapped in advance so we have
 	 * no iommu translations remaining when the pages are unpinned.
 	 */
-	domain = d = list_first_entry(&iommu->domain_list,
-				      struct vfio_domain, next);
 
+	get_first_domains(iommu, &domain, &domain_vgpu);
+
+	if (!domain)
+		return;
+
+	d = domain;
 	list_for_each_entry_continue(d, &iommu->domain_list, next) {
-		iommu_unmap(d->domain, dma->iova, dma->size);
-		cond_resched();
+		if (!d->vfio_iommu_api_only) {
+			iommu_unmap(d->domain, dma->iova, dma->size);
+			cond_resched();
+		}
 	}
 
 	while (iova < end) {
@@ -382,7 +649,7 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
 		if (WARN_ON(!unmapped))
 			break;
 
-		unlocked += vfio_unpin_pages(phys >> PAGE_SHIFT,
+		unlocked += vfio_unpin_pages_internal(domain, phys >> PAGE_SHIFT,
 					     unmapped >> PAGE_SHIFT,
 					     dma->prot, false);
 		iova += unmapped;
@@ -406,8 +673,10 @@ static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu)
 	unsigned long bitmap = ULONG_MAX;
 
 	mutex_lock(&iommu->lock);
-	list_for_each_entry(domain, &iommu->domain_list, next)
-		bitmap &= domain->domain->ops->pgsize_bitmap;
+	list_for_each_entry(domain, &iommu->domain_list, next) {
+		if (!domain->vfio_iommu_api_only)
+			bitmap &= domain->domain->ops->pgsize_bitmap;
+	}
 	mutex_unlock(&iommu->lock);
 
 	/*
@@ -517,6 +786,9 @@ static int map_try_harder(struct vfio_domain *domain, dma_addr_t iova,
 	long i;
 	int ret;
 
+	if (domain->vfio_iommu_api_only)
+		return -EINVAL;
+
 	for (i = 0; i < npage; i++, pfn++, iova += PAGE_SIZE) {
 		ret = iommu_map(domain->domain, iova,
 				(phys_addr_t)pfn << PAGE_SHIFT,
@@ -538,6 +810,9 @@ static int vfio_iommu_map(struct vfio_iommu *iommu, dma_addr_t iova,
 	int ret;
 
 	list_for_each_entry(d, &iommu->domain_list, next) {
+		if (d->vfio_iommu_api_only)
+			continue;
+
 		ret = iommu_map(d->domain, iova, (phys_addr_t)pfn << PAGE_SHIFT,
 				npage << PAGE_SHIFT, prot | d->prot);
 		if (ret) {
@@ -552,8 +827,11 @@ static int vfio_iommu_map(struct vfio_iommu *iommu, dma_addr_t iova,
 	return 0;
 
 unwind:
-	list_for_each_entry_continue_reverse(d, &iommu->domain_list, next)
+	list_for_each_entry_continue_reverse(d, &iommu->domain_list, next) {
+		if (d->vfio_iommu_api_only)
+			continue;
 		iommu_unmap(d->domain, iova, npage << PAGE_SHIFT);
+	}
 
 	return ret;
 }
@@ -569,6 +847,8 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
 	uint64_t mask;
 	struct vfio_dma *dma;
 	unsigned long pfn;
+	struct vfio_domain *domain = NULL;
+	int domain_with_iommu_present = 0;
 
 	/* Verify that none of our __u64 fields overflow */
 	if (map->size != size || map->vaddr != vaddr || map->iova != iova)
@@ -611,9 +891,22 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
 	/* Insert zero-sized and grow as we map chunks of it */
 	vfio_link_dma(iommu, dma);
 
+	list_for_each_entry(domain, &iommu->domain_list, next) {
+		if (!domain->vfio_iommu_api_only) {
+			domain_with_iommu_present = 1;
+			break;
+		}
+	}
+
+	// Skip pin and map only if domain without IOMMU is present
+	if (!domain_with_iommu_present) {
+		dma->size = size;
+		goto map_done;
+	}
+
 	while (size) {
 		/* Pin a contiguous chunk of memory */
-		npage = vfio_pin_pages(vaddr + dma->size,
+		npage = vfio_pin_pages_internal(domain, vaddr + dma->size,
 				       size >> PAGE_SHIFT, prot, &pfn);
 		if (npage <= 0) {
 			WARN_ON(!npage);
@@ -623,8 +916,8 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
 
 		/* Map it! */
 		ret = vfio_iommu_map(iommu, iova + dma->size, pfn, npage, prot);
-		if (ret) {
-			vfio_unpin_pages(pfn, npage, prot, true);
+		if (ret){
+			vfio_unpin_pages_internal(domain, pfn, npage, prot, true);
 			break;
 		}
 
@@ -635,6 +928,7 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
 	if (ret)
 		vfio_remove_dma(iommu, dma);
 
+map_done:
 	mutex_unlock(&iommu->lock);
 	return ret;
 }
@@ -654,12 +948,15 @@ static int vfio_bus_type(struct device *dev, void *data)
 static int vfio_iommu_replay(struct vfio_iommu *iommu,
 			     struct vfio_domain *domain)
 {
-	struct vfio_domain *d;
+	struct vfio_domain *d = NULL, *d_vgpu = NULL;
 	struct rb_node *n;
 	int ret;
 
+	if (domain->vfio_iommu_api_only)
+		return -EINVAL;
+
 	/* Arbitrarily pick the first domain in the list for lookups */
-	d = list_first_entry(&iommu->domain_list, struct vfio_domain, next);
+	get_first_domains(iommu, &d, &d_vgpu);
 	n = rb_first(&iommu->dma_list);
 
 	/* If there's not a domain, there better not be any mappings */
@@ -716,6 +1013,9 @@ static void vfio_test_domain_fgsp(struct vfio_domain *domain)
 	struct page *pages;
 	int ret, order = get_order(PAGE_SIZE * 2);
 
+	if (domain->vfio_iommu_api_only)
+		return;
+
 	pages = alloc_pages(GFP_KERNEL | __GFP_ZERO, order);
 	if (!pages)
 		return;
@@ -769,6 +1069,23 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 	if (ret)
 		goto out_free;
 
+	if (!iommu_present(bus) && (bus == &vgpu_bus_type)) {
+		struct vgpu_device *vgpu_dev = NULL;
+
+		vgpu_dev = get_vgpu_device_from_group(iommu_group);
+		if (!vgpu_dev)
+			goto out_free;
+
+		vgpu_dev->iommu_data = iommu;
+		domain->vfio_iommu_api_only = true;
+		domain->vmm_mm = current->mm;
+		INIT_LIST_HEAD(&domain->group_list);
+		list_add(&group->next, &domain->group_list);
+		domain->pfn_list = RB_ROOT;
+		mutex_init(&domain->lock);
+		goto out_success;
+	}
+
 	domain->domain = iommu_domain_alloc(bus);
 	if (!domain->domain) {
 		ret = -EIO;
@@ -834,6 +1151,7 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 	if (ret)
 		goto out_detach;
 
+out_success:
 	list_add(&domain->next, &iommu->domain_list);
 
 	mutex_unlock(&iommu->lock);
@@ -854,11 +1172,36 @@ out_free:
 static void vfio_iommu_unmap_unpin_all(struct vfio_iommu *iommu)
 {
 	struct rb_node *node;
+	struct vfio_domain *domain = NULL, *domain_vgpu = NULL;
+
+	get_first_domains(iommu, &domain, &domain_vgpu);
+
+	if (domain_vgpu) {
+		int unlocked;
+		mutex_lock(&domain_vgpu->lock);
+		while ((node = rb_first(&domain_vgpu->pfn_list))) {
+			unlocked = vfio_vgpu_unpin_pfn(domain_vgpu,
+					rb_entry(node, struct vfio_vgpu_pfn, node));
+		}
+		mutex_unlock(&domain_vgpu->lock);
+	}
 
 	while ((node = rb_first(&iommu->dma_list)))
 		vfio_remove_dma(iommu, rb_entry(node, struct vfio_dma, node));
 }
 
+static bool list_is_singular_iommu_domain(struct vfio_iommu *iommu)
+{
+	struct vfio_domain *domain;
+	int domain_iommu = 0;
+
+	list_for_each_entry(domain, &iommu->domain_list, next) {
+		if (!domain->vfio_iommu_api_only)
+			domain_iommu++;
+	}
+	return (domain_iommu == 1);
+}
+
 static void vfio_iommu_type1_detach_group(void *iommu_data,
 					  struct iommu_group *iommu_group)
 {
@@ -872,19 +1215,28 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
 		list_for_each_entry(group, &domain->group_list, next) {
 			if (group->iommu_group != iommu_group)
 				continue;
+			if (!domain->vfio_iommu_api_only)
+				iommu_detach_group(domain->domain, iommu_group);
+			else {
+				struct vgpu_device *vgpu_dev = NULL;
 
-			iommu_detach_group(domain->domain, iommu_group);
+				vgpu_dev = get_vgpu_device_from_group(iommu_group);
+				if (vgpu_dev)
+					vgpu_dev->iommu_data = NULL;
+
+			}
 			list_del(&group->next);
 			kfree(group);
 			/*
 			 * Group ownership provides privilege, if the group
 			 * list is empty, the domain goes away.  If it's the
-			 * last domain, then all the mappings go away too.
+			 * last domain with iommu, then all the mappings go away too.
 			 */
 			if (list_empty(&domain->group_list)) {
-				if (list_is_singular(&iommu->domain_list))
+				if (list_is_singular_iommu_domain(iommu))
 					vfio_iommu_unmap_unpin_all(iommu);
-				iommu_domain_free(domain->domain);
+				if (!domain->vfio_iommu_api_only)
+					iommu_domain_free(domain->domain);
 				list_del(&domain->next);
 				kfree(domain);
 			}
@@ -936,11 +1288,22 @@ static void vfio_iommu_type1_release(void *iommu_data)
 				 &iommu->domain_list, next) {
 		list_for_each_entry_safe(group, group_tmp,
 					 &domain->group_list, next) {
-			iommu_detach_group(domain->domain, group->iommu_group);
+			if (!domain->vfio_iommu_api_only)
+				iommu_detach_group(domain->domain, group->iommu_group);
+			else {
+				struct vgpu_device *vgpu_dev = NULL;
+
+				vgpu_dev = get_vgpu_device_from_group(group->iommu_group);
+				if (vgpu_dev)
+					vgpu_dev->iommu_data = NULL;
+
+			}
+
 			list_del(&group->next);
 			kfree(group);
 		}
-		iommu_domain_free(domain->domain);
+		if (!domain->vfio_iommu_api_only)
+			iommu_domain_free(domain->domain);
 		list_del(&domain->next);
 		kfree(domain);
 	}
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 0ecae0b..d280868 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -127,6 +127,12 @@ static inline long vfio_spapr_iommu_eeh_ioctl(struct iommu_group *group,
 }
 #endif /* CONFIG_EEH */
 
+extern long vfio_pin_pages(void *iommu_data, dma_addr_t *vaddr, long npage,
+		           int prot, dma_addr_t *pfn_base);
+
+extern long vfio_unpin_pages(void *iommu_data, dma_addr_t *pfn, long npage,
+			     int prot);
+
 /*
  * IRQfd - generic
  */
diff --git a/include/linux/vgpu.h b/include/linux/vgpu.h
index 03a77cf..cc18353 100644
--- a/include/linux/vgpu.h
+++ b/include/linux/vgpu.h
@@ -36,6 +36,7 @@ struct vgpu_device {
 	struct device		dev;
 	struct gpu_device	*gpu_dev;
 	struct iommu_group	*group;
+	void			*iommu_data;
 #define DEVICE_NAME_LEN		(64)
 	char			dev_name[DEVICE_NAME_LEN];
 	uuid_le			uuid;
@@ -209,8 +210,7 @@ extern void vgpu_unregister_driver(struct vgpu_driver *drv);
 
 extern int vgpu_map_virtual_bar(uint64_t virt_bar_addr, uint64_t phys_bar_addr,
 				uint32_t len, uint32_t flags);
-extern int vgpu_dma_do_translate(dma_addr_t * gfn_buffer, uint32_t count);
 
-struct vgpu_device *get_vgpu_device_from_group(struct iommu_group *group);
+extern struct vgpu_device *get_vgpu_device_from_group(struct iommu_group *group);
 
 #endif /* VGPU_H */
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 154+ messages in thread

* [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
@ 2016-05-02 18:40   ` Kirti Wankhede
  0 siblings, 0 replies; 154+ messages in thread
From: Kirti Wankhede @ 2016-05-02 18:40 UTC (permalink / raw)
  To: alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, kevin.tian, shuai.ruan, jike.song, zhiyuan.lv,
	Kirti Wankhede

VFIO Type1 IOMMU driver is designed for the devices which are IOMMU capable.
vGPU device are only using IOMMU TYPE1 API, the underlying hardware can be
managed by an IOMMU domain. To use most of the code of IOMMU driver for vGPU
devices, type1 IOMMU driver is modified to support vGPU devices. This change
exports functions to pin and unpin pages for vGPU devices.
It maintains data of pinned pages for vGPU domain. This data is used to verify
unpinning request and also used to unpin pages from detach_group().

Tested by assigning below combinations of devices to a single VM:
- GPU pass through only
- vGPU device only
- One GPU pass through and one vGPU device
- two GPU pass through

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Neo Jia <cjia@nvidia.com>
Change-Id: I6e35e9fc7f14049226365e9ecef3814dc4ca1738
---
 drivers/vfio/vfio_iommu_type1.c |  427 ++++++++++++++++++++++++++++++++++++---
 include/linux/vfio.h            |    6 +
 include/linux/vgpu.h            |    4 +-
 3 files changed, 403 insertions(+), 34 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 75b24e9..a970854 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -36,6 +36,7 @@
 #include <linux/uaccess.h>
 #include <linux/vfio.h>
 #include <linux/workqueue.h>
+#include <linux/vgpu.h>
 
 #define DRIVER_VERSION  "0.2"
 #define DRIVER_AUTHOR   "Alex Williamson <alex.williamson@redhat.com>"
@@ -67,6 +68,11 @@ struct vfio_domain {
 	struct list_head	group_list;
 	int			prot;		/* IOMMU_CACHE */
 	bool			fgsp;		/* Fine-grained super pages */
+	bool			vfio_iommu_api_only;	/* Domain for device which
+							   is without physical IOMMU */
+	struct mm_struct	*vmm_mm;	/* VMM's mm */
+	struct rb_root		pfn_list;	/* Host pfn list for requested gfns */
+	struct mutex		lock;		/* mutex for pfn_list */
 };
 
 struct vfio_dma {
@@ -83,6 +89,19 @@ struct vfio_group {
 };
 
 /*
+ * Guest RAM pinning working set or DMA target
+ */
+struct vfio_vgpu_pfn {
+	struct rb_node		node;
+	unsigned long		vmm_va;		/* VMM virtual addr */
+	dma_addr_t		iova;		/* IOVA */
+	unsigned long		npage;		/* number of pages */
+	unsigned long		pfn;		/* Host pfn */
+	int			prot;
+	atomic_t		ref_count;
+	struct list_head	next;
+};
+/*
  * This code handles mapping and unmapping of user data buffers
  * into DMA'ble space using the IOMMU
  */
@@ -130,6 +149,53 @@ static void vfio_unlink_dma(struct vfio_iommu *iommu, struct vfio_dma *old)
 	rb_erase(&old->node, &iommu->dma_list);
 }
 
+/*
+ * Helper Functions for host pfn list
+ */
+
+static struct vfio_vgpu_pfn *vfio_find_vgpu_pfn(struct vfio_domain *domain,
+						unsigned long pfn)
+{
+	struct rb_node *node = domain->pfn_list.rb_node;
+
+	while (node) {
+		struct vfio_vgpu_pfn *vgpu_pfn = rb_entry(node, struct vfio_vgpu_pfn, node);
+
+		if (pfn <= vgpu_pfn->pfn)
+			node = node->rb_left;
+		else if (pfn >= vgpu_pfn->pfn)
+			node = node->rb_right;
+		else
+			return vgpu_pfn;
+	}
+
+	return NULL;
+}
+
+static void vfio_link_vgpu_pfn(struct vfio_domain *domain, struct vfio_vgpu_pfn *new)
+{
+	struct rb_node **link = &domain->pfn_list.rb_node, *parent = NULL;
+	struct vfio_vgpu_pfn *vgpu_pfn;
+
+	while (*link) {
+		parent = *link;
+		vgpu_pfn = rb_entry(parent, struct vfio_vgpu_pfn, node);
+
+		if (new->pfn <= vgpu_pfn->pfn)
+			link = &(*link)->rb_left;
+		else
+			link = &(*link)->rb_right;
+	}
+
+	rb_link_node(&new->node, parent, link);
+	rb_insert_color(&new->node, &domain->pfn_list);
+}
+
+static void vfio_unlink_vgpu_pfn(struct vfio_domain *domain, struct vfio_vgpu_pfn *old)
+{
+	rb_erase(&old->node, &domain->pfn_list);
+}
+
 struct vwork {
 	struct mm_struct	*mm;
 	long			npage;
@@ -228,20 +294,22 @@ static int put_pfn(unsigned long pfn, int prot)
 	return 0;
 }
 
-static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
+static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
+			 int prot, unsigned long *pfn)
 {
 	struct page *page[1];
 	struct vm_area_struct *vma;
 	int ret = -EFAULT;
 
-	if (get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE), page) == 1) {
+	if (get_user_pages_remote(NULL, mm, vaddr, 1, !!(prot & IOMMU_WRITE),
+				    0, page, NULL) == 1) {
 		*pfn = page_to_pfn(page[0]);
 		return 0;
 	}
 
-	down_read(&current->mm->mmap_sem);
+	down_read(&mm->mmap_sem);
 
-	vma = find_vma_intersection(current->mm, vaddr, vaddr + 1);
+	vma = find_vma_intersection(mm, vaddr, vaddr + 1);
 
 	if (vma && vma->vm_flags & VM_PFNMAP) {
 		*pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
@@ -249,28 +317,63 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
 			ret = 0;
 	}
 
-	up_read(&current->mm->mmap_sem);
+	up_read(&mm->mmap_sem);
 
 	return ret;
 }
 
 /*
+ * Get first domain with iommu and without iommu from iommu's domain_list for
+ * lookups
+ * @iommu [in]: iommu structure
+ * @domain [out]: domain with iommu
+ * @domain_vgpu [out] : domain without iommu for vGPU
+ */
+static void get_first_domains(struct vfio_iommu *iommu, struct vfio_domain **domain,
+			      struct vfio_domain **domain_vgpu)
+{
+	struct vfio_domain *d;
+
+	if (!domain || !domain_vgpu)
+		return;
+
+	list_for_each_entry(d, &iommu->domain_list, next) {
+		if (d->vfio_iommu_api_only && !*domain_vgpu)
+			*domain_vgpu = d;
+		else if (!*domain)
+			*domain = d;
+		if (*domain_vgpu && *domain)
+			break;
+	}
+}
+
+/*
  * Attempt to pin pages.  We really don't want to track all the pfns and
  * the iommu can only map chunks of consecutive pfns anyway, so get the
  * first page and all consecutive pages with the same locking.
  */
-static long vfio_pin_pages(unsigned long vaddr, long npage,
-			   int prot, unsigned long *pfn_base)
+static long vfio_pin_pages_internal(void *domain_data, unsigned long vaddr, long npage,
+		             int prot, unsigned long *pfn_base)
 {
+	struct vfio_domain *domain = domain_data;
 	unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
 	bool lock_cap = capable(CAP_IPC_LOCK);
 	long ret, i;
 	bool rsvd;
+	struct mm_struct *mm;
 
-	if (!current->mm)
+	if (!domain)
 		return -ENODEV;
 
-	ret = vaddr_get_pfn(vaddr, prot, pfn_base);
+	if (domain->vfio_iommu_api_only)
+		mm = domain->vmm_mm;
+	else
+		mm = current->mm;
+
+	if (!mm)
+		return -ENODEV;
+
+	ret = vaddr_get_pfn(mm, vaddr, prot, pfn_base);
 	if (ret)
 		return ret;
 
@@ -293,7 +396,7 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
 	for (i = 1, vaddr += PAGE_SIZE; i < npage; i++, vaddr += PAGE_SIZE) {
 		unsigned long pfn = 0;
 
-		ret = vaddr_get_pfn(vaddr, prot, &pfn);
+		ret = vaddr_get_pfn(mm, vaddr, prot, &pfn);
 		if (ret)
 			break;
 
@@ -318,25 +421,183 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
 	return i;
 }
 
-static long vfio_unpin_pages(unsigned long pfn, long npage,
-			     int prot, bool do_accounting)
+static long vfio_unpin_pages_internal(void *domain_data, unsigned long pfn, long npage,
+				      int prot, bool do_accounting)
 {
+	struct vfio_domain *domain = domain_data;
 	unsigned long unlocked = 0;
 	long i;
 
+	if (!domain)
+		return -ENODEV;
+
 	for (i = 0; i < npage; i++)
 		unlocked += put_pfn(pfn++, prot);
 
 	if (do_accounting)
 		vfio_lock_acct(-unlocked);
+	return unlocked;
+}
+
+/*
+ * Pin a set of guest PFNs and return their associated host PFNs for vGPU.
+ * @vaddr [in]: array of guest PFNs
+ * @npage [in]: count of array elements
+ * @prot [in] : protection flags
+ * @pfn_base[out] : array of host PFNs
+ */
+int vfio_pin_pages(void *iommu_data, dma_addr_t *vaddr, long npage,
+		   int prot, dma_addr_t *pfn_base)
+{
+	struct vfio_iommu *iommu = iommu_data;
+	struct vfio_domain *domain = NULL, *domain_vgpu = NULL;
+	int i = 0, ret = 0;
+	long retpage;
+	dma_addr_t remote_vaddr = 0;
+	dma_addr_t *pfn = pfn_base;
+	struct vfio_dma *dma;
+
+	if (!iommu || !pfn_base)
+		return -EINVAL;
+
+	if (list_empty(&iommu->domain_list)) {
+		ret = -EINVAL;
+		goto pin_done;
+	}
+
+	get_first_domains(iommu, &domain, &domain_vgpu);
+
+	// Return error if vGPU domain doesn't exist
+	if (!domain_vgpu) {
+		ret = -EINVAL;
+		goto pin_done;
+	}
+
+	for (i = 0; i < npage; i++) {
+		struct vfio_vgpu_pfn *p;
+		struct vfio_vgpu_pfn *lpfn;
+		unsigned long tpfn;
+		dma_addr_t iova;
+
+		mutex_lock(&iommu->lock);
+
+		iova = vaddr[i] << PAGE_SHIFT;
+
+		dma = vfio_find_dma(iommu, iova, 0 /*  size */);
+		if (!dma) {
+			mutex_unlock(&iommu->lock);
+			ret = -EINVAL;
+			goto pin_done;
+		}
+
+		remote_vaddr = dma->vaddr + iova - dma->iova;
+
+		retpage = vfio_pin_pages_internal(domain_vgpu, remote_vaddr,
+						  (long)1, prot, &tpfn);
+		mutex_unlock(&iommu->lock);
+		if (retpage <= 0) {
+			WARN_ON(!retpage);
+			ret = (int)retpage;
+			goto pin_done;
+		}
+
+		pfn[i] = tpfn;
+
+		mutex_lock(&domain_vgpu->lock);
+
+		// search if pfn exist
+		if ((p = vfio_find_vgpu_pfn(domain_vgpu, tpfn))) {
+			atomic_inc(&p->ref_count);
+			mutex_unlock(&domain_vgpu->lock);
+			continue;
+		}
+
+		// add to pfn_list
+		lpfn = kzalloc(sizeof(*lpfn), GFP_KERNEL);
+		if (!lpfn) {
+			ret = -ENOMEM;
+			mutex_unlock(&domain_vgpu->lock);
+			goto pin_done;
+		}
+		lpfn->vmm_va = remote_vaddr;
+		lpfn->iova = iova;
+		lpfn->pfn = pfn[i];
+		lpfn->npage = 1;
+		lpfn->prot = prot;
+		atomic_inc(&lpfn->ref_count);
+		vfio_link_vgpu_pfn(domain_vgpu, lpfn);
+		mutex_unlock(&domain_vgpu->lock);
+	}
+
+	ret = i;
+
+pin_done:
+	return ret;
+}
+EXPORT_SYMBOL(vfio_pin_pages);
+
+static int vfio_vgpu_unpin_pfn(struct vfio_domain *domain, struct vfio_vgpu_pfn *vpfn)
+{
+	int ret;
+
+	ret = vfio_unpin_pages_internal(domain, vpfn->pfn, vpfn->npage, vpfn->prot, true);
+
+	if (atomic_dec_and_test(&vpfn->ref_count)) {
+		// remove from pfn_list
+		vfio_unlink_vgpu_pfn(domain, vpfn);
+		kfree(vpfn);
+	}
+
+	return ret;
+}
+
+/*
+ * Unpin set of host PFNs for vGPU.
+ * @pfn	[in] : array of host PFNs to be unpinned.
+ * @npage [in] :count of elements in array, that is number of pages.
+ * @prot [in] : protection flags
+ */
+int vfio_unpin_pages(void *iommu_data, dma_addr_t *pfn, long npage,
+		     int prot)
+{
+	struct vfio_iommu *iommu = iommu_data;
+	struct vfio_domain *domain = NULL, *domain_vgpu = NULL;
+	long unlocked = 0;
+	int i;
+	if (!iommu || !pfn)
+		return -EINVAL;
+
+	if (list_empty(&iommu->domain_list))
+		return -EINVAL;
+
+	get_first_domains(iommu, &domain, &domain_vgpu);
+
+	// Return error if vGPU domain doesn't exist
+	if (!domain_vgpu)
+		return -EINVAL;
+
+	mutex_lock(&domain_vgpu->lock);
+
+	for (i = 0; i < npage; i++) {
+		struct vfio_vgpu_pfn *p;
+
+		// verify if pfn exist in pfn_list
+		if (!(p = vfio_find_vgpu_pfn(domain_vgpu, *(pfn + i)))) {
+			continue;
+		}
+
+		unlocked += vfio_vgpu_unpin_pfn(domain_vgpu, p);
+	}
+	mutex_unlock(&domain_vgpu->lock);
 
 	return unlocked;
 }
+EXPORT_SYMBOL(vfio_unpin_pages);
 
 static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
 {
 	dma_addr_t iova = dma->iova, end = dma->iova + dma->size;
-	struct vfio_domain *domain, *d;
+	struct vfio_domain *domain = NULL, *d, *domain_vgpu = NULL;
 	long unlocked = 0;
 
 	if (!dma->size)
@@ -348,12 +609,18 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
 	 * pfns to unpin.  The rest need to be unmapped in advance so we have
 	 * no iommu translations remaining when the pages are unpinned.
 	 */
-	domain = d = list_first_entry(&iommu->domain_list,
-				      struct vfio_domain, next);
 
+	get_first_domains(iommu, &domain, &domain_vgpu);
+
+	if (!domain)
+		return;
+
+	d = domain;
 	list_for_each_entry_continue(d, &iommu->domain_list, next) {
-		iommu_unmap(d->domain, dma->iova, dma->size);
-		cond_resched();
+		if (!d->vfio_iommu_api_only) {
+			iommu_unmap(d->domain, dma->iova, dma->size);
+			cond_resched();
+		}
 	}
 
 	while (iova < end) {
@@ -382,7 +649,7 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
 		if (WARN_ON(!unmapped))
 			break;
 
-		unlocked += vfio_unpin_pages(phys >> PAGE_SHIFT,
+		unlocked += vfio_unpin_pages_internal(domain, phys >> PAGE_SHIFT,
 					     unmapped >> PAGE_SHIFT,
 					     dma->prot, false);
 		iova += unmapped;
@@ -406,8 +673,10 @@ static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu)
 	unsigned long bitmap = ULONG_MAX;
 
 	mutex_lock(&iommu->lock);
-	list_for_each_entry(domain, &iommu->domain_list, next)
-		bitmap &= domain->domain->ops->pgsize_bitmap;
+	list_for_each_entry(domain, &iommu->domain_list, next) {
+		if (!domain->vfio_iommu_api_only)
+			bitmap &= domain->domain->ops->pgsize_bitmap;
+	}
 	mutex_unlock(&iommu->lock);
 
 	/*
@@ -517,6 +786,9 @@ static int map_try_harder(struct vfio_domain *domain, dma_addr_t iova,
 	long i;
 	int ret;
 
+	if (domain->vfio_iommu_api_only)
+		return -EINVAL;
+
 	for (i = 0; i < npage; i++, pfn++, iova += PAGE_SIZE) {
 		ret = iommu_map(domain->domain, iova,
 				(phys_addr_t)pfn << PAGE_SHIFT,
@@ -538,6 +810,9 @@ static int vfio_iommu_map(struct vfio_iommu *iommu, dma_addr_t iova,
 	int ret;
 
 	list_for_each_entry(d, &iommu->domain_list, next) {
+		if (d->vfio_iommu_api_only)
+			continue;
+
 		ret = iommu_map(d->domain, iova, (phys_addr_t)pfn << PAGE_SHIFT,
 				npage << PAGE_SHIFT, prot | d->prot);
 		if (ret) {
@@ -552,8 +827,11 @@ static int vfio_iommu_map(struct vfio_iommu *iommu, dma_addr_t iova,
 	return 0;
 
 unwind:
-	list_for_each_entry_continue_reverse(d, &iommu->domain_list, next)
+	list_for_each_entry_continue_reverse(d, &iommu->domain_list, next) {
+		if (d->vfio_iommu_api_only)
+			continue;
 		iommu_unmap(d->domain, iova, npage << PAGE_SHIFT);
+	}
 
 	return ret;
 }
@@ -569,6 +847,8 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
 	uint64_t mask;
 	struct vfio_dma *dma;
 	unsigned long pfn;
+	struct vfio_domain *domain = NULL;
+	int domain_with_iommu_present = 0;
 
 	/* Verify that none of our __u64 fields overflow */
 	if (map->size != size || map->vaddr != vaddr || map->iova != iova)
@@ -611,9 +891,22 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
 	/* Insert zero-sized and grow as we map chunks of it */
 	vfio_link_dma(iommu, dma);
 
+	list_for_each_entry(domain, &iommu->domain_list, next) {
+		if (!domain->vfio_iommu_api_only) {
+			domain_with_iommu_present = 1;
+			break;
+		}
+	}
+
+	// Skip pin and map only if domain without IOMMU is present
+	if (!domain_with_iommu_present) {
+		dma->size = size;
+		goto map_done;
+	}
+
 	while (size) {
 		/* Pin a contiguous chunk of memory */
-		npage = vfio_pin_pages(vaddr + dma->size,
+		npage = vfio_pin_pages_internal(domain, vaddr + dma->size,
 				       size >> PAGE_SHIFT, prot, &pfn);
 		if (npage <= 0) {
 			WARN_ON(!npage);
@@ -623,8 +916,8 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
 
 		/* Map it! */
 		ret = vfio_iommu_map(iommu, iova + dma->size, pfn, npage, prot);
-		if (ret) {
-			vfio_unpin_pages(pfn, npage, prot, true);
+		if (ret){
+			vfio_unpin_pages_internal(domain, pfn, npage, prot, true);
 			break;
 		}
 
@@ -635,6 +928,7 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
 	if (ret)
 		vfio_remove_dma(iommu, dma);
 
+map_done:
 	mutex_unlock(&iommu->lock);
 	return ret;
 }
@@ -654,12 +948,15 @@ static int vfio_bus_type(struct device *dev, void *data)
 static int vfio_iommu_replay(struct vfio_iommu *iommu,
 			     struct vfio_domain *domain)
 {
-	struct vfio_domain *d;
+	struct vfio_domain *d = NULL, *d_vgpu = NULL;
 	struct rb_node *n;
 	int ret;
 
+	if (domain->vfio_iommu_api_only)
+		return -EINVAL;
+
 	/* Arbitrarily pick the first domain in the list for lookups */
-	d = list_first_entry(&iommu->domain_list, struct vfio_domain, next);
+	get_first_domains(iommu, &d, &d_vgpu);
 	n = rb_first(&iommu->dma_list);
 
 	/* If there's not a domain, there better not be any mappings */
@@ -716,6 +1013,9 @@ static void vfio_test_domain_fgsp(struct vfio_domain *domain)
 	struct page *pages;
 	int ret, order = get_order(PAGE_SIZE * 2);
 
+	if (domain->vfio_iommu_api_only)
+		return;
+
 	pages = alloc_pages(GFP_KERNEL | __GFP_ZERO, order);
 	if (!pages)
 		return;
@@ -769,6 +1069,23 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 	if (ret)
 		goto out_free;
 
+	if (!iommu_present(bus) && (bus == &vgpu_bus_type)) {
+		struct vgpu_device *vgpu_dev = NULL;
+
+		vgpu_dev = get_vgpu_device_from_group(iommu_group);
+		if (!vgpu_dev)
+			goto out_free;
+
+		vgpu_dev->iommu_data = iommu;
+		domain->vfio_iommu_api_only = true;
+		domain->vmm_mm = current->mm;
+		INIT_LIST_HEAD(&domain->group_list);
+		list_add(&group->next, &domain->group_list);
+		domain->pfn_list = RB_ROOT;
+		mutex_init(&domain->lock);
+		goto out_success;
+	}
+
 	domain->domain = iommu_domain_alloc(bus);
 	if (!domain->domain) {
 		ret = -EIO;
@@ -834,6 +1151,7 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 	if (ret)
 		goto out_detach;
 
+out_success:
 	list_add(&domain->next, &iommu->domain_list);
 
 	mutex_unlock(&iommu->lock);
@@ -854,11 +1172,36 @@ out_free:
 static void vfio_iommu_unmap_unpin_all(struct vfio_iommu *iommu)
 {
 	struct rb_node *node;
+	struct vfio_domain *domain = NULL, *domain_vgpu = NULL;
+
+	get_first_domains(iommu, &domain, &domain_vgpu);
+
+	if (domain_vgpu) {
+		int unlocked;
+		mutex_lock(&domain_vgpu->lock);
+		while ((node = rb_first(&domain_vgpu->pfn_list))) {
+			unlocked = vfio_vgpu_unpin_pfn(domain_vgpu,
+					rb_entry(node, struct vfio_vgpu_pfn, node));
+		}
+		mutex_unlock(&domain_vgpu->lock);
+	}
 
 	while ((node = rb_first(&iommu->dma_list)))
 		vfio_remove_dma(iommu, rb_entry(node, struct vfio_dma, node));
 }
 
+static bool list_is_singular_iommu_domain(struct vfio_iommu *iommu)
+{
+	struct vfio_domain *domain;
+	int domain_iommu = 0;
+
+	list_for_each_entry(domain, &iommu->domain_list, next) {
+		if (!domain->vfio_iommu_api_only)
+			domain_iommu++;
+	}
+	return (domain_iommu == 1);
+}
+
 static void vfio_iommu_type1_detach_group(void *iommu_data,
 					  struct iommu_group *iommu_group)
 {
@@ -872,19 +1215,28 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
 		list_for_each_entry(group, &domain->group_list, next) {
 			if (group->iommu_group != iommu_group)
 				continue;
+			if (!domain->vfio_iommu_api_only)
+				iommu_detach_group(domain->domain, iommu_group);
+			else {
+				struct vgpu_device *vgpu_dev = NULL;
 
-			iommu_detach_group(domain->domain, iommu_group);
+				vgpu_dev = get_vgpu_device_from_group(iommu_group);
+				if (vgpu_dev)
+					vgpu_dev->iommu_data = NULL;
+
+			}
 			list_del(&group->next);
 			kfree(group);
 			/*
 			 * Group ownership provides privilege, if the group
 			 * list is empty, the domain goes away.  If it's the
-			 * last domain, then all the mappings go away too.
+			 * last domain with iommu, then all the mappings go away too.
 			 */
 			if (list_empty(&domain->group_list)) {
-				if (list_is_singular(&iommu->domain_list))
+				if (list_is_singular_iommu_domain(iommu))
 					vfio_iommu_unmap_unpin_all(iommu);
-				iommu_domain_free(domain->domain);
+				if (!domain->vfio_iommu_api_only)
+					iommu_domain_free(domain->domain);
 				list_del(&domain->next);
 				kfree(domain);
 			}
@@ -936,11 +1288,22 @@ static void vfio_iommu_type1_release(void *iommu_data)
 				 &iommu->domain_list, next) {
 		list_for_each_entry_safe(group, group_tmp,
 					 &domain->group_list, next) {
-			iommu_detach_group(domain->domain, group->iommu_group);
+			if (!domain->vfio_iommu_api_only)
+				iommu_detach_group(domain->domain, group->iommu_group);
+			else {
+				struct vgpu_device *vgpu_dev = NULL;
+
+				vgpu_dev = get_vgpu_device_from_group(group->iommu_group);
+				if (vgpu_dev)
+					vgpu_dev->iommu_data = NULL;
+
+			}
+
 			list_del(&group->next);
 			kfree(group);
 		}
-		iommu_domain_free(domain->domain);
+		if (!domain->vfio_iommu_api_only)
+			iommu_domain_free(domain->domain);
 		list_del(&domain->next);
 		kfree(domain);
 	}
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 0ecae0b..d280868 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -127,6 +127,12 @@ static inline long vfio_spapr_iommu_eeh_ioctl(struct iommu_group *group,
 }
 #endif /* CONFIG_EEH */
 
+extern long vfio_pin_pages(void *iommu_data, dma_addr_t *vaddr, long npage,
+		           int prot, dma_addr_t *pfn_base);
+
+extern long vfio_unpin_pages(void *iommu_data, dma_addr_t *pfn, long npage,
+			     int prot);
+
 /*
  * IRQfd - generic
  */
diff --git a/include/linux/vgpu.h b/include/linux/vgpu.h
index 03a77cf..cc18353 100644
--- a/include/linux/vgpu.h
+++ b/include/linux/vgpu.h
@@ -36,6 +36,7 @@ struct vgpu_device {
 	struct device		dev;
 	struct gpu_device	*gpu_dev;
 	struct iommu_group	*group;
+	void			*iommu_data;
 #define DEVICE_NAME_LEN		(64)
 	char			dev_name[DEVICE_NAME_LEN];
 	uuid_le			uuid;
@@ -209,8 +210,7 @@ extern void vgpu_unregister_driver(struct vgpu_driver *drv);
 
 extern int vgpu_map_virtual_bar(uint64_t virt_bar_addr, uint64_t phys_bar_addr,
 				uint32_t len, uint32_t flags);
-extern int vgpu_dma_do_translate(dma_addr_t * gfn_buffer, uint32_t count);
 
-struct vgpu_device *get_vgpu_device_from_group(struct iommu_group *group);
+extern struct vgpu_device *get_vgpu_device_from_group(struct iommu_group *group);
 
 #endif /* VGPU_H */
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 154+ messages in thread

* Re: [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
  2016-05-02 18:40   ` [Qemu-devel] " Kirti Wankhede
@ 2016-05-03 10:40     ` Jike Song
  -1 siblings, 0 replies; 154+ messages in thread
From: Jike Song @ 2016-05-03 10:40 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: alex.williamson, pbonzini, kraxel, cjia, qemu-devel, kvm,
	kevin.tian, shuai.ruan, zhiyuan.lv

On 05/03/2016 02:40 AM, Kirti Wankhede wrote:
> VFIO Type1 IOMMU driver is designed for the devices which are IOMMU capable.
> vGPU device are only using IOMMU TYPE1 API, the underlying hardware can be
> managed by an IOMMU domain. To use most of the code of IOMMU driver for vGPU
> devices, type1 IOMMU driver is modified to support vGPU devices. This change
> exports functions to pin and unpin pages for vGPU devices.
> It maintains data of pinned pages for vGPU domain. This data is used to verify
> unpinning request and also used to unpin pages from detach_group().
> 
> Tested by assigning below combinations of devices to a single VM:
> - GPU pass through only
> - vGPU device only
> - One GPU pass through and one vGPU device
> - two GPU pass through

{the patch trimmed}

Hi Kirti,

 I have a question: in the scenario above, how many PCI BDFs do your vGPUs consume?

 Per my understanding, you take the GPA of a KVM guest as the IOVA of IOMMU domain,
and if there are multiple guests with vGPU assigned, the vGPUs must belong to
different vGPUs (thereby having different BDFs).

 Do I miss anything?


--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
@ 2016-05-03 10:40     ` Jike Song
  0 siblings, 0 replies; 154+ messages in thread
From: Jike Song @ 2016-05-03 10:40 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: alex.williamson, pbonzini, kraxel, cjia, qemu-devel, kvm,
	kevin.tian, shuai.ruan, zhiyuan.lv

On 05/03/2016 02:40 AM, Kirti Wankhede wrote:
> VFIO Type1 IOMMU driver is designed for the devices which are IOMMU capable.
> vGPU device are only using IOMMU TYPE1 API, the underlying hardware can be
> managed by an IOMMU domain. To use most of the code of IOMMU driver for vGPU
> devices, type1 IOMMU driver is modified to support vGPU devices. This change
> exports functions to pin and unpin pages for vGPU devices.
> It maintains data of pinned pages for vGPU domain. This data is used to verify
> unpinning request and also used to unpin pages from detach_group().
> 
> Tested by assigning below combinations of devices to a single VM:
> - GPU pass through only
> - vGPU device only
> - One GPU pass through and one vGPU device
> - two GPU pass through

{the patch trimmed}

Hi Kirti,

 I have a question: in the scenario above, how many PCI BDFs do your vGPUs consume?

 Per my understanding, you take the GPA of a KVM guest as the IOVA of IOMMU domain,
and if there are multiple guests with vGPU assigned, the vGPUs must belong to
different vGPUs (thereby having different BDFs).

 Do I miss anything?


--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
  2016-05-02 18:40   ` [Qemu-devel] " Kirti Wankhede
@ 2016-05-03 22:43     ` Alex Williamson
  -1 siblings, 0 replies; 154+ messages in thread
From: Alex Williamson @ 2016-05-03 22:43 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, shuai.ruan,
	jike.song, zhiyuan.lv

On Tue, 3 May 2016 00:10:41 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> VFIO Type1 IOMMU driver is designed for the devices which are IOMMU capable.
> vGPU device are only using IOMMU TYPE1 API, the underlying hardware can be
> managed by an IOMMU domain. To use most of the code of IOMMU driver for vGPU
> devices, type1 IOMMU driver is modified to support vGPU devices. This change
> exports functions to pin and unpin pages for vGPU devices.
> It maintains data of pinned pages for vGPU domain. This data is used to verify
> unpinning request and also used to unpin pages from detach_group().
> 
> Tested by assigning below combinations of devices to a single VM:
> - GPU pass through only
> - vGPU device only
> - One GPU pass through and one vGPU device
> - two GPU pass through
>
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Signed-off-by: Neo Jia <cjia@nvidia.com>
> Change-Id: I6e35e9fc7f14049226365e9ecef3814dc4ca1738
> ---
>  drivers/vfio/vfio_iommu_type1.c |  427 ++++++++++++++++++++++++++++++++++++---
>  include/linux/vfio.h            |    6 +
>  include/linux/vgpu.h            |    4 +-
>  3 files changed, 403 insertions(+), 34 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 75b24e9..a970854 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -36,6 +36,7 @@
>  #include <linux/uaccess.h>
>  #include <linux/vfio.h>
>  #include <linux/workqueue.h>
> +#include <linux/vgpu.h>
>  
>  #define DRIVER_VERSION  "0.2"
>  #define DRIVER_AUTHOR   "Alex Williamson <alex.williamson@redhat.com>"
> @@ -67,6 +68,11 @@ struct vfio_domain {
>  	struct list_head	group_list;
>  	int			prot;		/* IOMMU_CACHE */
>  	bool			fgsp;		/* Fine-grained super pages */
> +	bool			vfio_iommu_api_only;	/* Domain for device which
> +							   is without physical IOMMU */
> +	struct mm_struct	*vmm_mm;	/* VMM's mm */

I really don't like assuming a VMM, vfio is a userspace driver
interface, this is just an mm associated with this set of mappings.

> +	struct rb_root		pfn_list;	/* Host pfn list for requested gfns */

This is just an iova to pfn list, whether it's a guest running on a
VMM, we don't care.

> +	struct mutex		lock;		/* mutex for pfn_list */

So pfn_list_lock might be a better name for it.

>  };
>  
>  struct vfio_dma {
> @@ -83,6 +89,19 @@ struct vfio_group {
>  };
>  
>  /*
> + * Guest RAM pinning working set or DMA target
> + */
> +struct vfio_vgpu_pfn {
> +	struct rb_node		node;
> +	unsigned long		vmm_va;		/* VMM virtual addr */

vaddr

> +	dma_addr_t		iova;		/* IOVA */
> +	unsigned long		npage;		/* number of pages */
> +	unsigned long		pfn;		/* Host pfn */
> +	int			prot;
> +	atomic_t		ref_count;
> +	struct list_head	next;
> +};

Why is any of this vgpu specific?  It's just a data structure for
tracking iova to vaddr pins.

> +/*
>   * This code handles mapping and unmapping of user data buffers
>   * into DMA'ble space using the IOMMU
>   */
> @@ -130,6 +149,53 @@ static void vfio_unlink_dma(struct vfio_iommu *iommu, struct vfio_dma *old)
>  	rb_erase(&old->node, &iommu->dma_list);
>  }
>  
> +/*
> + * Helper Functions for host pfn list
> + */
> +
> +static struct vfio_vgpu_pfn *vfio_find_vgpu_pfn(struct vfio_domain *domain,
> +						unsigned long pfn)
> +{
> +	struct rb_node *node = domain->pfn_list.rb_node;
> +
> +	while (node) {
> +		struct vfio_vgpu_pfn *vgpu_pfn = rb_entry(node, struct vfio_vgpu_pfn, node);
> +
> +		if (pfn <= vgpu_pfn->pfn)
> +			node = node->rb_left;
> +		else if (pfn >= vgpu_pfn->pfn)
> +			node = node->rb_right;
> +		else
> +			return vgpu_pfn;
> +	}
> +
> +	return NULL;
> +}
> +
> +static void vfio_link_vgpu_pfn(struct vfio_domain *domain, struct vfio_vgpu_pfn *new)
> +{
> +	struct rb_node **link = &domain->pfn_list.rb_node, *parent = NULL;
> +	struct vfio_vgpu_pfn *vgpu_pfn;
> +
> +	while (*link) {
> +		parent = *link;
> +		vgpu_pfn = rb_entry(parent, struct vfio_vgpu_pfn, node);
> +
> +		if (new->pfn <= vgpu_pfn->pfn)
> +			link = &(*link)->rb_left;
> +		else
> +			link = &(*link)->rb_right;
> +	}
> +
> +	rb_link_node(&new->node, parent, link);
> +	rb_insert_color(&new->node, &domain->pfn_list);
> +}
> +
> +static void vfio_unlink_vgpu_pfn(struct vfio_domain *domain, struct vfio_vgpu_pfn *old)
> +{
> +	rb_erase(&old->node, &domain->pfn_list);
> +}
> +

None of the above is really vgpu specific either, just managing a tree
of mappings.  Name things based on what they do, not who they're for.

>  struct vwork {
>  	struct mm_struct	*mm;
>  	long			npage;
> @@ -228,20 +294,22 @@ static int put_pfn(unsigned long pfn, int prot)
>  	return 0;
>  }
>  
> -static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
> +static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
> +			 int prot, unsigned long *pfn)
>  {
>  	struct page *page[1];
>  	struct vm_area_struct *vma;
>  	int ret = -EFAULT;
>  
> -	if (get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE), page) == 1) {
> +	if (get_user_pages_remote(NULL, mm, vaddr, 1, !!(prot & IOMMU_WRITE),
> +				    0, page, NULL) == 1) {

AIUI, _remote requires the mmap_sem to be held, _fast does not.  I don't
see that being accounted for anywhere.

>  		*pfn = page_to_pfn(page[0]);
>  		return 0;
>  	}
>  
> -	down_read(&current->mm->mmap_sem);
> +	down_read(&mm->mmap_sem);
>  
> -	vma = find_vma_intersection(current->mm, vaddr, vaddr + 1);
> +	vma = find_vma_intersection(mm, vaddr, vaddr + 1);
>  
>  	if (vma && vma->vm_flags & VM_PFNMAP) {
>  		*pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
> @@ -249,28 +317,63 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
>  			ret = 0;
>  	}
>  
> -	up_read(&current->mm->mmap_sem);
> +	up_read(&mm->mmap_sem);
>  
>  	return ret;
>  }
>  
>  /*
> + * Get first domain with iommu and without iommu from iommu's domain_list for
> + * lookups
> + * @iommu [in]: iommu structure
> + * @domain [out]: domain with iommu
> + * @domain_vgpu [out] : domain without iommu for vGPU
> + */
> +static void get_first_domains(struct vfio_iommu *iommu, struct vfio_domain **domain,
> +			      struct vfio_domain **domain_vgpu)
> +{
> +	struct vfio_domain *d;
> +
> +	if (!domain || !domain_vgpu)
> +		return;
> +
> +	list_for_each_entry(d, &iommu->domain_list, next) {
> +		if (d->vfio_iommu_api_only && !*domain_vgpu)
> +			*domain_vgpu = d;
> +		else if (!*domain)
> +			*domain = d;
> +		if (*domain_vgpu && *domain)
> +			break;
> +	}
> +}

This looks like pure overhead for existing code.  Also no need to
introduce the concept of vgpu here.  

> +
> +/*
>   * Attempt to pin pages.  We really don't want to track all the pfns and
>   * the iommu can only map chunks of consecutive pfns anyway, so get the
>   * first page and all consecutive pages with the same locking.
>   */
> -static long vfio_pin_pages(unsigned long vaddr, long npage,
> -			   int prot, unsigned long *pfn_base)
> +static long vfio_pin_pages_internal(void *domain_data, unsigned long vaddr, long npage,

I appears we know this as a struct vfio_domain in all callers, why is
this using void*?.

> +		             int prot, unsigned long *pfn_base)
>  {
> +	struct vfio_domain *domain = domain_data;
>  	unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
>  	bool lock_cap = capable(CAP_IPC_LOCK);
>  	long ret, i;
>  	bool rsvd;
> +	struct mm_struct *mm;
>  
> -	if (!current->mm)
> +	if (!domain)
>  		return -ENODEV;
>  
> -	ret = vaddr_get_pfn(vaddr, prot, pfn_base);
> +	if (domain->vfio_iommu_api_only)
> +		mm = domain->vmm_mm;
> +	else
> +		mm = current->mm;
> +
> +	if (!mm)
> +		return -ENODEV;
> +
> +	ret = vaddr_get_pfn(mm, vaddr, prot, pfn_base);

We could pass domain->mm unconditionally to vaddr_get_pfn(), let it be
NULL in the !api_only case and use it as a cue to vaddr_get_pfn() which
gup variant to use.  Of course we need to deal with mmap_sem somewhere
too without turning the code into swiss cheese.

Correct me if I'm wrong, but I assume the main benefit of interweaving
this into type1 vs pulling out common code and making a new vfio iommu
backend is the page accounting, ie. not over accounting locked pages.
TBH, I don't know if it's worth it.  Any idea what the high water mark
of pinned pages for a vgpu might be?

>  	if (ret)
>  		return ret;
>  
> @@ -293,7 +396,7 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
>  	for (i = 1, vaddr += PAGE_SIZE; i < npage; i++, vaddr += PAGE_SIZE) {
>  		unsigned long pfn = 0;
>  
> -		ret = vaddr_get_pfn(vaddr, prot, &pfn);
> +		ret = vaddr_get_pfn(mm, vaddr, prot, &pfn);
>  		if (ret)
>  			break;
>  
> @@ -318,25 +421,183 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
>  	return i;
>  }
>  
> -static long vfio_unpin_pages(unsigned long pfn, long npage,
> -			     int prot, bool do_accounting)
> +static long vfio_unpin_pages_internal(void *domain_data, unsigned long pfn, long npage,

Again, all the callers seem to know they have a struct vfio_domain*, I
don't see any justification for using a void* here.

It seems less disruptive to leave these function names alone and make
the new functions _external.

> +				      int prot, bool do_accounting)
>  {
> +	struct vfio_domain *domain = domain_data;
>  	unsigned long unlocked = 0;
>  	long i;
>  
> +	if (!domain)
> +		return -ENODEV;
> +

How is this possible?  Callers of this function need to be updated for
a possible negative return or accounting gets really broken.

>  	for (i = 0; i < npage; i++)
>  		unlocked += put_pfn(pfn++, prot);
>  
>  	if (do_accounting)
>  		vfio_lock_acct(-unlocked);
> +	return unlocked;
> +}
> +
> +/*
> + * Pin a set of guest PFNs and return their associated host PFNs for vGPU.
> + * @vaddr [in]: array of guest PFNs
> + * @npage [in]: count of array elements
> + * @prot [in] : protection flags
> + * @pfn_base[out] : array of host PFNs
> + */
> +int vfio_pin_pages(void *iommu_data, dma_addr_t *vaddr, long npage,
> +		   int prot, dma_addr_t *pfn_base)
> +{
> +	struct vfio_iommu *iommu = iommu_data;
> +	struct vfio_domain *domain = NULL, *domain_vgpu = NULL;
> +	int i = 0, ret = 0;
> +	long retpage;
> +	dma_addr_t remote_vaddr = 0;
> +	dma_addr_t *pfn = pfn_base;
> +	struct vfio_dma *dma;
> +
> +	if (!iommu || !pfn_base)
> +		return -EINVAL;
> +
> +	if (list_empty(&iommu->domain_list)) {
> +		ret = -EINVAL;
> +		goto pin_done;
> +	}
> +
> +	get_first_domains(iommu, &domain, &domain_vgpu);
> +
> +	// Return error if vGPU domain doesn't exist

No c++ style comments please.

> +	if (!domain_vgpu) {
> +		ret = -EINVAL;
> +		goto pin_done;
> +	}
> +
> +	for (i = 0; i < npage; i++) {
> +		struct vfio_vgpu_pfn *p;
> +		struct vfio_vgpu_pfn *lpfn;
> +		unsigned long tpfn;
> +		dma_addr_t iova;
> +
> +		mutex_lock(&iommu->lock);
> +
> +		iova = vaddr[i] << PAGE_SHIFT;
> +
> +		dma = vfio_find_dma(iommu, iova, 0 /*  size */);
> +		if (!dma) {
> +			mutex_unlock(&iommu->lock);
> +			ret = -EINVAL;
> +			goto pin_done;
> +		}
> +
> +		remote_vaddr = dma->vaddr + iova - dma->iova;
> +
> +		retpage = vfio_pin_pages_internal(domain_vgpu, remote_vaddr,
> +						  (long)1, prot, &tpfn);
> +		mutex_unlock(&iommu->lock);
> +		if (retpage <= 0) {
> +			WARN_ON(!retpage);
> +			ret = (int)retpage;
> +			goto pin_done;
> +		}
> +
> +		pfn[i] = tpfn;
> +
> +		mutex_lock(&domain_vgpu->lock);
> +
> +		// search if pfn exist
> +		if ((p = vfio_find_vgpu_pfn(domain_vgpu, tpfn))) {
> +			atomic_inc(&p->ref_count);
> +			mutex_unlock(&domain_vgpu->lock);
> +			continue;
> +		}

The only reason I can come up with for why we'd want to integrate an
api-only domain into the existing type1 code would be to avoid page
accounting issues where we count locked pages once for a normal
assigned device and again for a vgpu, but that's not what we're doing
here.  We're not only locking the pages again regardless of them
already being locked, we're counting every time we lock them through
this new interface.  So there's really no point at all to making type1
become this unsupportable.  In that case we should be pulling out the
common code that we want to share from type1 and making a new type1
compatible vfio iommu backend rather than conditionalizing everything
here.

> +
> +		// add to pfn_list
> +		lpfn = kzalloc(sizeof(*lpfn), GFP_KERNEL);
> +		if (!lpfn) {
> +			ret = -ENOMEM;
> +			mutex_unlock(&domain_vgpu->lock);
> +			goto pin_done;
> +		}
> +		lpfn->vmm_va = remote_vaddr;
> +		lpfn->iova = iova;
> +		lpfn->pfn = pfn[i];
> +		lpfn->npage = 1;
> +		lpfn->prot = prot;
> +		atomic_inc(&lpfn->ref_count);
> +		vfio_link_vgpu_pfn(domain_vgpu, lpfn);
> +		mutex_unlock(&domain_vgpu->lock);
> +	}
> +
> +	ret = i;
> +
> +pin_done:
> +	return ret;
> +}
> +EXPORT_SYMBOL(vfio_pin_pages);
> +
> +static int vfio_vgpu_unpin_pfn(struct vfio_domain *domain, struct vfio_vgpu_pfn *vpfn)
> +{
> +	int ret;
> +
> +	ret = vfio_unpin_pages_internal(domain, vpfn->pfn, vpfn->npage, vpfn->prot, true);
> +
> +	if (atomic_dec_and_test(&vpfn->ref_count)) {
> +		// remove from pfn_list
> +		vfio_unlink_vgpu_pfn(domain, vpfn);
> +		kfree(vpfn);
> +	}
> +
> +	return ret;
> +}
> +
> +/*
> + * Unpin set of host PFNs for vGPU.
> + * @pfn	[in] : array of host PFNs to be unpinned.
> + * @npage [in] :count of elements in array, that is number of pages.
> + * @prot [in] : protection flags
> + */
> +int vfio_unpin_pages(void *iommu_data, dma_addr_t *pfn, long npage,
> +		     int prot)
> +{
> +	struct vfio_iommu *iommu = iommu_data;
> +	struct vfio_domain *domain = NULL, *domain_vgpu = NULL;
> +	long unlocked = 0;
> +	int i;
> +	if (!iommu || !pfn)
> +		return -EINVAL;
> +
> +	if (list_empty(&iommu->domain_list))
> +		return -EINVAL;
> +
> +	get_first_domains(iommu, &domain, &domain_vgpu);
> +
> +	// Return error if vGPU domain doesn't exist
> +	if (!domain_vgpu)
> +		return -EINVAL;
> +
> +	mutex_lock(&domain_vgpu->lock);
> +
> +	for (i = 0; i < npage; i++) {
> +		struct vfio_vgpu_pfn *p;
> +
> +		// verify if pfn exist in pfn_list
> +		if (!(p = vfio_find_vgpu_pfn(domain_vgpu, *(pfn + i)))) {
> +			continue;

How does the caller deal with this, the function returns number of
pages unpinned which will not match the requested number of pages to
unpin if there are any missing.  Also, no setting variables within a
test when easily avoidable please, separate to a set then test.

> +		}
> +
> +		unlocked += vfio_vgpu_unpin_pfn(domain_vgpu, p);
> +	}
> +	mutex_unlock(&domain_vgpu->lock);
>  
>  	return unlocked;
>  }
> +EXPORT_SYMBOL(vfio_unpin_pages);
>  
>  static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
>  {
>  	dma_addr_t iova = dma->iova, end = dma->iova + dma->size;
> -	struct vfio_domain *domain, *d;
> +	struct vfio_domain *domain = NULL, *d, *domain_vgpu = NULL;
>  	long unlocked = 0;
>  
>  	if (!dma->size)
> @@ -348,12 +609,18 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
>  	 * pfns to unpin.  The rest need to be unmapped in advance so we have
>  	 * no iommu translations remaining when the pages are unpinned.
>  	 */
> -	domain = d = list_first_entry(&iommu->domain_list,
> -				      struct vfio_domain, next);
>  
> +	get_first_domains(iommu, &domain, &domain_vgpu);
> +
> +	if (!domain)
> +		return;
> +
> +	d = domain;
>  	list_for_each_entry_continue(d, &iommu->domain_list, next) {
> -		iommu_unmap(d->domain, dma->iova, dma->size);
> -		cond_resched();
> +		if (!d->vfio_iommu_api_only) {
> +			iommu_unmap(d->domain, dma->iova, dma->size);
> +			cond_resched();
> +		}
>  	}
>  
>  	while (iova < end) {

How do api-only domain not blowup on the iommu API code in this next
code block?  Are you just getting lucky that the api-only domain is
first in the list and the real domain is last?

> @@ -382,7 +649,7 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
>  		if (WARN_ON(!unmapped))
>  			break;
>  
> -		unlocked += vfio_unpin_pages(phys >> PAGE_SHIFT,
> +		unlocked += vfio_unpin_pages_internal(domain, phys >> PAGE_SHIFT,
>  					     unmapped >> PAGE_SHIFT,
>  					     dma->prot, false);
>  		iova += unmapped;
> @@ -406,8 +673,10 @@ static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu)
>  	unsigned long bitmap = ULONG_MAX;
>  
>  	mutex_lock(&iommu->lock);
> -	list_for_each_entry(domain, &iommu->domain_list, next)
> -		bitmap &= domain->domain->ops->pgsize_bitmap;
> +	list_for_each_entry(domain, &iommu->domain_list, next) {
> +		if (!domain->vfio_iommu_api_only)
> +			bitmap &= domain->domain->ops->pgsize_bitmap;
> +	}
>  	mutex_unlock(&iommu->lock);
>  
>  	/*
> @@ -517,6 +786,9 @@ static int map_try_harder(struct vfio_domain *domain, dma_addr_t iova,
>  	long i;
>  	int ret;
>  
> +	if (domain->vfio_iommu_api_only)
> +		return -EINVAL;
> +
>  	for (i = 0; i < npage; i++, pfn++, iova += PAGE_SIZE) {
>  		ret = iommu_map(domain->domain, iova,
>  				(phys_addr_t)pfn << PAGE_SHIFT,
> @@ -538,6 +810,9 @@ static int vfio_iommu_map(struct vfio_iommu *iommu, dma_addr_t iova,
>  	int ret;
>  
>  	list_for_each_entry(d, &iommu->domain_list, next) {
> +		if (d->vfio_iommu_api_only)
> +			continue;
> +

Really disliking all these switches everywhere, too many different code
paths.

>  		ret = iommu_map(d->domain, iova, (phys_addr_t)pfn << PAGE_SHIFT,
>  				npage << PAGE_SHIFT, prot | d->prot);
>  		if (ret) {
> @@ -552,8 +827,11 @@ static int vfio_iommu_map(struct vfio_iommu *iommu, dma_addr_t iova,
>  	return 0;
>  
>  unwind:
> -	list_for_each_entry_continue_reverse(d, &iommu->domain_list, next)
> +	list_for_each_entry_continue_reverse(d, &iommu->domain_list, next) {
> +		if (d->vfio_iommu_api_only)
> +			continue;
>  		iommu_unmap(d->domain, iova, npage << PAGE_SHIFT);
> +	}
>  
>  	return ret;
>  }
> @@ -569,6 +847,8 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>  	uint64_t mask;
>  	struct vfio_dma *dma;
>  	unsigned long pfn;
> +	struct vfio_domain *domain = NULL;
> +	int domain_with_iommu_present = 0;
>  
>  	/* Verify that none of our __u64 fields overflow */
>  	if (map->size != size || map->vaddr != vaddr || map->iova != iova)
> @@ -611,9 +891,22 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>  	/* Insert zero-sized and grow as we map chunks of it */
>  	vfio_link_dma(iommu, dma);
>  
> +	list_for_each_entry(domain, &iommu->domain_list, next) {
> +		if (!domain->vfio_iommu_api_only) {
> +			domain_with_iommu_present = 1;
> +			break;
> +		}
> +	}
> +
> +	// Skip pin and map only if domain without IOMMU is present
> +	if (!domain_with_iommu_present) {
> +		dma->size = size;
> +		goto map_done;
> +	}
> +

Yet more special cases, the code is getting unsupportable.

>  	while (size) {
>  		/* Pin a contiguous chunk of memory */
> -		npage = vfio_pin_pages(vaddr + dma->size,
> +		npage = vfio_pin_pages_internal(domain, vaddr + dma->size,
>  				       size >> PAGE_SHIFT, prot, &pfn);
>  		if (npage <= 0) {
>  			WARN_ON(!npage);
> @@ -623,8 +916,8 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>  
>  		/* Map it! */
>  		ret = vfio_iommu_map(iommu, iova + dma->size, pfn, npage, prot);
> -		if (ret) {
> -			vfio_unpin_pages(pfn, npage, prot, true);
> +		if (ret){
> +			vfio_unpin_pages_internal(domain, pfn, npage, prot, true);
>  			break;
>  		}
>  
> @@ -635,6 +928,7 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>  	if (ret)
>  		vfio_remove_dma(iommu, dma);
>  
> +map_done:
>  	mutex_unlock(&iommu->lock);
>  	return ret;
>  }
> @@ -654,12 +948,15 @@ static int vfio_bus_type(struct device *dev, void *data)
>  static int vfio_iommu_replay(struct vfio_iommu *iommu,
>  			     struct vfio_domain *domain)
>  {
> -	struct vfio_domain *d;
> +	struct vfio_domain *d = NULL, *d_vgpu = NULL;
>  	struct rb_node *n;
>  	int ret;
>  
> +	if (domain->vfio_iommu_api_only)
> +		return -EINVAL;

Huh?  This only does iommu API stuff, shouldn't we skip it w/o error
for api-only?

> +
>  	/* Arbitrarily pick the first domain in the list for lookups */
> -	d = list_first_entry(&iommu->domain_list, struct vfio_domain, next);
> +	get_first_domains(iommu, &d, &d_vgpu);

Gag, probably should have used a separate list.

>  	n = rb_first(&iommu->dma_list);
>  
>  	/* If there's not a domain, there better not be any mappings */
> @@ -716,6 +1013,9 @@ static void vfio_test_domain_fgsp(struct vfio_domain *domain)
>  	struct page *pages;
>  	int ret, order = get_order(PAGE_SIZE * 2);
>  
> +	if (domain->vfio_iommu_api_only)
> +		return;
> +
>  	pages = alloc_pages(GFP_KERNEL | __GFP_ZERO, order);
>  	if (!pages)
>  		return;
> @@ -769,6 +1069,23 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>  	if (ret)
>  		goto out_free;
>  
> +	if (!iommu_present(bus) && (bus == &vgpu_bus_type)) {
> +		struct vgpu_device *vgpu_dev = NULL;
> +
> +		vgpu_dev = get_vgpu_device_from_group(iommu_group);
> +		if (!vgpu_dev)
> +			goto out_free;
> +
> +		vgpu_dev->iommu_data = iommu;

Probably better to have a vgpu_set_iommu_data() function rather than
manipulate it ourselves.  We also have no guarantees about races since
vgpus have no reference counting.

> +		domain->vfio_iommu_api_only = true;
> +		domain->vmm_mm = current->mm;
> +		INIT_LIST_HEAD(&domain->group_list);
> +		list_add(&group->next, &domain->group_list);
> +		domain->pfn_list = RB_ROOT;
> +		mutex_init(&domain->lock);
> +		goto out_success;
> +	}

Very little sharing going on here.

> +
>  	domain->domain = iommu_domain_alloc(bus);
>  	if (!domain->domain) {
>  		ret = -EIO;
> @@ -834,6 +1151,7 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>  	if (ret)
>  		goto out_detach;
>  
> +out_success:
>  	list_add(&domain->next, &iommu->domain_list);
>  
>  	mutex_unlock(&iommu->lock);
> @@ -854,11 +1172,36 @@ out_free:
>  static void vfio_iommu_unmap_unpin_all(struct vfio_iommu *iommu)
>  {
>  	struct rb_node *node;
> +	struct vfio_domain *domain = NULL, *domain_vgpu = NULL;
> +
> +	get_first_domains(iommu, &domain, &domain_vgpu);
> +
> +	if (domain_vgpu) {
> +		int unlocked;
> +		mutex_lock(&domain_vgpu->lock);
> +		while ((node = rb_first(&domain_vgpu->pfn_list))) {
> +			unlocked = vfio_vgpu_unpin_pfn(domain_vgpu,
> +					rb_entry(node, struct vfio_vgpu_pfn, node));

Why bother to store the return?

> +		}
> +		mutex_unlock(&domain_vgpu->lock);
> +	}
>  
>  	while ((node = rb_first(&iommu->dma_list)))
>  		vfio_remove_dma(iommu, rb_entry(node, struct vfio_dma, node));
>  }
>  
> +static bool list_is_singular_iommu_domain(struct vfio_iommu *iommu)
> +{
> +	struct vfio_domain *domain;
> +	int domain_iommu = 0;
> +
> +	list_for_each_entry(domain, &iommu->domain_list, next) {
> +		if (!domain->vfio_iommu_api_only)
> +			domain_iommu++;
> +	}
> +	return (domain_iommu == 1);
> +}
> +
>  static void vfio_iommu_type1_detach_group(void *iommu_data,
>  					  struct iommu_group *iommu_group)
>  {
> @@ -872,19 +1215,28 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
>  		list_for_each_entry(group, &domain->group_list, next) {
>  			if (group->iommu_group != iommu_group)
>  				continue;
> +			if (!domain->vfio_iommu_api_only)
> +				iommu_detach_group(domain->domain, iommu_group);
> +			else {
> +				struct vgpu_device *vgpu_dev = NULL;
>  
> -			iommu_detach_group(domain->domain, iommu_group);
> +				vgpu_dev = get_vgpu_device_from_group(iommu_group);
> +				if (vgpu_dev)
> +					vgpu_dev->iommu_data = NULL;
> +
> +			}
>  			list_del(&group->next);
>  			kfree(group);
>  			/*
>  			 * Group ownership provides privilege, if the group
>  			 * list is empty, the domain goes away.  If it's the
> -			 * last domain, then all the mappings go away too.
> +			 * last domain with iommu, then all the mappings go away too.
>  			 */
>  			if (list_empty(&domain->group_list)) {
> -				if (list_is_singular(&iommu->domain_list))
> +				if (list_is_singular_iommu_domain(iommu))
>  					vfio_iommu_unmap_unpin_all(iommu);
> -				iommu_domain_free(domain->domain);
> +				if (!domain->vfio_iommu_api_only)
> +					iommu_domain_free(domain->domain);
>  				list_del(&domain->next);
>  				kfree(domain);
>  			}
> @@ -936,11 +1288,22 @@ static void vfio_iommu_type1_release(void *iommu_data)
>  				 &iommu->domain_list, next) {
>  		list_for_each_entry_safe(group, group_tmp,
>  					 &domain->group_list, next) {
> -			iommu_detach_group(domain->domain, group->iommu_group);
> +			if (!domain->vfio_iommu_api_only)
> +				iommu_detach_group(domain->domain, group->iommu_group);
> +			else {
> +				struct vgpu_device *vgpu_dev = NULL;
> +
> +				vgpu_dev = get_vgpu_device_from_group(group->iommu_group);
> +				if (vgpu_dev)
> +					vgpu_dev->iommu_data = NULL;
> +
> +			}
> +
>  			list_del(&group->next);
>  			kfree(group);
>  		}
> -		iommu_domain_free(domain->domain);
> +		if (!domain->vfio_iommu_api_only)
> +			iommu_domain_free(domain->domain);
>  		list_del(&domain->next);
>  		kfree(domain);
>  	}

I'm really not convinced that pushing this into the type1 code is the
right approach vs pulling out shareable code chunks where it makes
sense and creating a separate iommu backend.  We're not getting
anything but code complexity out of this approach it seems.  Thanks,

Alex

> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index 0ecae0b..d280868 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -127,6 +127,12 @@ static inline long vfio_spapr_iommu_eeh_ioctl(struct iommu_group *group,
>  }
>  #endif /* CONFIG_EEH */
>  
> +extern long vfio_pin_pages(void *iommu_data, dma_addr_t *vaddr, long npage,
> +		           int prot, dma_addr_t *pfn_base);
> +
> +extern long vfio_unpin_pages(void *iommu_data, dma_addr_t *pfn, long npage,
> +			     int prot);
> +
>  /*
>   * IRQfd - generic
>   */
> diff --git a/include/linux/vgpu.h b/include/linux/vgpu.h
> index 03a77cf..cc18353 100644
> --- a/include/linux/vgpu.h
> +++ b/include/linux/vgpu.h
> @@ -36,6 +36,7 @@ struct vgpu_device {
>  	struct device		dev;
>  	struct gpu_device	*gpu_dev;
>  	struct iommu_group	*group;
> +	void			*iommu_data;
>  #define DEVICE_NAME_LEN		(64)
>  	char			dev_name[DEVICE_NAME_LEN];
>  	uuid_le			uuid;
> @@ -209,8 +210,7 @@ extern void vgpu_unregister_driver(struct vgpu_driver *drv);
>  
>  extern int vgpu_map_virtual_bar(uint64_t virt_bar_addr, uint64_t phys_bar_addr,
>  				uint32_t len, uint32_t flags);
> -extern int vgpu_dma_do_translate(dma_addr_t * gfn_buffer, uint32_t count);
>  
> -struct vgpu_device *get_vgpu_device_from_group(struct iommu_group *group);
> +extern struct vgpu_device *get_vgpu_device_from_group(struct iommu_group *group);
>  
>  #endif /* VGPU_H */


^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
@ 2016-05-03 22:43     ` Alex Williamson
  0 siblings, 0 replies; 154+ messages in thread
From: Alex Williamson @ 2016-05-03 22:43 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, shuai.ruan,
	jike.song, zhiyuan.lv

On Tue, 3 May 2016 00:10:41 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> VFIO Type1 IOMMU driver is designed for the devices which are IOMMU capable.
> vGPU device are only using IOMMU TYPE1 API, the underlying hardware can be
> managed by an IOMMU domain. To use most of the code of IOMMU driver for vGPU
> devices, type1 IOMMU driver is modified to support vGPU devices. This change
> exports functions to pin and unpin pages for vGPU devices.
> It maintains data of pinned pages for vGPU domain. This data is used to verify
> unpinning request and also used to unpin pages from detach_group().
> 
> Tested by assigning below combinations of devices to a single VM:
> - GPU pass through only
> - vGPU device only
> - One GPU pass through and one vGPU device
> - two GPU pass through
>
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Signed-off-by: Neo Jia <cjia@nvidia.com>
> Change-Id: I6e35e9fc7f14049226365e9ecef3814dc4ca1738
> ---
>  drivers/vfio/vfio_iommu_type1.c |  427 ++++++++++++++++++++++++++++++++++++---
>  include/linux/vfio.h            |    6 +
>  include/linux/vgpu.h            |    4 +-
>  3 files changed, 403 insertions(+), 34 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 75b24e9..a970854 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -36,6 +36,7 @@
>  #include <linux/uaccess.h>
>  #include <linux/vfio.h>
>  #include <linux/workqueue.h>
> +#include <linux/vgpu.h>
>  
>  #define DRIVER_VERSION  "0.2"
>  #define DRIVER_AUTHOR   "Alex Williamson <alex.williamson@redhat.com>"
> @@ -67,6 +68,11 @@ struct vfio_domain {
>  	struct list_head	group_list;
>  	int			prot;		/* IOMMU_CACHE */
>  	bool			fgsp;		/* Fine-grained super pages */
> +	bool			vfio_iommu_api_only;	/* Domain for device which
> +							   is without physical IOMMU */
> +	struct mm_struct	*vmm_mm;	/* VMM's mm */

I really don't like assuming a VMM, vfio is a userspace driver
interface, this is just an mm associated with this set of mappings.

> +	struct rb_root		pfn_list;	/* Host pfn list for requested gfns */

This is just an iova to pfn list, whether it's a guest running on a
VMM, we don't care.

> +	struct mutex		lock;		/* mutex for pfn_list */

So pfn_list_lock might be a better name for it.

>  };
>  
>  struct vfio_dma {
> @@ -83,6 +89,19 @@ struct vfio_group {
>  };
>  
>  /*
> + * Guest RAM pinning working set or DMA target
> + */
> +struct vfio_vgpu_pfn {
> +	struct rb_node		node;
> +	unsigned long		vmm_va;		/* VMM virtual addr */

vaddr

> +	dma_addr_t		iova;		/* IOVA */
> +	unsigned long		npage;		/* number of pages */
> +	unsigned long		pfn;		/* Host pfn */
> +	int			prot;
> +	atomic_t		ref_count;
> +	struct list_head	next;
> +};

Why is any of this vgpu specific?  It's just a data structure for
tracking iova to vaddr pins.

> +/*
>   * This code handles mapping and unmapping of user data buffers
>   * into DMA'ble space using the IOMMU
>   */
> @@ -130,6 +149,53 @@ static void vfio_unlink_dma(struct vfio_iommu *iommu, struct vfio_dma *old)
>  	rb_erase(&old->node, &iommu->dma_list);
>  }
>  
> +/*
> + * Helper Functions for host pfn list
> + */
> +
> +static struct vfio_vgpu_pfn *vfio_find_vgpu_pfn(struct vfio_domain *domain,
> +						unsigned long pfn)
> +{
> +	struct rb_node *node = domain->pfn_list.rb_node;
> +
> +	while (node) {
> +		struct vfio_vgpu_pfn *vgpu_pfn = rb_entry(node, struct vfio_vgpu_pfn, node);
> +
> +		if (pfn <= vgpu_pfn->pfn)
> +			node = node->rb_left;
> +		else if (pfn >= vgpu_pfn->pfn)
> +			node = node->rb_right;
> +		else
> +			return vgpu_pfn;
> +	}
> +
> +	return NULL;
> +}
> +
> +static void vfio_link_vgpu_pfn(struct vfio_domain *domain, struct vfio_vgpu_pfn *new)
> +{
> +	struct rb_node **link = &domain->pfn_list.rb_node, *parent = NULL;
> +	struct vfio_vgpu_pfn *vgpu_pfn;
> +
> +	while (*link) {
> +		parent = *link;
> +		vgpu_pfn = rb_entry(parent, struct vfio_vgpu_pfn, node);
> +
> +		if (new->pfn <= vgpu_pfn->pfn)
> +			link = &(*link)->rb_left;
> +		else
> +			link = &(*link)->rb_right;
> +	}
> +
> +	rb_link_node(&new->node, parent, link);
> +	rb_insert_color(&new->node, &domain->pfn_list);
> +}
> +
> +static void vfio_unlink_vgpu_pfn(struct vfio_domain *domain, struct vfio_vgpu_pfn *old)
> +{
> +	rb_erase(&old->node, &domain->pfn_list);
> +}
> +

None of the above is really vgpu specific either, just managing a tree
of mappings.  Name things based on what they do, not who they're for.

>  struct vwork {
>  	struct mm_struct	*mm;
>  	long			npage;
> @@ -228,20 +294,22 @@ static int put_pfn(unsigned long pfn, int prot)
>  	return 0;
>  }
>  
> -static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
> +static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
> +			 int prot, unsigned long *pfn)
>  {
>  	struct page *page[1];
>  	struct vm_area_struct *vma;
>  	int ret = -EFAULT;
>  
> -	if (get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE), page) == 1) {
> +	if (get_user_pages_remote(NULL, mm, vaddr, 1, !!(prot & IOMMU_WRITE),
> +				    0, page, NULL) == 1) {

AIUI, _remote requires the mmap_sem to be held, _fast does not.  I don't
see that being accounted for anywhere.

>  		*pfn = page_to_pfn(page[0]);
>  		return 0;
>  	}
>  
> -	down_read(&current->mm->mmap_sem);
> +	down_read(&mm->mmap_sem);
>  
> -	vma = find_vma_intersection(current->mm, vaddr, vaddr + 1);
> +	vma = find_vma_intersection(mm, vaddr, vaddr + 1);
>  
>  	if (vma && vma->vm_flags & VM_PFNMAP) {
>  		*pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
> @@ -249,28 +317,63 @@ static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
>  			ret = 0;
>  	}
>  
> -	up_read(&current->mm->mmap_sem);
> +	up_read(&mm->mmap_sem);
>  
>  	return ret;
>  }
>  
>  /*
> + * Get first domain with iommu and without iommu from iommu's domain_list for
> + * lookups
> + * @iommu [in]: iommu structure
> + * @domain [out]: domain with iommu
> + * @domain_vgpu [out] : domain without iommu for vGPU
> + */
> +static void get_first_domains(struct vfio_iommu *iommu, struct vfio_domain **domain,
> +			      struct vfio_domain **domain_vgpu)
> +{
> +	struct vfio_domain *d;
> +
> +	if (!domain || !domain_vgpu)
> +		return;
> +
> +	list_for_each_entry(d, &iommu->domain_list, next) {
> +		if (d->vfio_iommu_api_only && !*domain_vgpu)
> +			*domain_vgpu = d;
> +		else if (!*domain)
> +			*domain = d;
> +		if (*domain_vgpu && *domain)
> +			break;
> +	}
> +}

This looks like pure overhead for existing code.  Also no need to
introduce the concept of vgpu here.  

> +
> +/*
>   * Attempt to pin pages.  We really don't want to track all the pfns and
>   * the iommu can only map chunks of consecutive pfns anyway, so get the
>   * first page and all consecutive pages with the same locking.
>   */
> -static long vfio_pin_pages(unsigned long vaddr, long npage,
> -			   int prot, unsigned long *pfn_base)
> +static long vfio_pin_pages_internal(void *domain_data, unsigned long vaddr, long npage,

I appears we know this as a struct vfio_domain in all callers, why is
this using void*?.

> +		             int prot, unsigned long *pfn_base)
>  {
> +	struct vfio_domain *domain = domain_data;
>  	unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
>  	bool lock_cap = capable(CAP_IPC_LOCK);
>  	long ret, i;
>  	bool rsvd;
> +	struct mm_struct *mm;
>  
> -	if (!current->mm)
> +	if (!domain)
>  		return -ENODEV;
>  
> -	ret = vaddr_get_pfn(vaddr, prot, pfn_base);
> +	if (domain->vfio_iommu_api_only)
> +		mm = domain->vmm_mm;
> +	else
> +		mm = current->mm;
> +
> +	if (!mm)
> +		return -ENODEV;
> +
> +	ret = vaddr_get_pfn(mm, vaddr, prot, pfn_base);

We could pass domain->mm unconditionally to vaddr_get_pfn(), let it be
NULL in the !api_only case and use it as a cue to vaddr_get_pfn() which
gup variant to use.  Of course we need to deal with mmap_sem somewhere
too without turning the code into swiss cheese.

Correct me if I'm wrong, but I assume the main benefit of interweaving
this into type1 vs pulling out common code and making a new vfio iommu
backend is the page accounting, ie. not over accounting locked pages.
TBH, I don't know if it's worth it.  Any idea what the high water mark
of pinned pages for a vgpu might be?

>  	if (ret)
>  		return ret;
>  
> @@ -293,7 +396,7 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
>  	for (i = 1, vaddr += PAGE_SIZE; i < npage; i++, vaddr += PAGE_SIZE) {
>  		unsigned long pfn = 0;
>  
> -		ret = vaddr_get_pfn(vaddr, prot, &pfn);
> +		ret = vaddr_get_pfn(mm, vaddr, prot, &pfn);
>  		if (ret)
>  			break;
>  
> @@ -318,25 +421,183 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
>  	return i;
>  }
>  
> -static long vfio_unpin_pages(unsigned long pfn, long npage,
> -			     int prot, bool do_accounting)
> +static long vfio_unpin_pages_internal(void *domain_data, unsigned long pfn, long npage,

Again, all the callers seem to know they have a struct vfio_domain*, I
don't see any justification for using a void* here.

It seems less disruptive to leave these function names alone and make
the new functions _external.

> +				      int prot, bool do_accounting)
>  {
> +	struct vfio_domain *domain = domain_data;
>  	unsigned long unlocked = 0;
>  	long i;
>  
> +	if (!domain)
> +		return -ENODEV;
> +

How is this possible?  Callers of this function need to be updated for
a possible negative return or accounting gets really broken.

>  	for (i = 0; i < npage; i++)
>  		unlocked += put_pfn(pfn++, prot);
>  
>  	if (do_accounting)
>  		vfio_lock_acct(-unlocked);
> +	return unlocked;
> +}
> +
> +/*
> + * Pin a set of guest PFNs and return their associated host PFNs for vGPU.
> + * @vaddr [in]: array of guest PFNs
> + * @npage [in]: count of array elements
> + * @prot [in] : protection flags
> + * @pfn_base[out] : array of host PFNs
> + */
> +int vfio_pin_pages(void *iommu_data, dma_addr_t *vaddr, long npage,
> +		   int prot, dma_addr_t *pfn_base)
> +{
> +	struct vfio_iommu *iommu = iommu_data;
> +	struct vfio_domain *domain = NULL, *domain_vgpu = NULL;
> +	int i = 0, ret = 0;
> +	long retpage;
> +	dma_addr_t remote_vaddr = 0;
> +	dma_addr_t *pfn = pfn_base;
> +	struct vfio_dma *dma;
> +
> +	if (!iommu || !pfn_base)
> +		return -EINVAL;
> +
> +	if (list_empty(&iommu->domain_list)) {
> +		ret = -EINVAL;
> +		goto pin_done;
> +	}
> +
> +	get_first_domains(iommu, &domain, &domain_vgpu);
> +
> +	// Return error if vGPU domain doesn't exist

No c++ style comments please.

> +	if (!domain_vgpu) {
> +		ret = -EINVAL;
> +		goto pin_done;
> +	}
> +
> +	for (i = 0; i < npage; i++) {
> +		struct vfio_vgpu_pfn *p;
> +		struct vfio_vgpu_pfn *lpfn;
> +		unsigned long tpfn;
> +		dma_addr_t iova;
> +
> +		mutex_lock(&iommu->lock);
> +
> +		iova = vaddr[i] << PAGE_SHIFT;
> +
> +		dma = vfio_find_dma(iommu, iova, 0 /*  size */);
> +		if (!dma) {
> +			mutex_unlock(&iommu->lock);
> +			ret = -EINVAL;
> +			goto pin_done;
> +		}
> +
> +		remote_vaddr = dma->vaddr + iova - dma->iova;
> +
> +		retpage = vfio_pin_pages_internal(domain_vgpu, remote_vaddr,
> +						  (long)1, prot, &tpfn);
> +		mutex_unlock(&iommu->lock);
> +		if (retpage <= 0) {
> +			WARN_ON(!retpage);
> +			ret = (int)retpage;
> +			goto pin_done;
> +		}
> +
> +		pfn[i] = tpfn;
> +
> +		mutex_lock(&domain_vgpu->lock);
> +
> +		// search if pfn exist
> +		if ((p = vfio_find_vgpu_pfn(domain_vgpu, tpfn))) {
> +			atomic_inc(&p->ref_count);
> +			mutex_unlock(&domain_vgpu->lock);
> +			continue;
> +		}

The only reason I can come up with for why we'd want to integrate an
api-only domain into the existing type1 code would be to avoid page
accounting issues where we count locked pages once for a normal
assigned device and again for a vgpu, but that's not what we're doing
here.  We're not only locking the pages again regardless of them
already being locked, we're counting every time we lock them through
this new interface.  So there's really no point at all to making type1
become this unsupportable.  In that case we should be pulling out the
common code that we want to share from type1 and making a new type1
compatible vfio iommu backend rather than conditionalizing everything
here.

> +
> +		// add to pfn_list
> +		lpfn = kzalloc(sizeof(*lpfn), GFP_KERNEL);
> +		if (!lpfn) {
> +			ret = -ENOMEM;
> +			mutex_unlock(&domain_vgpu->lock);
> +			goto pin_done;
> +		}
> +		lpfn->vmm_va = remote_vaddr;
> +		lpfn->iova = iova;
> +		lpfn->pfn = pfn[i];
> +		lpfn->npage = 1;
> +		lpfn->prot = prot;
> +		atomic_inc(&lpfn->ref_count);
> +		vfio_link_vgpu_pfn(domain_vgpu, lpfn);
> +		mutex_unlock(&domain_vgpu->lock);
> +	}
> +
> +	ret = i;
> +
> +pin_done:
> +	return ret;
> +}
> +EXPORT_SYMBOL(vfio_pin_pages);
> +
> +static int vfio_vgpu_unpin_pfn(struct vfio_domain *domain, struct vfio_vgpu_pfn *vpfn)
> +{
> +	int ret;
> +
> +	ret = vfio_unpin_pages_internal(domain, vpfn->pfn, vpfn->npage, vpfn->prot, true);
> +
> +	if (atomic_dec_and_test(&vpfn->ref_count)) {
> +		// remove from pfn_list
> +		vfio_unlink_vgpu_pfn(domain, vpfn);
> +		kfree(vpfn);
> +	}
> +
> +	return ret;
> +}
> +
> +/*
> + * Unpin set of host PFNs for vGPU.
> + * @pfn	[in] : array of host PFNs to be unpinned.
> + * @npage [in] :count of elements in array, that is number of pages.
> + * @prot [in] : protection flags
> + */
> +int vfio_unpin_pages(void *iommu_data, dma_addr_t *pfn, long npage,
> +		     int prot)
> +{
> +	struct vfio_iommu *iommu = iommu_data;
> +	struct vfio_domain *domain = NULL, *domain_vgpu = NULL;
> +	long unlocked = 0;
> +	int i;
> +	if (!iommu || !pfn)
> +		return -EINVAL;
> +
> +	if (list_empty(&iommu->domain_list))
> +		return -EINVAL;
> +
> +	get_first_domains(iommu, &domain, &domain_vgpu);
> +
> +	// Return error if vGPU domain doesn't exist
> +	if (!domain_vgpu)
> +		return -EINVAL;
> +
> +	mutex_lock(&domain_vgpu->lock);
> +
> +	for (i = 0; i < npage; i++) {
> +		struct vfio_vgpu_pfn *p;
> +
> +		// verify if pfn exist in pfn_list
> +		if (!(p = vfio_find_vgpu_pfn(domain_vgpu, *(pfn + i)))) {
> +			continue;

How does the caller deal with this, the function returns number of
pages unpinned which will not match the requested number of pages to
unpin if there are any missing.  Also, no setting variables within a
test when easily avoidable please, separate to a set then test.

> +		}
> +
> +		unlocked += vfio_vgpu_unpin_pfn(domain_vgpu, p);
> +	}
> +	mutex_unlock(&domain_vgpu->lock);
>  
>  	return unlocked;
>  }
> +EXPORT_SYMBOL(vfio_unpin_pages);
>  
>  static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
>  {
>  	dma_addr_t iova = dma->iova, end = dma->iova + dma->size;
> -	struct vfio_domain *domain, *d;
> +	struct vfio_domain *domain = NULL, *d, *domain_vgpu = NULL;
>  	long unlocked = 0;
>  
>  	if (!dma->size)
> @@ -348,12 +609,18 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
>  	 * pfns to unpin.  The rest need to be unmapped in advance so we have
>  	 * no iommu translations remaining when the pages are unpinned.
>  	 */
> -	domain = d = list_first_entry(&iommu->domain_list,
> -				      struct vfio_domain, next);
>  
> +	get_first_domains(iommu, &domain, &domain_vgpu);
> +
> +	if (!domain)
> +		return;
> +
> +	d = domain;
>  	list_for_each_entry_continue(d, &iommu->domain_list, next) {
> -		iommu_unmap(d->domain, dma->iova, dma->size);
> -		cond_resched();
> +		if (!d->vfio_iommu_api_only) {
> +			iommu_unmap(d->domain, dma->iova, dma->size);
> +			cond_resched();
> +		}
>  	}
>  
>  	while (iova < end) {

How do api-only domain not blowup on the iommu API code in this next
code block?  Are you just getting lucky that the api-only domain is
first in the list and the real domain is last?

> @@ -382,7 +649,7 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
>  		if (WARN_ON(!unmapped))
>  			break;
>  
> -		unlocked += vfio_unpin_pages(phys >> PAGE_SHIFT,
> +		unlocked += vfio_unpin_pages_internal(domain, phys >> PAGE_SHIFT,
>  					     unmapped >> PAGE_SHIFT,
>  					     dma->prot, false);
>  		iova += unmapped;
> @@ -406,8 +673,10 @@ static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu)
>  	unsigned long bitmap = ULONG_MAX;
>  
>  	mutex_lock(&iommu->lock);
> -	list_for_each_entry(domain, &iommu->domain_list, next)
> -		bitmap &= domain->domain->ops->pgsize_bitmap;
> +	list_for_each_entry(domain, &iommu->domain_list, next) {
> +		if (!domain->vfio_iommu_api_only)
> +			bitmap &= domain->domain->ops->pgsize_bitmap;
> +	}
>  	mutex_unlock(&iommu->lock);
>  
>  	/*
> @@ -517,6 +786,9 @@ static int map_try_harder(struct vfio_domain *domain, dma_addr_t iova,
>  	long i;
>  	int ret;
>  
> +	if (domain->vfio_iommu_api_only)
> +		return -EINVAL;
> +
>  	for (i = 0; i < npage; i++, pfn++, iova += PAGE_SIZE) {
>  		ret = iommu_map(domain->domain, iova,
>  				(phys_addr_t)pfn << PAGE_SHIFT,
> @@ -538,6 +810,9 @@ static int vfio_iommu_map(struct vfio_iommu *iommu, dma_addr_t iova,
>  	int ret;
>  
>  	list_for_each_entry(d, &iommu->domain_list, next) {
> +		if (d->vfio_iommu_api_only)
> +			continue;
> +

Really disliking all these switches everywhere, too many different code
paths.

>  		ret = iommu_map(d->domain, iova, (phys_addr_t)pfn << PAGE_SHIFT,
>  				npage << PAGE_SHIFT, prot | d->prot);
>  		if (ret) {
> @@ -552,8 +827,11 @@ static int vfio_iommu_map(struct vfio_iommu *iommu, dma_addr_t iova,
>  	return 0;
>  
>  unwind:
> -	list_for_each_entry_continue_reverse(d, &iommu->domain_list, next)
> +	list_for_each_entry_continue_reverse(d, &iommu->domain_list, next) {
> +		if (d->vfio_iommu_api_only)
> +			continue;
>  		iommu_unmap(d->domain, iova, npage << PAGE_SHIFT);
> +	}
>  
>  	return ret;
>  }
> @@ -569,6 +847,8 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>  	uint64_t mask;
>  	struct vfio_dma *dma;
>  	unsigned long pfn;
> +	struct vfio_domain *domain = NULL;
> +	int domain_with_iommu_present = 0;
>  
>  	/* Verify that none of our __u64 fields overflow */
>  	if (map->size != size || map->vaddr != vaddr || map->iova != iova)
> @@ -611,9 +891,22 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>  	/* Insert zero-sized and grow as we map chunks of it */
>  	vfio_link_dma(iommu, dma);
>  
> +	list_for_each_entry(domain, &iommu->domain_list, next) {
> +		if (!domain->vfio_iommu_api_only) {
> +			domain_with_iommu_present = 1;
> +			break;
> +		}
> +	}
> +
> +	// Skip pin and map only if domain without IOMMU is present
> +	if (!domain_with_iommu_present) {
> +		dma->size = size;
> +		goto map_done;
> +	}
> +

Yet more special cases, the code is getting unsupportable.

>  	while (size) {
>  		/* Pin a contiguous chunk of memory */
> -		npage = vfio_pin_pages(vaddr + dma->size,
> +		npage = vfio_pin_pages_internal(domain, vaddr + dma->size,
>  				       size >> PAGE_SHIFT, prot, &pfn);
>  		if (npage <= 0) {
>  			WARN_ON(!npage);
> @@ -623,8 +916,8 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>  
>  		/* Map it! */
>  		ret = vfio_iommu_map(iommu, iova + dma->size, pfn, npage, prot);
> -		if (ret) {
> -			vfio_unpin_pages(pfn, npage, prot, true);
> +		if (ret){
> +			vfio_unpin_pages_internal(domain, pfn, npage, prot, true);
>  			break;
>  		}
>  
> @@ -635,6 +928,7 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>  	if (ret)
>  		vfio_remove_dma(iommu, dma);
>  
> +map_done:
>  	mutex_unlock(&iommu->lock);
>  	return ret;
>  }
> @@ -654,12 +948,15 @@ static int vfio_bus_type(struct device *dev, void *data)
>  static int vfio_iommu_replay(struct vfio_iommu *iommu,
>  			     struct vfio_domain *domain)
>  {
> -	struct vfio_domain *d;
> +	struct vfio_domain *d = NULL, *d_vgpu = NULL;
>  	struct rb_node *n;
>  	int ret;
>  
> +	if (domain->vfio_iommu_api_only)
> +		return -EINVAL;

Huh?  This only does iommu API stuff, shouldn't we skip it w/o error
for api-only?

> +
>  	/* Arbitrarily pick the first domain in the list for lookups */
> -	d = list_first_entry(&iommu->domain_list, struct vfio_domain, next);
> +	get_first_domains(iommu, &d, &d_vgpu);

Gag, probably should have used a separate list.

>  	n = rb_first(&iommu->dma_list);
>  
>  	/* If there's not a domain, there better not be any mappings */
> @@ -716,6 +1013,9 @@ static void vfio_test_domain_fgsp(struct vfio_domain *domain)
>  	struct page *pages;
>  	int ret, order = get_order(PAGE_SIZE * 2);
>  
> +	if (domain->vfio_iommu_api_only)
> +		return;
> +
>  	pages = alloc_pages(GFP_KERNEL | __GFP_ZERO, order);
>  	if (!pages)
>  		return;
> @@ -769,6 +1069,23 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>  	if (ret)
>  		goto out_free;
>  
> +	if (!iommu_present(bus) && (bus == &vgpu_bus_type)) {
> +		struct vgpu_device *vgpu_dev = NULL;
> +
> +		vgpu_dev = get_vgpu_device_from_group(iommu_group);
> +		if (!vgpu_dev)
> +			goto out_free;
> +
> +		vgpu_dev->iommu_data = iommu;

Probably better to have a vgpu_set_iommu_data() function rather than
manipulate it ourselves.  We also have no guarantees about races since
vgpus have no reference counting.

> +		domain->vfio_iommu_api_only = true;
> +		domain->vmm_mm = current->mm;
> +		INIT_LIST_HEAD(&domain->group_list);
> +		list_add(&group->next, &domain->group_list);
> +		domain->pfn_list = RB_ROOT;
> +		mutex_init(&domain->lock);
> +		goto out_success;
> +	}

Very little sharing going on here.

> +
>  	domain->domain = iommu_domain_alloc(bus);
>  	if (!domain->domain) {
>  		ret = -EIO;
> @@ -834,6 +1151,7 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
>  	if (ret)
>  		goto out_detach;
>  
> +out_success:
>  	list_add(&domain->next, &iommu->domain_list);
>  
>  	mutex_unlock(&iommu->lock);
> @@ -854,11 +1172,36 @@ out_free:
>  static void vfio_iommu_unmap_unpin_all(struct vfio_iommu *iommu)
>  {
>  	struct rb_node *node;
> +	struct vfio_domain *domain = NULL, *domain_vgpu = NULL;
> +
> +	get_first_domains(iommu, &domain, &domain_vgpu);
> +
> +	if (domain_vgpu) {
> +		int unlocked;
> +		mutex_lock(&domain_vgpu->lock);
> +		while ((node = rb_first(&domain_vgpu->pfn_list))) {
> +			unlocked = vfio_vgpu_unpin_pfn(domain_vgpu,
> +					rb_entry(node, struct vfio_vgpu_pfn, node));

Why bother to store the return?

> +		}
> +		mutex_unlock(&domain_vgpu->lock);
> +	}
>  
>  	while ((node = rb_first(&iommu->dma_list)))
>  		vfio_remove_dma(iommu, rb_entry(node, struct vfio_dma, node));
>  }
>  
> +static bool list_is_singular_iommu_domain(struct vfio_iommu *iommu)
> +{
> +	struct vfio_domain *domain;
> +	int domain_iommu = 0;
> +
> +	list_for_each_entry(domain, &iommu->domain_list, next) {
> +		if (!domain->vfio_iommu_api_only)
> +			domain_iommu++;
> +	}
> +	return (domain_iommu == 1);
> +}
> +
>  static void vfio_iommu_type1_detach_group(void *iommu_data,
>  					  struct iommu_group *iommu_group)
>  {
> @@ -872,19 +1215,28 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
>  		list_for_each_entry(group, &domain->group_list, next) {
>  			if (group->iommu_group != iommu_group)
>  				continue;
> +			if (!domain->vfio_iommu_api_only)
> +				iommu_detach_group(domain->domain, iommu_group);
> +			else {
> +				struct vgpu_device *vgpu_dev = NULL;
>  
> -			iommu_detach_group(domain->domain, iommu_group);
> +				vgpu_dev = get_vgpu_device_from_group(iommu_group);
> +				if (vgpu_dev)
> +					vgpu_dev->iommu_data = NULL;
> +
> +			}
>  			list_del(&group->next);
>  			kfree(group);
>  			/*
>  			 * Group ownership provides privilege, if the group
>  			 * list is empty, the domain goes away.  If it's the
> -			 * last domain, then all the mappings go away too.
> +			 * last domain with iommu, then all the mappings go away too.
>  			 */
>  			if (list_empty(&domain->group_list)) {
> -				if (list_is_singular(&iommu->domain_list))
> +				if (list_is_singular_iommu_domain(iommu))
>  					vfio_iommu_unmap_unpin_all(iommu);
> -				iommu_domain_free(domain->domain);
> +				if (!domain->vfio_iommu_api_only)
> +					iommu_domain_free(domain->domain);
>  				list_del(&domain->next);
>  				kfree(domain);
>  			}
> @@ -936,11 +1288,22 @@ static void vfio_iommu_type1_release(void *iommu_data)
>  				 &iommu->domain_list, next) {
>  		list_for_each_entry_safe(group, group_tmp,
>  					 &domain->group_list, next) {
> -			iommu_detach_group(domain->domain, group->iommu_group);
> +			if (!domain->vfio_iommu_api_only)
> +				iommu_detach_group(domain->domain, group->iommu_group);
> +			else {
> +				struct vgpu_device *vgpu_dev = NULL;
> +
> +				vgpu_dev = get_vgpu_device_from_group(group->iommu_group);
> +				if (vgpu_dev)
> +					vgpu_dev->iommu_data = NULL;
> +
> +			}
> +
>  			list_del(&group->next);
>  			kfree(group);
>  		}
> -		iommu_domain_free(domain->domain);
> +		if (!domain->vfio_iommu_api_only)
> +			iommu_domain_free(domain->domain);
>  		list_del(&domain->next);
>  		kfree(domain);
>  	}

I'm really not convinced that pushing this into the type1 code is the
right approach vs pulling out shareable code chunks where it makes
sense and creating a separate iommu backend.  We're not getting
anything but code complexity out of this approach it seems.  Thanks,

Alex

> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index 0ecae0b..d280868 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -127,6 +127,12 @@ static inline long vfio_spapr_iommu_eeh_ioctl(struct iommu_group *group,
>  }
>  #endif /* CONFIG_EEH */
>  
> +extern long vfio_pin_pages(void *iommu_data, dma_addr_t *vaddr, long npage,
> +		           int prot, dma_addr_t *pfn_base);
> +
> +extern long vfio_unpin_pages(void *iommu_data, dma_addr_t *pfn, long npage,
> +			     int prot);
> +
>  /*
>   * IRQfd - generic
>   */
> diff --git a/include/linux/vgpu.h b/include/linux/vgpu.h
> index 03a77cf..cc18353 100644
> --- a/include/linux/vgpu.h
> +++ b/include/linux/vgpu.h
> @@ -36,6 +36,7 @@ struct vgpu_device {
>  	struct device		dev;
>  	struct gpu_device	*gpu_dev;
>  	struct iommu_group	*group;
> +	void			*iommu_data;
>  #define DEVICE_NAME_LEN		(64)
>  	char			dev_name[DEVICE_NAME_LEN];
>  	uuid_le			uuid;
> @@ -209,8 +210,7 @@ extern void vgpu_unregister_driver(struct vgpu_driver *drv);
>  
>  extern int vgpu_map_virtual_bar(uint64_t virt_bar_addr, uint64_t phys_bar_addr,
>  				uint32_t len, uint32_t flags);
> -extern int vgpu_dma_do_translate(dma_addr_t * gfn_buffer, uint32_t count);
>  
> -struct vgpu_device *get_vgpu_device_from_group(struct iommu_group *group);
> +extern struct vgpu_device *get_vgpu_device_from_group(struct iommu_group *group);
>  
>  #endif /* VGPU_H */

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [RFC PATCH v3 2/3] VFIO driver for vGPU device
  2016-05-02 18:40   ` [Qemu-devel] " Kirti Wankhede
@ 2016-05-03 22:43     ` Alex Williamson
  -1 siblings, 0 replies; 154+ messages in thread
From: Alex Williamson @ 2016-05-03 22:43 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, shuai.ruan,
	jike.song, zhiyuan.lv

On Tue, 3 May 2016 00:10:40 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> VFIO driver registers with vGPU core driver. vGPU core driver creates vGPU
> device and calls probe routine of vGPU VFIO driver. This vGPU VFIO driver adds
> vGPU device to VFIO core module.
> Main aim of this module is to manage all VFIO APIs for each vGPU device.
> Those are:
> - get region information from GPU driver.
> - trap and emulate PCI config space and BAR region.
> - Send interrupt configuration information to GPU driver.
> - mmap mappable region with invalidate mapping and fault on access to remap pfn.
> 
> Thanks,
> Kirti.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Signed-off-by: Neo Jia <cjia@nvidia.com>
> Change-Id: I949a6b499d2e98d9c3352ae579535a608729b223
> ---
>  drivers/vgpu/Makefile    |    1 +
>  drivers/vgpu/vgpu_vfio.c |  671 ++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 672 insertions(+), 0 deletions(-)
>  create mode 100644 drivers/vgpu/vgpu_vfio.c
> 
> diff --git a/drivers/vgpu/Makefile b/drivers/vgpu/Makefile
> index f5be980..a0a2655 100644
> --- a/drivers/vgpu/Makefile
> +++ b/drivers/vgpu/Makefile
> @@ -2,3 +2,4 @@
>  vgpu-y := vgpu-core.o vgpu-sysfs.o vgpu-driver.o
>  
>  obj-$(CONFIG_VGPU)			+= vgpu.o
> +obj-$(CONFIG_VGPU_VFIO)                 += vgpu_vfio.o

This is where we should add a new Kconfig entry for VGPU_VFIO, nothing
in patch 1 has any vfio dependency.  Perhaps it should also depend on
VFIO_PCI rather than VFIO since you are getting very PCI specific below.

> diff --git a/drivers/vgpu/vgpu_vfio.c b/drivers/vgpu/vgpu_vfio.c
> new file mode 100644
> index 0000000..460a4dc
> --- /dev/null
> +++ b/drivers/vgpu/vgpu_vfio.c
> @@ -0,0 +1,671 @@
> +/*
> + * VGPU VFIO device
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/init.h>
> +#include <linux/module.h>
> +#include <linux/device.h>
> +#include <linux/kernel.h>
> +#include <linux/fs.h>
> +#include <linux/poll.h>
> +#include <linux/slab.h>
> +#include <linux/cdev.h>
> +#include <linux/sched.h>
> +#include <linux/wait.h>
> +#include <linux/uuid.h>
> +#include <linux/vfio.h>
> +#include <linux/iommu.h>
> +#include <linux/vgpu.h>
> +
> +#include "vgpu_private.h"
> +
> +#define DRIVER_VERSION  "0.1"
> +#define DRIVER_AUTHOR   "NVIDIA Corporation"
> +#define DRIVER_DESC     "VGPU VFIO Driver"
> +
> +#define VFIO_PCI_OFFSET_SHIFT   40
> +
> +#define VFIO_PCI_OFFSET_TO_INDEX(off)	(off >> VFIO_PCI_OFFSET_SHIFT)
> +#define VFIO_PCI_INDEX_TO_OFFSET(index)	((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
> +#define VFIO_PCI_OFFSET_MASK	(((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)

Change the name of these from vfio-pci please or shift code around to
use them directly.  You're certainly free to redefine these, but using
the same name is confusing.

> +
> +struct vfio_vgpu_device {
> +	struct iommu_group *group;
> +	struct vgpu_device *vgpu_dev;
> +	int		    refcnt;
> +	struct pci_bar_info bar_info[VFIO_PCI_NUM_REGIONS];
> +	u8		    *vconfig;
> +};
> +
> +static DEFINE_MUTEX(vfio_vgpu_lock);
> +
> +static int get_virtual_bar_info(struct vgpu_device *vgpu_dev,
> +				struct pci_bar_info *bar_info,
> +				int index)
> +{
> +	int ret = -1;

Use a real errno.

> +	struct gpu_device *gpu_dev = vgpu_dev->gpu_dev;
> +
> +	if (gpu_dev->ops->vgpu_bar_info)
> +		ret = gpu_dev->ops->vgpu_bar_info(vgpu_dev, index, bar_info);

vgpu_bar_info is already optional, further validating that the vgpu
core is not PCI specific.

> +	return ret;
> +}
> +
> +static int vdev_read_base(struct vfio_vgpu_device *vdev)
> +{
> +	int index, pos;
> +	u32 start_lo, start_hi;
> +	u32 mem_type;
> +
> +	pos = PCI_BASE_ADDRESS_0;
> +
> +	for (index = 0; index <= VFIO_PCI_BAR5_REGION_INDEX; index++) {
> +
> +		if (!vdev->bar_info[index].size)
> +			continue;
> +
> +		start_lo = (*(u32 *)(vdev->vconfig + pos)) &
> +					PCI_BASE_ADDRESS_MEM_MASK;
> +		mem_type = (*(u32 *)(vdev->vconfig + pos)) &
> +					PCI_BASE_ADDRESS_MEM_TYPE_MASK;
> +
> +		switch (mem_type) {
> +		case PCI_BASE_ADDRESS_MEM_TYPE_64:
> +			start_hi = (*(u32 *)(vdev->vconfig + pos + 4));
> +			pos += 4;
> +			break;
> +		case PCI_BASE_ADDRESS_MEM_TYPE_32:
> +		case PCI_BASE_ADDRESS_MEM_TYPE_1M:
> +			/* 1M mem BAR treated as 32-bit BAR */
> +		default:
> +			/* mem unknown type treated as 32-bit BAR */
> +			start_hi = 0;
> +			break;
> +		}

Let's not neglect ioport BARs here, IO_MASK is different.

> +		pos += 4;
> +		vdev->bar_info[index].start = ((u64)start_hi << 32) | start_lo;
> +	}
> +	return 0;
> +}
> +
> +static int vgpu_dev_open(void *device_data)
> +{
> +	int ret = 0;
> +	struct vfio_vgpu_device *vdev = device_data;
> +
> +	if (!try_module_get(THIS_MODULE))
> +		return -ENODEV;
> +
> +	mutex_lock(&vfio_vgpu_lock);
> +
> +	if (!vdev->refcnt) {
> +		u8 *vconfig;
> +		int vconfig_size, index;
> +
> +		for (index = 0; index < VFIO_PCI_NUM_REGIONS; index++) {

nit, region indexes are not all BARs.

> +			ret = get_virtual_bar_info(vdev->vgpu_dev,
> +						   &vdev->bar_info[index],
> +						   index);
> +			if (ret)
> +				goto open_error;
> +		}
> +		vconfig_size = vdev->bar_info[VFIO_PCI_CONFIG_REGION_INDEX].size;

nit, config space is not a BAR.

> +		if (!vconfig_size)
> +			goto open_error;
> +
> +		vconfig = kzalloc(vconfig_size, GFP_KERNEL);
> +		if (!vconfig) {
> +			ret = -ENOMEM;
> +			goto open_error;
> +		}
> +
> +		vdev->vconfig = vconfig;
> +	}
> +
> +	vdev->refcnt++;
> +open_error:
> +
> +	mutex_unlock(&vfio_vgpu_lock);
> +
> +	if (ret)
> +		module_put(THIS_MODULE);
> +
> +	return ret;
> +}
> +
> +static void vgpu_dev_close(void *device_data)
> +{
> +	struct vfio_vgpu_device *vdev = device_data;
> +
> +	mutex_lock(&vfio_vgpu_lock);
> +
> +	vdev->refcnt--;
> +	if (!vdev->refcnt) {
> +		memset(&vdev->bar_info, 0, sizeof(vdev->bar_info));

Why?

> +		if (vdev->vconfig)

How would we ever achieve that?

> +			kfree(vdev->vconfig);
> +	}
> +
> +	mutex_unlock(&vfio_vgpu_lock);
> +
> +	module_put(THIS_MODULE);
> +}
> +
> +static int vgpu_get_irq_count(struct vfio_vgpu_device *vdev, int irq_type)
> +{
> +	// Don't support MSIX for now
> +	if (irq_type == VFIO_PCI_MSIX_IRQ_INDEX)
> +		return -1;

How are we going to expand the API later for it?  Shouldn't this just
be a passthrough to a gpu_devices_ops.vgpu_vfio_get_irq_info callback?

> +
> +	return 1;
> +}
> +
> +static long vgpu_dev_unlocked_ioctl(void *device_data,
> +		unsigned int cmd, unsigned long arg)
> +{
> +	int ret = 0;
> +	struct vfio_vgpu_device *vdev = device_data;
> +	unsigned long minsz;
> +
> +	switch (cmd)
> +	{
> +	case VFIO_DEVICE_GET_INFO:
> +	{
> +		struct vfio_device_info info;
> +		printk(KERN_INFO "%s VFIO_DEVICE_GET_INFO cmd index ", __FUNCTION__);
> +		minsz = offsetofend(struct vfio_device_info, num_irqs);
> +
> +		if (copy_from_user(&info, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (info.argsz < minsz)
> +			return -EINVAL;
> +
> +		info.flags = VFIO_DEVICE_FLAGS_PCI;
> +		info.num_regions = VFIO_PCI_NUM_REGIONS;
> +		info.num_irqs = VFIO_PCI_NUM_IRQS;
> +
> +		return copy_to_user((void __user *)arg, &info, minsz);
> +	}
> +
> +	case VFIO_DEVICE_GET_REGION_INFO:
> +	{
> +		struct vfio_region_info info;
> +
> +		minsz = offsetofend(struct vfio_region_info, offset);
> +
> +		if (copy_from_user(&info, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (info.argsz < minsz)
> +			return -EINVAL;
> +
> +		printk(KERN_INFO "%s VFIO_DEVICE_GET_REGION_INFO cmd for region_index %d", __FUNCTION__, info.index);
> +		switch (info.index) {
> +		case VFIO_PCI_CONFIG_REGION_INDEX:
> +		case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
> +			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
> +			info.size = vdev->bar_info[info.index].size;
> +			if (!info.size) {
> +				info.flags = 0;
> +				break;
> +			}
> +
> +			info.flags = vdev->bar_info[info.index].flags;

Ah, so bar_info.flags are vfio region info flags, that's not documented
anywhere in the API.

> +			break;
> +		case VFIO_PCI_VGA_REGION_INDEX:
> +			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
> +			info.size = 0xc0000;
> +			info.flags = VFIO_REGION_INFO_FLAG_READ |
> +				     VFIO_REGION_INFO_FLAG_WRITE;
> +				break;

I think VGA support needs to be at the discretion of the vendor
driver.  There are certainly use cases that don't require VGA.

> +
> +		case VFIO_PCI_ROM_REGION_INDEX:

So should ROM support.  What's the assumption here, that QEMU will
provide a ROM, much like is required for SR-IOV VFs?

> +		default:
> +			return -EINVAL;
> +		}
> +
> +		return copy_to_user((void __user *)arg, &info, minsz);
> +
> +	}
> +	case VFIO_DEVICE_GET_IRQ_INFO:
> +	{
> +		struct vfio_irq_info info;
> +
> +		printk(KERN_INFO "%s VFIO_DEVICE_GET_IRQ_INFO cmd", __FUNCTION__);

Clearly lots of debug remaining in these functions.

> +		minsz = offsetofend(struct vfio_irq_info, count);
> +
> +		if (copy_from_user(&info, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (info.argsz < minsz || info.index >= VFIO_PCI_NUM_IRQS)
> +			return -EINVAL;
> +
> +		switch (info.index) {
> +		case VFIO_PCI_INTX_IRQ_INDEX ... VFIO_PCI_MSI_IRQ_INDEX:
> +		case VFIO_PCI_REQ_IRQ_INDEX:
> +			break;
> +			/* pass thru to return error */
> +		case VFIO_PCI_MSIX_IRQ_INDEX:

Lots of assumptions about what the vendor driver is going to support.

> +		default:
> +			return -EINVAL;
> +		}
> +
> +		info.count = VFIO_PCI_NUM_IRQS;
> +
> +		info.flags = VFIO_IRQ_INFO_EVENTFD;
> +		info.count = vgpu_get_irq_count(vdev, info.index);
> +
> +		if (info.count == -1)
> +			return -EINVAL;
> +
> +		if (info.index == VFIO_PCI_INTX_IRQ_INDEX)
> +			info.flags |= (VFIO_IRQ_INFO_MASKABLE |
> +					VFIO_IRQ_INFO_AUTOMASKED);
> +		else
> +			info.flags |= VFIO_IRQ_INFO_NORESIZE;
> +
> +		return copy_to_user((void __user *)arg, &info, minsz);
> +	}
> +
> +	case VFIO_DEVICE_SET_IRQS:
> +	{
> +		struct vfio_irq_set hdr;
> +		struct gpu_device *gpu_dev = vdev->vgpu_dev->gpu_dev;
> +		u8 *data = NULL;
> +		int ret = 0;
> +		minsz = offsetofend(struct vfio_irq_set, count);
> +
> +		if (copy_from_user(&hdr, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (hdr.argsz < minsz || hdr.index >= VFIO_PCI_NUM_IRQS ||
> +		    hdr.flags & ~(VFIO_IRQ_SET_DATA_TYPE_MASK |
> +		    VFIO_IRQ_SET_ACTION_TYPE_MASK))
> +			return -EINVAL;
> +
> +		if (!(hdr.flags & VFIO_IRQ_SET_DATA_NONE)) {
> +			size_t size;
> +			int max = vgpu_get_irq_count(vdev, hdr.index);
> +
> +			if (hdr.flags & VFIO_IRQ_SET_DATA_BOOL)
> +				size = sizeof(uint8_t);
> +			else if (hdr.flags & VFIO_IRQ_SET_DATA_EVENTFD)
> +				size = sizeof(int32_t);
> +			else
> +				return -EINVAL;
> +
> +			if (hdr.argsz - minsz < hdr.count * size ||
> +			    hdr.start >= max || hdr.start + hdr.count > max)
> +				return -EINVAL;
> +
> +			data = memdup_user((void __user *)(arg + minsz),
> +						hdr.count * size);
> +				if (IS_ERR(data))
> +					return PTR_ERR(data);
> +
> +			}
> +
> +			if (gpu_dev->ops->vgpu_set_irqs) {
> +				ret = gpu_dev->ops->vgpu_set_irqs(vdev->vgpu_dev,
> +								  hdr.flags,
> +								  hdr.index, hdr.start,
> +								  hdr.count, data);
> +			}
> +			kfree(data);
> +			return ret;
> +		}
> +
> +		default:
> +			return -EINVAL;
> +	}
> +	return ret;
> +}
> +
> +ssize_t vgpu_dev_config_rw(struct vfio_vgpu_device *vdev, char __user *buf,
> +		size_t count, loff_t *ppos, bool iswrite)
> +{
> +	struct vgpu_device *vgpu_dev = vdev->vgpu_dev;
> +	struct gpu_device *gpu_dev = vgpu_dev->gpu_dev;
> +	int cfg_size = vdev->bar_info[VFIO_PCI_CONFIG_REGION_INDEX].size;
> +	int ret = 0;
> +	uint64_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
> +
> +	if (pos < 0 || pos >= cfg_size ||
> +	    pos + count > cfg_size) {
> +		printk(KERN_ERR "%s pos 0x%llx out of range\n", __FUNCTION__, pos);
> +		ret = -EFAULT;
> +		goto config_rw_exit;
> +	}
> +
> +	if (iswrite) {
> +		char *user_data = kmalloc(count, GFP_KERNEL);
> +
> +		if (user_data == NULL) {
> +			ret = -ENOMEM;
> +			goto config_rw_exit;
> +		}
> +
> +		if (copy_from_user(user_data, buf, count)) {
> +			ret = -EFAULT;
> +			kfree(user_data);
> +			goto config_rw_exit;
> +		}

memdup_user()?

> +
> +		if (gpu_dev->ops->write) {
> +			ret = gpu_dev->ops->write(vgpu_dev,
> +						  user_data,
> +						  count,
> +						  vgpu_emul_space_config,
> +						  pos);
> +		}
> +
> +		memcpy((void *)(vdev->vconfig + pos), (void *)user_data, count);

So write is expected to user_data to allow only the writable bits to be
changed?  What's really being saved in the vconfig here vs the vendor
vgpu driver?  It seems like we're only using it to cache the BAR
values, but we're not providing the BAR emulation here, which seems
like one of the few things we could provide so it's not duplicated in
every vendor driver.  But then we only need a few u32s to do that, not
all of config space.

> +		kfree(user_data);
> +	}
> +	else
> +	{
> +		char *ret_data = kzalloc(count, GFP_KERNEL);
> +
> +		if (ret_data == NULL) {
> +			ret = -ENOMEM;
> +			goto config_rw_exit;
> +		}
> +
> +		if (gpu_dev->ops->read) {
> +			ret = gpu_dev->ops->read(vgpu_dev,
> +						 ret_data,
> +						 count,
> +						 vgpu_emul_space_config,
> +						 pos);
> +		}
> +
> +		if (ret > 0 ) {
> +			if (copy_to_user(buf, ret_data, ret)) {
> +				ret = -EFAULT;
> +				kfree(ret_data);
> +				goto config_rw_exit;
> +			}
> +
> +			memcpy((void *)(vdev->vconfig + pos), (void *)ret_data, count);
> +		}
> +		kfree(ret_data);
> +	}
> +config_rw_exit:
> +	return ret;
> +}
> +
> +ssize_t vgpu_dev_bar_rw(struct vfio_vgpu_device *vdev, char __user *buf,
> +		size_t count, loff_t *ppos, bool iswrite)
> +{
> +	struct vgpu_device *vgpu_dev = vdev->vgpu_dev;
> +	struct gpu_device *gpu_dev = vgpu_dev->gpu_dev;
> +	loff_t offset = *ppos & VFIO_PCI_OFFSET_MASK;
> +	loff_t pos;
> +	int bar_index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
> +	int ret = 0;
> +
> +	if (!vdev->bar_info[bar_index].start) {
> +		ret = vdev_read_base(vdev);
> +		if (ret)
> +			goto bar_rw_exit;
> +	}
> +
> +	if (offset >= vdev->bar_info[bar_index].size) {
> +		ret = -EINVAL;
> +		goto bar_rw_exit;
> +	}
> +
> +	pos = vdev->bar_info[bar_index].start + offset;
> +	if (iswrite) {
> +		char *user_data = kmalloc(count, GFP_KERNEL);
> +
> +		if (user_data == NULL) {
> +			ret = -ENOMEM;
> +			goto bar_rw_exit;
> +		}
> +
> +		if (copy_from_user(user_data, buf, count)) {
> +			ret = -EFAULT;
> +			kfree(user_data);
> +			goto bar_rw_exit;
> +		}

memdup_user() again.

> +
> +		if (gpu_dev->ops->write) {
> +			ret = gpu_dev->ops->write(vgpu_dev,
> +						  user_data,
> +						  count,
> +						  vgpu_emul_space_mmio,
> +						  pos);
> +		}

What's the usefulness in a vendor driver that doesn't provide
read/write?

> +
> +		kfree(user_data);
> +	}
> +	else
> +	{
> +		char *ret_data = kmalloc(count, GFP_KERNEL);
> +
> +		if (ret_data == NULL) {
> +			ret = -ENOMEM;
> +			goto bar_rw_exit;
> +		}
> +
> +		memset(ret_data, 0, count);
> +
> +		if (gpu_dev->ops->read) {
> +			ret = gpu_dev->ops->read(vgpu_dev,
> +						 ret_data,
> +						 count,
> +						 vgpu_emul_space_mmio,
> +						 pos);
> +		}
> +
> +		if (ret > 0 ) {
> +			if (copy_to_user(buf, ret_data, ret)) {
> +				ret = -EFAULT;
> +			}
> +		}
> +		kfree(ret_data);
> +	}
> +
> +bar_rw_exit:
> +	return ret;

No freeing, no lock releasing, no cleanup, just return from the point
of error.

> +}
> +
> +
> +static ssize_t vgpu_dev_rw(void *device_data, char __user *buf,
> +		size_t count, loff_t *ppos, bool iswrite)
> +{
> +	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
> +	struct vfio_vgpu_device *vdev = device_data;
> +
> +	if (index >= VFIO_PCI_NUM_REGIONS)
> +		return -EINVAL;
> +
> +	switch (index) {
> +	case VFIO_PCI_CONFIG_REGION_INDEX:
> +		return vgpu_dev_config_rw(vdev, buf, count, ppos, iswrite);
> +
> +	case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
> +		return vgpu_dev_bar_rw(vdev, buf, count, ppos, iswrite);
> +
> +	case VFIO_PCI_ROM_REGION_INDEX:
> +	case VFIO_PCI_VGA_REGION_INDEX:

Wait a sec, who's doing the VGA emulation?  We can't be claiming to
support a VGA region and then fail to provide read/write access to it
like we said it has.

> +		break;
> +	}
> +
> +	return -EINVAL;
> +}
> +
> +
> +static ssize_t vgpu_dev_read(void *device_data, char __user *buf,
> +			     size_t count, loff_t *ppos)
> +{
> +	int ret = 0;
> +
> +	if (count)
> +		ret = vgpu_dev_rw(device_data, buf, count, ppos, false);
> +
> +	return ret;
> +}
> +
> +static ssize_t vgpu_dev_write(void *device_data, const char __user *buf,
> +			      size_t count, loff_t *ppos)
> +{
> +	int ret = 0;
> +
> +	if (count)
> +		ret = vgpu_dev_rw(device_data, (char *)buf, count, ppos, true);
> +
> +	return ret;
> +}
> +
> +static int vgpu_dev_mmio_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> +{
> +	int ret = 0;
> +	struct vfio_vgpu_device *vdev = vma->vm_private_data;
> +	struct vgpu_device *vgpu_dev;
> +	struct gpu_device *gpu_dev;
> +	u64 virtaddr = (u64)vmf->virtual_address;
> +	u64 offset, phyaddr;
> +	unsigned long req_size, pgoff;
> +	pgprot_t pg_prot;
> +
> +	if (!vdev && !vdev->vgpu_dev)
> +		return -EINVAL;
> +
> +	vgpu_dev = vdev->vgpu_dev;
> +	gpu_dev  = vgpu_dev->gpu_dev;
> +
> +	offset   = vma->vm_pgoff << PAGE_SHIFT;
> +	phyaddr  = virtaddr - vma->vm_start + offset;
> +	pgoff    = phyaddr >> PAGE_SHIFT;
> +	req_size = vma->vm_end - virtaddr;
> +	pg_prot  = vma->vm_page_prot;
> +
> +	if (gpu_dev->ops->validate_map_request) {
> +		ret = gpu_dev->ops->validate_map_request(vgpu_dev, virtaddr, &pgoff,
> +							 &req_size, &pg_prot);
> +		if (ret)
> +			return ret;
> +
> +		if (!req_size)
> +			return -EINVAL;
> +	}
> +
> +	ret = remap_pfn_range(vma, virtaddr, pgoff, req_size, pg_prot);

So not supporting validate_map_request() means that the user can
directly mmap BARs of the host GPU and as shown below, we assume a 1:1
mapping of vGPU BAR to host GPU BAR.  Is that ever valid in a vGPU
scenario or should this callback be required?  It's not clear to me how
the vendor driver determines what this maps to, do they compare it to
the physical device's own BAR addresses?

> +
> +	return ret | VM_FAULT_NOPAGE;
> +}
> +
> +static const struct vm_operations_struct vgpu_dev_mmio_ops = {
> +	.fault = vgpu_dev_mmio_fault,
> +};
> +
> +
> +static int vgpu_dev_mmap(void *device_data, struct vm_area_struct *vma)
> +{
> +	unsigned int index;
> +	struct vfio_vgpu_device *vdev = device_data;
> +	struct vgpu_device *vgpu_dev = vdev->vgpu_dev;
> +	struct pci_dev *pdev = vgpu_dev->gpu_dev->dev;
> +	unsigned long pgoff;
> +
> +	loff_t offset = vma->vm_pgoff << PAGE_SHIFT;
> +
> +	index = VFIO_PCI_OFFSET_TO_INDEX(offset);
> +
> +	if (index >= VFIO_PCI_ROM_REGION_INDEX)
> +		return -EINVAL;

ioport BARs?

> +
> +	pgoff = vma->vm_pgoff &
> +		((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
> +
> +	vma->vm_pgoff = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff;
> +
> +	vma->vm_private_data = vdev;
> +	vma->vm_ops = &vgpu_dev_mmio_ops;
> +
> +	return 0;
> +}
> +
> +static const struct vfio_device_ops vgpu_vfio_dev_ops = {
> +	.name		= "vfio-vgpu",

Should all of this be vfio-pci-vgpu?  We've certainly gotten PCI
specific here.

> +	.open		= vgpu_dev_open,
> +	.release	= vgpu_dev_close,
> +	.ioctl		= vgpu_dev_unlocked_ioctl,
> +	.read		= vgpu_dev_read,
> +	.write		= vgpu_dev_write,
> +	.mmap		= vgpu_dev_mmap,
> +};
> +
> +int vgpu_vfio_probe(struct device *dev)
> +{
> +	struct vfio_vgpu_device *vdev;
> +	struct vgpu_device *vgpu_dev = to_vgpu_device(dev);
> +	int ret = 0;
> +
> +	if (vgpu_dev == NULL)
> +		return -EINVAL;
> +
> +	vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
> +	if (!vdev) {
> +		return -ENOMEM;
> +	}
> +
> +	vdev->vgpu_dev = vgpu_dev;
> +	vdev->group = vgpu_dev->group;
> +
> +	ret = vfio_add_group_dev(dev, &vgpu_vfio_dev_ops, vdev);
> +	if (ret)
> +		kfree(vdev);
> +
> +	printk(KERN_INFO "%s ret = %d\n", __FUNCTION__, ret);
> +	return ret;
> +}
> +
> +void vgpu_vfio_remove(struct device *dev)
> +{
> +	struct vfio_vgpu_device *vdev;
> +
> +	printk(KERN_INFO "%s \n", __FUNCTION__);
> +	vdev = vfio_del_group_dev(dev);
> +	if (vdev) {
> +		printk(KERN_INFO "%s vdev being freed\n", __FUNCTION__);
> +		kfree(vdev);
> +	}
> +}
> +
> +struct vgpu_driver vgpu_vfio_driver = {
> +        .name	= "vgpu-vfio",
> +        .probe	= vgpu_vfio_probe,
> +        .remove	= vgpu_vfio_remove,
> +};
> +
> +static int __init vgpu_vfio_init(void)
> +{
> +	printk(KERN_INFO "%s \n", __FUNCTION__);
> +	return vgpu_register_driver(&vgpu_vfio_driver, THIS_MODULE);
> +}
> +
> +static void __exit vgpu_vfio_exit(void)
> +{
> +	printk(KERN_INFO "%s \n", __FUNCTION__);
> +	vgpu_unregister_driver(&vgpu_vfio_driver);
> +}
> +
> +module_init(vgpu_vfio_init)
> +module_exit(vgpu_vfio_exit)
> +
> +MODULE_VERSION(DRIVER_VERSION);
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR(DRIVER_AUTHOR);
> +MODULE_DESCRIPTION(DRIVER_DESC);


^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 2/3] VFIO driver for vGPU device
@ 2016-05-03 22:43     ` Alex Williamson
  0 siblings, 0 replies; 154+ messages in thread
From: Alex Williamson @ 2016-05-03 22:43 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, shuai.ruan,
	jike.song, zhiyuan.lv

On Tue, 3 May 2016 00:10:40 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> VFIO driver registers with vGPU core driver. vGPU core driver creates vGPU
> device and calls probe routine of vGPU VFIO driver. This vGPU VFIO driver adds
> vGPU device to VFIO core module.
> Main aim of this module is to manage all VFIO APIs for each vGPU device.
> Those are:
> - get region information from GPU driver.
> - trap and emulate PCI config space and BAR region.
> - Send interrupt configuration information to GPU driver.
> - mmap mappable region with invalidate mapping and fault on access to remap pfn.
> 
> Thanks,
> Kirti.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Signed-off-by: Neo Jia <cjia@nvidia.com>
> Change-Id: I949a6b499d2e98d9c3352ae579535a608729b223
> ---
>  drivers/vgpu/Makefile    |    1 +
>  drivers/vgpu/vgpu_vfio.c |  671 ++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 672 insertions(+), 0 deletions(-)
>  create mode 100644 drivers/vgpu/vgpu_vfio.c
> 
> diff --git a/drivers/vgpu/Makefile b/drivers/vgpu/Makefile
> index f5be980..a0a2655 100644
> --- a/drivers/vgpu/Makefile
> +++ b/drivers/vgpu/Makefile
> @@ -2,3 +2,4 @@
>  vgpu-y := vgpu-core.o vgpu-sysfs.o vgpu-driver.o
>  
>  obj-$(CONFIG_VGPU)			+= vgpu.o
> +obj-$(CONFIG_VGPU_VFIO)                 += vgpu_vfio.o

This is where we should add a new Kconfig entry for VGPU_VFIO, nothing
in patch 1 has any vfio dependency.  Perhaps it should also depend on
VFIO_PCI rather than VFIO since you are getting very PCI specific below.

> diff --git a/drivers/vgpu/vgpu_vfio.c b/drivers/vgpu/vgpu_vfio.c
> new file mode 100644
> index 0000000..460a4dc
> --- /dev/null
> +++ b/drivers/vgpu/vgpu_vfio.c
> @@ -0,0 +1,671 @@
> +/*
> + * VGPU VFIO device
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/init.h>
> +#include <linux/module.h>
> +#include <linux/device.h>
> +#include <linux/kernel.h>
> +#include <linux/fs.h>
> +#include <linux/poll.h>
> +#include <linux/slab.h>
> +#include <linux/cdev.h>
> +#include <linux/sched.h>
> +#include <linux/wait.h>
> +#include <linux/uuid.h>
> +#include <linux/vfio.h>
> +#include <linux/iommu.h>
> +#include <linux/vgpu.h>
> +
> +#include "vgpu_private.h"
> +
> +#define DRIVER_VERSION  "0.1"
> +#define DRIVER_AUTHOR   "NVIDIA Corporation"
> +#define DRIVER_DESC     "VGPU VFIO Driver"
> +
> +#define VFIO_PCI_OFFSET_SHIFT   40
> +
> +#define VFIO_PCI_OFFSET_TO_INDEX(off)	(off >> VFIO_PCI_OFFSET_SHIFT)
> +#define VFIO_PCI_INDEX_TO_OFFSET(index)	((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
> +#define VFIO_PCI_OFFSET_MASK	(((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)

Change the name of these from vfio-pci please or shift code around to
use them directly.  You're certainly free to redefine these, but using
the same name is confusing.

> +
> +struct vfio_vgpu_device {
> +	struct iommu_group *group;
> +	struct vgpu_device *vgpu_dev;
> +	int		    refcnt;
> +	struct pci_bar_info bar_info[VFIO_PCI_NUM_REGIONS];
> +	u8		    *vconfig;
> +};
> +
> +static DEFINE_MUTEX(vfio_vgpu_lock);
> +
> +static int get_virtual_bar_info(struct vgpu_device *vgpu_dev,
> +				struct pci_bar_info *bar_info,
> +				int index)
> +{
> +	int ret = -1;

Use a real errno.

> +	struct gpu_device *gpu_dev = vgpu_dev->gpu_dev;
> +
> +	if (gpu_dev->ops->vgpu_bar_info)
> +		ret = gpu_dev->ops->vgpu_bar_info(vgpu_dev, index, bar_info);

vgpu_bar_info is already optional, further validating that the vgpu
core is not PCI specific.

> +	return ret;
> +}
> +
> +static int vdev_read_base(struct vfio_vgpu_device *vdev)
> +{
> +	int index, pos;
> +	u32 start_lo, start_hi;
> +	u32 mem_type;
> +
> +	pos = PCI_BASE_ADDRESS_0;
> +
> +	for (index = 0; index <= VFIO_PCI_BAR5_REGION_INDEX; index++) {
> +
> +		if (!vdev->bar_info[index].size)
> +			continue;
> +
> +		start_lo = (*(u32 *)(vdev->vconfig + pos)) &
> +					PCI_BASE_ADDRESS_MEM_MASK;
> +		mem_type = (*(u32 *)(vdev->vconfig + pos)) &
> +					PCI_BASE_ADDRESS_MEM_TYPE_MASK;
> +
> +		switch (mem_type) {
> +		case PCI_BASE_ADDRESS_MEM_TYPE_64:
> +			start_hi = (*(u32 *)(vdev->vconfig + pos + 4));
> +			pos += 4;
> +			break;
> +		case PCI_BASE_ADDRESS_MEM_TYPE_32:
> +		case PCI_BASE_ADDRESS_MEM_TYPE_1M:
> +			/* 1M mem BAR treated as 32-bit BAR */
> +		default:
> +			/* mem unknown type treated as 32-bit BAR */
> +			start_hi = 0;
> +			break;
> +		}

Let's not neglect ioport BARs here, IO_MASK is different.

> +		pos += 4;
> +		vdev->bar_info[index].start = ((u64)start_hi << 32) | start_lo;
> +	}
> +	return 0;
> +}
> +
> +static int vgpu_dev_open(void *device_data)
> +{
> +	int ret = 0;
> +	struct vfio_vgpu_device *vdev = device_data;
> +
> +	if (!try_module_get(THIS_MODULE))
> +		return -ENODEV;
> +
> +	mutex_lock(&vfio_vgpu_lock);
> +
> +	if (!vdev->refcnt) {
> +		u8 *vconfig;
> +		int vconfig_size, index;
> +
> +		for (index = 0; index < VFIO_PCI_NUM_REGIONS; index++) {

nit, region indexes are not all BARs.

> +			ret = get_virtual_bar_info(vdev->vgpu_dev,
> +						   &vdev->bar_info[index],
> +						   index);
> +			if (ret)
> +				goto open_error;
> +		}
> +		vconfig_size = vdev->bar_info[VFIO_PCI_CONFIG_REGION_INDEX].size;

nit, config space is not a BAR.

> +		if (!vconfig_size)
> +			goto open_error;
> +
> +		vconfig = kzalloc(vconfig_size, GFP_KERNEL);
> +		if (!vconfig) {
> +			ret = -ENOMEM;
> +			goto open_error;
> +		}
> +
> +		vdev->vconfig = vconfig;
> +	}
> +
> +	vdev->refcnt++;
> +open_error:
> +
> +	mutex_unlock(&vfio_vgpu_lock);
> +
> +	if (ret)
> +		module_put(THIS_MODULE);
> +
> +	return ret;
> +}
> +
> +static void vgpu_dev_close(void *device_data)
> +{
> +	struct vfio_vgpu_device *vdev = device_data;
> +
> +	mutex_lock(&vfio_vgpu_lock);
> +
> +	vdev->refcnt--;
> +	if (!vdev->refcnt) {
> +		memset(&vdev->bar_info, 0, sizeof(vdev->bar_info));

Why?

> +		if (vdev->vconfig)

How would we ever achieve that?

> +			kfree(vdev->vconfig);
> +	}
> +
> +	mutex_unlock(&vfio_vgpu_lock);
> +
> +	module_put(THIS_MODULE);
> +}
> +
> +static int vgpu_get_irq_count(struct vfio_vgpu_device *vdev, int irq_type)
> +{
> +	// Don't support MSIX for now
> +	if (irq_type == VFIO_PCI_MSIX_IRQ_INDEX)
> +		return -1;

How are we going to expand the API later for it?  Shouldn't this just
be a passthrough to a gpu_devices_ops.vgpu_vfio_get_irq_info callback?

> +
> +	return 1;
> +}
> +
> +static long vgpu_dev_unlocked_ioctl(void *device_data,
> +		unsigned int cmd, unsigned long arg)
> +{
> +	int ret = 0;
> +	struct vfio_vgpu_device *vdev = device_data;
> +	unsigned long minsz;
> +
> +	switch (cmd)
> +	{
> +	case VFIO_DEVICE_GET_INFO:
> +	{
> +		struct vfio_device_info info;
> +		printk(KERN_INFO "%s VFIO_DEVICE_GET_INFO cmd index ", __FUNCTION__);
> +		minsz = offsetofend(struct vfio_device_info, num_irqs);
> +
> +		if (copy_from_user(&info, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (info.argsz < minsz)
> +			return -EINVAL;
> +
> +		info.flags = VFIO_DEVICE_FLAGS_PCI;
> +		info.num_regions = VFIO_PCI_NUM_REGIONS;
> +		info.num_irqs = VFIO_PCI_NUM_IRQS;
> +
> +		return copy_to_user((void __user *)arg, &info, minsz);
> +	}
> +
> +	case VFIO_DEVICE_GET_REGION_INFO:
> +	{
> +		struct vfio_region_info info;
> +
> +		minsz = offsetofend(struct vfio_region_info, offset);
> +
> +		if (copy_from_user(&info, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (info.argsz < minsz)
> +			return -EINVAL;
> +
> +		printk(KERN_INFO "%s VFIO_DEVICE_GET_REGION_INFO cmd for region_index %d", __FUNCTION__, info.index);
> +		switch (info.index) {
> +		case VFIO_PCI_CONFIG_REGION_INDEX:
> +		case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
> +			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
> +			info.size = vdev->bar_info[info.index].size;
> +			if (!info.size) {
> +				info.flags = 0;
> +				break;
> +			}
> +
> +			info.flags = vdev->bar_info[info.index].flags;

Ah, so bar_info.flags are vfio region info flags, that's not documented
anywhere in the API.

> +			break;
> +		case VFIO_PCI_VGA_REGION_INDEX:
> +			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
> +			info.size = 0xc0000;
> +			info.flags = VFIO_REGION_INFO_FLAG_READ |
> +				     VFIO_REGION_INFO_FLAG_WRITE;
> +				break;

I think VGA support needs to be at the discretion of the vendor
driver.  There are certainly use cases that don't require VGA.

> +
> +		case VFIO_PCI_ROM_REGION_INDEX:

So should ROM support.  What's the assumption here, that QEMU will
provide a ROM, much like is required for SR-IOV VFs?

> +		default:
> +			return -EINVAL;
> +		}
> +
> +		return copy_to_user((void __user *)arg, &info, minsz);
> +
> +	}
> +	case VFIO_DEVICE_GET_IRQ_INFO:
> +	{
> +		struct vfio_irq_info info;
> +
> +		printk(KERN_INFO "%s VFIO_DEVICE_GET_IRQ_INFO cmd", __FUNCTION__);

Clearly lots of debug remaining in these functions.

> +		minsz = offsetofend(struct vfio_irq_info, count);
> +
> +		if (copy_from_user(&info, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (info.argsz < minsz || info.index >= VFIO_PCI_NUM_IRQS)
> +			return -EINVAL;
> +
> +		switch (info.index) {
> +		case VFIO_PCI_INTX_IRQ_INDEX ... VFIO_PCI_MSI_IRQ_INDEX:
> +		case VFIO_PCI_REQ_IRQ_INDEX:
> +			break;
> +			/* pass thru to return error */
> +		case VFIO_PCI_MSIX_IRQ_INDEX:

Lots of assumptions about what the vendor driver is going to support.

> +		default:
> +			return -EINVAL;
> +		}
> +
> +		info.count = VFIO_PCI_NUM_IRQS;
> +
> +		info.flags = VFIO_IRQ_INFO_EVENTFD;
> +		info.count = vgpu_get_irq_count(vdev, info.index);
> +
> +		if (info.count == -1)
> +			return -EINVAL;
> +
> +		if (info.index == VFIO_PCI_INTX_IRQ_INDEX)
> +			info.flags |= (VFIO_IRQ_INFO_MASKABLE |
> +					VFIO_IRQ_INFO_AUTOMASKED);
> +		else
> +			info.flags |= VFIO_IRQ_INFO_NORESIZE;
> +
> +		return copy_to_user((void __user *)arg, &info, minsz);
> +	}
> +
> +	case VFIO_DEVICE_SET_IRQS:
> +	{
> +		struct vfio_irq_set hdr;
> +		struct gpu_device *gpu_dev = vdev->vgpu_dev->gpu_dev;
> +		u8 *data = NULL;
> +		int ret = 0;
> +		minsz = offsetofend(struct vfio_irq_set, count);
> +
> +		if (copy_from_user(&hdr, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (hdr.argsz < minsz || hdr.index >= VFIO_PCI_NUM_IRQS ||
> +		    hdr.flags & ~(VFIO_IRQ_SET_DATA_TYPE_MASK |
> +		    VFIO_IRQ_SET_ACTION_TYPE_MASK))
> +			return -EINVAL;
> +
> +		if (!(hdr.flags & VFIO_IRQ_SET_DATA_NONE)) {
> +			size_t size;
> +			int max = vgpu_get_irq_count(vdev, hdr.index);
> +
> +			if (hdr.flags & VFIO_IRQ_SET_DATA_BOOL)
> +				size = sizeof(uint8_t);
> +			else if (hdr.flags & VFIO_IRQ_SET_DATA_EVENTFD)
> +				size = sizeof(int32_t);
> +			else
> +				return -EINVAL;
> +
> +			if (hdr.argsz - minsz < hdr.count * size ||
> +			    hdr.start >= max || hdr.start + hdr.count > max)
> +				return -EINVAL;
> +
> +			data = memdup_user((void __user *)(arg + minsz),
> +						hdr.count * size);
> +				if (IS_ERR(data))
> +					return PTR_ERR(data);
> +
> +			}
> +
> +			if (gpu_dev->ops->vgpu_set_irqs) {
> +				ret = gpu_dev->ops->vgpu_set_irqs(vdev->vgpu_dev,
> +								  hdr.flags,
> +								  hdr.index, hdr.start,
> +								  hdr.count, data);
> +			}
> +			kfree(data);
> +			return ret;
> +		}
> +
> +		default:
> +			return -EINVAL;
> +	}
> +	return ret;
> +}
> +
> +ssize_t vgpu_dev_config_rw(struct vfio_vgpu_device *vdev, char __user *buf,
> +		size_t count, loff_t *ppos, bool iswrite)
> +{
> +	struct vgpu_device *vgpu_dev = vdev->vgpu_dev;
> +	struct gpu_device *gpu_dev = vgpu_dev->gpu_dev;
> +	int cfg_size = vdev->bar_info[VFIO_PCI_CONFIG_REGION_INDEX].size;
> +	int ret = 0;
> +	uint64_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
> +
> +	if (pos < 0 || pos >= cfg_size ||
> +	    pos + count > cfg_size) {
> +		printk(KERN_ERR "%s pos 0x%llx out of range\n", __FUNCTION__, pos);
> +		ret = -EFAULT;
> +		goto config_rw_exit;
> +	}
> +
> +	if (iswrite) {
> +		char *user_data = kmalloc(count, GFP_KERNEL);
> +
> +		if (user_data == NULL) {
> +			ret = -ENOMEM;
> +			goto config_rw_exit;
> +		}
> +
> +		if (copy_from_user(user_data, buf, count)) {
> +			ret = -EFAULT;
> +			kfree(user_data);
> +			goto config_rw_exit;
> +		}

memdup_user()?

> +
> +		if (gpu_dev->ops->write) {
> +			ret = gpu_dev->ops->write(vgpu_dev,
> +						  user_data,
> +						  count,
> +						  vgpu_emul_space_config,
> +						  pos);
> +		}
> +
> +		memcpy((void *)(vdev->vconfig + pos), (void *)user_data, count);

So write is expected to user_data to allow only the writable bits to be
changed?  What's really being saved in the vconfig here vs the vendor
vgpu driver?  It seems like we're only using it to cache the BAR
values, but we're not providing the BAR emulation here, which seems
like one of the few things we could provide so it's not duplicated in
every vendor driver.  But then we only need a few u32s to do that, not
all of config space.

> +		kfree(user_data);
> +	}
> +	else
> +	{
> +		char *ret_data = kzalloc(count, GFP_KERNEL);
> +
> +		if (ret_data == NULL) {
> +			ret = -ENOMEM;
> +			goto config_rw_exit;
> +		}
> +
> +		if (gpu_dev->ops->read) {
> +			ret = gpu_dev->ops->read(vgpu_dev,
> +						 ret_data,
> +						 count,
> +						 vgpu_emul_space_config,
> +						 pos);
> +		}
> +
> +		if (ret > 0 ) {
> +			if (copy_to_user(buf, ret_data, ret)) {
> +				ret = -EFAULT;
> +				kfree(ret_data);
> +				goto config_rw_exit;
> +			}
> +
> +			memcpy((void *)(vdev->vconfig + pos), (void *)ret_data, count);
> +		}
> +		kfree(ret_data);
> +	}
> +config_rw_exit:
> +	return ret;
> +}
> +
> +ssize_t vgpu_dev_bar_rw(struct vfio_vgpu_device *vdev, char __user *buf,
> +		size_t count, loff_t *ppos, bool iswrite)
> +{
> +	struct vgpu_device *vgpu_dev = vdev->vgpu_dev;
> +	struct gpu_device *gpu_dev = vgpu_dev->gpu_dev;
> +	loff_t offset = *ppos & VFIO_PCI_OFFSET_MASK;
> +	loff_t pos;
> +	int bar_index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
> +	int ret = 0;
> +
> +	if (!vdev->bar_info[bar_index].start) {
> +		ret = vdev_read_base(vdev);
> +		if (ret)
> +			goto bar_rw_exit;
> +	}
> +
> +	if (offset >= vdev->bar_info[bar_index].size) {
> +		ret = -EINVAL;
> +		goto bar_rw_exit;
> +	}
> +
> +	pos = vdev->bar_info[bar_index].start + offset;
> +	if (iswrite) {
> +		char *user_data = kmalloc(count, GFP_KERNEL);
> +
> +		if (user_data == NULL) {
> +			ret = -ENOMEM;
> +			goto bar_rw_exit;
> +		}
> +
> +		if (copy_from_user(user_data, buf, count)) {
> +			ret = -EFAULT;
> +			kfree(user_data);
> +			goto bar_rw_exit;
> +		}

memdup_user() again.

> +
> +		if (gpu_dev->ops->write) {
> +			ret = gpu_dev->ops->write(vgpu_dev,
> +						  user_data,
> +						  count,
> +						  vgpu_emul_space_mmio,
> +						  pos);
> +		}

What's the usefulness in a vendor driver that doesn't provide
read/write?

> +
> +		kfree(user_data);
> +	}
> +	else
> +	{
> +		char *ret_data = kmalloc(count, GFP_KERNEL);
> +
> +		if (ret_data == NULL) {
> +			ret = -ENOMEM;
> +			goto bar_rw_exit;
> +		}
> +
> +		memset(ret_data, 0, count);
> +
> +		if (gpu_dev->ops->read) {
> +			ret = gpu_dev->ops->read(vgpu_dev,
> +						 ret_data,
> +						 count,
> +						 vgpu_emul_space_mmio,
> +						 pos);
> +		}
> +
> +		if (ret > 0 ) {
> +			if (copy_to_user(buf, ret_data, ret)) {
> +				ret = -EFAULT;
> +			}
> +		}
> +		kfree(ret_data);
> +	}
> +
> +bar_rw_exit:
> +	return ret;

No freeing, no lock releasing, no cleanup, just return from the point
of error.

> +}
> +
> +
> +static ssize_t vgpu_dev_rw(void *device_data, char __user *buf,
> +		size_t count, loff_t *ppos, bool iswrite)
> +{
> +	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
> +	struct vfio_vgpu_device *vdev = device_data;
> +
> +	if (index >= VFIO_PCI_NUM_REGIONS)
> +		return -EINVAL;
> +
> +	switch (index) {
> +	case VFIO_PCI_CONFIG_REGION_INDEX:
> +		return vgpu_dev_config_rw(vdev, buf, count, ppos, iswrite);
> +
> +	case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
> +		return vgpu_dev_bar_rw(vdev, buf, count, ppos, iswrite);
> +
> +	case VFIO_PCI_ROM_REGION_INDEX:
> +	case VFIO_PCI_VGA_REGION_INDEX:

Wait a sec, who's doing the VGA emulation?  We can't be claiming to
support a VGA region and then fail to provide read/write access to it
like we said it has.

> +		break;
> +	}
> +
> +	return -EINVAL;
> +}
> +
> +
> +static ssize_t vgpu_dev_read(void *device_data, char __user *buf,
> +			     size_t count, loff_t *ppos)
> +{
> +	int ret = 0;
> +
> +	if (count)
> +		ret = vgpu_dev_rw(device_data, buf, count, ppos, false);
> +
> +	return ret;
> +}
> +
> +static ssize_t vgpu_dev_write(void *device_data, const char __user *buf,
> +			      size_t count, loff_t *ppos)
> +{
> +	int ret = 0;
> +
> +	if (count)
> +		ret = vgpu_dev_rw(device_data, (char *)buf, count, ppos, true);
> +
> +	return ret;
> +}
> +
> +static int vgpu_dev_mmio_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> +{
> +	int ret = 0;
> +	struct vfio_vgpu_device *vdev = vma->vm_private_data;
> +	struct vgpu_device *vgpu_dev;
> +	struct gpu_device *gpu_dev;
> +	u64 virtaddr = (u64)vmf->virtual_address;
> +	u64 offset, phyaddr;
> +	unsigned long req_size, pgoff;
> +	pgprot_t pg_prot;
> +
> +	if (!vdev && !vdev->vgpu_dev)
> +		return -EINVAL;
> +
> +	vgpu_dev = vdev->vgpu_dev;
> +	gpu_dev  = vgpu_dev->gpu_dev;
> +
> +	offset   = vma->vm_pgoff << PAGE_SHIFT;
> +	phyaddr  = virtaddr - vma->vm_start + offset;
> +	pgoff    = phyaddr >> PAGE_SHIFT;
> +	req_size = vma->vm_end - virtaddr;
> +	pg_prot  = vma->vm_page_prot;
> +
> +	if (gpu_dev->ops->validate_map_request) {
> +		ret = gpu_dev->ops->validate_map_request(vgpu_dev, virtaddr, &pgoff,
> +							 &req_size, &pg_prot);
> +		if (ret)
> +			return ret;
> +
> +		if (!req_size)
> +			return -EINVAL;
> +	}
> +
> +	ret = remap_pfn_range(vma, virtaddr, pgoff, req_size, pg_prot);

So not supporting validate_map_request() means that the user can
directly mmap BARs of the host GPU and as shown below, we assume a 1:1
mapping of vGPU BAR to host GPU BAR.  Is that ever valid in a vGPU
scenario or should this callback be required?  It's not clear to me how
the vendor driver determines what this maps to, do they compare it to
the physical device's own BAR addresses?

> +
> +	return ret | VM_FAULT_NOPAGE;
> +}
> +
> +static const struct vm_operations_struct vgpu_dev_mmio_ops = {
> +	.fault = vgpu_dev_mmio_fault,
> +};
> +
> +
> +static int vgpu_dev_mmap(void *device_data, struct vm_area_struct *vma)
> +{
> +	unsigned int index;
> +	struct vfio_vgpu_device *vdev = device_data;
> +	struct vgpu_device *vgpu_dev = vdev->vgpu_dev;
> +	struct pci_dev *pdev = vgpu_dev->gpu_dev->dev;
> +	unsigned long pgoff;
> +
> +	loff_t offset = vma->vm_pgoff << PAGE_SHIFT;
> +
> +	index = VFIO_PCI_OFFSET_TO_INDEX(offset);
> +
> +	if (index >= VFIO_PCI_ROM_REGION_INDEX)
> +		return -EINVAL;

ioport BARs?

> +
> +	pgoff = vma->vm_pgoff &
> +		((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
> +
> +	vma->vm_pgoff = (pci_resource_start(pdev, index) >> PAGE_SHIFT) + pgoff;
> +
> +	vma->vm_private_data = vdev;
> +	vma->vm_ops = &vgpu_dev_mmio_ops;
> +
> +	return 0;
> +}
> +
> +static const struct vfio_device_ops vgpu_vfio_dev_ops = {
> +	.name		= "vfio-vgpu",

Should all of this be vfio-pci-vgpu?  We've certainly gotten PCI
specific here.

> +	.open		= vgpu_dev_open,
> +	.release	= vgpu_dev_close,
> +	.ioctl		= vgpu_dev_unlocked_ioctl,
> +	.read		= vgpu_dev_read,
> +	.write		= vgpu_dev_write,
> +	.mmap		= vgpu_dev_mmap,
> +};
> +
> +int vgpu_vfio_probe(struct device *dev)
> +{
> +	struct vfio_vgpu_device *vdev;
> +	struct vgpu_device *vgpu_dev = to_vgpu_device(dev);
> +	int ret = 0;
> +
> +	if (vgpu_dev == NULL)
> +		return -EINVAL;
> +
> +	vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
> +	if (!vdev) {
> +		return -ENOMEM;
> +	}
> +
> +	vdev->vgpu_dev = vgpu_dev;
> +	vdev->group = vgpu_dev->group;
> +
> +	ret = vfio_add_group_dev(dev, &vgpu_vfio_dev_ops, vdev);
> +	if (ret)
> +		kfree(vdev);
> +
> +	printk(KERN_INFO "%s ret = %d\n", __FUNCTION__, ret);
> +	return ret;
> +}
> +
> +void vgpu_vfio_remove(struct device *dev)
> +{
> +	struct vfio_vgpu_device *vdev;
> +
> +	printk(KERN_INFO "%s \n", __FUNCTION__);
> +	vdev = vfio_del_group_dev(dev);
> +	if (vdev) {
> +		printk(KERN_INFO "%s vdev being freed\n", __FUNCTION__);
> +		kfree(vdev);
> +	}
> +}
> +
> +struct vgpu_driver vgpu_vfio_driver = {
> +        .name	= "vgpu-vfio",
> +        .probe	= vgpu_vfio_probe,
> +        .remove	= vgpu_vfio_remove,
> +};
> +
> +static int __init vgpu_vfio_init(void)
> +{
> +	printk(KERN_INFO "%s \n", __FUNCTION__);
> +	return vgpu_register_driver(&vgpu_vfio_driver, THIS_MODULE);
> +}
> +
> +static void __exit vgpu_vfio_exit(void)
> +{
> +	printk(KERN_INFO "%s \n", __FUNCTION__);
> +	vgpu_unregister_driver(&vgpu_vfio_driver);
> +}
> +
> +module_init(vgpu_vfio_init)
> +module_exit(vgpu_vfio_exit)
> +
> +MODULE_VERSION(DRIVER_VERSION);
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR(DRIVER_AUTHOR);
> +MODULE_DESCRIPTION(DRIVER_DESC);

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [RFC PATCH v3 1/3] vGPU Core driver
  2016-05-02 18:40   ` [Qemu-devel] " Kirti Wankhede
@ 2016-05-03 22:43     ` Alex Williamson
  -1 siblings, 0 replies; 154+ messages in thread
From: Alex Williamson @ 2016-05-03 22:43 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, shuai.ruan,
	jike.song, zhiyuan.lv

On Tue, 3 May 2016 00:10:39 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> Design for vGPU Driver:
> Main purpose of vGPU driver is to provide a common interface for vGPU
> management that can be used by differnt GPU drivers.
> 
> This module would provide a generic interface to create the device, add
> it to vGPU bus, add device to IOMMU group and then add it to vfio group.
> 
> High Level block diagram:
> 
> +--------------+    vgpu_register_driver()+---------------+
> |     __init() +------------------------->+               |
> |              |                          |               |
> |              +<-------------------------+    vgpu.ko    |
> | vgpu_vfio.ko |   probe()/remove()       |               |
> |              |                +---------+               +---------+
> +--------------+                |         +-------+-------+         |
>                                 |                 ^                 |
>                                 | callback        |                 |
>                                 |         +-------+--------+        |
>                                 |         |vgpu_register_device()   |
>                                 |         |                |        |
>                                 +---^-----+-----+    +-----+------+-+
>                                     | nvidia.ko |    |  i915.ko   |
>                                     |           |    |            |
>                                     +-----------+    +------------+
> 
> vGPU driver provides two types of registration interfaces:
> 1. Registration interface for vGPU bus driver:
> 
> /**
>   * struct vgpu_driver - vGPU device driver
>   * @name: driver name
>   * @probe: called when new device created
>   * @remove: called when device removed
>   * @driver: device driver structure
>   *
>   **/
> struct vgpu_driver {
>          const char *name;
>          int  (*probe)  (struct device *dev);
>          void (*remove) (struct device *dev);
>          struct device_driver    driver;
> };
> 
> int  vgpu_register_driver(struct vgpu_driver *drv, struct module *owner);
> void vgpu_unregister_driver(struct vgpu_driver *drv);
> 
> VFIO bus driver for vgpu, should use this interface to register with
> vGPU driver. With this, VFIO bus driver for vGPU devices is responsible
> to add vGPU device to VFIO group.
> 
> 2. GPU driver interface
> GPU driver interface provides GPU driver the set APIs to manage GPU driver
> related work in their own driver. APIs are to:
> - vgpu_supported_config: provide supported configuration list by the GPU.
> - vgpu_create: to allocate basic resouces in GPU driver for a vGPU device.
> - vgpu_destroy: to free resources in GPU driver during vGPU device destroy.
> - vgpu_start: to initiate vGPU initialization process from GPU driver when VM
>   boots and before QEMU starts.
> - vgpu_shutdown: to teardown vGPU resources during VM teardown.
> - read : read emulation callback.
> - write: write emulation callback.
> - vgpu_set_irqs: send interrupt configuration information that QEMU sets.
> - vgpu_bar_info: to provice BAR size and its flags for the vGPU device.
> - validate_map_request: to validate remap pfn request.
> 
> This registration interface should be used by GPU drivers to register
> each physical device to vGPU driver.
> 
> Updated this patch with couple of more functions in GPU driver interface
> which were discussed during v1 version of this RFC.
> 
> Thanks,
> Kirti.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Signed-off-by: Neo Jia <cjia@nvidia.com>
> Change-Id: I1c13c411f61b7b2e750e85adfe1b097f9fd218b9
> ---
>  drivers/Kconfig             |    2 +
>  drivers/Makefile            |    1 +
>  drivers/vgpu/Kconfig        |   21 ++
>  drivers/vgpu/Makefile       |    4 +
>  drivers/vgpu/vgpu-core.c    |  424 +++++++++++++++++++++++++++++++++++++++++++
>  drivers/vgpu/vgpu-driver.c  |  136 ++++++++++++++
>  drivers/vgpu/vgpu-sysfs.c   |  365 +++++++++++++++++++++++++++++++++++++
>  drivers/vgpu/vgpu_private.h |   36 ++++
>  include/linux/vgpu.h        |  216 ++++++++++++++++++++++
>  9 files changed, 1205 insertions(+), 0 deletions(-)
>  create mode 100644 drivers/vgpu/Kconfig
>  create mode 100644 drivers/vgpu/Makefile
>  create mode 100644 drivers/vgpu/vgpu-core.c
>  create mode 100644 drivers/vgpu/vgpu-driver.c
>  create mode 100644 drivers/vgpu/vgpu-sysfs.c
>  create mode 100644 drivers/vgpu/vgpu_private.h
>  create mode 100644 include/linux/vgpu.h
> 
> diff --git a/drivers/Kconfig b/drivers/Kconfig
> index d2ac339..5fd9eae 100644
> --- a/drivers/Kconfig
> +++ b/drivers/Kconfig
> @@ -122,6 +122,8 @@ source "drivers/uio/Kconfig"
>  
>  source "drivers/vfio/Kconfig"
>  
> +source "drivers/vgpu/Kconfig"
> +
>  source "drivers/vlynq/Kconfig"
>  
>  source "drivers/virt/Kconfig"
> diff --git a/drivers/Makefile b/drivers/Makefile
> index 8f5d076..36f1110 100644
> --- a/drivers/Makefile
> +++ b/drivers/Makefile
> @@ -84,6 +84,7 @@ obj-$(CONFIG_FUSION)		+= message/
>  obj-y				+= firewire/
>  obj-$(CONFIG_UIO)		+= uio/
>  obj-$(CONFIG_VFIO)		+= vfio/
> +obj-$(CONFIG_VFIO)		+= vgpu/
>  obj-y				+= cdrom/
>  obj-y				+= auxdisplay/
>  obj-$(CONFIG_PCCARD)		+= pcmcia/
> diff --git a/drivers/vgpu/Kconfig b/drivers/vgpu/Kconfig
> new file mode 100644
> index 0000000..792eb48
> --- /dev/null
> +++ b/drivers/vgpu/Kconfig
> @@ -0,0 +1,21 @@
> +
> +menuconfig VGPU
> +    tristate "VGPU driver framework"
> +    depends on VFIO
> +    select VGPU_VFIO
> +    help
> +        VGPU provides a framework to virtualize GPU without SR-IOV cap
> +        See Documentation/vgpu.txt for more details.
> +
> +        If you don't know what do here, say N.
> +
> +config VGPU
> +    tristate
> +    depends on VFIO
> +    default n
> +
> +config VGPU_VFIO
> +    tristate
> +    depends on VGPU
> +    default n
> +

This is a little bit convoluted, it seems like everything added in this
patch is vfio agnostic, it doesn't necessarily care what the consumer
is.  That makes me think we should only be adding CONFIG_VGPU here and
it should not depend on CONFIG_VFIO or be enabling CONFIG_VGPU_VFIO.
The middle config entry is also redundant to the first, just move the
default line up to the first and remove the rest.

> diff --git a/drivers/vgpu/Makefile b/drivers/vgpu/Makefile
> new file mode 100644
> index 0000000..f5be980
> --- /dev/null
> +++ b/drivers/vgpu/Makefile
> @@ -0,0 +1,4 @@
> +
> +vgpu-y := vgpu-core.o vgpu-sysfs.o vgpu-driver.o
> +
> +obj-$(CONFIG_VGPU)			+= vgpu.o
> diff --git a/drivers/vgpu/vgpu-core.c b/drivers/vgpu/vgpu-core.c
> new file mode 100644
> index 0000000..1a7d274
> --- /dev/null
> +++ b/drivers/vgpu/vgpu-core.c
> @@ -0,0 +1,424 @@
> +/*
> + * VGPU Core Driver
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/init.h>
> +#include <linux/module.h>
> +#include <linux/device.h>
> +#include <linux/kernel.h>
> +#include <linux/fs.h>
> +#include <linux/poll.h>
> +#include <linux/slab.h>
> +#include <linux/cdev.h>
> +#include <linux/sched.h>
> +#include <linux/wait.h>
> +#include <linux/uuid.h>
> +#include <linux/vfio.h>
> +#include <linux/iommu.h>
> +#include <linux/sysfs.h>
> +#include <linux/ctype.h>
> +#include <linux/vgpu.h>
> +
> +#include "vgpu_private.h"
> +
> +#define DRIVER_VERSION	"0.1"
> +#define DRIVER_AUTHOR	"NVIDIA Corporation"
> +#define DRIVER_DESC	"VGPU Core Driver"
> +
> +/*
> + * #defines
> + */
> +
> +#define VGPU_CLASS_NAME		"vgpu"
> +
> +/*
> + * Global Structures
> + */
> +
> +static struct vgpu {
> +	struct list_head    vgpu_devices_list;
> +	struct mutex        vgpu_devices_lock;
> +	struct list_head    gpu_devices_list;
> +	struct mutex        gpu_devices_lock;
> +} vgpu;
> +
> +static struct class vgpu_class;
> +
> +/*
> + * Functions
> + */
> +
> +struct vgpu_device *get_vgpu_device_from_group(struct iommu_group *group)
> +{
> +	struct vgpu_device *vdev = NULL;
> +
> +	mutex_lock(&vgpu.vgpu_devices_lock);
> +	list_for_each_entry(vdev, &vgpu.vgpu_devices_list, list) {
> +		if (vdev->group) {
> +			if (iommu_group_id(vdev->group) == iommu_group_id(group)) {
> +				mutex_unlock(&vgpu.vgpu_devices_lock);
> +				return vdev;
> +			}
> +		}
> +	}
> +	mutex_unlock(&vgpu.vgpu_devices_lock);
> +	return NULL;
> +}
> +
> +EXPORT_SYMBOL_GPL(get_vgpu_device_from_group);
> +
> +static int vgpu_add_attribute_group(struct device *dev,
> +			            const struct attribute_group **groups)
> +{
> +        return sysfs_create_groups(&dev->kobj, groups);
> +}
> +
> +static void vgpu_remove_attribute_group(struct device *dev,
> +			                const struct attribute_group **groups)
> +{
> +        sysfs_remove_groups(&dev->kobj, groups);
> +}
> +
> +int vgpu_register_device(struct pci_dev *dev, const struct gpu_device_ops *ops)

To make the API abundantly clear, how about vgpu_register_gpu_device()
to avoid confusion with a vgpu device.

Why do we care that it's a pci_dev?  It seems like there's only a very
small portion of the API that cares about pci_devs in order to describe
BARs, which could be switched based on the device type.  Otherwise we
could operate on a struct device here.

> +{
> +	int ret = 0;
> +	struct gpu_device *gpu_dev, *tmp;
> +
> +	if (!dev)
> +		return -EINVAL;
> +
> +        gpu_dev = kzalloc(sizeof(*gpu_dev), GFP_KERNEL);
> +        if (!gpu_dev)
> +                return -ENOMEM;
> +
> +	gpu_dev->dev = dev;
> +        gpu_dev->ops = ops;
> +
> +        mutex_lock(&vgpu.gpu_devices_lock);
> +
> +        /* Check for duplicates */
> +        list_for_each_entry(tmp, &vgpu.gpu_devices_list, gpu_next) {
> +                if (tmp->dev == dev) {
> +			ret = -EINVAL;

Maybe -EEXIST here to get a different error value.

> +			goto add_error;
> +                }
> +        }
> +
> +	ret = vgpu_create_pci_device_files(dev);

I don't actually see anything pci specific in that function.

> +	if (ret)
> +		goto add_error;
> +
> +	ret = vgpu_add_attribute_group(&dev->dev, ops->dev_attr_groups);
> +	if (ret)
> +		goto add_group_error;
> +
> +        list_add(&gpu_dev->gpu_next, &vgpu.gpu_devices_list);

Whitespace issues, please run scripts/checkpatch.pl on patches before
posting.

> +
> +	printk(KERN_INFO "VGPU: Registered dev 0x%x 0x%x, class 0x%x\n",
> +			 dev->vendor, dev->device, dev->class);

This is a place where we're using pci_dev specific fields, but it's not
very useful.  We're registering a specific device, not everything that
matches this set of vendor/device/class, so what is the user supposed
to learn from this?  A dev_info here would give us the name of the
specific device we're registering and be device type agnostic.

> +        mutex_unlock(&vgpu.gpu_devices_lock);
> +
> +        return 0;
> +
> +add_group_error:
> +	vgpu_remove_pci_device_files(dev);
> +add_error:
> +	mutex_unlock(&vgpu.gpu_devices_lock);
> +	kfree(gpu_dev);
> +	return ret;
> +
> +}
> +EXPORT_SYMBOL(vgpu_register_device);
> +
> +void vgpu_unregister_device(struct pci_dev *dev)
> +{
> +        struct gpu_device *gpu_dev;
> +
> +        mutex_lock(&vgpu.gpu_devices_lock);
> +        list_for_each_entry(gpu_dev, &vgpu.gpu_devices_list, gpu_next) {
> +		struct vgpu_device *vdev = NULL;
> +
> +                if (gpu_dev->dev != dev)
> +			continue;
> +
> +		printk(KERN_INFO "VGPU: Unregistered dev 0x%x 0x%x, class 0x%x\n",
> +				dev->vendor, dev->device, dev->class);

Same comments as above for function name, device type, and this print.

> +
> +		list_for_each_entry(vdev, &vgpu.vgpu_devices_list, list) {

How can we walk this list without holding vgpu_devices_lock?

> +			if (vdev->gpu_dev != gpu_dev)
> +				continue;
> +			destroy_vgpu_device(vdev);
> +		}
> +		vgpu_remove_attribute_group(&dev->dev, gpu_dev->ops->dev_attr_groups);
> +		vgpu_remove_pci_device_files(dev);
> +		list_del(&gpu_dev->gpu_next);
> +		mutex_unlock(&vgpu.gpu_devices_lock);

It's often desirable to avoid multiple exit points, especially when
locking is involved, to simplify the code flow.  It would be very easy
to accomplish that here.

> +		kfree(gpu_dev);
> +		return;
> +        }
> +        mutex_unlock(&vgpu.gpu_devices_lock);
> +}
> +EXPORT_SYMBOL(vgpu_unregister_device);
> +
> +/*
> + * Helper Functions
> + */
> +
> +static struct vgpu_device *vgpu_device_alloc(uuid_le uuid, int instance, char *name)
> +{
> +	struct vgpu_device *vgpu_dev = NULL;
> +
> +	vgpu_dev = kzalloc(sizeof(*vgpu_dev), GFP_KERNEL);
> +	if (!vgpu_dev)
> +		return ERR_PTR(-ENOMEM);
> +
> +	kref_init(&vgpu_dev->kref);
> +	memcpy(&vgpu_dev->uuid, &uuid, sizeof(uuid_le));
> +	vgpu_dev->vgpu_instance = instance;
> +	strcpy(vgpu_dev->dev_name, name);
> +
> +	mutex_lock(&vgpu.vgpu_devices_lock);
> +	list_add(&vgpu_dev->list, &vgpu.vgpu_devices_list);
> +	mutex_unlock(&vgpu.vgpu_devices_lock);
> +
> +	return vgpu_dev;
> +}
> +
> +static void vgpu_device_free(struct vgpu_device *vgpu_dev)
> +{
> +	if (vgpu_dev) {
> +		mutex_lock(&vgpu.vgpu_devices_lock);
> +		list_del(&vgpu_dev->list);
> +		mutex_unlock(&vgpu.vgpu_devices_lock);
> +		kfree(vgpu_dev);
> +	}

Why aren't we using the kref to remove and free the vgpu when the last
reference is released?

> +	return;

Unnecessary

> +}
> +
> +struct vgpu_device *vgpu_drv_get_vgpu_device(uuid_le uuid, int instance)
> +{
> +	struct vgpu_device *vdev = NULL;
> +
> +	mutex_lock(&vgpu.vgpu_devices_lock);
> +	list_for_each_entry(vdev, &vgpu.vgpu_devices_list, list) {
> +		if ((uuid_le_cmp(vdev->uuid, uuid) == 0) &&
> +		    (vdev->vgpu_instance == instance)) {
> +			mutex_unlock(&vgpu.vgpu_devices_lock);
> +			return vdev;

We're not taking any sort of reference to the vgpu, what prevents races
with it being removed?  A common exit path would be easy to achieve
here too.

> +		}
> +	}
> +	mutex_unlock(&vgpu.vgpu_devices_lock);
> +	return NULL;
> +}
> +
> +static void vgpu_device_release(struct device *dev)
> +{
> +	struct vgpu_device *vgpu_dev = to_vgpu_device(dev);
> +	vgpu_device_free(vgpu_dev);
> +}
> +
> +int create_vgpu_device(struct pci_dev *pdev, uuid_le uuid, uint32_t instance, char *vgpu_params)
> +{

I'm not seeing anything here that really cares if the host gpu is a
struct device vs pci_dev either.

> +	char name[64];
> +	int numChar = 0;
> +	int retval = 0;
> +	struct vgpu_device *vgpu_dev = NULL;
> +	struct gpu_device *gpu_dev;
> +
> +	printk(KERN_INFO "VGPU: %s: device ", __FUNCTION__);

pr_info() would be preferred, but this seems like leftover debug and
should be removed.

> +
> +	numChar = sprintf(name, "%pUb-%d", uuid.b, instance);

Use snprintf even though this shouldn't be able to overflow.

> +	name[numChar] = '\0';
> +
> +	vgpu_dev = vgpu_device_alloc(uuid, instance, name);
> +	if (IS_ERR(vgpu_dev)) {
> +		return PTR_ERR(vgpu_dev);
> +	}
> +
> +	vgpu_dev->dev.parent  = &pdev->dev;
> +	vgpu_dev->dev.bus     = &vgpu_bus_type;
> +	vgpu_dev->dev.release = vgpu_device_release;
> +	dev_set_name(&vgpu_dev->dev, "%s", name);
> +
> +	retval = device_register(&vgpu_dev->dev);
> +	if (retval)
> +		goto create_failed1;
> +
> +	printk(KERN_INFO "UUID %pUb \n", vgpu_dev->uuid.b);

This also looks like debug.

> +
> +	mutex_lock(&vgpu.gpu_devices_lock);
> +	list_for_each_entry(gpu_dev, &vgpu.gpu_devices_list, gpu_next) {
> +		if (gpu_dev->dev != pdev)
> +			continue;
> +
> +		vgpu_dev->gpu_dev = gpu_dev;
> +		if (gpu_dev->ops->vgpu_create) {
> +			retval = gpu_dev->ops->vgpu_create(pdev, vgpu_dev->uuid,
> +							   instance, vgpu_params);
> +			if (retval) {
> +				mutex_unlock(&vgpu.gpu_devices_lock);
> +				goto create_failed2;
> +			}
> +		}
> +		break;
> +	}
> +	if (!vgpu_dev->gpu_dev) {
> +		retval = -EINVAL;
> +		mutex_unlock(&vgpu.gpu_devices_lock);
> +		goto create_failed2;
> +	}
> +
> +	mutex_unlock(&vgpu.gpu_devices_lock);
> +
> +	retval = vgpu_add_attribute_group(&vgpu_dev->dev, gpu_dev->ops->vgpu_attr_groups);
> +	if (retval)
> +		goto create_attr_error;
> +
> +	return retval;
> +
> +create_attr_error:
> +	if (gpu_dev->ops->vgpu_destroy) {
> +		int ret = 0;
> +		ret = gpu_dev->ops->vgpu_destroy(gpu_dev->dev,
> +						 vgpu_dev->uuid,
> +						 vgpu_dev->vgpu_instance);

Unnecessary initialization and we don't do anything with the result.
Below indicates lack of vgpu_destroy indicates the vendor doesn't
support unplug, but doesn't that break our error cleanup path here?

> +	}
> +
> +create_failed2:
> +	device_unregister(&vgpu_dev->dev);
> +
> +create_failed1:
> +	vgpu_device_free(vgpu_dev);
> +
> +	return retval;
> +}
> +
> +void destroy_vgpu_device(struct vgpu_device *vgpu_dev)
> +{
> +	struct gpu_device *gpu_dev = vgpu_dev->gpu_dev;
> +
> +	printk(KERN_INFO "VGPU: destroying device %s ", vgpu_dev->dev_name);

dev_info()

> +	if (gpu_dev->ops->vgpu_destroy) {
> +		int retval = 0;

Unnecessary initialization, in fact this entire variable is unnecessary.

> +		retval = gpu_dev->ops->vgpu_destroy(gpu_dev->dev,
> +						    vgpu_dev->uuid,
> +						    vgpu_dev->vgpu_instance);
> +	/* if vendor driver doesn't return success that means vendor driver doesn't
> +	 * support hot-unplug */
> +		if (retval)
> +			return;

Should we return an error code then?  Inconsistent comment style.

> +	}
> +
> +	vgpu_remove_attribute_group(&vgpu_dev->dev, gpu_dev->ops->vgpu_attr_groups);
> +	device_unregister(&vgpu_dev->dev);
> +}
> +
> +void get_vgpu_supported_types(struct device *dev, char *str)
> +{
> +	struct gpu_device *gpu_dev;
> +
> +	mutex_lock(&vgpu.gpu_devices_lock);
> +	list_for_each_entry(gpu_dev, &vgpu.gpu_devices_list, gpu_next) {
> +		if (&gpu_dev->dev->dev == dev) {
> +			if (gpu_dev->ops->vgpu_supported_config)
> +				gpu_dev->ops->vgpu_supported_config(gpu_dev->dev, str);
> +			break;
> +		}
> +	}
> +	mutex_unlock(&vgpu.gpu_devices_lock);
> +}
> +
> +int vgpu_start_callback(struct vgpu_device *vgpu_dev)
> +{
> +	int ret = 0;
> +	struct gpu_device *gpu_dev = vgpu_dev->gpu_dev;
> +
> +	mutex_lock(&vgpu.gpu_devices_lock);
> +	if (gpu_dev->ops->vgpu_start)
> +		ret = gpu_dev->ops->vgpu_start(vgpu_dev->uuid);
> +	mutex_unlock(&vgpu.gpu_devices_lock);
> +	return ret;
> +}
> +
> +int vgpu_shutdown_callback(struct vgpu_device *vgpu_dev)
> +{
> +	int ret = 0;
> +	struct gpu_device *gpu_dev = vgpu_dev->gpu_dev;
> +
> +	mutex_lock(&vgpu.gpu_devices_lock);
> +	if (gpu_dev->ops->vgpu_shutdown)
> +		ret = gpu_dev->ops->vgpu_shutdown(vgpu_dev->uuid);
> +	mutex_unlock(&vgpu.gpu_devices_lock);
> +	return ret;
> +}
> +
> +char *vgpu_devnode(struct device *dev, umode_t *mode)
> +{
> +	return kasprintf(GFP_KERNEL, "vgpu/%s", dev_name(dev));
> +}
> +
> +static void release_vgpubus_dev(struct device *dev)
> +{
> +	struct vgpu_device *vgpu_dev = to_vgpu_device(dev);
> +	destroy_vgpu_device(vgpu_dev);
> +}
> +
> +static struct class vgpu_class = {
> +	.name		= VGPU_CLASS_NAME,
> +	.owner		= THIS_MODULE,
> +	.class_attrs	= vgpu_class_attrs,
> +	.dev_groups	= vgpu_dev_groups,
> +	.devnode	= vgpu_devnode,
> +	.dev_release    = release_vgpubus_dev,
> +};
> +
> +static int __init vgpu_init(void)
> +{
> +	int rc = 0;
> +
> +	memset(&vgpu, 0 , sizeof(vgpu));

Unnecessary, this is declared in the bss and zero initialized.

> +
> +	mutex_init(&vgpu.vgpu_devices_lock);
> +	INIT_LIST_HEAD(&vgpu.vgpu_devices_list);
> +	mutex_init(&vgpu.gpu_devices_lock);
> +	INIT_LIST_HEAD(&vgpu.gpu_devices_list);
> +
> +	rc = class_register(&vgpu_class);
> +	if (rc < 0) {
> +		printk(KERN_ERR "Error: failed to register vgpu class\n");

pr_err()

> +		goto failed1;
> +	}
> +
> +	rc = vgpu_bus_register();
> +	if (rc < 0) {
> +		printk(KERN_ERR "Error: failed to register vgpu bus\n");
> +		class_unregister(&vgpu_class);
> +	}
> +
> +    request_module_nowait("vgpu_vfio");
> +
> +failed1:
> +	return rc;

While common exit points are good, if there's no cleanup and no
locking, why do we need failed1?

> +}
> +
> +static void __exit vgpu_exit(void)
> +{
> +	vgpu_bus_unregister();
> +	class_unregister(&vgpu_class);
> +}
> +
> +module_init(vgpu_init)
> +module_exit(vgpu_exit)
> +
> +MODULE_VERSION(DRIVER_VERSION);
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR(DRIVER_AUTHOR);
> +MODULE_DESCRIPTION(DRIVER_DESC);
> diff --git a/drivers/vgpu/vgpu-driver.c b/drivers/vgpu/vgpu-driver.c
> new file mode 100644
> index 0000000..c4c2e9f
> --- /dev/null
> +++ b/drivers/vgpu/vgpu-driver.c
> @@ -0,0 +1,136 @@
> +/*
> + * VGPU driver
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/init.h>
> +#include <linux/module.h>
> +#include <linux/device.h>
> +#include <linux/kernel.h>
> +#include <linux/fs.h>

I don't see any vfio here, or fs or sysfs or ctype.

> +#include <linux/vfio.h>
> +#include <linux/iommu.h>
> +#include <linux/sysfs.h>
> +#include <linux/ctype.h>
> +#include <linux/vgpu.h>
> +
> +#include "vgpu_private.h"
> +
> +static int vgpu_device_attach_iommu(struct vgpu_device *vgpu_dev)
> +{
> +        int retval = 0;
> +        struct iommu_group *group = NULL;
> +
> +        group = iommu_group_alloc();
> +        if (IS_ERR(group)) {
> +                printk(KERN_ERR "VGPU: failed to allocate group!\n");
> +                return PTR_ERR(group);
> +        }
> +
> +        retval = iommu_group_add_device(group, &vgpu_dev->dev);
> +        if (retval) {
> +                printk(KERN_ERR "VGPU: failed to add dev to group!\n");

dev_err()

> +                iommu_group_put(group);

The iommu group should be put regardless of error, the device holds a
reference to the group to keep it around.

> +                return retval;
> +        }
> +
> +        vgpu_dev->group = group;
> +
> +        printk(KERN_INFO "VGPU: group_id = %d \n", iommu_group_id(group));
> +        return retval;
> +}
> +
> +static void vgpu_device_detach_iommu(struct vgpu_device *vgpu_dev)
> +{
> +        iommu_group_put(vgpu_dev->dev.iommu_group);
> +        iommu_group_remove_device(&vgpu_dev->dev);

Only the remove _device should be needed here, the group reference
should have been released above, otherwise we're double incrementing
and double decrementing.

> +        printk(KERN_INFO "VGPU: detaching iommu \n");

debug.

> +}
> +
> +static int vgpu_device_probe(struct device *dev)
> +{
> +	struct vgpu_driver *drv = to_vgpu_driver(dev->driver);
> +	struct vgpu_device *vgpu_dev = to_vgpu_device(dev);
> +	int status = 0;
> +
> +	status = vgpu_device_attach_iommu(vgpu_dev);
> +	if (status) {
> +		printk(KERN_ERR "Failed to attach IOMMU\n");
> +		return status;
> +	}
> +
> +	if (drv && drv->probe) {
> +		status = drv->probe(dev);
> +	}
> +
> +	return status;
> +}
> +
> +static int vgpu_device_remove(struct device *dev)
> +{
> +	struct vgpu_driver *drv = to_vgpu_driver(dev->driver);
> +	struct vgpu_device *vgpu_dev = to_vgpu_device(dev);
> +	int status = 0;
> +
> +	if (drv && drv->remove) {
> +		drv->remove(dev);
> +	}
> +
> +	vgpu_device_detach_iommu(vgpu_dev);
> +
> +	return status;

return 0;  Or make this void.  .remove functions often return void, for
better or worse.

> +}
> +
> +struct bus_type vgpu_bus_type = {
> +	.name		= "vgpu",
> +	.probe		= vgpu_device_probe,
> +	.remove		= vgpu_device_remove,
> +};
> +EXPORT_SYMBOL_GPL(vgpu_bus_type);
> +
> +/**
> + * vgpu_register_driver - register a new vGPU driver
> + * @drv: the driver to register
> + * @owner: owner module of driver ro register
> + *
> + * Returns a negative value on error, otherwise 0.
> + */
> +int vgpu_register_driver(struct vgpu_driver *drv, struct module *owner)
> +{
> +	/* initialize common driver fields */
> +	drv->driver.name = drv->name;
> +	drv->driver.bus = &vgpu_bus_type;
> +	drv->driver.owner = owner;
> +
> +	/* register with core */
> +	return driver_register(&drv->driver);
> +}
> +EXPORT_SYMBOL(vgpu_register_driver);
> +
> +/**
> + * vgpu_unregister_driver - unregister vGPU driver
> + * @drv: the driver to unregister
> + *
> + */
> +void vgpu_unregister_driver(struct vgpu_driver *drv)
> +{
> +	driver_unregister(&drv->driver);
> +}
> +EXPORT_SYMBOL(vgpu_unregister_driver);
> +
> +int vgpu_bus_register(void)
> +{
> +	return bus_register(&vgpu_bus_type);
> +}
> +
> +void vgpu_bus_unregister(void)
> +{
> +	bus_unregister(&vgpu_bus_type);
> +}
> diff --git a/drivers/vgpu/vgpu-sysfs.c b/drivers/vgpu/vgpu-sysfs.c
> new file mode 100644
> index 0000000..b740f9a
> --- /dev/null
> +++ b/drivers/vgpu/vgpu-sysfs.c
> @@ -0,0 +1,365 @@
> +/*
> + * File attributes for vGPU devices
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/sched.h>
> +#include <linux/fs.h>
> +#include <linux/sysfs.h>
> +#include <linux/ctype.h>
> +#include <linux/uuid.h>
> +#include <linux/vfio.h>

No vfio, fs, or ctype here either

> +#include <linux/vgpu.h>
> +
> +#include "vgpu_private.h"
> +
> +/* Prototypes */
> +
> +static ssize_t vgpu_supported_types_show(struct device *dev,
> +					 struct device_attribute *attr,
> +					 char *buf);
> +static DEVICE_ATTR_RO(vgpu_supported_types);
> +
> +static ssize_t vgpu_create_store(struct device *dev,
> +				 struct device_attribute *attr,
> +				 const char *buf, size_t count);
> +static DEVICE_ATTR_WO(vgpu_create);
> +
> +static ssize_t vgpu_destroy_store(struct device *dev,
> +				  struct device_attribute *attr,
> +				  const char *buf, size_t count);
> +static DEVICE_ATTR_WO(vgpu_destroy);
> +
> +
> +/* Static functions */
> +
> +static bool is_uuid_sep(char sep)
> +{
> +	if (sep == '\n' || sep == '-' || sep == ':' || sep == '\0')
> +		return true;
> +	return false;
> +}
> +
> +
> +static int uuid_parse(const char *str, uuid_le *uuid)
> +{
> +	int i;
> +
> +	if (strlen(str) < 36)
> +		return -1;
> +
> +	for (i = 0; i < 16; i++) {
> +		if (!isxdigit(str[0]) || !isxdigit(str[1])) {
> +			printk(KERN_ERR "%s err", __FUNCTION__);
> +			return -EINVAL;
> +		}
> +
> +		uuid->b[i] = (hex_to_bin(str[0]) << 4) | hex_to_bin(str[1]);
> +		str += 2;
> +		if (is_uuid_sep(*str))
> +			str++;
> +	}
> +
> +	return 0;
> +}
> +
> +
> +/* Functions */
> +static ssize_t vgpu_supported_types_show(struct device *dev,
> +					 struct device_attribute *attr,
> +					 char *buf)
> +{
> +	char *str;
> +	ssize_t n;
> +
> +        str = kzalloc(sizeof(*str) * 512, GFP_KERNEL);

Arbitrary size limit?  Do we even need a separate buffer?

> +        if (!str)
> +                return -ENOMEM;
> +
> +	get_vgpu_supported_types(dev, str);
> +
> +	n = sprintf(buf,"%s\n", str);
> +	kfree(str);
> +
> +	return n;
> +}
> +
> +static ssize_t vgpu_create_store(struct device *dev,
> +				 struct device_attribute *attr,
> +				 const char *buf, size_t count)
> +{
> +	char *str, *pstr;
> +	char *uuid_str, *instance_str, *vgpu_params = NULL;
> +	uuid_le uuid;
> +	uint32_t instance;
> +	struct pci_dev *pdev;
> +	int ret = 0;
> +
> +	pstr = str = kstrndup(buf, count, GFP_KERNEL);
> +
> +	if (!str)
> +		return -ENOMEM;
> +
> +	if ((uuid_str = strsep(&str, ":")) == NULL) {
> +		printk(KERN_ERR "%s Empty UUID or string %s \n",
> +				 __FUNCTION__, buf);

pr_err() for all these.

> +		ret = -EINVAL;
> +		goto create_error;
> +	}
> +
> +	if (!str) {
> +		printk(KERN_ERR "%s vgpu instance not specified %s \n",
> +				 __FUNCTION__, buf);
> +		ret = -EINVAL;
> +		goto create_error;
> +	}
> +
> +	if ((instance_str = strsep(&str, ":")) == NULL) {
> +		printk(KERN_ERR "%s Empty instance or string %s \n",
> +				 __FUNCTION__, buf);
> +		ret = -EINVAL;
> +		goto create_error;
> +	}
> +
> +	instance = (unsigned int)simple_strtoul(instance_str, NULL, 0);
> +
> +	if (!str) {
> +		printk(KERN_ERR "%s vgpu params not specified %s \n",
> +				 __FUNCTION__, buf);
> +		ret = -EINVAL;
> +		goto create_error;
> +	}
> +
> +	vgpu_params = kstrdup(str, GFP_KERNEL);
> +
> +	if (!vgpu_params) {
> +		printk(KERN_ERR "%s vgpu params allocation failed \n",
> +				 __FUNCTION__);
> +		ret = -EINVAL;
> +		goto create_error;
> +	}
> +
> +	if (uuid_parse(uuid_str, &uuid) < 0) {
> +		printk(KERN_ERR "%s UUID parse error  %s \n", __FUNCTION__, buf);
> +		ret = -EINVAL;
> +		goto create_error;
> +	}
> +
> +	if (dev_is_pci(dev)) {
> +		pdev = to_pci_dev(dev);
> +
> +		if (create_vgpu_device(pdev, uuid, instance, vgpu_params) < 0) {

Why do we care?  I still haven't seen anything that requires the gpu to
be a pci device.

> +			printk(KERN_ERR "%s vgpu create error \n", __FUNCTION__);
> +			ret = -EINVAL;
> +			goto create_error;
> +		}
> +		ret = count;
> +	}
> +
> +create_error:
> +	if (vgpu_params)
> +		kfree(vgpu_params);
> +
> +	if (pstr)
> +		kfree(pstr);

kfree(NULL) does the right thing.

> +	return ret;
> +}
> +
> +static ssize_t vgpu_destroy_store(struct device *dev,
> +				  struct device_attribute *attr,
> +				  const char *buf, size_t count)
> +{
> +	char *uuid_str, *str;
> +	uuid_le uuid;
> +	unsigned int instance;
> +	struct vgpu_device *vgpu_dev = NULL;
> +
> +	str = kstrndup(buf, count, GFP_KERNEL);
> +
> +	if (!str)
> +		return -ENOMEM;
> +
> +	if ((uuid_str = strsep(&str, ":")) == NULL) {
> +		printk(KERN_ERR "%s Empty UUID or string %s \n", __FUNCTION__, buf);
> +		return -EINVAL;
> +	}
> +
> +	if (str == NULL) {
> +		printk(KERN_ERR "%s instance not specified %s \n", __FUNCTION__, buf);
> +		return -EINVAL;
> +	}
> +
> +	instance = (unsigned int)simple_strtoul(str, NULL, 0);
> +
> +	if (uuid_parse(uuid_str, &uuid) < 0) {
> +		printk(KERN_ERR "%s UUID parse error  %s \n", __FUNCTION__, buf);
> +		return -EINVAL;
> +	}
> +
> +	printk(KERN_INFO "%s UUID %pUb - %d \n", __FUNCTION__, uuid.b, instance);
> +
> +	vgpu_dev = vgpu_drv_get_vgpu_device(uuid, instance);

Since we have no reference counting, all we need to do to crash this is
race this destroy sysfs entry.

> +
> +	if (vgpu_dev)
> +		destroy_vgpu_device(vgpu_dev);
> +
> +	return count;

An error if not found might be nice.

> +}
> +
> +static ssize_t
> +vgpu_uuid_show(struct device *dev, struct device_attribute *attr, char *buf)
> +{
> +	struct vgpu_device *drv = to_vgpu_device(dev);
> +
> +	if (drv)
> +		return sprintf(buf, "%pUb \n", drv->uuid.b);
> +
> +	return sprintf(buf, " \n");
> +}
> +
> +static DEVICE_ATTR_RO(vgpu_uuid);
> +
> +static ssize_t
> +vgpu_group_id_show(struct device *dev, struct device_attribute *attr, char *buf)
> +{
> +	struct vgpu_device *drv = to_vgpu_device(dev);
> +
> +	if (drv && drv->group)
> +		return sprintf(buf, "%d \n", iommu_group_id(drv->group));
> +
> +	return sprintf(buf, " \n");

There should be an iommu_group link from the device to the group in
sysfs, otherwise this is inconsistent with real devices.

> +}
> +
> +static DEVICE_ATTR_RO(vgpu_group_id);
> +
> +
> +static struct attribute *vgpu_dev_attrs[] = {
> +	&dev_attr_vgpu_uuid.attr,
> +	&dev_attr_vgpu_group_id.attr,
> +	NULL,
> +};
> +
> +static const struct attribute_group vgpu_dev_group = {
> +	.attrs = vgpu_dev_attrs,
> +};
> +
> +const struct attribute_group *vgpu_dev_groups[] = {
> +	&vgpu_dev_group,
> +	NULL,
> +};
> +
> +
> +ssize_t vgpu_start_store(struct class *class, struct class_attribute *attr,
> +			 const char *buf, size_t count)
> +{
> +	char *uuid_str;
> +	uuid_le uuid;
> +	struct vgpu_device *vgpu_dev = NULL;
> +	int ret;
> +
> +	uuid_str = kstrndup(buf, count, GFP_KERNEL);
> +
> +	if (!uuid_str)
> +		return -ENOMEM;
> +
> +	if (uuid_parse(uuid_str, &uuid) < 0) {
> +		printk(KERN_ERR "%s UUID parse error  %s \n", __FUNCTION__, buf);
> +		return -EINVAL;
> +	}
> +
> +	vgpu_dev = vgpu_drv_get_vgpu_device(uuid, 0);

No reference counting, so we hope nobody destroys the device while we
have it...

> +
> +	if (vgpu_dev && dev_is_vgpu(&vgpu_dev->dev)) {
> +		kobject_uevent(&vgpu_dev->dev.kobj, KOBJ_ONLINE);
> +
> +		ret = vgpu_start_callback(vgpu_dev);
> +		if (ret < 0) {
> +			printk(KERN_ERR "%s vgpu_start callback failed  %d \n",
> +					 __FUNCTION__, ret);
> +			return ret;
> +		}
> +	}
> +
> +	return count;
> +}
> +
> +ssize_t vgpu_shutdown_store(struct class *class, struct class_attribute *attr,
> +			    const char *buf, size_t count)
> +{
> +	char *uuid_str;
> +	uuid_le uuid;
> +	struct vgpu_device *vgpu_dev = NULL;
> +	int ret;
> +
> +	uuid_str = kstrndup(buf, count, GFP_KERNEL);
> +
> +	if (!uuid_str)
> +		return -ENOMEM;
> +
> +	if (uuid_parse(uuid_str, &uuid) < 0) {
> +		printk(KERN_ERR "%s UUID parse error  %s \n", __FUNCTION__, buf);
> +		return -EINVAL;
> +	}
> +	vgpu_dev = vgpu_drv_get_vgpu_device(uuid, 0);
> +
> +	if (vgpu_dev && dev_is_vgpu(&vgpu_dev->dev)) {
> +		kobject_uevent(&vgpu_dev->dev.kobj, KOBJ_OFFLINE);
> +
> +		ret = vgpu_shutdown_callback(vgpu_dev);
> +		if (ret < 0) {
> +			printk(KERN_ERR "%s vgpu_shutdown callback failed  %d \n",
> +					 __FUNCTION__, ret);
> +			return ret;
> +		}
> +	}
> +
> +	return count;
> +}
> +
> +struct class_attribute vgpu_class_attrs[] = {
> +	__ATTR_WO(vgpu_start),
> +	__ATTR_WO(vgpu_shutdown),
> +	__ATTR_NULL
> +};
> +
> +int vgpu_create_pci_device_files(struct pci_dev *dev)

What's pci specific about this?

> +{
> +	int retval;
> +
> +	retval = sysfs_create_file(&dev->dev.kobj,
> +				   &dev_attr_vgpu_supported_types.attr);
> +	if (retval) {
> +		printk(KERN_ERR "VGPU-VFIO: failed to create vgpu_supported_types sysfs entry\n");
> +		return retval;
> +	}
> +
> +	retval = sysfs_create_file(&dev->dev.kobj, &dev_attr_vgpu_create.attr);
> +	if (retval) {
> +		printk(KERN_ERR "VGPU-VFIO: failed to create vgpu_create sysfs entry\n");
> +		return retval;
> +	}
> +
> +	retval = sysfs_create_file(&dev->dev.kobj, &dev_attr_vgpu_destroy.attr);
> +	if (retval) {
> +		printk(KERN_ERR "VGPU-VFIO: failed to create vgpu_destroy sysfs entry\n");
> +		return retval;
> +	}
> +
> +	return 0;
> +}
> +
> +
> +void vgpu_remove_pci_device_files(struct pci_dev *dev)

Or this?

> +{
> +	sysfs_remove_file(&dev->dev.kobj, &dev_attr_vgpu_supported_types.attr);
> +	sysfs_remove_file(&dev->dev.kobj, &dev_attr_vgpu_create.attr);
> +	sysfs_remove_file(&dev->dev.kobj, &dev_attr_vgpu_destroy.attr);
> +}
> diff --git a/drivers/vgpu/vgpu_private.h b/drivers/vgpu/vgpu_private.h
> new file mode 100644
> index 0000000..35158ef
> --- /dev/null
> +++ b/drivers/vgpu/vgpu_private.h
> @@ -0,0 +1,36 @@
> +/*
> + * VGPU interal definition
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author:
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#ifndef VGPU_PRIVATE_H
> +#define VGPU_PRIVATE_H
> +
> +struct vgpu_device *vgpu_drv_get_vgpu_device(uuid_le uuid, int instance);
> +
> +int  create_vgpu_device(struct pci_dev *pdev, uuid_le uuid, uint32_t instance,
> +		       char *vgpu_params);
> +void destroy_vgpu_device(struct vgpu_device *vgpu_dev);
> +
> +int  vgpu_bus_register(void);
> +void vgpu_bus_unregister(void);
> +
> +/* Function prototypes for vgpu_sysfs */
> +
> +extern struct class_attribute vgpu_class_attrs[];
> +extern const struct attribute_group *vgpu_dev_groups[];
> +
> +int  vgpu_create_pci_device_files(struct pci_dev *dev);
> +void vgpu_remove_pci_device_files(struct pci_dev *dev);
> +
> +void get_vgpu_supported_types(struct device *dev, char *str);
> +int  vgpu_start_callback(struct vgpu_device *vgpu_dev);
> +int  vgpu_shutdown_callback(struct vgpu_device *vgpu_dev);
> +
> +#endif /* VGPU_PRIVATE_H */
> diff --git a/include/linux/vgpu.h b/include/linux/vgpu.h
> new file mode 100644
> index 0000000..03a77cf
> --- /dev/null
> +++ b/include/linux/vgpu.h
> @@ -0,0 +1,216 @@
> +/*
> + * VGPU definition
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author:
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#ifndef VGPU_H
> +#define VGPU_H
> +
> +// Common Data structures
> +
> +struct pci_bar_info {
> +	uint64_t start;
> +	uint64_t size;
> +	uint32_t flags;
> +};
> +
> +enum vgpu_emul_space_e {
> +	vgpu_emul_space_config = 0, /*!< PCI configuration space */
> +	vgpu_emul_space_io = 1,     /*!< I/O register space */
> +	vgpu_emul_space_mmio = 2    /*!< Memory-mapped I/O space */
> +};

Actual PCI specific stuff, but should it be in the vgpu core or where
it's actually used?

> +
> +struct gpu_device;
> +
> +/*
> + * VGPU device
> + */
> +struct vgpu_device {
> +	struct kref		kref;

This should really be used more for reference counting.

> +	struct device		dev;
> +	struct gpu_device	*gpu_dev;

And the vgpu_device really should hold a reference to the gpu_device.

> +	struct iommu_group	*group;

Like it does for the iommu_group.

> +#define DEVICE_NAME_LEN		(64)
> +	char			dev_name[DEVICE_NAME_LEN];
> +	uuid_le			uuid;
> +	uint32_t		vgpu_instance;

prefixing vgpu_ on vgpu_device fields seems redundant, and inconsistent
since it's not vgpu_uuid.

> +	struct device_attribute	*dev_attr_vgpu_status;
> +	int			vgpu_device_status;
> +
> +	void			*driver_data;
> +
> +	struct list_head	list;
> +};
> +
> +
> +/**
> + * struct gpu_device_ops - Structure to be registered for each physical GPU to
> + * register the device to vgpu module.
> + *
> + * @owner:			The module owner.
> + * @dev_attr_groups:		Default attributes of the physical device.
> + * @vgpu_attr_groups:		Default attributes of the vGPU device.
> + * @vgpu_supported_config:	Called to get information about supported vgpu types.
> + *				@dev : pci device structure of physical GPU.
> + *				@config: should return string listing supported config
> + *				Returns integer: success (0) or error (< 0)
> + * @vgpu_create:		Called to allocate basic resouces in graphics
> + *				driver for a particular vgpu.
> + *				@dev: physical pci device structure on which vgpu
> + *				      should be created
> + *				@uuid: VM's uuid for which VM it is intended to
> + *				@instance: vgpu instance in that VM
> + *				@vgpu_params: extra parameters required by GPU driver.
> + *				Returns integer: success (0) or error (< 0)
> + * @vgpu_destroy:		Called to free resources in graphics driver for
> + *				a vgpu instance of that VM.
> + *				@dev: physical pci device structure to which
> + *				this vgpu points to.
> + *				@uuid: VM's uuid for which the vgpu belongs to.
> + *				@instance: vgpu instance in that VM
> + *				Returns integer: success (0) or error (< 0)
> + *				If VM is running and vgpu_destroy is called that
> + *				means the vGPU is being hotunpluged. Return error
> + *				if VM is running and graphics driver doesn't
> + *				support vgpu hotplug.
> + * @vgpu_start:			Called to do initiate vGPU initialization
> + *				process in graphics driver when VM boots before
> + *				qemu starts.
> + *				@uuid: VM's UUID which is booting.
> + *				Returns integer: success (0) or error (< 0)
> + * @vgpu_shutdown:		Called to teardown vGPU related resources for
> + *				the VM
> + *				@uuid: VM's UUID which is shutting down .
> + *				Returns integer: success (0) or error (< 0)
> + * @read:			Read emulation callback
> + *				@vdev: vgpu device structure
> + *				@buf: read buffer
> + *				@count: number bytes to read
> + *				@address_space: specifies for which address space
> + *				the request is: pci_config_space, IO register
> + *				space or MMIO space.
> + *				@pos: offset from base address.
> + *				Retuns number on bytes read on success or error.
> + * @write:			Write emulation callback
> + *				@vdev: vgpu device structure
> + *				@buf: write buffer
> + *				@count: number bytes to be written
> + *				@address_space: specifies for which address space
> + *				the request is: pci_config_space, IO register
> + *				space or MMIO space.
> + *				@pos: offset from base address.
> + *				Retuns number on bytes written on success or error.

How do these support multiple MMIO spaces or IO port spaces?  GPUs, and
therefore I assume vGPUs, often have more than one MMIO space, how
does the enum above tell us which one?  We could simply make this be a
region index.

> + * @vgpu_set_irqs:		Called to send about interrupts configuration
> + *				information that qemu set.
> + *				@vdev: vgpu device structure
> + *				@flags, index, start, count and *data : same as
> + *				that of struct vfio_irq_set of
> + *				VFIO_DEVICE_SET_IRQS API.

How do we learn about the supported interrupt types?  Should this be
called vgpu_vfio_set_irqs if it's following the vfio API?

> + * @vgpu_bar_info:		Called to get BAR size and flags of vGPU device.
> + *				@vdev: vgpu device structure
> + *				@bar_index: BAR index
> + *				@bar_info: output, returns size and flags of
> + *				requested BAR
> + *				Returns integer: success (0) or error (< 0)

This is called bar_info, but the bar_index is actually the vfio region
index and things like the config region info is being overloaded
through it.  We already have a structure defined for getting a generic
region index, why not use it?  Maybe this should just be
vgpu_vfio_get_region_info.

> + * @validate_map_request:	Validate remap pfn request
> + *				@vdev: vgpu device structure
> + *				@virtaddr: target user address to start at
> + *				@pfn: physical address of kernel memory, GPU
> + *				driver can change if required.
> + *				@size: size of map area, GPU driver can change
> + *				the size of map area if desired.
> + *				@prot: page protection flags for this mapping,
> + *				GPU driver can change, if required.
> + *				Returns integer: success (0) or error (< 0)

Was not at all clear to me what this did until I got to patch 2, this
is actually providing the fault handling for mmap'ing a vGPU mmio BAR.
Needs a better name or better description.

> + *
> + * Physical GPU that support vGPU should be register with vgpu module with
> + * gpu_device_ops structure.
> + */
> +
> +struct gpu_device_ops {
> +	struct module   *owner;
> +	const struct attribute_group **dev_attr_groups;
> +	const struct attribute_group **vgpu_attr_groups;
> +
> +	int	(*vgpu_supported_config)(struct pci_dev *dev, char *config);
> +	int     (*vgpu_create)(struct pci_dev *dev, uuid_le uuid,
> +			       uint32_t instance, char *vgpu_params);
> +	int     (*vgpu_destroy)(struct pci_dev *dev, uuid_le uuid,
> +			        uint32_t instance);
> +
> +	int     (*vgpu_start)(uuid_le uuid);
> +	int     (*vgpu_shutdown)(uuid_le uuid);
> +
> +	ssize_t (*read) (struct vgpu_device *vdev, char *buf, size_t count,
> +			 uint32_t address_space, loff_t pos);
> +	ssize_t (*write)(struct vgpu_device *vdev, char *buf, size_t count,
> +			 uint32_t address_space, loff_t pos);

Aren't these really 'enum vgpu_emul_space_e', not uint32_t?

> +	int     (*vgpu_set_irqs)(struct vgpu_device *vdev, uint32_t flags,
> +				 unsigned index, unsigned start, unsigned count,
> +				 void *data);
> +	int	(*vgpu_bar_info)(struct vgpu_device *vdev, int bar_index,
> +				 struct pci_bar_info *bar_info);
> +	int	(*validate_map_request)(struct vgpu_device *vdev,
> +					unsigned long virtaddr,
> +					unsigned long *pfn, unsigned long *size,
> +					pgprot_t *prot);
> +};
> +
> +/*
> + * Physical GPU
> + */
> +struct gpu_device {
> +	struct pci_dev                  *dev;
> +	const struct gpu_device_ops     *ops;
> +	struct list_head                gpu_next;
> +};
> +
> +/**
> + * struct vgpu_driver - vGPU device driver
> + * @name: driver name
> + * @probe: called when new device created
> + * @remove: called when device removed
> + * @driver: device driver structure
> + *
> + **/
> +struct vgpu_driver {
> +	const char *name;
> +	int  (*probe)  (struct device *dev);
> +	void (*remove) (struct device *dev);
> +	struct device_driver	driver;
> +};
> +
> +static inline struct vgpu_driver *to_vgpu_driver(struct device_driver *drv)
> +{
> +	return drv ? container_of(drv, struct vgpu_driver, driver) : NULL;
> +}
> +
> +static inline struct vgpu_device *to_vgpu_device(struct device *dev)
> +{
> +	return dev ? container_of(dev, struct vgpu_device, dev) : NULL;
> +}
> +
> +extern struct bus_type vgpu_bus_type;
> +
> +#define dev_is_vgpu(d) ((d)->bus == &vgpu_bus_type)
> +
> +extern int  vgpu_register_device(struct pci_dev *dev,
> +				 const struct gpu_device_ops *ops);
> +extern void vgpu_unregister_device(struct pci_dev *dev);
> +
> +extern int  vgpu_register_driver(struct vgpu_driver *drv, struct module *owner);
> +extern void vgpu_unregister_driver(struct vgpu_driver *drv);
> +
> +extern int vgpu_map_virtual_bar(uint64_t virt_bar_addr, uint64_t phys_bar_addr,
> +				uint32_t len, uint32_t flags);
> +extern int vgpu_dma_do_translate(dma_addr_t * gfn_buffer, uint32_t count);
> +
> +struct vgpu_device *get_vgpu_device_from_group(struct iommu_group *group);
> +
> +#endif /* VGPU_H */


The sysfs ABI needs to be documented in
Documentation/ABI/testing/sysfs-vgpu.  This is particularly important
for things like the format used for the create/destroy interfaces.
Thanks,

Alex

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 1/3] vGPU Core driver
@ 2016-05-03 22:43     ` Alex Williamson
  0 siblings, 0 replies; 154+ messages in thread
From: Alex Williamson @ 2016-05-03 22:43 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, shuai.ruan,
	jike.song, zhiyuan.lv

On Tue, 3 May 2016 00:10:39 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> Design for vGPU Driver:
> Main purpose of vGPU driver is to provide a common interface for vGPU
> management that can be used by differnt GPU drivers.
> 
> This module would provide a generic interface to create the device, add
> it to vGPU bus, add device to IOMMU group and then add it to vfio group.
> 
> High Level block diagram:
> 
> +--------------+    vgpu_register_driver()+---------------+
> |     __init() +------------------------->+               |
> |              |                          |               |
> |              +<-------------------------+    vgpu.ko    |
> | vgpu_vfio.ko |   probe()/remove()       |               |
> |              |                +---------+               +---------+
> +--------------+                |         +-------+-------+         |
>                                 |                 ^                 |
>                                 | callback        |                 |
>                                 |         +-------+--------+        |
>                                 |         |vgpu_register_device()   |
>                                 |         |                |        |
>                                 +---^-----+-----+    +-----+------+-+
>                                     | nvidia.ko |    |  i915.ko   |
>                                     |           |    |            |
>                                     +-----------+    +------------+
> 
> vGPU driver provides two types of registration interfaces:
> 1. Registration interface for vGPU bus driver:
> 
> /**
>   * struct vgpu_driver - vGPU device driver
>   * @name: driver name
>   * @probe: called when new device created
>   * @remove: called when device removed
>   * @driver: device driver structure
>   *
>   **/
> struct vgpu_driver {
>          const char *name;
>          int  (*probe)  (struct device *dev);
>          void (*remove) (struct device *dev);
>          struct device_driver    driver;
> };
> 
> int  vgpu_register_driver(struct vgpu_driver *drv, struct module *owner);
> void vgpu_unregister_driver(struct vgpu_driver *drv);
> 
> VFIO bus driver for vgpu, should use this interface to register with
> vGPU driver. With this, VFIO bus driver for vGPU devices is responsible
> to add vGPU device to VFIO group.
> 
> 2. GPU driver interface
> GPU driver interface provides GPU driver the set APIs to manage GPU driver
> related work in their own driver. APIs are to:
> - vgpu_supported_config: provide supported configuration list by the GPU.
> - vgpu_create: to allocate basic resouces in GPU driver for a vGPU device.
> - vgpu_destroy: to free resources in GPU driver during vGPU device destroy.
> - vgpu_start: to initiate vGPU initialization process from GPU driver when VM
>   boots and before QEMU starts.
> - vgpu_shutdown: to teardown vGPU resources during VM teardown.
> - read : read emulation callback.
> - write: write emulation callback.
> - vgpu_set_irqs: send interrupt configuration information that QEMU sets.
> - vgpu_bar_info: to provice BAR size and its flags for the vGPU device.
> - validate_map_request: to validate remap pfn request.
> 
> This registration interface should be used by GPU drivers to register
> each physical device to vGPU driver.
> 
> Updated this patch with couple of more functions in GPU driver interface
> which were discussed during v1 version of this RFC.
> 
> Thanks,
> Kirti.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Signed-off-by: Neo Jia <cjia@nvidia.com>
> Change-Id: I1c13c411f61b7b2e750e85adfe1b097f9fd218b9
> ---
>  drivers/Kconfig             |    2 +
>  drivers/Makefile            |    1 +
>  drivers/vgpu/Kconfig        |   21 ++
>  drivers/vgpu/Makefile       |    4 +
>  drivers/vgpu/vgpu-core.c    |  424 +++++++++++++++++++++++++++++++++++++++++++
>  drivers/vgpu/vgpu-driver.c  |  136 ++++++++++++++
>  drivers/vgpu/vgpu-sysfs.c   |  365 +++++++++++++++++++++++++++++++++++++
>  drivers/vgpu/vgpu_private.h |   36 ++++
>  include/linux/vgpu.h        |  216 ++++++++++++++++++++++
>  9 files changed, 1205 insertions(+), 0 deletions(-)
>  create mode 100644 drivers/vgpu/Kconfig
>  create mode 100644 drivers/vgpu/Makefile
>  create mode 100644 drivers/vgpu/vgpu-core.c
>  create mode 100644 drivers/vgpu/vgpu-driver.c
>  create mode 100644 drivers/vgpu/vgpu-sysfs.c
>  create mode 100644 drivers/vgpu/vgpu_private.h
>  create mode 100644 include/linux/vgpu.h
> 
> diff --git a/drivers/Kconfig b/drivers/Kconfig
> index d2ac339..5fd9eae 100644
> --- a/drivers/Kconfig
> +++ b/drivers/Kconfig
> @@ -122,6 +122,8 @@ source "drivers/uio/Kconfig"
>  
>  source "drivers/vfio/Kconfig"
>  
> +source "drivers/vgpu/Kconfig"
> +
>  source "drivers/vlynq/Kconfig"
>  
>  source "drivers/virt/Kconfig"
> diff --git a/drivers/Makefile b/drivers/Makefile
> index 8f5d076..36f1110 100644
> --- a/drivers/Makefile
> +++ b/drivers/Makefile
> @@ -84,6 +84,7 @@ obj-$(CONFIG_FUSION)		+= message/
>  obj-y				+= firewire/
>  obj-$(CONFIG_UIO)		+= uio/
>  obj-$(CONFIG_VFIO)		+= vfio/
> +obj-$(CONFIG_VFIO)		+= vgpu/
>  obj-y				+= cdrom/
>  obj-y				+= auxdisplay/
>  obj-$(CONFIG_PCCARD)		+= pcmcia/
> diff --git a/drivers/vgpu/Kconfig b/drivers/vgpu/Kconfig
> new file mode 100644
> index 0000000..792eb48
> --- /dev/null
> +++ b/drivers/vgpu/Kconfig
> @@ -0,0 +1,21 @@
> +
> +menuconfig VGPU
> +    tristate "VGPU driver framework"
> +    depends on VFIO
> +    select VGPU_VFIO
> +    help
> +        VGPU provides a framework to virtualize GPU without SR-IOV cap
> +        See Documentation/vgpu.txt for more details.
> +
> +        If you don't know what do here, say N.
> +
> +config VGPU
> +    tristate
> +    depends on VFIO
> +    default n
> +
> +config VGPU_VFIO
> +    tristate
> +    depends on VGPU
> +    default n
> +

This is a little bit convoluted, it seems like everything added in this
patch is vfio agnostic, it doesn't necessarily care what the consumer
is.  That makes me think we should only be adding CONFIG_VGPU here and
it should not depend on CONFIG_VFIO or be enabling CONFIG_VGPU_VFIO.
The middle config entry is also redundant to the first, just move the
default line up to the first and remove the rest.

> diff --git a/drivers/vgpu/Makefile b/drivers/vgpu/Makefile
> new file mode 100644
> index 0000000..f5be980
> --- /dev/null
> +++ b/drivers/vgpu/Makefile
> @@ -0,0 +1,4 @@
> +
> +vgpu-y := vgpu-core.o vgpu-sysfs.o vgpu-driver.o
> +
> +obj-$(CONFIG_VGPU)			+= vgpu.o
> diff --git a/drivers/vgpu/vgpu-core.c b/drivers/vgpu/vgpu-core.c
> new file mode 100644
> index 0000000..1a7d274
> --- /dev/null
> +++ b/drivers/vgpu/vgpu-core.c
> @@ -0,0 +1,424 @@
> +/*
> + * VGPU Core Driver
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/init.h>
> +#include <linux/module.h>
> +#include <linux/device.h>
> +#include <linux/kernel.h>
> +#include <linux/fs.h>
> +#include <linux/poll.h>
> +#include <linux/slab.h>
> +#include <linux/cdev.h>
> +#include <linux/sched.h>
> +#include <linux/wait.h>
> +#include <linux/uuid.h>
> +#include <linux/vfio.h>
> +#include <linux/iommu.h>
> +#include <linux/sysfs.h>
> +#include <linux/ctype.h>
> +#include <linux/vgpu.h>
> +
> +#include "vgpu_private.h"
> +
> +#define DRIVER_VERSION	"0.1"
> +#define DRIVER_AUTHOR	"NVIDIA Corporation"
> +#define DRIVER_DESC	"VGPU Core Driver"
> +
> +/*
> + * #defines
> + */
> +
> +#define VGPU_CLASS_NAME		"vgpu"
> +
> +/*
> + * Global Structures
> + */
> +
> +static struct vgpu {
> +	struct list_head    vgpu_devices_list;
> +	struct mutex        vgpu_devices_lock;
> +	struct list_head    gpu_devices_list;
> +	struct mutex        gpu_devices_lock;
> +} vgpu;
> +
> +static struct class vgpu_class;
> +
> +/*
> + * Functions
> + */
> +
> +struct vgpu_device *get_vgpu_device_from_group(struct iommu_group *group)
> +{
> +	struct vgpu_device *vdev = NULL;
> +
> +	mutex_lock(&vgpu.vgpu_devices_lock);
> +	list_for_each_entry(vdev, &vgpu.vgpu_devices_list, list) {
> +		if (vdev->group) {
> +			if (iommu_group_id(vdev->group) == iommu_group_id(group)) {
> +				mutex_unlock(&vgpu.vgpu_devices_lock);
> +				return vdev;
> +			}
> +		}
> +	}
> +	mutex_unlock(&vgpu.vgpu_devices_lock);
> +	return NULL;
> +}
> +
> +EXPORT_SYMBOL_GPL(get_vgpu_device_from_group);
> +
> +static int vgpu_add_attribute_group(struct device *dev,
> +			            const struct attribute_group **groups)
> +{
> +        return sysfs_create_groups(&dev->kobj, groups);
> +}
> +
> +static void vgpu_remove_attribute_group(struct device *dev,
> +			                const struct attribute_group **groups)
> +{
> +        sysfs_remove_groups(&dev->kobj, groups);
> +}
> +
> +int vgpu_register_device(struct pci_dev *dev, const struct gpu_device_ops *ops)

To make the API abundantly clear, how about vgpu_register_gpu_device()
to avoid confusion with a vgpu device.

Why do we care that it's a pci_dev?  It seems like there's only a very
small portion of the API that cares about pci_devs in order to describe
BARs, which could be switched based on the device type.  Otherwise we
could operate on a struct device here.

> +{
> +	int ret = 0;
> +	struct gpu_device *gpu_dev, *tmp;
> +
> +	if (!dev)
> +		return -EINVAL;
> +
> +        gpu_dev = kzalloc(sizeof(*gpu_dev), GFP_KERNEL);
> +        if (!gpu_dev)
> +                return -ENOMEM;
> +
> +	gpu_dev->dev = dev;
> +        gpu_dev->ops = ops;
> +
> +        mutex_lock(&vgpu.gpu_devices_lock);
> +
> +        /* Check for duplicates */
> +        list_for_each_entry(tmp, &vgpu.gpu_devices_list, gpu_next) {
> +                if (tmp->dev == dev) {
> +			ret = -EINVAL;

Maybe -EEXIST here to get a different error value.

> +			goto add_error;
> +                }
> +        }
> +
> +	ret = vgpu_create_pci_device_files(dev);

I don't actually see anything pci specific in that function.

> +	if (ret)
> +		goto add_error;
> +
> +	ret = vgpu_add_attribute_group(&dev->dev, ops->dev_attr_groups);
> +	if (ret)
> +		goto add_group_error;
> +
> +        list_add(&gpu_dev->gpu_next, &vgpu.gpu_devices_list);

Whitespace issues, please run scripts/checkpatch.pl on patches before
posting.

> +
> +	printk(KERN_INFO "VGPU: Registered dev 0x%x 0x%x, class 0x%x\n",
> +			 dev->vendor, dev->device, dev->class);

This is a place where we're using pci_dev specific fields, but it's not
very useful.  We're registering a specific device, not everything that
matches this set of vendor/device/class, so what is the user supposed
to learn from this?  A dev_info here would give us the name of the
specific device we're registering and be device type agnostic.

> +        mutex_unlock(&vgpu.gpu_devices_lock);
> +
> +        return 0;
> +
> +add_group_error:
> +	vgpu_remove_pci_device_files(dev);
> +add_error:
> +	mutex_unlock(&vgpu.gpu_devices_lock);
> +	kfree(gpu_dev);
> +	return ret;
> +
> +}
> +EXPORT_SYMBOL(vgpu_register_device);
> +
> +void vgpu_unregister_device(struct pci_dev *dev)
> +{
> +        struct gpu_device *gpu_dev;
> +
> +        mutex_lock(&vgpu.gpu_devices_lock);
> +        list_for_each_entry(gpu_dev, &vgpu.gpu_devices_list, gpu_next) {
> +		struct vgpu_device *vdev = NULL;
> +
> +                if (gpu_dev->dev != dev)
> +			continue;
> +
> +		printk(KERN_INFO "VGPU: Unregistered dev 0x%x 0x%x, class 0x%x\n",
> +				dev->vendor, dev->device, dev->class);

Same comments as above for function name, device type, and this print.

> +
> +		list_for_each_entry(vdev, &vgpu.vgpu_devices_list, list) {

How can we walk this list without holding vgpu_devices_lock?

> +			if (vdev->gpu_dev != gpu_dev)
> +				continue;
> +			destroy_vgpu_device(vdev);
> +		}
> +		vgpu_remove_attribute_group(&dev->dev, gpu_dev->ops->dev_attr_groups);
> +		vgpu_remove_pci_device_files(dev);
> +		list_del(&gpu_dev->gpu_next);
> +		mutex_unlock(&vgpu.gpu_devices_lock);

It's often desirable to avoid multiple exit points, especially when
locking is involved, to simplify the code flow.  It would be very easy
to accomplish that here.

> +		kfree(gpu_dev);
> +		return;
> +        }
> +        mutex_unlock(&vgpu.gpu_devices_lock);
> +}
> +EXPORT_SYMBOL(vgpu_unregister_device);
> +
> +/*
> + * Helper Functions
> + */
> +
> +static struct vgpu_device *vgpu_device_alloc(uuid_le uuid, int instance, char *name)
> +{
> +	struct vgpu_device *vgpu_dev = NULL;
> +
> +	vgpu_dev = kzalloc(sizeof(*vgpu_dev), GFP_KERNEL);
> +	if (!vgpu_dev)
> +		return ERR_PTR(-ENOMEM);
> +
> +	kref_init(&vgpu_dev->kref);
> +	memcpy(&vgpu_dev->uuid, &uuid, sizeof(uuid_le));
> +	vgpu_dev->vgpu_instance = instance;
> +	strcpy(vgpu_dev->dev_name, name);
> +
> +	mutex_lock(&vgpu.vgpu_devices_lock);
> +	list_add(&vgpu_dev->list, &vgpu.vgpu_devices_list);
> +	mutex_unlock(&vgpu.vgpu_devices_lock);
> +
> +	return vgpu_dev;
> +}
> +
> +static void vgpu_device_free(struct vgpu_device *vgpu_dev)
> +{
> +	if (vgpu_dev) {
> +		mutex_lock(&vgpu.vgpu_devices_lock);
> +		list_del(&vgpu_dev->list);
> +		mutex_unlock(&vgpu.vgpu_devices_lock);
> +		kfree(vgpu_dev);
> +	}

Why aren't we using the kref to remove and free the vgpu when the last
reference is released?

> +	return;

Unnecessary

> +}
> +
> +struct vgpu_device *vgpu_drv_get_vgpu_device(uuid_le uuid, int instance)
> +{
> +	struct vgpu_device *vdev = NULL;
> +
> +	mutex_lock(&vgpu.vgpu_devices_lock);
> +	list_for_each_entry(vdev, &vgpu.vgpu_devices_list, list) {
> +		if ((uuid_le_cmp(vdev->uuid, uuid) == 0) &&
> +		    (vdev->vgpu_instance == instance)) {
> +			mutex_unlock(&vgpu.vgpu_devices_lock);
> +			return vdev;

We're not taking any sort of reference to the vgpu, what prevents races
with it being removed?  A common exit path would be easy to achieve
here too.

> +		}
> +	}
> +	mutex_unlock(&vgpu.vgpu_devices_lock);
> +	return NULL;
> +}
> +
> +static void vgpu_device_release(struct device *dev)
> +{
> +	struct vgpu_device *vgpu_dev = to_vgpu_device(dev);
> +	vgpu_device_free(vgpu_dev);
> +}
> +
> +int create_vgpu_device(struct pci_dev *pdev, uuid_le uuid, uint32_t instance, char *vgpu_params)
> +{

I'm not seeing anything here that really cares if the host gpu is a
struct device vs pci_dev either.

> +	char name[64];
> +	int numChar = 0;
> +	int retval = 0;
> +	struct vgpu_device *vgpu_dev = NULL;
> +	struct gpu_device *gpu_dev;
> +
> +	printk(KERN_INFO "VGPU: %s: device ", __FUNCTION__);

pr_info() would be preferred, but this seems like leftover debug and
should be removed.

> +
> +	numChar = sprintf(name, "%pUb-%d", uuid.b, instance);

Use snprintf even though this shouldn't be able to overflow.

> +	name[numChar] = '\0';
> +
> +	vgpu_dev = vgpu_device_alloc(uuid, instance, name);
> +	if (IS_ERR(vgpu_dev)) {
> +		return PTR_ERR(vgpu_dev);
> +	}
> +
> +	vgpu_dev->dev.parent  = &pdev->dev;
> +	vgpu_dev->dev.bus     = &vgpu_bus_type;
> +	vgpu_dev->dev.release = vgpu_device_release;
> +	dev_set_name(&vgpu_dev->dev, "%s", name);
> +
> +	retval = device_register(&vgpu_dev->dev);
> +	if (retval)
> +		goto create_failed1;
> +
> +	printk(KERN_INFO "UUID %pUb \n", vgpu_dev->uuid.b);

This also looks like debug.

> +
> +	mutex_lock(&vgpu.gpu_devices_lock);
> +	list_for_each_entry(gpu_dev, &vgpu.gpu_devices_list, gpu_next) {
> +		if (gpu_dev->dev != pdev)
> +			continue;
> +
> +		vgpu_dev->gpu_dev = gpu_dev;
> +		if (gpu_dev->ops->vgpu_create) {
> +			retval = gpu_dev->ops->vgpu_create(pdev, vgpu_dev->uuid,
> +							   instance, vgpu_params);
> +			if (retval) {
> +				mutex_unlock(&vgpu.gpu_devices_lock);
> +				goto create_failed2;
> +			}
> +		}
> +		break;
> +	}
> +	if (!vgpu_dev->gpu_dev) {
> +		retval = -EINVAL;
> +		mutex_unlock(&vgpu.gpu_devices_lock);
> +		goto create_failed2;
> +	}
> +
> +	mutex_unlock(&vgpu.gpu_devices_lock);
> +
> +	retval = vgpu_add_attribute_group(&vgpu_dev->dev, gpu_dev->ops->vgpu_attr_groups);
> +	if (retval)
> +		goto create_attr_error;
> +
> +	return retval;
> +
> +create_attr_error:
> +	if (gpu_dev->ops->vgpu_destroy) {
> +		int ret = 0;
> +		ret = gpu_dev->ops->vgpu_destroy(gpu_dev->dev,
> +						 vgpu_dev->uuid,
> +						 vgpu_dev->vgpu_instance);

Unnecessary initialization and we don't do anything with the result.
Below indicates lack of vgpu_destroy indicates the vendor doesn't
support unplug, but doesn't that break our error cleanup path here?

> +	}
> +
> +create_failed2:
> +	device_unregister(&vgpu_dev->dev);
> +
> +create_failed1:
> +	vgpu_device_free(vgpu_dev);
> +
> +	return retval;
> +}
> +
> +void destroy_vgpu_device(struct vgpu_device *vgpu_dev)
> +{
> +	struct gpu_device *gpu_dev = vgpu_dev->gpu_dev;
> +
> +	printk(KERN_INFO "VGPU: destroying device %s ", vgpu_dev->dev_name);

dev_info()

> +	if (gpu_dev->ops->vgpu_destroy) {
> +		int retval = 0;

Unnecessary initialization, in fact this entire variable is unnecessary.

> +		retval = gpu_dev->ops->vgpu_destroy(gpu_dev->dev,
> +						    vgpu_dev->uuid,
> +						    vgpu_dev->vgpu_instance);
> +	/* if vendor driver doesn't return success that means vendor driver doesn't
> +	 * support hot-unplug */
> +		if (retval)
> +			return;

Should we return an error code then?  Inconsistent comment style.

> +	}
> +
> +	vgpu_remove_attribute_group(&vgpu_dev->dev, gpu_dev->ops->vgpu_attr_groups);
> +	device_unregister(&vgpu_dev->dev);
> +}
> +
> +void get_vgpu_supported_types(struct device *dev, char *str)
> +{
> +	struct gpu_device *gpu_dev;
> +
> +	mutex_lock(&vgpu.gpu_devices_lock);
> +	list_for_each_entry(gpu_dev, &vgpu.gpu_devices_list, gpu_next) {
> +		if (&gpu_dev->dev->dev == dev) {
> +			if (gpu_dev->ops->vgpu_supported_config)
> +				gpu_dev->ops->vgpu_supported_config(gpu_dev->dev, str);
> +			break;
> +		}
> +	}
> +	mutex_unlock(&vgpu.gpu_devices_lock);
> +}
> +
> +int vgpu_start_callback(struct vgpu_device *vgpu_dev)
> +{
> +	int ret = 0;
> +	struct gpu_device *gpu_dev = vgpu_dev->gpu_dev;
> +
> +	mutex_lock(&vgpu.gpu_devices_lock);
> +	if (gpu_dev->ops->vgpu_start)
> +		ret = gpu_dev->ops->vgpu_start(vgpu_dev->uuid);
> +	mutex_unlock(&vgpu.gpu_devices_lock);
> +	return ret;
> +}
> +
> +int vgpu_shutdown_callback(struct vgpu_device *vgpu_dev)
> +{
> +	int ret = 0;
> +	struct gpu_device *gpu_dev = vgpu_dev->gpu_dev;
> +
> +	mutex_lock(&vgpu.gpu_devices_lock);
> +	if (gpu_dev->ops->vgpu_shutdown)
> +		ret = gpu_dev->ops->vgpu_shutdown(vgpu_dev->uuid);
> +	mutex_unlock(&vgpu.gpu_devices_lock);
> +	return ret;
> +}
> +
> +char *vgpu_devnode(struct device *dev, umode_t *mode)
> +{
> +	return kasprintf(GFP_KERNEL, "vgpu/%s", dev_name(dev));
> +}
> +
> +static void release_vgpubus_dev(struct device *dev)
> +{
> +	struct vgpu_device *vgpu_dev = to_vgpu_device(dev);
> +	destroy_vgpu_device(vgpu_dev);
> +}
> +
> +static struct class vgpu_class = {
> +	.name		= VGPU_CLASS_NAME,
> +	.owner		= THIS_MODULE,
> +	.class_attrs	= vgpu_class_attrs,
> +	.dev_groups	= vgpu_dev_groups,
> +	.devnode	= vgpu_devnode,
> +	.dev_release    = release_vgpubus_dev,
> +};
> +
> +static int __init vgpu_init(void)
> +{
> +	int rc = 0;
> +
> +	memset(&vgpu, 0 , sizeof(vgpu));

Unnecessary, this is declared in the bss and zero initialized.

> +
> +	mutex_init(&vgpu.vgpu_devices_lock);
> +	INIT_LIST_HEAD(&vgpu.vgpu_devices_list);
> +	mutex_init(&vgpu.gpu_devices_lock);
> +	INIT_LIST_HEAD(&vgpu.gpu_devices_list);
> +
> +	rc = class_register(&vgpu_class);
> +	if (rc < 0) {
> +		printk(KERN_ERR "Error: failed to register vgpu class\n");

pr_err()

> +		goto failed1;
> +	}
> +
> +	rc = vgpu_bus_register();
> +	if (rc < 0) {
> +		printk(KERN_ERR "Error: failed to register vgpu bus\n");
> +		class_unregister(&vgpu_class);
> +	}
> +
> +    request_module_nowait("vgpu_vfio");
> +
> +failed1:
> +	return rc;

While common exit points are good, if there's no cleanup and no
locking, why do we need failed1?

> +}
> +
> +static void __exit vgpu_exit(void)
> +{
> +	vgpu_bus_unregister();
> +	class_unregister(&vgpu_class);
> +}
> +
> +module_init(vgpu_init)
> +module_exit(vgpu_exit)
> +
> +MODULE_VERSION(DRIVER_VERSION);
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR(DRIVER_AUTHOR);
> +MODULE_DESCRIPTION(DRIVER_DESC);
> diff --git a/drivers/vgpu/vgpu-driver.c b/drivers/vgpu/vgpu-driver.c
> new file mode 100644
> index 0000000..c4c2e9f
> --- /dev/null
> +++ b/drivers/vgpu/vgpu-driver.c
> @@ -0,0 +1,136 @@
> +/*
> + * VGPU driver
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/init.h>
> +#include <linux/module.h>
> +#include <linux/device.h>
> +#include <linux/kernel.h>
> +#include <linux/fs.h>

I don't see any vfio here, or fs or sysfs or ctype.

> +#include <linux/vfio.h>
> +#include <linux/iommu.h>
> +#include <linux/sysfs.h>
> +#include <linux/ctype.h>
> +#include <linux/vgpu.h>
> +
> +#include "vgpu_private.h"
> +
> +static int vgpu_device_attach_iommu(struct vgpu_device *vgpu_dev)
> +{
> +        int retval = 0;
> +        struct iommu_group *group = NULL;
> +
> +        group = iommu_group_alloc();
> +        if (IS_ERR(group)) {
> +                printk(KERN_ERR "VGPU: failed to allocate group!\n");
> +                return PTR_ERR(group);
> +        }
> +
> +        retval = iommu_group_add_device(group, &vgpu_dev->dev);
> +        if (retval) {
> +                printk(KERN_ERR "VGPU: failed to add dev to group!\n");

dev_err()

> +                iommu_group_put(group);

The iommu group should be put regardless of error, the device holds a
reference to the group to keep it around.

> +                return retval;
> +        }
> +
> +        vgpu_dev->group = group;
> +
> +        printk(KERN_INFO "VGPU: group_id = %d \n", iommu_group_id(group));
> +        return retval;
> +}
> +
> +static void vgpu_device_detach_iommu(struct vgpu_device *vgpu_dev)
> +{
> +        iommu_group_put(vgpu_dev->dev.iommu_group);
> +        iommu_group_remove_device(&vgpu_dev->dev);

Only the remove _device should be needed here, the group reference
should have been released above, otherwise we're double incrementing
and double decrementing.

> +        printk(KERN_INFO "VGPU: detaching iommu \n");

debug.

> +}
> +
> +static int vgpu_device_probe(struct device *dev)
> +{
> +	struct vgpu_driver *drv = to_vgpu_driver(dev->driver);
> +	struct vgpu_device *vgpu_dev = to_vgpu_device(dev);
> +	int status = 0;
> +
> +	status = vgpu_device_attach_iommu(vgpu_dev);
> +	if (status) {
> +		printk(KERN_ERR "Failed to attach IOMMU\n");
> +		return status;
> +	}
> +
> +	if (drv && drv->probe) {
> +		status = drv->probe(dev);
> +	}
> +
> +	return status;
> +}
> +
> +static int vgpu_device_remove(struct device *dev)
> +{
> +	struct vgpu_driver *drv = to_vgpu_driver(dev->driver);
> +	struct vgpu_device *vgpu_dev = to_vgpu_device(dev);
> +	int status = 0;
> +
> +	if (drv && drv->remove) {
> +		drv->remove(dev);
> +	}
> +
> +	vgpu_device_detach_iommu(vgpu_dev);
> +
> +	return status;

return 0;  Or make this void.  .remove functions often return void, for
better or worse.

> +}
> +
> +struct bus_type vgpu_bus_type = {
> +	.name		= "vgpu",
> +	.probe		= vgpu_device_probe,
> +	.remove		= vgpu_device_remove,
> +};
> +EXPORT_SYMBOL_GPL(vgpu_bus_type);
> +
> +/**
> + * vgpu_register_driver - register a new vGPU driver
> + * @drv: the driver to register
> + * @owner: owner module of driver ro register
> + *
> + * Returns a negative value on error, otherwise 0.
> + */
> +int vgpu_register_driver(struct vgpu_driver *drv, struct module *owner)
> +{
> +	/* initialize common driver fields */
> +	drv->driver.name = drv->name;
> +	drv->driver.bus = &vgpu_bus_type;
> +	drv->driver.owner = owner;
> +
> +	/* register with core */
> +	return driver_register(&drv->driver);
> +}
> +EXPORT_SYMBOL(vgpu_register_driver);
> +
> +/**
> + * vgpu_unregister_driver - unregister vGPU driver
> + * @drv: the driver to unregister
> + *
> + */
> +void vgpu_unregister_driver(struct vgpu_driver *drv)
> +{
> +	driver_unregister(&drv->driver);
> +}
> +EXPORT_SYMBOL(vgpu_unregister_driver);
> +
> +int vgpu_bus_register(void)
> +{
> +	return bus_register(&vgpu_bus_type);
> +}
> +
> +void vgpu_bus_unregister(void)
> +{
> +	bus_unregister(&vgpu_bus_type);
> +}
> diff --git a/drivers/vgpu/vgpu-sysfs.c b/drivers/vgpu/vgpu-sysfs.c
> new file mode 100644
> index 0000000..b740f9a
> --- /dev/null
> +++ b/drivers/vgpu/vgpu-sysfs.c
> @@ -0,0 +1,365 @@
> +/*
> + * File attributes for vGPU devices
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author: Neo Jia <cjia@nvidia.com>
> + *	       Kirti Wankhede <kwankhede@nvidia.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/sched.h>
> +#include <linux/fs.h>
> +#include <linux/sysfs.h>
> +#include <linux/ctype.h>
> +#include <linux/uuid.h>
> +#include <linux/vfio.h>

No vfio, fs, or ctype here either

> +#include <linux/vgpu.h>
> +
> +#include "vgpu_private.h"
> +
> +/* Prototypes */
> +
> +static ssize_t vgpu_supported_types_show(struct device *dev,
> +					 struct device_attribute *attr,
> +					 char *buf);
> +static DEVICE_ATTR_RO(vgpu_supported_types);
> +
> +static ssize_t vgpu_create_store(struct device *dev,
> +				 struct device_attribute *attr,
> +				 const char *buf, size_t count);
> +static DEVICE_ATTR_WO(vgpu_create);
> +
> +static ssize_t vgpu_destroy_store(struct device *dev,
> +				  struct device_attribute *attr,
> +				  const char *buf, size_t count);
> +static DEVICE_ATTR_WO(vgpu_destroy);
> +
> +
> +/* Static functions */
> +
> +static bool is_uuid_sep(char sep)
> +{
> +	if (sep == '\n' || sep == '-' || sep == ':' || sep == '\0')
> +		return true;
> +	return false;
> +}
> +
> +
> +static int uuid_parse(const char *str, uuid_le *uuid)
> +{
> +	int i;
> +
> +	if (strlen(str) < 36)
> +		return -1;
> +
> +	for (i = 0; i < 16; i++) {
> +		if (!isxdigit(str[0]) || !isxdigit(str[1])) {
> +			printk(KERN_ERR "%s err", __FUNCTION__);
> +			return -EINVAL;
> +		}
> +
> +		uuid->b[i] = (hex_to_bin(str[0]) << 4) | hex_to_bin(str[1]);
> +		str += 2;
> +		if (is_uuid_sep(*str))
> +			str++;
> +	}
> +
> +	return 0;
> +}
> +
> +
> +/* Functions */
> +static ssize_t vgpu_supported_types_show(struct device *dev,
> +					 struct device_attribute *attr,
> +					 char *buf)
> +{
> +	char *str;
> +	ssize_t n;
> +
> +        str = kzalloc(sizeof(*str) * 512, GFP_KERNEL);

Arbitrary size limit?  Do we even need a separate buffer?

> +        if (!str)
> +                return -ENOMEM;
> +
> +	get_vgpu_supported_types(dev, str);
> +
> +	n = sprintf(buf,"%s\n", str);
> +	kfree(str);
> +
> +	return n;
> +}
> +
> +static ssize_t vgpu_create_store(struct device *dev,
> +				 struct device_attribute *attr,
> +				 const char *buf, size_t count)
> +{
> +	char *str, *pstr;
> +	char *uuid_str, *instance_str, *vgpu_params = NULL;
> +	uuid_le uuid;
> +	uint32_t instance;
> +	struct pci_dev *pdev;
> +	int ret = 0;
> +
> +	pstr = str = kstrndup(buf, count, GFP_KERNEL);
> +
> +	if (!str)
> +		return -ENOMEM;
> +
> +	if ((uuid_str = strsep(&str, ":")) == NULL) {
> +		printk(KERN_ERR "%s Empty UUID or string %s \n",
> +				 __FUNCTION__, buf);

pr_err() for all these.

> +		ret = -EINVAL;
> +		goto create_error;
> +	}
> +
> +	if (!str) {
> +		printk(KERN_ERR "%s vgpu instance not specified %s \n",
> +				 __FUNCTION__, buf);
> +		ret = -EINVAL;
> +		goto create_error;
> +	}
> +
> +	if ((instance_str = strsep(&str, ":")) == NULL) {
> +		printk(KERN_ERR "%s Empty instance or string %s \n",
> +				 __FUNCTION__, buf);
> +		ret = -EINVAL;
> +		goto create_error;
> +	}
> +
> +	instance = (unsigned int)simple_strtoul(instance_str, NULL, 0);
> +
> +	if (!str) {
> +		printk(KERN_ERR "%s vgpu params not specified %s \n",
> +				 __FUNCTION__, buf);
> +		ret = -EINVAL;
> +		goto create_error;
> +	}
> +
> +	vgpu_params = kstrdup(str, GFP_KERNEL);
> +
> +	if (!vgpu_params) {
> +		printk(KERN_ERR "%s vgpu params allocation failed \n",
> +				 __FUNCTION__);
> +		ret = -EINVAL;
> +		goto create_error;
> +	}
> +
> +	if (uuid_parse(uuid_str, &uuid) < 0) {
> +		printk(KERN_ERR "%s UUID parse error  %s \n", __FUNCTION__, buf);
> +		ret = -EINVAL;
> +		goto create_error;
> +	}
> +
> +	if (dev_is_pci(dev)) {
> +		pdev = to_pci_dev(dev);
> +
> +		if (create_vgpu_device(pdev, uuid, instance, vgpu_params) < 0) {

Why do we care?  I still haven't seen anything that requires the gpu to
be a pci device.

> +			printk(KERN_ERR "%s vgpu create error \n", __FUNCTION__);
> +			ret = -EINVAL;
> +			goto create_error;
> +		}
> +		ret = count;
> +	}
> +
> +create_error:
> +	if (vgpu_params)
> +		kfree(vgpu_params);
> +
> +	if (pstr)
> +		kfree(pstr);

kfree(NULL) does the right thing.

> +	return ret;
> +}
> +
> +static ssize_t vgpu_destroy_store(struct device *dev,
> +				  struct device_attribute *attr,
> +				  const char *buf, size_t count)
> +{
> +	char *uuid_str, *str;
> +	uuid_le uuid;
> +	unsigned int instance;
> +	struct vgpu_device *vgpu_dev = NULL;
> +
> +	str = kstrndup(buf, count, GFP_KERNEL);
> +
> +	if (!str)
> +		return -ENOMEM;
> +
> +	if ((uuid_str = strsep(&str, ":")) == NULL) {
> +		printk(KERN_ERR "%s Empty UUID or string %s \n", __FUNCTION__, buf);
> +		return -EINVAL;
> +	}
> +
> +	if (str == NULL) {
> +		printk(KERN_ERR "%s instance not specified %s \n", __FUNCTION__, buf);
> +		return -EINVAL;
> +	}
> +
> +	instance = (unsigned int)simple_strtoul(str, NULL, 0);
> +
> +	if (uuid_parse(uuid_str, &uuid) < 0) {
> +		printk(KERN_ERR "%s UUID parse error  %s \n", __FUNCTION__, buf);
> +		return -EINVAL;
> +	}
> +
> +	printk(KERN_INFO "%s UUID %pUb - %d \n", __FUNCTION__, uuid.b, instance);
> +
> +	vgpu_dev = vgpu_drv_get_vgpu_device(uuid, instance);

Since we have no reference counting, all we need to do to crash this is
race this destroy sysfs entry.

> +
> +	if (vgpu_dev)
> +		destroy_vgpu_device(vgpu_dev);
> +
> +	return count;

An error if not found might be nice.

> +}
> +
> +static ssize_t
> +vgpu_uuid_show(struct device *dev, struct device_attribute *attr, char *buf)
> +{
> +	struct vgpu_device *drv = to_vgpu_device(dev);
> +
> +	if (drv)
> +		return sprintf(buf, "%pUb \n", drv->uuid.b);
> +
> +	return sprintf(buf, " \n");
> +}
> +
> +static DEVICE_ATTR_RO(vgpu_uuid);
> +
> +static ssize_t
> +vgpu_group_id_show(struct device *dev, struct device_attribute *attr, char *buf)
> +{
> +	struct vgpu_device *drv = to_vgpu_device(dev);
> +
> +	if (drv && drv->group)
> +		return sprintf(buf, "%d \n", iommu_group_id(drv->group));
> +
> +	return sprintf(buf, " \n");

There should be an iommu_group link from the device to the group in
sysfs, otherwise this is inconsistent with real devices.

> +}
> +
> +static DEVICE_ATTR_RO(vgpu_group_id);
> +
> +
> +static struct attribute *vgpu_dev_attrs[] = {
> +	&dev_attr_vgpu_uuid.attr,
> +	&dev_attr_vgpu_group_id.attr,
> +	NULL,
> +};
> +
> +static const struct attribute_group vgpu_dev_group = {
> +	.attrs = vgpu_dev_attrs,
> +};
> +
> +const struct attribute_group *vgpu_dev_groups[] = {
> +	&vgpu_dev_group,
> +	NULL,
> +};
> +
> +
> +ssize_t vgpu_start_store(struct class *class, struct class_attribute *attr,
> +			 const char *buf, size_t count)
> +{
> +	char *uuid_str;
> +	uuid_le uuid;
> +	struct vgpu_device *vgpu_dev = NULL;
> +	int ret;
> +
> +	uuid_str = kstrndup(buf, count, GFP_KERNEL);
> +
> +	if (!uuid_str)
> +		return -ENOMEM;
> +
> +	if (uuid_parse(uuid_str, &uuid) < 0) {
> +		printk(KERN_ERR "%s UUID parse error  %s \n", __FUNCTION__, buf);
> +		return -EINVAL;
> +	}
> +
> +	vgpu_dev = vgpu_drv_get_vgpu_device(uuid, 0);

No reference counting, so we hope nobody destroys the device while we
have it...

> +
> +	if (vgpu_dev && dev_is_vgpu(&vgpu_dev->dev)) {
> +		kobject_uevent(&vgpu_dev->dev.kobj, KOBJ_ONLINE);
> +
> +		ret = vgpu_start_callback(vgpu_dev);
> +		if (ret < 0) {
> +			printk(KERN_ERR "%s vgpu_start callback failed  %d \n",
> +					 __FUNCTION__, ret);
> +			return ret;
> +		}
> +	}
> +
> +	return count;
> +}
> +
> +ssize_t vgpu_shutdown_store(struct class *class, struct class_attribute *attr,
> +			    const char *buf, size_t count)
> +{
> +	char *uuid_str;
> +	uuid_le uuid;
> +	struct vgpu_device *vgpu_dev = NULL;
> +	int ret;
> +
> +	uuid_str = kstrndup(buf, count, GFP_KERNEL);
> +
> +	if (!uuid_str)
> +		return -ENOMEM;
> +
> +	if (uuid_parse(uuid_str, &uuid) < 0) {
> +		printk(KERN_ERR "%s UUID parse error  %s \n", __FUNCTION__, buf);
> +		return -EINVAL;
> +	}
> +	vgpu_dev = vgpu_drv_get_vgpu_device(uuid, 0);
> +
> +	if (vgpu_dev && dev_is_vgpu(&vgpu_dev->dev)) {
> +		kobject_uevent(&vgpu_dev->dev.kobj, KOBJ_OFFLINE);
> +
> +		ret = vgpu_shutdown_callback(vgpu_dev);
> +		if (ret < 0) {
> +			printk(KERN_ERR "%s vgpu_shutdown callback failed  %d \n",
> +					 __FUNCTION__, ret);
> +			return ret;
> +		}
> +	}
> +
> +	return count;
> +}
> +
> +struct class_attribute vgpu_class_attrs[] = {
> +	__ATTR_WO(vgpu_start),
> +	__ATTR_WO(vgpu_shutdown),
> +	__ATTR_NULL
> +};
> +
> +int vgpu_create_pci_device_files(struct pci_dev *dev)

What's pci specific about this?

> +{
> +	int retval;
> +
> +	retval = sysfs_create_file(&dev->dev.kobj,
> +				   &dev_attr_vgpu_supported_types.attr);
> +	if (retval) {
> +		printk(KERN_ERR "VGPU-VFIO: failed to create vgpu_supported_types sysfs entry\n");
> +		return retval;
> +	}
> +
> +	retval = sysfs_create_file(&dev->dev.kobj, &dev_attr_vgpu_create.attr);
> +	if (retval) {
> +		printk(KERN_ERR "VGPU-VFIO: failed to create vgpu_create sysfs entry\n");
> +		return retval;
> +	}
> +
> +	retval = sysfs_create_file(&dev->dev.kobj, &dev_attr_vgpu_destroy.attr);
> +	if (retval) {
> +		printk(KERN_ERR "VGPU-VFIO: failed to create vgpu_destroy sysfs entry\n");
> +		return retval;
> +	}
> +
> +	return 0;
> +}
> +
> +
> +void vgpu_remove_pci_device_files(struct pci_dev *dev)

Or this?

> +{
> +	sysfs_remove_file(&dev->dev.kobj, &dev_attr_vgpu_supported_types.attr);
> +	sysfs_remove_file(&dev->dev.kobj, &dev_attr_vgpu_create.attr);
> +	sysfs_remove_file(&dev->dev.kobj, &dev_attr_vgpu_destroy.attr);
> +}
> diff --git a/drivers/vgpu/vgpu_private.h b/drivers/vgpu/vgpu_private.h
> new file mode 100644
> index 0000000..35158ef
> --- /dev/null
> +++ b/drivers/vgpu/vgpu_private.h
> @@ -0,0 +1,36 @@
> +/*
> + * VGPU interal definition
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author:
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#ifndef VGPU_PRIVATE_H
> +#define VGPU_PRIVATE_H
> +
> +struct vgpu_device *vgpu_drv_get_vgpu_device(uuid_le uuid, int instance);
> +
> +int  create_vgpu_device(struct pci_dev *pdev, uuid_le uuid, uint32_t instance,
> +		       char *vgpu_params);
> +void destroy_vgpu_device(struct vgpu_device *vgpu_dev);
> +
> +int  vgpu_bus_register(void);
> +void vgpu_bus_unregister(void);
> +
> +/* Function prototypes for vgpu_sysfs */
> +
> +extern struct class_attribute vgpu_class_attrs[];
> +extern const struct attribute_group *vgpu_dev_groups[];
> +
> +int  vgpu_create_pci_device_files(struct pci_dev *dev);
> +void vgpu_remove_pci_device_files(struct pci_dev *dev);
> +
> +void get_vgpu_supported_types(struct device *dev, char *str);
> +int  vgpu_start_callback(struct vgpu_device *vgpu_dev);
> +int  vgpu_shutdown_callback(struct vgpu_device *vgpu_dev);
> +
> +#endif /* VGPU_PRIVATE_H */
> diff --git a/include/linux/vgpu.h b/include/linux/vgpu.h
> new file mode 100644
> index 0000000..03a77cf
> --- /dev/null
> +++ b/include/linux/vgpu.h
> @@ -0,0 +1,216 @@
> +/*
> + * VGPU definition
> + *
> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
> + *     Author:
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#ifndef VGPU_H
> +#define VGPU_H
> +
> +// Common Data structures
> +
> +struct pci_bar_info {
> +	uint64_t start;
> +	uint64_t size;
> +	uint32_t flags;
> +};
> +
> +enum vgpu_emul_space_e {
> +	vgpu_emul_space_config = 0, /*!< PCI configuration space */
> +	vgpu_emul_space_io = 1,     /*!< I/O register space */
> +	vgpu_emul_space_mmio = 2    /*!< Memory-mapped I/O space */
> +};

Actual PCI specific stuff, but should it be in the vgpu core or where
it's actually used?

> +
> +struct gpu_device;
> +
> +/*
> + * VGPU device
> + */
> +struct vgpu_device {
> +	struct kref		kref;

This should really be used more for reference counting.

> +	struct device		dev;
> +	struct gpu_device	*gpu_dev;

And the vgpu_device really should hold a reference to the gpu_device.

> +	struct iommu_group	*group;

Like it does for the iommu_group.

> +#define DEVICE_NAME_LEN		(64)
> +	char			dev_name[DEVICE_NAME_LEN];
> +	uuid_le			uuid;
> +	uint32_t		vgpu_instance;

prefixing vgpu_ on vgpu_device fields seems redundant, and inconsistent
since it's not vgpu_uuid.

> +	struct device_attribute	*dev_attr_vgpu_status;
> +	int			vgpu_device_status;
> +
> +	void			*driver_data;
> +
> +	struct list_head	list;
> +};
> +
> +
> +/**
> + * struct gpu_device_ops - Structure to be registered for each physical GPU to
> + * register the device to vgpu module.
> + *
> + * @owner:			The module owner.
> + * @dev_attr_groups:		Default attributes of the physical device.
> + * @vgpu_attr_groups:		Default attributes of the vGPU device.
> + * @vgpu_supported_config:	Called to get information about supported vgpu types.
> + *				@dev : pci device structure of physical GPU.
> + *				@config: should return string listing supported config
> + *				Returns integer: success (0) or error (< 0)
> + * @vgpu_create:		Called to allocate basic resouces in graphics
> + *				driver for a particular vgpu.
> + *				@dev: physical pci device structure on which vgpu
> + *				      should be created
> + *				@uuid: VM's uuid for which VM it is intended to
> + *				@instance: vgpu instance in that VM
> + *				@vgpu_params: extra parameters required by GPU driver.
> + *				Returns integer: success (0) or error (< 0)
> + * @vgpu_destroy:		Called to free resources in graphics driver for
> + *				a vgpu instance of that VM.
> + *				@dev: physical pci device structure to which
> + *				this vgpu points to.
> + *				@uuid: VM's uuid for which the vgpu belongs to.
> + *				@instance: vgpu instance in that VM
> + *				Returns integer: success (0) or error (< 0)
> + *				If VM is running and vgpu_destroy is called that
> + *				means the vGPU is being hotunpluged. Return error
> + *				if VM is running and graphics driver doesn't
> + *				support vgpu hotplug.
> + * @vgpu_start:			Called to do initiate vGPU initialization
> + *				process in graphics driver when VM boots before
> + *				qemu starts.
> + *				@uuid: VM's UUID which is booting.
> + *				Returns integer: success (0) or error (< 0)
> + * @vgpu_shutdown:		Called to teardown vGPU related resources for
> + *				the VM
> + *				@uuid: VM's UUID which is shutting down .
> + *				Returns integer: success (0) or error (< 0)
> + * @read:			Read emulation callback
> + *				@vdev: vgpu device structure
> + *				@buf: read buffer
> + *				@count: number bytes to read
> + *				@address_space: specifies for which address space
> + *				the request is: pci_config_space, IO register
> + *				space or MMIO space.
> + *				@pos: offset from base address.
> + *				Retuns number on bytes read on success or error.
> + * @write:			Write emulation callback
> + *				@vdev: vgpu device structure
> + *				@buf: write buffer
> + *				@count: number bytes to be written
> + *				@address_space: specifies for which address space
> + *				the request is: pci_config_space, IO register
> + *				space or MMIO space.
> + *				@pos: offset from base address.
> + *				Retuns number on bytes written on success or error.

How do these support multiple MMIO spaces or IO port spaces?  GPUs, and
therefore I assume vGPUs, often have more than one MMIO space, how
does the enum above tell us which one?  We could simply make this be a
region index.

> + * @vgpu_set_irqs:		Called to send about interrupts configuration
> + *				information that qemu set.
> + *				@vdev: vgpu device structure
> + *				@flags, index, start, count and *data : same as
> + *				that of struct vfio_irq_set of
> + *				VFIO_DEVICE_SET_IRQS API.

How do we learn about the supported interrupt types?  Should this be
called vgpu_vfio_set_irqs if it's following the vfio API?

> + * @vgpu_bar_info:		Called to get BAR size and flags of vGPU device.
> + *				@vdev: vgpu device structure
> + *				@bar_index: BAR index
> + *				@bar_info: output, returns size and flags of
> + *				requested BAR
> + *				Returns integer: success (0) or error (< 0)

This is called bar_info, but the bar_index is actually the vfio region
index and things like the config region info is being overloaded
through it.  We already have a structure defined for getting a generic
region index, why not use it?  Maybe this should just be
vgpu_vfio_get_region_info.

> + * @validate_map_request:	Validate remap pfn request
> + *				@vdev: vgpu device structure
> + *				@virtaddr: target user address to start at
> + *				@pfn: physical address of kernel memory, GPU
> + *				driver can change if required.
> + *				@size: size of map area, GPU driver can change
> + *				the size of map area if desired.
> + *				@prot: page protection flags for this mapping,
> + *				GPU driver can change, if required.
> + *				Returns integer: success (0) or error (< 0)

Was not at all clear to me what this did until I got to patch 2, this
is actually providing the fault handling for mmap'ing a vGPU mmio BAR.
Needs a better name or better description.

> + *
> + * Physical GPU that support vGPU should be register with vgpu module with
> + * gpu_device_ops structure.
> + */
> +
> +struct gpu_device_ops {
> +	struct module   *owner;
> +	const struct attribute_group **dev_attr_groups;
> +	const struct attribute_group **vgpu_attr_groups;
> +
> +	int	(*vgpu_supported_config)(struct pci_dev *dev, char *config);
> +	int     (*vgpu_create)(struct pci_dev *dev, uuid_le uuid,
> +			       uint32_t instance, char *vgpu_params);
> +	int     (*vgpu_destroy)(struct pci_dev *dev, uuid_le uuid,
> +			        uint32_t instance);
> +
> +	int     (*vgpu_start)(uuid_le uuid);
> +	int     (*vgpu_shutdown)(uuid_le uuid);
> +
> +	ssize_t (*read) (struct vgpu_device *vdev, char *buf, size_t count,
> +			 uint32_t address_space, loff_t pos);
> +	ssize_t (*write)(struct vgpu_device *vdev, char *buf, size_t count,
> +			 uint32_t address_space, loff_t pos);

Aren't these really 'enum vgpu_emul_space_e', not uint32_t?

> +	int     (*vgpu_set_irqs)(struct vgpu_device *vdev, uint32_t flags,
> +				 unsigned index, unsigned start, unsigned count,
> +				 void *data);
> +	int	(*vgpu_bar_info)(struct vgpu_device *vdev, int bar_index,
> +				 struct pci_bar_info *bar_info);
> +	int	(*validate_map_request)(struct vgpu_device *vdev,
> +					unsigned long virtaddr,
> +					unsigned long *pfn, unsigned long *size,
> +					pgprot_t *prot);
> +};
> +
> +/*
> + * Physical GPU
> + */
> +struct gpu_device {
> +	struct pci_dev                  *dev;
> +	const struct gpu_device_ops     *ops;
> +	struct list_head                gpu_next;
> +};
> +
> +/**
> + * struct vgpu_driver - vGPU device driver
> + * @name: driver name
> + * @probe: called when new device created
> + * @remove: called when device removed
> + * @driver: device driver structure
> + *
> + **/
> +struct vgpu_driver {
> +	const char *name;
> +	int  (*probe)  (struct device *dev);
> +	void (*remove) (struct device *dev);
> +	struct device_driver	driver;
> +};
> +
> +static inline struct vgpu_driver *to_vgpu_driver(struct device_driver *drv)
> +{
> +	return drv ? container_of(drv, struct vgpu_driver, driver) : NULL;
> +}
> +
> +static inline struct vgpu_device *to_vgpu_device(struct device *dev)
> +{
> +	return dev ? container_of(dev, struct vgpu_device, dev) : NULL;
> +}
> +
> +extern struct bus_type vgpu_bus_type;
> +
> +#define dev_is_vgpu(d) ((d)->bus == &vgpu_bus_type)
> +
> +extern int  vgpu_register_device(struct pci_dev *dev,
> +				 const struct gpu_device_ops *ops);
> +extern void vgpu_unregister_device(struct pci_dev *dev);
> +
> +extern int  vgpu_register_driver(struct vgpu_driver *drv, struct module *owner);
> +extern void vgpu_unregister_driver(struct vgpu_driver *drv);
> +
> +extern int vgpu_map_virtual_bar(uint64_t virt_bar_addr, uint64_t phys_bar_addr,
> +				uint32_t len, uint32_t flags);
> +extern int vgpu_dma_do_translate(dma_addr_t * gfn_buffer, uint32_t count);
> +
> +struct vgpu_device *get_vgpu_device_from_group(struct iommu_group *group);
> +
> +#endif /* VGPU_H */


The sysfs ABI needs to be documented in
Documentation/ABI/testing/sysfs-vgpu.  This is particularly important
for things like the format used for the create/destroy interfaces.
Thanks,

Alex

^ permalink raw reply	[flat|nested] 154+ messages in thread

* RE: [RFC PATCH v3 0/3] Add vGPU support
  2016-05-02 18:40 ` [Qemu-devel] " Kirti Wankhede
@ 2016-05-04  1:05   ` Tian, Kevin
  -1 siblings, 0 replies; 154+ messages in thread
From: Tian, Kevin @ 2016-05-04  1:05 UTC (permalink / raw)
  To: Kirti Wankhede, alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, Ruan, Shuai, Song, Jike, Lv, Zhiyuan

> From: Kirti Wankhede
> Sent: Tuesday, May 03, 2016 2:41 AM
> 
> This series adds vGPU support to v4.6 Linux host kernel. Purpose of this series
> is to provide a common interface for vGPU management that can be used
> by different GPU drivers. This series introduces vGPU core module that create
> and manage vGPU devices, VFIO based driver for vGPU devices that are created by
> vGPU core module and update VFIO type1 IOMMU module to support vGPU devices.
> 
> What's new in v3?
> VFIO type1 IOMMU module supports devices which are IOMMU capable. This version
> of patched adds support for vGPU devices, which are not IOMMU capable, to use
> existing VFIO IOMMU module. VFIO Type1 IOMMU patch provide new set of APIs for
> guest page translation.
> 
> What's left to do?
> VFIO driver for vGPU device doesn't support devices with MSI-X enabled.
> 
> Please review.
> 

Thanks Kirti/Neo for your nice work! We are integrating this common
framework with KVMGT. Once ready it'll be released as an experimental
feature in our next community release.

One curious question. There are some additional changes in our side.
What is the best way to collaborate our effort before this series is
accepted in upstream kernel? Do you prefer to receiving patches from
us directly, or having it hosted some place so both sides can contribute?

Of course we'll conduct high-level discussions of our changes and reach
agreement first before merging with your code.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 0/3] Add vGPU support
@ 2016-05-04  1:05   ` Tian, Kevin
  0 siblings, 0 replies; 154+ messages in thread
From: Tian, Kevin @ 2016-05-04  1:05 UTC (permalink / raw)
  To: Kirti Wankhede, alex.williamson, pbonzini, kraxel, cjia
  Cc: qemu-devel, kvm, Ruan, Shuai, Song, Jike, Lv, Zhiyuan

> From: Kirti Wankhede
> Sent: Tuesday, May 03, 2016 2:41 AM
> 
> This series adds vGPU support to v4.6 Linux host kernel. Purpose of this series
> is to provide a common interface for vGPU management that can be used
> by different GPU drivers. This series introduces vGPU core module that create
> and manage vGPU devices, VFIO based driver for vGPU devices that are created by
> vGPU core module and update VFIO type1 IOMMU module to support vGPU devices.
> 
> What's new in v3?
> VFIO type1 IOMMU module supports devices which are IOMMU capable. This version
> of patched adds support for vGPU devices, which are not IOMMU capable, to use
> existing VFIO IOMMU module. VFIO Type1 IOMMU patch provide new set of APIs for
> guest page translation.
> 
> What's left to do?
> VFIO driver for vGPU device doesn't support devices with MSI-X enabled.
> 
> Please review.
> 

Thanks Kirti/Neo for your nice work! We are integrating this common
framework with KVMGT. Once ready it'll be released as an experimental
feature in our next community release.

One curious question. There are some additional changes in our side.
What is the best way to collaborate our effort before this series is
accepted in upstream kernel? Do you prefer to receiving patches from
us directly, or having it hosted some place so both sides can contribute?

Of course we'll conduct high-level discussions of our changes and reach
agreement first before merging with your code.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 154+ messages in thread

* RE: [RFC PATCH v3 1/3] vGPU Core driver
  2016-05-03 22:43     ` [Qemu-devel] " Alex Williamson
@ 2016-05-04  2:45       ` Tian, Kevin
  -1 siblings, 0 replies; 154+ messages in thread
From: Tian, Kevin @ 2016-05-04  2:45 UTC (permalink / raw)
  To: Alex Williamson, Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, Ruan, Shuai, Song, Jike,
	Lv, Zhiyuan

> From: Alex Williamson
> Sent: Wednesday, May 04, 2016 6:44 AM
> 
> > diff --git a/drivers/vgpu/Kconfig b/drivers/vgpu/Kconfig
> > new file mode 100644
> > index 0000000..792eb48
> > --- /dev/null
> > +++ b/drivers/vgpu/Kconfig
> > @@ -0,0 +1,21 @@
> > +
> > +menuconfig VGPU
> > +    tristate "VGPU driver framework"
> > +    depends on VFIO
> > +    select VGPU_VFIO
> > +    help
> > +        VGPU provides a framework to virtualize GPU without SR-IOV cap
> > +        See Documentation/vgpu.txt for more details.
> > +
> > +        If you don't know what do here, say N.
> > +
> > +config VGPU
> > +    tristate
> > +    depends on VFIO
> > +    default n
> > +
> > +config VGPU_VFIO
> > +    tristate
> > +    depends on VGPU
> > +    default n
> > +
> 
> This is a little bit convoluted, it seems like everything added in this
> patch is vfio agnostic, it doesn't necessarily care what the consumer
> is.  That makes me think we should only be adding CONFIG_VGPU here and
> it should not depend on CONFIG_VFIO or be enabling CONFIG_VGPU_VFIO.
> The middle config entry is also redundant to the first, just move the
> default line up to the first and remove the rest.

Agree. Removing such dependency also benefits other hypervisor if
VFIO is not used.

Alex, there is one idea which I'd like to hear your comment. When looking at
the whole series, we can see the majority logic (maybe I cannot say 100%)
is GPU agnostic. Same frameworks in VFIO and vGPU core are actually neutral
to underlying device type, which e.g. can be easily applied to a NIC card too
if a similar technology is developed there.

Do you think whether we'd better make framework not GPU specific now
(mostly naming change), or continue current style and change later only 
when there is a real implementation on a different device? 

Thanks
Kevin

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 1/3] vGPU Core driver
@ 2016-05-04  2:45       ` Tian, Kevin
  0 siblings, 0 replies; 154+ messages in thread
From: Tian, Kevin @ 2016-05-04  2:45 UTC (permalink / raw)
  To: Alex Williamson, Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, Ruan, Shuai, Song, Jike,
	Lv, Zhiyuan

> From: Alex Williamson
> Sent: Wednesday, May 04, 2016 6:44 AM
> 
> > diff --git a/drivers/vgpu/Kconfig b/drivers/vgpu/Kconfig
> > new file mode 100644
> > index 0000000..792eb48
> > --- /dev/null
> > +++ b/drivers/vgpu/Kconfig
> > @@ -0,0 +1,21 @@
> > +
> > +menuconfig VGPU
> > +    tristate "VGPU driver framework"
> > +    depends on VFIO
> > +    select VGPU_VFIO
> > +    help
> > +        VGPU provides a framework to virtualize GPU without SR-IOV cap
> > +        See Documentation/vgpu.txt for more details.
> > +
> > +        If you don't know what do here, say N.
> > +
> > +config VGPU
> > +    tristate
> > +    depends on VFIO
> > +    default n
> > +
> > +config VGPU_VFIO
> > +    tristate
> > +    depends on VGPU
> > +    default n
> > +
> 
> This is a little bit convoluted, it seems like everything added in this
> patch is vfio agnostic, it doesn't necessarily care what the consumer
> is.  That makes me think we should only be adding CONFIG_VGPU here and
> it should not depend on CONFIG_VFIO or be enabling CONFIG_VGPU_VFIO.
> The middle config entry is also redundant to the first, just move the
> default line up to the first and remove the rest.

Agree. Removing such dependency also benefits other hypervisor if
VFIO is not used.

Alex, there is one idea which I'd like to hear your comment. When looking at
the whole series, we can see the majority logic (maybe I cannot say 100%)
is GPU agnostic. Same frameworks in VFIO and vGPU core are actually neutral
to underlying device type, which e.g. can be easily applied to a NIC card too
if a similar technology is developed there.

Do you think whether we'd better make framework not GPU specific now
(mostly naming change), or continue current style and change later only 
when there is a real implementation on a different device? 

Thanks
Kevin

^ permalink raw reply	[flat|nested] 154+ messages in thread

* RE: [RFC PATCH v3 1/3] vGPU Core driver
  2016-05-03 22:43     ` [Qemu-devel] " Alex Williamson
@ 2016-05-04  2:58       ` Tian, Kevin
  -1 siblings, 0 replies; 154+ messages in thread
From: Tian, Kevin @ 2016-05-04  2:58 UTC (permalink / raw)
  To: Alex Williamson, Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, Ruan, Shuai, Song, Jike,
	Lv, Zhiyuan

> From: Alex Williamson
> Sent: Wednesday, May 04, 2016 6:44 AM
> > +/**
> > + * struct gpu_device_ops - Structure to be registered for each physical GPU to
> > + * register the device to vgpu module.
> > + *
> > + * @owner:			The module owner.
> > + * @dev_attr_groups:		Default attributes of the physical device.
> > + * @vgpu_attr_groups:		Default attributes of the vGPU device.
> > + * @vgpu_supported_config:	Called to get information about supported vgpu types.
> > + *				@dev : pci device structure of physical GPU.
> > + *				@config: should return string listing supported config
> > + *				Returns integer: success (0) or error (< 0)
> > + * @vgpu_create:		Called to allocate basic resouces in graphics

It's redundant to have vgpu prefixed to every op here. Same comment
for later sysfs node.

> > + *				driver for a particular vgpu.
> > + *				@dev: physical pci device structure on which vgpu
> > + *				      should be created
> > + *				@uuid: VM's uuid for which VM it is intended to
> > + *				@instance: vgpu instance in that VM

I didn't quite get @instance here. Is it whatever vendor specific format
to indicate a vgpu?

> > + *				@vgpu_params: extra parameters required by GPU driver.
> > + *				Returns integer: success (0) or error (< 0)
> > + * @vgpu_destroy:		Called to free resources in graphics driver for
> > + *				a vgpu instance of that VM.
> > + *				@dev: physical pci device structure to which
> > + *				this vgpu points to.
> > + *				@uuid: VM's uuid for which the vgpu belongs to.
> > + *				@instance: vgpu instance in that VM
> > + *				Returns integer: success (0) or error (< 0)
> > + *				If VM is running and vgpu_destroy is called that
> > + *				means the vGPU is being hotunpluged. Return error
> > + *				if VM is running and graphics driver doesn't
> > + *				support vgpu hotplug.
> > + * @vgpu_start:			Called to do initiate vGPU initialization
> > + *				process in graphics driver when VM boots before
> > + *				qemu starts.
> > + *				@uuid: VM's UUID which is booting.
> > + *				Returns integer: success (0) or error (< 0)
> > + * @vgpu_shutdown:		Called to teardown vGPU related resources for
> > + *				the VM
> > + *				@uuid: VM's UUID which is shutting down .
> > + *				Returns integer: success (0) or error (< 0)

Can you give some specific example about difference between destroy
and shutdown? Want to map it correctly into our side, e.g. whether we
need implement both or just one of them.

Another optional op is 'stop', allowing physical device to stop scheduling
vGPU including wait for in-flight DMA done. It would be useful to support
VM live migration with vGPU assigned.

> > + * @read:			Read emulation callback
> > + *				@vdev: vgpu device structure
> > + *				@buf: read buffer
> > + *				@count: number bytes to read
> > + *				@address_space: specifies for which address space
> > + *				the request is: pci_config_space, IO register
> > + *				space or MMIO space.
> > + *				@pos: offset from base address.
> > + *				Retuns number on bytes read on success or error.
> > + * @write:			Write emulation callback
> > + *				@vdev: vgpu device structure
> > + *				@buf: write buffer
> > + *				@count: number bytes to be written
> > + *				@address_space: specifies for which address space
> > + *				the request is: pci_config_space, IO register
> > + *				space or MMIO space.
> > + *				@pos: offset from base address.
> > + *				Retuns number on bytes written on success or error.
> 
> How do these support multiple MMIO spaces or IO port spaces?  GPUs, and
> therefore I assume vGPUs, often have more than one MMIO space, how
> does the enum above tell us which one?  We could simply make this be a
> region index.
> 
> > + * @vgpu_set_irqs:		Called to send about interrupts configuration
> > + *				information that qemu set.
> > + *				@vdev: vgpu device structure
> > + *				@flags, index, start, count and *data : same as
> > + *				that of struct vfio_irq_set of
> > + *				VFIO_DEVICE_SET_IRQS API.
> 
> How do we learn about the supported interrupt types?  Should this be
> called vgpu_vfio_set_irqs if it's following the vfio API?
> 
> > + * @vgpu_bar_info:		Called to get BAR size and flags of vGPU device.
> > + *				@vdev: vgpu device structure
> > + *				@bar_index: BAR index
> > + *				@bar_info: output, returns size and flags of
> > + *				requested BAR
> > + *				Returns integer: success (0) or error (< 0)
> 
> This is called bar_info, but the bar_index is actually the vfio region
> index and things like the config region info is being overloaded
> through it.  We already have a structure defined for getting a generic
> region index, why not use it?  Maybe this should just be
> vgpu_vfio_get_region_info.

Or is it more extensible to allow reporting the whole vconfig space?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 1/3] vGPU Core driver
@ 2016-05-04  2:58       ` Tian, Kevin
  0 siblings, 0 replies; 154+ messages in thread
From: Tian, Kevin @ 2016-05-04  2:58 UTC (permalink / raw)
  To: Alex Williamson, Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, Ruan, Shuai, Song, Jike,
	Lv, Zhiyuan

> From: Alex Williamson
> Sent: Wednesday, May 04, 2016 6:44 AM
> > +/**
> > + * struct gpu_device_ops - Structure to be registered for each physical GPU to
> > + * register the device to vgpu module.
> > + *
> > + * @owner:			The module owner.
> > + * @dev_attr_groups:		Default attributes of the physical device.
> > + * @vgpu_attr_groups:		Default attributes of the vGPU device.
> > + * @vgpu_supported_config:	Called to get information about supported vgpu types.
> > + *				@dev : pci device structure of physical GPU.
> > + *				@config: should return string listing supported config
> > + *				Returns integer: success (0) or error (< 0)
> > + * @vgpu_create:		Called to allocate basic resouces in graphics

It's redundant to have vgpu prefixed to every op here. Same comment
for later sysfs node.

> > + *				driver for a particular vgpu.
> > + *				@dev: physical pci device structure on which vgpu
> > + *				      should be created
> > + *				@uuid: VM's uuid for which VM it is intended to
> > + *				@instance: vgpu instance in that VM

I didn't quite get @instance here. Is it whatever vendor specific format
to indicate a vgpu?

> > + *				@vgpu_params: extra parameters required by GPU driver.
> > + *				Returns integer: success (0) or error (< 0)
> > + * @vgpu_destroy:		Called to free resources in graphics driver for
> > + *				a vgpu instance of that VM.
> > + *				@dev: physical pci device structure to which
> > + *				this vgpu points to.
> > + *				@uuid: VM's uuid for which the vgpu belongs to.
> > + *				@instance: vgpu instance in that VM
> > + *				Returns integer: success (0) or error (< 0)
> > + *				If VM is running and vgpu_destroy is called that
> > + *				means the vGPU is being hotunpluged. Return error
> > + *				if VM is running and graphics driver doesn't
> > + *				support vgpu hotplug.
> > + * @vgpu_start:			Called to do initiate vGPU initialization
> > + *				process in graphics driver when VM boots before
> > + *				qemu starts.
> > + *				@uuid: VM's UUID which is booting.
> > + *				Returns integer: success (0) or error (< 0)
> > + * @vgpu_shutdown:		Called to teardown vGPU related resources for
> > + *				the VM
> > + *				@uuid: VM's UUID which is shutting down .
> > + *				Returns integer: success (0) or error (< 0)

Can you give some specific example about difference between destroy
and shutdown? Want to map it correctly into our side, e.g. whether we
need implement both or just one of them.

Another optional op is 'stop', allowing physical device to stop scheduling
vGPU including wait for in-flight DMA done. It would be useful to support
VM live migration with vGPU assigned.

> > + * @read:			Read emulation callback
> > + *				@vdev: vgpu device structure
> > + *				@buf: read buffer
> > + *				@count: number bytes to read
> > + *				@address_space: specifies for which address space
> > + *				the request is: pci_config_space, IO register
> > + *				space or MMIO space.
> > + *				@pos: offset from base address.
> > + *				Retuns number on bytes read on success or error.
> > + * @write:			Write emulation callback
> > + *				@vdev: vgpu device structure
> > + *				@buf: write buffer
> > + *				@count: number bytes to be written
> > + *				@address_space: specifies for which address space
> > + *				the request is: pci_config_space, IO register
> > + *				space or MMIO space.
> > + *				@pos: offset from base address.
> > + *				Retuns number on bytes written on success or error.
> 
> How do these support multiple MMIO spaces or IO port spaces?  GPUs, and
> therefore I assume vGPUs, often have more than one MMIO space, how
> does the enum above tell us which one?  We could simply make this be a
> region index.
> 
> > + * @vgpu_set_irqs:		Called to send about interrupts configuration
> > + *				information that qemu set.
> > + *				@vdev: vgpu device structure
> > + *				@flags, index, start, count and *data : same as
> > + *				that of struct vfio_irq_set of
> > + *				VFIO_DEVICE_SET_IRQS API.
> 
> How do we learn about the supported interrupt types?  Should this be
> called vgpu_vfio_set_irqs if it's following the vfio API?
> 
> > + * @vgpu_bar_info:		Called to get BAR size and flags of vGPU device.
> > + *				@vdev: vgpu device structure
> > + *				@bar_index: BAR index
> > + *				@bar_info: output, returns size and flags of
> > + *				requested BAR
> > + *				Returns integer: success (0) or error (< 0)
> 
> This is called bar_info, but the bar_index is actually the vfio region
> index and things like the config region info is being overloaded
> through it.  We already have a structure defined for getting a generic
> region index, why not use it?  Maybe this should just be
> vgpu_vfio_get_region_info.

Or is it more extensible to allow reporting the whole vconfig space?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 154+ messages in thread

* RE: [RFC PATCH v3 2/3] VFIO driver for vGPU device
  2016-05-03 22:43     ` [Qemu-devel] " Alex Williamson
@ 2016-05-04  3:23       ` Tian, Kevin
  -1 siblings, 0 replies; 154+ messages in thread
From: Tian, Kevin @ 2016-05-04  3:23 UTC (permalink / raw)
  To: Alex Williamson, Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, Ruan, Shuai, Song, Jike,
	Lv, Zhiyuan

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Wednesday, May 04, 2016 6:43 AM
> > +
> > +		if (gpu_dev->ops->write) {
> > +			ret = gpu_dev->ops->write(vgpu_dev,
> > +						  user_data,
> > +						  count,
> > +						  vgpu_emul_space_config,
> > +						  pos);
> > +		}
> > +
> > +		memcpy((void *)(vdev->vconfig + pos), (void *)user_data, count);
> 
> So write is expected to user_data to allow only the writable bits to be
> changed?  What's really being saved in the vconfig here vs the vendor
> vgpu driver?  It seems like we're only using it to cache the BAR
> values, but we're not providing the BAR emulation here, which seems
> like one of the few things we could provide so it's not duplicated in
> every vendor driver.  But then we only need a few u32s to do that, not
> all of config space.

We can borrow same vconfig emulation from existing vfio-pci driver.
But doing so doesn't mean that vendor vgpu driver cannot have its
own vconfig emulation further. vGPU is not like a real device, since
there may be no physical config space implemented for each vGPU.
So anyway vendor vGPU driver needs to create/emulate the virtualized 
config space while the way how is created might be vendor specific. 
So better to keep the interface to access raw vconfig space from
vendor vGPU driver.

> > +static ssize_t vgpu_dev_rw(void *device_data, char __user *buf,
> > +		size_t count, loff_t *ppos, bool iswrite)
> > +{
> > +	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
> > +	struct vfio_vgpu_device *vdev = device_data;
> > +
> > +	if (index >= VFIO_PCI_NUM_REGIONS)
> > +		return -EINVAL;
> > +
> > +	switch (index) {
> > +	case VFIO_PCI_CONFIG_REGION_INDEX:
> > +		return vgpu_dev_config_rw(vdev, buf, count, ppos, iswrite);
> > +
> > +	case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
> > +		return vgpu_dev_bar_rw(vdev, buf, count, ppos, iswrite);
> > +
> > +	case VFIO_PCI_ROM_REGION_INDEX:
> > +	case VFIO_PCI_VGA_REGION_INDEX:
> 
> Wait a sec, who's doing the VGA emulation?  We can't be claiming to
> support a VGA region and then fail to provide read/write access to it
> like we said it has.

For Intel side we plan to not support VGA region when upstreaming our
KVMGT work, which means Intel vGPU will be exposed only as a 
secondary graphics card then so legacy VGA is not required. Also no
VBIOS/ROM requirement. Guess we can remove above two regions.

> > +
> > +static int vgpu_dev_mmio_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> > +{
> > +	int ret = 0;
> > +	struct vfio_vgpu_device *vdev = vma->vm_private_data;
> > +	struct vgpu_device *vgpu_dev;
> > +	struct gpu_device *gpu_dev;
> > +	u64 virtaddr = (u64)vmf->virtual_address;
> > +	u64 offset, phyaddr;
> > +	unsigned long req_size, pgoff;
> > +	pgprot_t pg_prot;
> > +
> > +	if (!vdev && !vdev->vgpu_dev)
> > +		return -EINVAL;
> > +
> > +	vgpu_dev = vdev->vgpu_dev;
> > +	gpu_dev  = vgpu_dev->gpu_dev;
> > +
> > +	offset   = vma->vm_pgoff << PAGE_SHIFT;
> > +	phyaddr  = virtaddr - vma->vm_start + offset;
> > +	pgoff    = phyaddr >> PAGE_SHIFT;
> > +	req_size = vma->vm_end - virtaddr;
> > +	pg_prot  = vma->vm_page_prot;
> > +
> > +	if (gpu_dev->ops->validate_map_request) {
> > +		ret = gpu_dev->ops->validate_map_request(vgpu_dev, virtaddr, &pgoff,
> > +							 &req_size, &pg_prot);
> > +		if (ret)
> > +			return ret;
> > +
> > +		if (!req_size)
> > +			return -EINVAL;
> > +	}
> > +
> > +	ret = remap_pfn_range(vma, virtaddr, pgoff, req_size, pg_prot);
> 
> So not supporting validate_map_request() means that the user can
> directly mmap BARs of the host GPU and as shown below, we assume a 1:1
> mapping of vGPU BAR to host GPU BAR.  Is that ever valid in a vGPU
> scenario or should this callback be required?  It's not clear to me how
> the vendor driver determines what this maps to, do they compare it to
> the physical device's own BAR addresses?

I didn't quite understand too. Based on earlier discussion, do we need
something like this, or could achieve the purpose just by leveraging
recent sparse mmap support?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 2/3] VFIO driver for vGPU device
@ 2016-05-04  3:23       ` Tian, Kevin
  0 siblings, 0 replies; 154+ messages in thread
From: Tian, Kevin @ 2016-05-04  3:23 UTC (permalink / raw)
  To: Alex Williamson, Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, Ruan, Shuai, Song, Jike,
	Lv, Zhiyuan

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Wednesday, May 04, 2016 6:43 AM
> > +
> > +		if (gpu_dev->ops->write) {
> > +			ret = gpu_dev->ops->write(vgpu_dev,
> > +						  user_data,
> > +						  count,
> > +						  vgpu_emul_space_config,
> > +						  pos);
> > +		}
> > +
> > +		memcpy((void *)(vdev->vconfig + pos), (void *)user_data, count);
> 
> So write is expected to user_data to allow only the writable bits to be
> changed?  What's really being saved in the vconfig here vs the vendor
> vgpu driver?  It seems like we're only using it to cache the BAR
> values, but we're not providing the BAR emulation here, which seems
> like one of the few things we could provide so it's not duplicated in
> every vendor driver.  But then we only need a few u32s to do that, not
> all of config space.

We can borrow same vconfig emulation from existing vfio-pci driver.
But doing so doesn't mean that vendor vgpu driver cannot have its
own vconfig emulation further. vGPU is not like a real device, since
there may be no physical config space implemented for each vGPU.
So anyway vendor vGPU driver needs to create/emulate the virtualized 
config space while the way how is created might be vendor specific. 
So better to keep the interface to access raw vconfig space from
vendor vGPU driver.

> > +static ssize_t vgpu_dev_rw(void *device_data, char __user *buf,
> > +		size_t count, loff_t *ppos, bool iswrite)
> > +{
> > +	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
> > +	struct vfio_vgpu_device *vdev = device_data;
> > +
> > +	if (index >= VFIO_PCI_NUM_REGIONS)
> > +		return -EINVAL;
> > +
> > +	switch (index) {
> > +	case VFIO_PCI_CONFIG_REGION_INDEX:
> > +		return vgpu_dev_config_rw(vdev, buf, count, ppos, iswrite);
> > +
> > +	case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
> > +		return vgpu_dev_bar_rw(vdev, buf, count, ppos, iswrite);
> > +
> > +	case VFIO_PCI_ROM_REGION_INDEX:
> > +	case VFIO_PCI_VGA_REGION_INDEX:
> 
> Wait a sec, who's doing the VGA emulation?  We can't be claiming to
> support a VGA region and then fail to provide read/write access to it
> like we said it has.

For Intel side we plan to not support VGA region when upstreaming our
KVMGT work, which means Intel vGPU will be exposed only as a 
secondary graphics card then so legacy VGA is not required. Also no
VBIOS/ROM requirement. Guess we can remove above two regions.

> > +
> > +static int vgpu_dev_mmio_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> > +{
> > +	int ret = 0;
> > +	struct vfio_vgpu_device *vdev = vma->vm_private_data;
> > +	struct vgpu_device *vgpu_dev;
> > +	struct gpu_device *gpu_dev;
> > +	u64 virtaddr = (u64)vmf->virtual_address;
> > +	u64 offset, phyaddr;
> > +	unsigned long req_size, pgoff;
> > +	pgprot_t pg_prot;
> > +
> > +	if (!vdev && !vdev->vgpu_dev)
> > +		return -EINVAL;
> > +
> > +	vgpu_dev = vdev->vgpu_dev;
> > +	gpu_dev  = vgpu_dev->gpu_dev;
> > +
> > +	offset   = vma->vm_pgoff << PAGE_SHIFT;
> > +	phyaddr  = virtaddr - vma->vm_start + offset;
> > +	pgoff    = phyaddr >> PAGE_SHIFT;
> > +	req_size = vma->vm_end - virtaddr;
> > +	pg_prot  = vma->vm_page_prot;
> > +
> > +	if (gpu_dev->ops->validate_map_request) {
> > +		ret = gpu_dev->ops->validate_map_request(vgpu_dev, virtaddr, &pgoff,
> > +							 &req_size, &pg_prot);
> > +		if (ret)
> > +			return ret;
> > +
> > +		if (!req_size)
> > +			return -EINVAL;
> > +	}
> > +
> > +	ret = remap_pfn_range(vma, virtaddr, pgoff, req_size, pg_prot);
> 
> So not supporting validate_map_request() means that the user can
> directly mmap BARs of the host GPU and as shown below, we assume a 1:1
> mapping of vGPU BAR to host GPU BAR.  Is that ever valid in a vGPU
> scenario or should this callback be required?  It's not clear to me how
> the vendor driver determines what this maps to, do they compare it to
> the physical device's own BAR addresses?

I didn't quite understand too. Based on earlier discussion, do we need
something like this, or could achieve the purpose just by leveraging
recent sparse mmap support?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 154+ messages in thread

* RE: [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
  2016-05-03 22:43     ` [Qemu-devel] " Alex Williamson
@ 2016-05-04  3:39       ` Tian, Kevin
  -1 siblings, 0 replies; 154+ messages in thread
From: Tian, Kevin @ 2016-05-04  3:39 UTC (permalink / raw)
  To: Alex Williamson, Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, Ruan, Shuai, Song, Jike,
	Lv, Zhiyuan

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Wednesday, May 04, 2016 6:43 AM
> > +		             int prot, unsigned long *pfn_base)
> >  {
> > +	struct vfio_domain *domain = domain_data;
> >  	unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> >  	bool lock_cap = capable(CAP_IPC_LOCK);
> >  	long ret, i;
> >  	bool rsvd;
> > +	struct mm_struct *mm;
> >
> > -	if (!current->mm)
> > +	if (!domain)
> >  		return -ENODEV;
> >
> > -	ret = vaddr_get_pfn(vaddr, prot, pfn_base);
> > +	if (domain->vfio_iommu_api_only)
> > +		mm = domain->vmm_mm;
> > +	else
> > +		mm = current->mm;
> > +
> > +	if (!mm)
> > +		return -ENODEV;
> > +
> > +	ret = vaddr_get_pfn(mm, vaddr, prot, pfn_base);
> 
> We could pass domain->mm unconditionally to vaddr_get_pfn(), let it be
> NULL in the !api_only case and use it as a cue to vaddr_get_pfn() which
> gup variant to use.  Of course we need to deal with mmap_sem somewhere
> too without turning the code into swiss cheese.
> 
> Correct me if I'm wrong, but I assume the main benefit of interweaving
> this into type1 vs pulling out common code and making a new vfio iommu
> backend is the page accounting, ie. not over accounting locked pages.
> TBH, I don't know if it's worth it.  Any idea what the high water mark
> of pinned pages for a vgpu might be?

The baseline is same as today's PCI device passthrough, i.e. we need to
pin all memory pages allocated to the said VM at least for current KVMGT. 
Ideally we may reduce the pinned set based on fined-grained resource 
tracking within vGPU device model (then it might be in 100MBs based on 
active graphics memory working set). 

Thanks
Kevin

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
@ 2016-05-04  3:39       ` Tian, Kevin
  0 siblings, 0 replies; 154+ messages in thread
From: Tian, Kevin @ 2016-05-04  3:39 UTC (permalink / raw)
  To: Alex Williamson, Kirti Wankhede
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, Ruan, Shuai, Song, Jike,
	Lv, Zhiyuan

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Wednesday, May 04, 2016 6:43 AM
> > +		             int prot, unsigned long *pfn_base)
> >  {
> > +	struct vfio_domain *domain = domain_data;
> >  	unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> >  	bool lock_cap = capable(CAP_IPC_LOCK);
> >  	long ret, i;
> >  	bool rsvd;
> > +	struct mm_struct *mm;
> >
> > -	if (!current->mm)
> > +	if (!domain)
> >  		return -ENODEV;
> >
> > -	ret = vaddr_get_pfn(vaddr, prot, pfn_base);
> > +	if (domain->vfio_iommu_api_only)
> > +		mm = domain->vmm_mm;
> > +	else
> > +		mm = current->mm;
> > +
> > +	if (!mm)
> > +		return -ENODEV;
> > +
> > +	ret = vaddr_get_pfn(mm, vaddr, prot, pfn_base);
> 
> We could pass domain->mm unconditionally to vaddr_get_pfn(), let it be
> NULL in the !api_only case and use it as a cue to vaddr_get_pfn() which
> gup variant to use.  Of course we need to deal with mmap_sem somewhere
> too without turning the code into swiss cheese.
> 
> Correct me if I'm wrong, but I assume the main benefit of interweaving
> this into type1 vs pulling out common code and making a new vfio iommu
> backend is the page accounting, ie. not over accounting locked pages.
> TBH, I don't know if it's worth it.  Any idea what the high water mark
> of pinned pages for a vgpu might be?

The baseline is same as today's PCI device passthrough, i.e. we need to
pin all memory pages allocated to the said VM at least for current KVMGT. 
Ideally we may reduce the pinned set based on fined-grained resource 
tracking within vGPU device model (then it might be in 100MBs based on 
active graphics memory working set). 

Thanks
Kevin

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [RFC PATCH v3 0/3] Add vGPU support
  2016-05-04  1:05   ` [Qemu-devel] " Tian, Kevin
@ 2016-05-04  6:17     ` Neo Jia
  -1 siblings, 0 replies; 154+ messages in thread
From: Neo Jia @ 2016-05-04  6:17 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Kirti Wankhede, alex.williamson, pbonzini, kraxel, qemu-devel,
	kvm, Ruan, Shuai, Song, Jike, Lv, Zhiyuan

On Wed, May 04, 2016 at 01:05:36AM +0000, Tian, Kevin wrote:
> > From: Kirti Wankhede
> > Sent: Tuesday, May 03, 2016 2:41 AM
> > 
> > This series adds vGPU support to v4.6 Linux host kernel. Purpose of this series
> > is to provide a common interface for vGPU management that can be used
> > by different GPU drivers. This series introduces vGPU core module that create
> > and manage vGPU devices, VFIO based driver for vGPU devices that are created by
> > vGPU core module and update VFIO type1 IOMMU module to support vGPU devices.
> > 
> > What's new in v3?
> > VFIO type1 IOMMU module supports devices which are IOMMU capable. This version
> > of patched adds support for vGPU devices, which are not IOMMU capable, to use
> > existing VFIO IOMMU module. VFIO Type1 IOMMU patch provide new set of APIs for
> > guest page translation.
> > 
> > What's left to do?
> > VFIO driver for vGPU device doesn't support devices with MSI-X enabled.
> > 
> > Please review.
> > 
> 
> Thanks Kirti/Neo for your nice work! We are integrating this common
> framework with KVMGT. Once ready it'll be released as an experimental
> feature in our next community release.
> 
> One curious question. There are some additional changes in our side.
> What is the best way to collaborate our effort before this series is
> accepted in upstream kernel? Do you prefer to receiving patches from
> us directly, or having it hosted some place so both sides can contribute?

Yes, sending it directly to Kirti and myself will work the best, we can sort
out this process offline.

Thanks,
Neo

> 
> Of course we'll conduct high-level discussions of our changes and reach
> agreement first before merging with your code.
> 
> Thanks
> Kevin

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 0/3] Add vGPU support
@ 2016-05-04  6:17     ` Neo Jia
  0 siblings, 0 replies; 154+ messages in thread
From: Neo Jia @ 2016-05-04  6:17 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Kirti Wankhede, alex.williamson, pbonzini, kraxel, qemu-devel,
	kvm, Ruan, Shuai, Song, Jike, Lv, Zhiyuan

On Wed, May 04, 2016 at 01:05:36AM +0000, Tian, Kevin wrote:
> > From: Kirti Wankhede
> > Sent: Tuesday, May 03, 2016 2:41 AM
> > 
> > This series adds vGPU support to v4.6 Linux host kernel. Purpose of this series
> > is to provide a common interface for vGPU management that can be used
> > by different GPU drivers. This series introduces vGPU core module that create
> > and manage vGPU devices, VFIO based driver for vGPU devices that are created by
> > vGPU core module and update VFIO type1 IOMMU module to support vGPU devices.
> > 
> > What's new in v3?
> > VFIO type1 IOMMU module supports devices which are IOMMU capable. This version
> > of patched adds support for vGPU devices, which are not IOMMU capable, to use
> > existing VFIO IOMMU module. VFIO Type1 IOMMU patch provide new set of APIs for
> > guest page translation.
> > 
> > What's left to do?
> > VFIO driver for vGPU device doesn't support devices with MSI-X enabled.
> > 
> > Please review.
> > 
> 
> Thanks Kirti/Neo for your nice work! We are integrating this common
> framework with KVMGT. Once ready it'll be released as an experimental
> feature in our next community release.
> 
> One curious question. There are some additional changes in our side.
> What is the best way to collaborate our effort before this series is
> accepted in upstream kernel? Do you prefer to receiving patches from
> us directly, or having it hosted some place so both sides can contribute?

Yes, sending it directly to Kirti and myself will work the best, we can sort
out this process offline.

Thanks,
Neo

> 
> Of course we'll conduct high-level discussions of our changes and reach
> agreement first before merging with your code.
> 
> Thanks
> Kevin

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [RFC PATCH v3 1/3] vGPU Core driver
  2016-05-03 22:43     ` [Qemu-devel] " Alex Williamson
@ 2016-05-04 13:31       ` Kirti Wankhede
  -1 siblings, 0 replies; 154+ messages in thread
From: Kirti Wankhede @ 2016-05-04 13:31 UTC (permalink / raw)
  To: Alex Williamson
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, shuai.ruan,
	jike.song, zhiyuan.lv

Thanks Alex.

 >> +config VGPU_VFIO
 >> +    tristate
 >> +    depends on VGPU
 >> +    default n
 >> +
 >
 > This is a little bit convoluted, it seems like everything added in this
 > patch is vfio agnostic, it doesn't necessarily care what the consumer
 > is.  That makes me think we should only be adding CONFIG_VGPU here and
 > it should not depend on CONFIG_VFIO or be enabling CONFIG_VGPU_VFIO.
 > The middle config entry is also redundant to the first, just move the
 > default line up to the first and remove the rest.

CONFIG_VGPU doesn't directly depend on VFIO. CONFIG_VGPU_VFIO is 
directly dependent on VFIO. But devices created by VGPU core module need 
a driver to manage those devices. CONFIG_VGPU_VFIO is the driver which 
will manage vgpu devices. So I think CONFIG_VGPU_VFIO should be enabled 
by CONFIG_VGPU.

This would look like:
menuconfig VGPU
     tristate "VGPU driver framework"
     select VGPU_VFIO
     default n
     help
         VGPU provides a framework to virtualize GPU without SR-IOV cap
         See Documentation/vgpu.txt for more details.

         If you don't know what do here, say N.

config VGPU_VFIO
     tristate
     depends on VGPU
     depends on VFIO
     default n



 >> +int vgpu_register_device(struct pci_dev *dev, const struct 
gpu_device_ops *ops)
 >
 > Why do we care that it's a pci_dev?  It seems like there's only a very
 > small portion of the API that cares about pci_devs in order to describe
 > BARs, which could be switched based on the device type.  Otherwise we
 > could operate on a struct device here.
 >

GPUs are PCI devices, hence used pci_dev. I agree with you, I'll change 
it to operate on struct device and add checks in vgpu_vfio.c where 
config space and BARs are populated.

 >> +static void vgpu_device_free(struct vgpu_device *vgpu_dev)
 >> +{
 >> +	if (vgpu_dev) {
 >> +		mutex_lock(&vgpu.vgpu_devices_lock);
 >> +		list_del(&vgpu_dev->list);
 >> +		mutex_unlock(&vgpu.vgpu_devices_lock);
 >> +		kfree(vgpu_dev);
 >> +	}
 >
 > Why aren't we using the kref to remove and free the vgpu when the last
 > reference is released?

vgpu_device_free() is called from 2 places,
1. from create_vgpu_device(), when device_register() is failed.
2. vgpu_device_release(), which is set as a release function to device 
registered by device_register():
vgpu_dev->dev.release = vgpu_device_release;
This release function is called from device_unregister() kernel function 
where it uses kref to call release function. I don't think I need to do 
it again.


 >> +struct vgpu_device *vgpu_drv_get_vgpu_device(uuid_le uuid, int 
instance)
 >> +{
 >> +	struct vgpu_device *vdev = NULL;
 >> +
 >> +	mutex_lock(&vgpu.vgpu_devices_lock);
 >> +	list_for_each_entry(vdev, &vgpu.vgpu_devices_list, list) {
 >> +		if ((uuid_le_cmp(vdev->uuid, uuid) == 0) &&
 >> +		    (vdev->vgpu_instance == instance)) {
 >> +			mutex_unlock(&vgpu.vgpu_devices_lock);
 >> +			return vdev;
 >
 > We're not taking any sort of reference to the vgpu, what prevents races
 > with it being removed?  A common exit path would be easy to achieve
 > here too.
 >

I'll add reference count.


 >> +create_attr_error:
 >> +	if (gpu_dev->ops->vgpu_destroy) {
 >> +		int ret = 0;
 >> +		ret = gpu_dev->ops->vgpu_destroy(gpu_dev->dev,
 >> +						 vgpu_dev->uuid,
 >> +						 vgpu_dev->vgpu_instance);
 >
 > Unnecessary initialization and we don't do anything with the result.
 > Below indicates lack of vgpu_destroy indicates the vendor doesn't
 > support unplug, but doesn't that break our error cleanup path here?
 >

Comment about vgpu_destroy:
If VM is running and vgpu_destroy is called that 
means the vGPU is being hotunpluged. Return
error if VM is running and graphics driver
doesn't support vgpu hotplug.

Its GPU drivers responsibility to check if VM is running and return 
accordingly. This is vgpu creation path. Vgpu device would be hotplug to 
VM on vgpu_start.

 >> +		retval = gpu_dev->ops->vgpu_destroy(gpu_dev->dev,
 >> +						    vgpu_dev->uuid,
 >> +						    vgpu_dev->vgpu_instance);
 >> +	/* if vendor driver doesn't return success that means vendor 
driver doesn't
 >> +	 * support hot-unplug */
 >> +		if (retval)
 >> +			return;
 >
 > Should we return an error code then?  Inconsistent comment style.
 >

destroy_vgpu_device() is called from
- release_vgpubus_dev(), which is a release function of vgpu_class and 
has return type void.
- vgpu_unregister_device(), which again return void

Even if error code is returned from here, it is not going to be used.

I'll change the comment style in next patch update.



 >> + * @write:			Write emulation callback
 >> + *				@vdev: vgpu device structure
 >> + *				@buf: write buffer
 >> + *				@count: number bytes to be written
 >> + *				@address_space: specifies for which address space
 >> + *				the request is: pci_config_space, IO register
 >> + *				space or MMIO space.
 >> + *				@pos: offset from base address.
 >> + *				Retuns number on bytes written on success or error.
 >
 > How do these support multiple MMIO spaces or IO port spaces?  GPUs, and
 > therefore I assume vGPUs, often have more than one MMIO space, how
 > does the enum above tell us which one?  We could simply make this be a
 > region index.
 >

addresss_space should be one of these, yes address_space is of
'enum vgpu_emul_space_e', will change uint32_t to vgpu_emul_space_e.
enum vgpu_emul_space_e {
         vgpu_emul_space_config = 0, /*!< PCI configuration space */
         vgpu_emul_space_io = 1,     /*!< I/O register space */
         vgpu_emul_space_mmio = 2    /*!< Memory-mapped I/O space */
};

If you see vgpu_dev_bar_rw() in 2nd patch, in vgpu_vfio.c
int bar_index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);

@pos: offset from base address is calculated as:
pos = vdev->bar_info[bar_index].start + offset

GPU driver can find the bar index from 'pos'.

 >> + * @vgpu_set_irqs:		Called to send about interrupts configuration
 >> + *				information that qemu set.
 >> + *				@vdev: vgpu device structure
 >> + *				@flags, index, start, count and *data : same as
 >> + *				that of struct vfio_irq_set of
 >> + *				VFIO_DEVICE_SET_IRQS API.
 >
 > How do we learn about the supported interrupt types?  Should this be
 > called vgpu_vfio_set_irqs if it's following the vfio API?
 >

GPU driver provides config space of vgpu device.
QEMU learns about supported interrupt type by reading this config space 
and of vgpu device and check PCI capabilites.
Yes, this follows vfio API, will change the function name.

 >> + * @vgpu_bar_info:		Called to get BAR size and flags of vGPU device.
 >> + *				@vdev: vgpu device structure
 >> + *				@bar_index: BAR index
 >> + *				@bar_info: output, returns size and flags of
 >> + *				requested BAR
 >> + *				Returns integer: success (0) or error (< 0)
 >
 > This is called bar_info, but the bar_index is actually the vfio region
 > index and things like the config region info is being overloaded
 > through it.  We already have a structure defined for getting a generic
 > region index, why not use it?  Maybe this should just be
 > vgpu_vfio_get_region_info.
 >

Ok. Will do.


 >> + * @validate_map_request:	Validate remap pfn request
 >> + *				@vdev: vgpu device structure
 >> + *				@virtaddr: target user address to start at
 >> + *				@pfn: physical address of kernel memory, GPU
 >> + *				driver can change if required.
 >> + *				@size: size of map area, GPU driver can change
 >> + *				the size of map area if desired.
 >> + *				@prot: page protection flags for this mapping,
 >> + *				GPU driver can change, if required.
 >> + *				Returns integer: success (0) or error (< 0)
 >
 > Was not at all clear to me what this did until I got to patch 2, this
 > is actually providing the fault handling for mmap'ing a vGPU mmio BAR.
 > Needs a better name or better description.
 >

If say VMM mmap whole BAR1 of GPU, say 128MB, so fault would occur when 
BAR1 is tried to access then the size is calculated as:
req_size = vma->vm_end - virtaddr
Since GPU is being shared by multiple vGPUs, GPU driver might not remap 
whole BAR1 for only one vGPU device, so would prefer, say map one page 
at a time. GPU driver returns PAGE_SIZE. This is used by 
remap_pfn_range(). Now on next access to BAR1 other than that page, we 
will again get a fault().
As the name says this call is to validate from GPU driver for the size 
and prot of map area. GPU driver can change size and prot for this map area.


 >> +	ssize_t (*write)(struct vgpu_device *vdev, char *buf, size_t count,
 >> +			 uint32_t address_space, loff_t pos);
 >
 > Aren't these really 'enum vgpu_emul_space_e', not uint32_t?
 >

Yes. I'll change to enum vgpu_emul_space_e.

Thanks,
Kirti.



^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 1/3] vGPU Core driver
@ 2016-05-04 13:31       ` Kirti Wankhede
  0 siblings, 0 replies; 154+ messages in thread
From: Kirti Wankhede @ 2016-05-04 13:31 UTC (permalink / raw)
  To: Alex Williamson
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, shuai.ruan,
	jike.song, zhiyuan.lv

Thanks Alex.

 >> +config VGPU_VFIO
 >> +    tristate
 >> +    depends on VGPU
 >> +    default n
 >> +
 >
 > This is a little bit convoluted, it seems like everything added in this
 > patch is vfio agnostic, it doesn't necessarily care what the consumer
 > is.  That makes me think we should only be adding CONFIG_VGPU here and
 > it should not depend on CONFIG_VFIO or be enabling CONFIG_VGPU_VFIO.
 > The middle config entry is also redundant to the first, just move the
 > default line up to the first and remove the rest.

CONFIG_VGPU doesn't directly depend on VFIO. CONFIG_VGPU_VFIO is 
directly dependent on VFIO. But devices created by VGPU core module need 
a driver to manage those devices. CONFIG_VGPU_VFIO is the driver which 
will manage vgpu devices. So I think CONFIG_VGPU_VFIO should be enabled 
by CONFIG_VGPU.

This would look like:
menuconfig VGPU
     tristate "VGPU driver framework"
     select VGPU_VFIO
     default n
     help
         VGPU provides a framework to virtualize GPU without SR-IOV cap
         See Documentation/vgpu.txt for more details.

         If you don't know what do here, say N.

config VGPU_VFIO
     tristate
     depends on VGPU
     depends on VFIO
     default n



 >> +int vgpu_register_device(struct pci_dev *dev, const struct 
gpu_device_ops *ops)
 >
 > Why do we care that it's a pci_dev?  It seems like there's only a very
 > small portion of the API that cares about pci_devs in order to describe
 > BARs, which could be switched based on the device type.  Otherwise we
 > could operate on a struct device here.
 >

GPUs are PCI devices, hence used pci_dev. I agree with you, I'll change 
it to operate on struct device and add checks in vgpu_vfio.c where 
config space and BARs are populated.

 >> +static void vgpu_device_free(struct vgpu_device *vgpu_dev)
 >> +{
 >> +	if (vgpu_dev) {
 >> +		mutex_lock(&vgpu.vgpu_devices_lock);
 >> +		list_del(&vgpu_dev->list);
 >> +		mutex_unlock(&vgpu.vgpu_devices_lock);
 >> +		kfree(vgpu_dev);
 >> +	}
 >
 > Why aren't we using the kref to remove and free the vgpu when the last
 > reference is released?

vgpu_device_free() is called from 2 places,
1. from create_vgpu_device(), when device_register() is failed.
2. vgpu_device_release(), which is set as a release function to device 
registered by device_register():
vgpu_dev->dev.release = vgpu_device_release;
This release function is called from device_unregister() kernel function 
where it uses kref to call release function. I don't think I need to do 
it again.


 >> +struct vgpu_device *vgpu_drv_get_vgpu_device(uuid_le uuid, int 
instance)
 >> +{
 >> +	struct vgpu_device *vdev = NULL;
 >> +
 >> +	mutex_lock(&vgpu.vgpu_devices_lock);
 >> +	list_for_each_entry(vdev, &vgpu.vgpu_devices_list, list) {
 >> +		if ((uuid_le_cmp(vdev->uuid, uuid) == 0) &&
 >> +		    (vdev->vgpu_instance == instance)) {
 >> +			mutex_unlock(&vgpu.vgpu_devices_lock);
 >> +			return vdev;
 >
 > We're not taking any sort of reference to the vgpu, what prevents races
 > with it being removed?  A common exit path would be easy to achieve
 > here too.
 >

I'll add reference count.


 >> +create_attr_error:
 >> +	if (gpu_dev->ops->vgpu_destroy) {
 >> +		int ret = 0;
 >> +		ret = gpu_dev->ops->vgpu_destroy(gpu_dev->dev,
 >> +						 vgpu_dev->uuid,
 >> +						 vgpu_dev->vgpu_instance);
 >
 > Unnecessary initialization and we don't do anything with the result.
 > Below indicates lack of vgpu_destroy indicates the vendor doesn't
 > support unplug, but doesn't that break our error cleanup path here?
 >

Comment about vgpu_destroy:
If VM is running and vgpu_destroy is called that 
means the vGPU is being hotunpluged. Return
error if VM is running and graphics driver
doesn't support vgpu hotplug.

Its GPU drivers responsibility to check if VM is running and return 
accordingly. This is vgpu creation path. Vgpu device would be hotplug to 
VM on vgpu_start.

 >> +		retval = gpu_dev->ops->vgpu_destroy(gpu_dev->dev,
 >> +						    vgpu_dev->uuid,
 >> +						    vgpu_dev->vgpu_instance);
 >> +	/* if vendor driver doesn't return success that means vendor 
driver doesn't
 >> +	 * support hot-unplug */
 >> +		if (retval)
 >> +			return;
 >
 > Should we return an error code then?  Inconsistent comment style.
 >

destroy_vgpu_device() is called from
- release_vgpubus_dev(), which is a release function of vgpu_class and 
has return type void.
- vgpu_unregister_device(), which again return void

Even if error code is returned from here, it is not going to be used.

I'll change the comment style in next patch update.



 >> + * @write:			Write emulation callback
 >> + *				@vdev: vgpu device structure
 >> + *				@buf: write buffer
 >> + *				@count: number bytes to be written
 >> + *				@address_space: specifies for which address space
 >> + *				the request is: pci_config_space, IO register
 >> + *				space or MMIO space.
 >> + *				@pos: offset from base address.
 >> + *				Retuns number on bytes written on success or error.
 >
 > How do these support multiple MMIO spaces or IO port spaces?  GPUs, and
 > therefore I assume vGPUs, often have more than one MMIO space, how
 > does the enum above tell us which one?  We could simply make this be a
 > region index.
 >

addresss_space should be one of these, yes address_space is of
'enum vgpu_emul_space_e', will change uint32_t to vgpu_emul_space_e.
enum vgpu_emul_space_e {
         vgpu_emul_space_config = 0, /*!< PCI configuration space */
         vgpu_emul_space_io = 1,     /*!< I/O register space */
         vgpu_emul_space_mmio = 2    /*!< Memory-mapped I/O space */
};

If you see vgpu_dev_bar_rw() in 2nd patch, in vgpu_vfio.c
int bar_index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);

@pos: offset from base address is calculated as:
pos = vdev->bar_info[bar_index].start + offset

GPU driver can find the bar index from 'pos'.

 >> + * @vgpu_set_irqs:		Called to send about interrupts configuration
 >> + *				information that qemu set.
 >> + *				@vdev: vgpu device structure
 >> + *				@flags, index, start, count and *data : same as
 >> + *				that of struct vfio_irq_set of
 >> + *				VFIO_DEVICE_SET_IRQS API.
 >
 > How do we learn about the supported interrupt types?  Should this be
 > called vgpu_vfio_set_irqs if it's following the vfio API?
 >

GPU driver provides config space of vgpu device.
QEMU learns about supported interrupt type by reading this config space 
and of vgpu device and check PCI capabilites.
Yes, this follows vfio API, will change the function name.

 >> + * @vgpu_bar_info:		Called to get BAR size and flags of vGPU device.
 >> + *				@vdev: vgpu device structure
 >> + *				@bar_index: BAR index
 >> + *				@bar_info: output, returns size and flags of
 >> + *				requested BAR
 >> + *				Returns integer: success (0) or error (< 0)
 >
 > This is called bar_info, but the bar_index is actually the vfio region
 > index and things like the config region info is being overloaded
 > through it.  We already have a structure defined for getting a generic
 > region index, why not use it?  Maybe this should just be
 > vgpu_vfio_get_region_info.
 >

Ok. Will do.


 >> + * @validate_map_request:	Validate remap pfn request
 >> + *				@vdev: vgpu device structure
 >> + *				@virtaddr: target user address to start at
 >> + *				@pfn: physical address of kernel memory, GPU
 >> + *				driver can change if required.
 >> + *				@size: size of map area, GPU driver can change
 >> + *				the size of map area if desired.
 >> + *				@prot: page protection flags for this mapping,
 >> + *				GPU driver can change, if required.
 >> + *				Returns integer: success (0) or error (< 0)
 >
 > Was not at all clear to me what this did until I got to patch 2, this
 > is actually providing the fault handling for mmap'ing a vGPU mmio BAR.
 > Needs a better name or better description.
 >

If say VMM mmap whole BAR1 of GPU, say 128MB, so fault would occur when 
BAR1 is tried to access then the size is calculated as:
req_size = vma->vm_end - virtaddr
Since GPU is being shared by multiple vGPUs, GPU driver might not remap 
whole BAR1 for only one vGPU device, so would prefer, say map one page 
at a time. GPU driver returns PAGE_SIZE. This is used by 
remap_pfn_range(). Now on next access to BAR1 other than that page, we 
will again get a fault().
As the name says this call is to validate from GPU driver for the size 
and prot of map area. GPU driver can change size and prot for this map area.


 >> +	ssize_t (*write)(struct vgpu_device *vdev, char *buf, size_t count,
 >> +			 uint32_t address_space, loff_t pos);
 >
 > Aren't these really 'enum vgpu_emul_space_e', not uint32_t?
 >

Yes. I'll change to enum vgpu_emul_space_e.

Thanks,
Kirti.

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 2/3] VFIO driver for vGPU device
  2016-05-03 22:43     ` [Qemu-devel] " Alex Williamson
@ 2016-05-04 16:25       ` Kirti Wankhede
  -1 siblings, 0 replies; 154+ messages in thread
From: Kirti Wankhede @ 2016-05-04 16:25 UTC (permalink / raw)
  To: Alex Williamson
  Cc: shuai.ruan, kevin.tian, cjia, kvm, qemu-devel, jike.song, kraxel,
	pbonzini, zhiyuan.lv


On 5/4/2016 4:13 AM, Alex Williamson wrote:
 > On Tue, 3 May 2016 00:10:40 +0530

 >>  obj-$(CONFIG_VGPU)			+= vgpu.o
 >> +obj-$(CONFIG_VGPU_VFIO)                 += vgpu_vfio.o
 >
 > This is where we should add a new Kconfig entry for VGPU_VFIO, nothing
 > in patch 1 has any vfio dependency.  Perhaps it should also depend on
 > VFIO_PCI rather than VFIO since you are getting very PCI specific below.

VGPU_VFIO depends on VFIO but is independent of VFIO_PCI. VGPU_VFIO uses 
VFIO apis defined for PCI devices and uses common #defines but that 
doesn't mean it depends on VFIO_PCI.
I'll move Kconfig entry for VGPU_VFIO here in next version of patch.

 >> +#define VFIO_PCI_OFFSET_SHIFT   40
 >> +
 >> +#define VFIO_PCI_OFFSET_TO_INDEX(off)	(off >> VFIO_PCI_OFFSET_SHIFT)
 >> +#define VFIO_PCI_INDEX_TO_OFFSET(index)	((u64)(index) << 
VFIO_PCI_OFFSET_SHIFT)
 >> +#define VFIO_PCI_OFFSET_MASK	(((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
 >
 > Change the name of these from vfio-pci please or shift code around to
 > use them directly.  You're certainly free to redefine these, but using
 > the same name is confusing.
 >

I'll move these defines to common location.


 >> +	if (gpu_dev->ops->vgpu_bar_info)
 >> +		ret = gpu_dev->ops->vgpu_bar_info(vgpu_dev, index, bar_info);
 >
 > vgpu_bar_info is already optional, further validating that the vgpu
 > core is not PCI specific.

It is not optional if vgpu_vfio module should work on the device. If 
vgpu_bar_info is not provided by vendor driver, open() would fail. 
vgpu_vfio expect PCI device. Also need to PCI device validation.


 >
 > Let's not neglect ioport BARs here, IO_MASK is different.
 >

vgpu_device is virtual device, it is not going to drive VGA signals. 
Nvidia vGPU would not support IO BAR.


 >> +	vdev->refcnt--;
 >> +	if (!vdev->refcnt) {
 >> +		memset(&vdev->bar_info, 0, sizeof(vdev->bar_info));
 >
 > Why?

vfio_vgpu_device is allocated when vgpu device is created by vgpu core, 
then QEMU/VMM call open() on that device, where vdev->bar_info is 
populated and allocates vconfig.
In teardown path, QEMU/VMM call close() on the device and 
vfio_vgpu_device is destroyed when vgpu device is destroyed by vgpu core.

If QEMU/VMM restarts and in that case vgpu device is not destroyed, 
vdev->bar_info should be cleared to fetch it again from vendor driver. 
It should not keep any stale addresses.

 >> +	if (irq_type == VFIO_PCI_MSIX_IRQ_INDEX)
 >> +		return -1;
 >
 > How are we going to expand the API later for it?  Shouldn't this just
 > be a passthrough to a gpu_devices_ops.vgpu_vfio_get_irq_info callback?

Vendor driver convey interrupt type by defining capabilities in config 
space. I don't think we should add new callback for it.


 >> +		memcpy((void *)(vdev->vconfig + pos), (void *)user_data, count);
 >
 > So write is expected to user_data to allow only the writable bits to be
 > changed?  What's really being saved in the vconfig here vs the vendor
 > vgpu driver?  It seems like we're only using it to cache the BAR
 > values, but we're not providing the BAR emulation here, which seems
 > like one of the few things we could provide so it's not duplicated in
 > every vendor driver.  But then we only need a few u32s to do that, not
 > all of config space.
 >

Vendor driver should emulate config space. It is not just BAR addresses, 
vendor driver should add the capabilities supported by its vGPU device.


 >> +
 >> +		if (gpu_dev->ops->write) {
 >> +			ret = gpu_dev->ops->write(vgpu_dev,
 >> +						  user_data,
 >> +						  count,
 >> +						  vgpu_emul_space_mmio,
 >> +						  pos);
 >> +		}
 >
 > What's the usefulness in a vendor driver that doesn't provide
 > read/write?

The checks are to avoid NULL pointer deference if this callbacks are not 
provided. Whether it will work or not that completely depends on vendor 
driver stack in host and guest.

 >> +	case VFIO_PCI_ROM_REGION_INDEX:
 >> +	case VFIO_PCI_VGA_REGION_INDEX:
 >
 > Wait a sec, who's doing the VGA emulation?  We can't be claiming to
 > support a VGA region and then fail to provide read/write access to it
 > like we said it has.
 >

Nvidia vGPU doesn't support IO BAR and ROM BAR. But I can move these 
cases to
case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:

So that if vendor driver support IO BAR or ROM BAR emulation, it would 
be same as other BARs.


 >> +	ret = remap_pfn_range(vma, virtaddr, pgoff, req_size, pg_prot);
 >
 > So not supporting validate_map_request() means that the user can
 > directly mmap BARs of the host GPU and as shown below, we assume a 1:1
 > mapping of vGPU BAR to host GPU BAR.  Is that ever valid in a vGPU
 > scenario or should this callback be required?

Yes, if restrictions are imposed such that onle one vGPU device can be 
created on one physical GPU, i.e. 1:1 vGPU to host GPU.

 >  It's not clear to me how
 > the vendor driver determines what this maps to, do they compare it to
 > the physical device's own BAR addresses?
 >

Yes.

Thanks,
Kirti



^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 2/3] VFIO driver for vGPU device
@ 2016-05-04 16:25       ` Kirti Wankhede
  0 siblings, 0 replies; 154+ messages in thread
From: Kirti Wankhede @ 2016-05-04 16:25 UTC (permalink / raw)
  To: Alex Williamson
  Cc: shuai.ruan, kevin.tian, cjia, kvm, qemu-devel, jike.song, kraxel,
	pbonzini, zhiyuan.lv


On 5/4/2016 4:13 AM, Alex Williamson wrote:
 > On Tue, 3 May 2016 00:10:40 +0530

 >>  obj-$(CONFIG_VGPU)			+= vgpu.o
 >> +obj-$(CONFIG_VGPU_VFIO)                 += vgpu_vfio.o
 >
 > This is where we should add a new Kconfig entry for VGPU_VFIO, nothing
 > in patch 1 has any vfio dependency.  Perhaps it should also depend on
 > VFIO_PCI rather than VFIO since you are getting very PCI specific below.

VGPU_VFIO depends on VFIO but is independent of VFIO_PCI. VGPU_VFIO uses 
VFIO apis defined for PCI devices and uses common #defines but that 
doesn't mean it depends on VFIO_PCI.
I'll move Kconfig entry for VGPU_VFIO here in next version of patch.

 >> +#define VFIO_PCI_OFFSET_SHIFT   40
 >> +
 >> +#define VFIO_PCI_OFFSET_TO_INDEX(off)	(off >> VFIO_PCI_OFFSET_SHIFT)
 >> +#define VFIO_PCI_INDEX_TO_OFFSET(index)	((u64)(index) << 
VFIO_PCI_OFFSET_SHIFT)
 >> +#define VFIO_PCI_OFFSET_MASK	(((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
 >
 > Change the name of these from vfio-pci please or shift code around to
 > use them directly.  You're certainly free to redefine these, but using
 > the same name is confusing.
 >

I'll move these defines to common location.


 >> +	if (gpu_dev->ops->vgpu_bar_info)
 >> +		ret = gpu_dev->ops->vgpu_bar_info(vgpu_dev, index, bar_info);
 >
 > vgpu_bar_info is already optional, further validating that the vgpu
 > core is not PCI specific.

It is not optional if vgpu_vfio module should work on the device. If 
vgpu_bar_info is not provided by vendor driver, open() would fail. 
vgpu_vfio expect PCI device. Also need to PCI device validation.


 >
 > Let's not neglect ioport BARs here, IO_MASK is different.
 >

vgpu_device is virtual device, it is not going to drive VGA signals. 
Nvidia vGPU would not support IO BAR.


 >> +	vdev->refcnt--;
 >> +	if (!vdev->refcnt) {
 >> +		memset(&vdev->bar_info, 0, sizeof(vdev->bar_info));
 >
 > Why?

vfio_vgpu_device is allocated when vgpu device is created by vgpu core, 
then QEMU/VMM call open() on that device, where vdev->bar_info is 
populated and allocates vconfig.
In teardown path, QEMU/VMM call close() on the device and 
vfio_vgpu_device is destroyed when vgpu device is destroyed by vgpu core.

If QEMU/VMM restarts and in that case vgpu device is not destroyed, 
vdev->bar_info should be cleared to fetch it again from vendor driver. 
It should not keep any stale addresses.

 >> +	if (irq_type == VFIO_PCI_MSIX_IRQ_INDEX)
 >> +		return -1;
 >
 > How are we going to expand the API later for it?  Shouldn't this just
 > be a passthrough to a gpu_devices_ops.vgpu_vfio_get_irq_info callback?

Vendor driver convey interrupt type by defining capabilities in config 
space. I don't think we should add new callback for it.


 >> +		memcpy((void *)(vdev->vconfig + pos), (void *)user_data, count);
 >
 > So write is expected to user_data to allow only the writable bits to be
 > changed?  What's really being saved in the vconfig here vs the vendor
 > vgpu driver?  It seems like we're only using it to cache the BAR
 > values, but we're not providing the BAR emulation here, which seems
 > like one of the few things we could provide so it's not duplicated in
 > every vendor driver.  But then we only need a few u32s to do that, not
 > all of config space.
 >

Vendor driver should emulate config space. It is not just BAR addresses, 
vendor driver should add the capabilities supported by its vGPU device.


 >> +
 >> +		if (gpu_dev->ops->write) {
 >> +			ret = gpu_dev->ops->write(vgpu_dev,
 >> +						  user_data,
 >> +						  count,
 >> +						  vgpu_emul_space_mmio,
 >> +						  pos);
 >> +		}
 >
 > What's the usefulness in a vendor driver that doesn't provide
 > read/write?

The checks are to avoid NULL pointer deference if this callbacks are not 
provided. Whether it will work or not that completely depends on vendor 
driver stack in host and guest.

 >> +	case VFIO_PCI_ROM_REGION_INDEX:
 >> +	case VFIO_PCI_VGA_REGION_INDEX:
 >
 > Wait a sec, who's doing the VGA emulation?  We can't be claiming to
 > support a VGA region and then fail to provide read/write access to it
 > like we said it has.
 >

Nvidia vGPU doesn't support IO BAR and ROM BAR. But I can move these 
cases to
case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:

So that if vendor driver support IO BAR or ROM BAR emulation, it would 
be same as other BARs.


 >> +	ret = remap_pfn_range(vma, virtaddr, pgoff, req_size, pg_prot);
 >
 > So not supporting validate_map_request() means that the user can
 > directly mmap BARs of the host GPU and as shown below, we assume a 1:1
 > mapping of vGPU BAR to host GPU BAR.  Is that ever valid in a vGPU
 > scenario or should this callback be required?

Yes, if restrictions are imposed such that onle one vGPU device can be 
created on one physical GPU, i.e. 1:1 vGPU to host GPU.

 >  It's not clear to me how
 > the vendor driver determines what this maps to, do they compare it to
 > the physical device's own BAR addresses?
 >

Yes.

Thanks,
Kirti

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [RFC PATCH v3 1/3] vGPU Core driver
  2016-05-04  2:45       ` [Qemu-devel] " Tian, Kevin
@ 2016-05-04 16:57         ` Alex Williamson
  -1 siblings, 0 replies; 154+ messages in thread
From: Alex Williamson @ 2016-05-04 16:57 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Kirti Wankhede, pbonzini, kraxel, cjia, qemu-devel, kvm, Ruan,
	Shuai, Song, Jike, Lv, Zhiyuan

On Wed, 4 May 2016 02:45:59 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson
> > Sent: Wednesday, May 04, 2016 6:44 AM
> >   
> > > diff --git a/drivers/vgpu/Kconfig b/drivers/vgpu/Kconfig
> > > new file mode 100644
> > > index 0000000..792eb48
> > > --- /dev/null
> > > +++ b/drivers/vgpu/Kconfig
> > > @@ -0,0 +1,21 @@
> > > +
> > > +menuconfig VGPU
> > > +    tristate "VGPU driver framework"
> > > +    depends on VFIO
> > > +    select VGPU_VFIO
> > > +    help
> > > +        VGPU provides a framework to virtualize GPU without SR-IOV cap
> > > +        See Documentation/vgpu.txt for more details.
> > > +
> > > +        If you don't know what do here, say N.
> > > +
> > > +config VGPU
> > > +    tristate
> > > +    depends on VFIO
> > > +    default n
> > > +
> > > +config VGPU_VFIO
> > > +    tristate
> > > +    depends on VGPU
> > > +    default n
> > > +  
> > 
> > This is a little bit convoluted, it seems like everything added in this
> > patch is vfio agnostic, it doesn't necessarily care what the consumer
> > is.  That makes me think we should only be adding CONFIG_VGPU here and
> > it should not depend on CONFIG_VFIO or be enabling CONFIG_VGPU_VFIO.
> > The middle config entry is also redundant to the first, just move the
> > default line up to the first and remove the rest.  
> 
> Agree. Removing such dependency also benefits other hypervisor if
> VFIO is not used.
> 
> Alex, there is one idea which I'd like to hear your comment. When looking at
> the whole series, we can see the majority logic (maybe I cannot say 100%)
> is GPU agnostic. Same frameworks in VFIO and vGPU core are actually neutral
> to underlying device type, which e.g. can be easily applied to a NIC card too
> if a similar technology is developed there.
> 
> Do you think whether we'd better make framework not GPU specific now
> (mostly naming change), or continue current style and change later only 
> when there is a real implementation on a different device? 

Yeah, I see that too and I made a bunch of comments in patch 3 that
we're not doing anything vGPU specific and we should be careful about
assuming the user for the various interfaces.  In patch 1, we are
fairly v/GPU specific because we're dealing with how vGPUs are created
from the physical GPU.  Maybe the interface is general, maybe it's not,
it's hard to say.  Starting with patch 2 though, we really shouldn't
know or care what the device is beyond a PCI compatible device.  We're
just trying to create a vfio bus driver compatible with vfio-pci and
offload enough generic operations so that we don't need to pass
everything back to the vendor driver.  Patch 3 of course should be
completely device agnostic, we should only care that the vfio backend
provides mediation of the device, so an iommu is not required.  It may
be too much of a rathole to try to completely generalize the interface
at this point, but let's certainly try not to let vgpu specific ideas
spread beyond where we need.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 1/3] vGPU Core driver
@ 2016-05-04 16:57         ` Alex Williamson
  0 siblings, 0 replies; 154+ messages in thread
From: Alex Williamson @ 2016-05-04 16:57 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Kirti Wankhede, pbonzini, kraxel, cjia, qemu-devel, kvm, Ruan,
	Shuai, Song, Jike, Lv, Zhiyuan

On Wed, 4 May 2016 02:45:59 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson
> > Sent: Wednesday, May 04, 2016 6:44 AM
> >   
> > > diff --git a/drivers/vgpu/Kconfig b/drivers/vgpu/Kconfig
> > > new file mode 100644
> > > index 0000000..792eb48
> > > --- /dev/null
> > > +++ b/drivers/vgpu/Kconfig
> > > @@ -0,0 +1,21 @@
> > > +
> > > +menuconfig VGPU
> > > +    tristate "VGPU driver framework"
> > > +    depends on VFIO
> > > +    select VGPU_VFIO
> > > +    help
> > > +        VGPU provides a framework to virtualize GPU without SR-IOV cap
> > > +        See Documentation/vgpu.txt for more details.
> > > +
> > > +        If you don't know what do here, say N.
> > > +
> > > +config VGPU
> > > +    tristate
> > > +    depends on VFIO
> > > +    default n
> > > +
> > > +config VGPU_VFIO
> > > +    tristate
> > > +    depends on VGPU
> > > +    default n
> > > +  
> > 
> > This is a little bit convoluted, it seems like everything added in this
> > patch is vfio agnostic, it doesn't necessarily care what the consumer
> > is.  That makes me think we should only be adding CONFIG_VGPU here and
> > it should not depend on CONFIG_VFIO or be enabling CONFIG_VGPU_VFIO.
> > The middle config entry is also redundant to the first, just move the
> > default line up to the first and remove the rest.  
> 
> Agree. Removing such dependency also benefits other hypervisor if
> VFIO is not used.
> 
> Alex, there is one idea which I'd like to hear your comment. When looking at
> the whole series, we can see the majority logic (maybe I cannot say 100%)
> is GPU agnostic. Same frameworks in VFIO and vGPU core are actually neutral
> to underlying device type, which e.g. can be easily applied to a NIC card too
> if a similar technology is developed there.
> 
> Do you think whether we'd better make framework not GPU specific now
> (mostly naming change), or continue current style and change later only 
> when there is a real implementation on a different device? 

Yeah, I see that too and I made a bunch of comments in patch 3 that
we're not doing anything vGPU specific and we should be careful about
assuming the user for the various interfaces.  In patch 1, we are
fairly v/GPU specific because we're dealing with how vGPUs are created
from the physical GPU.  Maybe the interface is general, maybe it's not,
it's hard to say.  Starting with patch 2 though, we really shouldn't
know or care what the device is beyond a PCI compatible device.  We're
just trying to create a vfio bus driver compatible with vfio-pci and
offload enough generic operations so that we don't need to pass
everything back to the vendor driver.  Patch 3 of course should be
completely device agnostic, we should only care that the vfio backend
provides mediation of the device, so an iommu is not required.  It may
be too much of a rathole to try to completely generalize the interface
at this point, but let's certainly try not to let vgpu specific ideas
spread beyond where we need.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [RFC PATCH v3 2/3] VFIO driver for vGPU device
  2016-05-04  3:23       ` [Qemu-devel] " Tian, Kevin
@ 2016-05-04 17:06         ` Alex Williamson
  -1 siblings, 0 replies; 154+ messages in thread
From: Alex Williamson @ 2016-05-04 17:06 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Kirti Wankhede, pbonzini, kraxel, cjia, qemu-devel, kvm, Ruan,
	Shuai, Song, Jike, Lv, Zhiyuan

On Wed, 4 May 2016 03:23:13 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Wednesday, May 04, 2016 6:43 AM  
> > > +
> > > +		if (gpu_dev->ops->write) {
> > > +			ret = gpu_dev->ops->write(vgpu_dev,
> > > +						  user_data,
> > > +						  count,
> > > +						  vgpu_emul_space_config,
> > > +						  pos);
> > > +		}
> > > +
> > > +		memcpy((void *)(vdev->vconfig + pos), (void *)user_data, count);  
> > 
> > So write is expected to user_data to allow only the writable bits to be
> > changed?  What's really being saved in the vconfig here vs the vendor
> > vgpu driver?  It seems like we're only using it to cache the BAR
> > values, but we're not providing the BAR emulation here, which seems
> > like one of the few things we could provide so it's not duplicated in
> > every vendor driver.  But then we only need a few u32s to do that, not
> > all of config space.  
> 
> We can borrow same vconfig emulation from existing vfio-pci driver.
> But doing so doesn't mean that vendor vgpu driver cannot have its
> own vconfig emulation further. vGPU is not like a real device, since
> there may be no physical config space implemented for each vGPU.
> So anyway vendor vGPU driver needs to create/emulate the virtualized 
> config space while the way how is created might be vendor specific. 
> So better to keep the interface to access raw vconfig space from
> vendor vGPU driver.

I'm hoping config space will be very simple for a vgpu, so I don't know
that it makes sense to add that complexity early on.  Neo/Kirti, what
capabilities do you expect to provide?  Who provides the MSI
capability?  Is a PCIe capability provided?  Others?
 
> > > +static ssize_t vgpu_dev_rw(void *device_data, char __user *buf,
> > > +		size_t count, loff_t *ppos, bool iswrite)
> > > +{
> > > +	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
> > > +	struct vfio_vgpu_device *vdev = device_data;
> > > +
> > > +	if (index >= VFIO_PCI_NUM_REGIONS)
> > > +		return -EINVAL;
> > > +
> > > +	switch (index) {
> > > +	case VFIO_PCI_CONFIG_REGION_INDEX:
> > > +		return vgpu_dev_config_rw(vdev, buf, count, ppos, iswrite);
> > > +
> > > +	case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
> > > +		return vgpu_dev_bar_rw(vdev, buf, count, ppos, iswrite);
> > > +
> > > +	case VFIO_PCI_ROM_REGION_INDEX:
> > > +	case VFIO_PCI_VGA_REGION_INDEX:  
> > 
> > Wait a sec, who's doing the VGA emulation?  We can't be claiming to
> > support a VGA region and then fail to provide read/write access to it
> > like we said it has.  
> 
> For Intel side we plan to not support VGA region when upstreaming our
> KVMGT work, which means Intel vGPU will be exposed only as a 
> secondary graphics card then so legacy VGA is not required. Also no
> VBIOS/ROM requirement. Guess we can remove above two regions.

So this needs to be optional based on what the mediation driver
provides.  It seems like we're just making passthroughs for the vendor
mediation driver to speak vfio.

> > > +
> > > +static int vgpu_dev_mmio_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> > > +{
> > > +	int ret = 0;
> > > +	struct vfio_vgpu_device *vdev = vma->vm_private_data;
> > > +	struct vgpu_device *vgpu_dev;
> > > +	struct gpu_device *gpu_dev;
> > > +	u64 virtaddr = (u64)vmf->virtual_address;
> > > +	u64 offset, phyaddr;
> > > +	unsigned long req_size, pgoff;
> > > +	pgprot_t pg_prot;
> > > +
> > > +	if (!vdev && !vdev->vgpu_dev)
> > > +		return -EINVAL;
> > > +
> > > +	vgpu_dev = vdev->vgpu_dev;
> > > +	gpu_dev  = vgpu_dev->gpu_dev;
> > > +
> > > +	offset   = vma->vm_pgoff << PAGE_SHIFT;
> > > +	phyaddr  = virtaddr - vma->vm_start + offset;
> > > +	pgoff    = phyaddr >> PAGE_SHIFT;
> > > +	req_size = vma->vm_end - virtaddr;
> > > +	pg_prot  = vma->vm_page_prot;
> > > +
> > > +	if (gpu_dev->ops->validate_map_request) {
> > > +		ret = gpu_dev->ops->validate_map_request(vgpu_dev, virtaddr, &pgoff,
> > > +							 &req_size, &pg_prot);
> > > +		if (ret)
> > > +			return ret;
> > > +
> > > +		if (!req_size)
> > > +			return -EINVAL;
> > > +	}
> > > +
> > > +	ret = remap_pfn_range(vma, virtaddr, pgoff, req_size, pg_prot);  
> > 
> > So not supporting validate_map_request() means that the user can
> > directly mmap BARs of the host GPU and as shown below, we assume a 1:1
> > mapping of vGPU BAR to host GPU BAR.  Is that ever valid in a vGPU
> > scenario or should this callback be required?  It's not clear to me how
> > the vendor driver determines what this maps to, do they compare it to
> > the physical device's own BAR addresses?  
> 
> I didn't quite understand too. Based on earlier discussion, do we need
> something like this, or could achieve the purpose just by leveraging
> recent sparse mmap support?

The reason for faulting in the mmio space, if I recall correctly, is to
enable an ordering where the user driver (QEMU) can mmap regions of the
device prior to resources being allocated on the host GPU to handle
them.  Sparse mmap only partially handles that, it's not dynamic.  With
this faulting mechanism, the host GPU doesn't need to commit resources
until the mmap is actually accessed.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 2/3] VFIO driver for vGPU device
@ 2016-05-04 17:06         ` Alex Williamson
  0 siblings, 0 replies; 154+ messages in thread
From: Alex Williamson @ 2016-05-04 17:06 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Kirti Wankhede, pbonzini, kraxel, cjia, qemu-devel, kvm, Ruan,
	Shuai, Song, Jike, Lv, Zhiyuan

On Wed, 4 May 2016 03:23:13 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Wednesday, May 04, 2016 6:43 AM  
> > > +
> > > +		if (gpu_dev->ops->write) {
> > > +			ret = gpu_dev->ops->write(vgpu_dev,
> > > +						  user_data,
> > > +						  count,
> > > +						  vgpu_emul_space_config,
> > > +						  pos);
> > > +		}
> > > +
> > > +		memcpy((void *)(vdev->vconfig + pos), (void *)user_data, count);  
> > 
> > So write is expected to user_data to allow only the writable bits to be
> > changed?  What's really being saved in the vconfig here vs the vendor
> > vgpu driver?  It seems like we're only using it to cache the BAR
> > values, but we're not providing the BAR emulation here, which seems
> > like one of the few things we could provide so it's not duplicated in
> > every vendor driver.  But then we only need a few u32s to do that, not
> > all of config space.  
> 
> We can borrow same vconfig emulation from existing vfio-pci driver.
> But doing so doesn't mean that vendor vgpu driver cannot have its
> own vconfig emulation further. vGPU is not like a real device, since
> there may be no physical config space implemented for each vGPU.
> So anyway vendor vGPU driver needs to create/emulate the virtualized 
> config space while the way how is created might be vendor specific. 
> So better to keep the interface to access raw vconfig space from
> vendor vGPU driver.

I'm hoping config space will be very simple for a vgpu, so I don't know
that it makes sense to add that complexity early on.  Neo/Kirti, what
capabilities do you expect to provide?  Who provides the MSI
capability?  Is a PCIe capability provided?  Others?
 
> > > +static ssize_t vgpu_dev_rw(void *device_data, char __user *buf,
> > > +		size_t count, loff_t *ppos, bool iswrite)
> > > +{
> > > +	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
> > > +	struct vfio_vgpu_device *vdev = device_data;
> > > +
> > > +	if (index >= VFIO_PCI_NUM_REGIONS)
> > > +		return -EINVAL;
> > > +
> > > +	switch (index) {
> > > +	case VFIO_PCI_CONFIG_REGION_INDEX:
> > > +		return vgpu_dev_config_rw(vdev, buf, count, ppos, iswrite);
> > > +
> > > +	case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
> > > +		return vgpu_dev_bar_rw(vdev, buf, count, ppos, iswrite);
> > > +
> > > +	case VFIO_PCI_ROM_REGION_INDEX:
> > > +	case VFIO_PCI_VGA_REGION_INDEX:  
> > 
> > Wait a sec, who's doing the VGA emulation?  We can't be claiming to
> > support a VGA region and then fail to provide read/write access to it
> > like we said it has.  
> 
> For Intel side we plan to not support VGA region when upstreaming our
> KVMGT work, which means Intel vGPU will be exposed only as a 
> secondary graphics card then so legacy VGA is not required. Also no
> VBIOS/ROM requirement. Guess we can remove above two regions.

So this needs to be optional based on what the mediation driver
provides.  It seems like we're just making passthroughs for the vendor
mediation driver to speak vfio.

> > > +
> > > +static int vgpu_dev_mmio_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> > > +{
> > > +	int ret = 0;
> > > +	struct vfio_vgpu_device *vdev = vma->vm_private_data;
> > > +	struct vgpu_device *vgpu_dev;
> > > +	struct gpu_device *gpu_dev;
> > > +	u64 virtaddr = (u64)vmf->virtual_address;
> > > +	u64 offset, phyaddr;
> > > +	unsigned long req_size, pgoff;
> > > +	pgprot_t pg_prot;
> > > +
> > > +	if (!vdev && !vdev->vgpu_dev)
> > > +		return -EINVAL;
> > > +
> > > +	vgpu_dev = vdev->vgpu_dev;
> > > +	gpu_dev  = vgpu_dev->gpu_dev;
> > > +
> > > +	offset   = vma->vm_pgoff << PAGE_SHIFT;
> > > +	phyaddr  = virtaddr - vma->vm_start + offset;
> > > +	pgoff    = phyaddr >> PAGE_SHIFT;
> > > +	req_size = vma->vm_end - virtaddr;
> > > +	pg_prot  = vma->vm_page_prot;
> > > +
> > > +	if (gpu_dev->ops->validate_map_request) {
> > > +		ret = gpu_dev->ops->validate_map_request(vgpu_dev, virtaddr, &pgoff,
> > > +							 &req_size, &pg_prot);
> > > +		if (ret)
> > > +			return ret;
> > > +
> > > +		if (!req_size)
> > > +			return -EINVAL;
> > > +	}
> > > +
> > > +	ret = remap_pfn_range(vma, virtaddr, pgoff, req_size, pg_prot);  
> > 
> > So not supporting validate_map_request() means that the user can
> > directly mmap BARs of the host GPU and as shown below, we assume a 1:1
> > mapping of vGPU BAR to host GPU BAR.  Is that ever valid in a vGPU
> > scenario or should this callback be required?  It's not clear to me how
> > the vendor driver determines what this maps to, do they compare it to
> > the physical device's own BAR addresses?  
> 
> I didn't quite understand too. Based on earlier discussion, do we need
> something like this, or could achieve the purpose just by leveraging
> recent sparse mmap support?

The reason for faulting in the mmio space, if I recall correctly, is to
enable an ordering where the user driver (QEMU) can mmap regions of the
device prior to resources being allocated on the host GPU to handle
them.  Sparse mmap only partially handles that, it's not dynamic.  With
this faulting mechanism, the host GPU doesn't need to commit resources
until the mmap is actually accessed.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [RFC PATCH v3 0/3] Add vGPU support
  2016-05-04  6:17     ` [Qemu-devel] " Neo Jia
@ 2016-05-04 17:07       ` Alex Williamson
  -1 siblings, 0 replies; 154+ messages in thread
From: Alex Williamson @ 2016-05-04 17:07 UTC (permalink / raw)
  To: Neo Jia
  Cc: Tian, Kevin, Kirti Wankhede, pbonzini, kraxel, qemu-devel, kvm,
	Ruan, Shuai, Song, Jike, Lv, Zhiyuan

On Tue, 3 May 2016 23:17:24 -0700
Neo Jia <cjia@nvidia.com> wrote:

> On Wed, May 04, 2016 at 01:05:36AM +0000, Tian, Kevin wrote:
> > > From: Kirti Wankhede
> > > Sent: Tuesday, May 03, 2016 2:41 AM
> > > 
> > > This series adds vGPU support to v4.6 Linux host kernel. Purpose of this series
> > > is to provide a common interface for vGPU management that can be used
> > > by different GPU drivers. This series introduces vGPU core module that create
> > > and manage vGPU devices, VFIO based driver for vGPU devices that are created by
> > > vGPU core module and update VFIO type1 IOMMU module to support vGPU devices.
> > > 
> > > What's new in v3?
> > > VFIO type1 IOMMU module supports devices which are IOMMU capable. This version
> > > of patched adds support for vGPU devices, which are not IOMMU capable, to use
> > > existing VFIO IOMMU module. VFIO Type1 IOMMU patch provide new set of APIs for
> > > guest page translation.
> > > 
> > > What's left to do?
> > > VFIO driver for vGPU device doesn't support devices with MSI-X enabled.
> > > 
> > > Please review.
> > >   
> > 
> > Thanks Kirti/Neo for your nice work! We are integrating this common
> > framework with KVMGT. Once ready it'll be released as an experimental
> > feature in our next community release.
> > 
> > One curious question. There are some additional changes in our side.
> > What is the best way to collaborate our effort before this series is
> > accepted in upstream kernel? Do you prefer to receiving patches from
> > us directly, or having it hosted some place so both sides can contribute?  
> 
> Yes, sending it directly to Kirti and myself will work the best, we can sort
> out this process offline.

Please do it online, in the open, on public mailing lists.  We
specifically do not want to develop the interfaces in private.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 0/3] Add vGPU support
@ 2016-05-04 17:07       ` Alex Williamson
  0 siblings, 0 replies; 154+ messages in thread
From: Alex Williamson @ 2016-05-04 17:07 UTC (permalink / raw)
  To: Neo Jia
  Cc: Tian, Kevin, Kirti Wankhede, pbonzini, kraxel, qemu-devel, kvm,
	Ruan, Shuai, Song, Jike, Lv, Zhiyuan

On Tue, 3 May 2016 23:17:24 -0700
Neo Jia <cjia@nvidia.com> wrote:

> On Wed, May 04, 2016 at 01:05:36AM +0000, Tian, Kevin wrote:
> > > From: Kirti Wankhede
> > > Sent: Tuesday, May 03, 2016 2:41 AM
> > > 
> > > This series adds vGPU support to v4.6 Linux host kernel. Purpose of this series
> > > is to provide a common interface for vGPU management that can be used
> > > by different GPU drivers. This series introduces vGPU core module that create
> > > and manage vGPU devices, VFIO based driver for vGPU devices that are created by
> > > vGPU core module and update VFIO type1 IOMMU module to support vGPU devices.
> > > 
> > > What's new in v3?
> > > VFIO type1 IOMMU module supports devices which are IOMMU capable. This version
> > > of patched adds support for vGPU devices, which are not IOMMU capable, to use
> > > existing VFIO IOMMU module. VFIO Type1 IOMMU patch provide new set of APIs for
> > > guest page translation.
> > > 
> > > What's left to do?
> > > VFIO driver for vGPU device doesn't support devices with MSI-X enabled.
> > > 
> > > Please review.
> > >   
> > 
> > Thanks Kirti/Neo for your nice work! We are integrating this common
> > framework with KVMGT. Once ready it'll be released as an experimental
> > feature in our next community release.
> > 
> > One curious question. There are some additional changes in our side.
> > What is the best way to collaborate our effort before this series is
> > accepted in upstream kernel? Do you prefer to receiving patches from
> > us directly, or having it hosted some place so both sides can contribute?  
> 
> Yes, sending it directly to Kirti and myself will work the best, we can sort
> out this process offline.

Please do it online, in the open, on public mailing lists.  We
specifically do not want to develop the interfaces in private.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [RFC PATCH v3 2/3] VFIO driver for vGPU device
  2016-05-04 17:06         ` [Qemu-devel] " Alex Williamson
@ 2016-05-04 21:14           ` Neo Jia
  -1 siblings, 0 replies; 154+ messages in thread
From: Neo Jia @ 2016-05-04 21:14 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, Kirti Wankhede, pbonzini, kraxel, qemu-devel, kvm,
	Ruan, Shuai, Song, Jike, Lv, Zhiyuan

On Wed, May 04, 2016 at 11:06:19AM -0600, Alex Williamson wrote:
> On Wed, 4 May 2016 03:23:13 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > Sent: Wednesday, May 04, 2016 6:43 AM  
> > > > +
> > > > +		if (gpu_dev->ops->write) {
> > > > +			ret = gpu_dev->ops->write(vgpu_dev,
> > > > +						  user_data,
> > > > +						  count,
> > > > +						  vgpu_emul_space_config,
> > > > +						  pos);
> > > > +		}
> > > > +
> > > > +		memcpy((void *)(vdev->vconfig + pos), (void *)user_data, count);  
> > > 
> > > So write is expected to user_data to allow only the writable bits to be
> > > changed?  What's really being saved in the vconfig here vs the vendor
> > > vgpu driver?  It seems like we're only using it to cache the BAR
> > > values, but we're not providing the BAR emulation here, which seems
> > > like one of the few things we could provide so it's not duplicated in
> > > every vendor driver.  But then we only need a few u32s to do that, not
> > > all of config space.  
> > 
> > We can borrow same vconfig emulation from existing vfio-pci driver.
> > But doing so doesn't mean that vendor vgpu driver cannot have its
> > own vconfig emulation further. vGPU is not like a real device, since
> > there may be no physical config space implemented for each vGPU.
> > So anyway vendor vGPU driver needs to create/emulate the virtualized 
> > config space while the way how is created might be vendor specific. 
> > So better to keep the interface to access raw vconfig space from
> > vendor vGPU driver.
> 
> I'm hoping config space will be very simple for a vgpu, so I don't know
> that it makes sense to add that complexity early on.  Neo/Kirti, what
> capabilities do you expect to provide?  Who provides the MSI
> capability?  Is a PCIe capability provided?  Others?

Currently only standard PCI caps.

MSI cap is emulated by the vendor drivers via the above interface.

No PCIe caps so far.

>  
> > > > +static ssize_t vgpu_dev_rw(void *device_data, char __user *buf,
> > > > +		size_t count, loff_t *ppos, bool iswrite)
> > > > +{
> > > > +	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
> > > > +	struct vfio_vgpu_device *vdev = device_data;
> > > > +
> > > > +	if (index >= VFIO_PCI_NUM_REGIONS)
> > > > +		return -EINVAL;
> > > > +
> > > > +	switch (index) {
> > > > +	case VFIO_PCI_CONFIG_REGION_INDEX:
> > > > +		return vgpu_dev_config_rw(vdev, buf, count, ppos, iswrite);
> > > > +
> > > > +	case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
> > > > +		return vgpu_dev_bar_rw(vdev, buf, count, ppos, iswrite);
> > > > +
> > > > +	case VFIO_PCI_ROM_REGION_INDEX:
> > > > +	case VFIO_PCI_VGA_REGION_INDEX:  
> > > 
> > > Wait a sec, who's doing the VGA emulation?  We can't be claiming to
> > > support a VGA region and then fail to provide read/write access to it
> > > like we said it has.  
> > 
> > For Intel side we plan to not support VGA region when upstreaming our
> > KVMGT work, which means Intel vGPU will be exposed only as a 
> > secondary graphics card then so legacy VGA is not required. Also no
> > VBIOS/ROM requirement. Guess we can remove above two regions.
> 
> So this needs to be optional based on what the mediation driver
> provides.  It seems like we're just making passthroughs for the vendor
> mediation driver to speak vfio.
> 
> > > > +
> > > > +static int vgpu_dev_mmio_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> > > > +{
> > > > +	int ret = 0;
> > > > +	struct vfio_vgpu_device *vdev = vma->vm_private_data;
> > > > +	struct vgpu_device *vgpu_dev;
> > > > +	struct gpu_device *gpu_dev;
> > > > +	u64 virtaddr = (u64)vmf->virtual_address;
> > > > +	u64 offset, phyaddr;
> > > > +	unsigned long req_size, pgoff;
> > > > +	pgprot_t pg_prot;
> > > > +
> > > > +	if (!vdev && !vdev->vgpu_dev)
> > > > +		return -EINVAL;
> > > > +
> > > > +	vgpu_dev = vdev->vgpu_dev;
> > > > +	gpu_dev  = vgpu_dev->gpu_dev;
> > > > +
> > > > +	offset   = vma->vm_pgoff << PAGE_SHIFT;
> > > > +	phyaddr  = virtaddr - vma->vm_start + offset;
> > > > +	pgoff    = phyaddr >> PAGE_SHIFT;
> > > > +	req_size = vma->vm_end - virtaddr;
> > > > +	pg_prot  = vma->vm_page_prot;
> > > > +
> > > > +	if (gpu_dev->ops->validate_map_request) {
> > > > +		ret = gpu_dev->ops->validate_map_request(vgpu_dev, virtaddr, &pgoff,
> > > > +							 &req_size, &pg_prot);
> > > > +		if (ret)
> > > > +			return ret;
> > > > +
> > > > +		if (!req_size)
> > > > +			return -EINVAL;
> > > > +	}
> > > > +
> > > > +	ret = remap_pfn_range(vma, virtaddr, pgoff, req_size, pg_prot);  
> > > 
> > > So not supporting validate_map_request() means that the user can
> > > directly mmap BARs of the host GPU and as shown below, we assume a 1:1
> > > mapping of vGPU BAR to host GPU BAR.  Is that ever valid in a vGPU
> > > scenario or should this callback be required?  It's not clear to me how
> > > the vendor driver determines what this maps to, do they compare it to
> > > the physical device's own BAR addresses?  
> > 
> > I didn't quite understand too. Based on earlier discussion, do we need
> > something like this, or could achieve the purpose just by leveraging
> > recent sparse mmap support?
> 
> The reason for faulting in the mmio space, if I recall correctly, is to
> enable an ordering where the user driver (QEMU) can mmap regions of the
> device prior to resources being allocated on the host GPU to handle
> them.  Sparse mmap only partially handles that, it's not dynamic.  With
> this faulting mechanism, the host GPU doesn't need to commit resources
> until the mmap is actually accessed.  Thanks,

Correct.

Thanks,
Neo

> 
> Alex

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 2/3] VFIO driver for vGPU device
@ 2016-05-04 21:14           ` Neo Jia
  0 siblings, 0 replies; 154+ messages in thread
From: Neo Jia @ 2016-05-04 21:14 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, Kirti Wankhede, pbonzini, kraxel, qemu-devel, kvm,
	Ruan, Shuai, Song, Jike, Lv, Zhiyuan

On Wed, May 04, 2016 at 11:06:19AM -0600, Alex Williamson wrote:
> On Wed, 4 May 2016 03:23:13 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > Sent: Wednesday, May 04, 2016 6:43 AM  
> > > > +
> > > > +		if (gpu_dev->ops->write) {
> > > > +			ret = gpu_dev->ops->write(vgpu_dev,
> > > > +						  user_data,
> > > > +						  count,
> > > > +						  vgpu_emul_space_config,
> > > > +						  pos);
> > > > +		}
> > > > +
> > > > +		memcpy((void *)(vdev->vconfig + pos), (void *)user_data, count);  
> > > 
> > > So write is expected to user_data to allow only the writable bits to be
> > > changed?  What's really being saved in the vconfig here vs the vendor
> > > vgpu driver?  It seems like we're only using it to cache the BAR
> > > values, but we're not providing the BAR emulation here, which seems
> > > like one of the few things we could provide so it's not duplicated in
> > > every vendor driver.  But then we only need a few u32s to do that, not
> > > all of config space.  
> > 
> > We can borrow same vconfig emulation from existing vfio-pci driver.
> > But doing so doesn't mean that vendor vgpu driver cannot have its
> > own vconfig emulation further. vGPU is not like a real device, since
> > there may be no physical config space implemented for each vGPU.
> > So anyway vendor vGPU driver needs to create/emulate the virtualized 
> > config space while the way how is created might be vendor specific. 
> > So better to keep the interface to access raw vconfig space from
> > vendor vGPU driver.
> 
> I'm hoping config space will be very simple for a vgpu, so I don't know
> that it makes sense to add that complexity early on.  Neo/Kirti, what
> capabilities do you expect to provide?  Who provides the MSI
> capability?  Is a PCIe capability provided?  Others?

Currently only standard PCI caps.

MSI cap is emulated by the vendor drivers via the above interface.

No PCIe caps so far.

>  
> > > > +static ssize_t vgpu_dev_rw(void *device_data, char __user *buf,
> > > > +		size_t count, loff_t *ppos, bool iswrite)
> > > > +{
> > > > +	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
> > > > +	struct vfio_vgpu_device *vdev = device_data;
> > > > +
> > > > +	if (index >= VFIO_PCI_NUM_REGIONS)
> > > > +		return -EINVAL;
> > > > +
> > > > +	switch (index) {
> > > > +	case VFIO_PCI_CONFIG_REGION_INDEX:
> > > > +		return vgpu_dev_config_rw(vdev, buf, count, ppos, iswrite);
> > > > +
> > > > +	case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
> > > > +		return vgpu_dev_bar_rw(vdev, buf, count, ppos, iswrite);
> > > > +
> > > > +	case VFIO_PCI_ROM_REGION_INDEX:
> > > > +	case VFIO_PCI_VGA_REGION_INDEX:  
> > > 
> > > Wait a sec, who's doing the VGA emulation?  We can't be claiming to
> > > support a VGA region and then fail to provide read/write access to it
> > > like we said it has.  
> > 
> > For Intel side we plan to not support VGA region when upstreaming our
> > KVMGT work, which means Intel vGPU will be exposed only as a 
> > secondary graphics card then so legacy VGA is not required. Also no
> > VBIOS/ROM requirement. Guess we can remove above two regions.
> 
> So this needs to be optional based on what the mediation driver
> provides.  It seems like we're just making passthroughs for the vendor
> mediation driver to speak vfio.
> 
> > > > +
> > > > +static int vgpu_dev_mmio_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> > > > +{
> > > > +	int ret = 0;
> > > > +	struct vfio_vgpu_device *vdev = vma->vm_private_data;
> > > > +	struct vgpu_device *vgpu_dev;
> > > > +	struct gpu_device *gpu_dev;
> > > > +	u64 virtaddr = (u64)vmf->virtual_address;
> > > > +	u64 offset, phyaddr;
> > > > +	unsigned long req_size, pgoff;
> > > > +	pgprot_t pg_prot;
> > > > +
> > > > +	if (!vdev && !vdev->vgpu_dev)
> > > > +		return -EINVAL;
> > > > +
> > > > +	vgpu_dev = vdev->vgpu_dev;
> > > > +	gpu_dev  = vgpu_dev->gpu_dev;
> > > > +
> > > > +	offset   = vma->vm_pgoff << PAGE_SHIFT;
> > > > +	phyaddr  = virtaddr - vma->vm_start + offset;
> > > > +	pgoff    = phyaddr >> PAGE_SHIFT;
> > > > +	req_size = vma->vm_end - virtaddr;
> > > > +	pg_prot  = vma->vm_page_prot;
> > > > +
> > > > +	if (gpu_dev->ops->validate_map_request) {
> > > > +		ret = gpu_dev->ops->validate_map_request(vgpu_dev, virtaddr, &pgoff,
> > > > +							 &req_size, &pg_prot);
> > > > +		if (ret)
> > > > +			return ret;
> > > > +
> > > > +		if (!req_size)
> > > > +			return -EINVAL;
> > > > +	}
> > > > +
> > > > +	ret = remap_pfn_range(vma, virtaddr, pgoff, req_size, pg_prot);  
> > > 
> > > So not supporting validate_map_request() means that the user can
> > > directly mmap BARs of the host GPU and as shown below, we assume a 1:1
> > > mapping of vGPU BAR to host GPU BAR.  Is that ever valid in a vGPU
> > > scenario or should this callback be required?  It's not clear to me how
> > > the vendor driver determines what this maps to, do they compare it to
> > > the physical device's own BAR addresses?  
> > 
> > I didn't quite understand too. Based on earlier discussion, do we need
> > something like this, or could achieve the purpose just by leveraging
> > recent sparse mmap support?
> 
> The reason for faulting in the mmio space, if I recall correctly, is to
> enable an ordering where the user driver (QEMU) can mmap regions of the
> device prior to resources being allocated on the host GPU to handle
> them.  Sparse mmap only partially handles that, it's not dynamic.  With
> this faulting mechanism, the host GPU doesn't need to commit resources
> until the mmap is actually accessed.  Thanks,

Correct.

Thanks,
Neo

> 
> Alex

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [RFC PATCH v3 2/3] VFIO driver for vGPU device
  2016-05-04 21:14           ` [Qemu-devel] " Neo Jia
@ 2016-05-05  4:42             ` Kirti Wankhede
  -1 siblings, 0 replies; 154+ messages in thread
From: Kirti Wankhede @ 2016-05-05  4:42 UTC (permalink / raw)
  To: Neo Jia, Alex Williamson
  Cc: Tian, Kevin, pbonzini, kraxel, qemu-devel, kvm, Ruan, Shuai,
	Song, Jike, Lv, Zhiyuan



On 5/5/2016 2:44 AM, Neo Jia wrote:
> On Wed, May 04, 2016 at 11:06:19AM -0600, Alex Williamson wrote:
>> On Wed, 4 May 2016 03:23:13 +0000
>> "Tian, Kevin" <kevin.tian@intel.com> wrote:
>>
>>>> From: Alex Williamson [mailto:alex.williamson@redhat.com]
>>>> Sent: Wednesday, May 04, 2016 6:43 AM
>>>>> +
>>>>> +		if (gpu_dev->ops->write) {
>>>>> +			ret = gpu_dev->ops->write(vgpu_dev,
>>>>> +						  user_data,
>>>>> +						  count,
>>>>> +						  vgpu_emul_space_config,
>>>>> +						  pos);
>>>>> +		}
>>>>> +
>>>>> +		memcpy((void *)(vdev->vconfig + pos), (void *)user_data, count);
>>>>
>>>> So write is expected to user_data to allow only the writable bits to be
>>>> changed?  What's really being saved in the vconfig here vs the vendor
>>>> vgpu driver?  It seems like we're only using it to cache the BAR
>>>> values, but we're not providing the BAR emulation here, which seems
>>>> like one of the few things we could provide so it's not duplicated in
>>>> every vendor driver.  But then we only need a few u32s to do that, not
>>>> all of config space.
>>>
>>> We can borrow same vconfig emulation from existing vfio-pci driver.
>>> But doing so doesn't mean that vendor vgpu driver cannot have its
>>> own vconfig emulation further. vGPU is not like a real device, since
>>> there may be no physical config space implemented for each vGPU.
>>> So anyway vendor vGPU driver needs to create/emulate the virtualized
>>> config space while the way how is created might be vendor specific.
>>> So better to keep the interface to access raw vconfig space from
>>> vendor vGPU driver.
>>
>> I'm hoping config space will be very simple for a vgpu, so I don't know
>> that it makes sense to add that complexity early on.  Neo/Kirti, what
>> capabilities do you expect to provide?  Who provides the MSI
>> capability?  Is a PCIe capability provided?  Others?
>

 From VGPU_VFIO point of view, VGPU_VFIO would not provide or modify any 
capabilities. Vendor vGPU driver should provide config space. Then 
vendor driver can provide PCI capabilities or PCIe capabilities, it 
might also have vendor specific information. VGPU_VFIO driver would not 
intercept that information.

> Currently only standard PCI caps.
>
> MSI cap is emulated by the vendor drivers via the above interface.
>
> No PCIe caps so far.
>

Nvidia vGPU device is standard PCI device. We tested standard PCI caps.

Thanks,
Kirti.

>>
>>>>> +static ssize_t vgpu_dev_rw(void *device_data, char __user *buf,
>>>>> +		size_t count, loff_t *ppos, bool iswrite)
>>>>> +{
>>>>> +	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
>>>>> +	struct vfio_vgpu_device *vdev = device_data;
>>>>> +
>>>>> +	if (index >= VFIO_PCI_NUM_REGIONS)
>>>>> +		return -EINVAL;
>>>>> +
>>>>> +	switch (index) {
>>>>> +	case VFIO_PCI_CONFIG_REGION_INDEX:
>>>>> +		return vgpu_dev_config_rw(vdev, buf, count, ppos, iswrite);
>>>>> +
>>>>> +	case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
>>>>> +		return vgpu_dev_bar_rw(vdev, buf, count, ppos, iswrite);
>>>>> +
>>>>> +	case VFIO_PCI_ROM_REGION_INDEX:
>>>>> +	case VFIO_PCI_VGA_REGION_INDEX:
>>>>
>>>> Wait a sec, who's doing the VGA emulation?  We can't be claiming to
>>>> support a VGA region and then fail to provide read/write access to it
>>>> like we said it has.
>>>
>>> For Intel side we plan to not support VGA region when upstreaming our
>>> KVMGT work, which means Intel vGPU will be exposed only as a
>>> secondary graphics card then so legacy VGA is not required. Also no
>>> VBIOS/ROM requirement. Guess we can remove above two regions.
>>
>> So this needs to be optional based on what the mediation driver
>> provides.  It seems like we're just making passthroughs for the vendor
>> mediation driver to speak vfio.
>>
>>>>> +
>>>>> +static int vgpu_dev_mmio_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
>>>>> +{
>>>>> +	int ret = 0;
>>>>> +	struct vfio_vgpu_device *vdev = vma->vm_private_data;
>>>>> +	struct vgpu_device *vgpu_dev;
>>>>> +	struct gpu_device *gpu_dev;
>>>>> +	u64 virtaddr = (u64)vmf->virtual_address;
>>>>> +	u64 offset, phyaddr;
>>>>> +	unsigned long req_size, pgoff;
>>>>> +	pgprot_t pg_prot;
>>>>> +
>>>>> +	if (!vdev && !vdev->vgpu_dev)
>>>>> +		return -EINVAL;
>>>>> +
>>>>> +	vgpu_dev = vdev->vgpu_dev;
>>>>> +	gpu_dev  = vgpu_dev->gpu_dev;
>>>>> +
>>>>> +	offset   = vma->vm_pgoff << PAGE_SHIFT;
>>>>> +	phyaddr  = virtaddr - vma->vm_start + offset;
>>>>> +	pgoff    = phyaddr >> PAGE_SHIFT;
>>>>> +	req_size = vma->vm_end - virtaddr;
>>>>> +	pg_prot  = vma->vm_page_prot;
>>>>> +
>>>>> +	if (gpu_dev->ops->validate_map_request) {
>>>>> +		ret = gpu_dev->ops->validate_map_request(vgpu_dev, virtaddr, &pgoff,
>>>>> +							 &req_size, &pg_prot);
>>>>> +		if (ret)
>>>>> +			return ret;
>>>>> +
>>>>> +		if (!req_size)
>>>>> +			return -EINVAL;
>>>>> +	}
>>>>> +
>>>>> +	ret = remap_pfn_range(vma, virtaddr, pgoff, req_size, pg_prot);
>>>>
>>>> So not supporting validate_map_request() means that the user can
>>>> directly mmap BARs of the host GPU and as shown below, we assume a 1:1
>>>> mapping of vGPU BAR to host GPU BAR.  Is that ever valid in a vGPU
>>>> scenario or should this callback be required?  It's not clear to me how
>>>> the vendor driver determines what this maps to, do they compare it to
>>>> the physical device's own BAR addresses?
>>>
>>> I didn't quite understand too. Based on earlier discussion, do we need
>>> something like this, or could achieve the purpose just by leveraging
>>> recent sparse mmap support?
>>
>> The reason for faulting in the mmio space, if I recall correctly, is to
>> enable an ordering where the user driver (QEMU) can mmap regions of the
>> device prior to resources being allocated on the host GPU to handle
>> them.  Sparse mmap only partially handles that, it's not dynamic.  With
>> this faulting mechanism, the host GPU doesn't need to commit resources
>> until the mmap is actually accessed.  Thanks,
>
> Correct.
>
> Thanks,
> Neo
>
>>
>> Alex

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 2/3] VFIO driver for vGPU device
@ 2016-05-05  4:42             ` Kirti Wankhede
  0 siblings, 0 replies; 154+ messages in thread
From: Kirti Wankhede @ 2016-05-05  4:42 UTC (permalink / raw)
  To: Neo Jia, Alex Williamson
  Cc: Tian, Kevin, pbonzini, kraxel, qemu-devel, kvm, Ruan, Shuai,
	Song, Jike, Lv, Zhiyuan



On 5/5/2016 2:44 AM, Neo Jia wrote:
> On Wed, May 04, 2016 at 11:06:19AM -0600, Alex Williamson wrote:
>> On Wed, 4 May 2016 03:23:13 +0000
>> "Tian, Kevin" <kevin.tian@intel.com> wrote:
>>
>>>> From: Alex Williamson [mailto:alex.williamson@redhat.com]
>>>> Sent: Wednesday, May 04, 2016 6:43 AM
>>>>> +
>>>>> +		if (gpu_dev->ops->write) {
>>>>> +			ret = gpu_dev->ops->write(vgpu_dev,
>>>>> +						  user_data,
>>>>> +						  count,
>>>>> +						  vgpu_emul_space_config,
>>>>> +						  pos);
>>>>> +		}
>>>>> +
>>>>> +		memcpy((void *)(vdev->vconfig + pos), (void *)user_data, count);
>>>>
>>>> So write is expected to user_data to allow only the writable bits to be
>>>> changed?  What's really being saved in the vconfig here vs the vendor
>>>> vgpu driver?  It seems like we're only using it to cache the BAR
>>>> values, but we're not providing the BAR emulation here, which seems
>>>> like one of the few things we could provide so it's not duplicated in
>>>> every vendor driver.  But then we only need a few u32s to do that, not
>>>> all of config space.
>>>
>>> We can borrow same vconfig emulation from existing vfio-pci driver.
>>> But doing so doesn't mean that vendor vgpu driver cannot have its
>>> own vconfig emulation further. vGPU is not like a real device, since
>>> there may be no physical config space implemented for each vGPU.
>>> So anyway vendor vGPU driver needs to create/emulate the virtualized
>>> config space while the way how is created might be vendor specific.
>>> So better to keep the interface to access raw vconfig space from
>>> vendor vGPU driver.
>>
>> I'm hoping config space will be very simple for a vgpu, so I don't know
>> that it makes sense to add that complexity early on.  Neo/Kirti, what
>> capabilities do you expect to provide?  Who provides the MSI
>> capability?  Is a PCIe capability provided?  Others?
>

 From VGPU_VFIO point of view, VGPU_VFIO would not provide or modify any 
capabilities. Vendor vGPU driver should provide config space. Then 
vendor driver can provide PCI capabilities or PCIe capabilities, it 
might also have vendor specific information. VGPU_VFIO driver would not 
intercept that information.

> Currently only standard PCI caps.
>
> MSI cap is emulated by the vendor drivers via the above interface.
>
> No PCIe caps so far.
>

Nvidia vGPU device is standard PCI device. We tested standard PCI caps.

Thanks,
Kirti.

>>
>>>>> +static ssize_t vgpu_dev_rw(void *device_data, char __user *buf,
>>>>> +		size_t count, loff_t *ppos, bool iswrite)
>>>>> +{
>>>>> +	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
>>>>> +	struct vfio_vgpu_device *vdev = device_data;
>>>>> +
>>>>> +	if (index >= VFIO_PCI_NUM_REGIONS)
>>>>> +		return -EINVAL;
>>>>> +
>>>>> +	switch (index) {
>>>>> +	case VFIO_PCI_CONFIG_REGION_INDEX:
>>>>> +		return vgpu_dev_config_rw(vdev, buf, count, ppos, iswrite);
>>>>> +
>>>>> +	case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
>>>>> +		return vgpu_dev_bar_rw(vdev, buf, count, ppos, iswrite);
>>>>> +
>>>>> +	case VFIO_PCI_ROM_REGION_INDEX:
>>>>> +	case VFIO_PCI_VGA_REGION_INDEX:
>>>>
>>>> Wait a sec, who's doing the VGA emulation?  We can't be claiming to
>>>> support a VGA region and then fail to provide read/write access to it
>>>> like we said it has.
>>>
>>> For Intel side we plan to not support VGA region when upstreaming our
>>> KVMGT work, which means Intel vGPU will be exposed only as a
>>> secondary graphics card then so legacy VGA is not required. Also no
>>> VBIOS/ROM requirement. Guess we can remove above two regions.
>>
>> So this needs to be optional based on what the mediation driver
>> provides.  It seems like we're just making passthroughs for the vendor
>> mediation driver to speak vfio.
>>
>>>>> +
>>>>> +static int vgpu_dev_mmio_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
>>>>> +{
>>>>> +	int ret = 0;
>>>>> +	struct vfio_vgpu_device *vdev = vma->vm_private_data;
>>>>> +	struct vgpu_device *vgpu_dev;
>>>>> +	struct gpu_device *gpu_dev;
>>>>> +	u64 virtaddr = (u64)vmf->virtual_address;
>>>>> +	u64 offset, phyaddr;
>>>>> +	unsigned long req_size, pgoff;
>>>>> +	pgprot_t pg_prot;
>>>>> +
>>>>> +	if (!vdev && !vdev->vgpu_dev)
>>>>> +		return -EINVAL;
>>>>> +
>>>>> +	vgpu_dev = vdev->vgpu_dev;
>>>>> +	gpu_dev  = vgpu_dev->gpu_dev;
>>>>> +
>>>>> +	offset   = vma->vm_pgoff << PAGE_SHIFT;
>>>>> +	phyaddr  = virtaddr - vma->vm_start + offset;
>>>>> +	pgoff    = phyaddr >> PAGE_SHIFT;
>>>>> +	req_size = vma->vm_end - virtaddr;
>>>>> +	pg_prot  = vma->vm_page_prot;
>>>>> +
>>>>> +	if (gpu_dev->ops->validate_map_request) {
>>>>> +		ret = gpu_dev->ops->validate_map_request(vgpu_dev, virtaddr, &pgoff,
>>>>> +							 &req_size, &pg_prot);
>>>>> +		if (ret)
>>>>> +			return ret;
>>>>> +
>>>>> +		if (!req_size)
>>>>> +			return -EINVAL;
>>>>> +	}
>>>>> +
>>>>> +	ret = remap_pfn_range(vma, virtaddr, pgoff, req_size, pg_prot);
>>>>
>>>> So not supporting validate_map_request() means that the user can
>>>> directly mmap BARs of the host GPU and as shown below, we assume a 1:1
>>>> mapping of vGPU BAR to host GPU BAR.  Is that ever valid in a vGPU
>>>> scenario or should this callback be required?  It's not clear to me how
>>>> the vendor driver determines what this maps to, do they compare it to
>>>> the physical device's own BAR addresses?
>>>
>>> I didn't quite understand too. Based on earlier discussion, do we need
>>> something like this, or could achieve the purpose just by leveraging
>>> recent sparse mmap support?
>>
>> The reason for faulting in the mmio space, if I recall correctly, is to
>> enable an ordering where the user driver (QEMU) can mmap regions of the
>> device prior to resources being allocated on the host GPU to handle
>> them.  Sparse mmap only partially handles that, it's not dynamic.  With
>> this faulting mechanism, the host GPU doesn't need to commit resources
>> until the mmap is actually accessed.  Thanks,
>
> Correct.
>
> Thanks,
> Neo
>
>>
>> Alex

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
  2016-05-03 22:43     ` [Qemu-devel] " Alex Williamson
@ 2016-05-05  6:55       ` Jike Song
  -1 siblings, 0 replies; 154+ messages in thread
From: Jike Song @ 2016-05-05  6:55 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Kirti Wankhede, pbonzini, kraxel, cjia, qemu-devel, kvm,
	kevin.tian, shuai.ruan, zhiyuan.lv

On 05/04/2016 06:43 AM, Alex Williamson wrote:
> On Tue, 3 May 2016 00:10:41 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>> +
>> +/*
>> + * Pin a set of guest PFNs and return their associated host PFNs for vGPU.
>> + * @vaddr [in]: array of guest PFNs
>> + * @npage [in]: count of array elements
>> + * @prot [in] : protection flags
>> + * @pfn_base[out] : array of host PFNs
>> + */
>> +int vfio_pin_pages(void *iommu_data, dma_addr_t *vaddr, long npage,
>> +		   int prot, dma_addr_t *pfn_base)
>> +{
>> +	struct vfio_iommu *iommu = iommu_data;
>> +	struct vfio_domain *domain = NULL, *domain_vgpu = NULL;
>> +	int i = 0, ret = 0;
>> +	long retpage;
>> +	dma_addr_t remote_vaddr = 0;
>> +	dma_addr_t *pfn = pfn_base;
>> +	struct vfio_dma *dma;
>> +
>> +	if (!iommu || !pfn_base)
>> +		return -EINVAL;
>> +
>> +	if (list_empty(&iommu->domain_list)) {
>> +		ret = -EINVAL;
>> +		goto pin_done;
>> +	}
>> +
>> +	get_first_domains(iommu, &domain, &domain_vgpu);
>> +
>> +	// Return error if vGPU domain doesn't exist
> 
> No c++ style comments please.
> 
>> +	if (!domain_vgpu) {
>> +		ret = -EINVAL;
>> +		goto pin_done;
>> +	}
>> +
>> +	for (i = 0; i < npage; i++) {
>> +		struct vfio_vgpu_pfn *p;
>> +		struct vfio_vgpu_pfn *lpfn;
>> +		unsigned long tpfn;
>> +		dma_addr_t iova;
>> +
>> +		mutex_lock(&iommu->lock);
>> +
>> +		iova = vaddr[i] << PAGE_SHIFT;
>> +
>> +		dma = vfio_find_dma(iommu, iova, 0 /*  size */);
>> +		if (!dma) {
>> +			mutex_unlock(&iommu->lock);
>> +			ret = -EINVAL;
>> +			goto pin_done;
>> +		}
>> +
>> +		remote_vaddr = dma->vaddr + iova - dma->iova;
>> +
>> +		retpage = vfio_pin_pages_internal(domain_vgpu, remote_vaddr,
>> +						  (long)1, prot, &tpfn);
>> +		mutex_unlock(&iommu->lock);
>> +		if (retpage <= 0) {
>> +			WARN_ON(!retpage);
>> +			ret = (int)retpage;
>> +			goto pin_done;
>> +		}
>> +
>> +		pfn[i] = tpfn;
>> +
>> +		mutex_lock(&domain_vgpu->lock);
>> +
>> +		// search if pfn exist
>> +		if ((p = vfio_find_vgpu_pfn(domain_vgpu, tpfn))) {
>> +			atomic_inc(&p->ref_count);
>> +			mutex_unlock(&domain_vgpu->lock);
>> +			continue;
>> +		}
> 
> The only reason I can come up with for why we'd want to integrate an
> api-only domain into the existing type1 code would be to avoid page
> accounting issues where we count locked pages once for a normal
> assigned device and again for a vgpu, but that's not what we're doing
> here.  We're not only locking the pages again regardless of them
> already being locked, we're counting every time we lock them through
> this new interface.  So there's really no point at all to making type1
> become this unsupportable.  In that case we should be pulling out the
> common code that we want to share from type1 and making a new type1
> compatible vfio iommu backend rather than conditionalizing everything
> here.
> 
>> +
>> +		// add to pfn_list
>> +		lpfn = kzalloc(sizeof(*lpfn), GFP_KERNEL);
>> +		if (!lpfn) {
>> +			ret = -ENOMEM;
>> +			mutex_unlock(&domain_vgpu->lock);
>> +			goto pin_done;
>> +		}
>> +		lpfn->vmm_va = remote_vaddr;
>> +		lpfn->iova = iova;
>> +		lpfn->pfn = pfn[i];
>> +		lpfn->npage = 1;
>> +		lpfn->prot = prot;
>> +		atomic_inc(&lpfn->ref_count);
>> +		vfio_link_vgpu_pfn(domain_vgpu, lpfn);
>> +		mutex_unlock(&domain_vgpu->lock);
>> +	}
>> +
>> +	ret = i;
>> +
>> +pin_done:
>> +	return ret;
>> +}
>> +EXPORT_SYMBOL(vfio_pin_pages);
>> +

IIUC, an api-only domain is a VFIO domain *without* underlying IOMMU
hardware. It just, as you said in another mail, "rather than
programming them into an IOMMU for a device, it simply stores the
translations for use by later requests".

That imposes a constraint on gfx driver: hardware IOMMU must be disabled.
Otherwise, if IOMMU is present, the gfx driver eventually programs
the hardware IOMMU with IOVA returned by pci_map_page or dma_map_page;
Meanwhile, the IOMMU backend for vgpu only maintains GPA <-> HPA
translations without any knowledge about hardware IOMMU, how is the
device model supposed to do to get an IOVA for a given GPA (thereby HPA
by the IOMMU backend here)?

If things go as guessed above, as vfio_pin_pages() indicates, it
pin & translate vaddr to PFN, then it will be very difficult for the
device model to figure out:

	1, for a given GPA, how to avoid calling dma_map_page multiple times?
	2, for which page to call dma_unmap_page?

--
Thanks,
Jike


^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
@ 2016-05-05  6:55       ` Jike Song
  0 siblings, 0 replies; 154+ messages in thread
From: Jike Song @ 2016-05-05  6:55 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Kirti Wankhede, pbonzini, kraxel, cjia, qemu-devel, kvm,
	kevin.tian, shuai.ruan, zhiyuan.lv

On 05/04/2016 06:43 AM, Alex Williamson wrote:
> On Tue, 3 May 2016 00:10:41 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>> +
>> +/*
>> + * Pin a set of guest PFNs and return their associated host PFNs for vGPU.
>> + * @vaddr [in]: array of guest PFNs
>> + * @npage [in]: count of array elements
>> + * @prot [in] : protection flags
>> + * @pfn_base[out] : array of host PFNs
>> + */
>> +int vfio_pin_pages(void *iommu_data, dma_addr_t *vaddr, long npage,
>> +		   int prot, dma_addr_t *pfn_base)
>> +{
>> +	struct vfio_iommu *iommu = iommu_data;
>> +	struct vfio_domain *domain = NULL, *domain_vgpu = NULL;
>> +	int i = 0, ret = 0;
>> +	long retpage;
>> +	dma_addr_t remote_vaddr = 0;
>> +	dma_addr_t *pfn = pfn_base;
>> +	struct vfio_dma *dma;
>> +
>> +	if (!iommu || !pfn_base)
>> +		return -EINVAL;
>> +
>> +	if (list_empty(&iommu->domain_list)) {
>> +		ret = -EINVAL;
>> +		goto pin_done;
>> +	}
>> +
>> +	get_first_domains(iommu, &domain, &domain_vgpu);
>> +
>> +	// Return error if vGPU domain doesn't exist
> 
> No c++ style comments please.
> 
>> +	if (!domain_vgpu) {
>> +		ret = -EINVAL;
>> +		goto pin_done;
>> +	}
>> +
>> +	for (i = 0; i < npage; i++) {
>> +		struct vfio_vgpu_pfn *p;
>> +		struct vfio_vgpu_pfn *lpfn;
>> +		unsigned long tpfn;
>> +		dma_addr_t iova;
>> +
>> +		mutex_lock(&iommu->lock);
>> +
>> +		iova = vaddr[i] << PAGE_SHIFT;
>> +
>> +		dma = vfio_find_dma(iommu, iova, 0 /*  size */);
>> +		if (!dma) {
>> +			mutex_unlock(&iommu->lock);
>> +			ret = -EINVAL;
>> +			goto pin_done;
>> +		}
>> +
>> +		remote_vaddr = dma->vaddr + iova - dma->iova;
>> +
>> +		retpage = vfio_pin_pages_internal(domain_vgpu, remote_vaddr,
>> +						  (long)1, prot, &tpfn);
>> +		mutex_unlock(&iommu->lock);
>> +		if (retpage <= 0) {
>> +			WARN_ON(!retpage);
>> +			ret = (int)retpage;
>> +			goto pin_done;
>> +		}
>> +
>> +		pfn[i] = tpfn;
>> +
>> +		mutex_lock(&domain_vgpu->lock);
>> +
>> +		// search if pfn exist
>> +		if ((p = vfio_find_vgpu_pfn(domain_vgpu, tpfn))) {
>> +			atomic_inc(&p->ref_count);
>> +			mutex_unlock(&domain_vgpu->lock);
>> +			continue;
>> +		}
> 
> The only reason I can come up with for why we'd want to integrate an
> api-only domain into the existing type1 code would be to avoid page
> accounting issues where we count locked pages once for a normal
> assigned device and again for a vgpu, but that's not what we're doing
> here.  We're not only locking the pages again regardless of them
> already being locked, we're counting every time we lock them through
> this new interface.  So there's really no point at all to making type1
> become this unsupportable.  In that case we should be pulling out the
> common code that we want to share from type1 and making a new type1
> compatible vfio iommu backend rather than conditionalizing everything
> here.
> 
>> +
>> +		// add to pfn_list
>> +		lpfn = kzalloc(sizeof(*lpfn), GFP_KERNEL);
>> +		if (!lpfn) {
>> +			ret = -ENOMEM;
>> +			mutex_unlock(&domain_vgpu->lock);
>> +			goto pin_done;
>> +		}
>> +		lpfn->vmm_va = remote_vaddr;
>> +		lpfn->iova = iova;
>> +		lpfn->pfn = pfn[i];
>> +		lpfn->npage = 1;
>> +		lpfn->prot = prot;
>> +		atomic_inc(&lpfn->ref_count);
>> +		vfio_link_vgpu_pfn(domain_vgpu, lpfn);
>> +		mutex_unlock(&domain_vgpu->lock);
>> +	}
>> +
>> +	ret = i;
>> +
>> +pin_done:
>> +	return ret;
>> +}
>> +EXPORT_SYMBOL(vfio_pin_pages);
>> +

IIUC, an api-only domain is a VFIO domain *without* underlying IOMMU
hardware. It just, as you said in another mail, "rather than
programming them into an IOMMU for a device, it simply stores the
translations for use by later requests".

That imposes a constraint on gfx driver: hardware IOMMU must be disabled.
Otherwise, if IOMMU is present, the gfx driver eventually programs
the hardware IOMMU with IOVA returned by pci_map_page or dma_map_page;
Meanwhile, the IOMMU backend for vgpu only maintains GPA <-> HPA
translations without any knowledge about hardware IOMMU, how is the
device model supposed to do to get an IOVA for a given GPA (thereby HPA
by the IOMMU backend here)?

If things go as guessed above, as vfio_pin_pages() indicates, it
pin & translate vaddr to PFN, then it will be very difficult for the
device model to figure out:

	1, for a given GPA, how to avoid calling dma_map_page multiple times?
	2, for which page to call dma_unmap_page?

--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
  2016-05-03 22:43     ` [Qemu-devel] " Alex Williamson
@ 2016-05-05  7:51       ` Kirti Wankhede
  -1 siblings, 0 replies; 154+ messages in thread
From: Kirti Wankhede @ 2016-05-05  7:51 UTC (permalink / raw)
  To: Alex Williamson
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, shuai.ruan,
	jike.song, zhiyuan.lv


On 5/4/2016 4:13 AM, Alex Williamson wrote:
 > On Tue, 3 May 2016 00:10:41 +0530
 > Kirti Wankhede <kwankhede@nvidia.com> wrote:
 >
[..]


 >> +	if (domain->vfio_iommu_api_only)
 >> +		mm = domain->vmm_mm;
 >> +	else
 >> +		mm = current->mm;
 >> +
 >> +	if (!mm)
 >> +		return -ENODEV;
 >> +
 >> +	ret = vaddr_get_pfn(mm, vaddr, prot, pfn_base);
 >
 > We could pass domain->mm unconditionally to vaddr_get_pfn(), let it be
 > NULL in the !api_only case and use it as a cue to vaddr_get_pfn() which
 > gup variant to use.  Of course we need to deal with mmap_sem somewhere
 > too without turning the code into swiss cheese.
 >

Yes, I missed that. Thanks for pointing out. I'll fix it.

 > Correct me if I'm wrong, but I assume the main benefit of interweaving
 > this into type1 vs pulling out common code and making a new vfio iommu
 > backend is the page accounting, ie. not over accounting locked pages.
 > TBH, I don't know if it's worth it.  Any idea what the high water mark
 > of pinned pages for a vgpu might be?
 >

It depends in which VM (Linux/Windows) is running and what workload is 
running. On Windows DMA pages are managed by WDDM model. On Linux each 
user space application can DMA to pages and there are not restrictions.


 > The only reason I can come up with for why we'd want to integrate an
 > api-only domain into the existing type1 code would be to avoid page
 > accounting issues where we count locked pages once for a normal
 > assigned device and again for a vgpu, but that's not what we're doing
 > here.  We're not only locking the pages again regardless of them
 > already being locked, we're counting every time we lock them through
 > this new interface.  So there's really no point at all to making type1
 > become this unsupportable.  In that case we should be pulling out the
 > common code that we want to share from type1 and making a new type1
 > compatible vfio iommu backend rather than conditionalizing everything
 > here.
 >

I tried to add pfn tracking logic and use already locked pages, but that 
didn't worked somehow, I'll revisit it again.
With this there will be additional pfn tracking logic for the case where 
device is directly assigned and vGPU device is not present.



 >> +		// verify if pfn exist in pfn_list
 >> +		if (!(p = vfio_find_vgpu_pfn(domain_vgpu, *(pfn + i)))) {
 >> +			continue;
 >
 > How does the caller deal with this, the function returns number of
 > pages unpinned which will not match the requested number of pages to
 > unpin if there are any missing.  Also, no setting variables within a
 > test when easily avoidable please, separate to a set then test.
 >

Here we are following the current code logic. Do you have any suggestion 
how to deal with that?


 >> +	(iommu, &domain, &domain_vgpu);
 >> +
 >> +	if (!domain)
 >> +		return;
 >> +
 >> +	d = domain;
 >>  	list_for_each_entry_continue(d, &iommu->domain_list, next) {
 >> -		iommu_unmap(d->domain, dma->iova, dma->size);
 >> -		cond_resched();
 >> +		if (!d->vfio_iommu_api_only) {
 >> +			iommu_unmap(d->domain, dma->iova, dma->size);
 >> +			cond_resched();
 >> +		}get_first_domains
 >>  	}
 >>
 >>  	while (iova < end) {
 >
 > How do api-only domain not blowup on the iommu API code in this next
 > code block?  Are you just getting lucky that the api-only domain is
 > first in the list and the real domain is last?
 >

Control will not come here if there is no domain with IOMMU due to below 
change:

 >> +	if (!domain)
 >> +		return;

get_first_domains() returns the first domain with IOMMU and first domain 
with api_only.


 >> +		if (d->vfio_iommu_api_only)
 >> +			continue;
 >> +
 >
 > Really disliking all these switches everywhere, too many different code
 > paths.
 >

I'll move such APIs to inline functions such that this check would be 
within inline functions and code would look much cleaner.


 >> +	// Skip pin and map only if domain without IOMMU is present
 >> +	if (!domain_with_iommu_present) {
 >> +		dma->size = size;
 >> +		goto map_done;
 >> +	}
 >> +
 >
 > Yet more special cases, the code is getting unsupportable.

 From vfio_dma_do_map(), if there is no devices pass-throughed then we 
don't want to pin all pages upfront. and that is the reason of this check.

 >
 > I'm really not convinced that pushing this into the type1 code is the
 > right approach vs pulling out shareable code chunks where it makes
 > sense and creating a separate iommu backend.  We're not getting
 > anything but code complexity out of this approach it seems.

I find pulling out shared code is also not simple. I would like to 
revisit this again and sort out the concerns you raised rather than 
making separate module.

Thanks,
Kirti.



^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
@ 2016-05-05  7:51       ` Kirti Wankhede
  0 siblings, 0 replies; 154+ messages in thread
From: Kirti Wankhede @ 2016-05-05  7:51 UTC (permalink / raw)
  To: Alex Williamson
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, kevin.tian, shuai.ruan,
	jike.song, zhiyuan.lv


On 5/4/2016 4:13 AM, Alex Williamson wrote:
 > On Tue, 3 May 2016 00:10:41 +0530
 > Kirti Wankhede <kwankhede@nvidia.com> wrote:
 >
[..]


 >> +	if (domain->vfio_iommu_api_only)
 >> +		mm = domain->vmm_mm;
 >> +	else
 >> +		mm = current->mm;
 >> +
 >> +	if (!mm)
 >> +		return -ENODEV;
 >> +
 >> +	ret = vaddr_get_pfn(mm, vaddr, prot, pfn_base);
 >
 > We could pass domain->mm unconditionally to vaddr_get_pfn(), let it be
 > NULL in the !api_only case and use it as a cue to vaddr_get_pfn() which
 > gup variant to use.  Of course we need to deal with mmap_sem somewhere
 > too without turning the code into swiss cheese.
 >

Yes, I missed that. Thanks for pointing out. I'll fix it.

 > Correct me if I'm wrong, but I assume the main benefit of interweaving
 > this into type1 vs pulling out common code and making a new vfio iommu
 > backend is the page accounting, ie. not over accounting locked pages.
 > TBH, I don't know if it's worth it.  Any idea what the high water mark
 > of pinned pages for a vgpu might be?
 >

It depends in which VM (Linux/Windows) is running and what workload is 
running. On Windows DMA pages are managed by WDDM model. On Linux each 
user space application can DMA to pages and there are not restrictions.


 > The only reason I can come up with for why we'd want to integrate an
 > api-only domain into the existing type1 code would be to avoid page
 > accounting issues where we count locked pages once for a normal
 > assigned device and again for a vgpu, but that's not what we're doing
 > here.  We're not only locking the pages again regardless of them
 > already being locked, we're counting every time we lock them through
 > this new interface.  So there's really no point at all to making type1
 > become this unsupportable.  In that case we should be pulling out the
 > common code that we want to share from type1 and making a new type1
 > compatible vfio iommu backend rather than conditionalizing everything
 > here.
 >

I tried to add pfn tracking logic and use already locked pages, but that 
didn't worked somehow, I'll revisit it again.
With this there will be additional pfn tracking logic for the case where 
device is directly assigned and vGPU device is not present.



 >> +		// verify if pfn exist in pfn_list
 >> +		if (!(p = vfio_find_vgpu_pfn(domain_vgpu, *(pfn + i)))) {
 >> +			continue;
 >
 > How does the caller deal with this, the function returns number of
 > pages unpinned which will not match the requested number of pages to
 > unpin if there are any missing.  Also, no setting variables within a
 > test when easily avoidable please, separate to a set then test.
 >

Here we are following the current code logic. Do you have any suggestion 
how to deal with that?


 >> +	(iommu, &domain, &domain_vgpu);
 >> +
 >> +	if (!domain)
 >> +		return;
 >> +
 >> +	d = domain;
 >>  	list_for_each_entry_continue(d, &iommu->domain_list, next) {
 >> -		iommu_unmap(d->domain, dma->iova, dma->size);
 >> -		cond_resched();
 >> +		if (!d->vfio_iommu_api_only) {
 >> +			iommu_unmap(d->domain, dma->iova, dma->size);
 >> +			cond_resched();
 >> +		}get_first_domains
 >>  	}
 >>
 >>  	while (iova < end) {
 >
 > How do api-only domain not blowup on the iommu API code in this next
 > code block?  Are you just getting lucky that the api-only domain is
 > first in the list and the real domain is last?
 >

Control will not come here if there is no domain with IOMMU due to below 
change:

 >> +	if (!domain)
 >> +		return;

get_first_domains() returns the first domain with IOMMU and first domain 
with api_only.


 >> +		if (d->vfio_iommu_api_only)
 >> +			continue;
 >> +
 >
 > Really disliking all these switches everywhere, too many different code
 > paths.
 >

I'll move such APIs to inline functions such that this check would be 
within inline functions and code would look much cleaner.


 >> +	// Skip pin and map only if domain without IOMMU is present
 >> +	if (!domain_with_iommu_present) {
 >> +		dma->size = size;
 >> +		goto map_done;
 >> +	}
 >> +
 >
 > Yet more special cases, the code is getting unsupportable.

 From vfio_dma_do_map(), if there is no devices pass-throughed then we 
don't want to pin all pages upfront. and that is the reason of this check.

 >
 > I'm really not convinced that pushing this into the type1 code is the
 > right approach vs pulling out shareable code chunks where it makes
 > sense and creating a separate iommu backend.  We're not getting
 > anything but code complexity out of this approach it seems.

I find pulling out shared code is also not simple. I would like to 
revisit this again and sort out the concerns you raised rather than 
making separate module.

Thanks,
Kirti.

^ permalink raw reply	[flat|nested] 154+ messages in thread

* RE: [RFC PATCH v3 1/3] vGPU Core driver
  2016-05-04 16:57         ` [Qemu-devel] " Alex Williamson
@ 2016-05-05  8:58           ` Tian, Kevin
  -1 siblings, 0 replies; 154+ messages in thread
From: Tian, Kevin @ 2016-05-05  8:58 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Kirti Wankhede, pbonzini, kraxel, cjia, qemu-devel, kvm, Ruan,
	Shuai, Song, Jike, Lv, Zhiyuan

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Thursday, May 05, 2016 12:57 AM
> 
> On Wed, 4 May 2016 02:45:59 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Alex Williamson
> > > Sent: Wednesday, May 04, 2016 6:44 AM
> > >
> > > > diff --git a/drivers/vgpu/Kconfig b/drivers/vgpu/Kconfig
> > > > new file mode 100644
> > > > index 0000000..792eb48
> > > > --- /dev/null
> > > > +++ b/drivers/vgpu/Kconfig
> > > > @@ -0,0 +1,21 @@
> > > > +
> > > > +menuconfig VGPU
> > > > +    tristate "VGPU driver framework"
> > > > +    depends on VFIO
> > > > +    select VGPU_VFIO
> > > > +    help
> > > > +        VGPU provides a framework to virtualize GPU without SR-IOV cap
> > > > +        See Documentation/vgpu.txt for more details.
> > > > +
> > > > +        If you don't know what do here, say N.
> > > > +
> > > > +config VGPU
> > > > +    tristate
> > > > +    depends on VFIO
> > > > +    default n
> > > > +
> > > > +config VGPU_VFIO
> > > > +    tristate
> > > > +    depends on VGPU
> > > > +    default n
> > > > +
> > >
> > > This is a little bit convoluted, it seems like everything added in this
> > > patch is vfio agnostic, it doesn't necessarily care what the consumer
> > > is.  That makes me think we should only be adding CONFIG_VGPU here and
> > > it should not depend on CONFIG_VFIO or be enabling CONFIG_VGPU_VFIO.
> > > The middle config entry is also redundant to the first, just move the
> > > default line up to the first and remove the rest.
> >
> > Agree. Removing such dependency also benefits other hypervisor if
> > VFIO is not used.
> >
> > Alex, there is one idea which I'd like to hear your comment. When looking at
> > the whole series, we can see the majority logic (maybe I cannot say 100%)
> > is GPU agnostic. Same frameworks in VFIO and vGPU core are actually neutral
> > to underlying device type, which e.g. can be easily applied to a NIC card too
> > if a similar technology is developed there.
> >
> > Do you think whether we'd better make framework not GPU specific now
> > (mostly naming change), or continue current style and change later only
> > when there is a real implementation on a different device?
> 
> Yeah, I see that too and I made a bunch of comments in patch 3 that
> we're not doing anything vGPU specific and we should be careful about
> assuming the user for the various interfaces.  In patch 1, we are
> fairly v/GPU specific because we're dealing with how vGPUs are created
> from the physical GPU.  Maybe the interface is general, maybe it's not,
> it's hard to say.  Starting with patch 2 though, we really shouldn't
> know or care what the device is beyond a PCI compatible device.  We're
> just trying to create a vfio bus driver compatible with vfio-pci and
> offload enough generic operations so that we don't need to pass
> everything back to the vendor driver.  Patch 3 of course should be
> completely device agnostic, we should only care that the vfio backend
> provides mediation of the device, so an iommu is not required.  It may
> be too much of a rathole to try to completely generalize the interface
> at this point, but let's certainly try not to let vgpu specific ideas
> spread beyond where we need.  Thanks,
> 

Even for patch 1 current implementation can apply to any PCI device
if we just replace vgpu with another name. There is nothing (or minimal)
being real GPU specific. But I'm not strong on this point, since somehow
I agree w/o the 2nd actual user some abstractions here may only make
sense for vgpu. So... just raise this thought and hear your comments. :-)

btw a curious question. I know you Alex can give final call to VFIO specific
code. What about vGPU core framework? It creates a new category under
driver directory (drivers/vgpu). Who else is required to review and ack
that part?

Thanks
Kevin


^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 1/3] vGPU Core driver
@ 2016-05-05  8:58           ` Tian, Kevin
  0 siblings, 0 replies; 154+ messages in thread
From: Tian, Kevin @ 2016-05-05  8:58 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Kirti Wankhede, pbonzini, kraxel, cjia, qemu-devel, kvm, Ruan,
	Shuai, Song, Jike, Lv, Zhiyuan

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Thursday, May 05, 2016 12:57 AM
> 
> On Wed, 4 May 2016 02:45:59 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Alex Williamson
> > > Sent: Wednesday, May 04, 2016 6:44 AM
> > >
> > > > diff --git a/drivers/vgpu/Kconfig b/drivers/vgpu/Kconfig
> > > > new file mode 100644
> > > > index 0000000..792eb48
> > > > --- /dev/null
> > > > +++ b/drivers/vgpu/Kconfig
> > > > @@ -0,0 +1,21 @@
> > > > +
> > > > +menuconfig VGPU
> > > > +    tristate "VGPU driver framework"
> > > > +    depends on VFIO
> > > > +    select VGPU_VFIO
> > > > +    help
> > > > +        VGPU provides a framework to virtualize GPU without SR-IOV cap
> > > > +        See Documentation/vgpu.txt for more details.
> > > > +
> > > > +        If you don't know what do here, say N.
> > > > +
> > > > +config VGPU
> > > > +    tristate
> > > > +    depends on VFIO
> > > > +    default n
> > > > +
> > > > +config VGPU_VFIO
> > > > +    tristate
> > > > +    depends on VGPU
> > > > +    default n
> > > > +
> > >
> > > This is a little bit convoluted, it seems like everything added in this
> > > patch is vfio agnostic, it doesn't necessarily care what the consumer
> > > is.  That makes me think we should only be adding CONFIG_VGPU here and
> > > it should not depend on CONFIG_VFIO or be enabling CONFIG_VGPU_VFIO.
> > > The middle config entry is also redundant to the first, just move the
> > > default line up to the first and remove the rest.
> >
> > Agree. Removing such dependency also benefits other hypervisor if
> > VFIO is not used.
> >
> > Alex, there is one idea which I'd like to hear your comment. When looking at
> > the whole series, we can see the majority logic (maybe I cannot say 100%)
> > is GPU agnostic. Same frameworks in VFIO and vGPU core are actually neutral
> > to underlying device type, which e.g. can be easily applied to a NIC card too
> > if a similar technology is developed there.
> >
> > Do you think whether we'd better make framework not GPU specific now
> > (mostly naming change), or continue current style and change later only
> > when there is a real implementation on a different device?
> 
> Yeah, I see that too and I made a bunch of comments in patch 3 that
> we're not doing anything vGPU specific and we should be careful about
> assuming the user for the various interfaces.  In patch 1, we are
> fairly v/GPU specific because we're dealing with how vGPUs are created
> from the physical GPU.  Maybe the interface is general, maybe it's not,
> it's hard to say.  Starting with patch 2 though, we really shouldn't
> know or care what the device is beyond a PCI compatible device.  We're
> just trying to create a vfio bus driver compatible with vfio-pci and
> offload enough generic operations so that we don't need to pass
> everything back to the vendor driver.  Patch 3 of course should be
> completely device agnostic, we should only care that the vfio backend
> provides mediation of the device, so an iommu is not required.  It may
> be too much of a rathole to try to completely generalize the interface
> at this point, but let's certainly try not to let vgpu specific ideas
> spread beyond where we need.  Thanks,
> 

Even for patch 1 current implementation can apply to any PCI device
if we just replace vgpu with another name. There is nothing (or minimal)
being real GPU specific. But I'm not strong on this point, since somehow
I agree w/o the 2nd actual user some abstractions here may only make
sense for vgpu. So... just raise this thought and hear your comments. :-)

btw a curious question. I know you Alex can give final call to VFIO specific
code. What about vGPU core framework? It creates a new category under
driver directory (drivers/vgpu). Who else is required to review and ack
that part?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 154+ messages in thread

* RE: [RFC PATCH v3 1/3] vGPU Core driver
  2016-05-04 13:31       ` [Qemu-devel] " Kirti Wankhede
@ 2016-05-05  9:06         ` Tian, Kevin
  -1 siblings, 0 replies; 154+ messages in thread
From: Tian, Kevin @ 2016-05-05  9:06 UTC (permalink / raw)
  To: Kirti Wankhede, Alex Williamson
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, Ruan, Shuai, Song, Jike,
	Lv, Zhiyuan

> From: Kirti Wankhede
> Sent: Wednesday, May 04, 2016 9:32 PM
> 
> Thanks Alex.
> 
>  >> +config VGPU_VFIO
>  >> +    tristate
>  >> +    depends on VGPU
>  >> +    default n
>  >> +
>  >
>  > This is a little bit convoluted, it seems like everything added in this
>  > patch is vfio agnostic, it doesn't necessarily care what the consumer
>  > is.  That makes me think we should only be adding CONFIG_VGPU here and
>  > it should not depend on CONFIG_VFIO or be enabling CONFIG_VGPU_VFIO.
>  > The middle config entry is also redundant to the first, just move the
>  > default line up to the first and remove the rest.
> 
> CONFIG_VGPU doesn't directly depend on VFIO. CONFIG_VGPU_VFIO is
> directly dependent on VFIO. But devices created by VGPU core module need
> a driver to manage those devices. CONFIG_VGPU_VFIO is the driver which
> will manage vgpu devices. So I think CONFIG_VGPU_VFIO should be enabled
> by CONFIG_VGPU.
> 
> This would look like:
> menuconfig VGPU
>      tristate "VGPU driver framework"
>      select VGPU_VFIO
>      default n
>      help
>          VGPU provides a framework to virtualize GPU without SR-IOV cap
>          See Documentation/vgpu.txt for more details.
> 
>          If you don't know what do here, say N.
> 
> config VGPU_VFIO
>      tristate
>      depends on VGPU
>      depends on VFIO
>      default n
> 

There could be multiple drivers operating VGPU. Why do we restrict
it to VFIO here?

>  >> +create_attr_error:
>  >> +	if (gpu_dev->ops->vgpu_destroy) {
>  >> +		int ret = 0;
>  >> +		ret = gpu_dev->ops->vgpu_destroy(gpu_dev->dev,
>  >> +						 vgpu_dev->uuid,
>  >> +						 vgpu_dev->vgpu_instance);
>  >
>  > Unnecessary initialization and we don't do anything with the result.
>  > Below indicates lack of vgpu_destroy indicates the vendor doesn't
>  > support unplug, but doesn't that break our error cleanup path here?
>  >
> 
> Comment about vgpu_destroy:
> If VM is running and vgpu_destroy is called that
> means the vGPU is being hotunpluged. Return
> error if VM is running and graphics driver
> doesn't support vgpu hotplug.
> 
> Its GPU drivers responsibility to check if VM is running and return
> accordingly. This is vgpu creation path. Vgpu device would be hotplug to
> VM on vgpu_start.

How does GPU driver know whether VM is running? VM is managed
by KVM here.

Maybe it's clearer to say whether vGPU is busy which means some work
has been loaded to vGPU. That's something GPU driver can tell.

> 
>  >> + * @vgpu_bar_info:		Called to get BAR size and flags of vGPU device.
>  >> + *				@vdev: vgpu device structure
>  >> + *				@bar_index: BAR index
>  >> + *				@bar_info: output, returns size and flags of
>  >> + *				requested BAR
>  >> + *				Returns integer: success (0) or error (< 0)
>  >
>  > This is called bar_info, but the bar_index is actually the vfio region
>  > index and things like the config region info is being overloaded
>  > through it.  We already have a structure defined for getting a generic
>  > region index, why not use it?  Maybe this should just be
>  > vgpu_vfio_get_region_info.
>  >
> 
> Ok. Will do.

As you commented earlier that GPU driver is required to provide config
space (which I agree), then what's the point of introducing another
bar specific structure? VFIO can use @write to get bar information 
from vgpu config space, just like how it's done on physical device today.

> 
> 
>  >> + * @validate_map_request:	Validate remap pfn request
>  >> + *				@vdev: vgpu device structure
>  >> + *				@virtaddr: target user address to start at
>  >> + *				@pfn: physical address of kernel memory, GPU
>  >> + *				driver can change if required.
>  >> + *				@size: size of map area, GPU driver can change
>  >> + *				the size of map area if desired.
>  >> + *				@prot: page protection flags for this mapping,
>  >> + *				GPU driver can change, if required.
>  >> + *				Returns integer: success (0) or error (< 0)
>  >
>  > Was not at all clear to me what this did until I got to patch 2, this
>  > is actually providing the fault handling for mmap'ing a vGPU mmio BAR.
>  > Needs a better name or better description.
>  >
> 
> If say VMM mmap whole BAR1 of GPU, say 128MB, so fault would occur when
> BAR1 is tried to access then the size is calculated as:
> req_size = vma->vm_end - virtaddr
> Since GPU is being shared by multiple vGPUs, GPU driver might not remap
> whole BAR1 for only one vGPU device, so would prefer, say map one page
> at a time. GPU driver returns PAGE_SIZE. This is used by
> remap_pfn_range(). Now on next access to BAR1 other than that page, we
> will again get a fault().
> As the name says this call is to validate from GPU driver for the size
> and prot of map area. GPU driver can change size and prot for this map area.

Currently we don't require such interface for Intel vGPU. Need to think about
its rationale carefully (still not clear to me). Jike, do you have any thought on
this?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 1/3] vGPU Core driver
@ 2016-05-05  9:06         ` Tian, Kevin
  0 siblings, 0 replies; 154+ messages in thread
From: Tian, Kevin @ 2016-05-05  9:06 UTC (permalink / raw)
  To: Kirti Wankhede, Alex Williamson
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, Ruan, Shuai, Song, Jike,
	Lv, Zhiyuan

> From: Kirti Wankhede
> Sent: Wednesday, May 04, 2016 9:32 PM
> 
> Thanks Alex.
> 
>  >> +config VGPU_VFIO
>  >> +    tristate
>  >> +    depends on VGPU
>  >> +    default n
>  >> +
>  >
>  > This is a little bit convoluted, it seems like everything added in this
>  > patch is vfio agnostic, it doesn't necessarily care what the consumer
>  > is.  That makes me think we should only be adding CONFIG_VGPU here and
>  > it should not depend on CONFIG_VFIO or be enabling CONFIG_VGPU_VFIO.
>  > The middle config entry is also redundant to the first, just move the
>  > default line up to the first and remove the rest.
> 
> CONFIG_VGPU doesn't directly depend on VFIO. CONFIG_VGPU_VFIO is
> directly dependent on VFIO. But devices created by VGPU core module need
> a driver to manage those devices. CONFIG_VGPU_VFIO is the driver which
> will manage vgpu devices. So I think CONFIG_VGPU_VFIO should be enabled
> by CONFIG_VGPU.
> 
> This would look like:
> menuconfig VGPU
>      tristate "VGPU driver framework"
>      select VGPU_VFIO
>      default n
>      help
>          VGPU provides a framework to virtualize GPU without SR-IOV cap
>          See Documentation/vgpu.txt for more details.
> 
>          If you don't know what do here, say N.
> 
> config VGPU_VFIO
>      tristate
>      depends on VGPU
>      depends on VFIO
>      default n
> 

There could be multiple drivers operating VGPU. Why do we restrict
it to VFIO here?

>  >> +create_attr_error:
>  >> +	if (gpu_dev->ops->vgpu_destroy) {
>  >> +		int ret = 0;
>  >> +		ret = gpu_dev->ops->vgpu_destroy(gpu_dev->dev,
>  >> +						 vgpu_dev->uuid,
>  >> +						 vgpu_dev->vgpu_instance);
>  >
>  > Unnecessary initialization and we don't do anything with the result.
>  > Below indicates lack of vgpu_destroy indicates the vendor doesn't
>  > support unplug, but doesn't that break our error cleanup path here?
>  >
> 
> Comment about vgpu_destroy:
> If VM is running and vgpu_destroy is called that
> means the vGPU is being hotunpluged. Return
> error if VM is running and graphics driver
> doesn't support vgpu hotplug.
> 
> Its GPU drivers responsibility to check if VM is running and return
> accordingly. This is vgpu creation path. Vgpu device would be hotplug to
> VM on vgpu_start.

How does GPU driver know whether VM is running? VM is managed
by KVM here.

Maybe it's clearer to say whether vGPU is busy which means some work
has been loaded to vGPU. That's something GPU driver can tell.

> 
>  >> + * @vgpu_bar_info:		Called to get BAR size and flags of vGPU device.
>  >> + *				@vdev: vgpu device structure
>  >> + *				@bar_index: BAR index
>  >> + *				@bar_info: output, returns size and flags of
>  >> + *				requested BAR
>  >> + *				Returns integer: success (0) or error (< 0)
>  >
>  > This is called bar_info, but the bar_index is actually the vfio region
>  > index and things like the config region info is being overloaded
>  > through it.  We already have a structure defined for getting a generic
>  > region index, why not use it?  Maybe this should just be
>  > vgpu_vfio_get_region_info.
>  >
> 
> Ok. Will do.

As you commented earlier that GPU driver is required to provide config
space (which I agree), then what's the point of introducing another
bar specific structure? VFIO can use @write to get bar information 
from vgpu config space, just like how it's done on physical device today.

> 
> 
>  >> + * @validate_map_request:	Validate remap pfn request
>  >> + *				@vdev: vgpu device structure
>  >> + *				@virtaddr: target user address to start at
>  >> + *				@pfn: physical address of kernel memory, GPU
>  >> + *				driver can change if required.
>  >> + *				@size: size of map area, GPU driver can change
>  >> + *				the size of map area if desired.
>  >> + *				@prot: page protection flags for this mapping,
>  >> + *				GPU driver can change, if required.
>  >> + *				Returns integer: success (0) or error (< 0)
>  >
>  > Was not at all clear to me what this did until I got to patch 2, this
>  > is actually providing the fault handling for mmap'ing a vGPU mmio BAR.
>  > Needs a better name or better description.
>  >
> 
> If say VMM mmap whole BAR1 of GPU, say 128MB, so fault would occur when
> BAR1 is tried to access then the size is calculated as:
> req_size = vma->vm_end - virtaddr
> Since GPU is being shared by multiple vGPUs, GPU driver might not remap
> whole BAR1 for only one vGPU device, so would prefer, say map one page
> at a time. GPU driver returns PAGE_SIZE. This is used by
> remap_pfn_range(). Now on next access to BAR1 other than that page, we
> will again get a fault().
> As the name says this call is to validate from GPU driver for the size
> and prot of map area. GPU driver can change size and prot for this map area.

Currently we don't require such interface for Intel vGPU. Need to think about
its rationale carefully (still not clear to me). Jike, do you have any thought on
this?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 154+ messages in thread

* RE: [RFC PATCH v3 2/3] VFIO driver for vGPU device
  2016-05-04 17:06         ` [Qemu-devel] " Alex Williamson
@ 2016-05-05  9:24           ` Tian, Kevin
  -1 siblings, 0 replies; 154+ messages in thread
From: Tian, Kevin @ 2016-05-05  9:24 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Kirti Wankhede, pbonzini, kraxel, cjia, qemu-devel, kvm, Ruan,
	Shuai, Song, Jike, Lv, Zhiyuan

> From: Alex Williamson
> Sent: Thursday, May 05, 2016 1:06 AM
> > > > +
> > > > +static int vgpu_dev_mmio_fault(struct vm_area_struct *vma, struct vm_fault
> *vmf)
> > > > +{
> > > > +	int ret = 0;
> > > > +	struct vfio_vgpu_device *vdev = vma->vm_private_data;
> > > > +	struct vgpu_device *vgpu_dev;
> > > > +	struct gpu_device *gpu_dev;
> > > > +	u64 virtaddr = (u64)vmf->virtual_address;
> > > > +	u64 offset, phyaddr;
> > > > +	unsigned long req_size, pgoff;
> > > > +	pgprot_t pg_prot;
> > > > +
> > > > +	if (!vdev && !vdev->vgpu_dev)
> > > > +		return -EINVAL;
> > > > +
> > > > +	vgpu_dev = vdev->vgpu_dev;
> > > > +	gpu_dev  = vgpu_dev->gpu_dev;
> > > > +
> > > > +	offset   = vma->vm_pgoff << PAGE_SHIFT;
> > > > +	phyaddr  = virtaddr - vma->vm_start + offset;
> > > > +	pgoff    = phyaddr >> PAGE_SHIFT;
> > > > +	req_size = vma->vm_end - virtaddr;
> > > > +	pg_prot  = vma->vm_page_prot;
> > > > +
> > > > +	if (gpu_dev->ops->validate_map_request) {
> > > > +		ret = gpu_dev->ops->validate_map_request(vgpu_dev, virtaddr,
> &pgoff,
> > > > +							 &req_size, &pg_prot);
> > > > +		if (ret)
> > > > +			return ret;
> > > > +
> > > > +		if (!req_size)
> > > > +			return -EINVAL;
> > > > +	}
> > > > +
> > > > +	ret = remap_pfn_range(vma, virtaddr, pgoff, req_size, pg_prot);
> > >
> > > So not supporting validate_map_request() means that the user can
> > > directly mmap BARs of the host GPU and as shown below, we assume a 1:1
> > > mapping of vGPU BAR to host GPU BAR.  Is that ever valid in a vGPU
> > > scenario or should this callback be required?  It's not clear to me how
> > > the vendor driver determines what this maps to, do they compare it to
> > > the physical device's own BAR addresses?
> >
> > I didn't quite understand too. Based on earlier discussion, do we need
> > something like this, or could achieve the purpose just by leveraging
> > recent sparse mmap support?
> 
> The reason for faulting in the mmio space, if I recall correctly, is to
> enable an ordering where the user driver (QEMU) can mmap regions of the
> device prior to resources being allocated on the host GPU to handle
> them.  Sparse mmap only partially handles that, it's not dynamic.  With
> this faulting mechanism, the host GPU doesn't need to commit resources
> until the mmap is actually accessed.  Thanks,
> 
> Alex

Neo/Kirti, any specific example how above exactly works? I can see
difference from sparse mmap based on Alex's explanation, but still
cannot map the 1st sentence to a real scenario clearly. Now our side
doesn't use such faulting-based method. So I'd like to understand it
clearly and then see any value to do same thing for Intel GPU.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 2/3] VFIO driver for vGPU device
@ 2016-05-05  9:24           ` Tian, Kevin
  0 siblings, 0 replies; 154+ messages in thread
From: Tian, Kevin @ 2016-05-05  9:24 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Kirti Wankhede, pbonzini, kraxel, cjia, qemu-devel, kvm, Ruan,
	Shuai, Song, Jike, Lv, Zhiyuan

> From: Alex Williamson
> Sent: Thursday, May 05, 2016 1:06 AM
> > > > +
> > > > +static int vgpu_dev_mmio_fault(struct vm_area_struct *vma, struct vm_fault
> *vmf)
> > > > +{
> > > > +	int ret = 0;
> > > > +	struct vfio_vgpu_device *vdev = vma->vm_private_data;
> > > > +	struct vgpu_device *vgpu_dev;
> > > > +	struct gpu_device *gpu_dev;
> > > > +	u64 virtaddr = (u64)vmf->virtual_address;
> > > > +	u64 offset, phyaddr;
> > > > +	unsigned long req_size, pgoff;
> > > > +	pgprot_t pg_prot;
> > > > +
> > > > +	if (!vdev && !vdev->vgpu_dev)
> > > > +		return -EINVAL;
> > > > +
> > > > +	vgpu_dev = vdev->vgpu_dev;
> > > > +	gpu_dev  = vgpu_dev->gpu_dev;
> > > > +
> > > > +	offset   = vma->vm_pgoff << PAGE_SHIFT;
> > > > +	phyaddr  = virtaddr - vma->vm_start + offset;
> > > > +	pgoff    = phyaddr >> PAGE_SHIFT;
> > > > +	req_size = vma->vm_end - virtaddr;
> > > > +	pg_prot  = vma->vm_page_prot;
> > > > +
> > > > +	if (gpu_dev->ops->validate_map_request) {
> > > > +		ret = gpu_dev->ops->validate_map_request(vgpu_dev, virtaddr,
> &pgoff,
> > > > +							 &req_size, &pg_prot);
> > > > +		if (ret)
> > > > +			return ret;
> > > > +
> > > > +		if (!req_size)
> > > > +			return -EINVAL;
> > > > +	}
> > > > +
> > > > +	ret = remap_pfn_range(vma, virtaddr, pgoff, req_size, pg_prot);
> > >
> > > So not supporting validate_map_request() means that the user can
> > > directly mmap BARs of the host GPU and as shown below, we assume a 1:1
> > > mapping of vGPU BAR to host GPU BAR.  Is that ever valid in a vGPU
> > > scenario or should this callback be required?  It's not clear to me how
> > > the vendor driver determines what this maps to, do they compare it to
> > > the physical device's own BAR addresses?
> >
> > I didn't quite understand too. Based on earlier discussion, do we need
> > something like this, or could achieve the purpose just by leveraging
> > recent sparse mmap support?
> 
> The reason for faulting in the mmio space, if I recall correctly, is to
> enable an ordering where the user driver (QEMU) can mmap regions of the
> device prior to resources being allocated on the host GPU to handle
> them.  Sparse mmap only partially handles that, it's not dynamic.  With
> this faulting mechanism, the host GPU doesn't need to commit resources
> until the mmap is actually accessed.  Thanks,
> 
> Alex

Neo/Kirti, any specific example how above exactly works? I can see
difference from sparse mmap based on Alex's explanation, but still
cannot map the 1st sentence to a real scenario clearly. Now our side
doesn't use such faulting-based method. So I'd like to understand it
clearly and then see any value to do same thing for Intel GPU.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
  2016-05-05  6:55       ` [Qemu-devel] " Jike Song
@ 2016-05-05  9:27         ` Tian, Kevin
  -1 siblings, 0 replies; 154+ messages in thread
From: Tian, Kevin @ 2016-05-05  9:27 UTC (permalink / raw)
  To: Song, Jike, Alex Williamson
  Cc: Ruan, Shuai, cjia, kvm, qemu-devel, Kirti Wankhede, kraxel,
	pbonzini, Lv, Zhiyuan

> From: Song, Jike
> Sent: Thursday, May 05, 2016 2:56 PM
> >
> > The only reason I can come up with for why we'd want to integrate an
> > api-only domain into the existing type1 code would be to avoid page
> > accounting issues where we count locked pages once for a normal
> > assigned device and again for a vgpu, but that's not what we're doing
> > here.  We're not only locking the pages again regardless of them
> > already being locked, we're counting every time we lock them through
> > this new interface.  So there's really no point at all to making type1
> > become this unsupportable.  In that case we should be pulling out the
> > common code that we want to share from type1 and making a new type1
> > compatible vfio iommu backend rather than conditionalizing everything
> > here.
> >
> >> +
> >> +		// add to pfn_list
> >> +		lpfn = kzalloc(sizeof(*lpfn), GFP_KERNEL);
> >> +		if (!lpfn) {
> >> +			ret = -ENOMEM;
> >> +			mutex_unlock(&domain_vgpu->lock);
> >> +			goto pin_done;
> >> +		}
> >> +		lpfn->vmm_va = remote_vaddr;
> >> +		lpfn->iova = iova;
> >> +		lpfn->pfn = pfn[i];
> >> +		lpfn->npage = 1;
> >> +		lpfn->prot = prot;
> >> +		atomic_inc(&lpfn->ref_count);
> >> +		vfio_link_vgpu_pfn(domain_vgpu, lpfn);
> >> +		mutex_unlock(&domain_vgpu->lock);
> >> +	}
> >> +
> >> +	ret = i;
> >> +
> >> +pin_done:
> >> +	return ret;
> >> +}
> >> +EXPORT_SYMBOL(vfio_pin_pages);
> >> +
> 
> IIUC, an api-only domain is a VFIO domain *without* underlying IOMMU
> hardware. It just, as you said in another mail, "rather than
> programming them into an IOMMU for a device, it simply stores the
> translations for use by later requests".
> 
> That imposes a constraint on gfx driver: hardware IOMMU must be disabled.
> Otherwise, if IOMMU is present, the gfx driver eventually programs
> the hardware IOMMU with IOVA returned by pci_map_page or dma_map_page;
> Meanwhile, the IOMMU backend for vgpu only maintains GPA <-> HPA
> translations without any knowledge about hardware IOMMU, how is the
> device model supposed to do to get an IOVA for a given GPA (thereby HPA
> by the IOMMU backend here)?
> 
> If things go as guessed above, as vfio_pin_pages() indicates, it
> pin & translate vaddr to PFN, then it will be very difficult for the
> device model to figure out:
> 
> 	1, for a given GPA, how to avoid calling dma_map_page multiple times?
> 	2, for which page to call dma_unmap_page?
> 
> --

We have to support both w/ iommu and w/o iommu case, since
that fact is out of GPU driver control. A simple way is to use
dma_map_page which internally will cope with w/ and w/o iommu
case gracefully, i.e. return HPA w/o iommu and IOVA w/ iommu.
Then in this file we only need to cache GPA to whatever dmadr_t
returned by dma_map_page.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
@ 2016-05-05  9:27         ` Tian, Kevin
  0 siblings, 0 replies; 154+ messages in thread
From: Tian, Kevin @ 2016-05-05  9:27 UTC (permalink / raw)
  To: Song, Jike, Alex Williamson
  Cc: Kirti Wankhede, pbonzini, kraxel, cjia, qemu-devel, kvm, Ruan,
	Shuai, Lv, Zhiyuan

> From: Song, Jike
> Sent: Thursday, May 05, 2016 2:56 PM
> >
> > The only reason I can come up with for why we'd want to integrate an
> > api-only domain into the existing type1 code would be to avoid page
> > accounting issues where we count locked pages once for a normal
> > assigned device and again for a vgpu, but that's not what we're doing
> > here.  We're not only locking the pages again regardless of them
> > already being locked, we're counting every time we lock them through
> > this new interface.  So there's really no point at all to making type1
> > become this unsupportable.  In that case we should be pulling out the
> > common code that we want to share from type1 and making a new type1
> > compatible vfio iommu backend rather than conditionalizing everything
> > here.
> >
> >> +
> >> +		// add to pfn_list
> >> +		lpfn = kzalloc(sizeof(*lpfn), GFP_KERNEL);
> >> +		if (!lpfn) {
> >> +			ret = -ENOMEM;
> >> +			mutex_unlock(&domain_vgpu->lock);
> >> +			goto pin_done;
> >> +		}
> >> +		lpfn->vmm_va = remote_vaddr;
> >> +		lpfn->iova = iova;
> >> +		lpfn->pfn = pfn[i];
> >> +		lpfn->npage = 1;
> >> +		lpfn->prot = prot;
> >> +		atomic_inc(&lpfn->ref_count);
> >> +		vfio_link_vgpu_pfn(domain_vgpu, lpfn);
> >> +		mutex_unlock(&domain_vgpu->lock);
> >> +	}
> >> +
> >> +	ret = i;
> >> +
> >> +pin_done:
> >> +	return ret;
> >> +}
> >> +EXPORT_SYMBOL(vfio_pin_pages);
> >> +
> 
> IIUC, an api-only domain is a VFIO domain *without* underlying IOMMU
> hardware. It just, as you said in another mail, "rather than
> programming them into an IOMMU for a device, it simply stores the
> translations for use by later requests".
> 
> That imposes a constraint on gfx driver: hardware IOMMU must be disabled.
> Otherwise, if IOMMU is present, the gfx driver eventually programs
> the hardware IOMMU with IOVA returned by pci_map_page or dma_map_page;
> Meanwhile, the IOMMU backend for vgpu only maintains GPA <-> HPA
> translations without any knowledge about hardware IOMMU, how is the
> device model supposed to do to get an IOVA for a given GPA (thereby HPA
> by the IOMMU backend here)?
> 
> If things go as guessed above, as vfio_pin_pages() indicates, it
> pin & translate vaddr to PFN, then it will be very difficult for the
> device model to figure out:
> 
> 	1, for a given GPA, how to avoid calling dma_map_page multiple times?
> 	2, for which page to call dma_unmap_page?
> 
> --

We have to support both w/ iommu and w/o iommu case, since
that fact is out of GPU driver control. A simple way is to use
dma_map_page which internally will cope with w/ and w/o iommu
case gracefully, i.e. return HPA w/o iommu and IOVA w/ iommu.
Then in this file we only need to cache GPA to whatever dmadr_t
returned by dma_map_page.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [RFC PATCH v3 1/3] vGPU Core driver
  2016-05-05  9:06         ` [Qemu-devel] " Tian, Kevin
@ 2016-05-05 10:44           ` Kirti Wankhede
  -1 siblings, 0 replies; 154+ messages in thread
From: Kirti Wankhede @ 2016-05-05 10:44 UTC (permalink / raw)
  To: Tian, Kevin, Alex Williamson
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, Ruan, Shuai, Song, Jike,
	Lv, Zhiyuan



On 5/5/2016 2:36 PM, Tian, Kevin wrote:
>> From: Kirti Wankhede
>> Sent: Wednesday, May 04, 2016 9:32 PM
>>
>> Thanks Alex.
>>
>>  >> +config VGPU_VFIO
>>  >> +    tristate
>>  >> +    depends on VGPU
>>  >> +    default n
>>  >> +
>>  >
>>  > This is a little bit convoluted, it seems like everything added in this
>>  > patch is vfio agnostic, it doesn't necessarily care what the consumer
>>  > is.  That makes me think we should only be adding CONFIG_VGPU here and
>>  > it should not depend on CONFIG_VFIO or be enabling CONFIG_VGPU_VFIO.
>>  > The middle config entry is also redundant to the first, just move the
>>  > default line up to the first and remove the rest.
>>
>> CONFIG_VGPU doesn't directly depend on VFIO. CONFIG_VGPU_VFIO is
>> directly dependent on VFIO. But devices created by VGPU core module need
>> a driver to manage those devices. CONFIG_VGPU_VFIO is the driver which
>> will manage vgpu devices. So I think CONFIG_VGPU_VFIO should be enabled
>> by CONFIG_VGPU.
>>
>> This would look like:
>> menuconfig VGPU
>>      tristate "VGPU driver framework"
>>      select VGPU_VFIO
>>      default n
>>      help
>>          VGPU provides a framework to virtualize GPU without SR-IOV cap
>>          See Documentation/vgpu.txt for more details.
>>
>>          If you don't know what do here, say N.
>>
>> config VGPU_VFIO
>>      tristate
>>      depends on VGPU
>>      depends on VFIO
>>      default n
>>
>
> There could be multiple drivers operating VGPU. Why do we restrict
> it to VFIO here?
>

VGPU_VFIO uses VFIO APIs, it depends on VFIO.
I think since there is no other driver than VGPU_VFIO for VGPU devices, 
we should keep default selection of VGPU_VFIO on VGPU. May be in future 
if other driver is add ti operate vGPU devices, then default selection 
can be removed.

>>  >> +create_attr_error:
>>  >> +	if (gpu_dev->ops->vgpu_destroy) {
>>  >> +		int ret = 0;
>>  >> +		ret = gpu_dev->ops->vgpu_destroy(gpu_dev->dev,
>>  >> +						 vgpu_dev->uuid,
>>  >> +						 vgpu_dev->vgpu_instance);
>>  >
>>  > Unnecessary initialization and we don't do anything with the result.
>>  > Below indicates lack of vgpu_destroy indicates the vendor doesn't
>>  > support unplug, but doesn't that break our error cleanup path here?
>>  >
>>
>> Comment about vgpu_destroy:
>> If VM is running and vgpu_destroy is called that
>> means the vGPU is being hotunpluged. Return
>> error if VM is running and graphics driver
>> doesn't support vgpu hotplug.
>>
>> Its GPU drivers responsibility to check if VM is running and return
>> accordingly. This is vgpu creation path. Vgpu device would be hotplug to
>> VM on vgpu_start.
>
> How does GPU driver know whether VM is running? VM is managed
> by KVM here.
>
> Maybe it's clearer to say whether vGPU is busy which means some work
> has been loaded to vGPU. That's something GPU driver can tell.
>

GPU driver can detect based on resources allocated for the VM from 
vgpu_create/vgpu_start.

>>
>>  >> + * @vgpu_bar_info:		Called to get BAR size and flags of vGPU device.
>>  >> + *				@vdev: vgpu device structure
>>  >> + *				@bar_index: BAR index
>>  >> + *				@bar_info: output, returns size and flags of
>>  >> + *				requested BAR
>>  >> + *				Returns integer: success (0) or error (< 0)
>>  >
>>  > This is called bar_info, but the bar_index is actually the vfio region
>>  > index and things like the config region info is being overloaded
>>  > through it.  We already have a structure defined for getting a generic
>>  > region index, why not use it?  Maybe this should just be
>>  > vgpu_vfio_get_region_info.
>>  >
>>
>> Ok. Will do.
>
> As you commented earlier that GPU driver is required to provide config
> space (which I agree), then what's the point of introducing another
> bar specific structure? VFIO can use @write to get bar information
> from vgpu config space, just like how it's done on physical device today.
>

It is required not only for size, but also to fetch flags. Region flags 
could be combination of:

#define VFIO_REGION_INFO_FLAG_READ      (1 << 0) /* Region supports read */
#define VFIO_REGION_INFO_FLAG_WRITE     (1 << 1) /* Region supports write */
#define VFIO_REGION_INFO_FLAG_MMAP      (1 << 2) /* Region supports mmap */

Thanks,
Kirti.

>>
>>
>>  >> + * @validate_map_request:	Validate remap pfn request
>>  >> + *				@vdev: vgpu device structure
>>  >> + *				@virtaddr: target user address to start at
>>  >> + *				@pfn: physical address of kernel memory, GPU
>>  >> + *				driver can change if required.
>>  >> + *				@size: size of map area, GPU driver can change
>>  >> + *				the size of map area if desired.
>>  >> + *				@prot: page protection flags for this mapping,
>>  >> + *				GPU driver can change, if required.
>>  >> + *				Returns integer: success (0) or error (< 0)
>>  >
>>  > Was not at all clear to me what this did until I got to patch 2, this
>>  > is actually providing the fault handling for mmap'ing a vGPU mmio BAR.
>>  > Needs a better name or better description.
>>  >
>>
>> If say VMM mmap whole BAR1 of GPU, say 128MB, so fault would occur when
>> BAR1 is tried to access then the size is calculated as:
>> req_size = vma->vm_end - virtaddr
>> Since GPU is being shared by multiple vGPUs, GPU driver might not remap
>> whole BAR1 for only one vGPU device, so would prefer, say map one page
>> at a time. GPU driver returns PAGE_SIZE. This is used by
>> remap_pfn_range(). Now on next access to BAR1 other than that page, we
>> will again get a fault().
>> As the name says this call is to validate from GPU driver for the size
>> and prot of map area. GPU driver can change size and prot for this map area.
>
> Currently we don't require such interface for Intel vGPU. Need to think about
> its rationale carefully (still not clear to me). Jike, do you have any thought on
> this?
>
> Thanks
> Kevin
>

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 1/3] vGPU Core driver
@ 2016-05-05 10:44           ` Kirti Wankhede
  0 siblings, 0 replies; 154+ messages in thread
From: Kirti Wankhede @ 2016-05-05 10:44 UTC (permalink / raw)
  To: Tian, Kevin, Alex Williamson
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, Ruan, Shuai, Song, Jike,
	Lv, Zhiyuan



On 5/5/2016 2:36 PM, Tian, Kevin wrote:
>> From: Kirti Wankhede
>> Sent: Wednesday, May 04, 2016 9:32 PM
>>
>> Thanks Alex.
>>
>>  >> +config VGPU_VFIO
>>  >> +    tristate
>>  >> +    depends on VGPU
>>  >> +    default n
>>  >> +
>>  >
>>  > This is a little bit convoluted, it seems like everything added in this
>>  > patch is vfio agnostic, it doesn't necessarily care what the consumer
>>  > is.  That makes me think we should only be adding CONFIG_VGPU here and
>>  > it should not depend on CONFIG_VFIO or be enabling CONFIG_VGPU_VFIO.
>>  > The middle config entry is also redundant to the first, just move the
>>  > default line up to the first and remove the rest.
>>
>> CONFIG_VGPU doesn't directly depend on VFIO. CONFIG_VGPU_VFIO is
>> directly dependent on VFIO. But devices created by VGPU core module need
>> a driver to manage those devices. CONFIG_VGPU_VFIO is the driver which
>> will manage vgpu devices. So I think CONFIG_VGPU_VFIO should be enabled
>> by CONFIG_VGPU.
>>
>> This would look like:
>> menuconfig VGPU
>>      tristate "VGPU driver framework"
>>      select VGPU_VFIO
>>      default n
>>      help
>>          VGPU provides a framework to virtualize GPU without SR-IOV cap
>>          See Documentation/vgpu.txt for more details.
>>
>>          If you don't know what do here, say N.
>>
>> config VGPU_VFIO
>>      tristate
>>      depends on VGPU
>>      depends on VFIO
>>      default n
>>
>
> There could be multiple drivers operating VGPU. Why do we restrict
> it to VFIO here?
>

VGPU_VFIO uses VFIO APIs, it depends on VFIO.
I think since there is no other driver than VGPU_VFIO for VGPU devices, 
we should keep default selection of VGPU_VFIO on VGPU. May be in future 
if other driver is add ti operate vGPU devices, then default selection 
can be removed.

>>  >> +create_attr_error:
>>  >> +	if (gpu_dev->ops->vgpu_destroy) {
>>  >> +		int ret = 0;
>>  >> +		ret = gpu_dev->ops->vgpu_destroy(gpu_dev->dev,
>>  >> +						 vgpu_dev->uuid,
>>  >> +						 vgpu_dev->vgpu_instance);
>>  >
>>  > Unnecessary initialization and we don't do anything with the result.
>>  > Below indicates lack of vgpu_destroy indicates the vendor doesn't
>>  > support unplug, but doesn't that break our error cleanup path here?
>>  >
>>
>> Comment about vgpu_destroy:
>> If VM is running and vgpu_destroy is called that
>> means the vGPU is being hotunpluged. Return
>> error if VM is running and graphics driver
>> doesn't support vgpu hotplug.
>>
>> Its GPU drivers responsibility to check if VM is running and return
>> accordingly. This is vgpu creation path. Vgpu device would be hotplug to
>> VM on vgpu_start.
>
> How does GPU driver know whether VM is running? VM is managed
> by KVM here.
>
> Maybe it's clearer to say whether vGPU is busy which means some work
> has been loaded to vGPU. That's something GPU driver can tell.
>

GPU driver can detect based on resources allocated for the VM from 
vgpu_create/vgpu_start.

>>
>>  >> + * @vgpu_bar_info:		Called to get BAR size and flags of vGPU device.
>>  >> + *				@vdev: vgpu device structure
>>  >> + *				@bar_index: BAR index
>>  >> + *				@bar_info: output, returns size and flags of
>>  >> + *				requested BAR
>>  >> + *				Returns integer: success (0) or error (< 0)
>>  >
>>  > This is called bar_info, but the bar_index is actually the vfio region
>>  > index and things like the config region info is being overloaded
>>  > through it.  We already have a structure defined for getting a generic
>>  > region index, why not use it?  Maybe this should just be
>>  > vgpu_vfio_get_region_info.
>>  >
>>
>> Ok. Will do.
>
> As you commented earlier that GPU driver is required to provide config
> space (which I agree), then what's the point of introducing another
> bar specific structure? VFIO can use @write to get bar information
> from vgpu config space, just like how it's done on physical device today.
>

It is required not only for size, but also to fetch flags. Region flags 
could be combination of:

#define VFIO_REGION_INFO_FLAG_READ      (1 << 0) /* Region supports read */
#define VFIO_REGION_INFO_FLAG_WRITE     (1 << 1) /* Region supports write */
#define VFIO_REGION_INFO_FLAG_MMAP      (1 << 2) /* Region supports mmap */

Thanks,
Kirti.

>>
>>
>>  >> + * @validate_map_request:	Validate remap pfn request
>>  >> + *				@vdev: vgpu device structure
>>  >> + *				@virtaddr: target user address to start at
>>  >> + *				@pfn: physical address of kernel memory, GPU
>>  >> + *				driver can change if required.
>>  >> + *				@size: size of map area, GPU driver can change
>>  >> + *				the size of map area if desired.
>>  >> + *				@prot: page protection flags for this mapping,
>>  >> + *				GPU driver can change, if required.
>>  >> + *				Returns integer: success (0) or error (< 0)
>>  >
>>  > Was not at all clear to me what this did until I got to patch 2, this
>>  > is actually providing the fault handling for mmap'ing a vGPU mmio BAR.
>>  > Needs a better name or better description.
>>  >
>>
>> If say VMM mmap whole BAR1 of GPU, say 128MB, so fault would occur when
>> BAR1 is tried to access then the size is calculated as:
>> req_size = vma->vm_end - virtaddr
>> Since GPU is being shared by multiple vGPUs, GPU driver might not remap
>> whole BAR1 for only one vGPU device, so would prefer, say map one page
>> at a time. GPU driver returns PAGE_SIZE. This is used by
>> remap_pfn_range(). Now on next access to BAR1 other than that page, we
>> will again get a fault().
>> As the name says this call is to validate from GPU driver for the size
>> and prot of map area. GPU driver can change size and prot for this map area.
>
> Currently we don't require such interface for Intel vGPU. Need to think about
> its rationale carefully (still not clear to me). Jike, do you have any thought on
> this?
>
> Thanks
> Kevin
>

^ permalink raw reply	[flat|nested] 154+ messages in thread

* RE: [RFC PATCH v3 1/3] vGPU Core driver
  2016-05-05 10:44           ` [Qemu-devel] " Kirti Wankhede
@ 2016-05-05 12:07             ` Tian, Kevin
  -1 siblings, 0 replies; 154+ messages in thread
From: Tian, Kevin @ 2016-05-05 12:07 UTC (permalink / raw)
  To: Kirti Wankhede, Alex Williamson
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, Ruan, Shuai, Song, Jike,
	Lv, Zhiyuan

> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> Sent: Thursday, May 05, 2016 6:45 PM
> 
> 
> On 5/5/2016 2:36 PM, Tian, Kevin wrote:
> >> From: Kirti Wankhede
> >> Sent: Wednesday, May 04, 2016 9:32 PM
> >>
> >> Thanks Alex.
> >>
> >>  >> +config VGPU_VFIO
> >>  >> +    tristate
> >>  >> +    depends on VGPU
> >>  >> +    default n
> >>  >> +
> >>  >
> >>  > This is a little bit convoluted, it seems like everything added in this
> >>  > patch is vfio agnostic, it doesn't necessarily care what the consumer
> >>  > is.  That makes me think we should only be adding CONFIG_VGPU here and
> >>  > it should not depend on CONFIG_VFIO or be enabling CONFIG_VGPU_VFIO.
> >>  > The middle config entry is also redundant to the first, just move the
> >>  > default line up to the first and remove the rest.
> >>
> >> CONFIG_VGPU doesn't directly depend on VFIO. CONFIG_VGPU_VFIO is
> >> directly dependent on VFIO. But devices created by VGPU core module need
> >> a driver to manage those devices. CONFIG_VGPU_VFIO is the driver which
> >> will manage vgpu devices. So I think CONFIG_VGPU_VFIO should be enabled
> >> by CONFIG_VGPU.
> >>
> >> This would look like:
> >> menuconfig VGPU
> >>      tristate "VGPU driver framework"
> >>      select VGPU_VFIO
> >>      default n
> >>      help
> >>          VGPU provides a framework to virtualize GPU without SR-IOV cap
> >>          See Documentation/vgpu.txt for more details.
> >>
> >>          If you don't know what do here, say N.
> >>
> >> config VGPU_VFIO
> >>      tristate
> >>      depends on VGPU
> >>      depends on VFIO
> >>      default n
> >>
> >
> > There could be multiple drivers operating VGPU. Why do we restrict
> > it to VFIO here?
> >
> 
> VGPU_VFIO uses VFIO APIs, it depends on VFIO.
> I think since there is no other driver than VGPU_VFIO for VGPU devices,
> we should keep default selection of VGPU_VFIO on VGPU. May be in future
> if other driver is add ti operate vGPU devices, then default selection
> can be removed.

What's your plan to support Xen here?

> 
> >>  >> +create_attr_error:
> >>  >> +	if (gpu_dev->ops->vgpu_destroy) {
> >>  >> +		int ret = 0;
> >>  >> +		ret = gpu_dev->ops->vgpu_destroy(gpu_dev->dev,
> >>  >> +						 vgpu_dev->uuid,
> >>  >> +						 vgpu_dev->vgpu_instance);
> >>  >
> >>  > Unnecessary initialization and we don't do anything with the result.
> >>  > Below indicates lack of vgpu_destroy indicates the vendor doesn't
> >>  > support unplug, but doesn't that break our error cleanup path here?
> >>  >
> >>
> >> Comment about vgpu_destroy:
> >> If VM is running and vgpu_destroy is called that
> >> means the vGPU is being hotunpluged. Return
> >> error if VM is running and graphics driver
> >> doesn't support vgpu hotplug.
> >>
> >> Its GPU drivers responsibility to check if VM is running and return
> >> accordingly. This is vgpu creation path. Vgpu device would be hotplug to
> >> VM on vgpu_start.
> >
> > How does GPU driver know whether VM is running? VM is managed
> > by KVM here.
> >
> > Maybe it's clearer to say whether vGPU is busy which means some work
> > has been loaded to vGPU. That's something GPU driver can tell.
> >
> 
> GPU driver can detect based on resources allocated for the VM from
> vgpu_create/vgpu_start.

Yes, in that case GPU driver only knows a vGPU is in-use, not who is
using vGPU (now is VM, in the future it could be a container). Anyway
my point is just not assuming VM to add limitation when it's not necessary. :-)

> 
> >>
> >>  >> + * @vgpu_bar_info:		Called to get BAR size and flags of vGPU device.
> >>  >> + *				@vdev: vgpu device structure
> >>  >> + *				@bar_index: BAR index
> >>  >> + *				@bar_info: output, returns size and flags of
> >>  >> + *				requested BAR
> >>  >> + *				Returns integer: success (0) or error (< 0)
> >>  >
> >>  > This is called bar_info, but the bar_index is actually the vfio region
> >>  > index and things like the config region info is being overloaded
> >>  > through it.  We already have a structure defined for getting a generic
> >>  > region index, why not use it?  Maybe this should just be
> >>  > vgpu_vfio_get_region_info.
> >>  >
> >>
> >> Ok. Will do.
> >
> > As you commented earlier that GPU driver is required to provide config
> > space (which I agree), then what's the point of introducing another
> > bar specific structure? VFIO can use @write to get bar information
> > from vgpu config space, just like how it's done on physical device today.
> >
> 
> It is required not only for size, but also to fetch flags. Region flags
> could be combination of:
> 
> #define VFIO_REGION_INFO_FLAG_READ      (1 << 0) /* Region supports read */
> #define VFIO_REGION_INFO_FLAG_WRITE     (1 << 1) /* Region supports write */
> #define VFIO_REGION_INFO_FLAG_MMAP      (1 << 2) /* Region supports mmap */
> 
> Thanks,
> Kirti.

That makes sense.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 1/3] vGPU Core driver
@ 2016-05-05 12:07             ` Tian, Kevin
  0 siblings, 0 replies; 154+ messages in thread
From: Tian, Kevin @ 2016-05-05 12:07 UTC (permalink / raw)
  To: Kirti Wankhede, Alex Williamson
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, Ruan, Shuai, Song, Jike,
	Lv, Zhiyuan

> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> Sent: Thursday, May 05, 2016 6:45 PM
> 
> 
> On 5/5/2016 2:36 PM, Tian, Kevin wrote:
> >> From: Kirti Wankhede
> >> Sent: Wednesday, May 04, 2016 9:32 PM
> >>
> >> Thanks Alex.
> >>
> >>  >> +config VGPU_VFIO
> >>  >> +    tristate
> >>  >> +    depends on VGPU
> >>  >> +    default n
> >>  >> +
> >>  >
> >>  > This is a little bit convoluted, it seems like everything added in this
> >>  > patch is vfio agnostic, it doesn't necessarily care what the consumer
> >>  > is.  That makes me think we should only be adding CONFIG_VGPU here and
> >>  > it should not depend on CONFIG_VFIO or be enabling CONFIG_VGPU_VFIO.
> >>  > The middle config entry is also redundant to the first, just move the
> >>  > default line up to the first and remove the rest.
> >>
> >> CONFIG_VGPU doesn't directly depend on VFIO. CONFIG_VGPU_VFIO is
> >> directly dependent on VFIO. But devices created by VGPU core module need
> >> a driver to manage those devices. CONFIG_VGPU_VFIO is the driver which
> >> will manage vgpu devices. So I think CONFIG_VGPU_VFIO should be enabled
> >> by CONFIG_VGPU.
> >>
> >> This would look like:
> >> menuconfig VGPU
> >>      tristate "VGPU driver framework"
> >>      select VGPU_VFIO
> >>      default n
> >>      help
> >>          VGPU provides a framework to virtualize GPU without SR-IOV cap
> >>          See Documentation/vgpu.txt for more details.
> >>
> >>          If you don't know what do here, say N.
> >>
> >> config VGPU_VFIO
> >>      tristate
> >>      depends on VGPU
> >>      depends on VFIO
> >>      default n
> >>
> >
> > There could be multiple drivers operating VGPU. Why do we restrict
> > it to VFIO here?
> >
> 
> VGPU_VFIO uses VFIO APIs, it depends on VFIO.
> I think since there is no other driver than VGPU_VFIO for VGPU devices,
> we should keep default selection of VGPU_VFIO on VGPU. May be in future
> if other driver is add ti operate vGPU devices, then default selection
> can be removed.

What's your plan to support Xen here?

> 
> >>  >> +create_attr_error:
> >>  >> +	if (gpu_dev->ops->vgpu_destroy) {
> >>  >> +		int ret = 0;
> >>  >> +		ret = gpu_dev->ops->vgpu_destroy(gpu_dev->dev,
> >>  >> +						 vgpu_dev->uuid,
> >>  >> +						 vgpu_dev->vgpu_instance);
> >>  >
> >>  > Unnecessary initialization and we don't do anything with the result.
> >>  > Below indicates lack of vgpu_destroy indicates the vendor doesn't
> >>  > support unplug, but doesn't that break our error cleanup path here?
> >>  >
> >>
> >> Comment about vgpu_destroy:
> >> If VM is running and vgpu_destroy is called that
> >> means the vGPU is being hotunpluged. Return
> >> error if VM is running and graphics driver
> >> doesn't support vgpu hotplug.
> >>
> >> Its GPU drivers responsibility to check if VM is running and return
> >> accordingly. This is vgpu creation path. Vgpu device would be hotplug to
> >> VM on vgpu_start.
> >
> > How does GPU driver know whether VM is running? VM is managed
> > by KVM here.
> >
> > Maybe it's clearer to say whether vGPU is busy which means some work
> > has been loaded to vGPU. That's something GPU driver can tell.
> >
> 
> GPU driver can detect based on resources allocated for the VM from
> vgpu_create/vgpu_start.

Yes, in that case GPU driver only knows a vGPU is in-use, not who is
using vGPU (now is VM, in the future it could be a container). Anyway
my point is just not assuming VM to add limitation when it's not necessary. :-)

> 
> >>
> >>  >> + * @vgpu_bar_info:		Called to get BAR size and flags of vGPU device.
> >>  >> + *				@vdev: vgpu device structure
> >>  >> + *				@bar_index: BAR index
> >>  >> + *				@bar_info: output, returns size and flags of
> >>  >> + *				requested BAR
> >>  >> + *				Returns integer: success (0) or error (< 0)
> >>  >
> >>  > This is called bar_info, but the bar_index is actually the vfio region
> >>  > index and things like the config region info is being overloaded
> >>  > through it.  We already have a structure defined for getting a generic
> >>  > region index, why not use it?  Maybe this should just be
> >>  > vgpu_vfio_get_region_info.
> >>  >
> >>
> >> Ok. Will do.
> >
> > As you commented earlier that GPU driver is required to provide config
> > space (which I agree), then what's the point of introducing another
> > bar specific structure? VFIO can use @write to get bar information
> > from vgpu config space, just like how it's done on physical device today.
> >
> 
> It is required not only for size, but also to fetch flags. Region flags
> could be combination of:
> 
> #define VFIO_REGION_INFO_FLAG_READ      (1 << 0) /* Region supports read */
> #define VFIO_REGION_INFO_FLAG_WRITE     (1 << 1) /* Region supports write */
> #define VFIO_REGION_INFO_FLAG_MMAP      (1 << 2) /* Region supports mmap */
> 
> Thanks,
> Kirti.

That makes sense.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [RFC PATCH v3 1/3] vGPU Core driver
  2016-05-05 12:07             ` [Qemu-devel] " Tian, Kevin
@ 2016-05-05 12:57               ` Kirti Wankhede
  -1 siblings, 0 replies; 154+ messages in thread
From: Kirti Wankhede @ 2016-05-05 12:57 UTC (permalink / raw)
  To: Tian, Kevin, Alex Williamson
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, Ruan, Shuai, Song, Jike,
	Lv, Zhiyuan



On 5/5/2016 5:37 PM, Tian, Kevin wrote:
>> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
>> Sent: Thursday, May 05, 2016 6:45 PM
>>
>>
>> On 5/5/2016 2:36 PM, Tian, Kevin wrote:
>>>> From: Kirti Wankhede
>>>> Sent: Wednesday, May 04, 2016 9:32 PM
>>>>
>>>> Thanks Alex.
>>>>
>>>>  >> +config VGPU_VFIO
>>>>  >> +    tristate
>>>>  >> +    depends on VGPU
>>>>  >> +    default n
>>>>  >> +
>>>>  >
>>>>  > This is a little bit convoluted, it seems like everything added in this
>>>>  > patch is vfio agnostic, it doesn't necessarily care what the consumer
>>>>  > is.  That makes me think we should only be adding CONFIG_VGPU here and
>>>>  > it should not depend on CONFIG_VFIO or be enabling CONFIG_VGPU_VFIO.
>>>>  > The middle config entry is also redundant to the first, just move the
>>>>  > default line up to the first and remove the rest.
>>>>
>>>> CONFIG_VGPU doesn't directly depend on VFIO. CONFIG_VGPU_VFIO is
>>>> directly dependent on VFIO. But devices created by VGPU core module need
>>>> a driver to manage those devices. CONFIG_VGPU_VFIO is the driver which
>>>> will manage vgpu devices. So I think CONFIG_VGPU_VFIO should be enabled
>>>> by CONFIG_VGPU.
>>>>
>>>> This would look like:
>>>> menuconfig VGPU
>>>>      tristate "VGPU driver framework"
>>>>      select VGPU_VFIO
>>>>      default n
>>>>      help
>>>>          VGPU provides a framework to virtualize GPU without SR-IOV cap
>>>>          See Documentation/vgpu.txt for more details.
>>>>
>>>>          If you don't know what do here, say N.
>>>>
>>>> config VGPU_VFIO
>>>>      tristate
>>>>      depends on VGPU
>>>>      depends on VFIO
>>>>      default n
>>>>
>>>
>>> There could be multiple drivers operating VGPU. Why do we restrict
>>> it to VFIO here?
>>>
>>
>> VGPU_VFIO uses VFIO APIs, it depends on VFIO.
>> I think since there is no other driver than VGPU_VFIO for VGPU devices,
>> we should keep default selection of VGPU_VFIO on VGPU. May be in future
>> if other driver is add ti operate vGPU devices, then default selection
>> can be removed.
>
> What's your plan to support Xen here?
>

No plans to support Xen.

Thanks,
Kirti


^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 1/3] vGPU Core driver
@ 2016-05-05 12:57               ` Kirti Wankhede
  0 siblings, 0 replies; 154+ messages in thread
From: Kirti Wankhede @ 2016-05-05 12:57 UTC (permalink / raw)
  To: Tian, Kevin, Alex Williamson
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, Ruan, Shuai, Song, Jike,
	Lv, Zhiyuan



On 5/5/2016 5:37 PM, Tian, Kevin wrote:
>> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
>> Sent: Thursday, May 05, 2016 6:45 PM
>>
>>
>> On 5/5/2016 2:36 PM, Tian, Kevin wrote:
>>>> From: Kirti Wankhede
>>>> Sent: Wednesday, May 04, 2016 9:32 PM
>>>>
>>>> Thanks Alex.
>>>>
>>>>  >> +config VGPU_VFIO
>>>>  >> +    tristate
>>>>  >> +    depends on VGPU
>>>>  >> +    default n
>>>>  >> +
>>>>  >
>>>>  > This is a little bit convoluted, it seems like everything added in this
>>>>  > patch is vfio agnostic, it doesn't necessarily care what the consumer
>>>>  > is.  That makes me think we should only be adding CONFIG_VGPU here and
>>>>  > it should not depend on CONFIG_VFIO or be enabling CONFIG_VGPU_VFIO.
>>>>  > The middle config entry is also redundant to the first, just move the
>>>>  > default line up to the first and remove the rest.
>>>>
>>>> CONFIG_VGPU doesn't directly depend on VFIO. CONFIG_VGPU_VFIO is
>>>> directly dependent on VFIO. But devices created by VGPU core module need
>>>> a driver to manage those devices. CONFIG_VGPU_VFIO is the driver which
>>>> will manage vgpu devices. So I think CONFIG_VGPU_VFIO should be enabled
>>>> by CONFIG_VGPU.
>>>>
>>>> This would look like:
>>>> menuconfig VGPU
>>>>      tristate "VGPU driver framework"
>>>>      select VGPU_VFIO
>>>>      default n
>>>>      help
>>>>          VGPU provides a framework to virtualize GPU without SR-IOV cap
>>>>          See Documentation/vgpu.txt for more details.
>>>>
>>>>          If you don't know what do here, say N.
>>>>
>>>> config VGPU_VFIO
>>>>      tristate
>>>>      depends on VGPU
>>>>      depends on VFIO
>>>>      default n
>>>>
>>>
>>> There could be multiple drivers operating VGPU. Why do we restrict
>>> it to VFIO here?
>>>
>>
>> VGPU_VFIO uses VFIO APIs, it depends on VFIO.
>> I think since there is no other driver than VGPU_VFIO for VGPU devices,
>> we should keep default selection of VGPU_VFIO on VGPU. May be in future
>> if other driver is add ti operate vGPU devices, then default selection
>> can be removed.
>
> What's your plan to support Xen here?
>

No plans to support Xen.

Thanks,
Kirti

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [RFC PATCH v3 2/3] VFIO driver for vGPU device
  2016-05-05  9:24           ` [Qemu-devel] " Tian, Kevin
@ 2016-05-05 20:27             ` Neo Jia
  -1 siblings, 0 replies; 154+ messages in thread
From: Neo Jia @ 2016-05-05 20:27 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Alex Williamson, Kirti Wankhede, pbonzini, kraxel, qemu-devel,
	kvm, Ruan, Shuai, Song, Jike, Lv, Zhiyuan

On Thu, May 05, 2016 at 09:24:26AM +0000, Tian, Kevin wrote:
> > From: Alex Williamson
> > Sent: Thursday, May 05, 2016 1:06 AM
> > > > > +
> > > > > +static int vgpu_dev_mmio_fault(struct vm_area_struct *vma, struct vm_fault
> > *vmf)
> > > > > +{
> > > > > +	int ret = 0;
> > > > > +	struct vfio_vgpu_device *vdev = vma->vm_private_data;
> > > > > +	struct vgpu_device *vgpu_dev;
> > > > > +	struct gpu_device *gpu_dev;
> > > > > +	u64 virtaddr = (u64)vmf->virtual_address;
> > > > > +	u64 offset, phyaddr;
> > > > > +	unsigned long req_size, pgoff;
> > > > > +	pgprot_t pg_prot;
> > > > > +
> > > > > +	if (!vdev && !vdev->vgpu_dev)
> > > > > +		return -EINVAL;
> > > > > +
> > > > > +	vgpu_dev = vdev->vgpu_dev;
> > > > > +	gpu_dev  = vgpu_dev->gpu_dev;
> > > > > +
> > > > > +	offset   = vma->vm_pgoff << PAGE_SHIFT;
> > > > > +	phyaddr  = virtaddr - vma->vm_start + offset;
> > > > > +	pgoff    = phyaddr >> PAGE_SHIFT;
> > > > > +	req_size = vma->vm_end - virtaddr;
> > > > > +	pg_prot  = vma->vm_page_prot;
> > > > > +
> > > > > +	if (gpu_dev->ops->validate_map_request) {
> > > > > +		ret = gpu_dev->ops->validate_map_request(vgpu_dev, virtaddr,
> > &pgoff,
> > > > > +							 &req_size, &pg_prot);
> > > > > +		if (ret)
> > > > > +			return ret;
> > > > > +
> > > > > +		if (!req_size)
> > > > > +			return -EINVAL;
> > > > > +	}
> > > > > +
> > > > > +	ret = remap_pfn_range(vma, virtaddr, pgoff, req_size, pg_prot);
> > > >
> > > > So not supporting validate_map_request() means that the user can
> > > > directly mmap BARs of the host GPU and as shown below, we assume a 1:1
> > > > mapping of vGPU BAR to host GPU BAR.  Is that ever valid in a vGPU
> > > > scenario or should this callback be required?  It's not clear to me how
> > > > the vendor driver determines what this maps to, do they compare it to
> > > > the physical device's own BAR addresses?
> > >
> > > I didn't quite understand too. Based on earlier discussion, do we need
> > > something like this, or could achieve the purpose just by leveraging
> > > recent sparse mmap support?
> > 
> > The reason for faulting in the mmio space, if I recall correctly, is to
> > enable an ordering where the user driver (QEMU) can mmap regions of the
> > device prior to resources being allocated on the host GPU to handle
> > them.  Sparse mmap only partially handles that, it's not dynamic.  With
> > this faulting mechanism, the host GPU doesn't need to commit resources
> > until the mmap is actually accessed.  Thanks,
> > 
> > Alex
> 
> Neo/Kirti, any specific example how above exactly works? I can see
> difference from sparse mmap based on Alex's explanation, but still
> cannot map the 1st sentence to a real scenario clearly. Now our side
> doesn't use such faulting-based method. So I'd like to understand it
> clearly and then see any value to do same thing for Intel GPU.

Hi Kevin,

The short answer is CPU access to GPU resources via MMIO region.

Thanks,
Neo

> 
> Thanks
> Kevin

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 2/3] VFIO driver for vGPU device
@ 2016-05-05 20:27             ` Neo Jia
  0 siblings, 0 replies; 154+ messages in thread
From: Neo Jia @ 2016-05-05 20:27 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Alex Williamson, Kirti Wankhede, pbonzini, kraxel, qemu-devel,
	kvm, Ruan, Shuai, Song, Jike, Lv, Zhiyuan

On Thu, May 05, 2016 at 09:24:26AM +0000, Tian, Kevin wrote:
> > From: Alex Williamson
> > Sent: Thursday, May 05, 2016 1:06 AM
> > > > > +
> > > > > +static int vgpu_dev_mmio_fault(struct vm_area_struct *vma, struct vm_fault
> > *vmf)
> > > > > +{
> > > > > +	int ret = 0;
> > > > > +	struct vfio_vgpu_device *vdev = vma->vm_private_data;
> > > > > +	struct vgpu_device *vgpu_dev;
> > > > > +	struct gpu_device *gpu_dev;
> > > > > +	u64 virtaddr = (u64)vmf->virtual_address;
> > > > > +	u64 offset, phyaddr;
> > > > > +	unsigned long req_size, pgoff;
> > > > > +	pgprot_t pg_prot;
> > > > > +
> > > > > +	if (!vdev && !vdev->vgpu_dev)
> > > > > +		return -EINVAL;
> > > > > +
> > > > > +	vgpu_dev = vdev->vgpu_dev;
> > > > > +	gpu_dev  = vgpu_dev->gpu_dev;
> > > > > +
> > > > > +	offset   = vma->vm_pgoff << PAGE_SHIFT;
> > > > > +	phyaddr  = virtaddr - vma->vm_start + offset;
> > > > > +	pgoff    = phyaddr >> PAGE_SHIFT;
> > > > > +	req_size = vma->vm_end - virtaddr;
> > > > > +	pg_prot  = vma->vm_page_prot;
> > > > > +
> > > > > +	if (gpu_dev->ops->validate_map_request) {
> > > > > +		ret = gpu_dev->ops->validate_map_request(vgpu_dev, virtaddr,
> > &pgoff,
> > > > > +							 &req_size, &pg_prot);
> > > > > +		if (ret)
> > > > > +			return ret;
> > > > > +
> > > > > +		if (!req_size)
> > > > > +			return -EINVAL;
> > > > > +	}
> > > > > +
> > > > > +	ret = remap_pfn_range(vma, virtaddr, pgoff, req_size, pg_prot);
> > > >
> > > > So not supporting validate_map_request() means that the user can
> > > > directly mmap BARs of the host GPU and as shown below, we assume a 1:1
> > > > mapping of vGPU BAR to host GPU BAR.  Is that ever valid in a vGPU
> > > > scenario or should this callback be required?  It's not clear to me how
> > > > the vendor driver determines what this maps to, do they compare it to
> > > > the physical device's own BAR addresses?
> > >
> > > I didn't quite understand too. Based on earlier discussion, do we need
> > > something like this, or could achieve the purpose just by leveraging
> > > recent sparse mmap support?
> > 
> > The reason for faulting in the mmio space, if I recall correctly, is to
> > enable an ordering where the user driver (QEMU) can mmap regions of the
> > device prior to resources being allocated on the host GPU to handle
> > them.  Sparse mmap only partially handles that, it's not dynamic.  With
> > this faulting mechanism, the host GPU doesn't need to commit resources
> > until the mmap is actually accessed.  Thanks,
> > 
> > Alex
> 
> Neo/Kirti, any specific example how above exactly works? I can see
> difference from sparse mmap based on Alex's explanation, but still
> cannot map the 1st sentence to a real scenario clearly. Now our side
> doesn't use such faulting-based method. So I'd like to understand it
> clearly and then see any value to do same thing for Intel GPU.

Hi Kevin,

The short answer is CPU access to GPU resources via MMIO region.

Thanks,
Neo

> 
> Thanks
> Kevin

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [RFC PATCH v3 1/3] vGPU Core driver
  2016-05-05  9:06         ` [Qemu-devel] " Tian, Kevin
@ 2016-05-06 12:14           ` Jike Song
  -1 siblings, 0 replies; 154+ messages in thread
From: Jike Song @ 2016-05-06 12:14 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Kirti Wankhede, Alex Williamson, pbonzini, kraxel, cjia,
	qemu-devel, kvm, Ruan, Shuai, Lv, Zhiyuan

On 05/05/2016 05:06 PM, Tian, Kevin wrote:
>> From: Kirti Wankhede
>>
>>  >> + * @validate_map_request:	Validate remap pfn request
>>  >> + *				@vdev: vgpu device structure
>>  >> + *				@virtaddr: target user address to start at
>>  >> + *				@pfn: physical address of kernel memory, GPU
>>  >> + *				driver can change if required.
>>  >> + *				@size: size of map area, GPU driver can change
>>  >> + *				the size of map area if desired.
>>  >> + *				@prot: page protection flags for this mapping,
>>  >> + *				GPU driver can change, if required.
>>  >> + *				Returns integer: success (0) or error (< 0)
>>  >
>>  > Was not at all clear to me what this did until I got to patch 2, this
>>  > is actually providing the fault handling for mmap'ing a vGPU mmio BAR.
>>  > Needs a better name or better description.
>>  >
>>
>> If say VMM mmap whole BAR1 of GPU, say 128MB, so fault would occur when
>> BAR1 is tried to access then the size is calculated as:
>> req_size = vma->vm_end - virtaddr
Hi Kirti,

virtaddr is the faulted one, vma->vm_end the vaddr of the mmap-ed 128MB BAR1?

Would you elaborate why (vm_end - fault_addr) results the requested size? 


>> Since GPU is being shared by multiple vGPUs, GPU driver might not remap
>> whole BAR1 for only one vGPU device, so would prefer, say map one page
>> at a time. GPU driver returns PAGE_SIZE. This is used by
>> remap_pfn_range(). Now on next access to BAR1 other than that page, we
>> will again get a fault().
>> As the name says this call is to validate from GPU driver for the size
>> and prot of map area. GPU driver can change size and prot for this map area.

If I understand correctly, you are trying to share a physical BAR among
multiple vGPUs, by mapping a single pfn each time, when fault happens?

> 
> Currently we don't require such interface for Intel vGPU. Need to think about
> its rationale carefully (still not clear to me). Jike, do you have any thought on
> this?

We need the mmap method of vgpu_device to be implemented, but I was
expecting something else, like calling remap_pfn_range() directly from
the mmap.

>
> Thanks
> Kevin
>

--
Thanks,
Jike


^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 1/3] vGPU Core driver
@ 2016-05-06 12:14           ` Jike Song
  0 siblings, 0 replies; 154+ messages in thread
From: Jike Song @ 2016-05-06 12:14 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Kirti Wankhede, Alex Williamson, pbonzini, kraxel, cjia,
	qemu-devel, kvm, Ruan, Shuai, Lv, Zhiyuan

On 05/05/2016 05:06 PM, Tian, Kevin wrote:
>> From: Kirti Wankhede
>>
>>  >> + * @validate_map_request:	Validate remap pfn request
>>  >> + *				@vdev: vgpu device structure
>>  >> + *				@virtaddr: target user address to start at
>>  >> + *				@pfn: physical address of kernel memory, GPU
>>  >> + *				driver can change if required.
>>  >> + *				@size: size of map area, GPU driver can change
>>  >> + *				the size of map area if desired.
>>  >> + *				@prot: page protection flags for this mapping,
>>  >> + *				GPU driver can change, if required.
>>  >> + *				Returns integer: success (0) or error (< 0)
>>  >
>>  > Was not at all clear to me what this did until I got to patch 2, this
>>  > is actually providing the fault handling for mmap'ing a vGPU mmio BAR.
>>  > Needs a better name or better description.
>>  >
>>
>> If say VMM mmap whole BAR1 of GPU, say 128MB, so fault would occur when
>> BAR1 is tried to access then the size is calculated as:
>> req_size = vma->vm_end - virtaddr
Hi Kirti,

virtaddr is the faulted one, vma->vm_end the vaddr of the mmap-ed 128MB BAR1?

Would you elaborate why (vm_end - fault_addr) results the requested size? 


>> Since GPU is being shared by multiple vGPUs, GPU driver might not remap
>> whole BAR1 for only one vGPU device, so would prefer, say map one page
>> at a time. GPU driver returns PAGE_SIZE. This is used by
>> remap_pfn_range(). Now on next access to BAR1 other than that page, we
>> will again get a fault().
>> As the name says this call is to validate from GPU driver for the size
>> and prot of map area. GPU driver can change size and prot for this map area.

If I understand correctly, you are trying to share a physical BAR among
multiple vGPUs, by mapping a single pfn each time, when fault happens?

> 
> Currently we don't require such interface for Intel vGPU. Need to think about
> its rationale carefully (still not clear to me). Jike, do you have any thought on
> this?

We need the mmap method of vgpu_device to be implemented, but I was
expecting something else, like calling remap_pfn_range() directly from
the mmap.

>
> Thanks
> Kevin
>

--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [RFC PATCH v3 1/3] vGPU Core driver
  2016-05-06 12:14           ` [Qemu-devel] " Jike Song
@ 2016-05-06 16:16             ` Kirti Wankhede
  -1 siblings, 0 replies; 154+ messages in thread
From: Kirti Wankhede @ 2016-05-06 16:16 UTC (permalink / raw)
  To: Jike Song, Tian, Kevin
  Cc: Alex Williamson, pbonzini, kraxel, cjia, qemu-devel, kvm, Ruan,
	Shuai, Lv, Zhiyuan



On 5/6/2016 5:44 PM, Jike Song wrote:
> On 05/05/2016 05:06 PM, Tian, Kevin wrote:
>>> From: Kirti Wankhede
>>>
>>>  >> + * @validate_map_request:	Validate remap pfn request
>>>  >> + *				@vdev: vgpu device structure
>>>  >> + *				@virtaddr: target user address to start at
>>>  >> + *				@pfn: physical address of kernel memory, GPU
>>>  >> + *				driver can change if required.
>>>  >> + *				@size: size of map area, GPU driver can change
>>>  >> + *				the size of map area if desired.
>>>  >> + *				@prot: page protection flags for this mapping,
>>>  >> + *				GPU driver can change, if required.
>>>  >> + *				Returns integer: success (0) or error (< 0)
>>>  >
>>>  > Was not at all clear to me what this did until I got to patch 2, this
>>>  > is actually providing the fault handling for mmap'ing a vGPU mmio BAR.
>>>  > Needs a better name or better description.
>>>  >
>>>
>>> If say VMM mmap whole BAR1 of GPU, say 128MB, so fault would occur when
>>> BAR1 is tried to access then the size is calculated as:
>>> req_size = vma->vm_end - virtaddr
> Hi Kirti,
> 
> virtaddr is the faulted one, vma->vm_end the vaddr of the mmap-ed 128MB BAR1?
> 
> Would you elaborate why (vm_end - fault_addr) results the requested size? 
> 
> 

If first access is at start address of mmaped address, fault_addr is
vma->vm_start. Then (vm_end - vm_start) is the size mmapped region.

req_size should not exceed (vm_end - vm_start).


>>> Since GPU is being shared by multiple vGPUs, GPU driver might not remap
>>> whole BAR1 for only one vGPU device, so would prefer, say map one page
>>> at a time. GPU driver returns PAGE_SIZE. This is used by
>>> remap_pfn_range(). Now on next access to BAR1 other than that page, we
>>> will again get a fault().
>>> As the name says this call is to validate from GPU driver for the size
>>> and prot of map area. GPU driver can change size and prot for this map area.
> 
> If I understand correctly, you are trying to share a physical BAR among
> multiple vGPUs, by mapping a single pfn each time, when fault happens?
> 

Yes.

>>
>> Currently we don't require such interface for Intel vGPU. Need to think about
>> its rationale carefully (still not clear to me). Jike, do you have any thought on
>> this?
> 
> We need the mmap method of vgpu_device to be implemented, but I was
> expecting something else, like calling remap_pfn_range() directly from
> the mmap.
>

Calling remap_pfn_range directly from mmap means you would like to remap
pfn for whole BAR1 during mmap, right?

In that case, don't set validate_map_request() and access start of mmap
address, so that on first access it will do remap_pfn_range() for
(vm_end - vm_start).

Thanks,
Kirti


>>
>> Thanks
>> Kevin
>>
> 
> --
> Thanks,
> Jike
> 

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 1/3] vGPU Core driver
@ 2016-05-06 16:16             ` Kirti Wankhede
  0 siblings, 0 replies; 154+ messages in thread
From: Kirti Wankhede @ 2016-05-06 16:16 UTC (permalink / raw)
  To: Jike Song, Tian, Kevin
  Cc: Alex Williamson, pbonzini, kraxel, cjia, qemu-devel, kvm, Ruan,
	Shuai, Lv, Zhiyuan



On 5/6/2016 5:44 PM, Jike Song wrote:
> On 05/05/2016 05:06 PM, Tian, Kevin wrote:
>>> From: Kirti Wankhede
>>>
>>>  >> + * @validate_map_request:	Validate remap pfn request
>>>  >> + *				@vdev: vgpu device structure
>>>  >> + *				@virtaddr: target user address to start at
>>>  >> + *				@pfn: physical address of kernel memory, GPU
>>>  >> + *				driver can change if required.
>>>  >> + *				@size: size of map area, GPU driver can change
>>>  >> + *				the size of map area if desired.
>>>  >> + *				@prot: page protection flags for this mapping,
>>>  >> + *				GPU driver can change, if required.
>>>  >> + *				Returns integer: success (0) or error (< 0)
>>>  >
>>>  > Was not at all clear to me what this did until I got to patch 2, this
>>>  > is actually providing the fault handling for mmap'ing a vGPU mmio BAR.
>>>  > Needs a better name or better description.
>>>  >
>>>
>>> If say VMM mmap whole BAR1 of GPU, say 128MB, so fault would occur when
>>> BAR1 is tried to access then the size is calculated as:
>>> req_size = vma->vm_end - virtaddr
> Hi Kirti,
> 
> virtaddr is the faulted one, vma->vm_end the vaddr of the mmap-ed 128MB BAR1?
> 
> Would you elaborate why (vm_end - fault_addr) results the requested size? 
> 
> 

If first access is at start address of mmaped address, fault_addr is
vma->vm_start. Then (vm_end - vm_start) is the size mmapped region.

req_size should not exceed (vm_end - vm_start).


>>> Since GPU is being shared by multiple vGPUs, GPU driver might not remap
>>> whole BAR1 for only one vGPU device, so would prefer, say map one page
>>> at a time. GPU driver returns PAGE_SIZE. This is used by
>>> remap_pfn_range(). Now on next access to BAR1 other than that page, we
>>> will again get a fault().
>>> As the name says this call is to validate from GPU driver for the size
>>> and prot of map area. GPU driver can change size and prot for this map area.
> 
> If I understand correctly, you are trying to share a physical BAR among
> multiple vGPUs, by mapping a single pfn each time, when fault happens?
> 

Yes.

>>
>> Currently we don't require such interface for Intel vGPU. Need to think about
>> its rationale carefully (still not clear to me). Jike, do you have any thought on
>> this?
> 
> We need the mmap method of vgpu_device to be implemented, but I was
> expecting something else, like calling remap_pfn_range() directly from
> the mmap.
>

Calling remap_pfn_range directly from mmap means you would like to remap
pfn for whole BAR1 during mmap, right?

In that case, don't set validate_map_request() and access start of mmap
address, so that on first access it will do remap_pfn_range() for
(vm_end - vm_start).

Thanks,
Kirti


>>
>> Thanks
>> Kevin
>>
> 
> --
> Thanks,
> Jike
> 

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [RFC PATCH v3 1/3] vGPU Core driver
  2016-05-06 16:16             ` [Qemu-devel] " Kirti Wankhede
@ 2016-05-09 12:12               ` Jike Song
  -1 siblings, 0 replies; 154+ messages in thread
From: Jike Song @ 2016-05-09 12:12 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Tian, Kevin, Alex Williamson, pbonzini, kraxel, cjia, qemu-devel,
	kvm, Ruan, Shuai, Lv, Zhiyuan

On 05/07/2016 12:16 AM, Kirti Wankhede wrote:
> 
> 
> On 5/6/2016 5:44 PM, Jike Song wrote:
>> On 05/05/2016 05:06 PM, Tian, Kevin wrote:
>>>> From: Kirti Wankhede
>>>>
>>>>  >> + * @validate_map_request:	Validate remap pfn request
>>>>  >> + *				@vdev: vgpu device structure
>>>>  >> + *				@virtaddr: target user address to start at
>>>>  >> + *				@pfn: physical address of kernel memory, GPU
>>>>  >> + *				driver can change if required.
>>>>  >> + *				@size: size of map area, GPU driver can change
>>>>  >> + *				the size of map area if desired.
>>>>  >> + *				@prot: page protection flags for this mapping,
>>>>  >> + *				GPU driver can change, if required.
>>>>  >> + *				Returns integer: success (0) or error (< 0)
>>>>  >
>>>>  > Was not at all clear to me what this did until I got to patch 2, this
>>>>  > is actually providing the fault handling for mmap'ing a vGPU mmio BAR.
>>>>  > Needs a better name or better description.
>>>>  >
>>>>
>>>> If say VMM mmap whole BAR1 of GPU, say 128MB, so fault would occur when
>>>> BAR1 is tried to access then the size is calculated as:
>>>> req_size = vma->vm_end - virtaddr
>> Hi Kirti,
>>
>> virtaddr is the faulted one, vma->vm_end the vaddr of the mmap-ed 128MB BAR1?
>>
>> Would you elaborate why (vm_end - fault_addr) results the requested size? 
>>
>>
> 
> If first access is at start address of mmaped address, fault_addr is
> vma->vm_start. Then (vm_end - vm_start) is the size mmapped region.
> 
> req_size should not exceed (vm_end - vm_start).
> 

[Thanks for the kind explanation, I spent some time to dig & recall the details]


So this consists of two checks:

	1) vm_end >= vm_start
	2) fault_addr >= vm_start && fault_addr <= vm_end

>>>> Since GPU is being shared by multiple vGPUs, GPU driver might not remap
>>>> whole BAR1 for only one vGPU device, so would prefer, say map one page
>>>> at a time. GPU driver returns PAGE_SIZE. This is used by
>>>> remap_pfn_range(). Now on next access to BAR1 other than that page, we
>>>> will again get a fault().
>>>> As the name says this call is to validate from GPU driver for the size
>>>> and prot of map area. GPU driver can change size and prot for this map area.
>>
>> If I understand correctly, you are trying to share a physical BAR among
>> multiple vGPUs, by mapping a single pfn each time, when fault happens?
>>
> 
> Yes.
> 

Thanks.

For the vma with a vm_ops, and each time only one pfn to proceed, can
we replace remap_pfn_range with vm_insert_pfn? I had a quick check on
kernel repo, it seems that remap_pfn_range is only called from fops.mmap,
not from vma->vm_ops.fault.

>>>
>>> Currently we don't require such interface for Intel vGPU. Need to think about
>>> its rationale carefully (still not clear to me). Jike, do you have any thought on
>>> this?
>>
>> We need the mmap method of vgpu_device to be implemented, but I was
>> expecting something else, like calling remap_pfn_range() directly from
>> the mmap.
>>
> 
> Calling remap_pfn_range directly from mmap means you would like to remap
> pfn for whole BAR1 during mmap, right?
> 
> In that case, don't set validate_map_request() and access start of mmap
> address, so that on first access it will do remap_pfn_range() for
> (vm_end - vm_start).

No. I'd like QEMU to be aware that only a *portion* of the physical BAR1
is available for the vGPU, like:

	pGPU	: 1GB size BAR1
	vGPU	: 128MB size BAR1

QEMU has the information of the available size for a particular vGPU,
calling mmap() with that.

I'd say that your implementation is nice and flexible, but in order to
ensure whatever level a resource QoS, you have to account it from the
device-model (where validate_map_request is implemented), right?

How about making QEMU be aware that only a portion of MMIO is available?
Would appreciate hearing your opinion on this. Thanks!


> Thanks,
> Kirti
>

--
Thanks,
Jike


^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 1/3] vGPU Core driver
@ 2016-05-09 12:12               ` Jike Song
  0 siblings, 0 replies; 154+ messages in thread
From: Jike Song @ 2016-05-09 12:12 UTC (permalink / raw)
  To: Kirti Wankhede
  Cc: Tian, Kevin, Alex Williamson, pbonzini, kraxel, cjia, qemu-devel,
	kvm, Ruan, Shuai, Lv, Zhiyuan

On 05/07/2016 12:16 AM, Kirti Wankhede wrote:
> 
> 
> On 5/6/2016 5:44 PM, Jike Song wrote:
>> On 05/05/2016 05:06 PM, Tian, Kevin wrote:
>>>> From: Kirti Wankhede
>>>>
>>>>  >> + * @validate_map_request:	Validate remap pfn request
>>>>  >> + *				@vdev: vgpu device structure
>>>>  >> + *				@virtaddr: target user address to start at
>>>>  >> + *				@pfn: physical address of kernel memory, GPU
>>>>  >> + *				driver can change if required.
>>>>  >> + *				@size: size of map area, GPU driver can change
>>>>  >> + *				the size of map area if desired.
>>>>  >> + *				@prot: page protection flags for this mapping,
>>>>  >> + *				GPU driver can change, if required.
>>>>  >> + *				Returns integer: success (0) or error (< 0)
>>>>  >
>>>>  > Was not at all clear to me what this did until I got to patch 2, this
>>>>  > is actually providing the fault handling for mmap'ing a vGPU mmio BAR.
>>>>  > Needs a better name or better description.
>>>>  >
>>>>
>>>> If say VMM mmap whole BAR1 of GPU, say 128MB, so fault would occur when
>>>> BAR1 is tried to access then the size is calculated as:
>>>> req_size = vma->vm_end - virtaddr
>> Hi Kirti,
>>
>> virtaddr is the faulted one, vma->vm_end the vaddr of the mmap-ed 128MB BAR1?
>>
>> Would you elaborate why (vm_end - fault_addr) results the requested size? 
>>
>>
> 
> If first access is at start address of mmaped address, fault_addr is
> vma->vm_start. Then (vm_end - vm_start) is the size mmapped region.
> 
> req_size should not exceed (vm_end - vm_start).
> 

[Thanks for the kind explanation, I spent some time to dig & recall the details]


So this consists of two checks:

	1) vm_end >= vm_start
	2) fault_addr >= vm_start && fault_addr <= vm_end

>>>> Since GPU is being shared by multiple vGPUs, GPU driver might not remap
>>>> whole BAR1 for only one vGPU device, so would prefer, say map one page
>>>> at a time. GPU driver returns PAGE_SIZE. This is used by
>>>> remap_pfn_range(). Now on next access to BAR1 other than that page, we
>>>> will again get a fault().
>>>> As the name says this call is to validate from GPU driver for the size
>>>> and prot of map area. GPU driver can change size and prot for this map area.
>>
>> If I understand correctly, you are trying to share a physical BAR among
>> multiple vGPUs, by mapping a single pfn each time, when fault happens?
>>
> 
> Yes.
> 

Thanks.

For the vma with a vm_ops, and each time only one pfn to proceed, can
we replace remap_pfn_range with vm_insert_pfn? I had a quick check on
kernel repo, it seems that remap_pfn_range is only called from fops.mmap,
not from vma->vm_ops.fault.

>>>
>>> Currently we don't require such interface for Intel vGPU. Need to think about
>>> its rationale carefully (still not clear to me). Jike, do you have any thought on
>>> this?
>>
>> We need the mmap method of vgpu_device to be implemented, but I was
>> expecting something else, like calling remap_pfn_range() directly from
>> the mmap.
>>
> 
> Calling remap_pfn_range directly from mmap means you would like to remap
> pfn for whole BAR1 during mmap, right?
> 
> In that case, don't set validate_map_request() and access start of mmap
> address, so that on first access it will do remap_pfn_range() for
> (vm_end - vm_start).

No. I'd like QEMU to be aware that only a *portion* of the physical BAR1
is available for the vGPU, like:

	pGPU	: 1GB size BAR1
	vGPU	: 128MB size BAR1

QEMU has the information of the available size for a particular vGPU,
calling mmap() with that.

I'd say that your implementation is nice and flexible, but in order to
ensure whatever level a resource QoS, you have to account it from the
device-model (where validate_map_request is implemented), right?

How about making QEMU be aware that only a portion of MMIO is available?
Would appreciate hearing your opinion on this. Thanks!


> Thanks,
> Kirti
>

--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
  2016-05-05  9:27         ` [Qemu-devel] " Tian, Kevin
@ 2016-05-10  7:52           ` Jike Song
  -1 siblings, 0 replies; 154+ messages in thread
From: Jike Song @ 2016-05-10  7:52 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Ruan, Shuai, Neo Jia, kvm, qemu-devel, Kirti Wankhede,
	Alex Williamson, kraxel, pbonzini, Lv, Zhiyuan

On 05/05/2016 05:27 PM, Tian, Kevin wrote:
>> From: Song, Jike
>>
>> IIUC, an api-only domain is a VFIO domain *without* underlying IOMMU
>> hardware. It just, as you said in another mail, "rather than
>> programming them into an IOMMU for a device, it simply stores the
>> translations for use by later requests".
>>
>> That imposes a constraint on gfx driver: hardware IOMMU must be disabled.
>> Otherwise, if IOMMU is present, the gfx driver eventually programs
>> the hardware IOMMU with IOVA returned by pci_map_page or dma_map_page;
>> Meanwhile, the IOMMU backend for vgpu only maintains GPA <-> HPA
>> translations without any knowledge about hardware IOMMU, how is the
>> device model supposed to do to get an IOVA for a given GPA (thereby HPA
>> by the IOMMU backend here)?
>>
>> If things go as guessed above, as vfio_pin_pages() indicates, it
>> pin & translate vaddr to PFN, then it will be very difficult for the
>> device model to figure out:
>>
>> 	1, for a given GPA, how to avoid calling dma_map_page multiple times?
>> 	2, for which page to call dma_unmap_page?
>>
>> --
> 
> We have to support both w/ iommu and w/o iommu case, since
> that fact is out of GPU driver control. A simple way is to use
> dma_map_page which internally will cope with w/ and w/o iommu
> case gracefully, i.e. return HPA w/o iommu and IOVA w/ iommu.
> Then in this file we only need to cache GPA to whatever dmadr_t
> returned by dma_map_page.
> 

Hi Alex, Kirti and Neo, any thought on the IOMMU compatibility here?

--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
@ 2016-05-10  7:52           ` Jike Song
  0 siblings, 0 replies; 154+ messages in thread
From: Jike Song @ 2016-05-10  7:52 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Alex Williamson, Kirti Wankhede, pbonzini, kraxel, cjia,
	qemu-devel, kvm, Ruan, Shuai, Lv, Zhiyuan

On 05/05/2016 05:27 PM, Tian, Kevin wrote:
>> From: Song, Jike
>>
>> IIUC, an api-only domain is a VFIO domain *without* underlying IOMMU
>> hardware. It just, as you said in another mail, "rather than
>> programming them into an IOMMU for a device, it simply stores the
>> translations for use by later requests".
>>
>> That imposes a constraint on gfx driver: hardware IOMMU must be disabled.
>> Otherwise, if IOMMU is present, the gfx driver eventually programs
>> the hardware IOMMU with IOVA returned by pci_map_page or dma_map_page;
>> Meanwhile, the IOMMU backend for vgpu only maintains GPA <-> HPA
>> translations without any knowledge about hardware IOMMU, how is the
>> device model supposed to do to get an IOVA for a given GPA (thereby HPA
>> by the IOMMU backend here)?
>>
>> If things go as guessed above, as vfio_pin_pages() indicates, it
>> pin & translate vaddr to PFN, then it will be very difficult for the
>> device model to figure out:
>>
>> 	1, for a given GPA, how to avoid calling dma_map_page multiple times?
>> 	2, for which page to call dma_unmap_page?
>>
>> --
> 
> We have to support both w/ iommu and w/o iommu case, since
> that fact is out of GPU driver control. A simple way is to use
> dma_map_page which internally will cope with w/ and w/o iommu
> case gracefully, i.e. return HPA w/o iommu and IOVA w/ iommu.
> Then in this file we only need to cache GPA to whatever dmadr_t
> returned by dma_map_page.
> 

Hi Alex, Kirti and Neo, any thought on the IOMMU compatibility here?

--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
  2016-05-10  7:52           ` [Qemu-devel] " Jike Song
@ 2016-05-10 16:02             ` Neo Jia
  -1 siblings, 0 replies; 154+ messages in thread
From: Neo Jia @ 2016-05-10 16:02 UTC (permalink / raw)
  To: Jike Song
  Cc: Tian, Kevin, Alex Williamson, Kirti Wankhede, pbonzini, kraxel,
	qemu-devel, kvm, Ruan, Shuai, Lv, Zhiyuan

On Tue, May 10, 2016 at 03:52:27PM +0800, Jike Song wrote:
> On 05/05/2016 05:27 PM, Tian, Kevin wrote:
> >> From: Song, Jike
> >>
> >> IIUC, an api-only domain is a VFIO domain *without* underlying IOMMU
> >> hardware. It just, as you said in another mail, "rather than
> >> programming them into an IOMMU for a device, it simply stores the
> >> translations for use by later requests".
> >>
> >> That imposes a constraint on gfx driver: hardware IOMMU must be disabled.
> >> Otherwise, if IOMMU is present, the gfx driver eventually programs
> >> the hardware IOMMU with IOVA returned by pci_map_page or dma_map_page;
> >> Meanwhile, the IOMMU backend for vgpu only maintains GPA <-> HPA
> >> translations without any knowledge about hardware IOMMU, how is the
> >> device model supposed to do to get an IOVA for a given GPA (thereby HPA
> >> by the IOMMU backend here)?
> >>
> >> If things go as guessed above, as vfio_pin_pages() indicates, it
> >> pin & translate vaddr to PFN, then it will be very difficult for the
> >> device model to figure out:
> >>
> >> 	1, for a given GPA, how to avoid calling dma_map_page multiple times?
> >> 	2, for which page to call dma_unmap_page?
> >>
> >> --
> > 
> > We have to support both w/ iommu and w/o iommu case, since
> > that fact is out of GPU driver control. A simple way is to use
> > dma_map_page which internally will cope with w/ and w/o iommu
> > case gracefully, i.e. return HPA w/o iommu and IOVA w/ iommu.
> > Then in this file we only need to cache GPA to whatever dmadr_t
> > returned by dma_map_page.
> > 
> 
> Hi Alex, Kirti and Neo, any thought on the IOMMU compatibility here?

Hi Jike,

With mediated passthru, you still can use hardware iommu, but more important
that part is actually orthogonal to what we are discussing here as we will only
cache the mapping between <gfn (iova if guest has iommu), (qemu) va>, once we 
have pinned pages later with the help of above info, you can map it into the
proper iommu domain if the system has configured so.

Thanks,
Neo

> 
> --
> Thanks,
> Jike
> 

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
@ 2016-05-10 16:02             ` Neo Jia
  0 siblings, 0 replies; 154+ messages in thread
From: Neo Jia @ 2016-05-10 16:02 UTC (permalink / raw)
  To: Jike Song
  Cc: Tian, Kevin, Alex Williamson, Kirti Wankhede, pbonzini, kraxel,
	qemu-devel, kvm, Ruan, Shuai, Lv, Zhiyuan

On Tue, May 10, 2016 at 03:52:27PM +0800, Jike Song wrote:
> On 05/05/2016 05:27 PM, Tian, Kevin wrote:
> >> From: Song, Jike
> >>
> >> IIUC, an api-only domain is a VFIO domain *without* underlying IOMMU
> >> hardware. It just, as you said in another mail, "rather than
> >> programming them into an IOMMU for a device, it simply stores the
> >> translations for use by later requests".
> >>
> >> That imposes a constraint on gfx driver: hardware IOMMU must be disabled.
> >> Otherwise, if IOMMU is present, the gfx driver eventually programs
> >> the hardware IOMMU with IOVA returned by pci_map_page or dma_map_page;
> >> Meanwhile, the IOMMU backend for vgpu only maintains GPA <-> HPA
> >> translations without any knowledge about hardware IOMMU, how is the
> >> device model supposed to do to get an IOVA for a given GPA (thereby HPA
> >> by the IOMMU backend here)?
> >>
> >> If things go as guessed above, as vfio_pin_pages() indicates, it
> >> pin & translate vaddr to PFN, then it will be very difficult for the
> >> device model to figure out:
> >>
> >> 	1, for a given GPA, how to avoid calling dma_map_page multiple times?
> >> 	2, for which page to call dma_unmap_page?
> >>
> >> --
> > 
> > We have to support both w/ iommu and w/o iommu case, since
> > that fact is out of GPU driver control. A simple way is to use
> > dma_map_page which internally will cope with w/ and w/o iommu
> > case gracefully, i.e. return HPA w/o iommu and IOVA w/ iommu.
> > Then in this file we only need to cache GPA to whatever dmadr_t
> > returned by dma_map_page.
> > 
> 
> Hi Alex, Kirti and Neo, any thought on the IOMMU compatibility here?

Hi Jike,

With mediated passthru, you still can use hardware iommu, but more important
that part is actually orthogonal to what we are discussing here as we will only
cache the mapping between <gfn (iova if guest has iommu), (qemu) va>, once we 
have pinned pages later with the help of above info, you can map it into the
proper iommu domain if the system has configured so.

Thanks,
Neo

> 
> --
> Thanks,
> Jike
> 

^ permalink raw reply	[flat|nested] 154+ messages in thread

* RE: [RFC PATCH v3 1/3] vGPU Core driver
  2016-05-05 12:57               ` [Qemu-devel] " Kirti Wankhede
@ 2016-05-11  6:37                 ` Tian, Kevin
  -1 siblings, 0 replies; 154+ messages in thread
From: Tian, Kevin @ 2016-05-11  6:37 UTC (permalink / raw)
  To: Kirti Wankhede, Alex Williamson
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, Ruan, Shuai, Song, Jike,
	Lv, Zhiyuan

> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> Sent: Thursday, May 05, 2016 8:57 PM
> 
> 
> On 5/5/2016 5:37 PM, Tian, Kevin wrote:
> >> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> >> Sent: Thursday, May 05, 2016 6:45 PM
> >>
> >>
> >> On 5/5/2016 2:36 PM, Tian, Kevin wrote:
> >>>> From: Kirti Wankhede
> >>>> Sent: Wednesday, May 04, 2016 9:32 PM
> >>>>
> >>>> Thanks Alex.
> >>>>
> >>>>  >> +config VGPU_VFIO
> >>>>  >> +    tristate
> >>>>  >> +    depends on VGPU
> >>>>  >> +    default n
> >>>>  >> +
> >>>>  >
> >>>>  > This is a little bit convoluted, it seems like everything added in this
> >>>>  > patch is vfio agnostic, it doesn't necessarily care what the consumer
> >>>>  > is.  That makes me think we should only be adding CONFIG_VGPU here and
> >>>>  > it should not depend on CONFIG_VFIO or be enabling CONFIG_VGPU_VFIO.
> >>>>  > The middle config entry is also redundant to the first, just move the
> >>>>  > default line up to the first and remove the rest.
> >>>>
> >>>> CONFIG_VGPU doesn't directly depend on VFIO. CONFIG_VGPU_VFIO is
> >>>> directly dependent on VFIO. But devices created by VGPU core module need
> >>>> a driver to manage those devices. CONFIG_VGPU_VFIO is the driver which
> >>>> will manage vgpu devices. So I think CONFIG_VGPU_VFIO should be enabled
> >>>> by CONFIG_VGPU.
> >>>>
> >>>> This would look like:
> >>>> menuconfig VGPU
> >>>>      tristate "VGPU driver framework"
> >>>>      select VGPU_VFIO
> >>>>      default n
> >>>>      help
> >>>>          VGPU provides a framework to virtualize GPU without SR-IOV cap
> >>>>          See Documentation/vgpu.txt for more details.
> >>>>
> >>>>          If you don't know what do here, say N.
> >>>>
> >>>> config VGPU_VFIO
> >>>>      tristate
> >>>>      depends on VGPU
> >>>>      depends on VFIO
> >>>>      default n
> >>>>
> >>>
> >>> There could be multiple drivers operating VGPU. Why do we restrict
> >>> it to VFIO here?
> >>>
> >>
> >> VGPU_VFIO uses VFIO APIs, it depends on VFIO.
> >> I think since there is no other driver than VGPU_VFIO for VGPU devices,
> >> we should keep default selection of VGPU_VFIO on VGPU. May be in future
> >> if other driver is add ti operate vGPU devices, then default selection
> >> can be removed.
> >
> > What's your plan to support Xen here?
> >
> 
> No plans to support Xen.
> 

Intel will support both KVM and Xen based on this framework.

Also, such hard binding between components is better avoided if this framework
is designed for multi-drivers (that's why you introduce vgpu_register_driver).

Thanks
Kevin

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 1/3] vGPU Core driver
@ 2016-05-11  6:37                 ` Tian, Kevin
  0 siblings, 0 replies; 154+ messages in thread
From: Tian, Kevin @ 2016-05-11  6:37 UTC (permalink / raw)
  To: Kirti Wankhede, Alex Williamson
  Cc: pbonzini, kraxel, cjia, qemu-devel, kvm, Ruan, Shuai, Song, Jike,
	Lv, Zhiyuan

> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> Sent: Thursday, May 05, 2016 8:57 PM
> 
> 
> On 5/5/2016 5:37 PM, Tian, Kevin wrote:
> >> From: Kirti Wankhede [mailto:kwankhede@nvidia.com]
> >> Sent: Thursday, May 05, 2016 6:45 PM
> >>
> >>
> >> On 5/5/2016 2:36 PM, Tian, Kevin wrote:
> >>>> From: Kirti Wankhede
> >>>> Sent: Wednesday, May 04, 2016 9:32 PM
> >>>>
> >>>> Thanks Alex.
> >>>>
> >>>>  >> +config VGPU_VFIO
> >>>>  >> +    tristate
> >>>>  >> +    depends on VGPU
> >>>>  >> +    default n
> >>>>  >> +
> >>>>  >
> >>>>  > This is a little bit convoluted, it seems like everything added in this
> >>>>  > patch is vfio agnostic, it doesn't necessarily care what the consumer
> >>>>  > is.  That makes me think we should only be adding CONFIG_VGPU here and
> >>>>  > it should not depend on CONFIG_VFIO or be enabling CONFIG_VGPU_VFIO.
> >>>>  > The middle config entry is also redundant to the first, just move the
> >>>>  > default line up to the first and remove the rest.
> >>>>
> >>>> CONFIG_VGPU doesn't directly depend on VFIO. CONFIG_VGPU_VFIO is
> >>>> directly dependent on VFIO. But devices created by VGPU core module need
> >>>> a driver to manage those devices. CONFIG_VGPU_VFIO is the driver which
> >>>> will manage vgpu devices. So I think CONFIG_VGPU_VFIO should be enabled
> >>>> by CONFIG_VGPU.
> >>>>
> >>>> This would look like:
> >>>> menuconfig VGPU
> >>>>      tristate "VGPU driver framework"
> >>>>      select VGPU_VFIO
> >>>>      default n
> >>>>      help
> >>>>          VGPU provides a framework to virtualize GPU without SR-IOV cap
> >>>>          See Documentation/vgpu.txt for more details.
> >>>>
> >>>>          If you don't know what do here, say N.
> >>>>
> >>>> config VGPU_VFIO
> >>>>      tristate
> >>>>      depends on VGPU
> >>>>      depends on VFIO
> >>>>      default n
> >>>>
> >>>
> >>> There could be multiple drivers operating VGPU. Why do we restrict
> >>> it to VFIO here?
> >>>
> >>
> >> VGPU_VFIO uses VFIO APIs, it depends on VFIO.
> >> I think since there is no other driver than VGPU_VFIO for VGPU devices,
> >> we should keep default selection of VGPU_VFIO on VGPU. May be in future
> >> if other driver is add ti operate vGPU devices, then default selection
> >> can be removed.
> >
> > What's your plan to support Xen here?
> >
> 
> No plans to support Xen.
> 

Intel will support both KVM and Xen based on this framework.

Also, such hard binding between components is better avoided if this framework
is designed for multi-drivers (that's why you introduce vgpu_register_driver).

Thanks
Kevin

^ permalink raw reply	[flat|nested] 154+ messages in thread

* RE: [RFC PATCH v3 2/3] VFIO driver for vGPU device
  2016-05-04 17:06         ` [Qemu-devel] " Alex Williamson
@ 2016-05-11  6:45           ` Tian, Kevin
  -1 siblings, 0 replies; 154+ messages in thread
From: Tian, Kevin @ 2016-05-11  6:45 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Kirti Wankhede, pbonzini, kraxel, cjia, qemu-devel, kvm, Ruan,
	Shuai, Song, Jike, Lv, Zhiyuan

> From: Alex Williamson
> Sent: Thursday, May 05, 2016 1:06 AM
> > > > +
> > > > +	ret = remap_pfn_range(vma, virtaddr, pgoff, req_size, pg_prot);
> > >
> > > So not supporting validate_map_request() means that the user can
> > > directly mmap BARs of the host GPU and as shown below, we assume a 1:1
> > > mapping of vGPU BAR to host GPU BAR.  Is that ever valid in a vGPU
> > > scenario or should this callback be required?  It's not clear to me how
> > > the vendor driver determines what this maps to, do they compare it to
> > > the physical device's own BAR addresses?
> >
> > I didn't quite understand too. Based on earlier discussion, do we need
> > something like this, or could achieve the purpose just by leveraging
> > recent sparse mmap support?
> 
> The reason for faulting in the mmio space, if I recall correctly, is to
> enable an ordering where the user driver (QEMU) can mmap regions of the
> device prior to resources being allocated on the host GPU to handle
> them.  Sparse mmap only partially handles that, it's not dynamic.  With
> this faulting mechanism, the host GPU doesn't need to commit resources
> until the mmap is actually accessed.  Thanks,
> 
> Alex

Just double confirm. I assume this faulting mechanism can work with
sparse mmap, right? Regardless of whether it's a full or partial region,
this faulting mechanism would commit resource only when accessed
page has MMAP flag set...

Thanks
Kevin

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 2/3] VFIO driver for vGPU device
@ 2016-05-11  6:45           ` Tian, Kevin
  0 siblings, 0 replies; 154+ messages in thread
From: Tian, Kevin @ 2016-05-11  6:45 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Kirti Wankhede, pbonzini, kraxel, cjia, qemu-devel, kvm, Ruan,
	Shuai, Song, Jike, Lv, Zhiyuan

> From: Alex Williamson
> Sent: Thursday, May 05, 2016 1:06 AM
> > > > +
> > > > +	ret = remap_pfn_range(vma, virtaddr, pgoff, req_size, pg_prot);
> > >
> > > So not supporting validate_map_request() means that the user can
> > > directly mmap BARs of the host GPU and as shown below, we assume a 1:1
> > > mapping of vGPU BAR to host GPU BAR.  Is that ever valid in a vGPU
> > > scenario or should this callback be required?  It's not clear to me how
> > > the vendor driver determines what this maps to, do they compare it to
> > > the physical device's own BAR addresses?
> >
> > I didn't quite understand too. Based on earlier discussion, do we need
> > something like this, or could achieve the purpose just by leveraging
> > recent sparse mmap support?
> 
> The reason for faulting in the mmio space, if I recall correctly, is to
> enable an ordering where the user driver (QEMU) can mmap regions of the
> device prior to resources being allocated on the host GPU to handle
> them.  Sparse mmap only partially handles that, it's not dynamic.  With
> this faulting mechanism, the host GPU doesn't need to commit resources
> until the mmap is actually accessed.  Thanks,
> 
> Alex

Just double confirm. I assume this faulting mechanism can work with
sparse mmap, right? Regardless of whether it's a full or partial region,
this faulting mechanism would commit resource only when accessed
page has MMAP flag set...

Thanks
Kevin

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
  2016-05-10 16:02             ` [Qemu-devel] " Neo Jia
@ 2016-05-11  9:15               ` Jike Song
  -1 siblings, 0 replies; 154+ messages in thread
From: Jike Song @ 2016-05-11  9:15 UTC (permalink / raw)
  To: Neo Jia
  Cc: Tian, Kevin, Alex Williamson, Kirti Wankhede, pbonzini, kraxel,
	qemu-devel, kvm, Ruan, Shuai, Lv, Zhiyuan

On 05/11/2016 12:02 AM, Neo Jia wrote:
> On Tue, May 10, 2016 at 03:52:27PM +0800, Jike Song wrote:
>> On 05/05/2016 05:27 PM, Tian, Kevin wrote:
>>>> From: Song, Jike
>>>>
>>>> IIUC, an api-only domain is a VFIO domain *without* underlying IOMMU
>>>> hardware. It just, as you said in another mail, "rather than
>>>> programming them into an IOMMU for a device, it simply stores the
>>>> translations for use by later requests".
>>>>
>>>> That imposes a constraint on gfx driver: hardware IOMMU must be disabled.
>>>> Otherwise, if IOMMU is present, the gfx driver eventually programs
>>>> the hardware IOMMU with IOVA returned by pci_map_page or dma_map_page;
>>>> Meanwhile, the IOMMU backend for vgpu only maintains GPA <-> HPA
>>>> translations without any knowledge about hardware IOMMU, how is the
>>>> device model supposed to do to get an IOVA for a given GPA (thereby HPA
>>>> by the IOMMU backend here)?
>>>>
>>>> If things go as guessed above, as vfio_pin_pages() indicates, it
>>>> pin & translate vaddr to PFN, then it will be very difficult for the
>>>> device model to figure out:
>>>>
>>>> 	1, for a given GPA, how to avoid calling dma_map_page multiple times?
>>>> 	2, for which page to call dma_unmap_page?
>>>>
>>>> --
>>>
>>> We have to support both w/ iommu and w/o iommu case, since
>>> that fact is out of GPU driver control. A simple way is to use
>>> dma_map_page which internally will cope with w/ and w/o iommu
>>> case gracefully, i.e. return HPA w/o iommu and IOVA w/ iommu.
>>> Then in this file we only need to cache GPA to whatever dmadr_t
>>> returned by dma_map_page.
>>>
>>
>> Hi Alex, Kirti and Neo, any thought on the IOMMU compatibility here?
> 
> Hi Jike,
> 
> With mediated passthru, you still can use hardware iommu, but more important
> that part is actually orthogonal to what we are discussing here as we will only
> cache the mapping between <gfn (iova if guest has iommu), (qemu) va>, once we 
> have pinned pages later with the help of above info, you can map it into the
> proper iommu domain if the system has configured so.
>

Hi Neo,

Technically yes you can map a pfn into the proper IOMMU domain elsewhere,
but to find out whether a pfn was previously mapped or not, you have to
track it with another rbtree-alike data structure (the IOMMU driver simply
doesn't bother with tracking), that seems somehow duplicate with the vGPU
IOMMU backend we are discussing here.

And it is also semantically correct for an IOMMU backend to handle both w/
and w/o an IOMMU hardware? :)


--
Thanks,
Jike


^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
@ 2016-05-11  9:15               ` Jike Song
  0 siblings, 0 replies; 154+ messages in thread
From: Jike Song @ 2016-05-11  9:15 UTC (permalink / raw)
  To: Neo Jia
  Cc: Tian, Kevin, Alex Williamson, Kirti Wankhede, pbonzini, kraxel,
	qemu-devel, kvm, Ruan, Shuai, Lv, Zhiyuan

On 05/11/2016 12:02 AM, Neo Jia wrote:
> On Tue, May 10, 2016 at 03:52:27PM +0800, Jike Song wrote:
>> On 05/05/2016 05:27 PM, Tian, Kevin wrote:
>>>> From: Song, Jike
>>>>
>>>> IIUC, an api-only domain is a VFIO domain *without* underlying IOMMU
>>>> hardware. It just, as you said in another mail, "rather than
>>>> programming them into an IOMMU for a device, it simply stores the
>>>> translations for use by later requests".
>>>>
>>>> That imposes a constraint on gfx driver: hardware IOMMU must be disabled.
>>>> Otherwise, if IOMMU is present, the gfx driver eventually programs
>>>> the hardware IOMMU with IOVA returned by pci_map_page or dma_map_page;
>>>> Meanwhile, the IOMMU backend for vgpu only maintains GPA <-> HPA
>>>> translations without any knowledge about hardware IOMMU, how is the
>>>> device model supposed to do to get an IOVA for a given GPA (thereby HPA
>>>> by the IOMMU backend here)?
>>>>
>>>> If things go as guessed above, as vfio_pin_pages() indicates, it
>>>> pin & translate vaddr to PFN, then it will be very difficult for the
>>>> device model to figure out:
>>>>
>>>> 	1, for a given GPA, how to avoid calling dma_map_page multiple times?
>>>> 	2, for which page to call dma_unmap_page?
>>>>
>>>> --
>>>
>>> We have to support both w/ iommu and w/o iommu case, since
>>> that fact is out of GPU driver control. A simple way is to use
>>> dma_map_page which internally will cope with w/ and w/o iommu
>>> case gracefully, i.e. return HPA w/o iommu and IOVA w/ iommu.
>>> Then in this file we only need to cache GPA to whatever dmadr_t
>>> returned by dma_map_page.
>>>
>>
>> Hi Alex, Kirti and Neo, any thought on the IOMMU compatibility here?
> 
> Hi Jike,
> 
> With mediated passthru, you still can use hardware iommu, but more important
> that part is actually orthogonal to what we are discussing here as we will only
> cache the mapping between <gfn (iova if guest has iommu), (qemu) va>, once we 
> have pinned pages later with the help of above info, you can map it into the
> proper iommu domain if the system has configured so.
>

Hi Neo,

Technically yes you can map a pfn into the proper IOMMU domain elsewhere,
but to find out whether a pfn was previously mapped or not, you have to
track it with another rbtree-alike data structure (the IOMMU driver simply
doesn't bother with tracking), that seems somehow duplicate with the vGPU
IOMMU backend we are discussing here.

And it is also semantically correct for an IOMMU backend to handle both w/
and w/o an IOMMU hardware? :)


--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [RFC PATCH v3 2/3] VFIO driver for vGPU device
  2016-05-11  6:45           ` [Qemu-devel] " Tian, Kevin
@ 2016-05-11 20:10             ` Alex Williamson
  -1 siblings, 0 replies; 154+ messages in thread
From: Alex Williamson @ 2016-05-11 20:10 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Kirti Wankhede, pbonzini, kraxel, cjia, qemu-devel, kvm, Ruan,
	Shuai, Song, Jike, Lv, Zhiyuan

On Wed, 11 May 2016 06:45:41 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson
> > Sent: Thursday, May 05, 2016 1:06 AM  
> > > > > +
> > > > > +	ret = remap_pfn_range(vma, virtaddr, pgoff, req_size, pg_prot);  
> > > >
> > > > So not supporting validate_map_request() means that the user can
> > > > directly mmap BARs of the host GPU and as shown below, we assume a 1:1
> > > > mapping of vGPU BAR to host GPU BAR.  Is that ever valid in a vGPU
> > > > scenario or should this callback be required?  It's not clear to me how
> > > > the vendor driver determines what this maps to, do they compare it to
> > > > the physical device's own BAR addresses?  
> > >
> > > I didn't quite understand too. Based on earlier discussion, do we need
> > > something like this, or could achieve the purpose just by leveraging
> > > recent sparse mmap support?  
> > 
> > The reason for faulting in the mmio space, if I recall correctly, is to
> > enable an ordering where the user driver (QEMU) can mmap regions of the
> > device prior to resources being allocated on the host GPU to handle
> > them.  Sparse mmap only partially handles that, it's not dynamic.  With
> > this faulting mechanism, the host GPU doesn't need to commit resources
> > until the mmap is actually accessed.  Thanks,
> > 
> > Alex  
> 
> Just double confirm. I assume this faulting mechanism can work with
> sparse mmap, right? Regardless of whether it's a full or partial region,
> this faulting mechanism would commit resource only when accessed
> page has MMAP flag set...

Yes, the vfio sparse mmap just solves the problem that a vfio region
maps to an entire device resource, for example in the case of vfio-pci,
a PCI BAR.  It turns out that specifying mmap on a whole region doesn't
give us the granularity we need.  Sparse mmap gives us a generic way to
tell userspace which areas within a region support mmap and which
should use read/write access through the vfio device file descriptor.
The latter allows us to protect specific regions or provide further
emulation/virtualization for that sub-area.  How the mmap vma is
populated for the portions that do support mmap is an orthogonal
issue.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 2/3] VFIO driver for vGPU device
@ 2016-05-11 20:10             ` Alex Williamson
  0 siblings, 0 replies; 154+ messages in thread
From: Alex Williamson @ 2016-05-11 20:10 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Kirti Wankhede, pbonzini, kraxel, cjia, qemu-devel, kvm, Ruan,
	Shuai, Song, Jike, Lv, Zhiyuan

On Wed, 11 May 2016 06:45:41 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson
> > Sent: Thursday, May 05, 2016 1:06 AM  
> > > > > +
> > > > > +	ret = remap_pfn_range(vma, virtaddr, pgoff, req_size, pg_prot);  
> > > >
> > > > So not supporting validate_map_request() means that the user can
> > > > directly mmap BARs of the host GPU and as shown below, we assume a 1:1
> > > > mapping of vGPU BAR to host GPU BAR.  Is that ever valid in a vGPU
> > > > scenario or should this callback be required?  It's not clear to me how
> > > > the vendor driver determines what this maps to, do they compare it to
> > > > the physical device's own BAR addresses?  
> > >
> > > I didn't quite understand too. Based on earlier discussion, do we need
> > > something like this, or could achieve the purpose just by leveraging
> > > recent sparse mmap support?  
> > 
> > The reason for faulting in the mmio space, if I recall correctly, is to
> > enable an ordering where the user driver (QEMU) can mmap regions of the
> > device prior to resources being allocated on the host GPU to handle
> > them.  Sparse mmap only partially handles that, it's not dynamic.  With
> > this faulting mechanism, the host GPU doesn't need to commit resources
> > until the mmap is actually accessed.  Thanks,
> > 
> > Alex  
> 
> Just double confirm. I assume this faulting mechanism can work with
> sparse mmap, right? Regardless of whether it's a full or partial region,
> this faulting mechanism would commit resource only when accessed
> page has MMAP flag set...

Yes, the vfio sparse mmap just solves the problem that a vfio region
maps to an entire device resource, for example in the case of vfio-pci,
a PCI BAR.  It turns out that specifying mmap on a whole region doesn't
give us the granularity we need.  Sparse mmap gives us a generic way to
tell userspace which areas within a region support mmap and which
should use read/write access through the vfio device file descriptor.
The latter allows us to protect specific regions or provide further
emulation/virtualization for that sub-area.  How the mmap vma is
populated for the portions that do support mmap is an orthogonal
issue.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
  2016-05-11  9:15               ` [Qemu-devel] " Jike Song
@ 2016-05-11 22:06                 ` Alex Williamson
  -1 siblings, 0 replies; 154+ messages in thread
From: Alex Williamson @ 2016-05-11 22:06 UTC (permalink / raw)
  To: Jike Song
  Cc: Neo Jia, Tian, Kevin, Kirti Wankhede, pbonzini, kraxel,
	qemu-devel, kvm, Ruan, Shuai, Lv, Zhiyuan

On Wed, 11 May 2016 17:15:15 +0800
Jike Song <jike.song@intel.com> wrote:

> On 05/11/2016 12:02 AM, Neo Jia wrote:
> > On Tue, May 10, 2016 at 03:52:27PM +0800, Jike Song wrote:  
> >> On 05/05/2016 05:27 PM, Tian, Kevin wrote:  
> >>>> From: Song, Jike
> >>>>
> >>>> IIUC, an api-only domain is a VFIO domain *without* underlying IOMMU
> >>>> hardware. It just, as you said in another mail, "rather than
> >>>> programming them into an IOMMU for a device, it simply stores the
> >>>> translations for use by later requests".
> >>>>
> >>>> That imposes a constraint on gfx driver: hardware IOMMU must be disabled.
> >>>> Otherwise, if IOMMU is present, the gfx driver eventually programs
> >>>> the hardware IOMMU with IOVA returned by pci_map_page or dma_map_page;
> >>>> Meanwhile, the IOMMU backend for vgpu only maintains GPA <-> HPA
> >>>> translations without any knowledge about hardware IOMMU, how is the
> >>>> device model supposed to do to get an IOVA for a given GPA (thereby HPA
> >>>> by the IOMMU backend here)?
> >>>>
> >>>> If things go as guessed above, as vfio_pin_pages() indicates, it
> >>>> pin & translate vaddr to PFN, then it will be very difficult for the
> >>>> device model to figure out:
> >>>>
> >>>> 	1, for a given GPA, how to avoid calling dma_map_page multiple times?
> >>>> 	2, for which page to call dma_unmap_page?
> >>>>
> >>>> --  
> >>>
> >>> We have to support both w/ iommu and w/o iommu case, since
> >>> that fact is out of GPU driver control. A simple way is to use
> >>> dma_map_page which internally will cope with w/ and w/o iommu
> >>> case gracefully, i.e. return HPA w/o iommu and IOVA w/ iommu.
> >>> Then in this file we only need to cache GPA to whatever dmadr_t
> >>> returned by dma_map_page.
> >>>  
> >>
> >> Hi Alex, Kirti and Neo, any thought on the IOMMU compatibility here?  
> > 
> > Hi Jike,
> > 
> > With mediated passthru, you still can use hardware iommu, but more important
> > that part is actually orthogonal to what we are discussing here as we will only
> > cache the mapping between <gfn (iova if guest has iommu), (qemu) va>, once we 
> > have pinned pages later with the help of above info, you can map it into the
> > proper iommu domain if the system has configured so.
> >  
> 
> Hi Neo,
> 
> Technically yes you can map a pfn into the proper IOMMU domain elsewhere,
> but to find out whether a pfn was previously mapped or not, you have to
> track it with another rbtree-alike data structure (the IOMMU driver simply
> doesn't bother with tracking), that seems somehow duplicate with the vGPU
> IOMMU backend we are discussing here.
> 
> And it is also semantically correct for an IOMMU backend to handle both w/
> and w/o an IOMMU hardware? :)

A problem with the iommu doing the dma_map_page() though is for what
device does it do this?  In the mediated case the vfio infrastructure
is dealing with a software representation of a device.  For all we
know that software model could transparently migrate from one physical
GPU to another.  There may not even be a physical device backing
the mediated device.  Those are details left to the vgpu driver itself.

Perhaps one possibility would be to allow the vgpu driver to register
map and unmap callbacks.  The unmap callback might provide the
invalidation interface that we're so far missing.  The combination of
map and unmap callbacks might simplify the Intel approach of pinning the
entire VM memory space, ie. for each map callback do a translation
(pin) and dma_map_page, for each unmap do a dma_unmap_page and release
the translation.  There's still the problem of where that dma_addr_t
from the dma_map_page is stored though.  Someone would need to keep
track of iova to dma_addr_t.  The vfio iommu might be a place to do
that since we're already tracking information based on iova, possibly
in an opaque data element provided by the vgpu driver.  However, we're
going to need to take a serious look at whether an rb-tree is the right
data structure for the job.  It works well for the current type1
functionality where we typically have tens of entries.  I think the
NVIDIA model of sparse pinning the VM is pushing that up to tens of
thousands.  If Intel intends to pin the entire guest, that's
potentially tens of millions of tracked entries and I don't know that
an rb-tree is the right tool for that job.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
@ 2016-05-11 22:06                 ` Alex Williamson
  0 siblings, 0 replies; 154+ messages in thread
From: Alex Williamson @ 2016-05-11 22:06 UTC (permalink / raw)
  To: Jike Song
  Cc: Neo Jia, Tian, Kevin, Kirti Wankhede, pbonzini, kraxel,
	qemu-devel, kvm, Ruan, Shuai, Lv, Zhiyuan

On Wed, 11 May 2016 17:15:15 +0800
Jike Song <jike.song@intel.com> wrote:

> On 05/11/2016 12:02 AM, Neo Jia wrote:
> > On Tue, May 10, 2016 at 03:52:27PM +0800, Jike Song wrote:  
> >> On 05/05/2016 05:27 PM, Tian, Kevin wrote:  
> >>>> From: Song, Jike
> >>>>
> >>>> IIUC, an api-only domain is a VFIO domain *without* underlying IOMMU
> >>>> hardware. It just, as you said in another mail, "rather than
> >>>> programming them into an IOMMU for a device, it simply stores the
> >>>> translations for use by later requests".
> >>>>
> >>>> That imposes a constraint on gfx driver: hardware IOMMU must be disabled.
> >>>> Otherwise, if IOMMU is present, the gfx driver eventually programs
> >>>> the hardware IOMMU with IOVA returned by pci_map_page or dma_map_page;
> >>>> Meanwhile, the IOMMU backend for vgpu only maintains GPA <-> HPA
> >>>> translations without any knowledge about hardware IOMMU, how is the
> >>>> device model supposed to do to get an IOVA for a given GPA (thereby HPA
> >>>> by the IOMMU backend here)?
> >>>>
> >>>> If things go as guessed above, as vfio_pin_pages() indicates, it
> >>>> pin & translate vaddr to PFN, then it will be very difficult for the
> >>>> device model to figure out:
> >>>>
> >>>> 	1, for a given GPA, how to avoid calling dma_map_page multiple times?
> >>>> 	2, for which page to call dma_unmap_page?
> >>>>
> >>>> --  
> >>>
> >>> We have to support both w/ iommu and w/o iommu case, since
> >>> that fact is out of GPU driver control. A simple way is to use
> >>> dma_map_page which internally will cope with w/ and w/o iommu
> >>> case gracefully, i.e. return HPA w/o iommu and IOVA w/ iommu.
> >>> Then in this file we only need to cache GPA to whatever dmadr_t
> >>> returned by dma_map_page.
> >>>  
> >>
> >> Hi Alex, Kirti and Neo, any thought on the IOMMU compatibility here?  
> > 
> > Hi Jike,
> > 
> > With mediated passthru, you still can use hardware iommu, but more important
> > that part is actually orthogonal to what we are discussing here as we will only
> > cache the mapping between <gfn (iova if guest has iommu), (qemu) va>, once we 
> > have pinned pages later with the help of above info, you can map it into the
> > proper iommu domain if the system has configured so.
> >  
> 
> Hi Neo,
> 
> Technically yes you can map a pfn into the proper IOMMU domain elsewhere,
> but to find out whether a pfn was previously mapped or not, you have to
> track it with another rbtree-alike data structure (the IOMMU driver simply
> doesn't bother with tracking), that seems somehow duplicate with the vGPU
> IOMMU backend we are discussing here.
> 
> And it is also semantically correct for an IOMMU backend to handle both w/
> and w/o an IOMMU hardware? :)

A problem with the iommu doing the dma_map_page() though is for what
device does it do this?  In the mediated case the vfio infrastructure
is dealing with a software representation of a device.  For all we
know that software model could transparently migrate from one physical
GPU to another.  There may not even be a physical device backing
the mediated device.  Those are details left to the vgpu driver itself.

Perhaps one possibility would be to allow the vgpu driver to register
map and unmap callbacks.  The unmap callback might provide the
invalidation interface that we're so far missing.  The combination of
map and unmap callbacks might simplify the Intel approach of pinning the
entire VM memory space, ie. for each map callback do a translation
(pin) and dma_map_page, for each unmap do a dma_unmap_page and release
the translation.  There's still the problem of where that dma_addr_t
from the dma_map_page is stored though.  Someone would need to keep
track of iova to dma_addr_t.  The vfio iommu might be a place to do
that since we're already tracking information based on iova, possibly
in an opaque data element provided by the vgpu driver.  However, we're
going to need to take a serious look at whether an rb-tree is the right
data structure for the job.  It works well for the current type1
functionality where we typically have tens of entries.  I think the
NVIDIA model of sparse pinning the VM is pushing that up to tens of
thousands.  If Intel intends to pin the entire guest, that's
potentially tens of millions of tracked entries and I don't know that
an rb-tree is the right tool for that job.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [RFC PATCH v3 2/3] VFIO driver for vGPU device
  2016-05-11 20:10             ` [Qemu-devel] " Alex Williamson
@ 2016-05-12  0:59               ` Tian, Kevin
  -1 siblings, 0 replies; 154+ messages in thread
From: Tian, Kevin @ 2016-05-12  0:59 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Ruan, Shuai, Song, Jike, cjia, kvm, qemu-devel, Kirti Wankhede,
	kraxel, pbonzini, Lv, Zhiyuan

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Thursday, May 12, 2016 4:11 AM
> On Wed, 11 May 2016 06:45:41 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Alex Williamson
> > > Sent: Thursday, May 05, 2016 1:06 AM
> > > > > > +
> > > > > > +	ret = remap_pfn_range(vma, virtaddr, pgoff, req_size, pg_prot);
> > > > >
> > > > > So not supporting validate_map_request() means that the user can
> > > > > directly mmap BARs of the host GPU and as shown below, we assume a 1:1
> > > > > mapping of vGPU BAR to host GPU BAR.  Is that ever valid in a vGPU
> > > > > scenario or should this callback be required?  It's not clear to me how
> > > > > the vendor driver determines what this maps to, do they compare it to
> > > > > the physical device's own BAR addresses?
> > > >
> > > > I didn't quite understand too. Based on earlier discussion, do we need
> > > > something like this, or could achieve the purpose just by leveraging
> > > > recent sparse mmap support?
> > >
> > > The reason for faulting in the mmio space, if I recall correctly, is to
> > > enable an ordering where the user driver (QEMU) can mmap regions of the
> > > device prior to resources being allocated on the host GPU to handle
> > > them.  Sparse mmap only partially handles that, it's not dynamic.  With
> > > this faulting mechanism, the host GPU doesn't need to commit resources
> > > until the mmap is actually accessed.  Thanks,
> > >
> > > Alex
> >
> > Just double confirm. I assume this faulting mechanism can work with
> > sparse mmap, right? Regardless of whether it's a full or partial region,
> > this faulting mechanism would commit resource only when accessed
> > page has MMAP flag set...
> 
> Yes, the vfio sparse mmap just solves the problem that a vfio region
> maps to an entire device resource, for example in the case of vfio-pci,
> a PCI BAR.  It turns out that specifying mmap on a whole region doesn't
> give us the granularity we need.  Sparse mmap gives us a generic way to
> tell userspace which areas within a region support mmap and which
> should use read/write access through the vfio device file descriptor.
> The latter allows us to protect specific regions or provide further
> emulation/virtualization for that sub-area.  How the mmap vma is
> populated for the portions that do support mmap is an orthogonal
> issue.  Thanks,
> 

Exactly! Thanks for confirmation.

Kevin

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 2/3] VFIO driver for vGPU device
@ 2016-05-12  0:59               ` Tian, Kevin
  0 siblings, 0 replies; 154+ messages in thread
From: Tian, Kevin @ 2016-05-12  0:59 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Kirti Wankhede, pbonzini, kraxel, cjia, qemu-devel, kvm, Ruan,
	Shuai, Song, Jike, Lv, Zhiyuan

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Thursday, May 12, 2016 4:11 AM
> On Wed, 11 May 2016 06:45:41 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Alex Williamson
> > > Sent: Thursday, May 05, 2016 1:06 AM
> > > > > > +
> > > > > > +	ret = remap_pfn_range(vma, virtaddr, pgoff, req_size, pg_prot);
> > > > >
> > > > > So not supporting validate_map_request() means that the user can
> > > > > directly mmap BARs of the host GPU and as shown below, we assume a 1:1
> > > > > mapping of vGPU BAR to host GPU BAR.  Is that ever valid in a vGPU
> > > > > scenario or should this callback be required?  It's not clear to me how
> > > > > the vendor driver determines what this maps to, do they compare it to
> > > > > the physical device's own BAR addresses?
> > > >
> > > > I didn't quite understand too. Based on earlier discussion, do we need
> > > > something like this, or could achieve the purpose just by leveraging
> > > > recent sparse mmap support?
> > >
> > > The reason for faulting in the mmio space, if I recall correctly, is to
> > > enable an ordering where the user driver (QEMU) can mmap regions of the
> > > device prior to resources being allocated on the host GPU to handle
> > > them.  Sparse mmap only partially handles that, it's not dynamic.  With
> > > this faulting mechanism, the host GPU doesn't need to commit resources
> > > until the mmap is actually accessed.  Thanks,
> > >
> > > Alex
> >
> > Just double confirm. I assume this faulting mechanism can work with
> > sparse mmap, right? Regardless of whether it's a full or partial region,
> > this faulting mechanism would commit resource only when accessed
> > page has MMAP flag set...
> 
> Yes, the vfio sparse mmap just solves the problem that a vfio region
> maps to an entire device resource, for example in the case of vfio-pci,
> a PCI BAR.  It turns out that specifying mmap on a whole region doesn't
> give us the granularity we need.  Sparse mmap gives us a generic way to
> tell userspace which areas within a region support mmap and which
> should use read/write access through the vfio device file descriptor.
> The latter allows us to protect specific regions or provide further
> emulation/virtualization for that sub-area.  How the mmap vma is
> populated for the portions that do support mmap is an orthogonal
> issue.  Thanks,
> 

Exactly! Thanks for confirmation.

Kevin

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
  2016-05-11 22:06                 ` [Qemu-devel] " Alex Williamson
@ 2016-05-12  4:11                   ` Jike Song
  -1 siblings, 0 replies; 154+ messages in thread
From: Jike Song @ 2016-05-12  4:11 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Ruan, Shuai, Jike Song, Neo Jia, kvm, Tian, Kevin, qemu-devel,
	Kirti Wankhede, kraxel, pbonzini, Lv, Zhiyuan

On Thu, May 12, 2016 at 6:06 AM, Alex Williamson
<alex.williamson@redhat.com> wrote:
> On Wed, 11 May 2016 17:15:15 +0800
> Jike Song <jike.song@intel.com> wrote:
>
>> On 05/11/2016 12:02 AM, Neo Jia wrote:
>> > On Tue, May 10, 2016 at 03:52:27PM +0800, Jike Song wrote:
>> >> On 05/05/2016 05:27 PM, Tian, Kevin wrote:
>> >>>> From: Song, Jike
>> >>>>
>> >>>> IIUC, an api-only domain is a VFIO domain *without* underlying IOMMU
>> >>>> hardware. It just, as you said in another mail, "rather than
>> >>>> programming them into an IOMMU for a device, it simply stores the
>> >>>> translations for use by later requests".
>> >>>>
>> >>>> That imposes a constraint on gfx driver: hardware IOMMU must be disabled.
>> >>>> Otherwise, if IOMMU is present, the gfx driver eventually programs
>> >>>> the hardware IOMMU with IOVA returned by pci_map_page or dma_map_page;
>> >>>> Meanwhile, the IOMMU backend for vgpu only maintains GPA <-> HPA
>> >>>> translations without any knowledge about hardware IOMMU, how is the
>> >>>> device model supposed to do to get an IOVA for a given GPA (thereby HPA
>> >>>> by the IOMMU backend here)?
>> >>>>
>> >>>> If things go as guessed above, as vfio_pin_pages() indicates, it
>> >>>> pin & translate vaddr to PFN, then it will be very difficult for the
>> >>>> device model to figure out:
>> >>>>
>> >>>>  1, for a given GPA, how to avoid calling dma_map_page multiple times?
>> >>>>  2, for which page to call dma_unmap_page?
>> >>>>
>> >>>> --
>> >>>
>> >>> We have to support both w/ iommu and w/o iommu case, since
>> >>> that fact is out of GPU driver control. A simple way is to use
>> >>> dma_map_page which internally will cope with w/ and w/o iommu
>> >>> case gracefully, i.e. return HPA w/o iommu and IOVA w/ iommu.
>> >>> Then in this file we only need to cache GPA to whatever dmadr_t
>> >>> returned by dma_map_page.
>> >>>
>> >>
>> >> Hi Alex, Kirti and Neo, any thought on the IOMMU compatibility here?
>> >
>> > Hi Jike,
>> >
>> > With mediated passthru, you still can use hardware iommu, but more important
>> > that part is actually orthogonal to what we are discussing here as we will only
>> > cache the mapping between <gfn (iova if guest has iommu), (qemu) va>, once we
>> > have pinned pages later with the help of above info, you can map it into the
>> > proper iommu domain if the system has configured so.
>> >
>>
>> Hi Neo,
>>
>> Technically yes you can map a pfn into the proper IOMMU domain elsewhere,
>> but to find out whether a pfn was previously mapped or not, you have to
>> track it with another rbtree-alike data structure (the IOMMU driver simply
>> doesn't bother with tracking), that seems somehow duplicate with the vGPU
>> IOMMU backend we are discussing here.
>>
>> And it is also semantically correct for an IOMMU backend to handle both w/
>> and w/o an IOMMU hardware? :)
>
> A problem with the iommu doing the dma_map_page() though is for what
> device does it do this?  In the mediated case the vfio infrastructure
> is dealing with a software representation of a device.  For all we
> know that software model could transparently migrate from one physical
> GPU to another.  There may not even be a physical device backing
> the mediated device.  Those are details left to the vgpu driver itself.
>

Great point :) Yes, I agree it's a bit intrusive to do the mapping for
a particular
pdev in an vGPU IOMMU BE.

> Perhaps one possibility would be to allow the vgpu driver to register
> map and unmap callbacks.  The unmap callback might provide the
> invalidation interface that we're so far missing.  The combination of
> map and unmap callbacks might simplify the Intel approach of pinning the
> entire VM memory space, ie. for each map callback do a translation
> (pin) and dma_map_page, for each unmap do a dma_unmap_page and release
> the translation.

Yes adding map/unmap ops in pGPU drvier (I assume you are refering to
gpu_device_ops as
implemented in Kirti's patch) sounds a good idea, satisfying both: 1)
keeping vGPU purely
virtual; 2) dealing with the Linux DMA API to achive hardware IOMMU
compatibility.

PS, this has very little to do with pinning wholly or partially. Intel KVMGT has
once been had the whole guest memory pinned, only because we used a spinlock,
which can't sleep at runtime.  We have removed that spinlock in our another
upstreaming effort, not here but for i915 driver, so probably no biggie.


> There's still the problem of where that dma_addr_t
> from the dma_map_page is stored though.  Someone would need to keep
> track of iova to dma_addr_t.  The vfio iommu might be a place to do
> that since we're already tracking information based on iova, possibly
> in an opaque data element provided by the vgpu driver.

Any reason to keep it opaque? Given that vfio iommu is already tracking
PFN for iova (vaddr as vGPU is), seems adding dma_addr_t as another field is
simple. But I don't have a strong opinion here, opaque definitely
works for me :)

> However, we're
> going to need to take a serious look at whether an rb-tree is the right
> data structure for the job.  It works well for the current type1
> functionality where we typically have tens of entries.  I think the
> NVIDIA model of sparse pinning the VM is pushing that up to tens of
> thousands.  If Intel intends to pin the entire guest, that's
> potentially tens of millions of tracked entries and I don't know that
> an rb-tree is the right tool for that job.  Thanks,
>

Having the rbtree efficiency considered there is yet another reason for us
to pin partially. Assuming that partially pinning guaranteed, do you
think rbtree
is good enough?

--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
@ 2016-05-12  4:11                   ` Jike Song
  0 siblings, 0 replies; 154+ messages in thread
From: Jike Song @ 2016-05-12  4:11 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jike Song, Neo Jia, Tian, Kevin, Kirti Wankhede, pbonzini,
	kraxel, qemu-devel, kvm, Ruan, Shuai, Lv, Zhiyuan

On Thu, May 12, 2016 at 6:06 AM, Alex Williamson
<alex.williamson@redhat.com> wrote:
> On Wed, 11 May 2016 17:15:15 +0800
> Jike Song <jike.song@intel.com> wrote:
>
>> On 05/11/2016 12:02 AM, Neo Jia wrote:
>> > On Tue, May 10, 2016 at 03:52:27PM +0800, Jike Song wrote:
>> >> On 05/05/2016 05:27 PM, Tian, Kevin wrote:
>> >>>> From: Song, Jike
>> >>>>
>> >>>> IIUC, an api-only domain is a VFIO domain *without* underlying IOMMU
>> >>>> hardware. It just, as you said in another mail, "rather than
>> >>>> programming them into an IOMMU for a device, it simply stores the
>> >>>> translations for use by later requests".
>> >>>>
>> >>>> That imposes a constraint on gfx driver: hardware IOMMU must be disabled.
>> >>>> Otherwise, if IOMMU is present, the gfx driver eventually programs
>> >>>> the hardware IOMMU with IOVA returned by pci_map_page or dma_map_page;
>> >>>> Meanwhile, the IOMMU backend for vgpu only maintains GPA <-> HPA
>> >>>> translations without any knowledge about hardware IOMMU, how is the
>> >>>> device model supposed to do to get an IOVA for a given GPA (thereby HPA
>> >>>> by the IOMMU backend here)?
>> >>>>
>> >>>> If things go as guessed above, as vfio_pin_pages() indicates, it
>> >>>> pin & translate vaddr to PFN, then it will be very difficult for the
>> >>>> device model to figure out:
>> >>>>
>> >>>>  1, for a given GPA, how to avoid calling dma_map_page multiple times?
>> >>>>  2, for which page to call dma_unmap_page?
>> >>>>
>> >>>> --
>> >>>
>> >>> We have to support both w/ iommu and w/o iommu case, since
>> >>> that fact is out of GPU driver control. A simple way is to use
>> >>> dma_map_page which internally will cope with w/ and w/o iommu
>> >>> case gracefully, i.e. return HPA w/o iommu and IOVA w/ iommu.
>> >>> Then in this file we only need to cache GPA to whatever dmadr_t
>> >>> returned by dma_map_page.
>> >>>
>> >>
>> >> Hi Alex, Kirti and Neo, any thought on the IOMMU compatibility here?
>> >
>> > Hi Jike,
>> >
>> > With mediated passthru, you still can use hardware iommu, but more important
>> > that part is actually orthogonal to what we are discussing here as we will only
>> > cache the mapping between <gfn (iova if guest has iommu), (qemu) va>, once we
>> > have pinned pages later with the help of above info, you can map it into the
>> > proper iommu domain if the system has configured so.
>> >
>>
>> Hi Neo,
>>
>> Technically yes you can map a pfn into the proper IOMMU domain elsewhere,
>> but to find out whether a pfn was previously mapped or not, you have to
>> track it with another rbtree-alike data structure (the IOMMU driver simply
>> doesn't bother with tracking), that seems somehow duplicate with the vGPU
>> IOMMU backend we are discussing here.
>>
>> And it is also semantically correct for an IOMMU backend to handle both w/
>> and w/o an IOMMU hardware? :)
>
> A problem with the iommu doing the dma_map_page() though is for what
> device does it do this?  In the mediated case the vfio infrastructure
> is dealing with a software representation of a device.  For all we
> know that software model could transparently migrate from one physical
> GPU to another.  There may not even be a physical device backing
> the mediated device.  Those are details left to the vgpu driver itself.
>

Great point :) Yes, I agree it's a bit intrusive to do the mapping for
a particular
pdev in an vGPU IOMMU BE.

> Perhaps one possibility would be to allow the vgpu driver to register
> map and unmap callbacks.  The unmap callback might provide the
> invalidation interface that we're so far missing.  The combination of
> map and unmap callbacks might simplify the Intel approach of pinning the
> entire VM memory space, ie. for each map callback do a translation
> (pin) and dma_map_page, for each unmap do a dma_unmap_page and release
> the translation.

Yes adding map/unmap ops in pGPU drvier (I assume you are refering to
gpu_device_ops as
implemented in Kirti's patch) sounds a good idea, satisfying both: 1)
keeping vGPU purely
virtual; 2) dealing with the Linux DMA API to achive hardware IOMMU
compatibility.

PS, this has very little to do with pinning wholly or partially. Intel KVMGT has
once been had the whole guest memory pinned, only because we used a spinlock,
which can't sleep at runtime.  We have removed that spinlock in our another
upstreaming effort, not here but for i915 driver, so probably no biggie.


> There's still the problem of where that dma_addr_t
> from the dma_map_page is stored though.  Someone would need to keep
> track of iova to dma_addr_t.  The vfio iommu might be a place to do
> that since we're already tracking information based on iova, possibly
> in an opaque data element provided by the vgpu driver.

Any reason to keep it opaque? Given that vfio iommu is already tracking
PFN for iova (vaddr as vGPU is), seems adding dma_addr_t as another field is
simple. But I don't have a strong opinion here, opaque definitely
works for me :)

> However, we're
> going to need to take a serious look at whether an rb-tree is the right
> data structure for the job.  It works well for the current type1
> functionality where we typically have tens of entries.  I think the
> NVIDIA model of sparse pinning the VM is pushing that up to tens of
> thousands.  If Intel intends to pin the entire guest, that's
> potentially tens of millions of tracked entries and I don't know that
> an rb-tree is the right tool for that job.  Thanks,
>

Having the rbtree efficiency considered there is yet another reason for us
to pin partially. Assuming that partially pinning guaranteed, do you
think rbtree
is good enough?

--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 154+ messages in thread

* RE: [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
  2016-05-11 22:06                 ` [Qemu-devel] " Alex Williamson
@ 2016-05-12  8:00                   ` Tian, Kevin
  -1 siblings, 0 replies; 154+ messages in thread
From: Tian, Kevin @ 2016-05-12  8:00 UTC (permalink / raw)
  To: Alex Williamson, Song, Jike
  Cc: Neo Jia, Kirti Wankhede, pbonzini, kraxel, qemu-devel, kvm, Ruan,
	Shuai, Lv, Zhiyuan

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Thursday, May 12, 2016 6:06 AM
> 
> On Wed, 11 May 2016 17:15:15 +0800
> Jike Song <jike.song@intel.com> wrote:
> 
> > On 05/11/2016 12:02 AM, Neo Jia wrote:
> > > On Tue, May 10, 2016 at 03:52:27PM +0800, Jike Song wrote:
> > >> On 05/05/2016 05:27 PM, Tian, Kevin wrote:
> > >>>> From: Song, Jike
> > >>>>
> > >>>> IIUC, an api-only domain is a VFIO domain *without* underlying IOMMU
> > >>>> hardware. It just, as you said in another mail, "rather than
> > >>>> programming them into an IOMMU for a device, it simply stores the
> > >>>> translations for use by later requests".
> > >>>>
> > >>>> That imposes a constraint on gfx driver: hardware IOMMU must be disabled.
> > >>>> Otherwise, if IOMMU is present, the gfx driver eventually programs
> > >>>> the hardware IOMMU with IOVA returned by pci_map_page or dma_map_page;
> > >>>> Meanwhile, the IOMMU backend for vgpu only maintains GPA <-> HPA
> > >>>> translations without any knowledge about hardware IOMMU, how is the
> > >>>> device model supposed to do to get an IOVA for a given GPA (thereby HPA
> > >>>> by the IOMMU backend here)?
> > >>>>
> > >>>> If things go as guessed above, as vfio_pin_pages() indicates, it
> > >>>> pin & translate vaddr to PFN, then it will be very difficult for the
> > >>>> device model to figure out:
> > >>>>
> > >>>> 	1, for a given GPA, how to avoid calling dma_map_page multiple times?
> > >>>> 	2, for which page to call dma_unmap_page?
> > >>>>
> > >>>> --
> > >>>
> > >>> We have to support both w/ iommu and w/o iommu case, since
> > >>> that fact is out of GPU driver control. A simple way is to use
> > >>> dma_map_page which internally will cope with w/ and w/o iommu
> > >>> case gracefully, i.e. return HPA w/o iommu and IOVA w/ iommu.
> > >>> Then in this file we only need to cache GPA to whatever dmadr_t
> > >>> returned by dma_map_page.
> > >>>
> > >>
> > >> Hi Alex, Kirti and Neo, any thought on the IOMMU compatibility here?
> > >
> > > Hi Jike,
> > >
> > > With mediated passthru, you still can use hardware iommu, but more important
> > > that part is actually orthogonal to what we are discussing here as we will only
> > > cache the mapping between <gfn (iova if guest has iommu), (qemu) va>, once we
> > > have pinned pages later with the help of above info, you can map it into the
> > > proper iommu domain if the system has configured so.
> > >
> >
> > Hi Neo,
> >
> > Technically yes you can map a pfn into the proper IOMMU domain elsewhere,
> > but to find out whether a pfn was previously mapped or not, you have to
> > track it with another rbtree-alike data structure (the IOMMU driver simply
> > doesn't bother with tracking), that seems somehow duplicate with the vGPU
> > IOMMU backend we are discussing here.
> >
> > And it is also semantically correct for an IOMMU backend to handle both w/
> > and w/o an IOMMU hardware? :)
> 
> A problem with the iommu doing the dma_map_page() though is for what
> device does it do this?  In the mediated case the vfio infrastructure
> is dealing with a software representation of a device.  For all we
> know that software model could transparently migrate from one physical
> GPU to another.  There may not even be a physical device backing
> the mediated device.  Those are details left to the vgpu driver itself.

This is a fair argument. VFIO iommu driver simply serves user space
requests, where only vaddr<->iova (essentially gpa in kvm case) is
mattered. How iova is mapped into real IOMMU is not VFIO's interest.

> 
> Perhaps one possibility would be to allow the vgpu driver to register
> map and unmap callbacks.  The unmap callback might provide the
> invalidation interface that we're so far missing.  The combination of
> map and unmap callbacks might simplify the Intel approach of pinning the
> entire VM memory space, ie. for each map callback do a translation
> (pin) and dma_map_page, for each unmap do a dma_unmap_page and release
> the translation.  There's still the problem of where that dma_addr_t
> from the dma_map_page is stored though.  Someone would need to keep
> track of iova to dma_addr_t.  The vfio iommu might be a place to do
> that since we're already tracking information based on iova, possibly
> in an opaque data element provided by the vgpu driver.  However, we're
> going to need to take a serious look at whether an rb-tree is the right
> data structure for the job.  It works well for the current type1
> functionality where we typically have tens of entries.  I think the
> NVIDIA model of sparse pinning the VM is pushing that up to tens of
> thousands.  If Intel intends to pin the entire guest, that's
> potentially tens of millions of tracked entries and I don't know that
> an rb-tree is the right tool for that job.  Thanks,
> 

Based on above thought I'm thinking whether below would work:
(let's use gpa to replace existing iova in type1 driver, while using iova
for the one actually used in vGPU driver. Assume 'pin-all' scenario first
which matches existing vfio logic)

- No change to existing vfio_dma structure. VFIO still maintains gpa<->vaddr
mapping, in coarse-grained regions;

- Leverage same page accounting/pinning logic in type1 driver, which 
should be enough for 'pin-all' usage;

- Then main divergence point for vGPU would be in vfio_unmap_unpin
and vfio_iommu_map. I'm not sure whether it's easy to fake an 
iommu_domain for vGPU so same iommu_map/unmap can be reused.
If not, we may introduce two new map/unmap callbacks provided
specifically by vGPU core driver, as you suggested:

	* vGPU core driver uses dma_map_page to map specified pfns:

		o When IOMMU is enabled, we'll get an iova returned different
from pfn;
		o When IOMMU is disabled, returned iova is same as pfn;

	* Then vGPU core driver just maintains its own gpa<->iova lookup
table (e.g. called vgpu_dma)

	* Because each vfio_iommu_map invocation is about a contiguous 
region, we can expect same number of vgpu_dma entries as maintained 
for vfio_dma list;

Then it's vGPU core driver's responsibility to provide gpa<->iova
lookup for vendor specific GPU driver. And we don't need worry about
tens of thousands of entries. Once we get this simple 'pin-all' model
ready, then it can be further extended to support 'pin-sparse'
scenario. We still maintain a top-level vgpu_dma list with each entry to
further link its own sparse mapping structure. In reality I don't expect
we really need to maintain per-page translation even with sparse pinning.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
@ 2016-05-12  8:00                   ` Tian, Kevin
  0 siblings, 0 replies; 154+ messages in thread
From: Tian, Kevin @ 2016-05-12  8:00 UTC (permalink / raw)
  To: Alex Williamson, Song, Jike
  Cc: Neo Jia, Kirti Wankhede, pbonzini, kraxel, qemu-devel, kvm, Ruan,
	Shuai, Lv, Zhiyuan

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Thursday, May 12, 2016 6:06 AM
> 
> On Wed, 11 May 2016 17:15:15 +0800
> Jike Song <jike.song@intel.com> wrote:
> 
> > On 05/11/2016 12:02 AM, Neo Jia wrote:
> > > On Tue, May 10, 2016 at 03:52:27PM +0800, Jike Song wrote:
> > >> On 05/05/2016 05:27 PM, Tian, Kevin wrote:
> > >>>> From: Song, Jike
> > >>>>
> > >>>> IIUC, an api-only domain is a VFIO domain *without* underlying IOMMU
> > >>>> hardware. It just, as you said in another mail, "rather than
> > >>>> programming them into an IOMMU for a device, it simply stores the
> > >>>> translations for use by later requests".
> > >>>>
> > >>>> That imposes a constraint on gfx driver: hardware IOMMU must be disabled.
> > >>>> Otherwise, if IOMMU is present, the gfx driver eventually programs
> > >>>> the hardware IOMMU with IOVA returned by pci_map_page or dma_map_page;
> > >>>> Meanwhile, the IOMMU backend for vgpu only maintains GPA <-> HPA
> > >>>> translations without any knowledge about hardware IOMMU, how is the
> > >>>> device model supposed to do to get an IOVA for a given GPA (thereby HPA
> > >>>> by the IOMMU backend here)?
> > >>>>
> > >>>> If things go as guessed above, as vfio_pin_pages() indicates, it
> > >>>> pin & translate vaddr to PFN, then it will be very difficult for the
> > >>>> device model to figure out:
> > >>>>
> > >>>> 	1, for a given GPA, how to avoid calling dma_map_page multiple times?
> > >>>> 	2, for which page to call dma_unmap_page?
> > >>>>
> > >>>> --
> > >>>
> > >>> We have to support both w/ iommu and w/o iommu case, since
> > >>> that fact is out of GPU driver control. A simple way is to use
> > >>> dma_map_page which internally will cope with w/ and w/o iommu
> > >>> case gracefully, i.e. return HPA w/o iommu and IOVA w/ iommu.
> > >>> Then in this file we only need to cache GPA to whatever dmadr_t
> > >>> returned by dma_map_page.
> > >>>
> > >>
> > >> Hi Alex, Kirti and Neo, any thought on the IOMMU compatibility here?
> > >
> > > Hi Jike,
> > >
> > > With mediated passthru, you still can use hardware iommu, but more important
> > > that part is actually orthogonal to what we are discussing here as we will only
> > > cache the mapping between <gfn (iova if guest has iommu), (qemu) va>, once we
> > > have pinned pages later with the help of above info, you can map it into the
> > > proper iommu domain if the system has configured so.
> > >
> >
> > Hi Neo,
> >
> > Technically yes you can map a pfn into the proper IOMMU domain elsewhere,
> > but to find out whether a pfn was previously mapped or not, you have to
> > track it with another rbtree-alike data structure (the IOMMU driver simply
> > doesn't bother with tracking), that seems somehow duplicate with the vGPU
> > IOMMU backend we are discussing here.
> >
> > And it is also semantically correct for an IOMMU backend to handle both w/
> > and w/o an IOMMU hardware? :)
> 
> A problem with the iommu doing the dma_map_page() though is for what
> device does it do this?  In the mediated case the vfio infrastructure
> is dealing with a software representation of a device.  For all we
> know that software model could transparently migrate from one physical
> GPU to another.  There may not even be a physical device backing
> the mediated device.  Those are details left to the vgpu driver itself.

This is a fair argument. VFIO iommu driver simply serves user space
requests, where only vaddr<->iova (essentially gpa in kvm case) is
mattered. How iova is mapped into real IOMMU is not VFIO's interest.

> 
> Perhaps one possibility would be to allow the vgpu driver to register
> map and unmap callbacks.  The unmap callback might provide the
> invalidation interface that we're so far missing.  The combination of
> map and unmap callbacks might simplify the Intel approach of pinning the
> entire VM memory space, ie. for each map callback do a translation
> (pin) and dma_map_page, for each unmap do a dma_unmap_page and release
> the translation.  There's still the problem of where that dma_addr_t
> from the dma_map_page is stored though.  Someone would need to keep
> track of iova to dma_addr_t.  The vfio iommu might be a place to do
> that since we're already tracking information based on iova, possibly
> in an opaque data element provided by the vgpu driver.  However, we're
> going to need to take a serious look at whether an rb-tree is the right
> data structure for the job.  It works well for the current type1
> functionality where we typically have tens of entries.  I think the
> NVIDIA model of sparse pinning the VM is pushing that up to tens of
> thousands.  If Intel intends to pin the entire guest, that's
> potentially tens of millions of tracked entries and I don't know that
> an rb-tree is the right tool for that job.  Thanks,
> 

Based on above thought I'm thinking whether below would work:
(let's use gpa to replace existing iova in type1 driver, while using iova
for the one actually used in vGPU driver. Assume 'pin-all' scenario first
which matches existing vfio logic)

- No change to existing vfio_dma structure. VFIO still maintains gpa<->vaddr
mapping, in coarse-grained regions;

- Leverage same page accounting/pinning logic in type1 driver, which 
should be enough for 'pin-all' usage;

- Then main divergence point for vGPU would be in vfio_unmap_unpin
and vfio_iommu_map. I'm not sure whether it's easy to fake an 
iommu_domain for vGPU so same iommu_map/unmap can be reused.
If not, we may introduce two new map/unmap callbacks provided
specifically by vGPU core driver, as you suggested:

	* vGPU core driver uses dma_map_page to map specified pfns:

		o When IOMMU is enabled, we'll get an iova returned different
from pfn;
		o When IOMMU is disabled, returned iova is same as pfn;

	* Then vGPU core driver just maintains its own gpa<->iova lookup
table (e.g. called vgpu_dma)

	* Because each vfio_iommu_map invocation is about a contiguous 
region, we can expect same number of vgpu_dma entries as maintained 
for vfio_dma list;

Then it's vGPU core driver's responsibility to provide gpa<->iova
lookup for vendor specific GPU driver. And we don't need worry about
tens of thousands of entries. Once we get this simple 'pin-all' model
ready, then it can be further extended to support 'pin-sparse'
scenario. We still maintain a top-level vgpu_dma list with each entry to
further link its own sparse mapping structure. In reality I don't expect
we really need to maintain per-page translation even with sparse pinning.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [RFC PATCH v3 1/3] vGPU Core driver
  2016-05-04  2:58       ` [Qemu-devel] " Tian, Kevin
@ 2016-05-12  8:22         ` Tian, Kevin
  -1 siblings, 0 replies; 154+ messages in thread
From: Tian, Kevin @ 2016-05-12  8:22 UTC (permalink / raw)
  To: 'Alex Williamson', 'Kirti Wankhede'
  Cc: Ruan, Shuai, Song, Jike, 'cjia@nvidia.com',
	'kvm@vger.kernel.org', 'qemu-devel@nongnu.org',
	'kraxel@redhat.com', 'pbonzini@redhat.com',
	Lv, Zhiyuan

Hi, Kirti/Neo, any response for below comment?

> From: Tian, Kevin
> Sent: Wednesday, May 04, 2016 10:59 AM
> 
> > From: Alex Williamson
> > Sent: Wednesday, May 04, 2016 6:44 AM
> > > +/**
> > > + * struct gpu_device_ops - Structure to be registered for each physical GPU to
> > > + * register the device to vgpu module.
> > > + *
> > > + * @owner:			The module owner.
> > > + * @dev_attr_groups:		Default attributes of the physical device.
> > > + * @vgpu_attr_groups:		Default attributes of the vGPU device.
> > > + * @vgpu_supported_config:	Called to get information about supported vgpu
> types.
> > > + *				@dev : pci device structure of physical GPU.
> > > + *				@config: should return string listing supported config
> > > + *				Returns integer: success (0) or error (< 0)
> > > + * @vgpu_create:		Called to allocate basic resouces in graphics
> 
> It's redundant to have vgpu prefixed to every op here. Same comment
> for later sysfs node.
> 
> > > + *				driver for a particular vgpu.
> > > + *				@dev: physical pci device structure on which vgpu
> > > + *				      should be created
> > > + *				@uuid: VM's uuid for which VM it is intended to
> > > + *				@instance: vgpu instance in that VM
> 
> I didn't quite get @instance here. Is it whatever vendor specific format
> to indicate a vgpu?
> 
> > > + *				@vgpu_params: extra parameters required by GPU driver.
> > > + *				Returns integer: success (0) or error (< 0)
> > > + * @vgpu_destroy:		Called to free resources in graphics driver for
> > > + *				a vgpu instance of that VM.
> > > + *				@dev: physical pci device structure to which
> > > + *				this vgpu points to.
> > > + *				@uuid: VM's uuid for which the vgpu belongs to.
> > > + *				@instance: vgpu instance in that VM
> > > + *				Returns integer: success (0) or error (< 0)
> > > + *				If VM is running and vgpu_destroy is called that
> > > + *				means the vGPU is being hotunpluged. Return error
> > > + *				if VM is running and graphics driver doesn't
> > > + *				support vgpu hotplug.
> > > + * @vgpu_start:			Called to do initiate vGPU initialization
> > > + *				process in graphics driver when VM boots before
> > > + *				qemu starts.
> > > + *				@uuid: VM's UUID which is booting.
> > > + *				Returns integer: success (0) or error (< 0)
> > > + * @vgpu_shutdown:		Called to teardown vGPU related resources for
> > > + *				the VM
> > > + *				@uuid: VM's UUID which is shutting down .
> > > + *				Returns integer: success (0) or error (< 0)
> 
> Can you give some specific example about difference between destroy
> and shutdown? Want to map it correctly into our side, e.g. whether we
> need implement both or just one of them.
> 
> Another optional op is 'stop', allowing physical device to stop scheduling
> vGPU including wait for in-flight DMA done. It would be useful to support
> VM live migration with vGPU assigned.
> 

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 1/3] vGPU Core driver
@ 2016-05-12  8:22         ` Tian, Kevin
  0 siblings, 0 replies; 154+ messages in thread
From: Tian, Kevin @ 2016-05-12  8:22 UTC (permalink / raw)
  To: 'Alex Williamson', 'Kirti Wankhede'
  Cc: 'pbonzini@redhat.com', 'kraxel@redhat.com',
	'cjia@nvidia.com', 'qemu-devel@nongnu.org',
	'kvm@vger.kernel.org',
	Ruan, Shuai, Song, Jike, Lv, Zhiyuan

Hi, Kirti/Neo, any response for below comment?

> From: Tian, Kevin
> Sent: Wednesday, May 04, 2016 10:59 AM
> 
> > From: Alex Williamson
> > Sent: Wednesday, May 04, 2016 6:44 AM
> > > +/**
> > > + * struct gpu_device_ops - Structure to be registered for each physical GPU to
> > > + * register the device to vgpu module.
> > > + *
> > > + * @owner:			The module owner.
> > > + * @dev_attr_groups:		Default attributes of the physical device.
> > > + * @vgpu_attr_groups:		Default attributes of the vGPU device.
> > > + * @vgpu_supported_config:	Called to get information about supported vgpu
> types.
> > > + *				@dev : pci device structure of physical GPU.
> > > + *				@config: should return string listing supported config
> > > + *				Returns integer: success (0) or error (< 0)
> > > + * @vgpu_create:		Called to allocate basic resouces in graphics
> 
> It's redundant to have vgpu prefixed to every op here. Same comment
> for later sysfs node.
> 
> > > + *				driver for a particular vgpu.
> > > + *				@dev: physical pci device structure on which vgpu
> > > + *				      should be created
> > > + *				@uuid: VM's uuid for which VM it is intended to
> > > + *				@instance: vgpu instance in that VM
> 
> I didn't quite get @instance here. Is it whatever vendor specific format
> to indicate a vgpu?
> 
> > > + *				@vgpu_params: extra parameters required by GPU driver.
> > > + *				Returns integer: success (0) or error (< 0)
> > > + * @vgpu_destroy:		Called to free resources in graphics driver for
> > > + *				a vgpu instance of that VM.
> > > + *				@dev: physical pci device structure to which
> > > + *				this vgpu points to.
> > > + *				@uuid: VM's uuid for which the vgpu belongs to.
> > > + *				@instance: vgpu instance in that VM
> > > + *				Returns integer: success (0) or error (< 0)
> > > + *				If VM is running and vgpu_destroy is called that
> > > + *				means the vGPU is being hotunpluged. Return error
> > > + *				if VM is running and graphics driver doesn't
> > > + *				support vgpu hotplug.
> > > + * @vgpu_start:			Called to do initiate vGPU initialization
> > > + *				process in graphics driver when VM boots before
> > > + *				qemu starts.
> > > + *				@uuid: VM's UUID which is booting.
> > > + *				Returns integer: success (0) or error (< 0)
> > > + * @vgpu_shutdown:		Called to teardown vGPU related resources for
> > > + *				the VM
> > > + *				@uuid: VM's UUID which is shutting down .
> > > + *				Returns integer: success (0) or error (< 0)
> 
> Can you give some specific example about difference between destroy
> and shutdown? Want to map it correctly into our side, e.g. whether we
> need implement both or just one of them.
> 
> Another optional op is 'stop', allowing physical device to stop scheduling
> vGPU including wait for in-flight DMA done. It would be useful to support
> VM live migration with vGPU assigned.
> 

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
  2016-05-12  8:00                   ` [Qemu-devel] " Tian, Kevin
@ 2016-05-12 19:05                     ` Alex Williamson
  -1 siblings, 0 replies; 154+ messages in thread
From: Alex Williamson @ 2016-05-12 19:05 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Song, Jike, Neo Jia, Kirti Wankhede, pbonzini, kraxel,
	qemu-devel, kvm, Ruan, Shuai, Lv, Zhiyuan

On Thu, 12 May 2016 08:00:36 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Thursday, May 12, 2016 6:06 AM
> > 
> > On Wed, 11 May 2016 17:15:15 +0800
> > Jike Song <jike.song@intel.com> wrote:
> >   
> > > On 05/11/2016 12:02 AM, Neo Jia wrote:  
> > > > On Tue, May 10, 2016 at 03:52:27PM +0800, Jike Song wrote:  
> > > >> On 05/05/2016 05:27 PM, Tian, Kevin wrote:  
> > > >>>> From: Song, Jike
> > > >>>>
> > > >>>> IIUC, an api-only domain is a VFIO domain *without* underlying IOMMU
> > > >>>> hardware. It just, as you said in another mail, "rather than
> > > >>>> programming them into an IOMMU for a device, it simply stores the
> > > >>>> translations for use by later requests".
> > > >>>>
> > > >>>> That imposes a constraint on gfx driver: hardware IOMMU must be disabled.
> > > >>>> Otherwise, if IOMMU is present, the gfx driver eventually programs
> > > >>>> the hardware IOMMU with IOVA returned by pci_map_page or dma_map_page;
> > > >>>> Meanwhile, the IOMMU backend for vgpu only maintains GPA <-> HPA
> > > >>>> translations without any knowledge about hardware IOMMU, how is the
> > > >>>> device model supposed to do to get an IOVA for a given GPA (thereby HPA
> > > >>>> by the IOMMU backend here)?
> > > >>>>
> > > >>>> If things go as guessed above, as vfio_pin_pages() indicates, it
> > > >>>> pin & translate vaddr to PFN, then it will be very difficult for the
> > > >>>> device model to figure out:
> > > >>>>
> > > >>>> 	1, for a given GPA, how to avoid calling dma_map_page multiple times?
> > > >>>> 	2, for which page to call dma_unmap_page?
> > > >>>>
> > > >>>> --  
> > > >>>
> > > >>> We have to support both w/ iommu and w/o iommu case, since
> > > >>> that fact is out of GPU driver control. A simple way is to use
> > > >>> dma_map_page which internally will cope with w/ and w/o iommu
> > > >>> case gracefully, i.e. return HPA w/o iommu and IOVA w/ iommu.
> > > >>> Then in this file we only need to cache GPA to whatever dmadr_t
> > > >>> returned by dma_map_page.
> > > >>>  
> > > >>
> > > >> Hi Alex, Kirti and Neo, any thought on the IOMMU compatibility here?  
> > > >
> > > > Hi Jike,
> > > >
> > > > With mediated passthru, you still can use hardware iommu, but more important
> > > > that part is actually orthogonal to what we are discussing here as we will only
> > > > cache the mapping between <gfn (iova if guest has iommu), (qemu) va>, once we
> > > > have pinned pages later with the help of above info, you can map it into the
> > > > proper iommu domain if the system has configured so.
> > > >  
> > >
> > > Hi Neo,
> > >
> > > Technically yes you can map a pfn into the proper IOMMU domain elsewhere,
> > > but to find out whether a pfn was previously mapped or not, you have to
> > > track it with another rbtree-alike data structure (the IOMMU driver simply
> > > doesn't bother with tracking), that seems somehow duplicate with the vGPU
> > > IOMMU backend we are discussing here.
> > >
> > > And it is also semantically correct for an IOMMU backend to handle both w/
> > > and w/o an IOMMU hardware? :)  
> > 
> > A problem with the iommu doing the dma_map_page() though is for what
> > device does it do this?  In the mediated case the vfio infrastructure
> > is dealing with a software representation of a device.  For all we
> > know that software model could transparently migrate from one physical
> > GPU to another.  There may not even be a physical device backing
> > the mediated device.  Those are details left to the vgpu driver itself.  
> 
> This is a fair argument. VFIO iommu driver simply serves user space
> requests, where only vaddr<->iova (essentially gpa in kvm case) is
> mattered. How iova is mapped into real IOMMU is not VFIO's interest.
> 
> > 
> > Perhaps one possibility would be to allow the vgpu driver to register
> > map and unmap callbacks.  The unmap callback might provide the
> > invalidation interface that we're so far missing.  The combination of
> > map and unmap callbacks might simplify the Intel approach of pinning the
> > entire VM memory space, ie. for each map callback do a translation
> > (pin) and dma_map_page, for each unmap do a dma_unmap_page and release
> > the translation.  There's still the problem of where that dma_addr_t
> > from the dma_map_page is stored though.  Someone would need to keep
> > track of iova to dma_addr_t.  The vfio iommu might be a place to do
> > that since we're already tracking information based on iova, possibly
> > in an opaque data element provided by the vgpu driver.  However, we're
> > going to need to take a serious look at whether an rb-tree is the right
> > data structure for the job.  It works well for the current type1
> > functionality where we typically have tens of entries.  I think the
> > NVIDIA model of sparse pinning the VM is pushing that up to tens of
> > thousands.  If Intel intends to pin the entire guest, that's
> > potentially tens of millions of tracked entries and I don't know that
> > an rb-tree is the right tool for that job.  Thanks,
> >   
> 
> Based on above thought I'm thinking whether below would work:
> (let's use gpa to replace existing iova in type1 driver, while using iova
> for the one actually used in vGPU driver. Assume 'pin-all' scenario first
> which matches existing vfio logic)
> 
> - No change to existing vfio_dma structure. VFIO still maintains gpa<->vaddr
> mapping, in coarse-grained regions;
> 
> - Leverage same page accounting/pinning logic in type1 driver, which 
> should be enough for 'pin-all' usage;
> 
> - Then main divergence point for vGPU would be in vfio_unmap_unpin
> and vfio_iommu_map. I'm not sure whether it's easy to fake an 
> iommu_domain for vGPU so same iommu_map/unmap can be reused.

This seems troublesome.  Kirti's version used numerous api-only tests
to avoid these which made the code difficult to trace.  Clearly one
option is to split out the common code so that a new mediated-type1
backend skips this, but they thought they could clean it up without
this, so we'll see what happens in the next version.

> If not, we may introduce two new map/unmap callbacks provided
> specifically by vGPU core driver, as you suggested:
> 
> 	* vGPU core driver uses dma_map_page to map specified pfns:
> 
> 		o When IOMMU is enabled, we'll get an iova returned different
> from pfn;
> 		o When IOMMU is disabled, returned iova is same as pfn;

Either way each iova needs to be stored and we have a worst case of one
iova per page of guest memory.
 
> 	* Then vGPU core driver just maintains its own gpa<->iova lookup
> table (e.g. called vgpu_dma)
> 
> 	* Because each vfio_iommu_map invocation is about a contiguous 
> region, we can expect same number of vgpu_dma entries as maintained 
> for vfio_dma list;
>
> Then it's vGPU core driver's responsibility to provide gpa<->iova
> lookup for vendor specific GPU driver. And we don't need worry about
> tens of thousands of entries. Once we get this simple 'pin-all' model
> ready, then it can be further extended to support 'pin-sparse'
> scenario. We still maintain a top-level vgpu_dma list with each entry to
> further link its own sparse mapping structure. In reality I don't expect
> we really need to maintain per-page translation even with sparse pinning.

If you're trying to equate the scale of what we need to track vs what
type1 currently tracks, they're significantly different.  Possible
things we need to track include the pfn, the iova, and possibly a
reference count or some sort of pinned page map.  In the pin-all model
we can assume that every page is pinned on map and unpinned on unmap,
so a reference count or map is unnecessary.  We can also assume that we
can always regenerate the pfn with get_user_pages() from the vaddr, so
we don't need to track that.  I don't see any way around tracking the
iova.  The iommu can't tell us this like it can with the normal type1
model because the pfn is the result of the translation, not the key for
the translation. So we're always going to have between 1 and
(size/PAGE_SIZE) iova entries per vgpu_dma entry.  You might be able to
manage the vgpu_dma with an rb-tree, but each vgpu_dma entry needs some
data structure tracking every iova.

Sparse mapping has the same issue but of course the tree of iovas is
potentially incomplete and we need a way to determine where it's
incomplete.  A page table rooted in the vgpu_dma and indexed by the
offset from the start vaddr seems like the way to go here.  It's also
possible that some mediated device models might store the iova in the
command sent to the device and therefore be able to parse those entries
back out to unmap them without storing them separately.  This might be
how the s390 channel-io model would prefer to work.  That seems like
further validation that such tracking is going to be dependent on the
mediated driver itself and probably not something to centralize in a
mediated iommu driver.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
@ 2016-05-12 19:05                     ` Alex Williamson
  0 siblings, 0 replies; 154+ messages in thread
From: Alex Williamson @ 2016-05-12 19:05 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Song, Jike, Neo Jia, Kirti Wankhede, pbonzini, kraxel,
	qemu-devel, kvm, Ruan, Shuai, Lv, Zhiyuan

On Thu, 12 May 2016 08:00:36 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Thursday, May 12, 2016 6:06 AM
> > 
> > On Wed, 11 May 2016 17:15:15 +0800
> > Jike Song <jike.song@intel.com> wrote:
> >   
> > > On 05/11/2016 12:02 AM, Neo Jia wrote:  
> > > > On Tue, May 10, 2016 at 03:52:27PM +0800, Jike Song wrote:  
> > > >> On 05/05/2016 05:27 PM, Tian, Kevin wrote:  
> > > >>>> From: Song, Jike
> > > >>>>
> > > >>>> IIUC, an api-only domain is a VFIO domain *without* underlying IOMMU
> > > >>>> hardware. It just, as you said in another mail, "rather than
> > > >>>> programming them into an IOMMU for a device, it simply stores the
> > > >>>> translations for use by later requests".
> > > >>>>
> > > >>>> That imposes a constraint on gfx driver: hardware IOMMU must be disabled.
> > > >>>> Otherwise, if IOMMU is present, the gfx driver eventually programs
> > > >>>> the hardware IOMMU with IOVA returned by pci_map_page or dma_map_page;
> > > >>>> Meanwhile, the IOMMU backend for vgpu only maintains GPA <-> HPA
> > > >>>> translations without any knowledge about hardware IOMMU, how is the
> > > >>>> device model supposed to do to get an IOVA for a given GPA (thereby HPA
> > > >>>> by the IOMMU backend here)?
> > > >>>>
> > > >>>> If things go as guessed above, as vfio_pin_pages() indicates, it
> > > >>>> pin & translate vaddr to PFN, then it will be very difficult for the
> > > >>>> device model to figure out:
> > > >>>>
> > > >>>> 	1, for a given GPA, how to avoid calling dma_map_page multiple times?
> > > >>>> 	2, for which page to call dma_unmap_page?
> > > >>>>
> > > >>>> --  
> > > >>>
> > > >>> We have to support both w/ iommu and w/o iommu case, since
> > > >>> that fact is out of GPU driver control. A simple way is to use
> > > >>> dma_map_page which internally will cope with w/ and w/o iommu
> > > >>> case gracefully, i.e. return HPA w/o iommu and IOVA w/ iommu.
> > > >>> Then in this file we only need to cache GPA to whatever dmadr_t
> > > >>> returned by dma_map_page.
> > > >>>  
> > > >>
> > > >> Hi Alex, Kirti and Neo, any thought on the IOMMU compatibility here?  
> > > >
> > > > Hi Jike,
> > > >
> > > > With mediated passthru, you still can use hardware iommu, but more important
> > > > that part is actually orthogonal to what we are discussing here as we will only
> > > > cache the mapping between <gfn (iova if guest has iommu), (qemu) va>, once we
> > > > have pinned pages later with the help of above info, you can map it into the
> > > > proper iommu domain if the system has configured so.
> > > >  
> > >
> > > Hi Neo,
> > >
> > > Technically yes you can map a pfn into the proper IOMMU domain elsewhere,
> > > but to find out whether a pfn was previously mapped or not, you have to
> > > track it with another rbtree-alike data structure (the IOMMU driver simply
> > > doesn't bother with tracking), that seems somehow duplicate with the vGPU
> > > IOMMU backend we are discussing here.
> > >
> > > And it is also semantically correct for an IOMMU backend to handle both w/
> > > and w/o an IOMMU hardware? :)  
> > 
> > A problem with the iommu doing the dma_map_page() though is for what
> > device does it do this?  In the mediated case the vfio infrastructure
> > is dealing with a software representation of a device.  For all we
> > know that software model could transparently migrate from one physical
> > GPU to another.  There may not even be a physical device backing
> > the mediated device.  Those are details left to the vgpu driver itself.  
> 
> This is a fair argument. VFIO iommu driver simply serves user space
> requests, where only vaddr<->iova (essentially gpa in kvm case) is
> mattered. How iova is mapped into real IOMMU is not VFIO's interest.
> 
> > 
> > Perhaps one possibility would be to allow the vgpu driver to register
> > map and unmap callbacks.  The unmap callback might provide the
> > invalidation interface that we're so far missing.  The combination of
> > map and unmap callbacks might simplify the Intel approach of pinning the
> > entire VM memory space, ie. for each map callback do a translation
> > (pin) and dma_map_page, for each unmap do a dma_unmap_page and release
> > the translation.  There's still the problem of where that dma_addr_t
> > from the dma_map_page is stored though.  Someone would need to keep
> > track of iova to dma_addr_t.  The vfio iommu might be a place to do
> > that since we're already tracking information based on iova, possibly
> > in an opaque data element provided by the vgpu driver.  However, we're
> > going to need to take a serious look at whether an rb-tree is the right
> > data structure for the job.  It works well for the current type1
> > functionality where we typically have tens of entries.  I think the
> > NVIDIA model of sparse pinning the VM is pushing that up to tens of
> > thousands.  If Intel intends to pin the entire guest, that's
> > potentially tens of millions of tracked entries and I don't know that
> > an rb-tree is the right tool for that job.  Thanks,
> >   
> 
> Based on above thought I'm thinking whether below would work:
> (let's use gpa to replace existing iova in type1 driver, while using iova
> for the one actually used in vGPU driver. Assume 'pin-all' scenario first
> which matches existing vfio logic)
> 
> - No change to existing vfio_dma structure. VFIO still maintains gpa<->vaddr
> mapping, in coarse-grained regions;
> 
> - Leverage same page accounting/pinning logic in type1 driver, which 
> should be enough for 'pin-all' usage;
> 
> - Then main divergence point for vGPU would be in vfio_unmap_unpin
> and vfio_iommu_map. I'm not sure whether it's easy to fake an 
> iommu_domain for vGPU so same iommu_map/unmap can be reused.

This seems troublesome.  Kirti's version used numerous api-only tests
to avoid these which made the code difficult to trace.  Clearly one
option is to split out the common code so that a new mediated-type1
backend skips this, but they thought they could clean it up without
this, so we'll see what happens in the next version.

> If not, we may introduce two new map/unmap callbacks provided
> specifically by vGPU core driver, as you suggested:
> 
> 	* vGPU core driver uses dma_map_page to map specified pfns:
> 
> 		o When IOMMU is enabled, we'll get an iova returned different
> from pfn;
> 		o When IOMMU is disabled, returned iova is same as pfn;

Either way each iova needs to be stored and we have a worst case of one
iova per page of guest memory.
 
> 	* Then vGPU core driver just maintains its own gpa<->iova lookup
> table (e.g. called vgpu_dma)
> 
> 	* Because each vfio_iommu_map invocation is about a contiguous 
> region, we can expect same number of vgpu_dma entries as maintained 
> for vfio_dma list;
>
> Then it's vGPU core driver's responsibility to provide gpa<->iova
> lookup for vendor specific GPU driver. And we don't need worry about
> tens of thousands of entries. Once we get this simple 'pin-all' model
> ready, then it can be further extended to support 'pin-sparse'
> scenario. We still maintain a top-level vgpu_dma list with each entry to
> further link its own sparse mapping structure. In reality I don't expect
> we really need to maintain per-page translation even with sparse pinning.

If you're trying to equate the scale of what we need to track vs what
type1 currently tracks, they're significantly different.  Possible
things we need to track include the pfn, the iova, and possibly a
reference count or some sort of pinned page map.  In the pin-all model
we can assume that every page is pinned on map and unpinned on unmap,
so a reference count or map is unnecessary.  We can also assume that we
can always regenerate the pfn with get_user_pages() from the vaddr, so
we don't need to track that.  I don't see any way around tracking the
iova.  The iommu can't tell us this like it can with the normal type1
model because the pfn is the result of the translation, not the key for
the translation. So we're always going to have between 1 and
(size/PAGE_SIZE) iova entries per vgpu_dma entry.  You might be able to
manage the vgpu_dma with an rb-tree, but each vgpu_dma entry needs some
data structure tracking every iova.

Sparse mapping has the same issue but of course the tree of iovas is
potentially incomplete and we need a way to determine where it's
incomplete.  A page table rooted in the vgpu_dma and indexed by the
offset from the start vaddr seems like the way to go here.  It's also
possible that some mediated device models might store the iova in the
command sent to the device and therefore be able to parse those entries
back out to unmap them without storing them separately.  This might be
how the s390 channel-io model would prefer to work.  That seems like
further validation that such tracking is going to be dependent on the
mediated driver itself and probably not something to centralize in a
mediated iommu driver.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
  2016-05-12  4:11                   ` [Qemu-devel] " Jike Song
@ 2016-05-12 19:49                     ` Neo Jia
  -1 siblings, 0 replies; 154+ messages in thread
From: Neo Jia @ 2016-05-12 19:49 UTC (permalink / raw)
  To: Jike Song
  Cc: Alex Williamson, Jike Song, Tian, Kevin, Kirti Wankhede,
	pbonzini, kraxel, qemu-devel, kvm, Ruan, Shuai, Lv, Zhiyuan

On Thu, May 12, 2016 at 12:11:00PM +0800, Jike Song wrote:
> On Thu, May 12, 2016 at 6:06 AM, Alex Williamson
> <alex.williamson@redhat.com> wrote:
> > On Wed, 11 May 2016 17:15:15 +0800
> > Jike Song <jike.song@intel.com> wrote:
> >
> >> On 05/11/2016 12:02 AM, Neo Jia wrote:
> >> > On Tue, May 10, 2016 at 03:52:27PM +0800, Jike Song wrote:
> >> >> On 05/05/2016 05:27 PM, Tian, Kevin wrote:
> >> >>>> From: Song, Jike
> >> >>>>
> >> >>>> IIUC, an api-only domain is a VFIO domain *without* underlying IOMMU
> >> >>>> hardware. It just, as you said in another mail, "rather than
> >> >>>> programming them into an IOMMU for a device, it simply stores the
> >> >>>> translations for use by later requests".
> >> >>>>
> >> >>>> That imposes a constraint on gfx driver: hardware IOMMU must be disabled.
> >> >>>> Otherwise, if IOMMU is present, the gfx driver eventually programs
> >> >>>> the hardware IOMMU with IOVA returned by pci_map_page or dma_map_page;
> >> >>>> Meanwhile, the IOMMU backend for vgpu only maintains GPA <-> HPA
> >> >>>> translations without any knowledge about hardware IOMMU, how is the
> >> >>>> device model supposed to do to get an IOVA for a given GPA (thereby HPA
> >> >>>> by the IOMMU backend here)?
> >> >>>>
> >> >>>> If things go as guessed above, as vfio_pin_pages() indicates, it
> >> >>>> pin & translate vaddr to PFN, then it will be very difficult for the
> >> >>>> device model to figure out:
> >> >>>>
> >> >>>>  1, for a given GPA, how to avoid calling dma_map_page multiple times?
> >> >>>>  2, for which page to call dma_unmap_page?
> >> >>>>
> >> >>>> --
> >> >>>
> >> >>> We have to support both w/ iommu and w/o iommu case, since
> >> >>> that fact is out of GPU driver control. A simple way is to use
> >> >>> dma_map_page which internally will cope with w/ and w/o iommu
> >> >>> case gracefully, i.e. return HPA w/o iommu and IOVA w/ iommu.
> >> >>> Then in this file we only need to cache GPA to whatever dmadr_t
> >> >>> returned by dma_map_page.
> >> >>>
> >> >>
> >> >> Hi Alex, Kirti and Neo, any thought on the IOMMU compatibility here?
> >> >
> >> > Hi Jike,
> >> >
> >> > With mediated passthru, you still can use hardware iommu, but more important
> >> > that part is actually orthogonal to what we are discussing here as we will only
> >> > cache the mapping between <gfn (iova if guest has iommu), (qemu) va>, once we
> >> > have pinned pages later with the help of above info, you can map it into the
> >> > proper iommu domain if the system has configured so.
> >> >
> >>
> >> Hi Neo,
> >>
> >> Technically yes you can map a pfn into the proper IOMMU domain elsewhere,
> >> but to find out whether a pfn was previously mapped or not, you have to
> >> track it with another rbtree-alike data structure (the IOMMU driver simply
> >> doesn't bother with tracking), that seems somehow duplicate with the vGPU
> >> IOMMU backend we are discussing here.
> >>
> >> And it is also semantically correct for an IOMMU backend to handle both w/
> >> and w/o an IOMMU hardware? :)
> >
> > A problem with the iommu doing the dma_map_page() though is for what
> > device does it do this?  In the mediated case the vfio infrastructure
> > is dealing with a software representation of a device.  For all we
> > know that software model could transparently migrate from one physical
> > GPU to another.  There may not even be a physical device backing
> > the mediated device.  Those are details left to the vgpu driver itself.
> >
> 
> Great point :) Yes, I agree it's a bit intrusive to do the mapping for
> a particular
> pdev in an vGPU IOMMU BE.
> 
> > Perhaps one possibility would be to allow the vgpu driver to register
> > map and unmap callbacks.  The unmap callback might provide the
> > invalidation interface that we're so far missing.  The combination of
> > map and unmap callbacks might simplify the Intel approach of pinning the
> > entire VM memory space, ie. for each map callback do a translation
> > (pin) and dma_map_page, for each unmap do a dma_unmap_page and release
> > the translation.
> 
> Yes adding map/unmap ops in pGPU drvier (I assume you are refering to
> gpu_device_ops as
> implemented in Kirti's patch) sounds a good idea, satisfying both: 1)
> keeping vGPU purely
> virtual; 2) dealing with the Linux DMA API to achive hardware IOMMU
> compatibility.
> 
> PS, this has very little to do with pinning wholly or partially. Intel KVMGT has
> once been had the whole guest memory pinned, only because we used a spinlock,
> which can't sleep at runtime.  We have removed that spinlock in our another
> upstreaming effort, not here but for i915 driver, so probably no biggie.
> 

OK, then you guys don't need to pin everything. The next question will be if you
can send the pinning request from your mediated driver backend to request memory
pinning like we have demonstrated in the v3 patch, function vfio_pin_pages and
vfio_unpin_pages?

Thanks,
Neo

> 
> > There's still the problem of where that dma_addr_t
> > from the dma_map_page is stored though.  Someone would need to keep
> > track of iova to dma_addr_t.  The vfio iommu might be a place to do
> > that since we're already tracking information based on iova, possibly
> > in an opaque data element provided by the vgpu driver.
> 
> Any reason to keep it opaque? Given that vfio iommu is already tracking
> PFN for iova (vaddr as vGPU is), seems adding dma_addr_t as another field is
> simple. But I don't have a strong opinion here, opaque definitely
> works for me :)
> 
> > However, we're
> > going to need to take a serious look at whether an rb-tree is the right
> > data structure for the job.  It works well for the current type1
> > functionality where we typically have tens of entries.  I think the
> > NVIDIA model of sparse pinning the VM is pushing that up to tens of
> > thousands.  If Intel intends to pin the entire guest, that's
> > potentially tens of millions of tracked entries and I don't know that
> > an rb-tree is the right tool for that job.  Thanks,
> >
> 
> Having the rbtree efficiency considered there is yet another reason for us
> to pin partially. Assuming that partially pinning guaranteed, do you
> think rbtree
> is good enough?
> 
> --
> Thanks,
> Jike

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
@ 2016-05-12 19:49                     ` Neo Jia
  0 siblings, 0 replies; 154+ messages in thread
From: Neo Jia @ 2016-05-12 19:49 UTC (permalink / raw)
  To: Jike Song
  Cc: Alex Williamson, Jike Song, Tian, Kevin, Kirti Wankhede,
	pbonzini, kraxel, qemu-devel, kvm, Ruan, Shuai, Lv, Zhiyuan

On Thu, May 12, 2016 at 12:11:00PM +0800, Jike Song wrote:
> On Thu, May 12, 2016 at 6:06 AM, Alex Williamson
> <alex.williamson@redhat.com> wrote:
> > On Wed, 11 May 2016 17:15:15 +0800
> > Jike Song <jike.song@intel.com> wrote:
> >
> >> On 05/11/2016 12:02 AM, Neo Jia wrote:
> >> > On Tue, May 10, 2016 at 03:52:27PM +0800, Jike Song wrote:
> >> >> On 05/05/2016 05:27 PM, Tian, Kevin wrote:
> >> >>>> From: Song, Jike
> >> >>>>
> >> >>>> IIUC, an api-only domain is a VFIO domain *without* underlying IOMMU
> >> >>>> hardware. It just, as you said in another mail, "rather than
> >> >>>> programming them into an IOMMU for a device, it simply stores the
> >> >>>> translations for use by later requests".
> >> >>>>
> >> >>>> That imposes a constraint on gfx driver: hardware IOMMU must be disabled.
> >> >>>> Otherwise, if IOMMU is present, the gfx driver eventually programs
> >> >>>> the hardware IOMMU with IOVA returned by pci_map_page or dma_map_page;
> >> >>>> Meanwhile, the IOMMU backend for vgpu only maintains GPA <-> HPA
> >> >>>> translations without any knowledge about hardware IOMMU, how is the
> >> >>>> device model supposed to do to get an IOVA for a given GPA (thereby HPA
> >> >>>> by the IOMMU backend here)?
> >> >>>>
> >> >>>> If things go as guessed above, as vfio_pin_pages() indicates, it
> >> >>>> pin & translate vaddr to PFN, then it will be very difficult for the
> >> >>>> device model to figure out:
> >> >>>>
> >> >>>>  1, for a given GPA, how to avoid calling dma_map_page multiple times?
> >> >>>>  2, for which page to call dma_unmap_page?
> >> >>>>
> >> >>>> --
> >> >>>
> >> >>> We have to support both w/ iommu and w/o iommu case, since
> >> >>> that fact is out of GPU driver control. A simple way is to use
> >> >>> dma_map_page which internally will cope with w/ and w/o iommu
> >> >>> case gracefully, i.e. return HPA w/o iommu and IOVA w/ iommu.
> >> >>> Then in this file we only need to cache GPA to whatever dmadr_t
> >> >>> returned by dma_map_page.
> >> >>>
> >> >>
> >> >> Hi Alex, Kirti and Neo, any thought on the IOMMU compatibility here?
> >> >
> >> > Hi Jike,
> >> >
> >> > With mediated passthru, you still can use hardware iommu, but more important
> >> > that part is actually orthogonal to what we are discussing here as we will only
> >> > cache the mapping between <gfn (iova if guest has iommu), (qemu) va>, once we
> >> > have pinned pages later with the help of above info, you can map it into the
> >> > proper iommu domain if the system has configured so.
> >> >
> >>
> >> Hi Neo,
> >>
> >> Technically yes you can map a pfn into the proper IOMMU domain elsewhere,
> >> but to find out whether a pfn was previously mapped or not, you have to
> >> track it with another rbtree-alike data structure (the IOMMU driver simply
> >> doesn't bother with tracking), that seems somehow duplicate with the vGPU
> >> IOMMU backend we are discussing here.
> >>
> >> And it is also semantically correct for an IOMMU backend to handle both w/
> >> and w/o an IOMMU hardware? :)
> >
> > A problem with the iommu doing the dma_map_page() though is for what
> > device does it do this?  In the mediated case the vfio infrastructure
> > is dealing with a software representation of a device.  For all we
> > know that software model could transparently migrate from one physical
> > GPU to another.  There may not even be a physical device backing
> > the mediated device.  Those are details left to the vgpu driver itself.
> >
> 
> Great point :) Yes, I agree it's a bit intrusive to do the mapping for
> a particular
> pdev in an vGPU IOMMU BE.
> 
> > Perhaps one possibility would be to allow the vgpu driver to register
> > map and unmap callbacks.  The unmap callback might provide the
> > invalidation interface that we're so far missing.  The combination of
> > map and unmap callbacks might simplify the Intel approach of pinning the
> > entire VM memory space, ie. for each map callback do a translation
> > (pin) and dma_map_page, for each unmap do a dma_unmap_page and release
> > the translation.
> 
> Yes adding map/unmap ops in pGPU drvier (I assume you are refering to
> gpu_device_ops as
> implemented in Kirti's patch) sounds a good idea, satisfying both: 1)
> keeping vGPU purely
> virtual; 2) dealing with the Linux DMA API to achive hardware IOMMU
> compatibility.
> 
> PS, this has very little to do with pinning wholly or partially. Intel KVMGT has
> once been had the whole guest memory pinned, only because we used a spinlock,
> which can't sleep at runtime.  We have removed that spinlock in our another
> upstreaming effort, not here but for i915 driver, so probably no biggie.
> 

OK, then you guys don't need to pin everything. The next question will be if you
can send the pinning request from your mediated driver backend to request memory
pinning like we have demonstrated in the v3 patch, function vfio_pin_pages and
vfio_unpin_pages?

Thanks,
Neo

> 
> > There's still the problem of where that dma_addr_t
> > from the dma_map_page is stored though.  Someone would need to keep
> > track of iova to dma_addr_t.  The vfio iommu might be a place to do
> > that since we're already tracking information based on iova, possibly
> > in an opaque data element provided by the vgpu driver.
> 
> Any reason to keep it opaque? Given that vfio iommu is already tracking
> PFN for iova (vaddr as vGPU is), seems adding dma_addr_t as another field is
> simple. But I don't have a strong opinion here, opaque definitely
> works for me :)
> 
> > However, we're
> > going to need to take a serious look at whether an rb-tree is the right
> > data structure for the job.  It works well for the current type1
> > functionality where we typically have tens of entries.  I think the
> > NVIDIA model of sparse pinning the VM is pushing that up to tens of
> > thousands.  If Intel intends to pin the entire guest, that's
> > potentially tens of millions of tracked entries and I don't know that
> > an rb-tree is the right tool for that job.  Thanks,
> >
> 
> Having the rbtree efficiency considered there is yet another reason for us
> to pin partially. Assuming that partially pinning guaranteed, do you
> think rbtree
> is good enough?
> 
> --
> Thanks,
> Jike

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
  2016-05-12 19:05                     ` [Qemu-devel] " Alex Williamson
@ 2016-05-12 20:12                       ` Neo Jia
  -1 siblings, 0 replies; 154+ messages in thread
From: Neo Jia @ 2016-05-12 20:12 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, Song, Jike, Kirti Wankhede, pbonzini, kraxel,
	qemu-devel, kvm, Ruan, Shuai, Lv, Zhiyuan

On Thu, May 12, 2016 at 01:05:52PM -0600, Alex Williamson wrote:
> On Thu, 12 May 2016 08:00:36 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > Sent: Thursday, May 12, 2016 6:06 AM
> > > 
> > > On Wed, 11 May 2016 17:15:15 +0800
> > > Jike Song <jike.song@intel.com> wrote:
> > >   
> > > > On 05/11/2016 12:02 AM, Neo Jia wrote:  
> > > > > On Tue, May 10, 2016 at 03:52:27PM +0800, Jike Song wrote:  
> > > > >> On 05/05/2016 05:27 PM, Tian, Kevin wrote:  
> > > > >>>> From: Song, Jike
> > > > >>>>
> > > > >>>> IIUC, an api-only domain is a VFIO domain *without* underlying IOMMU
> > > > >>>> hardware. It just, as you said in another mail, "rather than
> > > > >>>> programming them into an IOMMU for a device, it simply stores the
> > > > >>>> translations for use by later requests".
> > > > >>>>
> > > > >>>> That imposes a constraint on gfx driver: hardware IOMMU must be disabled.
> > > > >>>> Otherwise, if IOMMU is present, the gfx driver eventually programs
> > > > >>>> the hardware IOMMU with IOVA returned by pci_map_page or dma_map_page;
> > > > >>>> Meanwhile, the IOMMU backend for vgpu only maintains GPA <-> HPA
> > > > >>>> translations without any knowledge about hardware IOMMU, how is the
> > > > >>>> device model supposed to do to get an IOVA for a given GPA (thereby HPA
> > > > >>>> by the IOMMU backend here)?
> > > > >>>>
> > > > >>>> If things go as guessed above, as vfio_pin_pages() indicates, it
> > > > >>>> pin & translate vaddr to PFN, then it will be very difficult for the
> > > > >>>> device model to figure out:
> > > > >>>>
> > > > >>>> 	1, for a given GPA, how to avoid calling dma_map_page multiple times?
> > > > >>>> 	2, for which page to call dma_unmap_page?
> > > > >>>>
> > > > >>>> --  
> > > > >>>
> > > > >>> We have to support both w/ iommu and w/o iommu case, since
> > > > >>> that fact is out of GPU driver control. A simple way is to use
> > > > >>> dma_map_page which internally will cope with w/ and w/o iommu
> > > > >>> case gracefully, i.e. return HPA w/o iommu and IOVA w/ iommu.
> > > > >>> Then in this file we only need to cache GPA to whatever dmadr_t
> > > > >>> returned by dma_map_page.
> > > > >>>  
> > > > >>
> > > > >> Hi Alex, Kirti and Neo, any thought on the IOMMU compatibility here?  
> > > > >
> > > > > Hi Jike,
> > > > >
> > > > > With mediated passthru, you still can use hardware iommu, but more important
> > > > > that part is actually orthogonal to what we are discussing here as we will only
> > > > > cache the mapping between <gfn (iova if guest has iommu), (qemu) va>, once we
> > > > > have pinned pages later with the help of above info, you can map it into the
> > > > > proper iommu domain if the system has configured so.
> > > > >  
> > > >
> > > > Hi Neo,
> > > >
> > > > Technically yes you can map a pfn into the proper IOMMU domain elsewhere,
> > > > but to find out whether a pfn was previously mapped or not, you have to
> > > > track it with another rbtree-alike data structure (the IOMMU driver simply
> > > > doesn't bother with tracking), that seems somehow duplicate with the vGPU
> > > > IOMMU backend we are discussing here.
> > > >
> > > > And it is also semantically correct for an IOMMU backend to handle both w/
> > > > and w/o an IOMMU hardware? :)  
> > > 
> > > A problem with the iommu doing the dma_map_page() though is for what
> > > device does it do this?  In the mediated case the vfio infrastructure
> > > is dealing with a software representation of a device.  For all we
> > > know that software model could transparently migrate from one physical
> > > GPU to another.  There may not even be a physical device backing
> > > the mediated device.  Those are details left to the vgpu driver itself.  
> > 
> > This is a fair argument. VFIO iommu driver simply serves user space
> > requests, where only vaddr<->iova (essentially gpa in kvm case) is
> > mattered. How iova is mapped into real IOMMU is not VFIO's interest.
> > 
> > > 
> > > Perhaps one possibility would be to allow the vgpu driver to register
> > > map and unmap callbacks.  The unmap callback might provide the
> > > invalidation interface that we're so far missing.  The combination of
> > > map and unmap callbacks might simplify the Intel approach of pinning the
> > > entire VM memory space, ie. for each map callback do a translation
> > > (pin) and dma_map_page, for each unmap do a dma_unmap_page and release
> > > the translation.  There's still the problem of where that dma_addr_t
> > > from the dma_map_page is stored though.  Someone would need to keep
> > > track of iova to dma_addr_t.  The vfio iommu might be a place to do
> > > that since we're already tracking information based on iova, possibly
> > > in an opaque data element provided by the vgpu driver.  However, we're
> > > going to need to take a serious look at whether an rb-tree is the right
> > > data structure for the job.  It works well for the current type1
> > > functionality where we typically have tens of entries.  I think the
> > > NVIDIA model of sparse pinning the VM is pushing that up to tens of
> > > thousands.  If Intel intends to pin the entire guest, that's
> > > potentially tens of millions of tracked entries and I don't know that
> > > an rb-tree is the right tool for that job.  Thanks,
> > >   
> > 
> > Based on above thought I'm thinking whether below would work:
> > (let's use gpa to replace existing iova in type1 driver, while using iova
> > for the one actually used in vGPU driver. Assume 'pin-all' scenario first
> > which matches existing vfio logic)
> > 
> > - No change to existing vfio_dma structure. VFIO still maintains gpa<->vaddr
> > mapping, in coarse-grained regions;
> > 
> > - Leverage same page accounting/pinning logic in type1 driver, which 
> > should be enough for 'pin-all' usage;
> > 
> > - Then main divergence point for vGPU would be in vfio_unmap_unpin
> > and vfio_iommu_map. I'm not sure whether it's easy to fake an 
> > iommu_domain for vGPU so same iommu_map/unmap can be reused.
> 
> This seems troublesome.  Kirti's version used numerous api-only tests
> to avoid these which made the code difficult to trace.  Clearly one
> option is to split out the common code so that a new mediated-type1
> backend skips this, but they thought they could clean it up without
> this, so we'll see what happens in the next version.
> 
> > If not, we may introduce two new map/unmap callbacks provided
> > specifically by vGPU core driver, as you suggested:
> > 
> > 	* vGPU core driver uses dma_map_page to map specified pfns:
> > 
> > 		o When IOMMU is enabled, we'll get an iova returned different
> > from pfn;
> > 		o When IOMMU is disabled, returned iova is same as pfn;
> 
> Either way each iova needs to be stored and we have a worst case of one
> iova per page of guest memory.
>  
> > 	* Then vGPU core driver just maintains its own gpa<->iova lookup
> > table (e.g. called vgpu_dma)
> > 
> > 	* Because each vfio_iommu_map invocation is about a contiguous 
> > region, we can expect same number of vgpu_dma entries as maintained 
> > for vfio_dma list;
> >
> > Then it's vGPU core driver's responsibility to provide gpa<->iova
> > lookup for vendor specific GPU driver. And we don't need worry about
> > tens of thousands of entries. Once we get this simple 'pin-all' model
> > ready, then it can be further extended to support 'pin-sparse'
> > scenario. We still maintain a top-level vgpu_dma list with each entry to
> > further link its own sparse mapping structure. In reality I don't expect
> > we really need to maintain per-page translation even with sparse pinning.
> 
> If you're trying to equate the scale of what we need to track vs what
> type1 currently tracks, they're significantly different.  Possible
> things we need to track include the pfn, the iova, and possibly a
> reference count or some sort of pinned page map.  In the pin-all model
> we can assume that every page is pinned on map and unpinned on unmap,
> so a reference count or map is unnecessary.  We can also assume that we
> can always regenerate the pfn with get_user_pages() from the vaddr, so
> we don't need to track that.  

Hi Alex,

Thanks for pointing this out, we will not track those in our next rev and
get_user_pages will be used from the vaddr as you suggested to handle the
single VM with both passthru + mediated device case.

Thanks,
Neo

> I don't see any way around tracking the
> iova.  The iommu can't tell us this like it can with the normal type1
> model because the pfn is the result of the translation, not the key for
> the translation. So we're always going to have between 1 and
> (size/PAGE_SIZE) iova entries per vgpu_dma entry.  You might be able to
> manage the vgpu_dma with an rb-tree, but each vgpu_dma entry needs some
> data structure tracking every iova.
> 
> Sparse mapping has the same issue but of course the tree of iovas is
> potentially incomplete and we need a way to determine where it's
> incomplete.  A page table rooted in the vgpu_dma and indexed by the
> offset from the start vaddr seems like the way to go here.  It's also
> possible that some mediated device models might store the iova in the
> command sent to the device and therefore be able to parse those entries
> back out to unmap them without storing them separately.  This might be
> how the s390 channel-io model would prefer to work.  That seems like
> further validation that such tracking is going to be dependent on the
> mediated driver itself and probably not something to centralize in a
> mediated iommu driver.  Thanks,

> 
> Alex

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
@ 2016-05-12 20:12                       ` Neo Jia
  0 siblings, 0 replies; 154+ messages in thread
From: Neo Jia @ 2016-05-12 20:12 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, Song, Jike, Kirti Wankhede, pbonzini, kraxel,
	qemu-devel, kvm, Ruan, Shuai, Lv, Zhiyuan

On Thu, May 12, 2016 at 01:05:52PM -0600, Alex Williamson wrote:
> On Thu, 12 May 2016 08:00:36 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > Sent: Thursday, May 12, 2016 6:06 AM
> > > 
> > > On Wed, 11 May 2016 17:15:15 +0800
> > > Jike Song <jike.song@intel.com> wrote:
> > >   
> > > > On 05/11/2016 12:02 AM, Neo Jia wrote:  
> > > > > On Tue, May 10, 2016 at 03:52:27PM +0800, Jike Song wrote:  
> > > > >> On 05/05/2016 05:27 PM, Tian, Kevin wrote:  
> > > > >>>> From: Song, Jike
> > > > >>>>
> > > > >>>> IIUC, an api-only domain is a VFIO domain *without* underlying IOMMU
> > > > >>>> hardware. It just, as you said in another mail, "rather than
> > > > >>>> programming them into an IOMMU for a device, it simply stores the
> > > > >>>> translations for use by later requests".
> > > > >>>>
> > > > >>>> That imposes a constraint on gfx driver: hardware IOMMU must be disabled.
> > > > >>>> Otherwise, if IOMMU is present, the gfx driver eventually programs
> > > > >>>> the hardware IOMMU with IOVA returned by pci_map_page or dma_map_page;
> > > > >>>> Meanwhile, the IOMMU backend for vgpu only maintains GPA <-> HPA
> > > > >>>> translations without any knowledge about hardware IOMMU, how is the
> > > > >>>> device model supposed to do to get an IOVA for a given GPA (thereby HPA
> > > > >>>> by the IOMMU backend here)?
> > > > >>>>
> > > > >>>> If things go as guessed above, as vfio_pin_pages() indicates, it
> > > > >>>> pin & translate vaddr to PFN, then it will be very difficult for the
> > > > >>>> device model to figure out:
> > > > >>>>
> > > > >>>> 	1, for a given GPA, how to avoid calling dma_map_page multiple times?
> > > > >>>> 	2, for which page to call dma_unmap_page?
> > > > >>>>
> > > > >>>> --  
> > > > >>>
> > > > >>> We have to support both w/ iommu and w/o iommu case, since
> > > > >>> that fact is out of GPU driver control. A simple way is to use
> > > > >>> dma_map_page which internally will cope with w/ and w/o iommu
> > > > >>> case gracefully, i.e. return HPA w/o iommu and IOVA w/ iommu.
> > > > >>> Then in this file we only need to cache GPA to whatever dmadr_t
> > > > >>> returned by dma_map_page.
> > > > >>>  
> > > > >>
> > > > >> Hi Alex, Kirti and Neo, any thought on the IOMMU compatibility here?  
> > > > >
> > > > > Hi Jike,
> > > > >
> > > > > With mediated passthru, you still can use hardware iommu, but more important
> > > > > that part is actually orthogonal to what we are discussing here as we will only
> > > > > cache the mapping between <gfn (iova if guest has iommu), (qemu) va>, once we
> > > > > have pinned pages later with the help of above info, you can map it into the
> > > > > proper iommu domain if the system has configured so.
> > > > >  
> > > >
> > > > Hi Neo,
> > > >
> > > > Technically yes you can map a pfn into the proper IOMMU domain elsewhere,
> > > > but to find out whether a pfn was previously mapped or not, you have to
> > > > track it with another rbtree-alike data structure (the IOMMU driver simply
> > > > doesn't bother with tracking), that seems somehow duplicate with the vGPU
> > > > IOMMU backend we are discussing here.
> > > >
> > > > And it is also semantically correct for an IOMMU backend to handle both w/
> > > > and w/o an IOMMU hardware? :)  
> > > 
> > > A problem with the iommu doing the dma_map_page() though is for what
> > > device does it do this?  In the mediated case the vfio infrastructure
> > > is dealing with a software representation of a device.  For all we
> > > know that software model could transparently migrate from one physical
> > > GPU to another.  There may not even be a physical device backing
> > > the mediated device.  Those are details left to the vgpu driver itself.  
> > 
> > This is a fair argument. VFIO iommu driver simply serves user space
> > requests, where only vaddr<->iova (essentially gpa in kvm case) is
> > mattered. How iova is mapped into real IOMMU is not VFIO's interest.
> > 
> > > 
> > > Perhaps one possibility would be to allow the vgpu driver to register
> > > map and unmap callbacks.  The unmap callback might provide the
> > > invalidation interface that we're so far missing.  The combination of
> > > map and unmap callbacks might simplify the Intel approach of pinning the
> > > entire VM memory space, ie. for each map callback do a translation
> > > (pin) and dma_map_page, for each unmap do a dma_unmap_page and release
> > > the translation.  There's still the problem of where that dma_addr_t
> > > from the dma_map_page is stored though.  Someone would need to keep
> > > track of iova to dma_addr_t.  The vfio iommu might be a place to do
> > > that since we're already tracking information based on iova, possibly
> > > in an opaque data element provided by the vgpu driver.  However, we're
> > > going to need to take a serious look at whether an rb-tree is the right
> > > data structure for the job.  It works well for the current type1
> > > functionality where we typically have tens of entries.  I think the
> > > NVIDIA model of sparse pinning the VM is pushing that up to tens of
> > > thousands.  If Intel intends to pin the entire guest, that's
> > > potentially tens of millions of tracked entries and I don't know that
> > > an rb-tree is the right tool for that job.  Thanks,
> > >   
> > 
> > Based on above thought I'm thinking whether below would work:
> > (let's use gpa to replace existing iova in type1 driver, while using iova
> > for the one actually used in vGPU driver. Assume 'pin-all' scenario first
> > which matches existing vfio logic)
> > 
> > - No change to existing vfio_dma structure. VFIO still maintains gpa<->vaddr
> > mapping, in coarse-grained regions;
> > 
> > - Leverage same page accounting/pinning logic in type1 driver, which 
> > should be enough for 'pin-all' usage;
> > 
> > - Then main divergence point for vGPU would be in vfio_unmap_unpin
> > and vfio_iommu_map. I'm not sure whether it's easy to fake an 
> > iommu_domain for vGPU so same iommu_map/unmap can be reused.
> 
> This seems troublesome.  Kirti's version used numerous api-only tests
> to avoid these which made the code difficult to trace.  Clearly one
> option is to split out the common code so that a new mediated-type1
> backend skips this, but they thought they could clean it up without
> this, so we'll see what happens in the next version.
> 
> > If not, we may introduce two new map/unmap callbacks provided
> > specifically by vGPU core driver, as you suggested:
> > 
> > 	* vGPU core driver uses dma_map_page to map specified pfns:
> > 
> > 		o When IOMMU is enabled, we'll get an iova returned different
> > from pfn;
> > 		o When IOMMU is disabled, returned iova is same as pfn;
> 
> Either way each iova needs to be stored and we have a worst case of one
> iova per page of guest memory.
>  
> > 	* Then vGPU core driver just maintains its own gpa<->iova lookup
> > table (e.g. called vgpu_dma)
> > 
> > 	* Because each vfio_iommu_map invocation is about a contiguous 
> > region, we can expect same number of vgpu_dma entries as maintained 
> > for vfio_dma list;
> >
> > Then it's vGPU core driver's responsibility to provide gpa<->iova
> > lookup for vendor specific GPU driver. And we don't need worry about
> > tens of thousands of entries. Once we get this simple 'pin-all' model
> > ready, then it can be further extended to support 'pin-sparse'
> > scenario. We still maintain a top-level vgpu_dma list with each entry to
> > further link its own sparse mapping structure. In reality I don't expect
> > we really need to maintain per-page translation even with sparse pinning.
> 
> If you're trying to equate the scale of what we need to track vs what
> type1 currently tracks, they're significantly different.  Possible
> things we need to track include the pfn, the iova, and possibly a
> reference count or some sort of pinned page map.  In the pin-all model
> we can assume that every page is pinned on map and unpinned on unmap,
> so a reference count or map is unnecessary.  We can also assume that we
> can always regenerate the pfn with get_user_pages() from the vaddr, so
> we don't need to track that.  

Hi Alex,

Thanks for pointing this out, we will not track those in our next rev and
get_user_pages will be used from the vaddr as you suggested to handle the
single VM with both passthru + mediated device case.

Thanks,
Neo

> I don't see any way around tracking the
> iova.  The iommu can't tell us this like it can with the normal type1
> model because the pfn is the result of the translation, not the key for
> the translation. So we're always going to have between 1 and
> (size/PAGE_SIZE) iova entries per vgpu_dma entry.  You might be able to
> manage the vgpu_dma with an rb-tree, but each vgpu_dma entry needs some
> data structure tracking every iova.
> 
> Sparse mapping has the same issue but of course the tree of iovas is
> potentially incomplete and we need a way to determine where it's
> incomplete.  A page table rooted in the vgpu_dma and indexed by the
> offset from the start vaddr seems like the way to go here.  It's also
> possible that some mediated device models might store the iova in the
> command sent to the device and therefore be able to parse those entries
> back out to unmap them without storing them separately.  This might be
> how the s390 channel-io model would prefer to work.  That seems like
> further validation that such tracking is going to be dependent on the
> mediated driver itself and probably not something to centralize in a
> mediated iommu driver.  Thanks,

> 
> Alex

^ permalink raw reply	[flat|nested] 154+ messages in thread

* RE: [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
  2016-05-12 19:49                     ` [Qemu-devel] " Neo Jia
@ 2016-05-13  2:41                       ` Tian, Kevin
  -1 siblings, 0 replies; 154+ messages in thread
From: Tian, Kevin @ 2016-05-13  2:41 UTC (permalink / raw)
  To: Neo Jia, Jike Song
  Cc: Alex Williamson, Song, Jike, Kirti Wankhede, pbonzini, kraxel,
	qemu-devel, kvm, Ruan, Shuai, Lv, Zhiyuan

> From: Neo Jia [mailto:cjia@nvidia.com]
> Sent: Friday, May 13, 2016 3:49 AM
> 
> >
> > > Perhaps one possibility would be to allow the vgpu driver to register
> > > map and unmap callbacks.  The unmap callback might provide the
> > > invalidation interface that we're so far missing.  The combination of
> > > map and unmap callbacks might simplify the Intel approach of pinning the
> > > entire VM memory space, ie. for each map callback do a translation
> > > (pin) and dma_map_page, for each unmap do a dma_unmap_page and release
> > > the translation.
> >
> > Yes adding map/unmap ops in pGPU drvier (I assume you are refering to
> > gpu_device_ops as
> > implemented in Kirti's patch) sounds a good idea, satisfying both: 1)
> > keeping vGPU purely
> > virtual; 2) dealing with the Linux DMA API to achive hardware IOMMU
> > compatibility.
> >
> > PS, this has very little to do with pinning wholly or partially. Intel KVMGT has
> > once been had the whole guest memory pinned, only because we used a spinlock,
> > which can't sleep at runtime.  We have removed that spinlock in our another
> > upstreaming effort, not here but for i915 driver, so probably no biggie.
> >
> 
> OK, then you guys don't need to pin everything. The next question will be if you
> can send the pinning request from your mediated driver backend to request memory
> pinning like we have demonstrated in the v3 patch, function vfio_pin_pages and
> vfio_unpin_pages?
> 

Jike can you confirm this statement? My feeling is that we don't have such logic
in our device model to figure out which pages need to be pinned on demand. So
currently pin-everything is same requirement in both KVM and Xen side...

Thanks
Kevin

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
@ 2016-05-13  2:41                       ` Tian, Kevin
  0 siblings, 0 replies; 154+ messages in thread
From: Tian, Kevin @ 2016-05-13  2:41 UTC (permalink / raw)
  To: Neo Jia, Jike Song
  Cc: Alex Williamson, Song, Jike, Kirti Wankhede, pbonzini, kraxel,
	qemu-devel, kvm, Ruan, Shuai, Lv, Zhiyuan

> From: Neo Jia [mailto:cjia@nvidia.com]
> Sent: Friday, May 13, 2016 3:49 AM
> 
> >
> > > Perhaps one possibility would be to allow the vgpu driver to register
> > > map and unmap callbacks.  The unmap callback might provide the
> > > invalidation interface that we're so far missing.  The combination of
> > > map and unmap callbacks might simplify the Intel approach of pinning the
> > > entire VM memory space, ie. for each map callback do a translation
> > > (pin) and dma_map_page, for each unmap do a dma_unmap_page and release
> > > the translation.
> >
> > Yes adding map/unmap ops in pGPU drvier (I assume you are refering to
> > gpu_device_ops as
> > implemented in Kirti's patch) sounds a good idea, satisfying both: 1)
> > keeping vGPU purely
> > virtual; 2) dealing with the Linux DMA API to achive hardware IOMMU
> > compatibility.
> >
> > PS, this has very little to do with pinning wholly or partially. Intel KVMGT has
> > once been had the whole guest memory pinned, only because we used a spinlock,
> > which can't sleep at runtime.  We have removed that spinlock in our another
> > upstreaming effort, not here but for i915 driver, so probably no biggie.
> >
> 
> OK, then you guys don't need to pin everything. The next question will be if you
> can send the pinning request from your mediated driver backend to request memory
> pinning like we have demonstrated in the v3 patch, function vfio_pin_pages and
> vfio_unpin_pages?
> 

Jike can you confirm this statement? My feeling is that we don't have such logic
in our device model to figure out which pages need to be pinned on demand. So
currently pin-everything is same requirement in both KVM and Xen side...

Thanks
Kevin

^ permalink raw reply	[flat|nested] 154+ messages in thread

* RE: [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
  2016-05-12 19:05                     ` [Qemu-devel] " Alex Williamson
@ 2016-05-13  3:55                       ` Tian, Kevin
  -1 siblings, 0 replies; 154+ messages in thread
From: Tian, Kevin @ 2016-05-13  3:55 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Song, Jike, Neo Jia, Kirti Wankhede, pbonzini, kraxel,
	qemu-devel, kvm, Ruan, Shuai, Lv, Zhiyuan

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Friday, May 13, 2016 3:06 AM
> 
> > >
> >
> > Based on above thought I'm thinking whether below would work:
> > (let's use gpa to replace existing iova in type1 driver, while using iova
> > for the one actually used in vGPU driver. Assume 'pin-all' scenario first
> > which matches existing vfio logic)
> >
> > - No change to existing vfio_dma structure. VFIO still maintains gpa<->vaddr
> > mapping, in coarse-grained regions;
> >
> > - Leverage same page accounting/pinning logic in type1 driver, which
> > should be enough for 'pin-all' usage;
> >
> > - Then main divergence point for vGPU would be in vfio_unmap_unpin
> > and vfio_iommu_map. I'm not sure whether it's easy to fake an
> > iommu_domain for vGPU so same iommu_map/unmap can be reused.
> 
> This seems troublesome.  Kirti's version used numerous api-only tests
> to avoid these which made the code difficult to trace.  Clearly one
> option is to split out the common code so that a new mediated-type1
> backend skips this, but they thought they could clean it up without
> this, so we'll see what happens in the next version.
> 
> > If not, we may introduce two new map/unmap callbacks provided
> > specifically by vGPU core driver, as you suggested:
> >
> > 	* vGPU core driver uses dma_map_page to map specified pfns:
> >
> > 		o When IOMMU is enabled, we'll get an iova returned different
> > from pfn;
> > 		o When IOMMU is disabled, returned iova is same as pfn;
> 
> Either way each iova needs to be stored and we have a worst case of one
> iova per page of guest memory.
> 
> > 	* Then vGPU core driver just maintains its own gpa<->iova lookup
> > table (e.g. called vgpu_dma)
> >
> > 	* Because each vfio_iommu_map invocation is about a contiguous
> > region, we can expect same number of vgpu_dma entries as maintained
> > for vfio_dma list;
> >
> > Then it's vGPU core driver's responsibility to provide gpa<->iova
> > lookup for vendor specific GPU driver. And we don't need worry about
> > tens of thousands of entries. Once we get this simple 'pin-all' model
> > ready, then it can be further extended to support 'pin-sparse'
> > scenario. We still maintain a top-level vgpu_dma list with each entry to
> > further link its own sparse mapping structure. In reality I don't expect
> > we really need to maintain per-page translation even with sparse pinning.
> 
> If you're trying to equate the scale of what we need to track vs what
> type1 currently tracks, they're significantly different.  Possible
> things we need to track include the pfn, the iova, and possibly a
> reference count or some sort of pinned page map.  In the pin-all model
> we can assume that every page is pinned on map and unpinned on unmap,
> so a reference count or map is unnecessary.  We can also assume that we
> can always regenerate the pfn with get_user_pages() from the vaddr, so
> we don't need to track that.  I don't see any way around tracking the
> iova.  The iommu can't tell us this like it can with the normal type1
> model because the pfn is the result of the translation, not the key for
> the translation. So we're always going to have between 1 and
> (size/PAGE_SIZE) iova entries per vgpu_dma entry.  You might be able to
> manage the vgpu_dma with an rb-tree, but each vgpu_dma entry needs some
> data structure tracking every iova.

There is one option. We may use alloc_iova to reserve continuous iova
range for each vgpu_dma range and then use iommu_map/unmap to
write iommu ptes later upon map request (then could be same #entries
as vfio_dma compared to unbounded entries when using dma_map_page). 
Of course this needs to be done in vGPU core driver, since vfio type1 only 
sees a faked iommu domain.

> 
> Sparse mapping has the same issue but of course the tree of iovas is
> potentially incomplete and we need a way to determine where it's
> incomplete.  A page table rooted in the vgpu_dma and indexed by the
> offset from the start vaddr seems like the way to go here.  It's also
> possible that some mediated device models might store the iova in the
> command sent to the device and therefore be able to parse those entries
> back out to unmap them without storing them separately.  This might be
> how the s390 channel-io model would prefer to work.  That seems like
> further validation that such tracking is going to be dependent on the
> mediated driver itself and probably not something to centralize in a
> mediated iommu driver.  Thanks,
> 

Another simpler way might be allocate an array for each memory
regions registered from user space. For a 512MB region, it means
512K*4=2MB array to track pfn or iova mapping corresponding to
a gfn. It may consume more resource than rb tree when not many
pages need to be pinned, but could be less when rb tree increases
a lot. 

Is such array-based approach considered ugly in kernel? :-)

Thanks
Kevin

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
@ 2016-05-13  3:55                       ` Tian, Kevin
  0 siblings, 0 replies; 154+ messages in thread
From: Tian, Kevin @ 2016-05-13  3:55 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Song, Jike, Neo Jia, Kirti Wankhede, pbonzini, kraxel,
	qemu-devel, kvm, Ruan, Shuai, Lv, Zhiyuan

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Friday, May 13, 2016 3:06 AM
> 
> > >
> >
> > Based on above thought I'm thinking whether below would work:
> > (let's use gpa to replace existing iova in type1 driver, while using iova
> > for the one actually used in vGPU driver. Assume 'pin-all' scenario first
> > which matches existing vfio logic)
> >
> > - No change to existing vfio_dma structure. VFIO still maintains gpa<->vaddr
> > mapping, in coarse-grained regions;
> >
> > - Leverage same page accounting/pinning logic in type1 driver, which
> > should be enough for 'pin-all' usage;
> >
> > - Then main divergence point for vGPU would be in vfio_unmap_unpin
> > and vfio_iommu_map. I'm not sure whether it's easy to fake an
> > iommu_domain for vGPU so same iommu_map/unmap can be reused.
> 
> This seems troublesome.  Kirti's version used numerous api-only tests
> to avoid these which made the code difficult to trace.  Clearly one
> option is to split out the common code so that a new mediated-type1
> backend skips this, but they thought they could clean it up without
> this, so we'll see what happens in the next version.
> 
> > If not, we may introduce two new map/unmap callbacks provided
> > specifically by vGPU core driver, as you suggested:
> >
> > 	* vGPU core driver uses dma_map_page to map specified pfns:
> >
> > 		o When IOMMU is enabled, we'll get an iova returned different
> > from pfn;
> > 		o When IOMMU is disabled, returned iova is same as pfn;
> 
> Either way each iova needs to be stored and we have a worst case of one
> iova per page of guest memory.
> 
> > 	* Then vGPU core driver just maintains its own gpa<->iova lookup
> > table (e.g. called vgpu_dma)
> >
> > 	* Because each vfio_iommu_map invocation is about a contiguous
> > region, we can expect same number of vgpu_dma entries as maintained
> > for vfio_dma list;
> >
> > Then it's vGPU core driver's responsibility to provide gpa<->iova
> > lookup for vendor specific GPU driver. And we don't need worry about
> > tens of thousands of entries. Once we get this simple 'pin-all' model
> > ready, then it can be further extended to support 'pin-sparse'
> > scenario. We still maintain a top-level vgpu_dma list with each entry to
> > further link its own sparse mapping structure. In reality I don't expect
> > we really need to maintain per-page translation even with sparse pinning.
> 
> If you're trying to equate the scale of what we need to track vs what
> type1 currently tracks, they're significantly different.  Possible
> things we need to track include the pfn, the iova, and possibly a
> reference count or some sort of pinned page map.  In the pin-all model
> we can assume that every page is pinned on map and unpinned on unmap,
> so a reference count or map is unnecessary.  We can also assume that we
> can always regenerate the pfn with get_user_pages() from the vaddr, so
> we don't need to track that.  I don't see any way around tracking the
> iova.  The iommu can't tell us this like it can with the normal type1
> model because the pfn is the result of the translation, not the key for
> the translation. So we're always going to have between 1 and
> (size/PAGE_SIZE) iova entries per vgpu_dma entry.  You might be able to
> manage the vgpu_dma with an rb-tree, but each vgpu_dma entry needs some
> data structure tracking every iova.

There is one option. We may use alloc_iova to reserve continuous iova
range for each vgpu_dma range and then use iommu_map/unmap to
write iommu ptes later upon map request (then could be same #entries
as vfio_dma compared to unbounded entries when using dma_map_page). 
Of course this needs to be done in vGPU core driver, since vfio type1 only 
sees a faked iommu domain.

> 
> Sparse mapping has the same issue but of course the tree of iovas is
> potentially incomplete and we need a way to determine where it's
> incomplete.  A page table rooted in the vgpu_dma and indexed by the
> offset from the start vaddr seems like the way to go here.  It's also
> possible that some mediated device models might store the iova in the
> command sent to the device and therefore be able to parse those entries
> back out to unmap them without storing them separately.  This might be
> how the s390 channel-io model would prefer to work.  That seems like
> further validation that such tracking is going to be dependent on the
> mediated driver itself and probably not something to centralize in a
> mediated iommu driver.  Thanks,
> 

Another simpler way might be allocate an array for each memory
regions registered from user space. For a 512MB region, it means
512K*4=2MB array to track pfn or iova mapping corresponding to
a gfn. It may consume more resource than rb tree when not many
pages need to be pinned, but could be less when rb tree increases
a lot. 

Is such array-based approach considered ugly in kernel? :-)

Thanks
Kevin

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
  2016-05-12 19:49                     ` [Qemu-devel] " Neo Jia
@ 2016-05-13  6:08                       ` Jike Song
  -1 siblings, 0 replies; 154+ messages in thread
From: Jike Song @ 2016-05-13  6:08 UTC (permalink / raw)
  To: Neo Jia
  Cc: Ruan, Shuai, Tian, Kevin, kvm, Jike Song, Kirti Wankhede,
	qemu-devel, Alex Williamson, kraxel, pbonzini, Lv, Zhiyuan

On 05/13/2016 03:49 AM, Neo Jia wrote:
> On Thu, May 12, 2016 at 12:11:00PM +0800, Jike Song wrote:
>> On Thu, May 12, 2016 at 6:06 AM, Alex Williamson
>> <alex.williamson@redhat.com> wrote:
>>> On Wed, 11 May 2016 17:15:15 +0800
>>> Jike Song <jike.song@intel.com> wrote:
>>>
>>>> On 05/11/2016 12:02 AM, Neo Jia wrote:
>>>>> On Tue, May 10, 2016 at 03:52:27PM +0800, Jike Song wrote:
>>>>>> On 05/05/2016 05:27 PM, Tian, Kevin wrote:
>>>>>>>> From: Song, Jike
>>>>>>>>
>>>>>>>> IIUC, an api-only domain is a VFIO domain *without* underlying IOMMU
>>>>>>>> hardware. It just, as you said in another mail, "rather than
>>>>>>>> programming them into an IOMMU for a device, it simply stores the
>>>>>>>> translations for use by later requests".
>>>>>>>>
>>>>>>>> That imposes a constraint on gfx driver: hardware IOMMU must be disabled.
>>>>>>>> Otherwise, if IOMMU is present, the gfx driver eventually programs
>>>>>>>> the hardware IOMMU with IOVA returned by pci_map_page or dma_map_page;
>>>>>>>> Meanwhile, the IOMMU backend for vgpu only maintains GPA <-> HPA
>>>>>>>> translations without any knowledge about hardware IOMMU, how is the
>>>>>>>> device model supposed to do to get an IOVA for a given GPA (thereby HPA
>>>>>>>> by the IOMMU backend here)?
>>>>>>>>
>>>>>>>> If things go as guessed above, as vfio_pin_pages() indicates, it
>>>>>>>> pin & translate vaddr to PFN, then it will be very difficult for the
>>>>>>>> device model to figure out:
>>>>>>>>
>>>>>>>>  1, for a given GPA, how to avoid calling dma_map_page multiple times?
>>>>>>>>  2, for which page to call dma_unmap_page?
>>>>>>>>
>>>>>>>> --
>>>>>>>
>>>>>>> We have to support both w/ iommu and w/o iommu case, since
>>>>>>> that fact is out of GPU driver control. A simple way is to use
>>>>>>> dma_map_page which internally will cope with w/ and w/o iommu
>>>>>>> case gracefully, i.e. return HPA w/o iommu and IOVA w/ iommu.
>>>>>>> Then in this file we only need to cache GPA to whatever dmadr_t
>>>>>>> returned by dma_map_page.
>>>>>>>
>>>>>>
>>>>>> Hi Alex, Kirti and Neo, any thought on the IOMMU compatibility here?
>>>>>
>>>>> Hi Jike,
>>>>>
>>>>> With mediated passthru, you still can use hardware iommu, but more important
>>>>> that part is actually orthogonal to what we are discussing here as we will only
>>>>> cache the mapping between <gfn (iova if guest has iommu), (qemu) va>, once we
>>>>> have pinned pages later with the help of above info, you can map it into the
>>>>> proper iommu domain if the system has configured so.
>>>>>
>>>>
>>>> Hi Neo,
>>>>
>>>> Technically yes you can map a pfn into the proper IOMMU domain elsewhere,
>>>> but to find out whether a pfn was previously mapped or not, you have to
>>>> track it with another rbtree-alike data structure (the IOMMU driver simply
>>>> doesn't bother with tracking), that seems somehow duplicate with the vGPU
>>>> IOMMU backend we are discussing here.
>>>>
>>>> And it is also semantically correct for an IOMMU backend to handle both w/
>>>> and w/o an IOMMU hardware? :)
>>>
>>> A problem with the iommu doing the dma_map_page() though is for what
>>> device does it do this?  In the mediated case the vfio infrastructure
>>> is dealing with a software representation of a device.  For all we
>>> know that software model could transparently migrate from one physical
>>> GPU to another.  There may not even be a physical device backing
>>> the mediated device.  Those are details left to the vgpu driver itself.
>>>
>>
>> Great point :) Yes, I agree it's a bit intrusive to do the mapping for
>> a particular
>> pdev in an vGPU IOMMU BE.
>>
>>> Perhaps one possibility would be to allow the vgpu driver to register
>>> map and unmap callbacks.  The unmap callback might provide the
>>> invalidation interface that we're so far missing.  The combination of
>>> map and unmap callbacks might simplify the Intel approach of pinning the
>>> entire VM memory space, ie. for each map callback do a translation
>>> (pin) and dma_map_page, for each unmap do a dma_unmap_page and release
>>> the translation.
>>
>> Yes adding map/unmap ops in pGPU drvier (I assume you are refering to
>> gpu_device_ops as
>> implemented in Kirti's patch) sounds a good idea, satisfying both: 1)
>> keeping vGPU purely
>> virtual; 2) dealing with the Linux DMA API to achive hardware IOMMU
>> compatibility.
>>
>> PS, this has very little to do with pinning wholly or partially. Intel KVMGT has
>> once been had the whole guest memory pinned, only because we used a spinlock,
>> which can't sleep at runtime.  We have removed that spinlock in our another
>> upstreaming effort, not here but for i915 driver, so probably no biggie.
>>
> 
> OK, then you guys don't need to pin everything.

Yes :)

> The next question will be if you
> can send the pinning request from your mediated driver backend to request memory
> pinning like we have demonstrated in the v3 patch, function vfio_pin_pages and
> vfio_unpin_pages?

Kind of yes, not exactly.

IMO the mediated driver backend cares not only about pinning, but also the more
important translation. The vfio_pin_pages of v3 patch does the pinning and
translation simultaneously, whereas I do think the API is better named to
'translate' instead of 'pin', like v2 did.

We possibly have the same requirement from the mediate driver backend:

	a) get a GFN, when guest try to tell hardware;
	b) consult the vfio iommu with that GFN[1]: will you find me a proper dma_addr?

The vfio iommu backend search the tracking table with this GFN[1]:

	c) if entry found, return the dma_addr;
	d) if nothing found, call GUP to pin the page, and dma_map_page to get the dma_addr[2], return it;

The dma_addr will be told to real GPU hardware.

I can't simply say a 'Yes' here, since we may consult dma_addr for a GFN
multiple times, but only for the first time we need to pin the page.

IOW, pinning is kind of an internal action in the iommu backend.


//Sorry for the long, maybe boring explanation.. :)


[1] GFN or vaddr, no biggie
[2] As pointed out by Alex, dma_map_page can be called elsewhere like a callback.


--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
@ 2016-05-13  6:08                       ` Jike Song
  0 siblings, 0 replies; 154+ messages in thread
From: Jike Song @ 2016-05-13  6:08 UTC (permalink / raw)
  To: Neo Jia
  Cc: Jike Song, Alex Williamson, Tian, Kevin, Kirti Wankhede,
	pbonzini, kraxel, qemu-devel, kvm, Ruan, Shuai, Lv, Zhiyuan

On 05/13/2016 03:49 AM, Neo Jia wrote:
> On Thu, May 12, 2016 at 12:11:00PM +0800, Jike Song wrote:
>> On Thu, May 12, 2016 at 6:06 AM, Alex Williamson
>> <alex.williamson@redhat.com> wrote:
>>> On Wed, 11 May 2016 17:15:15 +0800
>>> Jike Song <jike.song@intel.com> wrote:
>>>
>>>> On 05/11/2016 12:02 AM, Neo Jia wrote:
>>>>> On Tue, May 10, 2016 at 03:52:27PM +0800, Jike Song wrote:
>>>>>> On 05/05/2016 05:27 PM, Tian, Kevin wrote:
>>>>>>>> From: Song, Jike
>>>>>>>>
>>>>>>>> IIUC, an api-only domain is a VFIO domain *without* underlying IOMMU
>>>>>>>> hardware. It just, as you said in another mail, "rather than
>>>>>>>> programming them into an IOMMU for a device, it simply stores the
>>>>>>>> translations for use by later requests".
>>>>>>>>
>>>>>>>> That imposes a constraint on gfx driver: hardware IOMMU must be disabled.
>>>>>>>> Otherwise, if IOMMU is present, the gfx driver eventually programs
>>>>>>>> the hardware IOMMU with IOVA returned by pci_map_page or dma_map_page;
>>>>>>>> Meanwhile, the IOMMU backend for vgpu only maintains GPA <-> HPA
>>>>>>>> translations without any knowledge about hardware IOMMU, how is the
>>>>>>>> device model supposed to do to get an IOVA for a given GPA (thereby HPA
>>>>>>>> by the IOMMU backend here)?
>>>>>>>>
>>>>>>>> If things go as guessed above, as vfio_pin_pages() indicates, it
>>>>>>>> pin & translate vaddr to PFN, then it will be very difficult for the
>>>>>>>> device model to figure out:
>>>>>>>>
>>>>>>>>  1, for a given GPA, how to avoid calling dma_map_page multiple times?
>>>>>>>>  2, for which page to call dma_unmap_page?
>>>>>>>>
>>>>>>>> --
>>>>>>>
>>>>>>> We have to support both w/ iommu and w/o iommu case, since
>>>>>>> that fact is out of GPU driver control. A simple way is to use
>>>>>>> dma_map_page which internally will cope with w/ and w/o iommu
>>>>>>> case gracefully, i.e. return HPA w/o iommu and IOVA w/ iommu.
>>>>>>> Then in this file we only need to cache GPA to whatever dmadr_t
>>>>>>> returned by dma_map_page.
>>>>>>>
>>>>>>
>>>>>> Hi Alex, Kirti and Neo, any thought on the IOMMU compatibility here?
>>>>>
>>>>> Hi Jike,
>>>>>
>>>>> With mediated passthru, you still can use hardware iommu, but more important
>>>>> that part is actually orthogonal to what we are discussing here as we will only
>>>>> cache the mapping between <gfn (iova if guest has iommu), (qemu) va>, once we
>>>>> have pinned pages later with the help of above info, you can map it into the
>>>>> proper iommu domain if the system has configured so.
>>>>>
>>>>
>>>> Hi Neo,
>>>>
>>>> Technically yes you can map a pfn into the proper IOMMU domain elsewhere,
>>>> but to find out whether a pfn was previously mapped or not, you have to
>>>> track it with another rbtree-alike data structure (the IOMMU driver simply
>>>> doesn't bother with tracking), that seems somehow duplicate with the vGPU
>>>> IOMMU backend we are discussing here.
>>>>
>>>> And it is also semantically correct for an IOMMU backend to handle both w/
>>>> and w/o an IOMMU hardware? :)
>>>
>>> A problem with the iommu doing the dma_map_page() though is for what
>>> device does it do this?  In the mediated case the vfio infrastructure
>>> is dealing with a software representation of a device.  For all we
>>> know that software model could transparently migrate from one physical
>>> GPU to another.  There may not even be a physical device backing
>>> the mediated device.  Those are details left to the vgpu driver itself.
>>>
>>
>> Great point :) Yes, I agree it's a bit intrusive to do the mapping for
>> a particular
>> pdev in an vGPU IOMMU BE.
>>
>>> Perhaps one possibility would be to allow the vgpu driver to register
>>> map and unmap callbacks.  The unmap callback might provide the
>>> invalidation interface that we're so far missing.  The combination of
>>> map and unmap callbacks might simplify the Intel approach of pinning the
>>> entire VM memory space, ie. for each map callback do a translation
>>> (pin) and dma_map_page, for each unmap do a dma_unmap_page and release
>>> the translation.
>>
>> Yes adding map/unmap ops in pGPU drvier (I assume you are refering to
>> gpu_device_ops as
>> implemented in Kirti's patch) sounds a good idea, satisfying both: 1)
>> keeping vGPU purely
>> virtual; 2) dealing with the Linux DMA API to achive hardware IOMMU
>> compatibility.
>>
>> PS, this has very little to do with pinning wholly or partially. Intel KVMGT has
>> once been had the whole guest memory pinned, only because we used a spinlock,
>> which can't sleep at runtime.  We have removed that spinlock in our another
>> upstreaming effort, not here but for i915 driver, so probably no biggie.
>>
> 
> OK, then you guys don't need to pin everything.

Yes :)

> The next question will be if you
> can send the pinning request from your mediated driver backend to request memory
> pinning like we have demonstrated in the v3 patch, function vfio_pin_pages and
> vfio_unpin_pages?

Kind of yes, not exactly.

IMO the mediated driver backend cares not only about pinning, but also the more
important translation. The vfio_pin_pages of v3 patch does the pinning and
translation simultaneously, whereas I do think the API is better named to
'translate' instead of 'pin', like v2 did.

We possibly have the same requirement from the mediate driver backend:

	a) get a GFN, when guest try to tell hardware;
	b) consult the vfio iommu with that GFN[1]: will you find me a proper dma_addr?

The vfio iommu backend search the tracking table with this GFN[1]:

	c) if entry found, return the dma_addr;
	d) if nothing found, call GUP to pin the page, and dma_map_page to get the dma_addr[2], return it;

The dma_addr will be told to real GPU hardware.

I can't simply say a 'Yes' here, since we may consult dma_addr for a GFN
multiple times, but only for the first time we need to pin the page.

IOW, pinning is kind of an internal action in the iommu backend.


//Sorry for the long, maybe boring explanation.. :)


[1] GFN or vaddr, no biggie
[2] As pointed out by Alex, dma_map_page can be called elsewhere like a callback.


--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
  2016-05-13  2:41                       ` [Qemu-devel] " Tian, Kevin
@ 2016-05-13  6:22                         ` Jike Song
  -1 siblings, 0 replies; 154+ messages in thread
From: Jike Song @ 2016-05-13  6:22 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Neo Jia, Jike Song, Alex Williamson, Kirti Wankhede, pbonzini,
	kraxel, qemu-devel, kvm, Ruan, Shuai, Lv, Zhiyuan

On 05/13/2016 10:41 AM, Tian, Kevin wrote:
>> From: Neo Jia [mailto:cjia@nvidia.com]
>> Sent: Friday, May 13, 2016 3:49 AM
>>
>>>
>>>> Perhaps one possibility would be to allow the vgpu driver to register
>>>> map and unmap callbacks.  The unmap callback might provide the
>>>> invalidation interface that we're so far missing.  The combination of
>>>> map and unmap callbacks might simplify the Intel approach of pinning the
>>>> entire VM memory space, ie. for each map callback do a translation
>>>> (pin) and dma_map_page, for each unmap do a dma_unmap_page and release
>>>> the translation.
>>>
>>> Yes adding map/unmap ops in pGPU drvier (I assume you are refering to
>>> gpu_device_ops as
>>> implemented in Kirti's patch) sounds a good idea, satisfying both: 1)
>>> keeping vGPU purely
>>> virtual; 2) dealing with the Linux DMA API to achive hardware IOMMU
>>> compatibility.
>>>
>>> PS, this has very little to do with pinning wholly or partially. Intel KVMGT has
>>> once been had the whole guest memory pinned, only because we used a spinlock,
>>> which can't sleep at runtime.  We have removed that spinlock in our another
>>> upstreaming effort, not here but for i915 driver, so probably no biggie.
>>>
>>
>> OK, then you guys don't need to pin everything. The next question will be if you
>> can send the pinning request from your mediated driver backend to request memory
>> pinning like we have demonstrated in the v3 patch, function vfio_pin_pages and
>> vfio_unpin_pages?
>>
> 
> Jike can you confirm this statement? My feeling is that we don't have such logic
> in our device model to figure out which pages need to be pinned on demand. So
> currently pin-everything is same requirement in both KVM and Xen side...

[Correct me in case of any neglect:)]

IMO the ultimate reason to pin a page, is for DMA. Accessing RAM from a GPU is
certainly a DMA operation. The DMA facility of most platforms, IGD and NVIDIA
GPU included, is not capable of faulting-handling-retrying.

As for vGPU solutions like Nvidia and Intel provide, the memory address region
used by Guest for GPU access, whenever Guest sets the mappings, it is
intercepted by Host, so it's safe to only pin the page before it get used by
Guest. This probably doesn't need device model to change :)


--
Thanks,
Jike


^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
@ 2016-05-13  6:22                         ` Jike Song
  0 siblings, 0 replies; 154+ messages in thread
From: Jike Song @ 2016-05-13  6:22 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Neo Jia, Jike Song, Alex Williamson, Kirti Wankhede, pbonzini,
	kraxel, qemu-devel, kvm, Ruan, Shuai, Lv, Zhiyuan

On 05/13/2016 10:41 AM, Tian, Kevin wrote:
>> From: Neo Jia [mailto:cjia@nvidia.com]
>> Sent: Friday, May 13, 2016 3:49 AM
>>
>>>
>>>> Perhaps one possibility would be to allow the vgpu driver to register
>>>> map and unmap callbacks.  The unmap callback might provide the
>>>> invalidation interface that we're so far missing.  The combination of
>>>> map and unmap callbacks might simplify the Intel approach of pinning the
>>>> entire VM memory space, ie. for each map callback do a translation
>>>> (pin) and dma_map_page, for each unmap do a dma_unmap_page and release
>>>> the translation.
>>>
>>> Yes adding map/unmap ops in pGPU drvier (I assume you are refering to
>>> gpu_device_ops as
>>> implemented in Kirti's patch) sounds a good idea, satisfying both: 1)
>>> keeping vGPU purely
>>> virtual; 2) dealing with the Linux DMA API to achive hardware IOMMU
>>> compatibility.
>>>
>>> PS, this has very little to do with pinning wholly or partially. Intel KVMGT has
>>> once been had the whole guest memory pinned, only because we used a spinlock,
>>> which can't sleep at runtime.  We have removed that spinlock in our another
>>> upstreaming effort, not here but for i915 driver, so probably no biggie.
>>>
>>
>> OK, then you guys don't need to pin everything. The next question will be if you
>> can send the pinning request from your mediated driver backend to request memory
>> pinning like we have demonstrated in the v3 patch, function vfio_pin_pages and
>> vfio_unpin_pages?
>>
> 
> Jike can you confirm this statement? My feeling is that we don't have such logic
> in our device model to figure out which pages need to be pinned on demand. So
> currently pin-everything is same requirement in both KVM and Xen side...

[Correct me in case of any neglect:)]

IMO the ultimate reason to pin a page, is for DMA. Accessing RAM from a GPU is
certainly a DMA operation. The DMA facility of most platforms, IGD and NVIDIA
GPU included, is not capable of faulting-handling-retrying.

As for vGPU solutions like Nvidia and Intel provide, the memory address region
used by Guest for GPU access, whenever Guest sets the mappings, it is
intercepted by Host, so it's safe to only pin the page before it get used by
Guest. This probably doesn't need device model to change :)


--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
  2016-05-13  6:08                       ` [Qemu-devel] " Jike Song
@ 2016-05-13  6:41                         ` Neo Jia
  -1 siblings, 0 replies; 154+ messages in thread
From: Neo Jia @ 2016-05-13  6:41 UTC (permalink / raw)
  To: Jike Song
  Cc: Jike Song, Alex Williamson, Tian, Kevin, Kirti Wankhede,
	pbonzini, kraxel, qemu-devel, kvm, Ruan, Shuai, Lv, Zhiyuan

On Fri, May 13, 2016 at 02:08:36PM +0800, Jike Song wrote:
> On 05/13/2016 03:49 AM, Neo Jia wrote:
> > On Thu, May 12, 2016 at 12:11:00PM +0800, Jike Song wrote:
> >> On Thu, May 12, 2016 at 6:06 AM, Alex Williamson
> >> <alex.williamson@redhat.com> wrote:
> >>> On Wed, 11 May 2016 17:15:15 +0800
> >>> Jike Song <jike.song@intel.com> wrote:
> >>>
> >>>> On 05/11/2016 12:02 AM, Neo Jia wrote:
> >>>>> On Tue, May 10, 2016 at 03:52:27PM +0800, Jike Song wrote:
> >>>>>> On 05/05/2016 05:27 PM, Tian, Kevin wrote:
> >>>>>>>> From: Song, Jike
> >>>>>>>>
> >>>>>>>> IIUC, an api-only domain is a VFIO domain *without* underlying IOMMU
> >>>>>>>> hardware. It just, as you said in another mail, "rather than
> >>>>>>>> programming them into an IOMMU for a device, it simply stores the
> >>>>>>>> translations for use by later requests".
> >>>>>>>>
> >>>>>>>> That imposes a constraint on gfx driver: hardware IOMMU must be disabled.
> >>>>>>>> Otherwise, if IOMMU is present, the gfx driver eventually programs
> >>>>>>>> the hardware IOMMU with IOVA returned by pci_map_page or dma_map_page;
> >>>>>>>> Meanwhile, the IOMMU backend for vgpu only maintains GPA <-> HPA
> >>>>>>>> translations without any knowledge about hardware IOMMU, how is the
> >>>>>>>> device model supposed to do to get an IOVA for a given GPA (thereby HPA
> >>>>>>>> by the IOMMU backend here)?
> >>>>>>>>
> >>>>>>>> If things go as guessed above, as vfio_pin_pages() indicates, it
> >>>>>>>> pin & translate vaddr to PFN, then it will be very difficult for the
> >>>>>>>> device model to figure out:
> >>>>>>>>
> >>>>>>>>  1, for a given GPA, how to avoid calling dma_map_page multiple times?
> >>>>>>>>  2, for which page to call dma_unmap_page?
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>
> >>>>>>> We have to support both w/ iommu and w/o iommu case, since
> >>>>>>> that fact is out of GPU driver control. A simple way is to use
> >>>>>>> dma_map_page which internally will cope with w/ and w/o iommu
> >>>>>>> case gracefully, i.e. return HPA w/o iommu and IOVA w/ iommu.
> >>>>>>> Then in this file we only need to cache GPA to whatever dmadr_t
> >>>>>>> returned by dma_map_page.
> >>>>>>>
> >>>>>>
> >>>>>> Hi Alex, Kirti and Neo, any thought on the IOMMU compatibility here?
> >>>>>
> >>>>> Hi Jike,
> >>>>>
> >>>>> With mediated passthru, you still can use hardware iommu, but more important
> >>>>> that part is actually orthogonal to what we are discussing here as we will only
> >>>>> cache the mapping between <gfn (iova if guest has iommu), (qemu) va>, once we
> >>>>> have pinned pages later with the help of above info, you can map it into the
> >>>>> proper iommu domain if the system has configured so.
> >>>>>
> >>>>
> >>>> Hi Neo,
> >>>>
> >>>> Technically yes you can map a pfn into the proper IOMMU domain elsewhere,
> >>>> but to find out whether a pfn was previously mapped or not, you have to
> >>>> track it with another rbtree-alike data structure (the IOMMU driver simply
> >>>> doesn't bother with tracking), that seems somehow duplicate with the vGPU
> >>>> IOMMU backend we are discussing here.
> >>>>
> >>>> And it is also semantically correct for an IOMMU backend to handle both w/
> >>>> and w/o an IOMMU hardware? :)
> >>>
> >>> A problem with the iommu doing the dma_map_page() though is for what
> >>> device does it do this?  In the mediated case the vfio infrastructure
> >>> is dealing with a software representation of a device.  For all we
> >>> know that software model could transparently migrate from one physical
> >>> GPU to another.  There may not even be a physical device backing
> >>> the mediated device.  Those are details left to the vgpu driver itself.
> >>>
> >>
> >> Great point :) Yes, I agree it's a bit intrusive to do the mapping for
> >> a particular
> >> pdev in an vGPU IOMMU BE.
> >>
> >>> Perhaps one possibility would be to allow the vgpu driver to register
> >>> map and unmap callbacks.  The unmap callback might provide the
> >>> invalidation interface that we're so far missing.  The combination of
> >>> map and unmap callbacks might simplify the Intel approach of pinning the
> >>> entire VM memory space, ie. for each map callback do a translation
> >>> (pin) and dma_map_page, for each unmap do a dma_unmap_page and release
> >>> the translation.
> >>
> >> Yes adding map/unmap ops in pGPU drvier (I assume you are refering to
> >> gpu_device_ops as
> >> implemented in Kirti's patch) sounds a good idea, satisfying both: 1)
> >> keeping vGPU purely
> >> virtual; 2) dealing with the Linux DMA API to achive hardware IOMMU
> >> compatibility.
> >>
> >> PS, this has very little to do with pinning wholly or partially. Intel KVMGT has
> >> once been had the whole guest memory pinned, only because we used a spinlock,
> >> which can't sleep at runtime.  We have removed that spinlock in our another
> >> upstreaming effort, not here but for i915 driver, so probably no biggie.
> >>
> > 
> > OK, then you guys don't need to pin everything.
> 
> Yes :)
> 
> > The next question will be if you
> > can send the pinning request from your mediated driver backend to request memory
> > pinning like we have demonstrated in the v3 patch, function vfio_pin_pages and
> > vfio_unpin_pages?
> 
> Kind of yes, not exactly.
> 
> IMO the mediated driver backend cares not only about pinning, but also the more
> important translation. The vfio_pin_pages of v3 patch does the pinning and
> translation simultaneously, whereas I do think the API is better named to
> 'translate' instead of 'pin', like v2 did.

Hi Jike,

Let me explain here.

The "pin and translation" has to be done all together and the pinning here
doesn't mean installing anything into the real IOMMU hardware.

The pinning is lock down the underlying pages for a given QEMU VA which will be
the corresponding guest physical address. Why we have to do that? If not, the
underlying physical pages will be moved and the DMA will not work properly, this
is exactly why the default iommu type1 driver use the get_user_pages to *pin*
memory. The translation part is easy to understand I think.

If you want to read more, you can check the latest email from Alex about a
recent regression introduced by THP, where the underlying page has moved by thp
for a qemu va, so dma is broken.

https://lkml.org/lkml/2016/4/28/604

Once you have the pfn, then the vendor driver can decide what do next.

> 
> We possibly have the same requirement from the mediate driver backend:
> 
> 	a) get a GFN, when guest try to tell hardware;
> 	b) consult the vfio iommu with that GFN[1]: will you find me a proper dma_addr?

We will provide you the pfn via vfio_pin_pages, so you can map it for dma
purpose in your i915 driver, which is what we are doing today.

> 
> The vfio iommu backend search the tracking table with this GFN[1]:
> 
> 	c) if entry found, return the dma_addr;

> 	d) if nothing found, call GUP to pin the page, and dma_map_page to get the dma_addr[2], return it;
> 
> The dma_addr will be told to real GPU hardware.
> 
> I can't simply say a 'Yes' here, since we may consult dma_addr for a GFN
> multiple times, but only for the first time we need to pin the page.

It is very important to keep the consistency from kernel point of view and also
not trust device driver, for example it would always good to assume the
device is going to reference a page whenever he is asking for that information,
therefore it is always good to keep the reference counter going if he asks for
it.

So it is the caller's responsibility to know what they are doing when calling
vfio_pin_pages, the same actually applies to get_user_pages.

Thanks,
Neo

> 
> IOW, pinning is kind of an internal action in the iommu backend.
> 
> 
> //Sorry for the long, maybe boring explanation.. :)
> 
> 
> [1] GFN or vaddr, no biggie
> [2] As pointed out by Alex, dma_map_page can be called elsewhere like a callback.
> 
> 
> --
> Thanks,
> Jike

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
@ 2016-05-13  6:41                         ` Neo Jia
  0 siblings, 0 replies; 154+ messages in thread
From: Neo Jia @ 2016-05-13  6:41 UTC (permalink / raw)
  To: Jike Song
  Cc: Jike Song, Alex Williamson, Tian, Kevin, Kirti Wankhede,
	pbonzini, kraxel, qemu-devel, kvm, Ruan, Shuai, Lv, Zhiyuan

On Fri, May 13, 2016 at 02:08:36PM +0800, Jike Song wrote:
> On 05/13/2016 03:49 AM, Neo Jia wrote:
> > On Thu, May 12, 2016 at 12:11:00PM +0800, Jike Song wrote:
> >> On Thu, May 12, 2016 at 6:06 AM, Alex Williamson
> >> <alex.williamson@redhat.com> wrote:
> >>> On Wed, 11 May 2016 17:15:15 +0800
> >>> Jike Song <jike.song@intel.com> wrote:
> >>>
> >>>> On 05/11/2016 12:02 AM, Neo Jia wrote:
> >>>>> On Tue, May 10, 2016 at 03:52:27PM +0800, Jike Song wrote:
> >>>>>> On 05/05/2016 05:27 PM, Tian, Kevin wrote:
> >>>>>>>> From: Song, Jike
> >>>>>>>>
> >>>>>>>> IIUC, an api-only domain is a VFIO domain *without* underlying IOMMU
> >>>>>>>> hardware. It just, as you said in another mail, "rather than
> >>>>>>>> programming them into an IOMMU for a device, it simply stores the
> >>>>>>>> translations for use by later requests".
> >>>>>>>>
> >>>>>>>> That imposes a constraint on gfx driver: hardware IOMMU must be disabled.
> >>>>>>>> Otherwise, if IOMMU is present, the gfx driver eventually programs
> >>>>>>>> the hardware IOMMU with IOVA returned by pci_map_page or dma_map_page;
> >>>>>>>> Meanwhile, the IOMMU backend for vgpu only maintains GPA <-> HPA
> >>>>>>>> translations without any knowledge about hardware IOMMU, how is the
> >>>>>>>> device model supposed to do to get an IOVA for a given GPA (thereby HPA
> >>>>>>>> by the IOMMU backend here)?
> >>>>>>>>
> >>>>>>>> If things go as guessed above, as vfio_pin_pages() indicates, it
> >>>>>>>> pin & translate vaddr to PFN, then it will be very difficult for the
> >>>>>>>> device model to figure out:
> >>>>>>>>
> >>>>>>>>  1, for a given GPA, how to avoid calling dma_map_page multiple times?
> >>>>>>>>  2, for which page to call dma_unmap_page?
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>
> >>>>>>> We have to support both w/ iommu and w/o iommu case, since
> >>>>>>> that fact is out of GPU driver control. A simple way is to use
> >>>>>>> dma_map_page which internally will cope with w/ and w/o iommu
> >>>>>>> case gracefully, i.e. return HPA w/o iommu and IOVA w/ iommu.
> >>>>>>> Then in this file we only need to cache GPA to whatever dmadr_t
> >>>>>>> returned by dma_map_page.
> >>>>>>>
> >>>>>>
> >>>>>> Hi Alex, Kirti and Neo, any thought on the IOMMU compatibility here?
> >>>>>
> >>>>> Hi Jike,
> >>>>>
> >>>>> With mediated passthru, you still can use hardware iommu, but more important
> >>>>> that part is actually orthogonal to what we are discussing here as we will only
> >>>>> cache the mapping between <gfn (iova if guest has iommu), (qemu) va>, once we
> >>>>> have pinned pages later with the help of above info, you can map it into the
> >>>>> proper iommu domain if the system has configured so.
> >>>>>
> >>>>
> >>>> Hi Neo,
> >>>>
> >>>> Technically yes you can map a pfn into the proper IOMMU domain elsewhere,
> >>>> but to find out whether a pfn was previously mapped or not, you have to
> >>>> track it with another rbtree-alike data structure (the IOMMU driver simply
> >>>> doesn't bother with tracking), that seems somehow duplicate with the vGPU
> >>>> IOMMU backend we are discussing here.
> >>>>
> >>>> And it is also semantically correct for an IOMMU backend to handle both w/
> >>>> and w/o an IOMMU hardware? :)
> >>>
> >>> A problem with the iommu doing the dma_map_page() though is for what
> >>> device does it do this?  In the mediated case the vfio infrastructure
> >>> is dealing with a software representation of a device.  For all we
> >>> know that software model could transparently migrate from one physical
> >>> GPU to another.  There may not even be a physical device backing
> >>> the mediated device.  Those are details left to the vgpu driver itself.
> >>>
> >>
> >> Great point :) Yes, I agree it's a bit intrusive to do the mapping for
> >> a particular
> >> pdev in an vGPU IOMMU BE.
> >>
> >>> Perhaps one possibility would be to allow the vgpu driver to register
> >>> map and unmap callbacks.  The unmap callback might provide the
> >>> invalidation interface that we're so far missing.  The combination of
> >>> map and unmap callbacks might simplify the Intel approach of pinning the
> >>> entire VM memory space, ie. for each map callback do a translation
> >>> (pin) and dma_map_page, for each unmap do a dma_unmap_page and release
> >>> the translation.
> >>
> >> Yes adding map/unmap ops in pGPU drvier (I assume you are refering to
> >> gpu_device_ops as
> >> implemented in Kirti's patch) sounds a good idea, satisfying both: 1)
> >> keeping vGPU purely
> >> virtual; 2) dealing with the Linux DMA API to achive hardware IOMMU
> >> compatibility.
> >>
> >> PS, this has very little to do with pinning wholly or partially. Intel KVMGT has
> >> once been had the whole guest memory pinned, only because we used a spinlock,
> >> which can't sleep at runtime.  We have removed that spinlock in our another
> >> upstreaming effort, not here but for i915 driver, so probably no biggie.
> >>
> > 
> > OK, then you guys don't need to pin everything.
> 
> Yes :)
> 
> > The next question will be if you
> > can send the pinning request from your mediated driver backend to request memory
> > pinning like we have demonstrated in the v3 patch, function vfio_pin_pages and
> > vfio_unpin_pages?
> 
> Kind of yes, not exactly.
> 
> IMO the mediated driver backend cares not only about pinning, but also the more
> important translation. The vfio_pin_pages of v3 patch does the pinning and
> translation simultaneously, whereas I do think the API is better named to
> 'translate' instead of 'pin', like v2 did.

Hi Jike,

Let me explain here.

The "pin and translation" has to be done all together and the pinning here
doesn't mean installing anything into the real IOMMU hardware.

The pinning is lock down the underlying pages for a given QEMU VA which will be
the corresponding guest physical address. Why we have to do that? If not, the
underlying physical pages will be moved and the DMA will not work properly, this
is exactly why the default iommu type1 driver use the get_user_pages to *pin*
memory. The translation part is easy to understand I think.

If you want to read more, you can check the latest email from Alex about a
recent regression introduced by THP, where the underlying page has moved by thp
for a qemu va, so dma is broken.

https://lkml.org/lkml/2016/4/28/604

Once you have the pfn, then the vendor driver can decide what do next.

> 
> We possibly have the same requirement from the mediate driver backend:
> 
> 	a) get a GFN, when guest try to tell hardware;
> 	b) consult the vfio iommu with that GFN[1]: will you find me a proper dma_addr?

We will provide you the pfn via vfio_pin_pages, so you can map it for dma
purpose in your i915 driver, which is what we are doing today.

> 
> The vfio iommu backend search the tracking table with this GFN[1]:
> 
> 	c) if entry found, return the dma_addr;

> 	d) if nothing found, call GUP to pin the page, and dma_map_page to get the dma_addr[2], return it;
> 
> The dma_addr will be told to real GPU hardware.
> 
> I can't simply say a 'Yes' here, since we may consult dma_addr for a GFN
> multiple times, but only for the first time we need to pin the page.

It is very important to keep the consistency from kernel point of view and also
not trust device driver, for example it would always good to assume the
device is going to reference a page whenever he is asking for that information,
therefore it is always good to keep the reference counter going if he asks for
it.

So it is the caller's responsibility to know what they are doing when calling
vfio_pin_pages, the same actually applies to get_user_pages.

Thanks,
Neo

> 
> IOW, pinning is kind of an internal action in the iommu backend.
> 
> 
> //Sorry for the long, maybe boring explanation.. :)
> 
> 
> [1] GFN or vaddr, no biggie
> [2] As pointed out by Alex, dma_map_page can be called elsewhere like a callback.
> 
> 
> --
> Thanks,
> Jike

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
  2016-05-13  6:22                         ` [Qemu-devel] " Jike Song
@ 2016-05-13  6:43                           ` Neo Jia
  -1 siblings, 0 replies; 154+ messages in thread
From: Neo Jia @ 2016-05-13  6:43 UTC (permalink / raw)
  To: Jike Song
  Cc: Tian, Kevin, Jike Song, Alex Williamson, Kirti Wankhede,
	pbonzini, kraxel, qemu-devel, kvm, Ruan, Shuai, Lv, Zhiyuan

On Fri, May 13, 2016 at 02:22:37PM +0800, Jike Song wrote:
> On 05/13/2016 10:41 AM, Tian, Kevin wrote:
> >> From: Neo Jia [mailto:cjia@nvidia.com]
> >> Sent: Friday, May 13, 2016 3:49 AM
> >>
> >>>
> >>>> Perhaps one possibility would be to allow the vgpu driver to register
> >>>> map and unmap callbacks.  The unmap callback might provide the
> >>>> invalidation interface that we're so far missing.  The combination of
> >>>> map and unmap callbacks might simplify the Intel approach of pinning the
> >>>> entire VM memory space, ie. for each map callback do a translation
> >>>> (pin) and dma_map_page, for each unmap do a dma_unmap_page and release
> >>>> the translation.
> >>>
> >>> Yes adding map/unmap ops in pGPU drvier (I assume you are refering to
> >>> gpu_device_ops as
> >>> implemented in Kirti's patch) sounds a good idea, satisfying both: 1)
> >>> keeping vGPU purely
> >>> virtual; 2) dealing with the Linux DMA API to achive hardware IOMMU
> >>> compatibility.
> >>>
> >>> PS, this has very little to do with pinning wholly or partially. Intel KVMGT has
> >>> once been had the whole guest memory pinned, only because we used a spinlock,
> >>> which can't sleep at runtime.  We have removed that spinlock in our another
> >>> upstreaming effort, not here but for i915 driver, so probably no biggie.
> >>>
> >>
> >> OK, then you guys don't need to pin everything. The next question will be if you
> >> can send the pinning request from your mediated driver backend to request memory
> >> pinning like we have demonstrated in the v3 patch, function vfio_pin_pages and
> >> vfio_unpin_pages?
> >>
> > 
> > Jike can you confirm this statement? My feeling is that we don't have such logic
> > in our device model to figure out which pages need to be pinned on demand. So
> > currently pin-everything is same requirement in both KVM and Xen side...
> 
> [Correct me in case of any neglect:)]
> 
> IMO the ultimate reason to pin a page, is for DMA. Accessing RAM from a GPU is
> certainly a DMA operation. The DMA facility of most platforms, IGD and NVIDIA
> GPU included, is not capable of faulting-handling-retrying.
> 
> As for vGPU solutions like Nvidia and Intel provide, the memory address region
> used by Guest for GPU access, whenever Guest sets the mappings, it is
> intercepted by Host, so it's safe to only pin the page before it get used by
> Guest. This probably doesn't need device model to change :)

Hi Jike

Just out of curiosity, how does the host intercept this before it goes on the
bus?

Thanks,
Neo

> 
> 
> --
> Thanks,
> Jike
> 

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
@ 2016-05-13  6:43                           ` Neo Jia
  0 siblings, 0 replies; 154+ messages in thread
From: Neo Jia @ 2016-05-13  6:43 UTC (permalink / raw)
  To: Jike Song
  Cc: Tian, Kevin, Jike Song, Alex Williamson, Kirti Wankhede,
	pbonzini, kraxel, qemu-devel, kvm, Ruan, Shuai, Lv, Zhiyuan

On Fri, May 13, 2016 at 02:22:37PM +0800, Jike Song wrote:
> On 05/13/2016 10:41 AM, Tian, Kevin wrote:
> >> From: Neo Jia [mailto:cjia@nvidia.com]
> >> Sent: Friday, May 13, 2016 3:49 AM
> >>
> >>>
> >>>> Perhaps one possibility would be to allow the vgpu driver to register
> >>>> map and unmap callbacks.  The unmap callback might provide the
> >>>> invalidation interface that we're so far missing.  The combination of
> >>>> map and unmap callbacks might simplify the Intel approach of pinning the
> >>>> entire VM memory space, ie. for each map callback do a translation
> >>>> (pin) and dma_map_page, for each unmap do a dma_unmap_page and release
> >>>> the translation.
> >>>
> >>> Yes adding map/unmap ops in pGPU drvier (I assume you are refering to
> >>> gpu_device_ops as
> >>> implemented in Kirti's patch) sounds a good idea, satisfying both: 1)
> >>> keeping vGPU purely
> >>> virtual; 2) dealing with the Linux DMA API to achive hardware IOMMU
> >>> compatibility.
> >>>
> >>> PS, this has very little to do with pinning wholly or partially. Intel KVMGT has
> >>> once been had the whole guest memory pinned, only because we used a spinlock,
> >>> which can't sleep at runtime.  We have removed that spinlock in our another
> >>> upstreaming effort, not here but for i915 driver, so probably no biggie.
> >>>
> >>
> >> OK, then you guys don't need to pin everything. The next question will be if you
> >> can send the pinning request from your mediated driver backend to request memory
> >> pinning like we have demonstrated in the v3 patch, function vfio_pin_pages and
> >> vfio_unpin_pages?
> >>
> > 
> > Jike can you confirm this statement? My feeling is that we don't have such logic
> > in our device model to figure out which pages need to be pinned on demand. So
> > currently pin-everything is same requirement in both KVM and Xen side...
> 
> [Correct me in case of any neglect:)]
> 
> IMO the ultimate reason to pin a page, is for DMA. Accessing RAM from a GPU is
> certainly a DMA operation. The DMA facility of most platforms, IGD and NVIDIA
> GPU included, is not capable of faulting-handling-retrying.
> 
> As for vGPU solutions like Nvidia and Intel provide, the memory address region
> used by Guest for GPU access, whenever Guest sets the mappings, it is
> intercepted by Host, so it's safe to only pin the page before it get used by
> Guest. This probably doesn't need device model to change :)

Hi Jike

Just out of curiosity, how does the host intercept this before it goes on the
bus?

Thanks,
Neo

> 
> 
> --
> Thanks,
> Jike
> 

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
  2016-05-12 19:05                     ` [Qemu-devel] " Alex Williamson
@ 2016-05-13  7:10                       ` Dong Jia
  -1 siblings, 0 replies; 154+ messages in thread
From: Dong Jia @ 2016-05-13  7:10 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, Ruan, Shuai, Song, Jike, Neo Jia, kvm, qemu-devel,
	Kirti Wankhede, kraxel, pbonzini, Lv, Zhiyuan, Dong Jia

On Thu, 12 May 2016 13:05:52 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Thu, 12 May 2016 08:00:36 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > Sent: Thursday, May 12, 2016 6:06 AM
> > > 
> > > On Wed, 11 May 2016 17:15:15 +0800
> > > Jike Song <jike.song@intel.com> wrote:
> > >   
> > > > On 05/11/2016 12:02 AM, Neo Jia wrote:  
> > > > > On Tue, May 10, 2016 at 03:52:27PM +0800, Jike Song wrote:  
> > > > >> On 05/05/2016 05:27 PM, Tian, Kevin wrote:  
> > > > >>>> From: Song, Jike
> > > > >>>>
> > > > >>>> IIUC, an api-only domain is a VFIO domain *without* underlying IOMMU
> > > > >>>> hardware. It just, as you said in another mail, "rather than
> > > > >>>> programming them into an IOMMU for a device, it simply stores the
> > > > >>>> translations for use by later requests".
> > > > >>>>
> > > > >>>> That imposes a constraint on gfx driver: hardware IOMMU must be disabled.
> > > > >>>> Otherwise, if IOMMU is present, the gfx driver eventually programs
> > > > >>>> the hardware IOMMU with IOVA returned by pci_map_page or dma_map_page;
> > > > >>>> Meanwhile, the IOMMU backend for vgpu only maintains GPA <-> HPA
> > > > >>>> translations without any knowledge about hardware IOMMU, how is the
> > > > >>>> device model supposed to do to get an IOVA for a given GPA (thereby HPA
> > > > >>>> by the IOMMU backend here)?
> > > > >>>>
> > > > >>>> If things go as guessed above, as vfio_pin_pages() indicates, it
> > > > >>>> pin & translate vaddr to PFN, then it will be very difficult for the
> > > > >>>> device model to figure out:
> > > > >>>>
> > > > >>>> 	1, for a given GPA, how to avoid calling dma_map_page multiple times?
> > > > >>>> 	2, for which page to call dma_unmap_page?
> > > > >>>>
> > > > >>>> --  
> > > > >>>
> > > > >>> We have to support both w/ iommu and w/o iommu case, since
> > > > >>> that fact is out of GPU driver control. A simple way is to use
> > > > >>> dma_map_page which internally will cope with w/ and w/o iommu
> > > > >>> case gracefully, i.e. return HPA w/o iommu and IOVA w/ iommu.
> > > > >>> Then in this file we only need to cache GPA to whatever dmadr_t
> > > > >>> returned by dma_map_page.
> > > > >>>  
> > > > >>
> > > > >> Hi Alex, Kirti and Neo, any thought on the IOMMU compatibility here?  
> > > > >
> > > > > Hi Jike,
> > > > >
> > > > > With mediated passthru, you still can use hardware iommu, but more important
> > > > > that part is actually orthogonal to what we are discussing here as we will only
> > > > > cache the mapping between <gfn (iova if guest has iommu), (qemu) va>, once we
> > > > > have pinned pages later with the help of above info, you can map it into the
> > > > > proper iommu domain if the system has configured so.
> > > > >  
> > > >
> > > > Hi Neo,
> > > >
> > > > Technically yes you can map a pfn into the proper IOMMU domain elsewhere,
> > > > but to find out whether a pfn was previously mapped or not, you have to
> > > > track it with another rbtree-alike data structure (the IOMMU driver simply
> > > > doesn't bother with tracking), that seems somehow duplicate with the vGPU
> > > > IOMMU backend we are discussing here.
> > > >
> > > > And it is also semantically correct for an IOMMU backend to handle both w/
> > > > and w/o an IOMMU hardware? :)  
> > > 
> > > A problem with the iommu doing the dma_map_page() though is for what
> > > device does it do this?  In the mediated case the vfio infrastructure
> > > is dealing with a software representation of a device.  For all we
> > > know that software model could transparently migrate from one physical
> > > GPU to another.  There may not even be a physical device backing
> > > the mediated device.  Those are details left to the vgpu driver itself.  
> > 
> > This is a fair argument. VFIO iommu driver simply serves user space
> > requests, where only vaddr<->iova (essentially gpa in kvm case) is
> > mattered. How iova is mapped into real IOMMU is not VFIO's interest.
> > 
> > > 
> > > Perhaps one possibility would be to allow the vgpu driver to register
> > > map and unmap callbacks.  The unmap callback might provide the
> > > invalidation interface that we're so far missing.  The combination of
> > > map and unmap callbacks might simplify the Intel approach of pinning the
> > > entire VM memory space, ie. for each map callback do a translation
> > > (pin) and dma_map_page, for each unmap do a dma_unmap_page and release
> > > the translation.  There's still the problem of where that dma_addr_t
> > > from the dma_map_page is stored though.  Someone would need to keep
> > > track of iova to dma_addr_t.  The vfio iommu might be a place to do
> > > that since we're already tracking information based on iova, possibly
> > > in an opaque data element provided by the vgpu driver.  However, we're
> > > going to need to take a serious look at whether an rb-tree is the right
> > > data structure for the job.  It works well for the current type1
> > > functionality where we typically have tens of entries.  I think the
> > > NVIDIA model of sparse pinning the VM is pushing that up to tens of
> > > thousands.  If Intel intends to pin the entire guest, that's
> > > potentially tens of millions of tracked entries and I don't know that
> > > an rb-tree is the right tool for that job.  Thanks,
> > >   
> > 
> > Based on above thought I'm thinking whether below would work:
> > (let's use gpa to replace existing iova in type1 driver, while using iova
> > for the one actually used in vGPU driver. Assume 'pin-all' scenario first
> > which matches existing vfio logic)
> > 
> > - No change to existing vfio_dma structure. VFIO still maintains gpa<->vaddr
> > mapping, in coarse-grained regions;
> > 
> > - Leverage same page accounting/pinning logic in type1 driver, which 
> > should be enough for 'pin-all' usage;
> > 
> > - Then main divergence point for vGPU would be in vfio_unmap_unpin
> > and vfio_iommu_map. I'm not sure whether it's easy to fake an 
> > iommu_domain for vGPU so same iommu_map/unmap can be reused.
> 
> This seems troublesome.  Kirti's version used numerous api-only tests
> to avoid these which made the code difficult to trace.  Clearly one
> option is to split out the common code so that a new mediated-type1
> backend skips this, but they thought they could clean it up without
> this, so we'll see what happens in the next version.
> 
> > If not, we may introduce two new map/unmap callbacks provided
> > specifically by vGPU core driver, as you suggested:
> > 
> > 	* vGPU core driver uses dma_map_page to map specified pfns:
> > 
> > 		o When IOMMU is enabled, we'll get an iova returned different
> > from pfn;
> > 		o When IOMMU is disabled, returned iova is same as pfn;
> 
> Either way each iova needs to be stored and we have a worst case of one
> iova per page of guest memory.
> 
> > 	* Then vGPU core driver just maintains its own gpa<->iova lookup
> > table (e.g. called vgpu_dma)
> > 
> > 	* Because each vfio_iommu_map invocation is about a contiguous 
> > region, we can expect same number of vgpu_dma entries as maintained 
> > for vfio_dma list;
> >
> > Then it's vGPU core driver's responsibility to provide gpa<->iova
> > lookup for vendor specific GPU driver. And we don't need worry about
> > tens of thousands of entries. Once we get this simple 'pin-all' model
> > ready, then it can be further extended to support 'pin-sparse'
> > scenario. We still maintain a top-level vgpu_dma list with each entry to
> > further link its own sparse mapping structure. In reality I don't expect
> > we really need to maintain per-page translation even with sparse pinning.
> 
> If you're trying to equate the scale of what we need to track vs what
> type1 currently tracks, they're significantly different.  Possible
> things we need to track include the pfn, the iova, and possibly a
> reference count or some sort of pinned page map.  In the pin-all model
> we can assume that every page is pinned on map and unpinned on unmap,
> so a reference count or map is unnecessary.  We can also assume that we
> can always regenerate the pfn with get_user_pages() from the vaddr, so
> we don't need to track that.  I don't see any way around tracking the
> iova.  The iommu can't tell us this like it can with the normal type1
> model because the pfn is the result of the translation, not the key for
> the translation. So we're always going to have between 1 and
> (size/PAGE_SIZE) iova entries per vgpu_dma entry.  You might be able to
> manage the vgpu_dma with an rb-tree, but each vgpu_dma entry needs some
> data structure tracking every iova.
> 
> Sparse mapping has the same issue but of course the tree of iovas is
> potentially incomplete and we need a way to determine where it's
> incomplete.  A page table rooted in the vgpu_dma and indexed by the
> offset from the start vaddr seems like the way to go here.  It's also
> possible that some mediated device models might store the iova in the
> command sent to the device and therefore be able to parse those entries
> back out to unmap them without storing them separately.  This might be
> how the s390 channel-io model would prefer to work.
Dear Alex:

For the s390 channel-io model, when an I/O instruction was intercepted
and issued to the device driver for further translation, the operand of
the instruction contents iovas only. Since iova is the key to locate an
entry in the database (r-b tree or whatever), we do can parse the
entries back out one by one when doing the unmap operation.
                 ^^^^^^^^^^

BTW, if the mediated-iommu backend can potentially offer a transaction
level support for the unmap operation, I believe it will benefit the
performance for this case.

e.g.:
handler = vfio_trasaction_begin();
foreach(iova in the command) {
    pfn = vfio_trasaction_map(handler, iova);
    do_io(pfn);
}

/*
 * Expect to unmap all of the pfns mapped in this trasaction with the
 * next statement. The mediated-iommu backend could use handler as the
 * key to track the list of the entries.
 */
vfio_trasaction_unmap(handler);
vfio_trasaction_end(handler);

Not sure if this could benefit the vgpu sparse mapping use case though.

>  That seems like
> further validation that such tracking is going to be dependent on the
> mediated driver itself and probably not something to centralize in a
> mediated iommu driver.  Thanks,
> 
> Alex
> 



--------
Dong Jia


^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
@ 2016-05-13  7:10                       ` Dong Jia
  0 siblings, 0 replies; 154+ messages in thread
From: Dong Jia @ 2016-05-13  7:10 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, Ruan, Shuai, Song, Jike, Neo Jia, kvm, qemu-devel,
	Kirti Wankhede, kraxel, pbonzini, Lv,	Zhiyuan, Dong Jia

On Thu, 12 May 2016 13:05:52 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Thu, 12 May 2016 08:00:36 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > Sent: Thursday, May 12, 2016 6:06 AM
> > > 
> > > On Wed, 11 May 2016 17:15:15 +0800
> > > Jike Song <jike.song@intel.com> wrote:
> > >   
> > > > On 05/11/2016 12:02 AM, Neo Jia wrote:  
> > > > > On Tue, May 10, 2016 at 03:52:27PM +0800, Jike Song wrote:  
> > > > >> On 05/05/2016 05:27 PM, Tian, Kevin wrote:  
> > > > >>>> From: Song, Jike
> > > > >>>>
> > > > >>>> IIUC, an api-only domain is a VFIO domain *without* underlying IOMMU
> > > > >>>> hardware. It just, as you said in another mail, "rather than
> > > > >>>> programming them into an IOMMU for a device, it simply stores the
> > > > >>>> translations for use by later requests".
> > > > >>>>
> > > > >>>> That imposes a constraint on gfx driver: hardware IOMMU must be disabled.
> > > > >>>> Otherwise, if IOMMU is present, the gfx driver eventually programs
> > > > >>>> the hardware IOMMU with IOVA returned by pci_map_page or dma_map_page;
> > > > >>>> Meanwhile, the IOMMU backend for vgpu only maintains GPA <-> HPA
> > > > >>>> translations without any knowledge about hardware IOMMU, how is the
> > > > >>>> device model supposed to do to get an IOVA for a given GPA (thereby HPA
> > > > >>>> by the IOMMU backend here)?
> > > > >>>>
> > > > >>>> If things go as guessed above, as vfio_pin_pages() indicates, it
> > > > >>>> pin & translate vaddr to PFN, then it will be very difficult for the
> > > > >>>> device model to figure out:
> > > > >>>>
> > > > >>>> 	1, for a given GPA, how to avoid calling dma_map_page multiple times?
> > > > >>>> 	2, for which page to call dma_unmap_page?
> > > > >>>>
> > > > >>>> --  
> > > > >>>
> > > > >>> We have to support both w/ iommu and w/o iommu case, since
> > > > >>> that fact is out of GPU driver control. A simple way is to use
> > > > >>> dma_map_page which internally will cope with w/ and w/o iommu
> > > > >>> case gracefully, i.e. return HPA w/o iommu and IOVA w/ iommu.
> > > > >>> Then in this file we only need to cache GPA to whatever dmadr_t
> > > > >>> returned by dma_map_page.
> > > > >>>  
> > > > >>
> > > > >> Hi Alex, Kirti and Neo, any thought on the IOMMU compatibility here?  
> > > > >
> > > > > Hi Jike,
> > > > >
> > > > > With mediated passthru, you still can use hardware iommu, but more important
> > > > > that part is actually orthogonal to what we are discussing here as we will only
> > > > > cache the mapping between <gfn (iova if guest has iommu), (qemu) va>, once we
> > > > > have pinned pages later with the help of above info, you can map it into the
> > > > > proper iommu domain if the system has configured so.
> > > > >  
> > > >
> > > > Hi Neo,
> > > >
> > > > Technically yes you can map a pfn into the proper IOMMU domain elsewhere,
> > > > but to find out whether a pfn was previously mapped or not, you have to
> > > > track it with another rbtree-alike data structure (the IOMMU driver simply
> > > > doesn't bother with tracking), that seems somehow duplicate with the vGPU
> > > > IOMMU backend we are discussing here.
> > > >
> > > > And it is also semantically correct for an IOMMU backend to handle both w/
> > > > and w/o an IOMMU hardware? :)  
> > > 
> > > A problem with the iommu doing the dma_map_page() though is for what
> > > device does it do this?  In the mediated case the vfio infrastructure
> > > is dealing with a software representation of a device.  For all we
> > > know that software model could transparently migrate from one physical
> > > GPU to another.  There may not even be a physical device backing
> > > the mediated device.  Those are details left to the vgpu driver itself.  
> > 
> > This is a fair argument. VFIO iommu driver simply serves user space
> > requests, where only vaddr<->iova (essentially gpa in kvm case) is
> > mattered. How iova is mapped into real IOMMU is not VFIO's interest.
> > 
> > > 
> > > Perhaps one possibility would be to allow the vgpu driver to register
> > > map and unmap callbacks.  The unmap callback might provide the
> > > invalidation interface that we're so far missing.  The combination of
> > > map and unmap callbacks might simplify the Intel approach of pinning the
> > > entire VM memory space, ie. for each map callback do a translation
> > > (pin) and dma_map_page, for each unmap do a dma_unmap_page and release
> > > the translation.  There's still the problem of where that dma_addr_t
> > > from the dma_map_page is stored though.  Someone would need to keep
> > > track of iova to dma_addr_t.  The vfio iommu might be a place to do
> > > that since we're already tracking information based on iova, possibly
> > > in an opaque data element provided by the vgpu driver.  However, we're
> > > going to need to take a serious look at whether an rb-tree is the right
> > > data structure for the job.  It works well for the current type1
> > > functionality where we typically have tens of entries.  I think the
> > > NVIDIA model of sparse pinning the VM is pushing that up to tens of
> > > thousands.  If Intel intends to pin the entire guest, that's
> > > potentially tens of millions of tracked entries and I don't know that
> > > an rb-tree is the right tool for that job.  Thanks,
> > >   
> > 
> > Based on above thought I'm thinking whether below would work:
> > (let's use gpa to replace existing iova in type1 driver, while using iova
> > for the one actually used in vGPU driver. Assume 'pin-all' scenario first
> > which matches existing vfio logic)
> > 
> > - No change to existing vfio_dma structure. VFIO still maintains gpa<->vaddr
> > mapping, in coarse-grained regions;
> > 
> > - Leverage same page accounting/pinning logic in type1 driver, which 
> > should be enough for 'pin-all' usage;
> > 
> > - Then main divergence point for vGPU would be in vfio_unmap_unpin
> > and vfio_iommu_map. I'm not sure whether it's easy to fake an 
> > iommu_domain for vGPU so same iommu_map/unmap can be reused.
> 
> This seems troublesome.  Kirti's version used numerous api-only tests
> to avoid these which made the code difficult to trace.  Clearly one
> option is to split out the common code so that a new mediated-type1
> backend skips this, but they thought they could clean it up without
> this, so we'll see what happens in the next version.
> 
> > If not, we may introduce two new map/unmap callbacks provided
> > specifically by vGPU core driver, as you suggested:
> > 
> > 	* vGPU core driver uses dma_map_page to map specified pfns:
> > 
> > 		o When IOMMU is enabled, we'll get an iova returned different
> > from pfn;
> > 		o When IOMMU is disabled, returned iova is same as pfn;
> 
> Either way each iova needs to be stored and we have a worst case of one
> iova per page of guest memory.
> 
> > 	* Then vGPU core driver just maintains its own gpa<->iova lookup
> > table (e.g. called vgpu_dma)
> > 
> > 	* Because each vfio_iommu_map invocation is about a contiguous 
> > region, we can expect same number of vgpu_dma entries as maintained 
> > for vfio_dma list;
> >
> > Then it's vGPU core driver's responsibility to provide gpa<->iova
> > lookup for vendor specific GPU driver. And we don't need worry about
> > tens of thousands of entries. Once we get this simple 'pin-all' model
> > ready, then it can be further extended to support 'pin-sparse'
> > scenario. We still maintain a top-level vgpu_dma list with each entry to
> > further link its own sparse mapping structure. In reality I don't expect
> > we really need to maintain per-page translation even with sparse pinning.
> 
> If you're trying to equate the scale of what we need to track vs what
> type1 currently tracks, they're significantly different.  Possible
> things we need to track include the pfn, the iova, and possibly a
> reference count or some sort of pinned page map.  In the pin-all model
> we can assume that every page is pinned on map and unpinned on unmap,
> so a reference count or map is unnecessary.  We can also assume that we
> can always regenerate the pfn with get_user_pages() from the vaddr, so
> we don't need to track that.  I don't see any way around tracking the
> iova.  The iommu can't tell us this like it can with the normal type1
> model because the pfn is the result of the translation, not the key for
> the translation. So we're always going to have between 1 and
> (size/PAGE_SIZE) iova entries per vgpu_dma entry.  You might be able to
> manage the vgpu_dma with an rb-tree, but each vgpu_dma entry needs some
> data structure tracking every iova.
> 
> Sparse mapping has the same issue but of course the tree of iovas is
> potentially incomplete and we need a way to determine where it's
> incomplete.  A page table rooted in the vgpu_dma and indexed by the
> offset from the start vaddr seems like the way to go here.  It's also
> possible that some mediated device models might store the iova in the
> command sent to the device and therefore be able to parse those entries
> back out to unmap them without storing them separately.  This might be
> how the s390 channel-io model would prefer to work.
Dear Alex:

For the s390 channel-io model, when an I/O instruction was intercepted
and issued to the device driver for further translation, the operand of
the instruction contents iovas only. Since iova is the key to locate an
entry in the database (r-b tree or whatever), we do can parse the
entries back out one by one when doing the unmap operation.
                 ^^^^^^^^^^

BTW, if the mediated-iommu backend can potentially offer a transaction
level support for the unmap operation, I believe it will benefit the
performance for this case.

e.g.:
handler = vfio_trasaction_begin();
foreach(iova in the command) {
    pfn = vfio_trasaction_map(handler, iova);
    do_io(pfn);
}

/*
 * Expect to unmap all of the pfns mapped in this trasaction with the
 * next statement. The mediated-iommu backend could use handler as the
 * key to track the list of the entries.
 */
vfio_trasaction_unmap(handler);
vfio_trasaction_end(handler);

Not sure if this could benefit the vgpu sparse mapping use case though.

>  That seems like
> further validation that such tracking is going to be dependent on the
> mediated driver itself and probably not something to centralize in a
> mediated iommu driver.  Thanks,
> 
> Alex
> 



--------
Dong Jia

^ permalink raw reply	[flat|nested] 154+ messages in thread

* RE: [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
  2016-05-13  6:41                         ` [Qemu-devel] " Neo Jia
@ 2016-05-13  7:13                           ` Tian, Kevin
  -1 siblings, 0 replies; 154+ messages in thread
From: Tian, Kevin @ 2016-05-13  7:13 UTC (permalink / raw)
  To: Neo Jia, Song, Jike
  Cc: Jike Song, Alex Williamson, Kirti Wankhede, pbonzini, kraxel,
	qemu-devel, kvm, Ruan, Shuai, Lv, Zhiyuan

> From: Neo Jia [mailto:cjia@nvidia.com]
> Sent: Friday, May 13, 2016 2:42 PM
> 
> 
> >
> > We possibly have the same requirement from the mediate driver backend:
> >
> > 	a) get a GFN, when guest try to tell hardware;
> > 	b) consult the vfio iommu with that GFN[1]: will you find me a proper dma_addr?
> 
> We will provide you the pfn via vfio_pin_pages, so you can map it for dma
> purpose in your i915 driver, which is what we are doing today.
> 

Can such 'map' operation be consolidated in vGPU core driver? I don't think 
Intel vGPU driver has any feature proactively relying on iommu. The reason 
why we keep talking iommu is just because the kernel may enable iommu 
for physical GPU so we need make sure our device model can work in such
configuration. And this requirement should apply to all vendors, not Intel
specific (like you said you are doing it already today).

Thanks
Kevin

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
@ 2016-05-13  7:13                           ` Tian, Kevin
  0 siblings, 0 replies; 154+ messages in thread
From: Tian, Kevin @ 2016-05-13  7:13 UTC (permalink / raw)
  To: Neo Jia, Song, Jike
  Cc: Jike Song, Alex Williamson, Kirti Wankhede, pbonzini, kraxel,
	qemu-devel, kvm, Ruan, Shuai, Lv, Zhiyuan

> From: Neo Jia [mailto:cjia@nvidia.com]
> Sent: Friday, May 13, 2016 2:42 PM
> 
> 
> >
> > We possibly have the same requirement from the mediate driver backend:
> >
> > 	a) get a GFN, when guest try to tell hardware;
> > 	b) consult the vfio iommu with that GFN[1]: will you find me a proper dma_addr?
> 
> We will provide you the pfn via vfio_pin_pages, so you can map it for dma
> purpose in your i915 driver, which is what we are doing today.
> 

Can such 'map' operation be consolidated in vGPU core driver? I don't think 
Intel vGPU driver has any feature proactively relying on iommu. The reason 
why we keep talking iommu is just because the kernel may enable iommu 
for physical GPU so we need make sure our device model can work in such
configuration. And this requirement should apply to all vendors, not Intel
specific (like you said you are doing it already today).

Thanks
Kevin

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
  2016-05-13  7:10                       ` Dong Jia
@ 2016-05-13  7:24                         ` Neo Jia
  -1 siblings, 0 replies; 154+ messages in thread
From: Neo Jia @ 2016-05-13  7:24 UTC (permalink / raw)
  To: Dong Jia
  Cc: Alex Williamson, Tian, Kevin, Ruan, Shuai, Song, Jike, kvm,
	qemu-devel, Kirti Wankhede, kraxel, pbonzini, Lv, Zhiyuan

On Fri, May 13, 2016 at 03:10:22PM +0800, Dong Jia wrote:
> On Thu, 12 May 2016 13:05:52 -0600
> Alex Williamson <alex.williamson@redhat.com> wrote:
> 
> > On Thu, 12 May 2016 08:00:36 +0000
> > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > 
> > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > Sent: Thursday, May 12, 2016 6:06 AM
> > > > 
> > > > On Wed, 11 May 2016 17:15:15 +0800
> > > > Jike Song <jike.song@intel.com> wrote:
> > > >   
> > > > > On 05/11/2016 12:02 AM, Neo Jia wrote:  
> > > > > > On Tue, May 10, 2016 at 03:52:27PM +0800, Jike Song wrote:  
> > > > > >> On 05/05/2016 05:27 PM, Tian, Kevin wrote:  
> > > > > >>>> From: Song, Jike
> > > > > >>>>
> > > > > >>>> IIUC, an api-only domain is a VFIO domain *without* underlying IOMMU
> > > > > >>>> hardware. It just, as you said in another mail, "rather than
> > > > > >>>> programming them into an IOMMU for a device, it simply stores the
> > > > > >>>> translations for use by later requests".
> > > > > >>>>
> > > > > >>>> That imposes a constraint on gfx driver: hardware IOMMU must be disabled.
> > > > > >>>> Otherwise, if IOMMU is present, the gfx driver eventually programs
> > > > > >>>> the hardware IOMMU with IOVA returned by pci_map_page or dma_map_page;
> > > > > >>>> Meanwhile, the IOMMU backend for vgpu only maintains GPA <-> HPA
> > > > > >>>> translations without any knowledge about hardware IOMMU, how is the
> > > > > >>>> device model supposed to do to get an IOVA for a given GPA (thereby HPA
> > > > > >>>> by the IOMMU backend here)?
> > > > > >>>>
> > > > > >>>> If things go as guessed above, as vfio_pin_pages() indicates, it
> > > > > >>>> pin & translate vaddr to PFN, then it will be very difficult for the
> > > > > >>>> device model to figure out:
> > > > > >>>>
> > > > > >>>> 	1, for a given GPA, how to avoid calling dma_map_page multiple times?
> > > > > >>>> 	2, for which page to call dma_unmap_page?
> > > > > >>>>
> > > > > >>>> --  
> > > > > >>>
> > > > > >>> We have to support both w/ iommu and w/o iommu case, since
> > > > > >>> that fact is out of GPU driver control. A simple way is to use
> > > > > >>> dma_map_page which internally will cope with w/ and w/o iommu
> > > > > >>> case gracefully, i.e. return HPA w/o iommu and IOVA w/ iommu.
> > > > > >>> Then in this file we only need to cache GPA to whatever dmadr_t
> > > > > >>> returned by dma_map_page.
> > > > > >>>  
> > > > > >>
> > > > > >> Hi Alex, Kirti and Neo, any thought on the IOMMU compatibility here?  
> > > > > >
> > > > > > Hi Jike,
> > > > > >
> > > > > > With mediated passthru, you still can use hardware iommu, but more important
> > > > > > that part is actually orthogonal to what we are discussing here as we will only
> > > > > > cache the mapping between <gfn (iova if guest has iommu), (qemu) va>, once we
> > > > > > have pinned pages later with the help of above info, you can map it into the
> > > > > > proper iommu domain if the system has configured so.
> > > > > >  
> > > > >
> > > > > Hi Neo,
> > > > >
> > > > > Technically yes you can map a pfn into the proper IOMMU domain elsewhere,
> > > > > but to find out whether a pfn was previously mapped or not, you have to
> > > > > track it with another rbtree-alike data structure (the IOMMU driver simply
> > > > > doesn't bother with tracking), that seems somehow duplicate with the vGPU
> > > > > IOMMU backend we are discussing here.
> > > > >
> > > > > And it is also semantically correct for an IOMMU backend to handle both w/
> > > > > and w/o an IOMMU hardware? :)  
> > > > 
> > > > A problem with the iommu doing the dma_map_page() though is for what
> > > > device does it do this?  In the mediated case the vfio infrastructure
> > > > is dealing with a software representation of a device.  For all we
> > > > know that software model could transparently migrate from one physical
> > > > GPU to another.  There may not even be a physical device backing
> > > > the mediated device.  Those are details left to the vgpu driver itself.  
> > > 
> > > This is a fair argument. VFIO iommu driver simply serves user space
> > > requests, where only vaddr<->iova (essentially gpa in kvm case) is
> > > mattered. How iova is mapped into real IOMMU is not VFIO's interest.
> > > 
> > > > 
> > > > Perhaps one possibility would be to allow the vgpu driver to register
> > > > map and unmap callbacks.  The unmap callback might provide the
> > > > invalidation interface that we're so far missing.  The combination of
> > > > map and unmap callbacks might simplify the Intel approach of pinning the
> > > > entire VM memory space, ie. for each map callback do a translation
> > > > (pin) and dma_map_page, for each unmap do a dma_unmap_page and release
> > > > the translation.  There's still the problem of where that dma_addr_t
> > > > from the dma_map_page is stored though.  Someone would need to keep
> > > > track of iova to dma_addr_t.  The vfio iommu might be a place to do
> > > > that since we're already tracking information based on iova, possibly
> > > > in an opaque data element provided by the vgpu driver.  However, we're
> > > > going to need to take a serious look at whether an rb-tree is the right
> > > > data structure for the job.  It works well for the current type1
> > > > functionality where we typically have tens of entries.  I think the
> > > > NVIDIA model of sparse pinning the VM is pushing that up to tens of
> > > > thousands.  If Intel intends to pin the entire guest, that's
> > > > potentially tens of millions of tracked entries and I don't know that
> > > > an rb-tree is the right tool for that job.  Thanks,
> > > >   
> > > 
> > > Based on above thought I'm thinking whether below would work:
> > > (let's use gpa to replace existing iova in type1 driver, while using iova
> > > for the one actually used in vGPU driver. Assume 'pin-all' scenario first
> > > which matches existing vfio logic)
> > > 
> > > - No change to existing vfio_dma structure. VFIO still maintains gpa<->vaddr
> > > mapping, in coarse-grained regions;
> > > 
> > > - Leverage same page accounting/pinning logic in type1 driver, which 
> > > should be enough for 'pin-all' usage;
> > > 
> > > - Then main divergence point for vGPU would be in vfio_unmap_unpin
> > > and vfio_iommu_map. I'm not sure whether it's easy to fake an 
> > > iommu_domain for vGPU so same iommu_map/unmap can be reused.
> > 
> > This seems troublesome.  Kirti's version used numerous api-only tests
> > to avoid these which made the code difficult to trace.  Clearly one
> > option is to split out the common code so that a new mediated-type1
> > backend skips this, but they thought they could clean it up without
> > this, so we'll see what happens in the next version.
> > 
> > > If not, we may introduce two new map/unmap callbacks provided
> > > specifically by vGPU core driver, as you suggested:
> > > 
> > > 	* vGPU core driver uses dma_map_page to map specified pfns:
> > > 
> > > 		o When IOMMU is enabled, we'll get an iova returned different
> > > from pfn;
> > > 		o When IOMMU is disabled, returned iova is same as pfn;
> > 
> > Either way each iova needs to be stored and we have a worst case of one
> > iova per page of guest memory.
> > 
> > > 	* Then vGPU core driver just maintains its own gpa<->iova lookup
> > > table (e.g. called vgpu_dma)
> > > 
> > > 	* Because each vfio_iommu_map invocation is about a contiguous 
> > > region, we can expect same number of vgpu_dma entries as maintained 
> > > for vfio_dma list;
> > >
> > > Then it's vGPU core driver's responsibility to provide gpa<->iova
> > > lookup for vendor specific GPU driver. And we don't need worry about
> > > tens of thousands of entries. Once we get this simple 'pin-all' model
> > > ready, then it can be further extended to support 'pin-sparse'
> > > scenario. We still maintain a top-level vgpu_dma list with each entry to
> > > further link its own sparse mapping structure. In reality I don't expect
> > > we really need to maintain per-page translation even with sparse pinning.
> > 
> > If you're trying to equate the scale of what we need to track vs what
> > type1 currently tracks, they're significantly different.  Possible
> > things we need to track include the pfn, the iova, and possibly a
> > reference count or some sort of pinned page map.  In the pin-all model
> > we can assume that every page is pinned on map and unpinned on unmap,
> > so a reference count or map is unnecessary.  We can also assume that we
> > can always regenerate the pfn with get_user_pages() from the vaddr, so
> > we don't need to track that.  I don't see any way around tracking the
> > iova.  The iommu can't tell us this like it can with the normal type1
> > model because the pfn is the result of the translation, not the key for
> > the translation. So we're always going to have between 1 and
> > (size/PAGE_SIZE) iova entries per vgpu_dma entry.  You might be able to
> > manage the vgpu_dma with an rb-tree, but each vgpu_dma entry needs some
> > data structure tracking every iova.
> > 
> > Sparse mapping has the same issue but of course the tree of iovas is
> > potentially incomplete and we need a way to determine where it's
> > incomplete.  A page table rooted in the vgpu_dma and indexed by the
> > offset from the start vaddr seems like the way to go here.  It's also
> > possible that some mediated device models might store the iova in the
> > command sent to the device and therefore be able to parse those entries
> > back out to unmap them without storing them separately.  This might be
> > how the s390 channel-io model would prefer to work.
> Dear Alex:
> 
> For the s390 channel-io model, when an I/O instruction was intercepted
> and issued to the device driver for further translation, the operand of
> the instruction contents iovas only. Since iova is the key to locate an
> entry in the database (r-b tree or whatever), we do can parse the
> entries back out one by one when doing the unmap operation.
>                  ^^^^^^^^^^
> 
> BTW, if the mediated-iommu backend can potentially offer a transaction
> level support for the unmap operation, I believe it will benefit the
> performance for this case.
> 
> e.g.:
> handler = vfio_trasaction_begin();
> foreach(iova in the command) {
>     pfn = vfio_trasaction_map(handler, iova);
>     do_io(pfn);
> }

Hi Dong,

Could you please help me understand the performance benefit here? 

Is the perf argument coming from the searching of rbtree of the tracking data
structure or something else?

For example you can do similar thing by the following sequence from your backend
driver:

    vfio_pin_pages(gfn_list/iova_list /* in */, npages, prot, pfn_bases /* out */)
    foreach (pfn)
        do_io(pfn)
    vfio_unpin_pages(pfn_bases)

Thanks,
Neo

> 
> /*
>  * Expect to unmap all of the pfns mapped in this trasaction with the
>  * next statement. The mediated-iommu backend could use handler as the
>  * key to track the list of the entries.
>  */
> vfio_trasaction_unmap(handler);
> vfio_trasaction_end(handler);
> 
> Not sure if this could benefit the vgpu sparse mapping use case though.





> 
> >  That seems like
> > further validation that such tracking is going to be dependent on the
> > mediated driver itself and probably not something to centralize in a
> > mediated iommu driver.  Thanks,
> > 
> > Alex
> > 
> 
> 
> 
> --------
> Dong Jia
> 

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
@ 2016-05-13  7:24                         ` Neo Jia
  0 siblings, 0 replies; 154+ messages in thread
From: Neo Jia @ 2016-05-13  7:24 UTC (permalink / raw)
  To: Dong Jia
  Cc: Alex Williamson, Tian, Kevin, Ruan, Shuai, Song, Jike, kvm,
	qemu-devel, Kirti Wankhede, kraxel, pbonzini, Lv,	Zhiyuan

On Fri, May 13, 2016 at 03:10:22PM +0800, Dong Jia wrote:
> On Thu, 12 May 2016 13:05:52 -0600
> Alex Williamson <alex.williamson@redhat.com> wrote:
> 
> > On Thu, 12 May 2016 08:00:36 +0000
> > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > 
> > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > Sent: Thursday, May 12, 2016 6:06 AM
> > > > 
> > > > On Wed, 11 May 2016 17:15:15 +0800
> > > > Jike Song <jike.song@intel.com> wrote:
> > > >   
> > > > > On 05/11/2016 12:02 AM, Neo Jia wrote:  
> > > > > > On Tue, May 10, 2016 at 03:52:27PM +0800, Jike Song wrote:  
> > > > > >> On 05/05/2016 05:27 PM, Tian, Kevin wrote:  
> > > > > >>>> From: Song, Jike
> > > > > >>>>
> > > > > >>>> IIUC, an api-only domain is a VFIO domain *without* underlying IOMMU
> > > > > >>>> hardware. It just, as you said in another mail, "rather than
> > > > > >>>> programming them into an IOMMU for a device, it simply stores the
> > > > > >>>> translations for use by later requests".
> > > > > >>>>
> > > > > >>>> That imposes a constraint on gfx driver: hardware IOMMU must be disabled.
> > > > > >>>> Otherwise, if IOMMU is present, the gfx driver eventually programs
> > > > > >>>> the hardware IOMMU with IOVA returned by pci_map_page or dma_map_page;
> > > > > >>>> Meanwhile, the IOMMU backend for vgpu only maintains GPA <-> HPA
> > > > > >>>> translations without any knowledge about hardware IOMMU, how is the
> > > > > >>>> device model supposed to do to get an IOVA for a given GPA (thereby HPA
> > > > > >>>> by the IOMMU backend here)?
> > > > > >>>>
> > > > > >>>> If things go as guessed above, as vfio_pin_pages() indicates, it
> > > > > >>>> pin & translate vaddr to PFN, then it will be very difficult for the
> > > > > >>>> device model to figure out:
> > > > > >>>>
> > > > > >>>> 	1, for a given GPA, how to avoid calling dma_map_page multiple times?
> > > > > >>>> 	2, for which page to call dma_unmap_page?
> > > > > >>>>
> > > > > >>>> --  
> > > > > >>>
> > > > > >>> We have to support both w/ iommu and w/o iommu case, since
> > > > > >>> that fact is out of GPU driver control. A simple way is to use
> > > > > >>> dma_map_page which internally will cope with w/ and w/o iommu
> > > > > >>> case gracefully, i.e. return HPA w/o iommu and IOVA w/ iommu.
> > > > > >>> Then in this file we only need to cache GPA to whatever dmadr_t
> > > > > >>> returned by dma_map_page.
> > > > > >>>  
> > > > > >>
> > > > > >> Hi Alex, Kirti and Neo, any thought on the IOMMU compatibility here?  
> > > > > >
> > > > > > Hi Jike,
> > > > > >
> > > > > > With mediated passthru, you still can use hardware iommu, but more important
> > > > > > that part is actually orthogonal to what we are discussing here as we will only
> > > > > > cache the mapping between <gfn (iova if guest has iommu), (qemu) va>, once we
> > > > > > have pinned pages later with the help of above info, you can map it into the
> > > > > > proper iommu domain if the system has configured so.
> > > > > >  
> > > > >
> > > > > Hi Neo,
> > > > >
> > > > > Technically yes you can map a pfn into the proper IOMMU domain elsewhere,
> > > > > but to find out whether a pfn was previously mapped or not, you have to
> > > > > track it with another rbtree-alike data structure (the IOMMU driver simply
> > > > > doesn't bother with tracking), that seems somehow duplicate with the vGPU
> > > > > IOMMU backend we are discussing here.
> > > > >
> > > > > And it is also semantically correct for an IOMMU backend to handle both w/
> > > > > and w/o an IOMMU hardware? :)  
> > > > 
> > > > A problem with the iommu doing the dma_map_page() though is for what
> > > > device does it do this?  In the mediated case the vfio infrastructure
> > > > is dealing with a software representation of a device.  For all we
> > > > know that software model could transparently migrate from one physical
> > > > GPU to another.  There may not even be a physical device backing
> > > > the mediated device.  Those are details left to the vgpu driver itself.  
> > > 
> > > This is a fair argument. VFIO iommu driver simply serves user space
> > > requests, where only vaddr<->iova (essentially gpa in kvm case) is
> > > mattered. How iova is mapped into real IOMMU is not VFIO's interest.
> > > 
> > > > 
> > > > Perhaps one possibility would be to allow the vgpu driver to register
> > > > map and unmap callbacks.  The unmap callback might provide the
> > > > invalidation interface that we're so far missing.  The combination of
> > > > map and unmap callbacks might simplify the Intel approach of pinning the
> > > > entire VM memory space, ie. for each map callback do a translation
> > > > (pin) and dma_map_page, for each unmap do a dma_unmap_page and release
> > > > the translation.  There's still the problem of where that dma_addr_t
> > > > from the dma_map_page is stored though.  Someone would need to keep
> > > > track of iova to dma_addr_t.  The vfio iommu might be a place to do
> > > > that since we're already tracking information based on iova, possibly
> > > > in an opaque data element provided by the vgpu driver.  However, we're
> > > > going to need to take a serious look at whether an rb-tree is the right
> > > > data structure for the job.  It works well for the current type1
> > > > functionality where we typically have tens of entries.  I think the
> > > > NVIDIA model of sparse pinning the VM is pushing that up to tens of
> > > > thousands.  If Intel intends to pin the entire guest, that's
> > > > potentially tens of millions of tracked entries and I don't know that
> > > > an rb-tree is the right tool for that job.  Thanks,
> > > >   
> > > 
> > > Based on above thought I'm thinking whether below would work:
> > > (let's use gpa to replace existing iova in type1 driver, while using iova
> > > for the one actually used in vGPU driver. Assume 'pin-all' scenario first
> > > which matches existing vfio logic)
> > > 
> > > - No change to existing vfio_dma structure. VFIO still maintains gpa<->vaddr
> > > mapping, in coarse-grained regions;
> > > 
> > > - Leverage same page accounting/pinning logic in type1 driver, which 
> > > should be enough for 'pin-all' usage;
> > > 
> > > - Then main divergence point for vGPU would be in vfio_unmap_unpin
> > > and vfio_iommu_map. I'm not sure whether it's easy to fake an 
> > > iommu_domain for vGPU so same iommu_map/unmap can be reused.
> > 
> > This seems troublesome.  Kirti's version used numerous api-only tests
> > to avoid these which made the code difficult to trace.  Clearly one
> > option is to split out the common code so that a new mediated-type1
> > backend skips this, but they thought they could clean it up without
> > this, so we'll see what happens in the next version.
> > 
> > > If not, we may introduce two new map/unmap callbacks provided
> > > specifically by vGPU core driver, as you suggested:
> > > 
> > > 	* vGPU core driver uses dma_map_page to map specified pfns:
> > > 
> > > 		o When IOMMU is enabled, we'll get an iova returned different
> > > from pfn;
> > > 		o When IOMMU is disabled, returned iova is same as pfn;
> > 
> > Either way each iova needs to be stored and we have a worst case of one
> > iova per page of guest memory.
> > 
> > > 	* Then vGPU core driver just maintains its own gpa<->iova lookup
> > > table (e.g. called vgpu_dma)
> > > 
> > > 	* Because each vfio_iommu_map invocation is about a contiguous 
> > > region, we can expect same number of vgpu_dma entries as maintained 
> > > for vfio_dma list;
> > >
> > > Then it's vGPU core driver's responsibility to provide gpa<->iova
> > > lookup for vendor specific GPU driver. And we don't need worry about
> > > tens of thousands of entries. Once we get this simple 'pin-all' model
> > > ready, then it can be further extended to support 'pin-sparse'
> > > scenario. We still maintain a top-level vgpu_dma list with each entry to
> > > further link its own sparse mapping structure. In reality I don't expect
> > > we really need to maintain per-page translation even with sparse pinning.
> > 
> > If you're trying to equate the scale of what we need to track vs what
> > type1 currently tracks, they're significantly different.  Possible
> > things we need to track include the pfn, the iova, and possibly a
> > reference count or some sort of pinned page map.  In the pin-all model
> > we can assume that every page is pinned on map and unpinned on unmap,
> > so a reference count or map is unnecessary.  We can also assume that we
> > can always regenerate the pfn with get_user_pages() from the vaddr, so
> > we don't need to track that.  I don't see any way around tracking the
> > iova.  The iommu can't tell us this like it can with the normal type1
> > model because the pfn is the result of the translation, not the key for
> > the translation. So we're always going to have between 1 and
> > (size/PAGE_SIZE) iova entries per vgpu_dma entry.  You might be able to
> > manage the vgpu_dma with an rb-tree, but each vgpu_dma entry needs some
> > data structure tracking every iova.
> > 
> > Sparse mapping has the same issue but of course the tree of iovas is
> > potentially incomplete and we need a way to determine where it's
> > incomplete.  A page table rooted in the vgpu_dma and indexed by the
> > offset from the start vaddr seems like the way to go here.  It's also
> > possible that some mediated device models might store the iova in the
> > command sent to the device and therefore be able to parse those entries
> > back out to unmap them without storing them separately.  This might be
> > how the s390 channel-io model would prefer to work.
> Dear Alex:
> 
> For the s390 channel-io model, when an I/O instruction was intercepted
> and issued to the device driver for further translation, the operand of
> the instruction contents iovas only. Since iova is the key to locate an
> entry in the database (r-b tree or whatever), we do can parse the
> entries back out one by one when doing the unmap operation.
>                  ^^^^^^^^^^
> 
> BTW, if the mediated-iommu backend can potentially offer a transaction
> level support for the unmap operation, I believe it will benefit the
> performance for this case.
> 
> e.g.:
> handler = vfio_trasaction_begin();
> foreach(iova in the command) {
>     pfn = vfio_trasaction_map(handler, iova);
>     do_io(pfn);
> }

Hi Dong,

Could you please help me understand the performance benefit here? 

Is the perf argument coming from the searching of rbtree of the tracking data
structure or something else?

For example you can do similar thing by the following sequence from your backend
driver:

    vfio_pin_pages(gfn_list/iova_list /* in */, npages, prot, pfn_bases /* out */)
    foreach (pfn)
        do_io(pfn)
    vfio_unpin_pages(pfn_bases)

Thanks,
Neo

> 
> /*
>  * Expect to unmap all of the pfns mapped in this trasaction with the
>  * next statement. The mediated-iommu backend could use handler as the
>  * key to track the list of the entries.
>  */
> vfio_trasaction_unmap(handler);
> vfio_trasaction_end(handler);
> 
> Not sure if this could benefit the vgpu sparse mapping use case though.





> 
> >  That seems like
> > further validation that such tracking is going to be dependent on the
> > mediated driver itself and probably not something to centralize in a
> > mediated iommu driver.  Thanks,
> > 
> > Alex
> > 
> 
> 
> 
> --------
> Dong Jia
> 

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
  2016-05-13  6:43                           ` [Qemu-devel] " Neo Jia
@ 2016-05-13  7:30                             ` Jike Song
  -1 siblings, 0 replies; 154+ messages in thread
From: Jike Song @ 2016-05-13  7:30 UTC (permalink / raw)
  To: Neo Jia
  Cc: Tian, Kevin, Jike Song, Alex Williamson, Kirti Wankhede,
	pbonzini, kraxel, qemu-devel, kvm, Ruan, Shuai, Lv, Zhiyuan

On 05/13/2016 02:43 PM, Neo Jia wrote:
> On Fri, May 13, 2016 at 02:22:37PM +0800, Jike Song wrote:
>> On 05/13/2016 10:41 AM, Tian, Kevin wrote:
>>>> From: Neo Jia [mailto:cjia@nvidia.com] Sent: Friday, May 13,
>>>> 2016 3:49 AM
>>>> 
>>>>> 
>>>>>> Perhaps one possibility would be to allow the vgpu driver
>>>>>> to register map and unmap callbacks.  The unmap callback
>>>>>> might provide the invalidation interface that we're so far
>>>>>> missing.  The combination of map and unmap callbacks might
>>>>>> simplify the Intel approach of pinning the entire VM memory
>>>>>> space, ie. for each map callback do a translation (pin) and
>>>>>> dma_map_page, for each unmap do a dma_unmap_page and
>>>>>> release the translation.
>>>>> 
>>>>> Yes adding map/unmap ops in pGPU drvier (I assume you are
>>>>> refering to gpu_device_ops as implemented in Kirti's patch)
>>>>> sounds a good idea, satisfying both: 1) keeping vGPU purely 
>>>>> virtual; 2) dealing with the Linux DMA API to achive hardware
>>>>> IOMMU compatibility.
>>>>> 
>>>>> PS, this has very little to do with pinning wholly or
>>>>> partially. Intel KVMGT has once been had the whole guest
>>>>> memory pinned, only because we used a spinlock, which can't
>>>>> sleep at runtime.  We have removed that spinlock in our
>>>>> another upstreaming effort, not here but for i915 driver, so
>>>>> probably no biggie.
>>>>> 
>>>> 
>>>> OK, then you guys don't need to pin everything. The next
>>>> question will be if you can send the pinning request from your
>>>> mediated driver backend to request memory pinning like we have
>>>> demonstrated in the v3 patch, function vfio_pin_pages and 
>>>> vfio_unpin_pages?
>>>> 
>>> 
>>> Jike can you confirm this statement? My feeling is that we don't
>>> have such logic in our device model to figure out which pages
>>> need to be pinned on demand. So currently pin-everything is same
>>> requirement in both KVM and Xen side...
>> 
>> [Correct me in case of any neglect:)]
>> 
>> IMO the ultimate reason to pin a page, is for DMA. Accessing RAM
>> from a GPU is certainly a DMA operation. The DMA facility of most
>> platforms, IGD and NVIDIA GPU included, is not capable of
>> faulting-handling-retrying.
>> 
>> As for vGPU solutions like Nvidia and Intel provide, the memory
>> address region used by Guest for GPU access, whenever Guest sets
>> the mappings, it is intercepted by Host, so it's safe to only pin
>> the page before it get used by Guest. This probably doesn't need
>> device model to change :)
> 
> Hi Jike
> 
> Just out of curiosity, how does the host intercept this before it
> goes on the bus?
> 

Hi Neo,

[prologize if I mis-expressed myself, bad English ..] 

I was talking about intercepting the setting-up of GPU page tables,
not the DMA itself.  For currently Intel GPU, the page tables are
MMIO registers or simply RAM pages, called GTT (Graphics Translation
Table), the writing event to an GTT entry from Guest, is always
intercepted by Host.

--
Thanks,
Jike


^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
@ 2016-05-13  7:30                             ` Jike Song
  0 siblings, 0 replies; 154+ messages in thread
From: Jike Song @ 2016-05-13  7:30 UTC (permalink / raw)
  To: Neo Jia
  Cc: Tian, Kevin, Jike Song, Alex Williamson, Kirti Wankhede,
	pbonzini, kraxel, qemu-devel, kvm, Ruan, Shuai, Lv, Zhiyuan

On 05/13/2016 02:43 PM, Neo Jia wrote:
> On Fri, May 13, 2016 at 02:22:37PM +0800, Jike Song wrote:
>> On 05/13/2016 10:41 AM, Tian, Kevin wrote:
>>>> From: Neo Jia [mailto:cjia@nvidia.com] Sent: Friday, May 13,
>>>> 2016 3:49 AM
>>>> 
>>>>> 
>>>>>> Perhaps one possibility would be to allow the vgpu driver
>>>>>> to register map and unmap callbacks.  The unmap callback
>>>>>> might provide the invalidation interface that we're so far
>>>>>> missing.  The combination of map and unmap callbacks might
>>>>>> simplify the Intel approach of pinning the entire VM memory
>>>>>> space, ie. for each map callback do a translation (pin) and
>>>>>> dma_map_page, for each unmap do a dma_unmap_page and
>>>>>> release the translation.
>>>>> 
>>>>> Yes adding map/unmap ops in pGPU drvier (I assume you are
>>>>> refering to gpu_device_ops as implemented in Kirti's patch)
>>>>> sounds a good idea, satisfying both: 1) keeping vGPU purely 
>>>>> virtual; 2) dealing with the Linux DMA API to achive hardware
>>>>> IOMMU compatibility.
>>>>> 
>>>>> PS, this has very little to do with pinning wholly or
>>>>> partially. Intel KVMGT has once been had the whole guest
>>>>> memory pinned, only because we used a spinlock, which can't
>>>>> sleep at runtime.  We have removed that spinlock in our
>>>>> another upstreaming effort, not here but for i915 driver, so
>>>>> probably no biggie.
>>>>> 
>>>> 
>>>> OK, then you guys don't need to pin everything. The next
>>>> question will be if you can send the pinning request from your
>>>> mediated driver backend to request memory pinning like we have
>>>> demonstrated in the v3 patch, function vfio_pin_pages and 
>>>> vfio_unpin_pages?
>>>> 
>>> 
>>> Jike can you confirm this statement? My feeling is that we don't
>>> have such logic in our device model to figure out which pages
>>> need to be pinned on demand. So currently pin-everything is same
>>> requirement in both KVM and Xen side...
>> 
>> [Correct me in case of any neglect:)]
>> 
>> IMO the ultimate reason to pin a page, is for DMA. Accessing RAM
>> from a GPU is certainly a DMA operation. The DMA facility of most
>> platforms, IGD and NVIDIA GPU included, is not capable of
>> faulting-handling-retrying.
>> 
>> As for vGPU solutions like Nvidia and Intel provide, the memory
>> address region used by Guest for GPU access, whenever Guest sets
>> the mappings, it is intercepted by Host, so it's safe to only pin
>> the page before it get used by Guest. This probably doesn't need
>> device model to change :)
> 
> Hi Jike
> 
> Just out of curiosity, how does the host intercept this before it
> goes on the bus?
> 

Hi Neo,

[prologize if I mis-expressed myself, bad English ..] 

I was talking about intercepting the setting-up of GPU page tables,
not the DMA itself.  For currently Intel GPU, the page tables are
MMIO registers or simply RAM pages, called GTT (Graphics Translation
Table), the writing event to an GTT entry from Guest, is always
intercepted by Host.

--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
  2016-05-13  7:13                           ` [Qemu-devel] " Tian, Kevin
@ 2016-05-13  7:38                             ` Neo Jia
  -1 siblings, 0 replies; 154+ messages in thread
From: Neo Jia @ 2016-05-13  7:38 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Ruan, Shuai, Song, Jike, kvm, Jike Song, Kirti Wankhede,
	qemu-devel, Alex Williamson, kraxel, pbonzini, Lv, Zhiyuan

On Fri, May 13, 2016 at 07:13:44AM +0000, Tian, Kevin wrote:
> > From: Neo Jia [mailto:cjia@nvidia.com]
> > Sent: Friday, May 13, 2016 2:42 PM
> > 
> > 
> > >
> > > We possibly have the same requirement from the mediate driver backend:
> > >
> > > 	a) get a GFN, when guest try to tell hardware;
> > > 	b) consult the vfio iommu with that GFN[1]: will you find me a proper dma_addr?
> > 
> > We will provide you the pfn via vfio_pin_pages, so you can map it for dma
> > purpose in your i915 driver, which is what we are doing today.
> > 
> 
> Can such 'map' operation be consolidated in vGPU core driver? I don't think 
> Intel vGPU driver has any feature proactively relying on iommu. The reason 
> why we keep talking iommu is just because the kernel may enable iommu 
> for physical GPU so we need make sure our device model can work in such
> configuration. And this requirement should apply to all vendors, not Intel
> specific (like you said you are doing it already today).

Hi Kevin,

Actually, such requirement is already satisfied today as all vendor drivers
should transparently work with and without system iommu on bare-metal, right?

So I don't see any new requirement here, also such consolidation doesn't help
any but adding complexity to the system as vendor driver will not remove
their own dma_map_xxx functions as they are still required to support
non-mediated cases. 

Thanks,
Neo

> 
> Thanks
> Kevin

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
@ 2016-05-13  7:38                             ` Neo Jia
  0 siblings, 0 replies; 154+ messages in thread
From: Neo Jia @ 2016-05-13  7:38 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Song, Jike, Jike Song, Alex Williamson, Kirti Wankhede, pbonzini,
	kraxel, qemu-devel, kvm, Ruan, Shuai, Lv, Zhiyuan

On Fri, May 13, 2016 at 07:13:44AM +0000, Tian, Kevin wrote:
> > From: Neo Jia [mailto:cjia@nvidia.com]
> > Sent: Friday, May 13, 2016 2:42 PM
> > 
> > 
> > >
> > > We possibly have the same requirement from the mediate driver backend:
> > >
> > > 	a) get a GFN, when guest try to tell hardware;
> > > 	b) consult the vfio iommu with that GFN[1]: will you find me a proper dma_addr?
> > 
> > We will provide you the pfn via vfio_pin_pages, so you can map it for dma
> > purpose in your i915 driver, which is what we are doing today.
> > 
> 
> Can such 'map' operation be consolidated in vGPU core driver? I don't think 
> Intel vGPU driver has any feature proactively relying on iommu. The reason 
> why we keep talking iommu is just because the kernel may enable iommu 
> for physical GPU so we need make sure our device model can work in such
> configuration. And this requirement should apply to all vendors, not Intel
> specific (like you said you are doing it already today).

Hi Kevin,

Actually, such requirement is already satisfied today as all vendor drivers
should transparently work with and without system iommu on bare-metal, right?

So I don't see any new requirement here, also such consolidation doesn't help
any but adding complexity to the system as vendor driver will not remove
their own dma_map_xxx functions as they are still required to support
non-mediated cases. 

Thanks,
Neo

> 
> Thanks
> Kevin

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
  2016-05-13  7:30                             ` [Qemu-devel] " Jike Song
@ 2016-05-13  7:42                               ` Neo Jia
  -1 siblings, 0 replies; 154+ messages in thread
From: Neo Jia @ 2016-05-13  7:42 UTC (permalink / raw)
  To: Jike Song
  Cc: Tian, Kevin, Jike Song, Alex Williamson, Kirti Wankhede,
	pbonzini, kraxel, qemu-devel, kvm, Ruan, Shuai, Lv, Zhiyuan

On Fri, May 13, 2016 at 03:30:27PM +0800, Jike Song wrote:
> On 05/13/2016 02:43 PM, Neo Jia wrote:
> > On Fri, May 13, 2016 at 02:22:37PM +0800, Jike Song wrote:
> >> On 05/13/2016 10:41 AM, Tian, Kevin wrote:
> >>>> From: Neo Jia [mailto:cjia@nvidia.com] Sent: Friday, May 13,
> >>>> 2016 3:49 AM
> >>>> 
> >>>>> 
> >>>>>> Perhaps one possibility would be to allow the vgpu driver
> >>>>>> to register map and unmap callbacks.  The unmap callback
> >>>>>> might provide the invalidation interface that we're so far
> >>>>>> missing.  The combination of map and unmap callbacks might
> >>>>>> simplify the Intel approach of pinning the entire VM memory
> >>>>>> space, ie. for each map callback do a translation (pin) and
> >>>>>> dma_map_page, for each unmap do a dma_unmap_page and
> >>>>>> release the translation.
> >>>>> 
> >>>>> Yes adding map/unmap ops in pGPU drvier (I assume you are
> >>>>> refering to gpu_device_ops as implemented in Kirti's patch)
> >>>>> sounds a good idea, satisfying both: 1) keeping vGPU purely 
> >>>>> virtual; 2) dealing with the Linux DMA API to achive hardware
> >>>>> IOMMU compatibility.
> >>>>> 
> >>>>> PS, this has very little to do with pinning wholly or
> >>>>> partially. Intel KVMGT has once been had the whole guest
> >>>>> memory pinned, only because we used a spinlock, which can't
> >>>>> sleep at runtime.  We have removed that spinlock in our
> >>>>> another upstreaming effort, not here but for i915 driver, so
> >>>>> probably no biggie.
> >>>>> 
> >>>> 
> >>>> OK, then you guys don't need to pin everything. The next
> >>>> question will be if you can send the pinning request from your
> >>>> mediated driver backend to request memory pinning like we have
> >>>> demonstrated in the v3 patch, function vfio_pin_pages and 
> >>>> vfio_unpin_pages?
> >>>> 
> >>> 
> >>> Jike can you confirm this statement? My feeling is that we don't
> >>> have such logic in our device model to figure out which pages
> >>> need to be pinned on demand. So currently pin-everything is same
> >>> requirement in both KVM and Xen side...
> >> 
> >> [Correct me in case of any neglect:)]
> >> 
> >> IMO the ultimate reason to pin a page, is for DMA. Accessing RAM
> >> from a GPU is certainly a DMA operation. The DMA facility of most
> >> platforms, IGD and NVIDIA GPU included, is not capable of
> >> faulting-handling-retrying.
> >> 
> >> As for vGPU solutions like Nvidia and Intel provide, the memory
> >> address region used by Guest for GPU access, whenever Guest sets
> >> the mappings, it is intercepted by Host, so it's safe to only pin
> >> the page before it get used by Guest. This probably doesn't need
> >> device model to change :)
> > 
> > Hi Jike
> > 
> > Just out of curiosity, how does the host intercept this before it
> > goes on the bus?
> > 
> 
> Hi Neo,
> 
> [prologize if I mis-expressed myself, bad English ..] 
> 
> I was talking about intercepting the setting-up of GPU page tables,
> not the DMA itself.  For currently Intel GPU, the page tables are
> MMIO registers or simply RAM pages, called GTT (Graphics Translation
> Table), the writing event to an GTT entry from Guest, is always
> intercepted by Host.

Hi Jike,

Thanks for the details, one more question if the page tables are guest RAM, how do you
intercept it from host? I can see it get intercepted when it is in MMIO range.

Thanks,
Neo

> 
> --
> Thanks,
> Jike
> 

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
@ 2016-05-13  7:42                               ` Neo Jia
  0 siblings, 0 replies; 154+ messages in thread
From: Neo Jia @ 2016-05-13  7:42 UTC (permalink / raw)
  To: Jike Song
  Cc: Tian, Kevin, Jike Song, Alex Williamson, Kirti Wankhede,
	pbonzini, kraxel, qemu-devel, kvm, Ruan, Shuai, Lv, Zhiyuan

On Fri, May 13, 2016 at 03:30:27PM +0800, Jike Song wrote:
> On 05/13/2016 02:43 PM, Neo Jia wrote:
> > On Fri, May 13, 2016 at 02:22:37PM +0800, Jike Song wrote:
> >> On 05/13/2016 10:41 AM, Tian, Kevin wrote:
> >>>> From: Neo Jia [mailto:cjia@nvidia.com] Sent: Friday, May 13,
> >>>> 2016 3:49 AM
> >>>> 
> >>>>> 
> >>>>>> Perhaps one possibility would be to allow the vgpu driver
> >>>>>> to register map and unmap callbacks.  The unmap callback
> >>>>>> might provide the invalidation interface that we're so far
> >>>>>> missing.  The combination of map and unmap callbacks might
> >>>>>> simplify the Intel approach of pinning the entire VM memory
> >>>>>> space, ie. for each map callback do a translation (pin) and
> >>>>>> dma_map_page, for each unmap do a dma_unmap_page and
> >>>>>> release the translation.
> >>>>> 
> >>>>> Yes adding map/unmap ops in pGPU drvier (I assume you are
> >>>>> refering to gpu_device_ops as implemented in Kirti's patch)
> >>>>> sounds a good idea, satisfying both: 1) keeping vGPU purely 
> >>>>> virtual; 2) dealing with the Linux DMA API to achive hardware
> >>>>> IOMMU compatibility.
> >>>>> 
> >>>>> PS, this has very little to do with pinning wholly or
> >>>>> partially. Intel KVMGT has once been had the whole guest
> >>>>> memory pinned, only because we used a spinlock, which can't
> >>>>> sleep at runtime.  We have removed that spinlock in our
> >>>>> another upstreaming effort, not here but for i915 driver, so
> >>>>> probably no biggie.
> >>>>> 
> >>>> 
> >>>> OK, then you guys don't need to pin everything. The next
> >>>> question will be if you can send the pinning request from your
> >>>> mediated driver backend to request memory pinning like we have
> >>>> demonstrated in the v3 patch, function vfio_pin_pages and 
> >>>> vfio_unpin_pages?
> >>>> 
> >>> 
> >>> Jike can you confirm this statement? My feeling is that we don't
> >>> have such logic in our device model to figure out which pages
> >>> need to be pinned on demand. So currently pin-everything is same
> >>> requirement in both KVM and Xen side...
> >> 
> >> [Correct me in case of any neglect:)]
> >> 
> >> IMO the ultimate reason to pin a page, is for DMA. Accessing RAM
> >> from a GPU is certainly a DMA operation. The DMA facility of most
> >> platforms, IGD and NVIDIA GPU included, is not capable of
> >> faulting-handling-retrying.
> >> 
> >> As for vGPU solutions like Nvidia and Intel provide, the memory
> >> address region used by Guest for GPU access, whenever Guest sets
> >> the mappings, it is intercepted by Host, so it's safe to only pin
> >> the page before it get used by Guest. This probably doesn't need
> >> device model to change :)
> > 
> > Hi Jike
> > 
> > Just out of curiosity, how does the host intercept this before it
> > goes on the bus?
> > 
> 
> Hi Neo,
> 
> [prologize if I mis-expressed myself, bad English ..] 
> 
> I was talking about intercepting the setting-up of GPU page tables,
> not the DMA itself.  For currently Intel GPU, the page tables are
> MMIO registers or simply RAM pages, called GTT (Graphics Translation
> Table), the writing event to an GTT entry from Guest, is always
> intercepted by Host.

Hi Jike,

Thanks for the details, one more question if the page tables are guest RAM, how do you
intercept it from host? I can see it get intercepted when it is in MMIO range.

Thanks,
Neo

> 
> --
> Thanks,
> Jike
> 

^ permalink raw reply	[flat|nested] 154+ messages in thread

* RE: [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
  2016-05-13  7:42                               ` [Qemu-devel] " Neo Jia
@ 2016-05-13  7:45                                 ` Tian, Kevin
  -1 siblings, 0 replies; 154+ messages in thread
From: Tian, Kevin @ 2016-05-13  7:45 UTC (permalink / raw)
  To: Neo Jia, Song, Jike
  Cc: Jike Song, Alex Williamson, Kirti Wankhede, pbonzini, kraxel,
	qemu-devel, kvm, Ruan, Shuai, Lv, Zhiyuan

> From: Neo Jia [mailto:cjia@nvidia.com]
> Sent: Friday, May 13, 2016 3:42 PM
> 
> On Fri, May 13, 2016 at 03:30:27PM +0800, Jike Song wrote:
> > On 05/13/2016 02:43 PM, Neo Jia wrote:
> > > On Fri, May 13, 2016 at 02:22:37PM +0800, Jike Song wrote:
> > >> On 05/13/2016 10:41 AM, Tian, Kevin wrote:
> > >>>> From: Neo Jia [mailto:cjia@nvidia.com] Sent: Friday, May 13,
> > >>>> 2016 3:49 AM
> > >>>>
> > >>>>>
> > >>>>>> Perhaps one possibility would be to allow the vgpu driver
> > >>>>>> to register map and unmap callbacks.  The unmap callback
> > >>>>>> might provide the invalidation interface that we're so far
> > >>>>>> missing.  The combination of map and unmap callbacks might
> > >>>>>> simplify the Intel approach of pinning the entire VM memory
> > >>>>>> space, ie. for each map callback do a translation (pin) and
> > >>>>>> dma_map_page, for each unmap do a dma_unmap_page and
> > >>>>>> release the translation.
> > >>>>>
> > >>>>> Yes adding map/unmap ops in pGPU drvier (I assume you are
> > >>>>> refering to gpu_device_ops as implemented in Kirti's patch)
> > >>>>> sounds a good idea, satisfying both: 1) keeping vGPU purely
> > >>>>> virtual; 2) dealing with the Linux DMA API to achive hardware
> > >>>>> IOMMU compatibility.
> > >>>>>
> > >>>>> PS, this has very little to do with pinning wholly or
> > >>>>> partially. Intel KVMGT has once been had the whole guest
> > >>>>> memory pinned, only because we used a spinlock, which can't
> > >>>>> sleep at runtime.  We have removed that spinlock in our
> > >>>>> another upstreaming effort, not here but for i915 driver, so
> > >>>>> probably no biggie.
> > >>>>>
> > >>>>
> > >>>> OK, then you guys don't need to pin everything. The next
> > >>>> question will be if you can send the pinning request from your
> > >>>> mediated driver backend to request memory pinning like we have
> > >>>> demonstrated in the v3 patch, function vfio_pin_pages and
> > >>>> vfio_unpin_pages?
> > >>>>
> > >>>
> > >>> Jike can you confirm this statement? My feeling is that we don't
> > >>> have such logic in our device model to figure out which pages
> > >>> need to be pinned on demand. So currently pin-everything is same
> > >>> requirement in both KVM and Xen side...
> > >>
> > >> [Correct me in case of any neglect:)]
> > >>
> > >> IMO the ultimate reason to pin a page, is for DMA. Accessing RAM
> > >> from a GPU is certainly a DMA operation. The DMA facility of most
> > >> platforms, IGD and NVIDIA GPU included, is not capable of
> > >> faulting-handling-retrying.
> > >>
> > >> As for vGPU solutions like Nvidia and Intel provide, the memory
> > >> address region used by Guest for GPU access, whenever Guest sets
> > >> the mappings, it is intercepted by Host, so it's safe to only pin
> > >> the page before it get used by Guest. This probably doesn't need
> > >> device model to change :)
> > >
> > > Hi Jike
> > >
> > > Just out of curiosity, how does the host intercept this before it
> > > goes on the bus?
> > >
> >
> > Hi Neo,
> >
> > [prologize if I mis-expressed myself, bad English ..]
> >
> > I was talking about intercepting the setting-up of GPU page tables,
> > not the DMA itself.  For currently Intel GPU, the page tables are
> > MMIO registers or simply RAM pages, called GTT (Graphics Translation
> > Table), the writing event to an GTT entry from Guest, is always
> > intercepted by Host.
> 
> Hi Jike,
> 
> Thanks for the details, one more question if the page tables are guest RAM, how do you
> intercept it from host? I can see it get intercepted when it is in MMIO range.
> 

We use page tracking framework, which is newly added to KVM recently,
to mark RAM pages as read-only so write accesses are intercepted to 
device model.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
@ 2016-05-13  7:45                                 ` Tian, Kevin
  0 siblings, 0 replies; 154+ messages in thread
From: Tian, Kevin @ 2016-05-13  7:45 UTC (permalink / raw)
  To: Neo Jia, Song, Jike
  Cc: Jike Song, Alex Williamson, Kirti Wankhede, pbonzini, kraxel,
	qemu-devel, kvm, Ruan, Shuai, Lv, Zhiyuan

> From: Neo Jia [mailto:cjia@nvidia.com]
> Sent: Friday, May 13, 2016 3:42 PM
> 
> On Fri, May 13, 2016 at 03:30:27PM +0800, Jike Song wrote:
> > On 05/13/2016 02:43 PM, Neo Jia wrote:
> > > On Fri, May 13, 2016 at 02:22:37PM +0800, Jike Song wrote:
> > >> On 05/13/2016 10:41 AM, Tian, Kevin wrote:
> > >>>> From: Neo Jia [mailto:cjia@nvidia.com] Sent: Friday, May 13,
> > >>>> 2016 3:49 AM
> > >>>>
> > >>>>>
> > >>>>>> Perhaps one possibility would be to allow the vgpu driver
> > >>>>>> to register map and unmap callbacks.  The unmap callback
> > >>>>>> might provide the invalidation interface that we're so far
> > >>>>>> missing.  The combination of map and unmap callbacks might
> > >>>>>> simplify the Intel approach of pinning the entire VM memory
> > >>>>>> space, ie. for each map callback do a translation (pin) and
> > >>>>>> dma_map_page, for each unmap do a dma_unmap_page and
> > >>>>>> release the translation.
> > >>>>>
> > >>>>> Yes adding map/unmap ops in pGPU drvier (I assume you are
> > >>>>> refering to gpu_device_ops as implemented in Kirti's patch)
> > >>>>> sounds a good idea, satisfying both: 1) keeping vGPU purely
> > >>>>> virtual; 2) dealing with the Linux DMA API to achive hardware
> > >>>>> IOMMU compatibility.
> > >>>>>
> > >>>>> PS, this has very little to do with pinning wholly or
> > >>>>> partially. Intel KVMGT has once been had the whole guest
> > >>>>> memory pinned, only because we used a spinlock, which can't
> > >>>>> sleep at runtime.  We have removed that spinlock in our
> > >>>>> another upstreaming effort, not here but for i915 driver, so
> > >>>>> probably no biggie.
> > >>>>>
> > >>>>
> > >>>> OK, then you guys don't need to pin everything. The next
> > >>>> question will be if you can send the pinning request from your
> > >>>> mediated driver backend to request memory pinning like we have
> > >>>> demonstrated in the v3 patch, function vfio_pin_pages and
> > >>>> vfio_unpin_pages?
> > >>>>
> > >>>
> > >>> Jike can you confirm this statement? My feeling is that we don't
> > >>> have such logic in our device model to figure out which pages
> > >>> need to be pinned on demand. So currently pin-everything is same
> > >>> requirement in both KVM and Xen side...
> > >>
> > >> [Correct me in case of any neglect:)]
> > >>
> > >> IMO the ultimate reason to pin a page, is for DMA. Accessing RAM
> > >> from a GPU is certainly a DMA operation. The DMA facility of most
> > >> platforms, IGD and NVIDIA GPU included, is not capable of
> > >> faulting-handling-retrying.
> > >>
> > >> As for vGPU solutions like Nvidia and Intel provide, the memory
> > >> address region used by Guest for GPU access, whenever Guest sets
> > >> the mappings, it is intercepted by Host, so it's safe to only pin
> > >> the page before it get used by Guest. This probably doesn't need
> > >> device model to change :)
> > >
> > > Hi Jike
> > >
> > > Just out of curiosity, how does the host intercept this before it
> > > goes on the bus?
> > >
> >
> > Hi Neo,
> >
> > [prologize if I mis-expressed myself, bad English ..]
> >
> > I was talking about intercepting the setting-up of GPU page tables,
> > not the DMA itself.  For currently Intel GPU, the page tables are
> > MMIO registers or simply RAM pages, called GTT (Graphics Translation
> > Table), the writing event to an GTT entry from Guest, is always
> > intercepted by Host.
> 
> Hi Jike,
> 
> Thanks for the details, one more question if the page tables are guest RAM, how do you
> intercept it from host? I can see it get intercepted when it is in MMIO range.
> 

We use page tracking framework, which is newly added to KVM recently,
to mark RAM pages as read-only so write accesses are intercepted to 
device model.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 154+ messages in thread

* RE: [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
  2016-05-13  7:38                             ` [Qemu-devel] " Neo Jia
@ 2016-05-13  8:02                               ` Tian, Kevin
  -1 siblings, 0 replies; 154+ messages in thread
From: Tian, Kevin @ 2016-05-13  8:02 UTC (permalink / raw)
  To: Neo Jia
  Cc: Song, Jike, Jike Song, Alex Williamson, Kirti Wankhede, pbonzini,
	kraxel, qemu-devel, kvm, Ruan, Shuai, Lv, Zhiyuan

> From: Neo Jia [mailto:cjia@nvidia.com]
> Sent: Friday, May 13, 2016 3:38 PM
> 
> On Fri, May 13, 2016 at 07:13:44AM +0000, Tian, Kevin wrote:
> > > From: Neo Jia [mailto:cjia@nvidia.com]
> > > Sent: Friday, May 13, 2016 2:42 PM
> > >
> > >
> > > >
> > > > We possibly have the same requirement from the mediate driver backend:
> > > >
> > > > 	a) get a GFN, when guest try to tell hardware;
> > > > 	b) consult the vfio iommu with that GFN[1]: will you find me a proper dma_addr?
> > >
> > > We will provide you the pfn via vfio_pin_pages, so you can map it for dma
> > > purpose in your i915 driver, which is what we are doing today.
> > >
> >
> > Can such 'map' operation be consolidated in vGPU core driver? I don't think
> > Intel vGPU driver has any feature proactively relying on iommu. The reason
> > why we keep talking iommu is just because the kernel may enable iommu
> > for physical GPU so we need make sure our device model can work in such
> > configuration. And this requirement should apply to all vendors, not Intel
> > specific (like you said you are doing it already today).
> 
> Hi Kevin,
> 
> Actually, such requirement is already satisfied today as all vendor drivers
> should transparently work with and without system iommu on bare-metal, right?
> 
> So I don't see any new requirement here, also such consolidation doesn't help
> any but adding complexity to the system as vendor driver will not remove
> their own dma_map_xxx functions as they are still required to support
> non-mediated cases.
> 

Thanks for your information, which makes it clearer where the difference is. :-)

Based on your description, looks you treat guest pages same as normal process
pages, which all share the same code path when mapping as DMA target, so it
is pointless to separate guest page map out to vGPU core driver. Is this
understanding correct?

In our side, so far guest pages are treated differently from normal process
pages, which is the main reason why I asked whether we can consolidate that
part. Looks now it's not necessary since it's already not a common requirement.

One additional question though. Jike already mentioned the need to shadow
GPU MMU (called GTT table in Intel side) in our device model. 'shadow' here
basically means we need translate from 'gfn' in guest pte to 'dmadr_t'
as returned by dma_map_xxx. Based on gfn->pfn translation provided by
VFIO (in your v3 driver), gfn->dmadr_t mapping can be constructed accordingly
in the vendor driver. So do you have similar requirement like this? If yes, do
you think any value to unify that translation structure or prefer to maintain
it by vendor driver?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
@ 2016-05-13  8:02                               ` Tian, Kevin
  0 siblings, 0 replies; 154+ messages in thread
From: Tian, Kevin @ 2016-05-13  8:02 UTC (permalink / raw)
  To: Neo Jia
  Cc: Song, Jike, Jike Song, Alex Williamson, Kirti Wankhede, pbonzini,
	kraxel, qemu-devel, kvm, Ruan, Shuai, Lv, Zhiyuan

> From: Neo Jia [mailto:cjia@nvidia.com]
> Sent: Friday, May 13, 2016 3:38 PM
> 
> On Fri, May 13, 2016 at 07:13:44AM +0000, Tian, Kevin wrote:
> > > From: Neo Jia [mailto:cjia@nvidia.com]
> > > Sent: Friday, May 13, 2016 2:42 PM
> > >
> > >
> > > >
> > > > We possibly have the same requirement from the mediate driver backend:
> > > >
> > > > 	a) get a GFN, when guest try to tell hardware;
> > > > 	b) consult the vfio iommu with that GFN[1]: will you find me a proper dma_addr?
> > >
> > > We will provide you the pfn via vfio_pin_pages, so you can map it for dma
> > > purpose in your i915 driver, which is what we are doing today.
> > >
> >
> > Can such 'map' operation be consolidated in vGPU core driver? I don't think
> > Intel vGPU driver has any feature proactively relying on iommu. The reason
> > why we keep talking iommu is just because the kernel may enable iommu
> > for physical GPU so we need make sure our device model can work in such
> > configuration. And this requirement should apply to all vendors, not Intel
> > specific (like you said you are doing it already today).
> 
> Hi Kevin,
> 
> Actually, such requirement is already satisfied today as all vendor drivers
> should transparently work with and without system iommu on bare-metal, right?
> 
> So I don't see any new requirement here, also such consolidation doesn't help
> any but adding complexity to the system as vendor driver will not remove
> their own dma_map_xxx functions as they are still required to support
> non-mediated cases.
> 

Thanks for your information, which makes it clearer where the difference is. :-)

Based on your description, looks you treat guest pages same as normal process
pages, which all share the same code path when mapping as DMA target, so it
is pointless to separate guest page map out to vGPU core driver. Is this
understanding correct?

In our side, so far guest pages are treated differently from normal process
pages, which is the main reason why I asked whether we can consolidate that
part. Looks now it's not necessary since it's already not a common requirement.

One additional question though. Jike already mentioned the need to shadow
GPU MMU (called GTT table in Intel side) in our device model. 'shadow' here
basically means we need translate from 'gfn' in guest pte to 'dmadr_t'
as returned by dma_map_xxx. Based on gfn->pfn translation provided by
VFIO (in your v3 driver), gfn->dmadr_t mapping can be constructed accordingly
in the vendor driver. So do you have similar requirement like this? If yes, do
you think any value to unify that translation structure or prefer to maintain
it by vendor driver?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
  2016-05-13  7:45                                 ` [Qemu-devel] " Tian, Kevin
@ 2016-05-13  8:31                                   ` Neo Jia
  -1 siblings, 0 replies; 154+ messages in thread
From: Neo Jia @ 2016-05-13  8:31 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Ruan, Shuai, Song, Jike, kvm, Jike Song, Kirti Wankhede,
	qemu-devel, Alex Williamson, kraxel, pbonzini, Lv, Zhiyuan

On Fri, May 13, 2016 at 07:45:14AM +0000, Tian, Kevin wrote:
> > From: Neo Jia [mailto:cjia@nvidia.com]
> > Sent: Friday, May 13, 2016 3:42 PM
> > 
> > On Fri, May 13, 2016 at 03:30:27PM +0800, Jike Song wrote:
> > > On 05/13/2016 02:43 PM, Neo Jia wrote:
> > > > On Fri, May 13, 2016 at 02:22:37PM +0800, Jike Song wrote:
> > > >> On 05/13/2016 10:41 AM, Tian, Kevin wrote:
> > > >>>> From: Neo Jia [mailto:cjia@nvidia.com] Sent: Friday, May 13,
> > > >>>> 2016 3:49 AM
> > > >>>>
> > > >>>>>
> > > >>>>>> Perhaps one possibility would be to allow the vgpu driver
> > > >>>>>> to register map and unmap callbacks.  The unmap callback
> > > >>>>>> might provide the invalidation interface that we're so far
> > > >>>>>> missing.  The combination of map and unmap callbacks might
> > > >>>>>> simplify the Intel approach of pinning the entire VM memory
> > > >>>>>> space, ie. for each map callback do a translation (pin) and
> > > >>>>>> dma_map_page, for each unmap do a dma_unmap_page and
> > > >>>>>> release the translation.
> > > >>>>>
> > > >>>>> Yes adding map/unmap ops in pGPU drvier (I assume you are
> > > >>>>> refering to gpu_device_ops as implemented in Kirti's patch)
> > > >>>>> sounds a good idea, satisfying both: 1) keeping vGPU purely
> > > >>>>> virtual; 2) dealing with the Linux DMA API to achive hardware
> > > >>>>> IOMMU compatibility.
> > > >>>>>
> > > >>>>> PS, this has very little to do with pinning wholly or
> > > >>>>> partially. Intel KVMGT has once been had the whole guest
> > > >>>>> memory pinned, only because we used a spinlock, which can't
> > > >>>>> sleep at runtime.  We have removed that spinlock in our
> > > >>>>> another upstreaming effort, not here but for i915 driver, so
> > > >>>>> probably no biggie.
> > > >>>>>
> > > >>>>
> > > >>>> OK, then you guys don't need to pin everything. The next
> > > >>>> question will be if you can send the pinning request from your
> > > >>>> mediated driver backend to request memory pinning like we have
> > > >>>> demonstrated in the v3 patch, function vfio_pin_pages and
> > > >>>> vfio_unpin_pages?
> > > >>>>
> > > >>>
> > > >>> Jike can you confirm this statement? My feeling is that we don't
> > > >>> have such logic in our device model to figure out which pages
> > > >>> need to be pinned on demand. So currently pin-everything is same
> > > >>> requirement in both KVM and Xen side...
> > > >>
> > > >> [Correct me in case of any neglect:)]
> > > >>
> > > >> IMO the ultimate reason to pin a page, is for DMA. Accessing RAM
> > > >> from a GPU is certainly a DMA operation. The DMA facility of most
> > > >> platforms, IGD and NVIDIA GPU included, is not capable of
> > > >> faulting-handling-retrying.
> > > >>
> > > >> As for vGPU solutions like Nvidia and Intel provide, the memory
> > > >> address region used by Guest for GPU access, whenever Guest sets
> > > >> the mappings, it is intercepted by Host, so it's safe to only pin
> > > >> the page before it get used by Guest. This probably doesn't need
> > > >> device model to change :)
> > > >
> > > > Hi Jike
> > > >
> > > > Just out of curiosity, how does the host intercept this before it
> > > > goes on the bus?
> > > >
> > >
> > > Hi Neo,
> > >
> > > [prologize if I mis-expressed myself, bad English ..]
> > >
> > > I was talking about intercepting the setting-up of GPU page tables,
> > > not the DMA itself.  For currently Intel GPU, the page tables are
> > > MMIO registers or simply RAM pages, called GTT (Graphics Translation
> > > Table), the writing event to an GTT entry from Guest, is always
> > > intercepted by Host.
> > 
> > Hi Jike,
> > 
> > Thanks for the details, one more question if the page tables are guest RAM, how do you
> > intercept it from host? I can see it get intercepted when it is in MMIO range.
> > 
> 
> We use page tracking framework, which is newly added to KVM recently,
> to mark RAM pages as read-only so write accesses are intercepted to 
> device model.

Yes, I am aware of that patchset from Guangrong. So far the interface are all
requiring struct *kvm, copied from https://lkml.org/lkml/2015/11/30/644

- kvm_page_track_add_page(): add the page to the tracking pool after
  that later specified access on that page will be tracked

- kvm_page_track_remove_page(): remove the page from the tracking pool,
  the specified access on the page is not tracked after the last user is
  gone

void kvm_page_track_add_page(struct kvm *kvm, gfn_t gfn,
                enum kvm_page_track_mode mode);
void kvm_page_track_remove_page(struct kvm *kvm, gfn_t gfn,
               enum kvm_page_track_mode mode);

Really curious how you are going to have access to the struct kvm *kvm, or you
are relying on the userfaultfd to track the write faults only as part of the
QEMU userfault thread?

Thanks,
Neo

> 
> Thanks
> Kevin

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
@ 2016-05-13  8:31                                   ` Neo Jia
  0 siblings, 0 replies; 154+ messages in thread
From: Neo Jia @ 2016-05-13  8:31 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Song, Jike, Jike Song, Alex Williamson, Kirti Wankhede, pbonzini,
	kraxel, qemu-devel, kvm, Ruan, Shuai, Lv, Zhiyuan

On Fri, May 13, 2016 at 07:45:14AM +0000, Tian, Kevin wrote:
> > From: Neo Jia [mailto:cjia@nvidia.com]
> > Sent: Friday, May 13, 2016 3:42 PM
> > 
> > On Fri, May 13, 2016 at 03:30:27PM +0800, Jike Song wrote:
> > > On 05/13/2016 02:43 PM, Neo Jia wrote:
> > > > On Fri, May 13, 2016 at 02:22:37PM +0800, Jike Song wrote:
> > > >> On 05/13/2016 10:41 AM, Tian, Kevin wrote:
> > > >>>> From: Neo Jia [mailto:cjia@nvidia.com] Sent: Friday, May 13,
> > > >>>> 2016 3:49 AM
> > > >>>>
> > > >>>>>
> > > >>>>>> Perhaps one possibility would be to allow the vgpu driver
> > > >>>>>> to register map and unmap callbacks.  The unmap callback
> > > >>>>>> might provide the invalidation interface that we're so far
> > > >>>>>> missing.  The combination of map and unmap callbacks might
> > > >>>>>> simplify the Intel approach of pinning the entire VM memory
> > > >>>>>> space, ie. for each map callback do a translation (pin) and
> > > >>>>>> dma_map_page, for each unmap do a dma_unmap_page and
> > > >>>>>> release the translation.
> > > >>>>>
> > > >>>>> Yes adding map/unmap ops in pGPU drvier (I assume you are
> > > >>>>> refering to gpu_device_ops as implemented in Kirti's patch)
> > > >>>>> sounds a good idea, satisfying both: 1) keeping vGPU purely
> > > >>>>> virtual; 2) dealing with the Linux DMA API to achive hardware
> > > >>>>> IOMMU compatibility.
> > > >>>>>
> > > >>>>> PS, this has very little to do with pinning wholly or
> > > >>>>> partially. Intel KVMGT has once been had the whole guest
> > > >>>>> memory pinned, only because we used a spinlock, which can't
> > > >>>>> sleep at runtime.  We have removed that spinlock in our
> > > >>>>> another upstreaming effort, not here but for i915 driver, so
> > > >>>>> probably no biggie.
> > > >>>>>
> > > >>>>
> > > >>>> OK, then you guys don't need to pin everything. The next
> > > >>>> question will be if you can send the pinning request from your
> > > >>>> mediated driver backend to request memory pinning like we have
> > > >>>> demonstrated in the v3 patch, function vfio_pin_pages and
> > > >>>> vfio_unpin_pages?
> > > >>>>
> > > >>>
> > > >>> Jike can you confirm this statement? My feeling is that we don't
> > > >>> have such logic in our device model to figure out which pages
> > > >>> need to be pinned on demand. So currently pin-everything is same
> > > >>> requirement in both KVM and Xen side...
> > > >>
> > > >> [Correct me in case of any neglect:)]
> > > >>
> > > >> IMO the ultimate reason to pin a page, is for DMA. Accessing RAM
> > > >> from a GPU is certainly a DMA operation. The DMA facility of most
> > > >> platforms, IGD and NVIDIA GPU included, is not capable of
> > > >> faulting-handling-retrying.
> > > >>
> > > >> As for vGPU solutions like Nvidia and Intel provide, the memory
> > > >> address region used by Guest for GPU access, whenever Guest sets
> > > >> the mappings, it is intercepted by Host, so it's safe to only pin
> > > >> the page before it get used by Guest. This probably doesn't need
> > > >> device model to change :)
> > > >
> > > > Hi Jike
> > > >
> > > > Just out of curiosity, how does the host intercept this before it
> > > > goes on the bus?
> > > >
> > >
> > > Hi Neo,
> > >
> > > [prologize if I mis-expressed myself, bad English ..]
> > >
> > > I was talking about intercepting the setting-up of GPU page tables,
> > > not the DMA itself.  For currently Intel GPU, the page tables are
> > > MMIO registers or simply RAM pages, called GTT (Graphics Translation
> > > Table), the writing event to an GTT entry from Guest, is always
> > > intercepted by Host.
> > 
> > Hi Jike,
> > 
> > Thanks for the details, one more question if the page tables are guest RAM, how do you
> > intercept it from host? I can see it get intercepted when it is in MMIO range.
> > 
> 
> We use page tracking framework, which is newly added to KVM recently,
> to mark RAM pages as read-only so write accesses are intercepted to 
> device model.

Yes, I am aware of that patchset from Guangrong. So far the interface are all
requiring struct *kvm, copied from https://lkml.org/lkml/2015/11/30/644

- kvm_page_track_add_page(): add the page to the tracking pool after
  that later specified access on that page will be tracked

- kvm_page_track_remove_page(): remove the page from the tracking pool,
  the specified access on the page is not tracked after the last user is
  gone

void kvm_page_track_add_page(struct kvm *kvm, gfn_t gfn,
                enum kvm_page_track_mode mode);
void kvm_page_track_remove_page(struct kvm *kvm, gfn_t gfn,
               enum kvm_page_track_mode mode);

Really curious how you are going to have access to the struct kvm *kvm, or you
are relying on the userfaultfd to track the write faults only as part of the
QEMU userfault thread?

Thanks,
Neo

> 
> Thanks
> Kevin

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
  2016-05-13  7:24                         ` Neo Jia
@ 2016-05-13  8:39                           ` Dong Jia
  -1 siblings, 0 replies; 154+ messages in thread
From: Dong Jia @ 2016-05-13  8:39 UTC (permalink / raw)
  To: Neo Jia
  Cc: Ruan, Shuai, Song, Jike, kvm, Tian, Kevin, Kirti Wankhede,
	qemu-devel, Alex Williamson, kraxel, pbonzini, Dong Jia, Lv,
	Zhiyuan

On Fri, 13 May 2016 00:24:34 -0700
Neo Jia <cjia@nvidia.com> wrote:

> On Fri, May 13, 2016 at 03:10:22PM +0800, Dong Jia wrote:
> > On Thu, 12 May 2016 13:05:52 -0600
> > Alex Williamson <alex.williamson@redhat.com> wrote:
> > 
> > > On Thu, 12 May 2016 08:00:36 +0000
> > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > > 
> > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > Sent: Thursday, May 12, 2016 6:06 AM
> > > > > 
> > > > > On Wed, 11 May 2016 17:15:15 +0800
> > > > > Jike Song <jike.song@intel.com> wrote:
> > > > >   
> > > > > > On 05/11/2016 12:02 AM, Neo Jia wrote:  
> > > > > > > On Tue, May 10, 2016 at 03:52:27PM +0800, Jike Song wrote:  
> > > > > > >> On 05/05/2016 05:27 PM, Tian, Kevin wrote:  
> > > > > > >>>> From: Song, Jike
> > > > > > >>>>
> > > > > > >>>> IIUC, an api-only domain is a VFIO domain *without* underlying IOMMU
> > > > > > >>>> hardware. It just, as you said in another mail, "rather than
> > > > > > >>>> programming them into an IOMMU for a device, it simply stores the
> > > > > > >>>> translations for use by later requests".
> > > > > > >>>>
> > > > > > >>>> That imposes a constraint on gfx driver: hardware IOMMU must be disabled.
> > > > > > >>>> Otherwise, if IOMMU is present, the gfx driver eventually programs
> > > > > > >>>> the hardware IOMMU with IOVA returned by pci_map_page or dma_map_page;
> > > > > > >>>> Meanwhile, the IOMMU backend for vgpu only maintains GPA <-> HPA
> > > > > > >>>> translations without any knowledge about hardware IOMMU, how is the
> > > > > > >>>> device model supposed to do to get an IOVA for a given GPA (thereby HPA
> > > > > > >>>> by the IOMMU backend here)?
> > > > > > >>>>
> > > > > > >>>> If things go as guessed above, as vfio_pin_pages() indicates, it
> > > > > > >>>> pin & translate vaddr to PFN, then it will be very difficult for the
> > > > > > >>>> device model to figure out:
> > > > > > >>>>
> > > > > > >>>> 	1, for a given GPA, how to avoid calling dma_map_page multiple times?
> > > > > > >>>> 	2, for which page to call dma_unmap_page?
> > > > > > >>>>
> > > > > > >>>> --  
> > > > > > >>>
> > > > > > >>> We have to support both w/ iommu and w/o iommu case, since
> > > > > > >>> that fact is out of GPU driver control. A simple way is to use
> > > > > > >>> dma_map_page which internally will cope with w/ and w/o iommu
> > > > > > >>> case gracefully, i.e. return HPA w/o iommu and IOVA w/ iommu.
> > > > > > >>> Then in this file we only need to cache GPA to whatever dmadr_t
> > > > > > >>> returned by dma_map_page.
> > > > > > >>>  
> > > > > > >>
> > > > > > >> Hi Alex, Kirti and Neo, any thought on the IOMMU compatibility here?  
> > > > > > >
> > > > > > > Hi Jike,
> > > > > > >
> > > > > > > With mediated passthru, you still can use hardware iommu, but more important
> > > > > > > that part is actually orthogonal to what we are discussing here as we will only
> > > > > > > cache the mapping between <gfn (iova if guest has iommu), (qemu) va>, once we
> > > > > > > have pinned pages later with the help of above info, you can map it into the
> > > > > > > proper iommu domain if the system has configured so.
> > > > > > >  
> > > > > >
> > > > > > Hi Neo,
> > > > > >
> > > > > > Technically yes you can map a pfn into the proper IOMMU domain elsewhere,
> > > > > > but to find out whether a pfn was previously mapped or not, you have to
> > > > > > track it with another rbtree-alike data structure (the IOMMU driver simply
> > > > > > doesn't bother with tracking), that seems somehow duplicate with the vGPU
> > > > > > IOMMU backend we are discussing here.
> > > > > >
> > > > > > And it is also semantically correct for an IOMMU backend to handle both w/
> > > > > > and w/o an IOMMU hardware? :)  
> > > > > 
> > > > > A problem with the iommu doing the dma_map_page() though is for what
> > > > > device does it do this?  In the mediated case the vfio infrastructure
> > > > > is dealing with a software representation of a device.  For all we
> > > > > know that software model could transparently migrate from one physical
> > > > > GPU to another.  There may not even be a physical device backing
> > > > > the mediated device.  Those are details left to the vgpu driver itself.  
> > > > 
> > > > This is a fair argument. VFIO iommu driver simply serves user space
> > > > requests, where only vaddr<->iova (essentially gpa in kvm case) is
> > > > mattered. How iova is mapped into real IOMMU is not VFIO's interest.
> > > > 
> > > > > 
> > > > > Perhaps one possibility would be to allow the vgpu driver to register
> > > > > map and unmap callbacks.  The unmap callback might provide the
> > > > > invalidation interface that we're so far missing.  The combination of
> > > > > map and unmap callbacks might simplify the Intel approach of pinning the
> > > > > entire VM memory space, ie. for each map callback do a translation
> > > > > (pin) and dma_map_page, for each unmap do a dma_unmap_page and release
> > > > > the translation.  There's still the problem of where that dma_addr_t
> > > > > from the dma_map_page is stored though.  Someone would need to keep
> > > > > track of iova to dma_addr_t.  The vfio iommu might be a place to do
> > > > > that since we're already tracking information based on iova, possibly
> > > > > in an opaque data element provided by the vgpu driver.  However, we're
> > > > > going to need to take a serious look at whether an rb-tree is the right
> > > > > data structure for the job.  It works well for the current type1
> > > > > functionality where we typically have tens of entries.  I think the
> > > > > NVIDIA model of sparse pinning the VM is pushing that up to tens of
> > > > > thousands.  If Intel intends to pin the entire guest, that's
> > > > > potentially tens of millions of tracked entries and I don't know that
> > > > > an rb-tree is the right tool for that job.  Thanks,
> > > > >   
> > > > 
> > > > Based on above thought I'm thinking whether below would work:
> > > > (let's use gpa to replace existing iova in type1 driver, while using iova
> > > > for the one actually used in vGPU driver. Assume 'pin-all' scenario first
> > > > which matches existing vfio logic)
> > > > 
> > > > - No change to existing vfio_dma structure. VFIO still maintains gpa<->vaddr
> > > > mapping, in coarse-grained regions;
> > > > 
> > > > - Leverage same page accounting/pinning logic in type1 driver, which 
> > > > should be enough for 'pin-all' usage;
> > > > 
> > > > - Then main divergence point for vGPU would be in vfio_unmap_unpin
> > > > and vfio_iommu_map. I'm not sure whether it's easy to fake an 
> > > > iommu_domain for vGPU so same iommu_map/unmap can be reused.
> > > 
> > > This seems troublesome.  Kirti's version used numerous api-only tests
> > > to avoid these which made the code difficult to trace.  Clearly one
> > > option is to split out the common code so that a new mediated-type1
> > > backend skips this, but they thought they could clean it up without
> > > this, so we'll see what happens in the next version.
> > > 
> > > > If not, we may introduce two new map/unmap callbacks provided
> > > > specifically by vGPU core driver, as you suggested:
> > > > 
> > > > 	* vGPU core driver uses dma_map_page to map specified pfns:
> > > > 
> > > > 		o When IOMMU is enabled, we'll get an iova returned different
> > > > from pfn;
> > > > 		o When IOMMU is disabled, returned iova is same as pfn;
> > > 
> > > Either way each iova needs to be stored and we have a worst case of one
> > > iova per page of guest memory.
> > > 
> > > > 	* Then vGPU core driver just maintains its own gpa<->iova lookup
> > > > table (e.g. called vgpu_dma)
> > > > 
> > > > 	* Because each vfio_iommu_map invocation is about a contiguous 
> > > > region, we can expect same number of vgpu_dma entries as maintained 
> > > > for vfio_dma list;
> > > >
> > > > Then it's vGPU core driver's responsibility to provide gpa<->iova
> > > > lookup for vendor specific GPU driver. And we don't need worry about
> > > > tens of thousands of entries. Once we get this simple 'pin-all' model
> > > > ready, then it can be further extended to support 'pin-sparse'
> > > > scenario. We still maintain a top-level vgpu_dma list with each entry to
> > > > further link its own sparse mapping structure. In reality I don't expect
> > > > we really need to maintain per-page translation even with sparse pinning.
> > > 
> > > If you're trying to equate the scale of what we need to track vs what
> > > type1 currently tracks, they're significantly different.  Possible
> > > things we need to track include the pfn, the iova, and possibly a
> > > reference count or some sort of pinned page map.  In the pin-all model
> > > we can assume that every page is pinned on map and unpinned on unmap,
> > > so a reference count or map is unnecessary.  We can also assume that we
> > > can always regenerate the pfn with get_user_pages() from the vaddr, so
> > > we don't need to track that.  I don't see any way around tracking the
> > > iova.  The iommu can't tell us this like it can with the normal type1
> > > model because the pfn is the result of the translation, not the key for
> > > the translation. So we're always going to have between 1 and
> > > (size/PAGE_SIZE) iova entries per vgpu_dma entry.  You might be able to
> > > manage the vgpu_dma with an rb-tree, but each vgpu_dma entry needs some
> > > data structure tracking every iova.
> > > 
> > > Sparse mapping has the same issue but of course the tree of iovas is
> > > potentially incomplete and we need a way to determine where it's
> > > incomplete.  A page table rooted in the vgpu_dma and indexed by the
> > > offset from the start vaddr seems like the way to go here.  It's also
> > > possible that some mediated device models might store the iova in the
> > > command sent to the device and therefore be able to parse those entries
> > > back out to unmap them without storing them separately.  This might be
> > > how the s390 channel-io model would prefer to work.
> > Dear Alex:
> > 
> > For the s390 channel-io model, when an I/O instruction was intercepted
> > and issued to the device driver for further translation, the operand of
> > the instruction contents iovas only. Since iova is the key to locate an
> > entry in the database (r-b tree or whatever), we do can parse the
> > entries back out one by one when doing the unmap operation.
> >                  ^^^^^^^^^^
> > 
> > BTW, if the mediated-iommu backend can potentially offer a transaction
> > level support for the unmap operation, I believe it will benefit the
> > performance for this case.
> > 
> > e.g.:
> > handler = vfio_trasaction_begin();
> > foreach(iova in the command) {
> >     pfn = vfio_trasaction_map(handler, iova);
> >     do_io(pfn);
> > }
> 
> Hi Dong,
> 
> Could you please help me understand the performance benefit here? 
> 
> Is the perf argument coming from the searching of rbtree of the tracking data
> structure or something else?
> 
> For example you can do similar thing by the following sequence from your backend
> driver:
> 
>     vfio_pin_pages(gfn_list/iova_list /* in */, npages, prot, pfn_bases /* out */)
>     foreach (pfn)
>         do_io(pfn)
>     vfio_unpin_pages(pfn_bases)
Dear Neo:

FWIU, the channel-io driver could leverage these interfaces without
obvious feasibility issues. Since the implementation of the current
vfio_unpin_pages iterates @pfn_bases and find the corresponding entry
from the rb tree for each of the pfn_base, I'm wondering if a dedicated
list of the entries for the whole @pfn_bases could offer us some
benefits. I have to say that I know it's too early to consider the perf
, and the current interfaces are fine for the channel-io case.

I'm also wondering if such an idea could contribute a little to your
discussion of the management of the key-value mapping issue. If this is
just a noise (sorry for that :<), please ignore it.

My major intention is to show up, and to elaborate a bit about the
channel-io use case. So you will see that, there is really another user
of the mediate-iommu backend, and as Alex mentioned before, getting rid
of the 'vgpu_dev' and other vgpu specific stuff is indeed necessary. :>

> 
> Thanks,
> Neo
> 
> > 
> > /*
> >  * Expect to unmap all of the pfns mapped in this trasaction with the
> >  * next statement. The mediated-iommu backend could use handler as the
> >  * key to track the list of the entries.
> >  */
> > vfio_trasaction_unmap(handler);
> > vfio_trasaction_end(handler);
> > 
> > Not sure if this could benefit the vgpu sparse mapping use case though.
> 
> 
> 
> 
> 
> > 
> > >  That seems like
> > > further validation that such tracking is going to be dependent on the
> > > mediated driver itself and probably not something to centralize in a
> > > mediated iommu driver.  Thanks,
> > > 
> > > Alex
> > > 
> > 
> > 
> > 
> > --------
> > Dong Jia
> > 
> 



--------
Dong Jia

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
@ 2016-05-13  8:39                           ` Dong Jia
  0 siblings, 0 replies; 154+ messages in thread
From: Dong Jia @ 2016-05-13  8:39 UTC (permalink / raw)
  To: Neo Jia
  Cc: Alex Williamson, Tian, Kevin, Ruan, Shuai, Song, Jike, kvm,
	qemu-devel, Kirti Wankhede, kraxel, pbonzini, Lv,	Zhiyuan,
	Dong Jia

On Fri, 13 May 2016 00:24:34 -0700
Neo Jia <cjia@nvidia.com> wrote:

> On Fri, May 13, 2016 at 03:10:22PM +0800, Dong Jia wrote:
> > On Thu, 12 May 2016 13:05:52 -0600
> > Alex Williamson <alex.williamson@redhat.com> wrote:
> > 
> > > On Thu, 12 May 2016 08:00:36 +0000
> > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > > 
> > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > Sent: Thursday, May 12, 2016 6:06 AM
> > > > > 
> > > > > On Wed, 11 May 2016 17:15:15 +0800
> > > > > Jike Song <jike.song@intel.com> wrote:
> > > > >   
> > > > > > On 05/11/2016 12:02 AM, Neo Jia wrote:  
> > > > > > > On Tue, May 10, 2016 at 03:52:27PM +0800, Jike Song wrote:  
> > > > > > >> On 05/05/2016 05:27 PM, Tian, Kevin wrote:  
> > > > > > >>>> From: Song, Jike
> > > > > > >>>>
> > > > > > >>>> IIUC, an api-only domain is a VFIO domain *without* underlying IOMMU
> > > > > > >>>> hardware. It just, as you said in another mail, "rather than
> > > > > > >>>> programming them into an IOMMU for a device, it simply stores the
> > > > > > >>>> translations for use by later requests".
> > > > > > >>>>
> > > > > > >>>> That imposes a constraint on gfx driver: hardware IOMMU must be disabled.
> > > > > > >>>> Otherwise, if IOMMU is present, the gfx driver eventually programs
> > > > > > >>>> the hardware IOMMU with IOVA returned by pci_map_page or dma_map_page;
> > > > > > >>>> Meanwhile, the IOMMU backend for vgpu only maintains GPA <-> HPA
> > > > > > >>>> translations without any knowledge about hardware IOMMU, how is the
> > > > > > >>>> device model supposed to do to get an IOVA for a given GPA (thereby HPA
> > > > > > >>>> by the IOMMU backend here)?
> > > > > > >>>>
> > > > > > >>>> If things go as guessed above, as vfio_pin_pages() indicates, it
> > > > > > >>>> pin & translate vaddr to PFN, then it will be very difficult for the
> > > > > > >>>> device model to figure out:
> > > > > > >>>>
> > > > > > >>>> 	1, for a given GPA, how to avoid calling dma_map_page multiple times?
> > > > > > >>>> 	2, for which page to call dma_unmap_page?
> > > > > > >>>>
> > > > > > >>>> --  
> > > > > > >>>
> > > > > > >>> We have to support both w/ iommu and w/o iommu case, since
> > > > > > >>> that fact is out of GPU driver control. A simple way is to use
> > > > > > >>> dma_map_page which internally will cope with w/ and w/o iommu
> > > > > > >>> case gracefully, i.e. return HPA w/o iommu and IOVA w/ iommu.
> > > > > > >>> Then in this file we only need to cache GPA to whatever dmadr_t
> > > > > > >>> returned by dma_map_page.
> > > > > > >>>  
> > > > > > >>
> > > > > > >> Hi Alex, Kirti and Neo, any thought on the IOMMU compatibility here?  
> > > > > > >
> > > > > > > Hi Jike,
> > > > > > >
> > > > > > > With mediated passthru, you still can use hardware iommu, but more important
> > > > > > > that part is actually orthogonal to what we are discussing here as we will only
> > > > > > > cache the mapping between <gfn (iova if guest has iommu), (qemu) va>, once we
> > > > > > > have pinned pages later with the help of above info, you can map it into the
> > > > > > > proper iommu domain if the system has configured so.
> > > > > > >  
> > > > > >
> > > > > > Hi Neo,
> > > > > >
> > > > > > Technically yes you can map a pfn into the proper IOMMU domain elsewhere,
> > > > > > but to find out whether a pfn was previously mapped or not, you have to
> > > > > > track it with another rbtree-alike data structure (the IOMMU driver simply
> > > > > > doesn't bother with tracking), that seems somehow duplicate with the vGPU
> > > > > > IOMMU backend we are discussing here.
> > > > > >
> > > > > > And it is also semantically correct for an IOMMU backend to handle both w/
> > > > > > and w/o an IOMMU hardware? :)  
> > > > > 
> > > > > A problem with the iommu doing the dma_map_page() though is for what
> > > > > device does it do this?  In the mediated case the vfio infrastructure
> > > > > is dealing with a software representation of a device.  For all we
> > > > > know that software model could transparently migrate from one physical
> > > > > GPU to another.  There may not even be a physical device backing
> > > > > the mediated device.  Those are details left to the vgpu driver itself.  
> > > > 
> > > > This is a fair argument. VFIO iommu driver simply serves user space
> > > > requests, where only vaddr<->iova (essentially gpa in kvm case) is
> > > > mattered. How iova is mapped into real IOMMU is not VFIO's interest.
> > > > 
> > > > > 
> > > > > Perhaps one possibility would be to allow the vgpu driver to register
> > > > > map and unmap callbacks.  The unmap callback might provide the
> > > > > invalidation interface that we're so far missing.  The combination of
> > > > > map and unmap callbacks might simplify the Intel approach of pinning the
> > > > > entire VM memory space, ie. for each map callback do a translation
> > > > > (pin) and dma_map_page, for each unmap do a dma_unmap_page and release
> > > > > the translation.  There's still the problem of where that dma_addr_t
> > > > > from the dma_map_page is stored though.  Someone would need to keep
> > > > > track of iova to dma_addr_t.  The vfio iommu might be a place to do
> > > > > that since we're already tracking information based on iova, possibly
> > > > > in an opaque data element provided by the vgpu driver.  However, we're
> > > > > going to need to take a serious look at whether an rb-tree is the right
> > > > > data structure for the job.  It works well for the current type1
> > > > > functionality where we typically have tens of entries.  I think the
> > > > > NVIDIA model of sparse pinning the VM is pushing that up to tens of
> > > > > thousands.  If Intel intends to pin the entire guest, that's
> > > > > potentially tens of millions of tracked entries and I don't know that
> > > > > an rb-tree is the right tool for that job.  Thanks,
> > > > >   
> > > > 
> > > > Based on above thought I'm thinking whether below would work:
> > > > (let's use gpa to replace existing iova in type1 driver, while using iova
> > > > for the one actually used in vGPU driver. Assume 'pin-all' scenario first
> > > > which matches existing vfio logic)
> > > > 
> > > > - No change to existing vfio_dma structure. VFIO still maintains gpa<->vaddr
> > > > mapping, in coarse-grained regions;
> > > > 
> > > > - Leverage same page accounting/pinning logic in type1 driver, which 
> > > > should be enough for 'pin-all' usage;
> > > > 
> > > > - Then main divergence point for vGPU would be in vfio_unmap_unpin
> > > > and vfio_iommu_map. I'm not sure whether it's easy to fake an 
> > > > iommu_domain for vGPU so same iommu_map/unmap can be reused.
> > > 
> > > This seems troublesome.  Kirti's version used numerous api-only tests
> > > to avoid these which made the code difficult to trace.  Clearly one
> > > option is to split out the common code so that a new mediated-type1
> > > backend skips this, but they thought they could clean it up without
> > > this, so we'll see what happens in the next version.
> > > 
> > > > If not, we may introduce two new map/unmap callbacks provided
> > > > specifically by vGPU core driver, as you suggested:
> > > > 
> > > > 	* vGPU core driver uses dma_map_page to map specified pfns:
> > > > 
> > > > 		o When IOMMU is enabled, we'll get an iova returned different
> > > > from pfn;
> > > > 		o When IOMMU is disabled, returned iova is same as pfn;
> > > 
> > > Either way each iova needs to be stored and we have a worst case of one
> > > iova per page of guest memory.
> > > 
> > > > 	* Then vGPU core driver just maintains its own gpa<->iova lookup
> > > > table (e.g. called vgpu_dma)
> > > > 
> > > > 	* Because each vfio_iommu_map invocation is about a contiguous 
> > > > region, we can expect same number of vgpu_dma entries as maintained 
> > > > for vfio_dma list;
> > > >
> > > > Then it's vGPU core driver's responsibility to provide gpa<->iova
> > > > lookup for vendor specific GPU driver. And we don't need worry about
> > > > tens of thousands of entries. Once we get this simple 'pin-all' model
> > > > ready, then it can be further extended to support 'pin-sparse'
> > > > scenario. We still maintain a top-level vgpu_dma list with each entry to
> > > > further link its own sparse mapping structure. In reality I don't expect
> > > > we really need to maintain per-page translation even with sparse pinning.
> > > 
> > > If you're trying to equate the scale of what we need to track vs what
> > > type1 currently tracks, they're significantly different.  Possible
> > > things we need to track include the pfn, the iova, and possibly a
> > > reference count or some sort of pinned page map.  In the pin-all model
> > > we can assume that every page is pinned on map and unpinned on unmap,
> > > so a reference count or map is unnecessary.  We can also assume that we
> > > can always regenerate the pfn with get_user_pages() from the vaddr, so
> > > we don't need to track that.  I don't see any way around tracking the
> > > iova.  The iommu can't tell us this like it can with the normal type1
> > > model because the pfn is the result of the translation, not the key for
> > > the translation. So we're always going to have between 1 and
> > > (size/PAGE_SIZE) iova entries per vgpu_dma entry.  You might be able to
> > > manage the vgpu_dma with an rb-tree, but each vgpu_dma entry needs some
> > > data structure tracking every iova.
> > > 
> > > Sparse mapping has the same issue but of course the tree of iovas is
> > > potentially incomplete and we need a way to determine where it's
> > > incomplete.  A page table rooted in the vgpu_dma and indexed by the
> > > offset from the start vaddr seems like the way to go here.  It's also
> > > possible that some mediated device models might store the iova in the
> > > command sent to the device and therefore be able to parse those entries
> > > back out to unmap them without storing them separately.  This might be
> > > how the s390 channel-io model would prefer to work.
> > Dear Alex:
> > 
> > For the s390 channel-io model, when an I/O instruction was intercepted
> > and issued to the device driver for further translation, the operand of
> > the instruction contents iovas only. Since iova is the key to locate an
> > entry in the database (r-b tree or whatever), we do can parse the
> > entries back out one by one when doing the unmap operation.
> >                  ^^^^^^^^^^
> > 
> > BTW, if the mediated-iommu backend can potentially offer a transaction
> > level support for the unmap operation, I believe it will benefit the
> > performance for this case.
> > 
> > e.g.:
> > handler = vfio_trasaction_begin();
> > foreach(iova in the command) {
> >     pfn = vfio_trasaction_map(handler, iova);
> >     do_io(pfn);
> > }
> 
> Hi Dong,
> 
> Could you please help me understand the performance benefit here? 
> 
> Is the perf argument coming from the searching of rbtree of the tracking data
> structure or something else?
> 
> For example you can do similar thing by the following sequence from your backend
> driver:
> 
>     vfio_pin_pages(gfn_list/iova_list /* in */, npages, prot, pfn_bases /* out */)
>     foreach (pfn)
>         do_io(pfn)
>     vfio_unpin_pages(pfn_bases)
Dear Neo:

FWIU, the channel-io driver could leverage these interfaces without
obvious feasibility issues. Since the implementation of the current
vfio_unpin_pages iterates @pfn_bases and find the corresponding entry
from the rb tree for each of the pfn_base, I'm wondering if a dedicated
list of the entries for the whole @pfn_bases could offer us some
benefits. I have to say that I know it's too early to consider the perf
, and the current interfaces are fine for the channel-io case.

I'm also wondering if such an idea could contribute a little to your
discussion of the management of the key-value mapping issue. If this is
just a noise (sorry for that :<), please ignore it.

My major intention is to show up, and to elaborate a bit about the
channel-io use case. So you will see that, there is really another user
of the mediate-iommu backend, and as Alex mentioned before, getting rid
of the 'vgpu_dev' and other vgpu specific stuff is indeed necessary. :>

> 
> Thanks,
> Neo
> 
> > 
> > /*
> >  * Expect to unmap all of the pfns mapped in this trasaction with the
> >  * next statement. The mediated-iommu backend could use handler as the
> >  * key to track the list of the entries.
> >  */
> > vfio_trasaction_unmap(handler);
> > vfio_trasaction_end(handler);
> > 
> > Not sure if this could benefit the vgpu sparse mapping use case though.
> 
> 
> 
> 
> 
> > 
> > >  That seems like
> > > further validation that such tracking is going to be dependent on the
> > > mediated driver itself and probably not something to centralize in a
> > > mediated iommu driver.  Thanks,
> > > 
> > > Alex
> > > 
> > 
> > 
> > 
> > --------
> > Dong Jia
> > 
> 



--------
Dong Jia

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
  2016-05-13  8:02                               ` [Qemu-devel] " Tian, Kevin
@ 2016-05-13  8:41                                 ` Neo Jia
  -1 siblings, 0 replies; 154+ messages in thread
From: Neo Jia @ 2016-05-13  8:41 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Song, Jike, Jike Song, Alex Williamson, Kirti Wankhede, pbonzini,
	kraxel, qemu-devel, kvm, Ruan, Shuai, Lv, Zhiyuan

On Fri, May 13, 2016 at 08:02:41AM +0000, Tian, Kevin wrote:
> > From: Neo Jia [mailto:cjia@nvidia.com]
> > Sent: Friday, May 13, 2016 3:38 PM
> > 
> > On Fri, May 13, 2016 at 07:13:44AM +0000, Tian, Kevin wrote:
> > > > From: Neo Jia [mailto:cjia@nvidia.com]
> > > > Sent: Friday, May 13, 2016 2:42 PM
> > > >
> > > >
> > > > >
> > > > > We possibly have the same requirement from the mediate driver backend:
> > > > >
> > > > > 	a) get a GFN, when guest try to tell hardware;
> > > > > 	b) consult the vfio iommu with that GFN[1]: will you find me a proper dma_addr?
> > > >
> > > > We will provide you the pfn via vfio_pin_pages, so you can map it for dma
> > > > purpose in your i915 driver, which is what we are doing today.
> > > >
> > >
> > > Can such 'map' operation be consolidated in vGPU core driver? I don't think
> > > Intel vGPU driver has any feature proactively relying on iommu. The reason
> > > why we keep talking iommu is just because the kernel may enable iommu
> > > for physical GPU so we need make sure our device model can work in such
> > > configuration. And this requirement should apply to all vendors, not Intel
> > > specific (like you said you are doing it already today).
> > 
> > Hi Kevin,
> > 
> > Actually, such requirement is already satisfied today as all vendor drivers
> > should transparently work with and without system iommu on bare-metal, right?
> > 
> > So I don't see any new requirement here, also such consolidation doesn't help
> > any but adding complexity to the system as vendor driver will not remove
> > their own dma_map_xxx functions as they are still required to support
> > non-mediated cases.
> > 
> 
> Thanks for your information, which makes it clearer where the difference is. :-)
> 
> Based on your description, looks you treat guest pages same as normal process
> pages, which all share the same code path when mapping as DMA target, so it
> is pointless to separate guest page map out to vGPU core driver. Is this
> understanding correct?

Yes.

It is Linux's responsibility to allocate the physical pages for the QEMU process
which will happen to be the guest physical memory that we might use as DMA
target. From the device point of view, it is just some physical location he
needs to hit.

> 
> In our side, so far guest pages are treated differently from normal process
> pages, which is the main reason why I asked whether we can consolidate that
> part. Looks now it's not necessary since it's already not a common requirement.

> 
> One additional question though. Jike already mentioned the need to shadow
> GPU MMU (called GTT table in Intel side) in our device model. 'shadow' here
> basically means we need translate from 'gfn' in guest pte to 'dmadr_t'
> as returned by dma_map_xxx. Based on gfn->pfn translation provided by
> VFIO (in your v3 driver), gfn->dmadr_t mapping can be constructed accordingly
> in the vendor driver. So do you have similar requirement like this? If yes, do
> you think any value to unify that translation structure or prefer to maintain
> it by vendor driver?

Yes, I think it would make sense to do this in the vendor driver as it keeps the
iommu type1 clean - it will only track the gfn to pfn translation/pinning (on
CPU). Then, you can reuse your existing driver code to map the pfn as DMA
target.

Also you can do some kind of optimization such as keeping a small cache within
your device driver, if the gfn is already translated, no need to query again.

Thanks,
Neo

> 
> Thanks
> Kevin

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
@ 2016-05-13  8:41                                 ` Neo Jia
  0 siblings, 0 replies; 154+ messages in thread
From: Neo Jia @ 2016-05-13  8:41 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Song, Jike, Jike Song, Alex Williamson, Kirti Wankhede, pbonzini,
	kraxel, qemu-devel, kvm, Ruan, Shuai, Lv, Zhiyuan

On Fri, May 13, 2016 at 08:02:41AM +0000, Tian, Kevin wrote:
> > From: Neo Jia [mailto:cjia@nvidia.com]
> > Sent: Friday, May 13, 2016 3:38 PM
> > 
> > On Fri, May 13, 2016 at 07:13:44AM +0000, Tian, Kevin wrote:
> > > > From: Neo Jia [mailto:cjia@nvidia.com]
> > > > Sent: Friday, May 13, 2016 2:42 PM
> > > >
> > > >
> > > > >
> > > > > We possibly have the same requirement from the mediate driver backend:
> > > > >
> > > > > 	a) get a GFN, when guest try to tell hardware;
> > > > > 	b) consult the vfio iommu with that GFN[1]: will you find me a proper dma_addr?
> > > >
> > > > We will provide you the pfn via vfio_pin_pages, so you can map it for dma
> > > > purpose in your i915 driver, which is what we are doing today.
> > > >
> > >
> > > Can such 'map' operation be consolidated in vGPU core driver? I don't think
> > > Intel vGPU driver has any feature proactively relying on iommu. The reason
> > > why we keep talking iommu is just because the kernel may enable iommu
> > > for physical GPU so we need make sure our device model can work in such
> > > configuration. And this requirement should apply to all vendors, not Intel
> > > specific (like you said you are doing it already today).
> > 
> > Hi Kevin,
> > 
> > Actually, such requirement is already satisfied today as all vendor drivers
> > should transparently work with and without system iommu on bare-metal, right?
> > 
> > So I don't see any new requirement here, also such consolidation doesn't help
> > any but adding complexity to the system as vendor driver will not remove
> > their own dma_map_xxx functions as they are still required to support
> > non-mediated cases.
> > 
> 
> Thanks for your information, which makes it clearer where the difference is. :-)
> 
> Based on your description, looks you treat guest pages same as normal process
> pages, which all share the same code path when mapping as DMA target, so it
> is pointless to separate guest page map out to vGPU core driver. Is this
> understanding correct?

Yes.

It is Linux's responsibility to allocate the physical pages for the QEMU process
which will happen to be the guest physical memory that we might use as DMA
target. From the device point of view, it is just some physical location he
needs to hit.

> 
> In our side, so far guest pages are treated differently from normal process
> pages, which is the main reason why I asked whether we can consolidate that
> part. Looks now it's not necessary since it's already not a common requirement.

> 
> One additional question though. Jike already mentioned the need to shadow
> GPU MMU (called GTT table in Intel side) in our device model. 'shadow' here
> basically means we need translate from 'gfn' in guest pte to 'dmadr_t'
> as returned by dma_map_xxx. Based on gfn->pfn translation provided by
> VFIO (in your v3 driver), gfn->dmadr_t mapping can be constructed accordingly
> in the vendor driver. So do you have similar requirement like this? If yes, do
> you think any value to unify that translation structure or prefer to maintain
> it by vendor driver?

Yes, I think it would make sense to do this in the vendor driver as it keeps the
iommu type1 clean - it will only track the gfn to pfn translation/pinning (on
CPU). Then, you can reuse your existing driver code to map the pfn as DMA
target.

Also you can do some kind of optimization such as keeping a small cache within
your device driver, if the gfn is already translated, no need to query again.

Thanks,
Neo

> 
> Thanks
> Kevin

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
  2016-05-13  8:39                           ` [Qemu-devel] " Dong Jia
@ 2016-05-13  9:05                             ` Neo Jia
  -1 siblings, 0 replies; 154+ messages in thread
From: Neo Jia @ 2016-05-13  9:05 UTC (permalink / raw)
  To: Dong Jia
  Cc: Alex Williamson, Tian, Kevin, Ruan, Shuai, Song, Jike, kvm,
	qemu-devel, Kirti Wankhede, kraxel, pbonzini, Lv, Zhiyuan

On Fri, May 13, 2016 at 04:39:37PM +0800, Dong Jia wrote:
> On Fri, 13 May 2016 00:24:34 -0700
> Neo Jia <cjia@nvidia.com> wrote:
> 
> > On Fri, May 13, 2016 at 03:10:22PM +0800, Dong Jia wrote:
> > > On Thu, 12 May 2016 13:05:52 -0600
> > > Alex Williamson <alex.williamson@redhat.com> wrote:
> > > 
> > > > On Thu, 12 May 2016 08:00:36 +0000
> > > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > > > 
> > > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > > Sent: Thursday, May 12, 2016 6:06 AM
> > > > > > 
> > > > > > On Wed, 11 May 2016 17:15:15 +0800
> > > > > > Jike Song <jike.song@intel.com> wrote:
> > > > > >   
> > > > > > > On 05/11/2016 12:02 AM, Neo Jia wrote:  
> > > > > > > > On Tue, May 10, 2016 at 03:52:27PM +0800, Jike Song wrote:  
> > > > > > > >> On 05/05/2016 05:27 PM, Tian, Kevin wrote:  
> > > > > > > >>>> From: Song, Jike
> > > > > > > >>>>
> > > > > > > >>>> IIUC, an api-only domain is a VFIO domain *without* underlying IOMMU
> > > > > > > >>>> hardware. It just, as you said in another mail, "rather than
> > > > > > > >>>> programming them into an IOMMU for a device, it simply stores the
> > > > > > > >>>> translations for use by later requests".
> > > > > > > >>>>
> > > > > > > >>>> That imposes a constraint on gfx driver: hardware IOMMU must be disabled.
> > > > > > > >>>> Otherwise, if IOMMU is present, the gfx driver eventually programs
> > > > > > > >>>> the hardware IOMMU with IOVA returned by pci_map_page or dma_map_page;
> > > > > > > >>>> Meanwhile, the IOMMU backend for vgpu only maintains GPA <-> HPA
> > > > > > > >>>> translations without any knowledge about hardware IOMMU, how is the
> > > > > > > >>>> device model supposed to do to get an IOVA for a given GPA (thereby HPA
> > > > > > > >>>> by the IOMMU backend here)?
> > > > > > > >>>>
> > > > > > > >>>> If things go as guessed above, as vfio_pin_pages() indicates, it
> > > > > > > >>>> pin & translate vaddr to PFN, then it will be very difficult for the
> > > > > > > >>>> device model to figure out:
> > > > > > > >>>>
> > > > > > > >>>> 	1, for a given GPA, how to avoid calling dma_map_page multiple times?
> > > > > > > >>>> 	2, for which page to call dma_unmap_page?
> > > > > > > >>>>
> > > > > > > >>>> --  
> > > > > > > >>>
> > > > > > > >>> We have to support both w/ iommu and w/o iommu case, since
> > > > > > > >>> that fact is out of GPU driver control. A simple way is to use
> > > > > > > >>> dma_map_page which internally will cope with w/ and w/o iommu
> > > > > > > >>> case gracefully, i.e. return HPA w/o iommu and IOVA w/ iommu.
> > > > > > > >>> Then in this file we only need to cache GPA to whatever dmadr_t
> > > > > > > >>> returned by dma_map_page.
> > > > > > > >>>  
> > > > > > > >>
> > > > > > > >> Hi Alex, Kirti and Neo, any thought on the IOMMU compatibility here?  
> > > > > > > >
> > > > > > > > Hi Jike,
> > > > > > > >
> > > > > > > > With mediated passthru, you still can use hardware iommu, but more important
> > > > > > > > that part is actually orthogonal to what we are discussing here as we will only
> > > > > > > > cache the mapping between <gfn (iova if guest has iommu), (qemu) va>, once we
> > > > > > > > have pinned pages later with the help of above info, you can map it into the
> > > > > > > > proper iommu domain if the system has configured so.
> > > > > > > >  
> > > > > > >
> > > > > > > Hi Neo,
> > > > > > >
> > > > > > > Technically yes you can map a pfn into the proper IOMMU domain elsewhere,
> > > > > > > but to find out whether a pfn was previously mapped or not, you have to
> > > > > > > track it with another rbtree-alike data structure (the IOMMU driver simply
> > > > > > > doesn't bother with tracking), that seems somehow duplicate with the vGPU
> > > > > > > IOMMU backend we are discussing here.
> > > > > > >
> > > > > > > And it is also semantically correct for an IOMMU backend to handle both w/
> > > > > > > and w/o an IOMMU hardware? :)  
> > > > > > 
> > > > > > A problem with the iommu doing the dma_map_page() though is for what
> > > > > > device does it do this?  In the mediated case the vfio infrastructure
> > > > > > is dealing with a software representation of a device.  For all we
> > > > > > know that software model could transparently migrate from one physical
> > > > > > GPU to another.  There may not even be a physical device backing
> > > > > > the mediated device.  Those are details left to the vgpu driver itself.  
> > > > > 
> > > > > This is a fair argument. VFIO iommu driver simply serves user space
> > > > > requests, where only vaddr<->iova (essentially gpa in kvm case) is
> > > > > mattered. How iova is mapped into real IOMMU is not VFIO's interest.
> > > > > 
> > > > > > 
> > > > > > Perhaps one possibility would be to allow the vgpu driver to register
> > > > > > map and unmap callbacks.  The unmap callback might provide the
> > > > > > invalidation interface that we're so far missing.  The combination of
> > > > > > map and unmap callbacks might simplify the Intel approach of pinning the
> > > > > > entire VM memory space, ie. for each map callback do a translation
> > > > > > (pin) and dma_map_page, for each unmap do a dma_unmap_page and release
> > > > > > the translation.  There's still the problem of where that dma_addr_t
> > > > > > from the dma_map_page is stored though.  Someone would need to keep
> > > > > > track of iova to dma_addr_t.  The vfio iommu might be a place to do
> > > > > > that since we're already tracking information based on iova, possibly
> > > > > > in an opaque data element provided by the vgpu driver.  However, we're
> > > > > > going to need to take a serious look at whether an rb-tree is the right
> > > > > > data structure for the job.  It works well for the current type1
> > > > > > functionality where we typically have tens of entries.  I think the
> > > > > > NVIDIA model of sparse pinning the VM is pushing that up to tens of
> > > > > > thousands.  If Intel intends to pin the entire guest, that's
> > > > > > potentially tens of millions of tracked entries and I don't know that
> > > > > > an rb-tree is the right tool for that job.  Thanks,
> > > > > >   
> > > > > 
> > > > > Based on above thought I'm thinking whether below would work:
> > > > > (let's use gpa to replace existing iova in type1 driver, while using iova
> > > > > for the one actually used in vGPU driver. Assume 'pin-all' scenario first
> > > > > which matches existing vfio logic)
> > > > > 
> > > > > - No change to existing vfio_dma structure. VFIO still maintains gpa<->vaddr
> > > > > mapping, in coarse-grained regions;
> > > > > 
> > > > > - Leverage same page accounting/pinning logic in type1 driver, which 
> > > > > should be enough for 'pin-all' usage;
> > > > > 
> > > > > - Then main divergence point for vGPU would be in vfio_unmap_unpin
> > > > > and vfio_iommu_map. I'm not sure whether it's easy to fake an 
> > > > > iommu_domain for vGPU so same iommu_map/unmap can be reused.
> > > > 
> > > > This seems troublesome.  Kirti's version used numerous api-only tests
> > > > to avoid these which made the code difficult to trace.  Clearly one
> > > > option is to split out the common code so that a new mediated-type1
> > > > backend skips this, but they thought they could clean it up without
> > > > this, so we'll see what happens in the next version.
> > > > 
> > > > > If not, we may introduce two new map/unmap callbacks provided
> > > > > specifically by vGPU core driver, as you suggested:
> > > > > 
> > > > > 	* vGPU core driver uses dma_map_page to map specified pfns:
> > > > > 
> > > > > 		o When IOMMU is enabled, we'll get an iova returned different
> > > > > from pfn;
> > > > > 		o When IOMMU is disabled, returned iova is same as pfn;
> > > > 
> > > > Either way each iova needs to be stored and we have a worst case of one
> > > > iova per page of guest memory.
> > > > 
> > > > > 	* Then vGPU core driver just maintains its own gpa<->iova lookup
> > > > > table (e.g. called vgpu_dma)
> > > > > 
> > > > > 	* Because each vfio_iommu_map invocation is about a contiguous 
> > > > > region, we can expect same number of vgpu_dma entries as maintained 
> > > > > for vfio_dma list;
> > > > >
> > > > > Then it's vGPU core driver's responsibility to provide gpa<->iova
> > > > > lookup for vendor specific GPU driver. And we don't need worry about
> > > > > tens of thousands of entries. Once we get this simple 'pin-all' model
> > > > > ready, then it can be further extended to support 'pin-sparse'
> > > > > scenario. We still maintain a top-level vgpu_dma list with each entry to
> > > > > further link its own sparse mapping structure. In reality I don't expect
> > > > > we really need to maintain per-page translation even with sparse pinning.
> > > > 
> > > > If you're trying to equate the scale of what we need to track vs what
> > > > type1 currently tracks, they're significantly different.  Possible
> > > > things we need to track include the pfn, the iova, and possibly a
> > > > reference count or some sort of pinned page map.  In the pin-all model
> > > > we can assume that every page is pinned on map and unpinned on unmap,
> > > > so a reference count or map is unnecessary.  We can also assume that we
> > > > can always regenerate the pfn with get_user_pages() from the vaddr, so
> > > > we don't need to track that.  I don't see any way around tracking the
> > > > iova.  The iommu can't tell us this like it can with the normal type1
> > > > model because the pfn is the result of the translation, not the key for
> > > > the translation. So we're always going to have between 1 and
> > > > (size/PAGE_SIZE) iova entries per vgpu_dma entry.  You might be able to
> > > > manage the vgpu_dma with an rb-tree, but each vgpu_dma entry needs some
> > > > data structure tracking every iova.
> > > > 
> > > > Sparse mapping has the same issue but of course the tree of iovas is
> > > > potentially incomplete and we need a way to determine where it's
> > > > incomplete.  A page table rooted in the vgpu_dma and indexed by the
> > > > offset from the start vaddr seems like the way to go here.  It's also
> > > > possible that some mediated device models might store the iova in the
> > > > command sent to the device and therefore be able to parse those entries
> > > > back out to unmap them without storing them separately.  This might be
> > > > how the s390 channel-io model would prefer to work.
> > > Dear Alex:
> > > 
> > > For the s390 channel-io model, when an I/O instruction was intercepted
> > > and issued to the device driver for further translation, the operand of
> > > the instruction contents iovas only. Since iova is the key to locate an
> > > entry in the database (r-b tree or whatever), we do can parse the
> > > entries back out one by one when doing the unmap operation.
> > >                  ^^^^^^^^^^
> > > 
> > > BTW, if the mediated-iommu backend can potentially offer a transaction
> > > level support for the unmap operation, I believe it will benefit the
> > > performance for this case.
> > > 
> > > e.g.:
> > > handler = vfio_trasaction_begin();
> > > foreach(iova in the command) {
> > >     pfn = vfio_trasaction_map(handler, iova);
> > >     do_io(pfn);
> > > }
> > 
> > Hi Dong,
> > 
> > Could you please help me understand the performance benefit here? 
> > 
> > Is the perf argument coming from the searching of rbtree of the tracking data
> > structure or something else?
> > 
> > For example you can do similar thing by the following sequence from your backend
> > driver:
> > 
> >     vfio_pin_pages(gfn_list/iova_list /* in */, npages, prot, pfn_bases /* out */)
> >     foreach (pfn)
> >         do_io(pfn)
> >     vfio_unpin_pages(pfn_bases)
> Dear Neo:
> 
> FWIU, the channel-io driver could leverage these interfaces without
> obvious feasibility issues. Since the implementation of the current
> vfio_unpin_pages iterates @pfn_bases and find the corresponding entry
> from the rb tree for each of the pfn_base, I'm wondering if a dedicated
> list of the entries for the whole @pfn_bases could offer us some
> benefits. I have to say that I know it's too early to consider the perf
> , and the current interfaces are fine for the channel-io case.

Hi Dong,

We should definitely be mindful about the data structure performance especially
dealing with kernel. But for now, we haven't done any performance analysis yet
for the current rbtree implementation, later we will definitely run it through
large guest RAM configuration and multiple virtual devices cases, etc. to
collect data.

Regarding your use case, may I ask if there will be concurrent command streams
running for the same VM? If yes, those two transaction requests (if we
implement) will compete not only the rbtree lock but also the GUP locks.

Also, what is the typical guest RAM we are talking about here for your usecase
and any rough estimation of the active working set of those DMA pages? 

> 
> I'm also wondering if such an idea could contribute a little to your
> discussion of the management of the key-value mapping issue. If this is
> just a noise (sorry for that :<), please ignore it.
> 
> My major intention is to show up, and to elaborate a bit about the
> channel-io use case. So you will see that, there is really another user
> of the mediate-iommu backend, and as Alex mentioned before, getting rid
> of the 'vgpu_dev' and other vgpu specific stuff is indeed necessary. :>

Definitely, we are changing the module/variable names to reflect this general
purpose already. 

Thanks,
Neo

> 
> > 
> > Thanks,
> > Neo
> > 
> > > 
> > > /*
> > >  * Expect to unmap all of the pfns mapped in this trasaction with the
> > >  * next statement. The mediated-iommu backend could use handler as the
> > >  * key to track the list of the entries.
> > >  */
> > > vfio_trasaction_unmap(handler);
> > > vfio_trasaction_end(handler);
> > > 
> > > Not sure if this could benefit the vgpu sparse mapping use case though.
> > 
> > 
> > 
> > 
> > 
> > > 
> > > >  That seems like
> > > > further validation that such tracking is going to be dependent on the
> > > > mediated driver itself and probably not something to centralize in a
> > > > mediated iommu driver.  Thanks,
> > > > 
> > > > Alex
> > > > 
> > > 
> > > 
> > > 
> > > --------
> > > Dong Jia
> > > 
> > 
> 
> 
> 
> --------
> Dong Jia
> 

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
@ 2016-05-13  9:05                             ` Neo Jia
  0 siblings, 0 replies; 154+ messages in thread
From: Neo Jia @ 2016-05-13  9:05 UTC (permalink / raw)
  To: Dong Jia
  Cc: Alex Williamson, Tian, Kevin, Ruan, Shuai, Song, Jike, kvm,
	qemu-devel, Kirti Wankhede, kraxel, pbonzini, Lv,	Zhiyuan

On Fri, May 13, 2016 at 04:39:37PM +0800, Dong Jia wrote:
> On Fri, 13 May 2016 00:24:34 -0700
> Neo Jia <cjia@nvidia.com> wrote:
> 
> > On Fri, May 13, 2016 at 03:10:22PM +0800, Dong Jia wrote:
> > > On Thu, 12 May 2016 13:05:52 -0600
> > > Alex Williamson <alex.williamson@redhat.com> wrote:
> > > 
> > > > On Thu, 12 May 2016 08:00:36 +0000
> > > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > > > 
> > > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > > Sent: Thursday, May 12, 2016 6:06 AM
> > > > > > 
> > > > > > On Wed, 11 May 2016 17:15:15 +0800
> > > > > > Jike Song <jike.song@intel.com> wrote:
> > > > > >   
> > > > > > > On 05/11/2016 12:02 AM, Neo Jia wrote:  
> > > > > > > > On Tue, May 10, 2016 at 03:52:27PM +0800, Jike Song wrote:  
> > > > > > > >> On 05/05/2016 05:27 PM, Tian, Kevin wrote:  
> > > > > > > >>>> From: Song, Jike
> > > > > > > >>>>
> > > > > > > >>>> IIUC, an api-only domain is a VFIO domain *without* underlying IOMMU
> > > > > > > >>>> hardware. It just, as you said in another mail, "rather than
> > > > > > > >>>> programming them into an IOMMU for a device, it simply stores the
> > > > > > > >>>> translations for use by later requests".
> > > > > > > >>>>
> > > > > > > >>>> That imposes a constraint on gfx driver: hardware IOMMU must be disabled.
> > > > > > > >>>> Otherwise, if IOMMU is present, the gfx driver eventually programs
> > > > > > > >>>> the hardware IOMMU with IOVA returned by pci_map_page or dma_map_page;
> > > > > > > >>>> Meanwhile, the IOMMU backend for vgpu only maintains GPA <-> HPA
> > > > > > > >>>> translations without any knowledge about hardware IOMMU, how is the
> > > > > > > >>>> device model supposed to do to get an IOVA for a given GPA (thereby HPA
> > > > > > > >>>> by the IOMMU backend here)?
> > > > > > > >>>>
> > > > > > > >>>> If things go as guessed above, as vfio_pin_pages() indicates, it
> > > > > > > >>>> pin & translate vaddr to PFN, then it will be very difficult for the
> > > > > > > >>>> device model to figure out:
> > > > > > > >>>>
> > > > > > > >>>> 	1, for a given GPA, how to avoid calling dma_map_page multiple times?
> > > > > > > >>>> 	2, for which page to call dma_unmap_page?
> > > > > > > >>>>
> > > > > > > >>>> --  
> > > > > > > >>>
> > > > > > > >>> We have to support both w/ iommu and w/o iommu case, since
> > > > > > > >>> that fact is out of GPU driver control. A simple way is to use
> > > > > > > >>> dma_map_page which internally will cope with w/ and w/o iommu
> > > > > > > >>> case gracefully, i.e. return HPA w/o iommu and IOVA w/ iommu.
> > > > > > > >>> Then in this file we only need to cache GPA to whatever dmadr_t
> > > > > > > >>> returned by dma_map_page.
> > > > > > > >>>  
> > > > > > > >>
> > > > > > > >> Hi Alex, Kirti and Neo, any thought on the IOMMU compatibility here?  
> > > > > > > >
> > > > > > > > Hi Jike,
> > > > > > > >
> > > > > > > > With mediated passthru, you still can use hardware iommu, but more important
> > > > > > > > that part is actually orthogonal to what we are discussing here as we will only
> > > > > > > > cache the mapping between <gfn (iova if guest has iommu), (qemu) va>, once we
> > > > > > > > have pinned pages later with the help of above info, you can map it into the
> > > > > > > > proper iommu domain if the system has configured so.
> > > > > > > >  
> > > > > > >
> > > > > > > Hi Neo,
> > > > > > >
> > > > > > > Technically yes you can map a pfn into the proper IOMMU domain elsewhere,
> > > > > > > but to find out whether a pfn was previously mapped or not, you have to
> > > > > > > track it with another rbtree-alike data structure (the IOMMU driver simply
> > > > > > > doesn't bother with tracking), that seems somehow duplicate with the vGPU
> > > > > > > IOMMU backend we are discussing here.
> > > > > > >
> > > > > > > And it is also semantically correct for an IOMMU backend to handle both w/
> > > > > > > and w/o an IOMMU hardware? :)  
> > > > > > 
> > > > > > A problem with the iommu doing the dma_map_page() though is for what
> > > > > > device does it do this?  In the mediated case the vfio infrastructure
> > > > > > is dealing with a software representation of a device.  For all we
> > > > > > know that software model could transparently migrate from one physical
> > > > > > GPU to another.  There may not even be a physical device backing
> > > > > > the mediated device.  Those are details left to the vgpu driver itself.  
> > > > > 
> > > > > This is a fair argument. VFIO iommu driver simply serves user space
> > > > > requests, where only vaddr<->iova (essentially gpa in kvm case) is
> > > > > mattered. How iova is mapped into real IOMMU is not VFIO's interest.
> > > > > 
> > > > > > 
> > > > > > Perhaps one possibility would be to allow the vgpu driver to register
> > > > > > map and unmap callbacks.  The unmap callback might provide the
> > > > > > invalidation interface that we're so far missing.  The combination of
> > > > > > map and unmap callbacks might simplify the Intel approach of pinning the
> > > > > > entire VM memory space, ie. for each map callback do a translation
> > > > > > (pin) and dma_map_page, for each unmap do a dma_unmap_page and release
> > > > > > the translation.  There's still the problem of where that dma_addr_t
> > > > > > from the dma_map_page is stored though.  Someone would need to keep
> > > > > > track of iova to dma_addr_t.  The vfio iommu might be a place to do
> > > > > > that since we're already tracking information based on iova, possibly
> > > > > > in an opaque data element provided by the vgpu driver.  However, we're
> > > > > > going to need to take a serious look at whether an rb-tree is the right
> > > > > > data structure for the job.  It works well for the current type1
> > > > > > functionality where we typically have tens of entries.  I think the
> > > > > > NVIDIA model of sparse pinning the VM is pushing that up to tens of
> > > > > > thousands.  If Intel intends to pin the entire guest, that's
> > > > > > potentially tens of millions of tracked entries and I don't know that
> > > > > > an rb-tree is the right tool for that job.  Thanks,
> > > > > >   
> > > > > 
> > > > > Based on above thought I'm thinking whether below would work:
> > > > > (let's use gpa to replace existing iova in type1 driver, while using iova
> > > > > for the one actually used in vGPU driver. Assume 'pin-all' scenario first
> > > > > which matches existing vfio logic)
> > > > > 
> > > > > - No change to existing vfio_dma structure. VFIO still maintains gpa<->vaddr
> > > > > mapping, in coarse-grained regions;
> > > > > 
> > > > > - Leverage same page accounting/pinning logic in type1 driver, which 
> > > > > should be enough for 'pin-all' usage;
> > > > > 
> > > > > - Then main divergence point for vGPU would be in vfio_unmap_unpin
> > > > > and vfio_iommu_map. I'm not sure whether it's easy to fake an 
> > > > > iommu_domain for vGPU so same iommu_map/unmap can be reused.
> > > > 
> > > > This seems troublesome.  Kirti's version used numerous api-only tests
> > > > to avoid these which made the code difficult to trace.  Clearly one
> > > > option is to split out the common code so that a new mediated-type1
> > > > backend skips this, but they thought they could clean it up without
> > > > this, so we'll see what happens in the next version.
> > > > 
> > > > > If not, we may introduce two new map/unmap callbacks provided
> > > > > specifically by vGPU core driver, as you suggested:
> > > > > 
> > > > > 	* vGPU core driver uses dma_map_page to map specified pfns:
> > > > > 
> > > > > 		o When IOMMU is enabled, we'll get an iova returned different
> > > > > from pfn;
> > > > > 		o When IOMMU is disabled, returned iova is same as pfn;
> > > > 
> > > > Either way each iova needs to be stored and we have a worst case of one
> > > > iova per page of guest memory.
> > > > 
> > > > > 	* Then vGPU core driver just maintains its own gpa<->iova lookup
> > > > > table (e.g. called vgpu_dma)
> > > > > 
> > > > > 	* Because each vfio_iommu_map invocation is about a contiguous 
> > > > > region, we can expect same number of vgpu_dma entries as maintained 
> > > > > for vfio_dma list;
> > > > >
> > > > > Then it's vGPU core driver's responsibility to provide gpa<->iova
> > > > > lookup for vendor specific GPU driver. And we don't need worry about
> > > > > tens of thousands of entries. Once we get this simple 'pin-all' model
> > > > > ready, then it can be further extended to support 'pin-sparse'
> > > > > scenario. We still maintain a top-level vgpu_dma list with each entry to
> > > > > further link its own sparse mapping structure. In reality I don't expect
> > > > > we really need to maintain per-page translation even with sparse pinning.
> > > > 
> > > > If you're trying to equate the scale of what we need to track vs what
> > > > type1 currently tracks, they're significantly different.  Possible
> > > > things we need to track include the pfn, the iova, and possibly a
> > > > reference count or some sort of pinned page map.  In the pin-all model
> > > > we can assume that every page is pinned on map and unpinned on unmap,
> > > > so a reference count or map is unnecessary.  We can also assume that we
> > > > can always regenerate the pfn with get_user_pages() from the vaddr, so
> > > > we don't need to track that.  I don't see any way around tracking the
> > > > iova.  The iommu can't tell us this like it can with the normal type1
> > > > model because the pfn is the result of the translation, not the key for
> > > > the translation. So we're always going to have between 1 and
> > > > (size/PAGE_SIZE) iova entries per vgpu_dma entry.  You might be able to
> > > > manage the vgpu_dma with an rb-tree, but each vgpu_dma entry needs some
> > > > data structure tracking every iova.
> > > > 
> > > > Sparse mapping has the same issue but of course the tree of iovas is
> > > > potentially incomplete and we need a way to determine where it's
> > > > incomplete.  A page table rooted in the vgpu_dma and indexed by the
> > > > offset from the start vaddr seems like the way to go here.  It's also
> > > > possible that some mediated device models might store the iova in the
> > > > command sent to the device and therefore be able to parse those entries
> > > > back out to unmap them without storing them separately.  This might be
> > > > how the s390 channel-io model would prefer to work.
> > > Dear Alex:
> > > 
> > > For the s390 channel-io model, when an I/O instruction was intercepted
> > > and issued to the device driver for further translation, the operand of
> > > the instruction contents iovas only. Since iova is the key to locate an
> > > entry in the database (r-b tree or whatever), we do can parse the
> > > entries back out one by one when doing the unmap operation.
> > >                  ^^^^^^^^^^
> > > 
> > > BTW, if the mediated-iommu backend can potentially offer a transaction
> > > level support for the unmap operation, I believe it will benefit the
> > > performance for this case.
> > > 
> > > e.g.:
> > > handler = vfio_trasaction_begin();
> > > foreach(iova in the command) {
> > >     pfn = vfio_trasaction_map(handler, iova);
> > >     do_io(pfn);
> > > }
> > 
> > Hi Dong,
> > 
> > Could you please help me understand the performance benefit here? 
> > 
> > Is the perf argument coming from the searching of rbtree of the tracking data
> > structure or something else?
> > 
> > For example you can do similar thing by the following sequence from your backend
> > driver:
> > 
> >     vfio_pin_pages(gfn_list/iova_list /* in */, npages, prot, pfn_bases /* out */)
> >     foreach (pfn)
> >         do_io(pfn)
> >     vfio_unpin_pages(pfn_bases)
> Dear Neo:
> 
> FWIU, the channel-io driver could leverage these interfaces without
> obvious feasibility issues. Since the implementation of the current
> vfio_unpin_pages iterates @pfn_bases and find the corresponding entry
> from the rb tree for each of the pfn_base, I'm wondering if a dedicated
> list of the entries for the whole @pfn_bases could offer us some
> benefits. I have to say that I know it's too early to consider the perf
> , and the current interfaces are fine for the channel-io case.

Hi Dong,

We should definitely be mindful about the data structure performance especially
dealing with kernel. But for now, we haven't done any performance analysis yet
for the current rbtree implementation, later we will definitely run it through
large guest RAM configuration and multiple virtual devices cases, etc. to
collect data.

Regarding your use case, may I ask if there will be concurrent command streams
running for the same VM? If yes, those two transaction requests (if we
implement) will compete not only the rbtree lock but also the GUP locks.

Also, what is the typical guest RAM we are talking about here for your usecase
and any rough estimation of the active working set of those DMA pages? 

> 
> I'm also wondering if such an idea could contribute a little to your
> discussion of the management of the key-value mapping issue. If this is
> just a noise (sorry for that :<), please ignore it.
> 
> My major intention is to show up, and to elaborate a bit about the
> channel-io use case. So you will see that, there is really another user
> of the mediate-iommu backend, and as Alex mentioned before, getting rid
> of the 'vgpu_dev' and other vgpu specific stuff is indeed necessary. :>

Definitely, we are changing the module/variable names to reflect this general
purpose already. 

Thanks,
Neo

> 
> > 
> > Thanks,
> > Neo
> > 
> > > 
> > > /*
> > >  * Expect to unmap all of the pfns mapped in this trasaction with the
> > >  * next statement. The mediated-iommu backend could use handler as the
> > >  * key to track the list of the entries.
> > >  */
> > > vfio_trasaction_unmap(handler);
> > > vfio_trasaction_end(handler);
> > > 
> > > Not sure if this could benefit the vgpu sparse mapping use case though.
> > 
> > 
> > 
> > 
> > 
> > > 
> > > >  That seems like
> > > > further validation that such tracking is going to be dependent on the
> > > > mediated driver itself and probably not something to centralize in a
> > > > mediated iommu driver.  Thanks,
> > > > 
> > > > Alex
> > > > 
> > > 
> > > 
> > > 
> > > --------
> > > Dong Jia
> > > 
> > 
> 
> 
> 
> --------
> Dong Jia
> 

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
  2016-05-13  8:31                                   ` [Qemu-devel] " Neo Jia
@ 2016-05-13  9:23                                     ` Jike Song
  -1 siblings, 0 replies; 154+ messages in thread
From: Jike Song @ 2016-05-13  9:23 UTC (permalink / raw)
  To: Neo Jia
  Cc: Tian, Kevin, Jike Song, Alex Williamson, Kirti Wankhede,
	pbonzini, kraxel, qemu-devel, kvm, Ruan, Shuai, Lv, Zhiyuan

On 05/13/2016 04:31 PM, Neo Jia wrote:
> On Fri, May 13, 2016 at 07:45:14AM +0000, Tian, Kevin wrote:
>>
>> We use page tracking framework, which is newly added to KVM recently,
>> to mark RAM pages as read-only so write accesses are intercepted to 
>> device model.
> 
> Yes, I am aware of that patchset from Guangrong. So far the interface are all
> requiring struct *kvm, copied from https://lkml.org/lkml/2015/11/30/644
> 
> - kvm_page_track_add_page(): add the page to the tracking pool after
>   that later specified access on that page will be tracked
> 
> - kvm_page_track_remove_page(): remove the page from the tracking pool,
>   the specified access on the page is not tracked after the last user is
>   gone
> 
> void kvm_page_track_add_page(struct kvm *kvm, gfn_t gfn,
>                 enum kvm_page_track_mode mode);
> void kvm_page_track_remove_page(struct kvm *kvm, gfn_t gfn,
>                enum kvm_page_track_mode mode);
> 
> Really curious how you are going to have access to the struct kvm *kvm, or you
> are relying on the userfaultfd to track the write faults only as part of the
> QEMU userfault thread?
>

Hi Neo,

For the vGPU used as a device for KVM guest, there will be interfaces
wrapped or implemented in KVM layer, as a rival thing diverted from
the interfaces for Xen. That is where the KVM related code supposed to be.

--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
@ 2016-05-13  9:23                                     ` Jike Song
  0 siblings, 0 replies; 154+ messages in thread
From: Jike Song @ 2016-05-13  9:23 UTC (permalink / raw)
  To: Neo Jia
  Cc: Tian, Kevin, Jike Song, Alex Williamson, Kirti Wankhede,
	pbonzini, kraxel, qemu-devel, kvm, Ruan, Shuai, Lv, Zhiyuan

On 05/13/2016 04:31 PM, Neo Jia wrote:
> On Fri, May 13, 2016 at 07:45:14AM +0000, Tian, Kevin wrote:
>>
>> We use page tracking framework, which is newly added to KVM recently,
>> to mark RAM pages as read-only so write accesses are intercepted to 
>> device model.
> 
> Yes, I am aware of that patchset from Guangrong. So far the interface are all
> requiring struct *kvm, copied from https://lkml.org/lkml/2015/11/30/644
> 
> - kvm_page_track_add_page(): add the page to the tracking pool after
>   that later specified access on that page will be tracked
> 
> - kvm_page_track_remove_page(): remove the page from the tracking pool,
>   the specified access on the page is not tracked after the last user is
>   gone
> 
> void kvm_page_track_add_page(struct kvm *kvm, gfn_t gfn,
>                 enum kvm_page_track_mode mode);
> void kvm_page_track_remove_page(struct kvm *kvm, gfn_t gfn,
>                enum kvm_page_track_mode mode);
> 
> Really curious how you are going to have access to the struct kvm *kvm, or you
> are relying on the userfaultfd to track the write faults only as part of the
> QEMU userfault thread?
>

Hi Neo,

For the vGPU used as a device for KVM guest, there will be interfaces
wrapped or implemented in KVM layer, as a rival thing diverted from
the interfaces for Xen. That is where the KVM related code supposed to be.

--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
  2016-05-12 20:12                       ` [Qemu-devel] " Neo Jia
@ 2016-05-13  9:46                         ` Jike Song
  -1 siblings, 0 replies; 154+ messages in thread
From: Jike Song @ 2016-05-13  9:46 UTC (permalink / raw)
  To: Neo Jia
  Cc: Alex Williamson, Tian, Kevin, Kirti Wankhede, pbonzini, kraxel,
	qemu-devel, kvm, Ruan, Shuai, Lv, Zhiyuan

On 05/13/2016 04:12 AM, Neo Jia wrote:
> On Thu, May 12, 2016 at 01:05:52PM -0600, Alex Williamson wrote:
>>
>> If you're trying to equate the scale of what we need to track vs what
>> type1 currently tracks, they're significantly different.  Possible
>> things we need to track include the pfn, the iova, and possibly a
>> reference count or some sort of pinned page map.  In the pin-all model
>> we can assume that every page is pinned on map and unpinned on unmap,
>> so a reference count or map is unnecessary.  We can also assume that we
>> can always regenerate the pfn with get_user_pages() from the vaddr, so
>> we don't need to track that.  
> 
> Hi Alex,
> 
> Thanks for pointing this out, we will not track those in our next rev and
> get_user_pages will be used from the vaddr as you suggested to handle the
> single VM with both passthru + mediated device case.
>

Just a gut feeling:

Calling GUP every time for a particular vaddr, means locking mm->mmap_sem
every time for a particular process. If the VM has dozens of VCPU, which
is not rare, the semaphore is likely to be the bottleneck.


--
Thanks,
Jike


^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
@ 2016-05-13  9:46                         ` Jike Song
  0 siblings, 0 replies; 154+ messages in thread
From: Jike Song @ 2016-05-13  9:46 UTC (permalink / raw)
  To: Neo Jia
  Cc: Alex Williamson, Tian, Kevin, Kirti Wankhede, pbonzini, kraxel,
	qemu-devel, kvm, Ruan, Shuai, Lv, Zhiyuan

On 05/13/2016 04:12 AM, Neo Jia wrote:
> On Thu, May 12, 2016 at 01:05:52PM -0600, Alex Williamson wrote:
>>
>> If you're trying to equate the scale of what we need to track vs what
>> type1 currently tracks, they're significantly different.  Possible
>> things we need to track include the pfn, the iova, and possibly a
>> reference count or some sort of pinned page map.  In the pin-all model
>> we can assume that every page is pinned on map and unpinned on unmap,
>> so a reference count or map is unnecessary.  We can also assume that we
>> can always regenerate the pfn with get_user_pages() from the vaddr, so
>> we don't need to track that.  
> 
> Hi Alex,
> 
> Thanks for pointing this out, we will not track those in our next rev and
> get_user_pages will be used from the vaddr as you suggested to handle the
> single VM with both passthru + mediated device case.
>

Just a gut feeling:

Calling GUP every time for a particular vaddr, means locking mm->mmap_sem
every time for a particular process. If the VM has dozens of VCPU, which
is not rare, the semaphore is likely to be the bottleneck.


--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
  2016-05-13  9:46                         ` [Qemu-devel] " Jike Song
@ 2016-05-13 15:48                           ` Neo Jia
  -1 siblings, 0 replies; 154+ messages in thread
From: Neo Jia @ 2016-05-13 15:48 UTC (permalink / raw)
  To: Jike Song
  Cc: Alex Williamson, Tian, Kevin, Kirti Wankhede, pbonzini, kraxel,
	qemu-devel, kvm, Ruan, Shuai, Lv, Zhiyuan

On Fri, May 13, 2016 at 05:46:17PM +0800, Jike Song wrote:
> On 05/13/2016 04:12 AM, Neo Jia wrote:
> > On Thu, May 12, 2016 at 01:05:52PM -0600, Alex Williamson wrote:
> >>
> >> If you're trying to equate the scale of what we need to track vs what
> >> type1 currently tracks, they're significantly different.  Possible
> >> things we need to track include the pfn, the iova, and possibly a
> >> reference count or some sort of pinned page map.  In the pin-all model
> >> we can assume that every page is pinned on map and unpinned on unmap,
> >> so a reference count or map is unnecessary.  We can also assume that we
> >> can always regenerate the pfn with get_user_pages() from the vaddr, so
> >> we don't need to track that.  
> > 
> > Hi Alex,
> > 
> > Thanks for pointing this out, we will not track those in our next rev and
> > get_user_pages will be used from the vaddr as you suggested to handle the
> > single VM with both passthru + mediated device case.
> >
> 
> Just a gut feeling:
> 
> Calling GUP every time for a particular vaddr, means locking mm->mmap_sem
> every time for a particular process. If the VM has dozens of VCPU, which
> is not rare, the semaphore is likely to be the bottleneck.

Hi Jike,

We do need to hold the lock of mm->mmap_sem for the VMM/QEMU process, but I
don't quite follow the reasoning with "dozens of vcpus", one situation that I
can think of is that we have other thread competing with the mmap_sem for the
VMM/QEMU process within KVM kernel such as hva_to_pfn, after a quick search it
seems only mostly gets used by iotcl "KVM_ASSIGN_PCI_DEVICE".

We will definitely conduct performance analysis with large configuration on
servers with E5-2697 v4. :-)

Thanks,
Neo

> 
> 
> --
> Thanks,
> Jike
> 

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
@ 2016-05-13 15:48                           ` Neo Jia
  0 siblings, 0 replies; 154+ messages in thread
From: Neo Jia @ 2016-05-13 15:48 UTC (permalink / raw)
  To: Jike Song
  Cc: Alex Williamson, Tian, Kevin, Kirti Wankhede, pbonzini, kraxel,
	qemu-devel, kvm, Ruan, Shuai, Lv, Zhiyuan

On Fri, May 13, 2016 at 05:46:17PM +0800, Jike Song wrote:
> On 05/13/2016 04:12 AM, Neo Jia wrote:
> > On Thu, May 12, 2016 at 01:05:52PM -0600, Alex Williamson wrote:
> >>
> >> If you're trying to equate the scale of what we need to track vs what
> >> type1 currently tracks, they're significantly different.  Possible
> >> things we need to track include the pfn, the iova, and possibly a
> >> reference count or some sort of pinned page map.  In the pin-all model
> >> we can assume that every page is pinned on map and unpinned on unmap,
> >> so a reference count or map is unnecessary.  We can also assume that we
> >> can always regenerate the pfn with get_user_pages() from the vaddr, so
> >> we don't need to track that.  
> > 
> > Hi Alex,
> > 
> > Thanks for pointing this out, we will not track those in our next rev and
> > get_user_pages will be used from the vaddr as you suggested to handle the
> > single VM with both passthru + mediated device case.
> >
> 
> Just a gut feeling:
> 
> Calling GUP every time for a particular vaddr, means locking mm->mmap_sem
> every time for a particular process. If the VM has dozens of VCPU, which
> is not rare, the semaphore is likely to be the bottleneck.

Hi Jike,

We do need to hold the lock of mm->mmap_sem for the VMM/QEMU process, but I
don't quite follow the reasoning with "dozens of vcpus", one situation that I
can think of is that we have other thread competing with the mmap_sem for the
VMM/QEMU process within KVM kernel such as hva_to_pfn, after a quick search it
seems only mostly gets used by iotcl "KVM_ASSIGN_PCI_DEVICE".

We will definitely conduct performance analysis with large configuration on
servers with E5-2697 v4. :-)

Thanks,
Neo

> 
> 
> --
> Thanks,
> Jike
> 

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
  2016-05-13  9:23                                     ` [Qemu-devel] " Jike Song
@ 2016-05-13 15:50                                       ` Neo Jia
  -1 siblings, 0 replies; 154+ messages in thread
From: Neo Jia @ 2016-05-13 15:50 UTC (permalink / raw)
  To: Jike Song
  Cc: Tian, Kevin, Jike Song, Alex Williamson, Kirti Wankhede,
	pbonzini, kraxel, qemu-devel, kvm, Ruan, Shuai, Lv, Zhiyuan

On Fri, May 13, 2016 at 05:23:44PM +0800, Jike Song wrote:
> On 05/13/2016 04:31 PM, Neo Jia wrote:
> > On Fri, May 13, 2016 at 07:45:14AM +0000, Tian, Kevin wrote:
> >>
> >> We use page tracking framework, which is newly added to KVM recently,
> >> to mark RAM pages as read-only so write accesses are intercepted to 
> >> device model.
> > 
> > Yes, I am aware of that patchset from Guangrong. So far the interface are all
> > requiring struct *kvm, copied from https://lkml.org/lkml/2015/11/30/644
> > 
> > - kvm_page_track_add_page(): add the page to the tracking pool after
> >   that later specified access on that page will be tracked
> > 
> > - kvm_page_track_remove_page(): remove the page from the tracking pool,
> >   the specified access on the page is not tracked after the last user is
> >   gone
> > 
> > void kvm_page_track_add_page(struct kvm *kvm, gfn_t gfn,
> >                 enum kvm_page_track_mode mode);
> > void kvm_page_track_remove_page(struct kvm *kvm, gfn_t gfn,
> >                enum kvm_page_track_mode mode);
> > 
> > Really curious how you are going to have access to the struct kvm *kvm, or you
> > are relying on the userfaultfd to track the write faults only as part of the
> > QEMU userfault thread?
> >
> 
> Hi Neo,
> 
> For the vGPU used as a device for KVM guest, there will be interfaces
> wrapped or implemented in KVM layer, as a rival thing diverted from
> the interfaces for Xen. That is where the KVM related code supposed to be.

Hi Jike,

Is this discussed anywhere on the mailing list already? Sorry if I have missed
such conversation.

Thanks,
Neo

> 
> --
> Thanks,
> Jike

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
@ 2016-05-13 15:50                                       ` Neo Jia
  0 siblings, 0 replies; 154+ messages in thread
From: Neo Jia @ 2016-05-13 15:50 UTC (permalink / raw)
  To: Jike Song
  Cc: Tian, Kevin, Jike Song, Alex Williamson, Kirti Wankhede,
	pbonzini, kraxel, qemu-devel, kvm, Ruan, Shuai, Lv, Zhiyuan

On Fri, May 13, 2016 at 05:23:44PM +0800, Jike Song wrote:
> On 05/13/2016 04:31 PM, Neo Jia wrote:
> > On Fri, May 13, 2016 at 07:45:14AM +0000, Tian, Kevin wrote:
> >>
> >> We use page tracking framework, which is newly added to KVM recently,
> >> to mark RAM pages as read-only so write accesses are intercepted to 
> >> device model.
> > 
> > Yes, I am aware of that patchset from Guangrong. So far the interface are all
> > requiring struct *kvm, copied from https://lkml.org/lkml/2015/11/30/644
> > 
> > - kvm_page_track_add_page(): add the page to the tracking pool after
> >   that later specified access on that page will be tracked
> > 
> > - kvm_page_track_remove_page(): remove the page from the tracking pool,
> >   the specified access on the page is not tracked after the last user is
> >   gone
> > 
> > void kvm_page_track_add_page(struct kvm *kvm, gfn_t gfn,
> >                 enum kvm_page_track_mode mode);
> > void kvm_page_track_remove_page(struct kvm *kvm, gfn_t gfn,
> >                enum kvm_page_track_mode mode);
> > 
> > Really curious how you are going to have access to the struct kvm *kvm, or you
> > are relying on the userfaultfd to track the write faults only as part of the
> > QEMU userfault thread?
> >
> 
> Hi Neo,
> 
> For the vGPU used as a device for KVM guest, there will be interfaces
> wrapped or implemented in KVM layer, as a rival thing diverted from
> the interfaces for Xen. That is where the KVM related code supposed to be.

Hi Jike,

Is this discussed anywhere on the mailing list already? Sorry if I have missed
such conversation.

Thanks,
Neo

> 
> --
> Thanks,
> Jike

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
  2016-05-13  3:55                       ` [Qemu-devel] " Tian, Kevin
@ 2016-05-13 16:16                         ` Alex Williamson
  -1 siblings, 0 replies; 154+ messages in thread
From: Alex Williamson @ 2016-05-13 16:16 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Song, Jike, Neo Jia, Kirti Wankhede, pbonzini, kraxel,
	qemu-devel, kvm, Ruan, Shuai, Lv, Zhiyuan

On Fri, 13 May 2016 03:55:09 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Friday, May 13, 2016 3:06 AM
> >   
> > > >  
> > >
> > > Based on above thought I'm thinking whether below would work:
> > > (let's use gpa to replace existing iova in type1 driver, while using iova
> > > for the one actually used in vGPU driver. Assume 'pin-all' scenario first
> > > which matches existing vfio logic)
> > >
> > > - No change to existing vfio_dma structure. VFIO still maintains gpa<->vaddr
> > > mapping, in coarse-grained regions;
> > >
> > > - Leverage same page accounting/pinning logic in type1 driver, which
> > > should be enough for 'pin-all' usage;
> > >
> > > - Then main divergence point for vGPU would be in vfio_unmap_unpin
> > > and vfio_iommu_map. I'm not sure whether it's easy to fake an
> > > iommu_domain for vGPU so same iommu_map/unmap can be reused.  
> > 
> > This seems troublesome.  Kirti's version used numerous api-only tests
> > to avoid these which made the code difficult to trace.  Clearly one
> > option is to split out the common code so that a new mediated-type1
> > backend skips this, but they thought they could clean it up without
> > this, so we'll see what happens in the next version.
> >   
> > > If not, we may introduce two new map/unmap callbacks provided
> > > specifically by vGPU core driver, as you suggested:
> > >
> > > 	* vGPU core driver uses dma_map_page to map specified pfns:
> > >
> > > 		o When IOMMU is enabled, we'll get an iova returned different
> > > from pfn;
> > > 		o When IOMMU is disabled, returned iova is same as pfn;  
> > 
> > Either way each iova needs to be stored and we have a worst case of one
> > iova per page of guest memory.
> >   
> > > 	* Then vGPU core driver just maintains its own gpa<->iova lookup
> > > table (e.g. called vgpu_dma)
> > >
> > > 	* Because each vfio_iommu_map invocation is about a contiguous
> > > region, we can expect same number of vgpu_dma entries as maintained
> > > for vfio_dma list;
> > >
> > > Then it's vGPU core driver's responsibility to provide gpa<->iova
> > > lookup for vendor specific GPU driver. And we don't need worry about
> > > tens of thousands of entries. Once we get this simple 'pin-all' model
> > > ready, then it can be further extended to support 'pin-sparse'
> > > scenario. We still maintain a top-level vgpu_dma list with each entry to
> > > further link its own sparse mapping structure. In reality I don't expect
> > > we really need to maintain per-page translation even with sparse pinning.  
> > 
> > If you're trying to equate the scale of what we need to track vs what
> > type1 currently tracks, they're significantly different.  Possible
> > things we need to track include the pfn, the iova, and possibly a
> > reference count or some sort of pinned page map.  In the pin-all model
> > we can assume that every page is pinned on map and unpinned on unmap,
> > so a reference count or map is unnecessary.  We can also assume that we
> > can always regenerate the pfn with get_user_pages() from the vaddr, so
> > we don't need to track that.  I don't see any way around tracking the
> > iova.  The iommu can't tell us this like it can with the normal type1
> > model because the pfn is the result of the translation, not the key for
> > the translation. So we're always going to have between 1 and
> > (size/PAGE_SIZE) iova entries per vgpu_dma entry.  You might be able to
> > manage the vgpu_dma with an rb-tree, but each vgpu_dma entry needs some
> > data structure tracking every iova.  
> 
> There is one option. We may use alloc_iova to reserve continuous iova
> range for each vgpu_dma range and then use iommu_map/unmap to
> write iommu ptes later upon map request (then could be same #entries
> as vfio_dma compared to unbounded entries when using dma_map_page). 
> Of course this needs to be done in vGPU core driver, since vfio type1 only 
> sees a faked iommu domain.

I'm not sure this is really how iova domains work.  There's only one
iova domain per iommu domain using the dma-iommu API, and
iommu_map/unmap are part of a different API.  iova domain may be an
interesting solution though.
 
> > Sparse mapping has the same issue but of course the tree of iovas is
> > potentially incomplete and we need a way to determine where it's
> > incomplete.  A page table rooted in the vgpu_dma and indexed by the
> > offset from the start vaddr seems like the way to go here.  It's also
> > possible that some mediated device models might store the iova in the
> > command sent to the device and therefore be able to parse those entries
> > back out to unmap them without storing them separately.  This might be
> > how the s390 channel-io model would prefer to work.  That seems like
> > further validation that such tracking is going to be dependent on the
> > mediated driver itself and probably not something to centralize in a
> > mediated iommu driver.  Thanks,
> >   
> 
> Another simpler way might be allocate an array for each memory
> regions registered from user space. For a 512MB region, it means
> 512K*4=2MB array to track pfn or iova mapping corresponding to
> a gfn. It may consume more resource than rb tree when not many
> pages need to be pinned, but could be less when rb tree increases
> a lot. 

An array is only the most space efficient structure for a fully pinned
area where we have no contiguous iova.  If we're either mapping a
larger hugepage or we have a larger continuous iova space due to
scatter-gather mapping or we're sparsely pinning the region, an array
can waste a lot of space.  A 512MB is also a pretty anemic example, 2MB
is a reasonable over head, but 2MB per 512MB looks pretty bad when we
have a 512GB VM.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
@ 2016-05-13 16:16                         ` Alex Williamson
  0 siblings, 0 replies; 154+ messages in thread
From: Alex Williamson @ 2016-05-13 16:16 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Song, Jike, Neo Jia, Kirti Wankhede, pbonzini, kraxel,
	qemu-devel, kvm, Ruan, Shuai, Lv, Zhiyuan

On Fri, 13 May 2016 03:55:09 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Friday, May 13, 2016 3:06 AM
> >   
> > > >  
> > >
> > > Based on above thought I'm thinking whether below would work:
> > > (let's use gpa to replace existing iova in type1 driver, while using iova
> > > for the one actually used in vGPU driver. Assume 'pin-all' scenario first
> > > which matches existing vfio logic)
> > >
> > > - No change to existing vfio_dma structure. VFIO still maintains gpa<->vaddr
> > > mapping, in coarse-grained regions;
> > >
> > > - Leverage same page accounting/pinning logic in type1 driver, which
> > > should be enough for 'pin-all' usage;
> > >
> > > - Then main divergence point for vGPU would be in vfio_unmap_unpin
> > > and vfio_iommu_map. I'm not sure whether it's easy to fake an
> > > iommu_domain for vGPU so same iommu_map/unmap can be reused.  
> > 
> > This seems troublesome.  Kirti's version used numerous api-only tests
> > to avoid these which made the code difficult to trace.  Clearly one
> > option is to split out the common code so that a new mediated-type1
> > backend skips this, but they thought they could clean it up without
> > this, so we'll see what happens in the next version.
> >   
> > > If not, we may introduce two new map/unmap callbacks provided
> > > specifically by vGPU core driver, as you suggested:
> > >
> > > 	* vGPU core driver uses dma_map_page to map specified pfns:
> > >
> > > 		o When IOMMU is enabled, we'll get an iova returned different
> > > from pfn;
> > > 		o When IOMMU is disabled, returned iova is same as pfn;  
> > 
> > Either way each iova needs to be stored and we have a worst case of one
> > iova per page of guest memory.
> >   
> > > 	* Then vGPU core driver just maintains its own gpa<->iova lookup
> > > table (e.g. called vgpu_dma)
> > >
> > > 	* Because each vfio_iommu_map invocation is about a contiguous
> > > region, we can expect same number of vgpu_dma entries as maintained
> > > for vfio_dma list;
> > >
> > > Then it's vGPU core driver's responsibility to provide gpa<->iova
> > > lookup for vendor specific GPU driver. And we don't need worry about
> > > tens of thousands of entries. Once we get this simple 'pin-all' model
> > > ready, then it can be further extended to support 'pin-sparse'
> > > scenario. We still maintain a top-level vgpu_dma list with each entry to
> > > further link its own sparse mapping structure. In reality I don't expect
> > > we really need to maintain per-page translation even with sparse pinning.  
> > 
> > If you're trying to equate the scale of what we need to track vs what
> > type1 currently tracks, they're significantly different.  Possible
> > things we need to track include the pfn, the iova, and possibly a
> > reference count or some sort of pinned page map.  In the pin-all model
> > we can assume that every page is pinned on map and unpinned on unmap,
> > so a reference count or map is unnecessary.  We can also assume that we
> > can always regenerate the pfn with get_user_pages() from the vaddr, so
> > we don't need to track that.  I don't see any way around tracking the
> > iova.  The iommu can't tell us this like it can with the normal type1
> > model because the pfn is the result of the translation, not the key for
> > the translation. So we're always going to have between 1 and
> > (size/PAGE_SIZE) iova entries per vgpu_dma entry.  You might be able to
> > manage the vgpu_dma with an rb-tree, but each vgpu_dma entry needs some
> > data structure tracking every iova.  
> 
> There is one option. We may use alloc_iova to reserve continuous iova
> range for each vgpu_dma range and then use iommu_map/unmap to
> write iommu ptes later upon map request (then could be same #entries
> as vfio_dma compared to unbounded entries when using dma_map_page). 
> Of course this needs to be done in vGPU core driver, since vfio type1 only 
> sees a faked iommu domain.

I'm not sure this is really how iova domains work.  There's only one
iova domain per iommu domain using the dma-iommu API, and
iommu_map/unmap are part of a different API.  iova domain may be an
interesting solution though.
 
> > Sparse mapping has the same issue but of course the tree of iovas is
> > potentially incomplete and we need a way to determine where it's
> > incomplete.  A page table rooted in the vgpu_dma and indexed by the
> > offset from the start vaddr seems like the way to go here.  It's also
> > possible that some mediated device models might store the iova in the
> > command sent to the device and therefore be able to parse those entries
> > back out to unmap them without storing them separately.  This might be
> > how the s390 channel-io model would prefer to work.  That seems like
> > further validation that such tracking is going to be dependent on the
> > mediated driver itself and probably not something to centralize in a
> > mediated iommu driver.  Thanks,
> >   
> 
> Another simpler way might be allocate an array for each memory
> regions registered from user space. For a 512MB region, it means
> 512K*4=2MB array to track pfn or iova mapping corresponding to
> a gfn. It may consume more resource than rb tree when not many
> pages need to be pinned, but could be less when rb tree increases
> a lot. 

An array is only the most space efficient structure for a fully pinned
area where we have no contiguous iova.  If we're either mapping a
larger hugepage or we have a larger continuous iova space due to
scatter-gather mapping or we're sparsely pinning the region, an array
can waste a lot of space.  A 512MB is also a pretty anemic example, 2MB
is a reasonable over head, but 2MB per 512MB looks pretty bad when we
have a 512GB VM.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
  2016-05-13 15:48                           ` [Qemu-devel] " Neo Jia
@ 2016-05-16  2:27                             ` Jike Song
  -1 siblings, 0 replies; 154+ messages in thread
From: Jike Song @ 2016-05-16  2:27 UTC (permalink / raw)
  To: Neo Jia
  Cc: Alex Williamson, Tian, Kevin, Kirti Wankhede, pbonzini, kraxel,
	qemu-devel, kvm, Ruan, Shuai, Lv, Zhiyuan

On 05/13/2016 11:48 PM, Neo Jia wrote:
> On Fri, May 13, 2016 at 05:46:17PM +0800, Jike Song wrote:
>> On 05/13/2016 04:12 AM, Neo Jia wrote:
>>> On Thu, May 12, 2016 at 01:05:52PM -0600, Alex Williamson wrote:
>>>>
>>>> If you're trying to equate the scale of what we need to track vs what
>>>> type1 currently tracks, they're significantly different.  Possible
>>>> things we need to track include the pfn, the iova, and possibly a
>>>> reference count or some sort of pinned page map.  In the pin-all model
>>>> we can assume that every page is pinned on map and unpinned on unmap,
>>>> so a reference count or map is unnecessary.  We can also assume that we
>>>> can always regenerate the pfn with get_user_pages() from the vaddr, so
>>>> we don't need to track that.  
>>>
>>> Hi Alex,
>>>
>>> Thanks for pointing this out, we will not track those in our next rev and
>>> get_user_pages will be used from the vaddr as you suggested to handle the
>>> single VM with both passthru + mediated device case.
>>>
>>
>> Just a gut feeling:
>>
>> Calling GUP every time for a particular vaddr, means locking mm->mmap_sem
>> every time for a particular process. If the VM has dozens of VCPU, which
>> is not rare, the semaphore is likely to be the bottleneck.
> 
> Hi Jike,
> 
> We do need to hold the lock of mm->mmap_sem for the VMM/QEMU process, but I
> don't quite follow the reasoning with "dozens of vcpus", one situation that I
> can think of is that we have other thread competing with the mmap_sem for the
> VMM/QEMU process within KVM kernel such as hva_to_pfn, after a quick search it
> seems only mostly gets used by iotcl "KVM_ASSIGN_PCI_DEVICE".
>

I meant, on guest's writing a gfn to GPU MMU, which could happen on any vcpu,
so vmexit happens and mmap_sem required.  But I'm now realized that it's
also the situation even we store the pfn in rbtree ..

> We will definitely conduct performance analysis with large configuration on
> servers with E5-2697 v4. :-)

My homage :)

--
Thanks,
Jike


^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
@ 2016-05-16  2:27                             ` Jike Song
  0 siblings, 0 replies; 154+ messages in thread
From: Jike Song @ 2016-05-16  2:27 UTC (permalink / raw)
  To: Neo Jia
  Cc: Alex Williamson, Tian, Kevin, Kirti Wankhede, pbonzini, kraxel,
	qemu-devel, kvm, Ruan, Shuai, Lv, Zhiyuan

On 05/13/2016 11:48 PM, Neo Jia wrote:
> On Fri, May 13, 2016 at 05:46:17PM +0800, Jike Song wrote:
>> On 05/13/2016 04:12 AM, Neo Jia wrote:
>>> On Thu, May 12, 2016 at 01:05:52PM -0600, Alex Williamson wrote:
>>>>
>>>> If you're trying to equate the scale of what we need to track vs what
>>>> type1 currently tracks, they're significantly different.  Possible
>>>> things we need to track include the pfn, the iova, and possibly a
>>>> reference count or some sort of pinned page map.  In the pin-all model
>>>> we can assume that every page is pinned on map and unpinned on unmap,
>>>> so a reference count or map is unnecessary.  We can also assume that we
>>>> can always regenerate the pfn with get_user_pages() from the vaddr, so
>>>> we don't need to track that.  
>>>
>>> Hi Alex,
>>>
>>> Thanks for pointing this out, we will not track those in our next rev and
>>> get_user_pages will be used from the vaddr as you suggested to handle the
>>> single VM with both passthru + mediated device case.
>>>
>>
>> Just a gut feeling:
>>
>> Calling GUP every time for a particular vaddr, means locking mm->mmap_sem
>> every time for a particular process. If the VM has dozens of VCPU, which
>> is not rare, the semaphore is likely to be the bottleneck.
> 
> Hi Jike,
> 
> We do need to hold the lock of mm->mmap_sem for the VMM/QEMU process, but I
> don't quite follow the reasoning with "dozens of vcpus", one situation that I
> can think of is that we have other thread competing with the mmap_sem for the
> VMM/QEMU process within KVM kernel such as hva_to_pfn, after a quick search it
> seems only mostly gets used by iotcl "KVM_ASSIGN_PCI_DEVICE".
>

I meant, on guest's writing a gfn to GPU MMU, which could happen on any vcpu,
so vmexit happens and mmap_sem required.  But I'm now realized that it's
also the situation even we store the pfn in rbtree ..

> We will definitely conduct performance analysis with large configuration on
> servers with E5-2697 v4. :-)

My homage :)

--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
  2016-05-13 15:50                                       ` [Qemu-devel] " Neo Jia
@ 2016-05-16  6:57                                         ` Jike Song
  -1 siblings, 0 replies; 154+ messages in thread
From: Jike Song @ 2016-05-16  6:57 UTC (permalink / raw)
  To: Neo Jia
  Cc: Tian, Kevin, Jike Song, Alex Williamson, Kirti Wankhede,
	pbonzini, kraxel, qemu-devel, kvm, Ruan, Shuai, Lv, Zhiyuan

On 05/13/2016 11:50 PM, Neo Jia wrote:
> On Fri, May 13, 2016 at 05:23:44PM +0800, Jike Song wrote:
>> On 05/13/2016 04:31 PM, Neo Jia wrote:
>>> On Fri, May 13, 2016 at 07:45:14AM +0000, Tian, Kevin wrote:
>>>>
>>>> We use page tracking framework, which is newly added to KVM recently,
>>>> to mark RAM pages as read-only so write accesses are intercepted to 
>>>> device model.
>>>
>>> Yes, I am aware of that patchset from Guangrong. So far the interface are all
>>> requiring struct *kvm, copied from https://lkml.org/lkml/2015/11/30/644
>>>
>>> - kvm_page_track_add_page(): add the page to the tracking pool after
>>>   that later specified access on that page will be tracked
>>>
>>> - kvm_page_track_remove_page(): remove the page from the tracking pool,
>>>   the specified access on the page is not tracked after the last user is
>>>   gone
>>>
>>> void kvm_page_track_add_page(struct kvm *kvm, gfn_t gfn,
>>>                 enum kvm_page_track_mode mode);
>>> void kvm_page_track_remove_page(struct kvm *kvm, gfn_t gfn,
>>>                enum kvm_page_track_mode mode);
>>>
>>> Really curious how you are going to have access to the struct kvm *kvm, or you
>>> are relying on the userfaultfd to track the write faults only as part of the
>>> QEMU userfault thread?
>>>
>>
>> Hi Neo,
>>
>> For the vGPU used as a device for KVM guest, there will be interfaces
>> wrapped or implemented in KVM layer, as a rival thing diverted from
>> the interfaces for Xen. That is where the KVM related code supposed to be.
> 
> Hi Jike,
> 
> Is this discussed anywhere on the mailing list already? Sorry if I have missed
> such conversation.
>

Hi Neo,

Not exactly, but we can discuss it if necessary :)

Intel vGPU device-model, which is a part of i915 driver, has to be able to
emulate vGPU for *both* XenGT and KVMGT guests. That means there must be
a ridge somewhere, directing to Xen-specific and KVM-specific logic accordingly.


--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
@ 2016-05-16  6:57                                         ` Jike Song
  0 siblings, 0 replies; 154+ messages in thread
From: Jike Song @ 2016-05-16  6:57 UTC (permalink / raw)
  To: Neo Jia
  Cc: Tian, Kevin, Jike Song, Alex Williamson, Kirti Wankhede,
	pbonzini, kraxel, qemu-devel, kvm, Ruan, Shuai, Lv, Zhiyuan

On 05/13/2016 11:50 PM, Neo Jia wrote:
> On Fri, May 13, 2016 at 05:23:44PM +0800, Jike Song wrote:
>> On 05/13/2016 04:31 PM, Neo Jia wrote:
>>> On Fri, May 13, 2016 at 07:45:14AM +0000, Tian, Kevin wrote:
>>>>
>>>> We use page tracking framework, which is newly added to KVM recently,
>>>> to mark RAM pages as read-only so write accesses are intercepted to 
>>>> device model.
>>>
>>> Yes, I am aware of that patchset from Guangrong. So far the interface are all
>>> requiring struct *kvm, copied from https://lkml.org/lkml/2015/11/30/644
>>>
>>> - kvm_page_track_add_page(): add the page to the tracking pool after
>>>   that later specified access on that page will be tracked
>>>
>>> - kvm_page_track_remove_page(): remove the page from the tracking pool,
>>>   the specified access on the page is not tracked after the last user is
>>>   gone
>>>
>>> void kvm_page_track_add_page(struct kvm *kvm, gfn_t gfn,
>>>                 enum kvm_page_track_mode mode);
>>> void kvm_page_track_remove_page(struct kvm *kvm, gfn_t gfn,
>>>                enum kvm_page_track_mode mode);
>>>
>>> Really curious how you are going to have access to the struct kvm *kvm, or you
>>> are relying on the userfaultfd to track the write faults only as part of the
>>> QEMU userfault thread?
>>>
>>
>> Hi Neo,
>>
>> For the vGPU used as a device for KVM guest, there will be interfaces
>> wrapped or implemented in KVM layer, as a rival thing diverted from
>> the interfaces for Xen. That is where the KVM related code supposed to be.
> 
> Hi Jike,
> 
> Is this discussed anywhere on the mailing list already? Sorry if I have missed
> such conversation.
>

Hi Neo,

Not exactly, but we can discuss it if necessary :)

Intel vGPU device-model, which is a part of i915 driver, has to be able to
emulate vGPU for *both* XenGT and KVMGT guests. That means there must be
a ridge somewhere, directing to Xen-specific and KVM-specific logic accordingly.


--
Thanks,
Jike

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
  2016-05-13  9:05                             ` Neo Jia
@ 2016-05-19  7:28                               ` Dong Jia
  -1 siblings, 0 replies; 154+ messages in thread
From: Dong Jia @ 2016-05-19  7:28 UTC (permalink / raw)
  To: Neo Jia
  Cc: Alex Williamson, Tian, Kevin, Ruan, Shuai, Song, Jike, kvm,
	qemu-devel, Kirti Wankhede, kraxel, pbonzini, Lv, Zhiyuan

On Fri, 13 May 2016 02:05:01 -0700
Neo Jia <cjia@nvidia.com> wrote:

...snip...

> 
> Hi Dong,
> 
> We should definitely be mindful about the data structure performance especially
> dealing with kernel. But for now, we haven't done any performance analysis yet
> for the current rbtree implementation, later we will definitely run it through
> large guest RAM configuration and multiple virtual devices cases, etc. to
> collect data.
> 
> Regarding your use case, may I ask if there will be concurrent command streams
> running for the same VM?
Hi Neo:

Sorry for the late response. Spent some time to make the confirmation.

For our case, one iommu group will add one (and only one) ccw-device.
For one ccw-device, there will be no concurrent command streams from it.

> If yes, those two transaction requests (if we
> implement) will compete not only the rbtree lock but also the GUP locks.
Since the answer is 'no, I guess we needn't do this. :>

> 
> Also, what is the typical guest RAM we are talking about here for your usecase
> and any rough estimation of the active working set of those DMA pages? 
> 
I'm afraid there is no typical guest RAM for the I/O instructions
issued by the passed-through ccw-device drivers. They can use any
memory chunk allocated by a kmalloc.

The working set depends on how much memory used by the device drivers,
and of course the number of the available memory. Since there is no
restrictions of the memory usage for this case, it varies...

[...]

--------
Dong Jia


^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
@ 2016-05-19  7:28                               ` Dong Jia
  0 siblings, 0 replies; 154+ messages in thread
From: Dong Jia @ 2016-05-19  7:28 UTC (permalink / raw)
  To: Neo Jia
  Cc: Alex Williamson, Tian, Kevin, Ruan, Shuai, Song, Jike, kvm,
	qemu-devel, Kirti Wankhede, kraxel, pbonzini, Lv,	Zhiyuan

On Fri, 13 May 2016 02:05:01 -0700
Neo Jia <cjia@nvidia.com> wrote:

...snip...

> 
> Hi Dong,
> 
> We should definitely be mindful about the data structure performance especially
> dealing with kernel. But for now, we haven't done any performance analysis yet
> for the current rbtree implementation, later we will definitely run it through
> large guest RAM configuration and multiple virtual devices cases, etc. to
> collect data.
> 
> Regarding your use case, may I ask if there will be concurrent command streams
> running for the same VM?
Hi Neo:

Sorry for the late response. Spent some time to make the confirmation.

For our case, one iommu group will add one (and only one) ccw-device.
For one ccw-device, there will be no concurrent command streams from it.

> If yes, those two transaction requests (if we
> implement) will compete not only the rbtree lock but also the GUP locks.
Since the answer is 'no, I guess we needn't do this. :>

> 
> Also, what is the typical guest RAM we are talking about here for your usecase
> and any rough estimation of the active working set of those DMA pages? 
> 
I'm afraid there is no typical guest RAM for the I/O instructions
issued by the passed-through ccw-device drivers. They can use any
memory chunk allocated by a kmalloc.

The working set depends on how much memory used by the device drivers,
and of course the number of the available memory. Since there is no
restrictions of the memory usage for this case, it varies...

[...]

--------
Dong Jia

^ permalink raw reply	[flat|nested] 154+ messages in thread

* RE: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
  2016-05-19  7:28                               ` Dong Jia
@ 2016-05-20  3:21                                 ` Tian, Kevin
  -1 siblings, 0 replies; 154+ messages in thread
From: Tian, Kevin @ 2016-05-20  3:21 UTC (permalink / raw)
  To: Dong Jia, Neo Jia
  Cc: Alex Williamson, Ruan, Shuai, Song, Jike, kvm, qemu-devel,
	Kirti Wankhede, kraxel, pbonzini, Lv, Zhiyuan

> From: Dong Jia [mailto:bjsdjshi@linux.vnet.ibm.com]
> Sent: Thursday, May 19, 2016 3:28 PM
> 
> On Fri, 13 May 2016 02:05:01 -0700
> Neo Jia <cjia@nvidia.com> wrote:
> 
> ...snip...
> 
> >
> > Hi Dong,
> >
> > We should definitely be mindful about the data structure performance especially
> > dealing with kernel. But for now, we haven't done any performance analysis yet
> > for the current rbtree implementation, later we will definitely run it through
> > large guest RAM configuration and multiple virtual devices cases, etc. to
> > collect data.
> >
> > Regarding your use case, may I ask if there will be concurrent command streams
> > running for the same VM?
> Hi Neo:
> 
> Sorry for the late response. Spent some time to make the confirmation.
> 
> For our case, one iommu group will add one (and only one) ccw-device.
> For one ccw-device, there will be no concurrent command streams from it.
> 

Hi, Dong,

Looks there can be multiple devices behind one channel, according to:
https://en.wikipedia.org/wiki/Channel_I/O

Do they need to be assigned together as one iommu group? If not, how is
the isolation being done in your implementation? Based on cmd scanning in 
Qemu-side?

Another curious question about channel io itself. I'm unclear whether the 
channel here only fulfills the role of DMA controller (i.e. controlling how
device access memory), or also offloads CPU accesses to the registers
on the ccw-device. Are ccw-device registers directly addressable by CPU
on s390, similar to MMIO concept on x86? If yes, I guess you also need
provide region info in vfio-ccw to control which I/O resource can be accessed
by user space (looks not there in your vfio-ccw patch series). If not, how 
do you control the isolation in that aspect? :-)

Thanks
Kevin

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
@ 2016-05-20  3:21                                 ` Tian, Kevin
  0 siblings, 0 replies; 154+ messages in thread
From: Tian, Kevin @ 2016-05-20  3:21 UTC (permalink / raw)
  To: Dong Jia, Neo Jia
  Cc: Alex Williamson, Ruan, Shuai, Song, Jike, kvm, qemu-devel,
	Kirti Wankhede, kraxel, pbonzini, Lv, Zhiyuan

> From: Dong Jia [mailto:bjsdjshi@linux.vnet.ibm.com]
> Sent: Thursday, May 19, 2016 3:28 PM
> 
> On Fri, 13 May 2016 02:05:01 -0700
> Neo Jia <cjia@nvidia.com> wrote:
> 
> ...snip...
> 
> >
> > Hi Dong,
> >
> > We should definitely be mindful about the data structure performance especially
> > dealing with kernel. But for now, we haven't done any performance analysis yet
> > for the current rbtree implementation, later we will definitely run it through
> > large guest RAM configuration and multiple virtual devices cases, etc. to
> > collect data.
> >
> > Regarding your use case, may I ask if there will be concurrent command streams
> > running for the same VM?
> Hi Neo:
> 
> Sorry for the late response. Spent some time to make the confirmation.
> 
> For our case, one iommu group will add one (and only one) ccw-device.
> For one ccw-device, there will be no concurrent command streams from it.
> 

Hi, Dong,

Looks there can be multiple devices behind one channel, according to:
https://en.wikipedia.org/wiki/Channel_I/O

Do they need to be assigned together as one iommu group? If not, how is
the isolation being done in your implementation? Based on cmd scanning in 
Qemu-side?

Another curious question about channel io itself. I'm unclear whether the 
channel here only fulfills the role of DMA controller (i.e. controlling how
device access memory), or also offloads CPU accesses to the registers
on the ccw-device. Are ccw-device registers directly addressable by CPU
on s390, similar to MMIO concept on x86? If yes, I guess you also need
provide region info in vfio-ccw to control which I/O resource can be accessed
by user space (looks not there in your vfio-ccw patch series). If not, how 
do you control the isolation in that aspect? :-)

Thanks
Kevin

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
  2016-05-20  3:21                                 ` Tian, Kevin
  (?)
@ 2016-06-06  6:59                                 ` Dong Jia
  2016-06-07  2:47                                     ` Tian, Kevin
  -1 siblings, 1 reply; 154+ messages in thread
From: Dong Jia @ 2016-06-06  6:59 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Neo Jia, Alex Williamson, Ruan, Shuai, Song, Jike, kvm,
	qemu-devel, Kirti Wankhede, kraxel, pbonzini, Lv, Zhiyuan,
	Dong Jia

On Fri, 20 May 2016 03:21:31 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Dong Jia [mailto:bjsdjshi@linux.vnet.ibm.com]
> > Sent: Thursday, May 19, 2016 3:28 PM
> > 
> > On Fri, 13 May 2016 02:05:01 -0700
> > Neo Jia <cjia@nvidia.com> wrote:
> > 
> > ...snip...
> > 
> > >
> > > Hi Dong,
> > >
> > > We should definitely be mindful about the data structure performance especially
> > > dealing with kernel. But for now, we haven't done any performance analysis yet
> > > for the current rbtree implementation, later we will definitely run it through
> > > large guest RAM configuration and multiple virtual devices cases, etc. to
> > > collect data.
> > >
> > > Regarding your use case, may I ask if there will be concurrent command streams
> > > running for the same VM?
> > Hi Neo:
> > 
> > Sorry for the late response. Spent some time to make the confirmation.
> > 
> > For our case, one iommu group will add one (and only one) ccw-device.
> > For one ccw-device, there will be no concurrent command streams from it.
> > 
> 
> Hi, Dong,
> 
> Looks there can be multiple devices behind one channel, according to:
> https://en.wikipedia.org/wiki/Channel_I/O
Dear Kevin:

One subchannel (the co-processor to offload the I/O operations) could
be assigned to one device at a time. See below.

> 
> Do they need to be assigned together as one iommu group?
So, 'N/A' to this question.

> If not, how is
> the isolation being done in your implementation? Based on cmd scanning in 
> Qemu-side?
It's a 'one device'-'one subchannel'-'one iommu group' relation then.
The isolation looks quite natural.

> 
> Another curious question about channel io itself. I'm unclear whether the 
> channel here only fulfills the role of DMA controller (i.e. controlling how
> device access memory), or also offloads CPU accesses to the registers
> on the ccw-device. Are ccw-device registers directly addressable by CPU
> on s390, similar to MMIO concept on x86? If yes, I guess you also need
> provide region info in vfio-ccw to control which I/O resource can be accessed
> by user space (looks not there in your vfio-ccw patch series). If not, how 
> do you control the isolation in that aspect? :-)
Channel I/O is quite different to PCI, so I copied some more details
here. Hope these could help.

Channel subsystem:
The channel subsystem directs the flow of information between I/O devices
and main storage. It relieves CPUs of the task of communicating directly
with I/O devices and permits data processing to proceed concurrently with
I/O processing.

Channel path:
The channel subsystem uses one or more channel paths as the communication
link in managing the flow of information to or from I/O devices.

Subchannel:
Within the channel subsystem are subchannels. One subchannel of type I/O
is provided for and dedicated to each I/O device accessible to the channel
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
subsystem.

Control unit:
A control unit implements a standardized interface which translates between
the Channel Subsystem and the actual device. It does this by adapting the
characteristics of each device so that it can respond to the standard form
of control provided by the channel subsystem.

Channel Path:
The channel subsystem communicates with the I/O devices by means of
channel paths between the channel subsystem and control units.

+-------------------+
| channel subsystem |
+-------------------+
|                   |
|   +----------+    |              +--------------+    +------------+
|   |subchannel|    | channel path | control unit |    | I/O device |
|   +---------------------------------------------------------------+
|   | subchno  |    |              |              |    |    devno   |
|   +----------+    |              +--------------+    +------------+
|                   |
+-------------------+

There is no concept of ccw-device registers by the subchannel. Control
unit will interact with the device, collect the I/O result, and inform
the result to the subchannel.
So it seems to me, there is no needs to provide region info for
isolation. As mentioned above, the isolation is quite natural.

Please correct me in case I misunderstood some of the concepts in your
questions and gave irrelevant answers. :>

> 
> Thanks
> Kevin
> 



--------
Dong Jia


^ permalink raw reply	[flat|nested] 154+ messages in thread

* RE: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
  2016-06-06  6:59                                 ` Dong Jia
@ 2016-06-07  2:47                                     ` Tian, Kevin
  0 siblings, 0 replies; 154+ messages in thread
From: Tian, Kevin @ 2016-06-07  2:47 UTC (permalink / raw)
  To: Dong Jia
  Cc: Neo Jia, Alex Williamson, Ruan, Shuai, Song, Jike, kvm,
	qemu-devel, Kirti Wankhede, kraxel, pbonzini, Lv, Zhiyuan

> From: Dong Jia
> Sent: Monday, June 06, 2016 2:59 PM
> 

[...]

> Channel I/O is quite different to PCI, so I copied some more details
> here. Hope these could help.
> 
> Channel subsystem:
> The channel subsystem directs the flow of information between I/O devices
> and main storage. It relieves CPUs of the task of communicating directly
> with I/O devices and permits data processing to proceed concurrently with
> I/O processing.
> 
> Channel path:
> The channel subsystem uses one or more channel paths as the communication
> link in managing the flow of information to or from I/O devices.
> 
> Subchannel:
> Within the channel subsystem are subchannels. One subchannel of type I/O
> is provided for and dedicated to each I/O device accessible to the channel
>                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> subsystem.
> 
> Control unit:
> A control unit implements a standardized interface which translates between
> the Channel Subsystem and the actual device. It does this by adapting the
> characteristics of each device so that it can respond to the standard form
> of control provided by the channel subsystem.
> 
> Channel Path:
> The channel subsystem communicates with the I/O devices by means of
> channel paths between the channel subsystem and control units.
> 
> +-------------------+
> | channel subsystem |
> +-------------------+
> |                   |
> |   +----------+    |              +--------------+    +------------+
> |   |subchannel|    | channel path | control unit |    | I/O device |
> |   +---------------------------------------------------------------+
> |   | subchno  |    |              |              |    |    devno   |
> |   +----------+    |              +--------------+    +------------+
> |                   |
> +-------------------+
> 
> There is no concept of ccw-device registers by the subchannel. Control
> unit will interact with the device, collect the I/O result, and inform
> the result to the subchannel.
> So it seems to me, there is no needs to provide region info for
> isolation. As mentioned above, the isolation is quite natural.
> 
> Please correct me in case I misunderstood some of the concepts in your
> questions and gave irrelevant answers. :>
> 

Thanks for above background which is very useful. Several follow-up Qs:

1) Does it mean that VFIO is managing resource in subchannel level, so Qemu
can only operate subchannels assigned to itself (then emulate as the complete
channel io sub-system to guest)?

2) How are ccw commands associated with a subchannel? Are they submitted
through a dedicated subchannel interface (so VFIO can easily map that interface)
or that subchannel is specified by a special ccw cmd (means VFIO-ccw needs
to scan cmds to avoid malicious attempts on non-assigned subchannels)?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
@ 2016-06-07  2:47                                     ` Tian, Kevin
  0 siblings, 0 replies; 154+ messages in thread
From: Tian, Kevin @ 2016-06-07  2:47 UTC (permalink / raw)
  To: Dong Jia
  Cc: Neo Jia, Alex Williamson, Ruan, Shuai, Song, Jike, kvm,
	qemu-devel, Kirti Wankhede, kraxel, pbonzini, Lv, Zhiyuan

> From: Dong Jia
> Sent: Monday, June 06, 2016 2:59 PM
> 

[...]

> Channel I/O is quite different to PCI, so I copied some more details
> here. Hope these could help.
> 
> Channel subsystem:
> The channel subsystem directs the flow of information between I/O devices
> and main storage. It relieves CPUs of the task of communicating directly
> with I/O devices and permits data processing to proceed concurrently with
> I/O processing.
> 
> Channel path:
> The channel subsystem uses one or more channel paths as the communication
> link in managing the flow of information to or from I/O devices.
> 
> Subchannel:
> Within the channel subsystem are subchannels. One subchannel of type I/O
> is provided for and dedicated to each I/O device accessible to the channel
>                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> subsystem.
> 
> Control unit:
> A control unit implements a standardized interface which translates between
> the Channel Subsystem and the actual device. It does this by adapting the
> characteristics of each device so that it can respond to the standard form
> of control provided by the channel subsystem.
> 
> Channel Path:
> The channel subsystem communicates with the I/O devices by means of
> channel paths between the channel subsystem and control units.
> 
> +-------------------+
> | channel subsystem |
> +-------------------+
> |                   |
> |   +----------+    |              +--------------+    +------------+
> |   |subchannel|    | channel path | control unit |    | I/O device |
> |   +---------------------------------------------------------------+
> |   | subchno  |    |              |              |    |    devno   |
> |   +----------+    |              +--------------+    +------------+
> |                   |
> +-------------------+
> 
> There is no concept of ccw-device registers by the subchannel. Control
> unit will interact with the device, collect the I/O result, and inform
> the result to the subchannel.
> So it seems to me, there is no needs to provide region info for
> isolation. As mentioned above, the isolation is quite natural.
> 
> Please correct me in case I misunderstood some of the concepts in your
> questions and gave irrelevant answers. :>
> 

Thanks for above background which is very useful. Several follow-up Qs:

1) Does it mean that VFIO is managing resource in subchannel level, so Qemu
can only operate subchannels assigned to itself (then emulate as the complete
channel io sub-system to guest)?

2) How are ccw commands associated with a subchannel? Are they submitted
through a dedicated subchannel interface (so VFIO can easily map that interface)
or that subchannel is specified by a special ccw cmd (means VFIO-ccw needs
to scan cmds to avoid malicious attempts on non-assigned subchannels)?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 154+ messages in thread

* Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu
  2016-06-07  2:47                                     ` Tian, Kevin
  (?)
@ 2016-06-07  7:04                                     ` Dong Jia
  -1 siblings, 0 replies; 154+ messages in thread
From: Dong Jia @ 2016-06-07  7:04 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Neo Jia, Alex Williamson, Ruan, Shuai, Song, Jike, kvm,
	qemu-devel, Kirti Wankhede, kraxel, pbonzini, Lv, Zhiyuan,
	Dong Jia

On Tue, 7 Jun 2016 02:47:10 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Dong Jia
> > Sent: Monday, June 06, 2016 2:59 PM
> > 
> 
> [...]
> 
> > Channel I/O is quite different to PCI, so I copied some more details
> > here. Hope these could help.
> > 
> > Channel subsystem:
> > The channel subsystem directs the flow of information between I/O devices
> > and main storage. It relieves CPUs of the task of communicating directly
> > with I/O devices and permits data processing to proceed concurrently with
> > I/O processing.
> > 
> > Channel path:
> > The channel subsystem uses one or more channel paths as the communication
> > link in managing the flow of information to or from I/O devices.
> > 
> > Subchannel:
> > Within the channel subsystem are subchannels. One subchannel of type I/O
> > is provided for and dedicated to each I/O device accessible to the channel
> >                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > subsystem.
> > 
> > Control unit:
> > A control unit implements a standardized interface which translates between
> > the Channel Subsystem and the actual device. It does this by adapting the
> > characteristics of each device so that it can respond to the standard form
> > of control provided by the channel subsystem.
> > 
> > Channel Path:
> > The channel subsystem communicates with the I/O devices by means of
> > channel paths between the channel subsystem and control units.
> > 
> > +-------------------+
> > | channel subsystem |
> > +-------------------+
> > |                   |
> > |   +----------+    |              +--------------+    +------------+
> > |   |subchannel|    | channel path | control unit |    | I/O device |
> > |   +---------------------------------------------------------------+
> > |   | subchno  |    |              |              |    |    devno   |
> > |   +----------+    |              +--------------+    +------------+
> > |                   |
> > +-------------------+
> > 
> > There is no concept of ccw-device registers by the subchannel. Control
> > unit will interact with the device, collect the I/O result, and inform
> > the result to the subchannel.
> > So it seems to me, there is no needs to provide region info for
> > isolation. As mentioned above, the isolation is quite natural.
> > 
> > Please correct me in case I misunderstood some of the concepts in your
> > questions and gave irrelevant answers. :>
> > 
> 
> Thanks for above background which is very useful. Several follow-up Qs:
> 
> 1) Does it mean that VFIO is managing resource in subchannel level, so Qemu
> can only operate subchannels assigned to itself
Dear Kevin,

This understanding is basically right, but not that exactly.

Linux creates a 'struct ccw_device' for each device it has detected, and
a 'struct subchannel' for the corresponding subchannel. When we issue
a command to a device instance, the device can find the subchannel and
pass the command to it.

The current vfio-ccw implementation targets a device passthru. So I'd say
that VFIO is managing resource in both the device and the
subchannel level.

However, there is a discussion inside our team that tries to figure out
if a subchannel passthru would be better. So, in the future, it's
possible to do the management in a subchannel level only.

> (then emulate as the complete channel io sub-system to guest)?
This is right.

> 
> 2) How are ccw commands associated with a subchannel?
The I/O instruction requires a subchannel id and an ORB
(Operation-Request Block, which contains the execution parameters,
including the address of the ccws). So when an I/O instruction was
intercepted, we know which is the target subchannel. And using this
target information, we can find the real physical device and the
subchannel to perform the instruction.

> Are they submitted
> through a dedicated subchannel interface (so VFIO can easily map that interface)
We can understand it this way.

> or that subchannel is specified by a special ccw cmd (means VFIO-ccw needs
> to scan cmds to avoid malicious attempts on non-assigned subchannels)?
No. CCWs themselves don't contain subchannel information.

> 
> Thanks
> Kevin
> 

--------
Dong Jia


^ permalink raw reply	[flat|nested] 154+ messages in thread

end of thread, other threads:[~2016-06-07  7:04 UTC | newest]

Thread overview: 154+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-02 18:40 [RFC PATCH v3 0/3] Add vGPU support Kirti Wankhede
2016-05-02 18:40 ` [Qemu-devel] " Kirti Wankhede
2016-05-02 18:40 ` [RFC PATCH v3 1/3] vGPU Core driver Kirti Wankhede
2016-05-02 18:40   ` [Qemu-devel] " Kirti Wankhede
2016-05-03 22:43   ` Alex Williamson
2016-05-03 22:43     ` [Qemu-devel] " Alex Williamson
2016-05-04  2:45     ` Tian, Kevin
2016-05-04  2:45       ` [Qemu-devel] " Tian, Kevin
2016-05-04 16:57       ` Alex Williamson
2016-05-04 16:57         ` [Qemu-devel] " Alex Williamson
2016-05-05  8:58         ` Tian, Kevin
2016-05-05  8:58           ` [Qemu-devel] " Tian, Kevin
2016-05-04  2:58     ` Tian, Kevin
2016-05-04  2:58       ` [Qemu-devel] " Tian, Kevin
2016-05-12  8:22       ` Tian, Kevin
2016-05-12  8:22         ` [Qemu-devel] " Tian, Kevin
2016-05-04 13:31     ` Kirti Wankhede
2016-05-04 13:31       ` [Qemu-devel] " Kirti Wankhede
2016-05-05  9:06       ` Tian, Kevin
2016-05-05  9:06         ` [Qemu-devel] " Tian, Kevin
2016-05-05 10:44         ` Kirti Wankhede
2016-05-05 10:44           ` [Qemu-devel] " Kirti Wankhede
2016-05-05 12:07           ` Tian, Kevin
2016-05-05 12:07             ` [Qemu-devel] " Tian, Kevin
2016-05-05 12:57             ` Kirti Wankhede
2016-05-05 12:57               ` [Qemu-devel] " Kirti Wankhede
2016-05-11  6:37               ` Tian, Kevin
2016-05-11  6:37                 ` [Qemu-devel] " Tian, Kevin
2016-05-06 12:14         ` Jike Song
2016-05-06 12:14           ` [Qemu-devel] " Jike Song
2016-05-06 16:16           ` Kirti Wankhede
2016-05-06 16:16             ` [Qemu-devel] " Kirti Wankhede
2016-05-09 12:12             ` Jike Song
2016-05-09 12:12               ` [Qemu-devel] " Jike Song
2016-05-02 18:40 ` [RFC PATCH v3 2/3] VFIO driver for vGPU device Kirti Wankhede
2016-05-02 18:40   ` [Qemu-devel] " Kirti Wankhede
2016-05-03 22:43   ` Alex Williamson
2016-05-03 22:43     ` [Qemu-devel] " Alex Williamson
2016-05-04  3:23     ` Tian, Kevin
2016-05-04  3:23       ` [Qemu-devel] " Tian, Kevin
2016-05-04 17:06       ` Alex Williamson
2016-05-04 17:06         ` [Qemu-devel] " Alex Williamson
2016-05-04 21:14         ` Neo Jia
2016-05-04 21:14           ` [Qemu-devel] " Neo Jia
2016-05-05  4:42           ` Kirti Wankhede
2016-05-05  4:42             ` [Qemu-devel] " Kirti Wankhede
2016-05-05  9:24         ` Tian, Kevin
2016-05-05  9:24           ` [Qemu-devel] " Tian, Kevin
2016-05-05 20:27           ` Neo Jia
2016-05-05 20:27             ` [Qemu-devel] " Neo Jia
2016-05-11  6:45         ` Tian, Kevin
2016-05-11  6:45           ` [Qemu-devel] " Tian, Kevin
2016-05-11 20:10           ` Alex Williamson
2016-05-11 20:10             ` [Qemu-devel] " Alex Williamson
2016-05-12  0:59             ` Tian, Kevin
2016-05-12  0:59               ` [Qemu-devel] " Tian, Kevin
2016-05-04 16:25     ` Kirti Wankhede
2016-05-04 16:25       ` Kirti Wankhede
2016-05-02 18:40 ` [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu Kirti Wankhede
2016-05-02 18:40   ` [Qemu-devel] " Kirti Wankhede
2016-05-03 10:40   ` Jike Song
2016-05-03 10:40     ` [Qemu-devel] " Jike Song
2016-05-03 22:43   ` Alex Williamson
2016-05-03 22:43     ` [Qemu-devel] " Alex Williamson
2016-05-04  3:39     ` Tian, Kevin
2016-05-04  3:39       ` [Qemu-devel] " Tian, Kevin
2016-05-05  6:55     ` Jike Song
2016-05-05  6:55       ` [Qemu-devel] " Jike Song
2016-05-05  9:27       ` Tian, Kevin
2016-05-05  9:27         ` [Qemu-devel] " Tian, Kevin
2016-05-10  7:52         ` Jike Song
2016-05-10  7:52           ` [Qemu-devel] " Jike Song
2016-05-10 16:02           ` Neo Jia
2016-05-10 16:02             ` [Qemu-devel] " Neo Jia
2016-05-11  9:15             ` Jike Song
2016-05-11  9:15               ` [Qemu-devel] " Jike Song
2016-05-11 22:06               ` Alex Williamson
2016-05-11 22:06                 ` [Qemu-devel] " Alex Williamson
2016-05-12  4:11                 ` Jike Song
2016-05-12  4:11                   ` [Qemu-devel] " Jike Song
2016-05-12 19:49                   ` Neo Jia
2016-05-12 19:49                     ` [Qemu-devel] " Neo Jia
2016-05-13  2:41                     ` Tian, Kevin
2016-05-13  2:41                       ` [Qemu-devel] " Tian, Kevin
2016-05-13  6:22                       ` Jike Song
2016-05-13  6:22                         ` [Qemu-devel] " Jike Song
2016-05-13  6:43                         ` Neo Jia
2016-05-13  6:43                           ` [Qemu-devel] " Neo Jia
2016-05-13  7:30                           ` Jike Song
2016-05-13  7:30                             ` [Qemu-devel] " Jike Song
2016-05-13  7:42                             ` Neo Jia
2016-05-13  7:42                               ` [Qemu-devel] " Neo Jia
2016-05-13  7:45                               ` Tian, Kevin
2016-05-13  7:45                                 ` [Qemu-devel] " Tian, Kevin
2016-05-13  8:31                                 ` Neo Jia
2016-05-13  8:31                                   ` [Qemu-devel] " Neo Jia
2016-05-13  9:23                                   ` Jike Song
2016-05-13  9:23                                     ` [Qemu-devel] " Jike Song
2016-05-13 15:50                                     ` Neo Jia
2016-05-13 15:50                                       ` [Qemu-devel] " Neo Jia
2016-05-16  6:57                                       ` Jike Song
2016-05-16  6:57                                         ` [Qemu-devel] " Jike Song
2016-05-13  6:08                     ` Jike Song
2016-05-13  6:08                       ` [Qemu-devel] " Jike Song
2016-05-13  6:41                       ` Neo Jia
2016-05-13  6:41                         ` [Qemu-devel] " Neo Jia
2016-05-13  7:13                         ` Tian, Kevin
2016-05-13  7:13                           ` [Qemu-devel] " Tian, Kevin
2016-05-13  7:38                           ` Neo Jia
2016-05-13  7:38                             ` [Qemu-devel] " Neo Jia
2016-05-13  8:02                             ` Tian, Kevin
2016-05-13  8:02                               ` [Qemu-devel] " Tian, Kevin
2016-05-13  8:41                               ` Neo Jia
2016-05-13  8:41                                 ` [Qemu-devel] " Neo Jia
2016-05-12  8:00                 ` Tian, Kevin
2016-05-12  8:00                   ` [Qemu-devel] " Tian, Kevin
2016-05-12 19:05                   ` Alex Williamson
2016-05-12 19:05                     ` [Qemu-devel] " Alex Williamson
2016-05-12 20:12                     ` Neo Jia
2016-05-12 20:12                       ` [Qemu-devel] " Neo Jia
2016-05-13  9:46                       ` Jike Song
2016-05-13  9:46                         ` [Qemu-devel] " Jike Song
2016-05-13 15:48                         ` Neo Jia
2016-05-13 15:48                           ` [Qemu-devel] " Neo Jia
2016-05-16  2:27                           ` Jike Song
2016-05-16  2:27                             ` [Qemu-devel] " Jike Song
2016-05-13  3:55                     ` Tian, Kevin
2016-05-13  3:55                       ` [Qemu-devel] " Tian, Kevin
2016-05-13 16:16                       ` Alex Williamson
2016-05-13 16:16                         ` [Qemu-devel] " Alex Williamson
2016-05-13  7:10                     ` Dong Jia
2016-05-13  7:10                       ` Dong Jia
2016-05-13  7:24                       ` Neo Jia
2016-05-13  7:24                         ` Neo Jia
2016-05-13  8:39                         ` Dong Jia
2016-05-13  8:39                           ` [Qemu-devel] " Dong Jia
2016-05-13  9:05                           ` Neo Jia
2016-05-13  9:05                             ` Neo Jia
2016-05-19  7:28                             ` Dong Jia
2016-05-19  7:28                               ` Dong Jia
2016-05-20  3:21                               ` Tian, Kevin
2016-05-20  3:21                                 ` Tian, Kevin
2016-06-06  6:59                                 ` Dong Jia
2016-06-07  2:47                                   ` Tian, Kevin
2016-06-07  2:47                                     ` Tian, Kevin
2016-06-07  7:04                                     ` Dong Jia
2016-05-05  7:51     ` Kirti Wankhede
2016-05-05  7:51       ` [Qemu-devel] " Kirti Wankhede
2016-05-04  1:05 ` [RFC PATCH v3 0/3] Add vGPU support Tian, Kevin
2016-05-04  1:05   ` [Qemu-devel] " Tian, Kevin
2016-05-04  6:17   ` Neo Jia
2016-05-04  6:17     ` [Qemu-devel] " Neo Jia
2016-05-04 17:07     ` Alex Williamson
2016-05-04 17:07       ` [Qemu-devel] " Alex Williamson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.