linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/2] VFIO no-iommu
@ 2015-10-09 18:40 Alex Williamson
  2015-10-09 18:41 ` [RFC PATCH 1/2] vfio: Move vfio.c vfio_core.c Alex Williamson
                   ` (3 more replies)
  0 siblings, 4 replies; 21+ messages in thread
From: Alex Williamson @ 2015-10-09 18:40 UTC (permalink / raw)
  To: alex.williamson
  Cc: avi, avi, gleb, corbet, bruce.richardson, mst, linux-kernel,
	alexander.duyck, gleb, stephen, vladz, iommu, hjk, gregkh

Recent patches for UIO have been attempting to add MSI/X support,
which unfortunately implies DMA support, which users have been
enabling anyway, but was never intended for UIO.  VFIO on the other
hand expects an IOMMU to provide isolation of devices, but provides
a much more complete device interface, which already supports full
MSI/X support.  There's really no way to support userspace drivers
with DMA capable devices without an IOMMU to protect the host, but
we can at least think about doing it in a way that properly taints
the kernel and avoids creating new code duplicating existing code,
that does have a supportable use case.

The diffstat is only so large because I moved vfio.c to vfio_core.c
so I could more easily keep the module named vfio.ko while keeping
the bulk of the no-iommu support in a separate file that can be
optionally compiled.  We're really looking at a couple hundred lines
of mostly stub code.  The VFIO_NOIOMMU_IOMMU could certainly be
expanded to do page pinning and virt_to_bus() translation, but I
didn't want to complicate anything yet.

I've only compiled this and tested loading the module with the new
no-iommu mode enabled, I haven't actually tried to port a DPDK
driver to it, though it ought to be a pretty obvious mix of the
existing UIO and VFIO versions (set the IOMMU, but avoid using it
for mapping, use however bus translations are done w/ UIO).  The core
vfio device file is still /dev/vfio/vfio, but all the groups become
/dev/vfio-noiommu/$GROUP.

It should be obvious, but I always feel obligated to state that this
does not and will not ever enable device assignment to virtual
machines on non-IOMMU capable platforms.

I'm curious what IOMMU folks think of this.  This hack is really
only possible because we don't use iommu_ops for regular DMA, so we
can hijack it fairly safely.  I believe that's intended to change
though, so this may not be practical long term.  Thanks,

Alex

---

Alex Williamson (2):
      vfio: Move vfio.c vfio_core.c
      vfio: Include no-iommu mode


 drivers/vfio/Kconfig        |   15 
 drivers/vfio/Makefile       |    4 
 drivers/vfio/vfio.c         | 1640 ------------------------------------------
 drivers/vfio/vfio_core.c    | 1680 +++++++++++++++++++++++++++++++++++++++++++
 drivers/vfio/vfio_noiommu.c |  185 +++++
 drivers/vfio/vfio_private.h |   31 +
 include/uapi/linux/vfio.h   |    2 
 7 files changed, 1917 insertions(+), 1640 deletions(-)
 delete mode 100644 drivers/vfio/vfio.c
 create mode 100644 drivers/vfio/vfio_core.c
 create mode 100644 drivers/vfio/vfio_noiommu.c
 create mode 100644 drivers/vfio/vfio_private.h

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [RFC PATCH 1/2] vfio: Move vfio.c vfio_core.c
  2015-10-09 18:40 [RFC PATCH 0/2] VFIO no-iommu Alex Williamson
@ 2015-10-09 18:41 ` Alex Williamson
  2015-10-09 19:21   ` Greg KH
  2015-10-09 18:41 ` [RFC PATCH 2/2] vfio: Include no-iommu mode Alex Williamson
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 21+ messages in thread
From: Alex Williamson @ 2015-10-09 18:41 UTC (permalink / raw)
  To: alex.williamson
  Cc: avi, avi, gleb, corbet, bruce.richardson, mst, linux-kernel,
	alexander.duyck, gleb, stephen, vladz, iommu, hjk, gregkh

This allows us to more easily create a "vfio" module that includes
multiple files.  No code change, rename and Makefile update only.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 drivers/vfio/Makefile    |    1 
 drivers/vfio/vfio.c      | 1640 ----------------------------------------------
 drivers/vfio/vfio_core.c | 1640 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 1641 insertions(+), 1640 deletions(-)
 delete mode 100644 drivers/vfio/vfio.c
 create mode 100644 drivers/vfio/vfio_core.c

diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
index 7b8a31f..e8fa248 100644
--- a/drivers/vfio/Makefile
+++ b/drivers/vfio/Makefile
@@ -1,4 +1,5 @@
 vfio_virqfd-y := virqfd.o
+vfio-y := vfio_core.o
 
 obj-$(CONFIG_VFIO) += vfio.o
 obj-$(CONFIG_VFIO_VIRQFD) += vfio_virqfd.o
diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
deleted file mode 100644
index 563c510..0000000
--- a/drivers/vfio/vfio.c
+++ /dev/null
@@ -1,1640 +0,0 @@
-/*
- * VFIO core
- *
- * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
- *     Author: Alex Williamson <alex.williamson@redhat.com>
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License version 2 as
- * published by the Free Software Foundation.
- *
- * Derived from original vfio:
- * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
- * Author: Tom Lyon, pugs@cisco.com
- */
-
-#include <linux/cdev.h>
-#include <linux/compat.h>
-#include <linux/device.h>
-#include <linux/file.h>
-#include <linux/anon_inodes.h>
-#include <linux/fs.h>
-#include <linux/idr.h>
-#include <linux/iommu.h>
-#include <linux/list.h>
-#include <linux/miscdevice.h>
-#include <linux/module.h>
-#include <linux/mutex.h>
-#include <linux/rwsem.h>
-#include <linux/sched.h>
-#include <linux/slab.h>
-#include <linux/stat.h>
-#include <linux/string.h>
-#include <linux/uaccess.h>
-#include <linux/vfio.h>
-#include <linux/wait.h>
-
-#define DRIVER_VERSION	"0.3"
-#define DRIVER_AUTHOR	"Alex Williamson <alex.williamson@redhat.com>"
-#define DRIVER_DESC	"VFIO - User Level meta-driver"
-
-static struct vfio {
-	struct class			*class;
-	struct list_head		iommu_drivers_list;
-	struct mutex			iommu_drivers_lock;
-	struct list_head		group_list;
-	struct idr			group_idr;
-	struct mutex			group_lock;
-	struct cdev			group_cdev;
-	dev_t				group_devt;
-	wait_queue_head_t		release_q;
-} vfio;
-
-struct vfio_iommu_driver {
-	const struct vfio_iommu_driver_ops	*ops;
-	struct list_head			vfio_next;
-};
-
-struct vfio_container {
-	struct kref			kref;
-	struct list_head		group_list;
-	struct rw_semaphore		group_lock;
-	struct vfio_iommu_driver	*iommu_driver;
-	void				*iommu_data;
-};
-
-struct vfio_unbound_dev {
-	struct device			*dev;
-	struct list_head		unbound_next;
-};
-
-struct vfio_group {
-	struct kref			kref;
-	int				minor;
-	atomic_t			container_users;
-	struct iommu_group		*iommu_group;
-	struct vfio_container		*container;
-	struct list_head		device_list;
-	struct mutex			device_lock;
-	struct device			*dev;
-	struct notifier_block		nb;
-	struct list_head		vfio_next;
-	struct list_head		container_next;
-	struct list_head		unbound_list;
-	struct mutex			unbound_lock;
-	atomic_t			opened;
-};
-
-struct vfio_device {
-	struct kref			kref;
-	struct device			*dev;
-	const struct vfio_device_ops	*ops;
-	struct vfio_group		*group;
-	struct list_head		group_next;
-	void				*device_data;
-};
-
-/**
- * IOMMU driver registration
- */
-int vfio_register_iommu_driver(const struct vfio_iommu_driver_ops *ops)
-{
-	struct vfio_iommu_driver *driver, *tmp;
-
-	driver = kzalloc(sizeof(*driver), GFP_KERNEL);
-	if (!driver)
-		return -ENOMEM;
-
-	driver->ops = ops;
-
-	mutex_lock(&vfio.iommu_drivers_lock);
-
-	/* Check for duplicates */
-	list_for_each_entry(tmp, &vfio.iommu_drivers_list, vfio_next) {
-		if (tmp->ops == ops) {
-			mutex_unlock(&vfio.iommu_drivers_lock);
-			kfree(driver);
-			return -EINVAL;
-		}
-	}
-
-	list_add(&driver->vfio_next, &vfio.iommu_drivers_list);
-
-	mutex_unlock(&vfio.iommu_drivers_lock);
-
-	return 0;
-}
-EXPORT_SYMBOL_GPL(vfio_register_iommu_driver);
-
-void vfio_unregister_iommu_driver(const struct vfio_iommu_driver_ops *ops)
-{
-	struct vfio_iommu_driver *driver;
-
-	mutex_lock(&vfio.iommu_drivers_lock);
-	list_for_each_entry(driver, &vfio.iommu_drivers_list, vfio_next) {
-		if (driver->ops == ops) {
-			list_del(&driver->vfio_next);
-			mutex_unlock(&vfio.iommu_drivers_lock);
-			kfree(driver);
-			return;
-		}
-	}
-	mutex_unlock(&vfio.iommu_drivers_lock);
-}
-EXPORT_SYMBOL_GPL(vfio_unregister_iommu_driver);
-
-/**
- * Group minor allocation/free - both called with vfio.group_lock held
- */
-static int vfio_alloc_group_minor(struct vfio_group *group)
-{
-	return idr_alloc(&vfio.group_idr, group, 0, MINORMASK + 1, GFP_KERNEL);
-}
-
-static void vfio_free_group_minor(int minor)
-{
-	idr_remove(&vfio.group_idr, minor);
-}
-
-static int vfio_iommu_group_notifier(struct notifier_block *nb,
-				     unsigned long action, void *data);
-static void vfio_group_get(struct vfio_group *group);
-
-/**
- * Container objects - containers are created when /dev/vfio/vfio is
- * opened, but their lifecycle extends until the last user is done, so
- * it's freed via kref.  Must support container/group/device being
- * closed in any order.
- */
-static void vfio_container_get(struct vfio_container *container)
-{
-	kref_get(&container->kref);
-}
-
-static void vfio_container_release(struct kref *kref)
-{
-	struct vfio_container *container;
-	container = container_of(kref, struct vfio_container, kref);
-
-	kfree(container);
-}
-
-static void vfio_container_put(struct vfio_container *container)
-{
-	kref_put(&container->kref, vfio_container_release);
-}
-
-static void vfio_group_unlock_and_free(struct vfio_group *group)
-{
-	mutex_unlock(&vfio.group_lock);
-	/*
-	 * Unregister outside of lock.  A spurious callback is harmless now
-	 * that the group is no longer in vfio.group_list.
-	 */
-	iommu_group_unregister_notifier(group->iommu_group, &group->nb);
-	kfree(group);
-}
-
-/**
- * Group objects - create, release, get, put, search
- */
-static struct vfio_group *vfio_create_group(struct iommu_group *iommu_group)
-{
-	struct vfio_group *group, *tmp;
-	struct device *dev;
-	int ret, minor;
-
-	group = kzalloc(sizeof(*group), GFP_KERNEL);
-	if (!group)
-		return ERR_PTR(-ENOMEM);
-
-	kref_init(&group->kref);
-	INIT_LIST_HEAD(&group->device_list);
-	mutex_init(&group->device_lock);
-	INIT_LIST_HEAD(&group->unbound_list);
-	mutex_init(&group->unbound_lock);
-	atomic_set(&group->container_users, 0);
-	atomic_set(&group->opened, 0);
-	group->iommu_group = iommu_group;
-
-	group->nb.notifier_call = vfio_iommu_group_notifier;
-
-	/*
-	 * blocking notifiers acquire a rwsem around registering and hold
-	 * it around callback.  Therefore, need to register outside of
-	 * vfio.group_lock to avoid A-B/B-A contention.  Our callback won't
-	 * do anything unless it can find the group in vfio.group_list, so
-	 * no harm in registering early.
-	 */
-	ret = iommu_group_register_notifier(iommu_group, &group->nb);
-	if (ret) {
-		kfree(group);
-		return ERR_PTR(ret);
-	}
-
-	mutex_lock(&vfio.group_lock);
-
-	/* Did we race creating this group? */
-	list_for_each_entry(tmp, &vfio.group_list, vfio_next) {
-		if (tmp->iommu_group == iommu_group) {
-			vfio_group_get(tmp);
-			vfio_group_unlock_and_free(group);
-			return tmp;
-		}
-	}
-
-	minor = vfio_alloc_group_minor(group);
-	if (minor < 0) {
-		vfio_group_unlock_and_free(group);
-		return ERR_PTR(minor);
-	}
-
-	dev = device_create(vfio.class, NULL,
-			    MKDEV(MAJOR(vfio.group_devt), minor),
-			    group, "%d", iommu_group_id(iommu_group));
-	if (IS_ERR(dev)) {
-		vfio_free_group_minor(minor);
-		vfio_group_unlock_and_free(group);
-		return (struct vfio_group *)dev; /* ERR_PTR */
-	}
-
-	group->minor = minor;
-	group->dev = dev;
-
-	list_add(&group->vfio_next, &vfio.group_list);
-
-	mutex_unlock(&vfio.group_lock);
-
-	return group;
-}
-
-/* called with vfio.group_lock held */
-static void vfio_group_release(struct kref *kref)
-{
-	struct vfio_group *group = container_of(kref, struct vfio_group, kref);
-	struct vfio_unbound_dev *unbound, *tmp;
-	struct iommu_group *iommu_group = group->iommu_group;
-
-	WARN_ON(!list_empty(&group->device_list));
-
-	list_for_each_entry_safe(unbound, tmp,
-				 &group->unbound_list, unbound_next) {
-		list_del(&unbound->unbound_next);
-		kfree(unbound);
-	}
-
-	device_destroy(vfio.class, MKDEV(MAJOR(vfio.group_devt), group->minor));
-	list_del(&group->vfio_next);
-	vfio_free_group_minor(group->minor);
-	vfio_group_unlock_and_free(group);
-	iommu_group_put(iommu_group);
-}
-
-static void vfio_group_put(struct vfio_group *group)
-{
-	kref_put_mutex(&group->kref, vfio_group_release, &vfio.group_lock);
-}
-
-/* Assume group_lock or group reference is held */
-static void vfio_group_get(struct vfio_group *group)
-{
-	kref_get(&group->kref);
-}
-
-/*
- * Not really a try as we will sleep for mutex, but we need to make
- * sure the group pointer is valid under lock and get a reference.
- */
-static struct vfio_group *vfio_group_try_get(struct vfio_group *group)
-{
-	struct vfio_group *target = group;
-
-	mutex_lock(&vfio.group_lock);
-	list_for_each_entry(group, &vfio.group_list, vfio_next) {
-		if (group == target) {
-			vfio_group_get(group);
-			mutex_unlock(&vfio.group_lock);
-			return group;
-		}
-	}
-	mutex_unlock(&vfio.group_lock);
-
-	return NULL;
-}
-
-static
-struct vfio_group *vfio_group_get_from_iommu(struct iommu_group *iommu_group)
-{
-	struct vfio_group *group;
-
-	mutex_lock(&vfio.group_lock);
-	list_for_each_entry(group, &vfio.group_list, vfio_next) {
-		if (group->iommu_group == iommu_group) {
-			vfio_group_get(group);
-			mutex_unlock(&vfio.group_lock);
-			return group;
-		}
-	}
-	mutex_unlock(&vfio.group_lock);
-
-	return NULL;
-}
-
-static struct vfio_group *vfio_group_get_from_minor(int minor)
-{
-	struct vfio_group *group;
-
-	mutex_lock(&vfio.group_lock);
-	group = idr_find(&vfio.group_idr, minor);
-	if (!group) {
-		mutex_unlock(&vfio.group_lock);
-		return NULL;
-	}
-	vfio_group_get(group);
-	mutex_unlock(&vfio.group_lock);
-
-	return group;
-}
-
-/**
- * Device objects - create, release, get, put, search
- */
-static
-struct vfio_device *vfio_group_create_device(struct vfio_group *group,
-					     struct device *dev,
-					     const struct vfio_device_ops *ops,
-					     void *device_data)
-{
-	struct vfio_device *device;
-
-	device = kzalloc(sizeof(*device), GFP_KERNEL);
-	if (!device)
-		return ERR_PTR(-ENOMEM);
-
-	kref_init(&device->kref);
-	device->dev = dev;
-	device->group = group;
-	device->ops = ops;
-	device->device_data = device_data;
-	dev_set_drvdata(dev, device);
-
-	/* No need to get group_lock, caller has group reference */
-	vfio_group_get(group);
-
-	mutex_lock(&group->device_lock);
-	list_add(&device->group_next, &group->device_list);
-	mutex_unlock(&group->device_lock);
-
-	return device;
-}
-
-static void vfio_device_release(struct kref *kref)
-{
-	struct vfio_device *device = container_of(kref,
-						  struct vfio_device, kref);
-	struct vfio_group *group = device->group;
-
-	list_del(&device->group_next);
-	mutex_unlock(&group->device_lock);
-
-	dev_set_drvdata(device->dev, NULL);
-
-	kfree(device);
-
-	/* vfio_del_group_dev may be waiting for this device */
-	wake_up(&vfio.release_q);
-}
-
-/* Device reference always implies a group reference */
-void vfio_device_put(struct vfio_device *device)
-{
-	struct vfio_group *group = device->group;
-	kref_put_mutex(&device->kref, vfio_device_release, &group->device_lock);
-	vfio_group_put(group);
-}
-EXPORT_SYMBOL_GPL(vfio_device_put);
-
-static void vfio_device_get(struct vfio_device *device)
-{
-	vfio_group_get(device->group);
-	kref_get(&device->kref);
-}
-
-static struct vfio_device *vfio_group_get_device(struct vfio_group *group,
-						 struct device *dev)
-{
-	struct vfio_device *device;
-
-	mutex_lock(&group->device_lock);
-	list_for_each_entry(device, &group->device_list, group_next) {
-		if (device->dev == dev) {
-			vfio_device_get(device);
-			mutex_unlock(&group->device_lock);
-			return device;
-		}
-	}
-	mutex_unlock(&group->device_lock);
-	return NULL;
-}
-
-/*
- * Whitelist some drivers that we know are safe (no dma) or just sit on
- * a device.  It's not always practical to leave a device within a group
- * driverless as it could get re-bound to something unsafe.
- */
-static const char * const vfio_driver_whitelist[] = { "pci-stub", "pcieport" };
-
-static bool vfio_whitelisted_driver(struct device_driver *drv)
-{
-	int i;
-
-	for (i = 0; i < ARRAY_SIZE(vfio_driver_whitelist); i++) {
-		if (!strcmp(drv->name, vfio_driver_whitelist[i]))
-			return true;
-	}
-
-	return false;
-}
-
-/*
- * A vfio group is viable for use by userspace if all devices are in
- * one of the following states:
- *  - driver-less
- *  - bound to a vfio driver
- *  - bound to a whitelisted driver
- *
- * We use two methods to determine whether a device is bound to a vfio
- * driver.  The first is to test whether the device exists in the vfio
- * group.  The second is to test if the device exists on the group
- * unbound_list, indicating it's in the middle of transitioning from
- * a vfio driver to driver-less.
- */
-static int vfio_dev_viable(struct device *dev, void *data)
-{
-	struct vfio_group *group = data;
-	struct vfio_device *device;
-	struct device_driver *drv = ACCESS_ONCE(dev->driver);
-	struct vfio_unbound_dev *unbound;
-	int ret = -EINVAL;
-
-	mutex_lock(&group->unbound_lock);
-	list_for_each_entry(unbound, &group->unbound_list, unbound_next) {
-		if (dev == unbound->dev) {
-			ret = 0;
-			break;
-		}
-	}
-	mutex_unlock(&group->unbound_lock);
-
-	if (!ret || !drv || vfio_whitelisted_driver(drv))
-		return 0;
-
-	device = vfio_group_get_device(group, dev);
-	if (device) {
-		vfio_device_put(device);
-		return 0;
-	}
-
-	return ret;
-}
-
-/**
- * Async device support
- */
-static int vfio_group_nb_add_dev(struct vfio_group *group, struct device *dev)
-{
-	struct vfio_device *device;
-
-	/* Do we already know about it?  We shouldn't */
-	device = vfio_group_get_device(group, dev);
-	if (WARN_ON_ONCE(device)) {
-		vfio_device_put(device);
-		return 0;
-	}
-
-	/* Nothing to do for idle groups */
-	if (!atomic_read(&group->container_users))
-		return 0;
-
-	/* TODO Prevent device auto probing */
-	WARN("Device %s added to live group %d!\n", dev_name(dev),
-	     iommu_group_id(group->iommu_group));
-
-	return 0;
-}
-
-static int vfio_group_nb_verify(struct vfio_group *group, struct device *dev)
-{
-	/* We don't care what happens when the group isn't in use */
-	if (!atomic_read(&group->container_users))
-		return 0;
-
-	return vfio_dev_viable(dev, group);
-}
-
-static int vfio_iommu_group_notifier(struct notifier_block *nb,
-				     unsigned long action, void *data)
-{
-	struct vfio_group *group = container_of(nb, struct vfio_group, nb);
-	struct device *dev = data;
-	struct vfio_unbound_dev *unbound;
-
-	/*
-	 * Need to go through a group_lock lookup to get a reference or we
-	 * risk racing a group being removed.  Ignore spurious notifies.
-	 */
-	group = vfio_group_try_get(group);
-	if (!group)
-		return NOTIFY_OK;
-
-	switch (action) {
-	case IOMMU_GROUP_NOTIFY_ADD_DEVICE:
-		vfio_group_nb_add_dev(group, dev);
-		break;
-	case IOMMU_GROUP_NOTIFY_DEL_DEVICE:
-		/*
-		 * Nothing to do here.  If the device is in use, then the
-		 * vfio sub-driver should block the remove callback until
-		 * it is unused.  If the device is unused or attached to a
-		 * stub driver, then it should be released and we don't
-		 * care that it will be going away.
-		 */
-		break;
-	case IOMMU_GROUP_NOTIFY_BIND_DRIVER:
-		pr_debug("%s: Device %s, group %d binding to driver\n",
-			 __func__, dev_name(dev),
-			 iommu_group_id(group->iommu_group));
-		break;
-	case IOMMU_GROUP_NOTIFY_BOUND_DRIVER:
-		pr_debug("%s: Device %s, group %d bound to driver %s\n",
-			 __func__, dev_name(dev),
-			 iommu_group_id(group->iommu_group), dev->driver->name);
-		BUG_ON(vfio_group_nb_verify(group, dev));
-		break;
-	case IOMMU_GROUP_NOTIFY_UNBIND_DRIVER:
-		pr_debug("%s: Device %s, group %d unbinding from driver %s\n",
-			 __func__, dev_name(dev),
-			 iommu_group_id(group->iommu_group), dev->driver->name);
-		break;
-	case IOMMU_GROUP_NOTIFY_UNBOUND_DRIVER:
-		pr_debug("%s: Device %s, group %d unbound from driver\n",
-			 __func__, dev_name(dev),
-			 iommu_group_id(group->iommu_group));
-		/*
-		 * XXX An unbound device in a live group is ok, but we'd
-		 * really like to avoid the above BUG_ON by preventing other
-		 * drivers from binding to it.  Once that occurs, we have to
-		 * stop the system to maintain isolation.  At a minimum, we'd
-		 * want a toggle to disable driver auto probe for this device.
-		 */
-
-		mutex_lock(&group->unbound_lock);
-		list_for_each_entry(unbound,
-				    &group->unbound_list, unbound_next) {
-			if (dev == unbound->dev) {
-				list_del(&unbound->unbound_next);
-				kfree(unbound);
-				break;
-			}
-		}
-		mutex_unlock(&group->unbound_lock);
-		break;
-	}
-
-	vfio_group_put(group);
-	return NOTIFY_OK;
-}
-
-/**
- * VFIO driver API
- */
-int vfio_add_group_dev(struct device *dev,
-		       const struct vfio_device_ops *ops, void *device_data)
-{
-	struct iommu_group *iommu_group;
-	struct vfio_group *group;
-	struct vfio_device *device;
-
-	iommu_group = iommu_group_get(dev);
-	if (!iommu_group)
-		return -EINVAL;
-
-	group = vfio_group_get_from_iommu(iommu_group);
-	if (!group) {
-		group = vfio_create_group(iommu_group);
-		if (IS_ERR(group)) {
-			iommu_group_put(iommu_group);
-			return PTR_ERR(group);
-		}
-	} else {
-		/*
-		 * A found vfio_group already holds a reference to the
-		 * iommu_group.  A created vfio_group keeps the reference.
-		 */
-		iommu_group_put(iommu_group);
-	}
-
-	device = vfio_group_get_device(group, dev);
-	if (device) {
-		WARN(1, "Device %s already exists on group %d\n",
-		     dev_name(dev), iommu_group_id(iommu_group));
-		vfio_device_put(device);
-		vfio_group_put(group);
-		return -EBUSY;
-	}
-
-	device = vfio_group_create_device(group, dev, ops, device_data);
-	if (IS_ERR(device)) {
-		vfio_group_put(group);
-		return PTR_ERR(device);
-	}
-
-	/*
-	 * Drop all but the vfio_device reference.  The vfio_device holds
-	 * a reference to the vfio_group, which holds a reference to the
-	 * iommu_group.
-	 */
-	vfio_group_put(group);
-
-	return 0;
-}
-EXPORT_SYMBOL_GPL(vfio_add_group_dev);
-
-/**
- * Get a reference to the vfio_device for a device.  Even if the
- * caller thinks they own the device, they could be racing with a
- * release call path, so we can't trust drvdata for the shortcut.
- * Go the long way around, from the iommu_group to the vfio_group
- * to the vfio_device.
- */
-struct vfio_device *vfio_device_get_from_dev(struct device *dev)
-{
-	struct iommu_group *iommu_group;
-	struct vfio_group *group;
-	struct vfio_device *device;
-
-	iommu_group = iommu_group_get(dev);
-	if (!iommu_group)
-		return NULL;
-
-	group = vfio_group_get_from_iommu(iommu_group);
-	iommu_group_put(iommu_group);
-	if (!group)
-		return NULL;
-
-	device = vfio_group_get_device(group, dev);
-	vfio_group_put(group);
-
-	return device;
-}
-EXPORT_SYMBOL_GPL(vfio_device_get_from_dev);
-
-static struct vfio_device *vfio_device_get_from_name(struct vfio_group *group,
-						     char *buf)
-{
-	struct vfio_device *device;
-
-	mutex_lock(&group->device_lock);
-	list_for_each_entry(device, &group->device_list, group_next) {
-		if (!strcmp(dev_name(device->dev), buf)) {
-			vfio_device_get(device);
-			break;
-		}
-	}
-	mutex_unlock(&group->device_lock);
-
-	return device;
-}
-
-/*
- * Caller must hold a reference to the vfio_device
- */
-void *vfio_device_data(struct vfio_device *device)
-{
-	return device->device_data;
-}
-EXPORT_SYMBOL_GPL(vfio_device_data);
-
-/* Given a referenced group, check if it contains the device */
-static bool vfio_dev_present(struct vfio_group *group, struct device *dev)
-{
-	struct vfio_device *device;
-
-	device = vfio_group_get_device(group, dev);
-	if (!device)
-		return false;
-
-	vfio_device_put(device);
-	return true;
-}
-
-/*
- * Decrement the device reference count and wait for the device to be
- * removed.  Open file descriptors for the device... */
-void *vfio_del_group_dev(struct device *dev)
-{
-	struct vfio_device *device = dev_get_drvdata(dev);
-	struct vfio_group *group = device->group;
-	void *device_data = device->device_data;
-	struct vfio_unbound_dev *unbound;
-	unsigned int i = 0;
-	long ret;
-	bool interrupted = false;
-
-	/*
-	 * The group exists so long as we have a device reference.  Get
-	 * a group reference and use it to scan for the device going away.
-	 */
-	vfio_group_get(group);
-
-	/*
-	 * When the device is removed from the group, the group suddenly
-	 * becomes non-viable; the device has a driver (until the unbind
-	 * completes), but it's not present in the group.  This is bad news
-	 * for any external users that need to re-acquire a group reference
-	 * in order to match and release their existing reference.  To
-	 * solve this, we track such devices on the unbound_list to bridge
-	 * the gap until they're fully unbound.
-	 */
-	unbound = kzalloc(sizeof(*unbound), GFP_KERNEL);
-	if (unbound) {
-		unbound->dev = dev;
-		mutex_lock(&group->unbound_lock);
-		list_add(&unbound->unbound_next, &group->unbound_list);
-		mutex_unlock(&group->unbound_lock);
-	}
-	WARN_ON(!unbound);
-
-	vfio_device_put(device);
-
-	/*
-	 * If the device is still present in the group after the above
-	 * 'put', then it is in use and we need to request it from the
-	 * bus driver.  The driver may in turn need to request the
-	 * device from the user.  We send the request on an arbitrary
-	 * interval with counter to allow the driver to take escalating
-	 * measures to release the device if it has the ability to do so.
-	 */
-	do {
-		device = vfio_group_get_device(group, dev);
-		if (!device)
-			break;
-
-		if (device->ops->request)
-			device->ops->request(device_data, i++);
-
-		vfio_device_put(device);
-
-		if (interrupted) {
-			ret = wait_event_timeout(vfio.release_q,
-					!vfio_dev_present(group, dev), HZ * 10);
-		} else {
-			ret = wait_event_interruptible_timeout(vfio.release_q,
-					!vfio_dev_present(group, dev), HZ * 10);
-			if (ret == -ERESTARTSYS) {
-				interrupted = true;
-				dev_warn(dev,
-					 "Device is currently in use, task"
-					 " \"%s\" (%d) "
-					 "blocked until device is released",
-					 current->comm, task_pid_nr(current));
-			}
-		}
-	} while (ret <= 0);
-
-	vfio_group_put(group);
-
-	return device_data;
-}
-EXPORT_SYMBOL_GPL(vfio_del_group_dev);
-
-/**
- * VFIO base fd, /dev/vfio/vfio
- */
-static long vfio_ioctl_check_extension(struct vfio_container *container,
-				       unsigned long arg)
-{
-	struct vfio_iommu_driver *driver;
-	long ret = 0;
-
-	down_read(&container->group_lock);
-
-	driver = container->iommu_driver;
-
-	switch (arg) {
-		/* No base extensions yet */
-	default:
-		/*
-		 * If no driver is set, poll all registered drivers for
-		 * extensions and return the first positive result.  If
-		 * a driver is already set, further queries will be passed
-		 * only to that driver.
-		 */
-		if (!driver) {
-			mutex_lock(&vfio.iommu_drivers_lock);
-			list_for_each_entry(driver, &vfio.iommu_drivers_list,
-					    vfio_next) {
-				if (!try_module_get(driver->ops->owner))
-					continue;
-
-				ret = driver->ops->ioctl(NULL,
-							 VFIO_CHECK_EXTENSION,
-							 arg);
-				module_put(driver->ops->owner);
-				if (ret > 0)
-					break;
-			}
-			mutex_unlock(&vfio.iommu_drivers_lock);
-		} else
-			ret = driver->ops->ioctl(container->iommu_data,
-						 VFIO_CHECK_EXTENSION, arg);
-	}
-
-	up_read(&container->group_lock);
-
-	return ret;
-}
-
-/* hold write lock on container->group_lock */
-static int __vfio_container_attach_groups(struct vfio_container *container,
-					  struct vfio_iommu_driver *driver,
-					  void *data)
-{
-	struct vfio_group *group;
-	int ret = -ENODEV;
-
-	list_for_each_entry(group, &container->group_list, container_next) {
-		ret = driver->ops->attach_group(data, group->iommu_group);
-		if (ret)
-			goto unwind;
-	}
-
-	return ret;
-
-unwind:
-	list_for_each_entry_continue_reverse(group, &container->group_list,
-					     container_next) {
-		driver->ops->detach_group(data, group->iommu_group);
-	}
-
-	return ret;
-}
-
-static long vfio_ioctl_set_iommu(struct vfio_container *container,
-				 unsigned long arg)
-{
-	struct vfio_iommu_driver *driver;
-	long ret = -ENODEV;
-
-	down_write(&container->group_lock);
-
-	/*
-	 * The container is designed to be an unprivileged interface while
-	 * the group can be assigned to specific users.  Therefore, only by
-	 * adding a group to a container does the user get the privilege of
-	 * enabling the iommu, which may allocate finite resources.  There
-	 * is no unset_iommu, but by removing all the groups from a container,
-	 * the container is deprivileged and returns to an unset state.
-	 */
-	if (list_empty(&container->group_list) || container->iommu_driver) {
-		up_write(&container->group_lock);
-		return -EINVAL;
-	}
-
-	mutex_lock(&vfio.iommu_drivers_lock);
-	list_for_each_entry(driver, &vfio.iommu_drivers_list, vfio_next) {
-		void *data;
-
-		if (!try_module_get(driver->ops->owner))
-			continue;
-
-		/*
-		 * The arg magic for SET_IOMMU is the same as CHECK_EXTENSION,
-		 * so test which iommu driver reported support for this
-		 * extension and call open on them.  We also pass them the
-		 * magic, allowing a single driver to support multiple
-		 * interfaces if they'd like.
-		 */
-		if (driver->ops->ioctl(NULL, VFIO_CHECK_EXTENSION, arg) <= 0) {
-			module_put(driver->ops->owner);
-			continue;
-		}
-
-		/* module reference holds the driver we're working on */
-		mutex_unlock(&vfio.iommu_drivers_lock);
-
-		data = driver->ops->open(arg);
-		if (IS_ERR(data)) {
-			ret = PTR_ERR(data);
-			module_put(driver->ops->owner);
-			goto skip_drivers_unlock;
-		}
-
-		ret = __vfio_container_attach_groups(container, driver, data);
-		if (!ret) {
-			container->iommu_driver = driver;
-			container->iommu_data = data;
-		} else {
-			driver->ops->release(data);
-			module_put(driver->ops->owner);
-		}
-
-		goto skip_drivers_unlock;
-	}
-
-	mutex_unlock(&vfio.iommu_drivers_lock);
-skip_drivers_unlock:
-	up_write(&container->group_lock);
-
-	return ret;
-}
-
-static long vfio_fops_unl_ioctl(struct file *filep,
-				unsigned int cmd, unsigned long arg)
-{
-	struct vfio_container *container = filep->private_data;
-	struct vfio_iommu_driver *driver;
-	void *data;
-	long ret = -EINVAL;
-
-	if (!container)
-		return ret;
-
-	switch (cmd) {
-	case VFIO_GET_API_VERSION:
-		ret = VFIO_API_VERSION;
-		break;
-	case VFIO_CHECK_EXTENSION:
-		ret = vfio_ioctl_check_extension(container, arg);
-		break;
-	case VFIO_SET_IOMMU:
-		ret = vfio_ioctl_set_iommu(container, arg);
-		break;
-	default:
-		down_read(&container->group_lock);
-
-		driver = container->iommu_driver;
-		data = container->iommu_data;
-
-		if (driver) /* passthrough all unrecognized ioctls */
-			ret = driver->ops->ioctl(data, cmd, arg);
-
-		up_read(&container->group_lock);
-	}
-
-	return ret;
-}
-
-#ifdef CONFIG_COMPAT
-static long vfio_fops_compat_ioctl(struct file *filep,
-				   unsigned int cmd, unsigned long arg)
-{
-	arg = (unsigned long)compat_ptr(arg);
-	return vfio_fops_unl_ioctl(filep, cmd, arg);
-}
-#endif	/* CONFIG_COMPAT */
-
-static int vfio_fops_open(struct inode *inode, struct file *filep)
-{
-	struct vfio_container *container;
-
-	container = kzalloc(sizeof(*container), GFP_KERNEL);
-	if (!container)
-		return -ENOMEM;
-
-	INIT_LIST_HEAD(&container->group_list);
-	init_rwsem(&container->group_lock);
-	kref_init(&container->kref);
-
-	filep->private_data = container;
-
-	return 0;
-}
-
-static int vfio_fops_release(struct inode *inode, struct file *filep)
-{
-	struct vfio_container *container = filep->private_data;
-
-	filep->private_data = NULL;
-
-	vfio_container_put(container);
-
-	return 0;
-}
-
-/*
- * Once an iommu driver is set, we optionally pass read/write/mmap
- * on to the driver, allowing management interfaces beyond ioctl.
- */
-static ssize_t vfio_fops_read(struct file *filep, char __user *buf,
-			      size_t count, loff_t *ppos)
-{
-	struct vfio_container *container = filep->private_data;
-	struct vfio_iommu_driver *driver;
-	ssize_t ret = -EINVAL;
-
-	down_read(&container->group_lock);
-
-	driver = container->iommu_driver;
-	if (likely(driver && driver->ops->read))
-		ret = driver->ops->read(container->iommu_data,
-					buf, count, ppos);
-
-	up_read(&container->group_lock);
-
-	return ret;
-}
-
-static ssize_t vfio_fops_write(struct file *filep, const char __user *buf,
-			       size_t count, loff_t *ppos)
-{
-	struct vfio_container *container = filep->private_data;
-	struct vfio_iommu_driver *driver;
-	ssize_t ret = -EINVAL;
-
-	down_read(&container->group_lock);
-
-	driver = container->iommu_driver;
-	if (likely(driver && driver->ops->write))
-		ret = driver->ops->write(container->iommu_data,
-					 buf, count, ppos);
-
-	up_read(&container->group_lock);
-
-	return ret;
-}
-
-static int vfio_fops_mmap(struct file *filep, struct vm_area_struct *vma)
-{
-	struct vfio_container *container = filep->private_data;
-	struct vfio_iommu_driver *driver;
-	int ret = -EINVAL;
-
-	down_read(&container->group_lock);
-
-	driver = container->iommu_driver;
-	if (likely(driver && driver->ops->mmap))
-		ret = driver->ops->mmap(container->iommu_data, vma);
-
-	up_read(&container->group_lock);
-
-	return ret;
-}
-
-static const struct file_operations vfio_fops = {
-	.owner		= THIS_MODULE,
-	.open		= vfio_fops_open,
-	.release	= vfio_fops_release,
-	.read		= vfio_fops_read,
-	.write		= vfio_fops_write,
-	.unlocked_ioctl	= vfio_fops_unl_ioctl,
-#ifdef CONFIG_COMPAT
-	.compat_ioctl	= vfio_fops_compat_ioctl,
-#endif
-	.mmap		= vfio_fops_mmap,
-};
-
-/**
- * VFIO Group fd, /dev/vfio/$GROUP
- */
-static void __vfio_group_unset_container(struct vfio_group *group)
-{
-	struct vfio_container *container = group->container;
-	struct vfio_iommu_driver *driver;
-
-	down_write(&container->group_lock);
-
-	driver = container->iommu_driver;
-	if (driver)
-		driver->ops->detach_group(container->iommu_data,
-					  group->iommu_group);
-
-	group->container = NULL;
-	list_del(&group->container_next);
-
-	/* Detaching the last group deprivileges a container, remove iommu */
-	if (driver && list_empty(&container->group_list)) {
-		driver->ops->release(container->iommu_data);
-		module_put(driver->ops->owner);
-		container->iommu_driver = NULL;
-		container->iommu_data = NULL;
-	}
-
-	up_write(&container->group_lock);
-
-	vfio_container_put(container);
-}
-
-/*
- * VFIO_GROUP_UNSET_CONTAINER should fail if there are other users or
- * if there was no container to unset.  Since the ioctl is called on
- * the group, we know that still exists, therefore the only valid
- * transition here is 1->0.
- */
-static int vfio_group_unset_container(struct vfio_group *group)
-{
-	int users = atomic_cmpxchg(&group->container_users, 1, 0);
-
-	if (!users)
-		return -EINVAL;
-	if (users != 1)
-		return -EBUSY;
-
-	__vfio_group_unset_container(group);
-
-	return 0;
-}
-
-/*
- * When removing container users, anything that removes the last user
- * implicitly removes the group from the container.  That is, if the
- * group file descriptor is closed, as well as any device file descriptors,
- * the group is free.
- */
-static void vfio_group_try_dissolve_container(struct vfio_group *group)
-{
-	if (0 == atomic_dec_if_positive(&group->container_users))
-		__vfio_group_unset_container(group);
-}
-
-static int vfio_group_set_container(struct vfio_group *group, int container_fd)
-{
-	struct fd f;
-	struct vfio_container *container;
-	struct vfio_iommu_driver *driver;
-	int ret = 0;
-
-	if (atomic_read(&group->container_users))
-		return -EINVAL;
-
-	f = fdget(container_fd);
-	if (!f.file)
-		return -EBADF;
-
-	/* Sanity check, is this really our fd? */
-	if (f.file->f_op != &vfio_fops) {
-		fdput(f);
-		return -EINVAL;
-	}
-
-	container = f.file->private_data;
-	WARN_ON(!container); /* fget ensures we don't race vfio_release */
-
-	down_write(&container->group_lock);
-
-	driver = container->iommu_driver;
-	if (driver) {
-		ret = driver->ops->attach_group(container->iommu_data,
-						group->iommu_group);
-		if (ret)
-			goto unlock_out;
-	}
-
-	group->container = container;
-	list_add(&group->container_next, &container->group_list);
-
-	/* Get a reference on the container and mark a user within the group */
-	vfio_container_get(container);
-	atomic_inc(&group->container_users);
-
-unlock_out:
-	up_write(&container->group_lock);
-	fdput(f);
-	return ret;
-}
-
-static bool vfio_group_viable(struct vfio_group *group)
-{
-	return (iommu_group_for_each_dev(group->iommu_group,
-					 group, vfio_dev_viable) == 0);
-}
-
-static const struct file_operations vfio_device_fops;
-
-static int vfio_group_get_device_fd(struct vfio_group *group, char *buf)
-{
-	struct vfio_device *device;
-	struct file *filep;
-	int ret;
-
-	if (0 == atomic_read(&group->container_users) ||
-	    !group->container->iommu_driver || !vfio_group_viable(group))
-		return -EINVAL;
-
-	device = vfio_device_get_from_name(group, buf);
-	if (!device)
-		return -ENODEV;
-
-	ret = device->ops->open(device->device_data);
-	if (ret) {
-		vfio_device_put(device);
-		return ret;
-	}
-
-	/*
-	 * We can't use anon_inode_getfd() because we need to modify
-	 * the f_mode flags directly to allow more than just ioctls
-	 */
-	ret = get_unused_fd_flags(O_CLOEXEC);
-	if (ret < 0) {
-		device->ops->release(device->device_data);
-		vfio_device_put(device);
-		return ret;
-	}
-
-	filep = anon_inode_getfile("[vfio-device]", &vfio_device_fops,
-				   device, O_RDWR);
-	if (IS_ERR(filep)) {
-		put_unused_fd(ret);
-		ret = PTR_ERR(filep);
-		device->ops->release(device->device_data);
-		vfio_device_put(device);
-		return ret;
-	}
-
-	/*
-	 * TODO: add an anon_inode interface to do this.
-	 * Appears to be missing by lack of need rather than
-	 * explicitly prevented.  Now there's need.
-	 */
-	filep->f_mode |= (FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE);
-
-	atomic_inc(&group->container_users);
-
-	fd_install(ret, filep);
-
-	return ret;
-}
-
-static long vfio_group_fops_unl_ioctl(struct file *filep,
-				      unsigned int cmd, unsigned long arg)
-{
-	struct vfio_group *group = filep->private_data;
-	long ret = -ENOTTY;
-
-	switch (cmd) {
-	case VFIO_GROUP_GET_STATUS:
-	{
-		struct vfio_group_status status;
-		unsigned long minsz;
-
-		minsz = offsetofend(struct vfio_group_status, flags);
-
-		if (copy_from_user(&status, (void __user *)arg, minsz))
-			return -EFAULT;
-
-		if (status.argsz < minsz)
-			return -EINVAL;
-
-		status.flags = 0;
-
-		if (vfio_group_viable(group))
-			status.flags |= VFIO_GROUP_FLAGS_VIABLE;
-
-		if (group->container)
-			status.flags |= VFIO_GROUP_FLAGS_CONTAINER_SET;
-
-		if (copy_to_user((void __user *)arg, &status, minsz))
-			return -EFAULT;
-
-		ret = 0;
-		break;
-	}
-	case VFIO_GROUP_SET_CONTAINER:
-	{
-		int fd;
-
-		if (get_user(fd, (int __user *)arg))
-			return -EFAULT;
-
-		if (fd < 0)
-			return -EINVAL;
-
-		ret = vfio_group_set_container(group, fd);
-		break;
-	}
-	case VFIO_GROUP_UNSET_CONTAINER:
-		ret = vfio_group_unset_container(group);
-		break;
-	case VFIO_GROUP_GET_DEVICE_FD:
-	{
-		char *buf;
-
-		buf = strndup_user((const char __user *)arg, PAGE_SIZE);
-		if (IS_ERR(buf))
-			return PTR_ERR(buf);
-
-		ret = vfio_group_get_device_fd(group, buf);
-		kfree(buf);
-		break;
-	}
-	}
-
-	return ret;
-}
-
-#ifdef CONFIG_COMPAT
-static long vfio_group_fops_compat_ioctl(struct file *filep,
-					 unsigned int cmd, unsigned long arg)
-{
-	arg = (unsigned long)compat_ptr(arg);
-	return vfio_group_fops_unl_ioctl(filep, cmd, arg);
-}
-#endif	/* CONFIG_COMPAT */
-
-static int vfio_group_fops_open(struct inode *inode, struct file *filep)
-{
-	struct vfio_group *group;
-	int opened;
-
-	group = vfio_group_get_from_minor(iminor(inode));
-	if (!group)
-		return -ENODEV;
-
-	/* Do we need multiple instances of the group open?  Seems not. */
-	opened = atomic_cmpxchg(&group->opened, 0, 1);
-	if (opened) {
-		vfio_group_put(group);
-		return -EBUSY;
-	}
-
-	/* Is something still in use from a previous open? */
-	if (group->container) {
-		atomic_dec(&group->opened);
-		vfio_group_put(group);
-		return -EBUSY;
-	}
-
-	filep->private_data = group;
-
-	return 0;
-}
-
-static int vfio_group_fops_release(struct inode *inode, struct file *filep)
-{
-	struct vfio_group *group = filep->private_data;
-
-	filep->private_data = NULL;
-
-	vfio_group_try_dissolve_container(group);
-
-	atomic_dec(&group->opened);
-
-	vfio_group_put(group);
-
-	return 0;
-}
-
-static const struct file_operations vfio_group_fops = {
-	.owner		= THIS_MODULE,
-	.unlocked_ioctl	= vfio_group_fops_unl_ioctl,
-#ifdef CONFIG_COMPAT
-	.compat_ioctl	= vfio_group_fops_compat_ioctl,
-#endif
-	.open		= vfio_group_fops_open,
-	.release	= vfio_group_fops_release,
-};
-
-/**
- * VFIO Device fd
- */
-static int vfio_device_fops_release(struct inode *inode, struct file *filep)
-{
-	struct vfio_device *device = filep->private_data;
-
-	device->ops->release(device->device_data);
-
-	vfio_group_try_dissolve_container(device->group);
-
-	vfio_device_put(device);
-
-	return 0;
-}
-
-static long vfio_device_fops_unl_ioctl(struct file *filep,
-				       unsigned int cmd, unsigned long arg)
-{
-	struct vfio_device *device = filep->private_data;
-
-	if (unlikely(!device->ops->ioctl))
-		return -EINVAL;
-
-	return device->ops->ioctl(device->device_data, cmd, arg);
-}
-
-static ssize_t vfio_device_fops_read(struct file *filep, char __user *buf,
-				     size_t count, loff_t *ppos)
-{
-	struct vfio_device *device = filep->private_data;
-
-	if (unlikely(!device->ops->read))
-		return -EINVAL;
-
-	return device->ops->read(device->device_data, buf, count, ppos);
-}
-
-static ssize_t vfio_device_fops_write(struct file *filep,
-				      const char __user *buf,
-				      size_t count, loff_t *ppos)
-{
-	struct vfio_device *device = filep->private_data;
-
-	if (unlikely(!device->ops->write))
-		return -EINVAL;
-
-	return device->ops->write(device->device_data, buf, count, ppos);
-}
-
-static int vfio_device_fops_mmap(struct file *filep, struct vm_area_struct *vma)
-{
-	struct vfio_device *device = filep->private_data;
-
-	if (unlikely(!device->ops->mmap))
-		return -EINVAL;
-
-	return device->ops->mmap(device->device_data, vma);
-}
-
-#ifdef CONFIG_COMPAT
-static long vfio_device_fops_compat_ioctl(struct file *filep,
-					  unsigned int cmd, unsigned long arg)
-{
-	arg = (unsigned long)compat_ptr(arg);
-	return vfio_device_fops_unl_ioctl(filep, cmd, arg);
-}
-#endif	/* CONFIG_COMPAT */
-
-static const struct file_operations vfio_device_fops = {
-	.owner		= THIS_MODULE,
-	.release	= vfio_device_fops_release,
-	.read		= vfio_device_fops_read,
-	.write		= vfio_device_fops_write,
-	.unlocked_ioctl	= vfio_device_fops_unl_ioctl,
-#ifdef CONFIG_COMPAT
-	.compat_ioctl	= vfio_device_fops_compat_ioctl,
-#endif
-	.mmap		= vfio_device_fops_mmap,
-};
-
-/**
- * External user API, exported by symbols to be linked dynamically.
- *
- * The protocol includes:
- *  1. do normal VFIO init operation:
- *	- opening a new container;
- *	- attaching group(s) to it;
- *	- setting an IOMMU driver for a container.
- * When IOMMU is set for a container, all groups in it are
- * considered ready to use by an external user.
- *
- * 2. User space passes a group fd to an external user.
- * The external user calls vfio_group_get_external_user()
- * to verify that:
- *	- the group is initialized;
- *	- IOMMU is set for it.
- * If both checks passed, vfio_group_get_external_user()
- * increments the container user counter to prevent
- * the VFIO group from disposal before KVM exits.
- *
- * 3. The external user calls vfio_external_user_iommu_id()
- * to know an IOMMU ID.
- *
- * 4. When the external KVM finishes, it calls
- * vfio_group_put_external_user() to release the VFIO group.
- * This call decrements the container user counter.
- */
-struct vfio_group *vfio_group_get_external_user(struct file *filep)
-{
-	struct vfio_group *group = filep->private_data;
-
-	if (filep->f_op != &vfio_group_fops)
-		return ERR_PTR(-EINVAL);
-
-	if (!atomic_inc_not_zero(&group->container_users))
-		return ERR_PTR(-EINVAL);
-
-	if (!group->container->iommu_driver ||
-			!vfio_group_viable(group)) {
-		atomic_dec(&group->container_users);
-		return ERR_PTR(-EINVAL);
-	}
-
-	vfio_group_get(group);
-
-	return group;
-}
-EXPORT_SYMBOL_GPL(vfio_group_get_external_user);
-
-void vfio_group_put_external_user(struct vfio_group *group)
-{
-	vfio_group_put(group);
-	vfio_group_try_dissolve_container(group);
-}
-EXPORT_SYMBOL_GPL(vfio_group_put_external_user);
-
-int vfio_external_user_iommu_id(struct vfio_group *group)
-{
-	return iommu_group_id(group->iommu_group);
-}
-EXPORT_SYMBOL_GPL(vfio_external_user_iommu_id);
-
-long vfio_external_check_extension(struct vfio_group *group, unsigned long arg)
-{
-	return vfio_ioctl_check_extension(group->container, arg);
-}
-EXPORT_SYMBOL_GPL(vfio_external_check_extension);
-
-/**
- * Module/class support
- */
-static char *vfio_devnode(struct device *dev, umode_t *mode)
-{
-	return kasprintf(GFP_KERNEL, "vfio/%s", dev_name(dev));
-}
-
-static struct miscdevice vfio_dev = {
-	.minor = VFIO_MINOR,
-	.name = "vfio",
-	.fops = &vfio_fops,
-	.nodename = "vfio/vfio",
-	.mode = S_IRUGO | S_IWUGO,
-};
-
-static int __init vfio_init(void)
-{
-	int ret;
-
-	idr_init(&vfio.group_idr);
-	mutex_init(&vfio.group_lock);
-	mutex_init(&vfio.iommu_drivers_lock);
-	INIT_LIST_HEAD(&vfio.group_list);
-	INIT_LIST_HEAD(&vfio.iommu_drivers_list);
-	init_waitqueue_head(&vfio.release_q);
-
-	ret = misc_register(&vfio_dev);
-	if (ret) {
-		pr_err("vfio: misc device register failed\n");
-		return ret;
-	}
-
-	/* /dev/vfio/$GROUP */
-	vfio.class = class_create(THIS_MODULE, "vfio");
-	if (IS_ERR(vfio.class)) {
-		ret = PTR_ERR(vfio.class);
-		goto err_class;
-	}
-
-	vfio.class->devnode = vfio_devnode;
-
-	ret = alloc_chrdev_region(&vfio.group_devt, 0, MINORMASK, "vfio");
-	if (ret)
-		goto err_alloc_chrdev;
-
-	cdev_init(&vfio.group_cdev, &vfio_group_fops);
-	ret = cdev_add(&vfio.group_cdev, vfio.group_devt, MINORMASK);
-	if (ret)
-		goto err_cdev_add;
-
-	pr_info(DRIVER_DESC " version: " DRIVER_VERSION "\n");
-
-	/*
-	 * Attempt to load known iommu-drivers.  This gives us a working
-	 * environment without the user needing to explicitly load iommu
-	 * drivers.
-	 */
-	request_module_nowait("vfio_iommu_type1");
-	request_module_nowait("vfio_iommu_spapr_tce");
-
-	return 0;
-
-err_cdev_add:
-	unregister_chrdev_region(vfio.group_devt, MINORMASK);
-err_alloc_chrdev:
-	class_destroy(vfio.class);
-	vfio.class = NULL;
-err_class:
-	misc_deregister(&vfio_dev);
-	return ret;
-}
-
-static void __exit vfio_cleanup(void)
-{
-	WARN_ON(!list_empty(&vfio.group_list));
-
-	idr_destroy(&vfio.group_idr);
-	cdev_del(&vfio.group_cdev);
-	unregister_chrdev_region(vfio.group_devt, MINORMASK);
-	class_destroy(vfio.class);
-	vfio.class = NULL;
-	misc_deregister(&vfio_dev);
-}
-
-module_init(vfio_init);
-module_exit(vfio_cleanup);
-
-MODULE_VERSION(DRIVER_VERSION);
-MODULE_LICENSE("GPL v2");
-MODULE_AUTHOR(DRIVER_AUTHOR);
-MODULE_DESCRIPTION(DRIVER_DESC);
-MODULE_ALIAS_MISCDEV(VFIO_MINOR);
-MODULE_ALIAS("devname:vfio/vfio");
diff --git a/drivers/vfio/vfio_core.c b/drivers/vfio/vfio_core.c
new file mode 100644
index 0000000..563c510
--- /dev/null
+++ b/drivers/vfio/vfio_core.c
@@ -0,0 +1,1640 @@
+/*
+ * VFIO core
+ *
+ * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
+ *     Author: Alex Williamson <alex.williamson@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * Derived from original vfio:
+ * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
+ * Author: Tom Lyon, pugs@cisco.com
+ */
+
+#include <linux/cdev.h>
+#include <linux/compat.h>
+#include <linux/device.h>
+#include <linux/file.h>
+#include <linux/anon_inodes.h>
+#include <linux/fs.h>
+#include <linux/idr.h>
+#include <linux/iommu.h>
+#include <linux/list.h>
+#include <linux/miscdevice.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/rwsem.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/stat.h>
+#include <linux/string.h>
+#include <linux/uaccess.h>
+#include <linux/vfio.h>
+#include <linux/wait.h>
+
+#define DRIVER_VERSION	"0.3"
+#define DRIVER_AUTHOR	"Alex Williamson <alex.williamson@redhat.com>"
+#define DRIVER_DESC	"VFIO - User Level meta-driver"
+
+static struct vfio {
+	struct class			*class;
+	struct list_head		iommu_drivers_list;
+	struct mutex			iommu_drivers_lock;
+	struct list_head		group_list;
+	struct idr			group_idr;
+	struct mutex			group_lock;
+	struct cdev			group_cdev;
+	dev_t				group_devt;
+	wait_queue_head_t		release_q;
+} vfio;
+
+struct vfio_iommu_driver {
+	const struct vfio_iommu_driver_ops	*ops;
+	struct list_head			vfio_next;
+};
+
+struct vfio_container {
+	struct kref			kref;
+	struct list_head		group_list;
+	struct rw_semaphore		group_lock;
+	struct vfio_iommu_driver	*iommu_driver;
+	void				*iommu_data;
+};
+
+struct vfio_unbound_dev {
+	struct device			*dev;
+	struct list_head		unbound_next;
+};
+
+struct vfio_group {
+	struct kref			kref;
+	int				minor;
+	atomic_t			container_users;
+	struct iommu_group		*iommu_group;
+	struct vfio_container		*container;
+	struct list_head		device_list;
+	struct mutex			device_lock;
+	struct device			*dev;
+	struct notifier_block		nb;
+	struct list_head		vfio_next;
+	struct list_head		container_next;
+	struct list_head		unbound_list;
+	struct mutex			unbound_lock;
+	atomic_t			opened;
+};
+
+struct vfio_device {
+	struct kref			kref;
+	struct device			*dev;
+	const struct vfio_device_ops	*ops;
+	struct vfio_group		*group;
+	struct list_head		group_next;
+	void				*device_data;
+};
+
+/**
+ * IOMMU driver registration
+ */
+int vfio_register_iommu_driver(const struct vfio_iommu_driver_ops *ops)
+{
+	struct vfio_iommu_driver *driver, *tmp;
+
+	driver = kzalloc(sizeof(*driver), GFP_KERNEL);
+	if (!driver)
+		return -ENOMEM;
+
+	driver->ops = ops;
+
+	mutex_lock(&vfio.iommu_drivers_lock);
+
+	/* Check for duplicates */
+	list_for_each_entry(tmp, &vfio.iommu_drivers_list, vfio_next) {
+		if (tmp->ops == ops) {
+			mutex_unlock(&vfio.iommu_drivers_lock);
+			kfree(driver);
+			return -EINVAL;
+		}
+	}
+
+	list_add(&driver->vfio_next, &vfio.iommu_drivers_list);
+
+	mutex_unlock(&vfio.iommu_drivers_lock);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(vfio_register_iommu_driver);
+
+void vfio_unregister_iommu_driver(const struct vfio_iommu_driver_ops *ops)
+{
+	struct vfio_iommu_driver *driver;
+
+	mutex_lock(&vfio.iommu_drivers_lock);
+	list_for_each_entry(driver, &vfio.iommu_drivers_list, vfio_next) {
+		if (driver->ops == ops) {
+			list_del(&driver->vfio_next);
+			mutex_unlock(&vfio.iommu_drivers_lock);
+			kfree(driver);
+			return;
+		}
+	}
+	mutex_unlock(&vfio.iommu_drivers_lock);
+}
+EXPORT_SYMBOL_GPL(vfio_unregister_iommu_driver);
+
+/**
+ * Group minor allocation/free - both called with vfio.group_lock held
+ */
+static int vfio_alloc_group_minor(struct vfio_group *group)
+{
+	return idr_alloc(&vfio.group_idr, group, 0, MINORMASK + 1, GFP_KERNEL);
+}
+
+static void vfio_free_group_minor(int minor)
+{
+	idr_remove(&vfio.group_idr, minor);
+}
+
+static int vfio_iommu_group_notifier(struct notifier_block *nb,
+				     unsigned long action, void *data);
+static void vfio_group_get(struct vfio_group *group);
+
+/**
+ * Container objects - containers are created when /dev/vfio/vfio is
+ * opened, but their lifecycle extends until the last user is done, so
+ * it's freed via kref.  Must support container/group/device being
+ * closed in any order.
+ */
+static void vfio_container_get(struct vfio_container *container)
+{
+	kref_get(&container->kref);
+}
+
+static void vfio_container_release(struct kref *kref)
+{
+	struct vfio_container *container;
+	container = container_of(kref, struct vfio_container, kref);
+
+	kfree(container);
+}
+
+static void vfio_container_put(struct vfio_container *container)
+{
+	kref_put(&container->kref, vfio_container_release);
+}
+
+static void vfio_group_unlock_and_free(struct vfio_group *group)
+{
+	mutex_unlock(&vfio.group_lock);
+	/*
+	 * Unregister outside of lock.  A spurious callback is harmless now
+	 * that the group is no longer in vfio.group_list.
+	 */
+	iommu_group_unregister_notifier(group->iommu_group, &group->nb);
+	kfree(group);
+}
+
+/**
+ * Group objects - create, release, get, put, search
+ */
+static struct vfio_group *vfio_create_group(struct iommu_group *iommu_group)
+{
+	struct vfio_group *group, *tmp;
+	struct device *dev;
+	int ret, minor;
+
+	group = kzalloc(sizeof(*group), GFP_KERNEL);
+	if (!group)
+		return ERR_PTR(-ENOMEM);
+
+	kref_init(&group->kref);
+	INIT_LIST_HEAD(&group->device_list);
+	mutex_init(&group->device_lock);
+	INIT_LIST_HEAD(&group->unbound_list);
+	mutex_init(&group->unbound_lock);
+	atomic_set(&group->container_users, 0);
+	atomic_set(&group->opened, 0);
+	group->iommu_group = iommu_group;
+
+	group->nb.notifier_call = vfio_iommu_group_notifier;
+
+	/*
+	 * blocking notifiers acquire a rwsem around registering and hold
+	 * it around callback.  Therefore, need to register outside of
+	 * vfio.group_lock to avoid A-B/B-A contention.  Our callback won't
+	 * do anything unless it can find the group in vfio.group_list, so
+	 * no harm in registering early.
+	 */
+	ret = iommu_group_register_notifier(iommu_group, &group->nb);
+	if (ret) {
+		kfree(group);
+		return ERR_PTR(ret);
+	}
+
+	mutex_lock(&vfio.group_lock);
+
+	/* Did we race creating this group? */
+	list_for_each_entry(tmp, &vfio.group_list, vfio_next) {
+		if (tmp->iommu_group == iommu_group) {
+			vfio_group_get(tmp);
+			vfio_group_unlock_and_free(group);
+			return tmp;
+		}
+	}
+
+	minor = vfio_alloc_group_minor(group);
+	if (minor < 0) {
+		vfio_group_unlock_and_free(group);
+		return ERR_PTR(minor);
+	}
+
+	dev = device_create(vfio.class, NULL,
+			    MKDEV(MAJOR(vfio.group_devt), minor),
+			    group, "%d", iommu_group_id(iommu_group));
+	if (IS_ERR(dev)) {
+		vfio_free_group_minor(minor);
+		vfio_group_unlock_and_free(group);
+		return (struct vfio_group *)dev; /* ERR_PTR */
+	}
+
+	group->minor = minor;
+	group->dev = dev;
+
+	list_add(&group->vfio_next, &vfio.group_list);
+
+	mutex_unlock(&vfio.group_lock);
+
+	return group;
+}
+
+/* called with vfio.group_lock held */
+static void vfio_group_release(struct kref *kref)
+{
+	struct vfio_group *group = container_of(kref, struct vfio_group, kref);
+	struct vfio_unbound_dev *unbound, *tmp;
+	struct iommu_group *iommu_group = group->iommu_group;
+
+	WARN_ON(!list_empty(&group->device_list));
+
+	list_for_each_entry_safe(unbound, tmp,
+				 &group->unbound_list, unbound_next) {
+		list_del(&unbound->unbound_next);
+		kfree(unbound);
+	}
+
+	device_destroy(vfio.class, MKDEV(MAJOR(vfio.group_devt), group->minor));
+	list_del(&group->vfio_next);
+	vfio_free_group_minor(group->minor);
+	vfio_group_unlock_and_free(group);
+	iommu_group_put(iommu_group);
+}
+
+static void vfio_group_put(struct vfio_group *group)
+{
+	kref_put_mutex(&group->kref, vfio_group_release, &vfio.group_lock);
+}
+
+/* Assume group_lock or group reference is held */
+static void vfio_group_get(struct vfio_group *group)
+{
+	kref_get(&group->kref);
+}
+
+/*
+ * Not really a try as we will sleep for mutex, but we need to make
+ * sure the group pointer is valid under lock and get a reference.
+ */
+static struct vfio_group *vfio_group_try_get(struct vfio_group *group)
+{
+	struct vfio_group *target = group;
+
+	mutex_lock(&vfio.group_lock);
+	list_for_each_entry(group, &vfio.group_list, vfio_next) {
+		if (group == target) {
+			vfio_group_get(group);
+			mutex_unlock(&vfio.group_lock);
+			return group;
+		}
+	}
+	mutex_unlock(&vfio.group_lock);
+
+	return NULL;
+}
+
+static
+struct vfio_group *vfio_group_get_from_iommu(struct iommu_group *iommu_group)
+{
+	struct vfio_group *group;
+
+	mutex_lock(&vfio.group_lock);
+	list_for_each_entry(group, &vfio.group_list, vfio_next) {
+		if (group->iommu_group == iommu_group) {
+			vfio_group_get(group);
+			mutex_unlock(&vfio.group_lock);
+			return group;
+		}
+	}
+	mutex_unlock(&vfio.group_lock);
+
+	return NULL;
+}
+
+static struct vfio_group *vfio_group_get_from_minor(int minor)
+{
+	struct vfio_group *group;
+
+	mutex_lock(&vfio.group_lock);
+	group = idr_find(&vfio.group_idr, minor);
+	if (!group) {
+		mutex_unlock(&vfio.group_lock);
+		return NULL;
+	}
+	vfio_group_get(group);
+	mutex_unlock(&vfio.group_lock);
+
+	return group;
+}
+
+/**
+ * Device objects - create, release, get, put, search
+ */
+static
+struct vfio_device *vfio_group_create_device(struct vfio_group *group,
+					     struct device *dev,
+					     const struct vfio_device_ops *ops,
+					     void *device_data)
+{
+	struct vfio_device *device;
+
+	device = kzalloc(sizeof(*device), GFP_KERNEL);
+	if (!device)
+		return ERR_PTR(-ENOMEM);
+
+	kref_init(&device->kref);
+	device->dev = dev;
+	device->group = group;
+	device->ops = ops;
+	device->device_data = device_data;
+	dev_set_drvdata(dev, device);
+
+	/* No need to get group_lock, caller has group reference */
+	vfio_group_get(group);
+
+	mutex_lock(&group->device_lock);
+	list_add(&device->group_next, &group->device_list);
+	mutex_unlock(&group->device_lock);
+
+	return device;
+}
+
+static void vfio_device_release(struct kref *kref)
+{
+	struct vfio_device *device = container_of(kref,
+						  struct vfio_device, kref);
+	struct vfio_group *group = device->group;
+
+	list_del(&device->group_next);
+	mutex_unlock(&group->device_lock);
+
+	dev_set_drvdata(device->dev, NULL);
+
+	kfree(device);
+
+	/* vfio_del_group_dev may be waiting for this device */
+	wake_up(&vfio.release_q);
+}
+
+/* Device reference always implies a group reference */
+void vfio_device_put(struct vfio_device *device)
+{
+	struct vfio_group *group = device->group;
+	kref_put_mutex(&device->kref, vfio_device_release, &group->device_lock);
+	vfio_group_put(group);
+}
+EXPORT_SYMBOL_GPL(vfio_device_put);
+
+static void vfio_device_get(struct vfio_device *device)
+{
+	vfio_group_get(device->group);
+	kref_get(&device->kref);
+}
+
+static struct vfio_device *vfio_group_get_device(struct vfio_group *group,
+						 struct device *dev)
+{
+	struct vfio_device *device;
+
+	mutex_lock(&group->device_lock);
+	list_for_each_entry(device, &group->device_list, group_next) {
+		if (device->dev == dev) {
+			vfio_device_get(device);
+			mutex_unlock(&group->device_lock);
+			return device;
+		}
+	}
+	mutex_unlock(&group->device_lock);
+	return NULL;
+}
+
+/*
+ * Whitelist some drivers that we know are safe (no dma) or just sit on
+ * a device.  It's not always practical to leave a device within a group
+ * driverless as it could get re-bound to something unsafe.
+ */
+static const char * const vfio_driver_whitelist[] = { "pci-stub", "pcieport" };
+
+static bool vfio_whitelisted_driver(struct device_driver *drv)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(vfio_driver_whitelist); i++) {
+		if (!strcmp(drv->name, vfio_driver_whitelist[i]))
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * A vfio group is viable for use by userspace if all devices are in
+ * one of the following states:
+ *  - driver-less
+ *  - bound to a vfio driver
+ *  - bound to a whitelisted driver
+ *
+ * We use two methods to determine whether a device is bound to a vfio
+ * driver.  The first is to test whether the device exists in the vfio
+ * group.  The second is to test if the device exists on the group
+ * unbound_list, indicating it's in the middle of transitioning from
+ * a vfio driver to driver-less.
+ */
+static int vfio_dev_viable(struct device *dev, void *data)
+{
+	struct vfio_group *group = data;
+	struct vfio_device *device;
+	struct device_driver *drv = ACCESS_ONCE(dev->driver);
+	struct vfio_unbound_dev *unbound;
+	int ret = -EINVAL;
+
+	mutex_lock(&group->unbound_lock);
+	list_for_each_entry(unbound, &group->unbound_list, unbound_next) {
+		if (dev == unbound->dev) {
+			ret = 0;
+			break;
+		}
+	}
+	mutex_unlock(&group->unbound_lock);
+
+	if (!ret || !drv || vfio_whitelisted_driver(drv))
+		return 0;
+
+	device = vfio_group_get_device(group, dev);
+	if (device) {
+		vfio_device_put(device);
+		return 0;
+	}
+
+	return ret;
+}
+
+/**
+ * Async device support
+ */
+static int vfio_group_nb_add_dev(struct vfio_group *group, struct device *dev)
+{
+	struct vfio_device *device;
+
+	/* Do we already know about it?  We shouldn't */
+	device = vfio_group_get_device(group, dev);
+	if (WARN_ON_ONCE(device)) {
+		vfio_device_put(device);
+		return 0;
+	}
+
+	/* Nothing to do for idle groups */
+	if (!atomic_read(&group->container_users))
+		return 0;
+
+	/* TODO Prevent device auto probing */
+	WARN("Device %s added to live group %d!\n", dev_name(dev),
+	     iommu_group_id(group->iommu_group));
+
+	return 0;
+}
+
+static int vfio_group_nb_verify(struct vfio_group *group, struct device *dev)
+{
+	/* We don't care what happens when the group isn't in use */
+	if (!atomic_read(&group->container_users))
+		return 0;
+
+	return vfio_dev_viable(dev, group);
+}
+
+static int vfio_iommu_group_notifier(struct notifier_block *nb,
+				     unsigned long action, void *data)
+{
+	struct vfio_group *group = container_of(nb, struct vfio_group, nb);
+	struct device *dev = data;
+	struct vfio_unbound_dev *unbound;
+
+	/*
+	 * Need to go through a group_lock lookup to get a reference or we
+	 * risk racing a group being removed.  Ignore spurious notifies.
+	 */
+	group = vfio_group_try_get(group);
+	if (!group)
+		return NOTIFY_OK;
+
+	switch (action) {
+	case IOMMU_GROUP_NOTIFY_ADD_DEVICE:
+		vfio_group_nb_add_dev(group, dev);
+		break;
+	case IOMMU_GROUP_NOTIFY_DEL_DEVICE:
+		/*
+		 * Nothing to do here.  If the device is in use, then the
+		 * vfio sub-driver should block the remove callback until
+		 * it is unused.  If the device is unused or attached to a
+		 * stub driver, then it should be released and we don't
+		 * care that it will be going away.
+		 */
+		break;
+	case IOMMU_GROUP_NOTIFY_BIND_DRIVER:
+		pr_debug("%s: Device %s, group %d binding to driver\n",
+			 __func__, dev_name(dev),
+			 iommu_group_id(group->iommu_group));
+		break;
+	case IOMMU_GROUP_NOTIFY_BOUND_DRIVER:
+		pr_debug("%s: Device %s, group %d bound to driver %s\n",
+			 __func__, dev_name(dev),
+			 iommu_group_id(group->iommu_group), dev->driver->name);
+		BUG_ON(vfio_group_nb_verify(group, dev));
+		break;
+	case IOMMU_GROUP_NOTIFY_UNBIND_DRIVER:
+		pr_debug("%s: Device %s, group %d unbinding from driver %s\n",
+			 __func__, dev_name(dev),
+			 iommu_group_id(group->iommu_group), dev->driver->name);
+		break;
+	case IOMMU_GROUP_NOTIFY_UNBOUND_DRIVER:
+		pr_debug("%s: Device %s, group %d unbound from driver\n",
+			 __func__, dev_name(dev),
+			 iommu_group_id(group->iommu_group));
+		/*
+		 * XXX An unbound device in a live group is ok, but we'd
+		 * really like to avoid the above BUG_ON by preventing other
+		 * drivers from binding to it.  Once that occurs, we have to
+		 * stop the system to maintain isolation.  At a minimum, we'd
+		 * want a toggle to disable driver auto probe for this device.
+		 */
+
+		mutex_lock(&group->unbound_lock);
+		list_for_each_entry(unbound,
+				    &group->unbound_list, unbound_next) {
+			if (dev == unbound->dev) {
+				list_del(&unbound->unbound_next);
+				kfree(unbound);
+				break;
+			}
+		}
+		mutex_unlock(&group->unbound_lock);
+		break;
+	}
+
+	vfio_group_put(group);
+	return NOTIFY_OK;
+}
+
+/**
+ * VFIO driver API
+ */
+int vfio_add_group_dev(struct device *dev,
+		       const struct vfio_device_ops *ops, void *device_data)
+{
+	struct iommu_group *iommu_group;
+	struct vfio_group *group;
+	struct vfio_device *device;
+
+	iommu_group = iommu_group_get(dev);
+	if (!iommu_group)
+		return -EINVAL;
+
+	group = vfio_group_get_from_iommu(iommu_group);
+	if (!group) {
+		group = vfio_create_group(iommu_group);
+		if (IS_ERR(group)) {
+			iommu_group_put(iommu_group);
+			return PTR_ERR(group);
+		}
+	} else {
+		/*
+		 * A found vfio_group already holds a reference to the
+		 * iommu_group.  A created vfio_group keeps the reference.
+		 */
+		iommu_group_put(iommu_group);
+	}
+
+	device = vfio_group_get_device(group, dev);
+	if (device) {
+		WARN(1, "Device %s already exists on group %d\n",
+		     dev_name(dev), iommu_group_id(iommu_group));
+		vfio_device_put(device);
+		vfio_group_put(group);
+		return -EBUSY;
+	}
+
+	device = vfio_group_create_device(group, dev, ops, device_data);
+	if (IS_ERR(device)) {
+		vfio_group_put(group);
+		return PTR_ERR(device);
+	}
+
+	/*
+	 * Drop all but the vfio_device reference.  The vfio_device holds
+	 * a reference to the vfio_group, which holds a reference to the
+	 * iommu_group.
+	 */
+	vfio_group_put(group);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(vfio_add_group_dev);
+
+/**
+ * Get a reference to the vfio_device for a device.  Even if the
+ * caller thinks they own the device, they could be racing with a
+ * release call path, so we can't trust drvdata for the shortcut.
+ * Go the long way around, from the iommu_group to the vfio_group
+ * to the vfio_device.
+ */
+struct vfio_device *vfio_device_get_from_dev(struct device *dev)
+{
+	struct iommu_group *iommu_group;
+	struct vfio_group *group;
+	struct vfio_device *device;
+
+	iommu_group = iommu_group_get(dev);
+	if (!iommu_group)
+		return NULL;
+
+	group = vfio_group_get_from_iommu(iommu_group);
+	iommu_group_put(iommu_group);
+	if (!group)
+		return NULL;
+
+	device = vfio_group_get_device(group, dev);
+	vfio_group_put(group);
+
+	return device;
+}
+EXPORT_SYMBOL_GPL(vfio_device_get_from_dev);
+
+static struct vfio_device *vfio_device_get_from_name(struct vfio_group *group,
+						     char *buf)
+{
+	struct vfio_device *device;
+
+	mutex_lock(&group->device_lock);
+	list_for_each_entry(device, &group->device_list, group_next) {
+		if (!strcmp(dev_name(device->dev), buf)) {
+			vfio_device_get(device);
+			break;
+		}
+	}
+	mutex_unlock(&group->device_lock);
+
+	return device;
+}
+
+/*
+ * Caller must hold a reference to the vfio_device
+ */
+void *vfio_device_data(struct vfio_device *device)
+{
+	return device->device_data;
+}
+EXPORT_SYMBOL_GPL(vfio_device_data);
+
+/* Given a referenced group, check if it contains the device */
+static bool vfio_dev_present(struct vfio_group *group, struct device *dev)
+{
+	struct vfio_device *device;
+
+	device = vfio_group_get_device(group, dev);
+	if (!device)
+		return false;
+
+	vfio_device_put(device);
+	return true;
+}
+
+/*
+ * Decrement the device reference count and wait for the device to be
+ * removed.  Open file descriptors for the device... */
+void *vfio_del_group_dev(struct device *dev)
+{
+	struct vfio_device *device = dev_get_drvdata(dev);
+	struct vfio_group *group = device->group;
+	void *device_data = device->device_data;
+	struct vfio_unbound_dev *unbound;
+	unsigned int i = 0;
+	long ret;
+	bool interrupted = false;
+
+	/*
+	 * The group exists so long as we have a device reference.  Get
+	 * a group reference and use it to scan for the device going away.
+	 */
+	vfio_group_get(group);
+
+	/*
+	 * When the device is removed from the group, the group suddenly
+	 * becomes non-viable; the device has a driver (until the unbind
+	 * completes), but it's not present in the group.  This is bad news
+	 * for any external users that need to re-acquire a group reference
+	 * in order to match and release their existing reference.  To
+	 * solve this, we track such devices on the unbound_list to bridge
+	 * the gap until they're fully unbound.
+	 */
+	unbound = kzalloc(sizeof(*unbound), GFP_KERNEL);
+	if (unbound) {
+		unbound->dev = dev;
+		mutex_lock(&group->unbound_lock);
+		list_add(&unbound->unbound_next, &group->unbound_list);
+		mutex_unlock(&group->unbound_lock);
+	}
+	WARN_ON(!unbound);
+
+	vfio_device_put(device);
+
+	/*
+	 * If the device is still present in the group after the above
+	 * 'put', then it is in use and we need to request it from the
+	 * bus driver.  The driver may in turn need to request the
+	 * device from the user.  We send the request on an arbitrary
+	 * interval with counter to allow the driver to take escalating
+	 * measures to release the device if it has the ability to do so.
+	 */
+	do {
+		device = vfio_group_get_device(group, dev);
+		if (!device)
+			break;
+
+		if (device->ops->request)
+			device->ops->request(device_data, i++);
+
+		vfio_device_put(device);
+
+		if (interrupted) {
+			ret = wait_event_timeout(vfio.release_q,
+					!vfio_dev_present(group, dev), HZ * 10);
+		} else {
+			ret = wait_event_interruptible_timeout(vfio.release_q,
+					!vfio_dev_present(group, dev), HZ * 10);
+			if (ret == -ERESTARTSYS) {
+				interrupted = true;
+				dev_warn(dev,
+					 "Device is currently in use, task"
+					 " \"%s\" (%d) "
+					 "blocked until device is released",
+					 current->comm, task_pid_nr(current));
+			}
+		}
+	} while (ret <= 0);
+
+	vfio_group_put(group);
+
+	return device_data;
+}
+EXPORT_SYMBOL_GPL(vfio_del_group_dev);
+
+/**
+ * VFIO base fd, /dev/vfio/vfio
+ */
+static long vfio_ioctl_check_extension(struct vfio_container *container,
+				       unsigned long arg)
+{
+	struct vfio_iommu_driver *driver;
+	long ret = 0;
+
+	down_read(&container->group_lock);
+
+	driver = container->iommu_driver;
+
+	switch (arg) {
+		/* No base extensions yet */
+	default:
+		/*
+		 * If no driver is set, poll all registered drivers for
+		 * extensions and return the first positive result.  If
+		 * a driver is already set, further queries will be passed
+		 * only to that driver.
+		 */
+		if (!driver) {
+			mutex_lock(&vfio.iommu_drivers_lock);
+			list_for_each_entry(driver, &vfio.iommu_drivers_list,
+					    vfio_next) {
+				if (!try_module_get(driver->ops->owner))
+					continue;
+
+				ret = driver->ops->ioctl(NULL,
+							 VFIO_CHECK_EXTENSION,
+							 arg);
+				module_put(driver->ops->owner);
+				if (ret > 0)
+					break;
+			}
+			mutex_unlock(&vfio.iommu_drivers_lock);
+		} else
+			ret = driver->ops->ioctl(container->iommu_data,
+						 VFIO_CHECK_EXTENSION, arg);
+	}
+
+	up_read(&container->group_lock);
+
+	return ret;
+}
+
+/* hold write lock on container->group_lock */
+static int __vfio_container_attach_groups(struct vfio_container *container,
+					  struct vfio_iommu_driver *driver,
+					  void *data)
+{
+	struct vfio_group *group;
+	int ret = -ENODEV;
+
+	list_for_each_entry(group, &container->group_list, container_next) {
+		ret = driver->ops->attach_group(data, group->iommu_group);
+		if (ret)
+			goto unwind;
+	}
+
+	return ret;
+
+unwind:
+	list_for_each_entry_continue_reverse(group, &container->group_list,
+					     container_next) {
+		driver->ops->detach_group(data, group->iommu_group);
+	}
+
+	return ret;
+}
+
+static long vfio_ioctl_set_iommu(struct vfio_container *container,
+				 unsigned long arg)
+{
+	struct vfio_iommu_driver *driver;
+	long ret = -ENODEV;
+
+	down_write(&container->group_lock);
+
+	/*
+	 * The container is designed to be an unprivileged interface while
+	 * the group can be assigned to specific users.  Therefore, only by
+	 * adding a group to a container does the user get the privilege of
+	 * enabling the iommu, which may allocate finite resources.  There
+	 * is no unset_iommu, but by removing all the groups from a container,
+	 * the container is deprivileged and returns to an unset state.
+	 */
+	if (list_empty(&container->group_list) || container->iommu_driver) {
+		up_write(&container->group_lock);
+		return -EINVAL;
+	}
+
+	mutex_lock(&vfio.iommu_drivers_lock);
+	list_for_each_entry(driver, &vfio.iommu_drivers_list, vfio_next) {
+		void *data;
+
+		if (!try_module_get(driver->ops->owner))
+			continue;
+
+		/*
+		 * The arg magic for SET_IOMMU is the same as CHECK_EXTENSION,
+		 * so test which iommu driver reported support for this
+		 * extension and call open on them.  We also pass them the
+		 * magic, allowing a single driver to support multiple
+		 * interfaces if they'd like.
+		 */
+		if (driver->ops->ioctl(NULL, VFIO_CHECK_EXTENSION, arg) <= 0) {
+			module_put(driver->ops->owner);
+			continue;
+		}
+
+		/* module reference holds the driver we're working on */
+		mutex_unlock(&vfio.iommu_drivers_lock);
+
+		data = driver->ops->open(arg);
+		if (IS_ERR(data)) {
+			ret = PTR_ERR(data);
+			module_put(driver->ops->owner);
+			goto skip_drivers_unlock;
+		}
+
+		ret = __vfio_container_attach_groups(container, driver, data);
+		if (!ret) {
+			container->iommu_driver = driver;
+			container->iommu_data = data;
+		} else {
+			driver->ops->release(data);
+			module_put(driver->ops->owner);
+		}
+
+		goto skip_drivers_unlock;
+	}
+
+	mutex_unlock(&vfio.iommu_drivers_lock);
+skip_drivers_unlock:
+	up_write(&container->group_lock);
+
+	return ret;
+}
+
+static long vfio_fops_unl_ioctl(struct file *filep,
+				unsigned int cmd, unsigned long arg)
+{
+	struct vfio_container *container = filep->private_data;
+	struct vfio_iommu_driver *driver;
+	void *data;
+	long ret = -EINVAL;
+
+	if (!container)
+		return ret;
+
+	switch (cmd) {
+	case VFIO_GET_API_VERSION:
+		ret = VFIO_API_VERSION;
+		break;
+	case VFIO_CHECK_EXTENSION:
+		ret = vfio_ioctl_check_extension(container, arg);
+		break;
+	case VFIO_SET_IOMMU:
+		ret = vfio_ioctl_set_iommu(container, arg);
+		break;
+	default:
+		down_read(&container->group_lock);
+
+		driver = container->iommu_driver;
+		data = container->iommu_data;
+
+		if (driver) /* passthrough all unrecognized ioctls */
+			ret = driver->ops->ioctl(data, cmd, arg);
+
+		up_read(&container->group_lock);
+	}
+
+	return ret;
+}
+
+#ifdef CONFIG_COMPAT
+static long vfio_fops_compat_ioctl(struct file *filep,
+				   unsigned int cmd, unsigned long arg)
+{
+	arg = (unsigned long)compat_ptr(arg);
+	return vfio_fops_unl_ioctl(filep, cmd, arg);
+}
+#endif	/* CONFIG_COMPAT */
+
+static int vfio_fops_open(struct inode *inode, struct file *filep)
+{
+	struct vfio_container *container;
+
+	container = kzalloc(sizeof(*container), GFP_KERNEL);
+	if (!container)
+		return -ENOMEM;
+
+	INIT_LIST_HEAD(&container->group_list);
+	init_rwsem(&container->group_lock);
+	kref_init(&container->kref);
+
+	filep->private_data = container;
+
+	return 0;
+}
+
+static int vfio_fops_release(struct inode *inode, struct file *filep)
+{
+	struct vfio_container *container = filep->private_data;
+
+	filep->private_data = NULL;
+
+	vfio_container_put(container);
+
+	return 0;
+}
+
+/*
+ * Once an iommu driver is set, we optionally pass read/write/mmap
+ * on to the driver, allowing management interfaces beyond ioctl.
+ */
+static ssize_t vfio_fops_read(struct file *filep, char __user *buf,
+			      size_t count, loff_t *ppos)
+{
+	struct vfio_container *container = filep->private_data;
+	struct vfio_iommu_driver *driver;
+	ssize_t ret = -EINVAL;
+
+	down_read(&container->group_lock);
+
+	driver = container->iommu_driver;
+	if (likely(driver && driver->ops->read))
+		ret = driver->ops->read(container->iommu_data,
+					buf, count, ppos);
+
+	up_read(&container->group_lock);
+
+	return ret;
+}
+
+static ssize_t vfio_fops_write(struct file *filep, const char __user *buf,
+			       size_t count, loff_t *ppos)
+{
+	struct vfio_container *container = filep->private_data;
+	struct vfio_iommu_driver *driver;
+	ssize_t ret = -EINVAL;
+
+	down_read(&container->group_lock);
+
+	driver = container->iommu_driver;
+	if (likely(driver && driver->ops->write))
+		ret = driver->ops->write(container->iommu_data,
+					 buf, count, ppos);
+
+	up_read(&container->group_lock);
+
+	return ret;
+}
+
+static int vfio_fops_mmap(struct file *filep, struct vm_area_struct *vma)
+{
+	struct vfio_container *container = filep->private_data;
+	struct vfio_iommu_driver *driver;
+	int ret = -EINVAL;
+
+	down_read(&container->group_lock);
+
+	driver = container->iommu_driver;
+	if (likely(driver && driver->ops->mmap))
+		ret = driver->ops->mmap(container->iommu_data, vma);
+
+	up_read(&container->group_lock);
+
+	return ret;
+}
+
+static const struct file_operations vfio_fops = {
+	.owner		= THIS_MODULE,
+	.open		= vfio_fops_open,
+	.release	= vfio_fops_release,
+	.read		= vfio_fops_read,
+	.write		= vfio_fops_write,
+	.unlocked_ioctl	= vfio_fops_unl_ioctl,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl	= vfio_fops_compat_ioctl,
+#endif
+	.mmap		= vfio_fops_mmap,
+};
+
+/**
+ * VFIO Group fd, /dev/vfio/$GROUP
+ */
+static void __vfio_group_unset_container(struct vfio_group *group)
+{
+	struct vfio_container *container = group->container;
+	struct vfio_iommu_driver *driver;
+
+	down_write(&container->group_lock);
+
+	driver = container->iommu_driver;
+	if (driver)
+		driver->ops->detach_group(container->iommu_data,
+					  group->iommu_group);
+
+	group->container = NULL;
+	list_del(&group->container_next);
+
+	/* Detaching the last group deprivileges a container, remove iommu */
+	if (driver && list_empty(&container->group_list)) {
+		driver->ops->release(container->iommu_data);
+		module_put(driver->ops->owner);
+		container->iommu_driver = NULL;
+		container->iommu_data = NULL;
+	}
+
+	up_write(&container->group_lock);
+
+	vfio_container_put(container);
+}
+
+/*
+ * VFIO_GROUP_UNSET_CONTAINER should fail if there are other users or
+ * if there was no container to unset.  Since the ioctl is called on
+ * the group, we know that still exists, therefore the only valid
+ * transition here is 1->0.
+ */
+static int vfio_group_unset_container(struct vfio_group *group)
+{
+	int users = atomic_cmpxchg(&group->container_users, 1, 0);
+
+	if (!users)
+		return -EINVAL;
+	if (users != 1)
+		return -EBUSY;
+
+	__vfio_group_unset_container(group);
+
+	return 0;
+}
+
+/*
+ * When removing container users, anything that removes the last user
+ * implicitly removes the group from the container.  That is, if the
+ * group file descriptor is closed, as well as any device file descriptors,
+ * the group is free.
+ */
+static void vfio_group_try_dissolve_container(struct vfio_group *group)
+{
+	if (0 == atomic_dec_if_positive(&group->container_users))
+		__vfio_group_unset_container(group);
+}
+
+static int vfio_group_set_container(struct vfio_group *group, int container_fd)
+{
+	struct fd f;
+	struct vfio_container *container;
+	struct vfio_iommu_driver *driver;
+	int ret = 0;
+
+	if (atomic_read(&group->container_users))
+		return -EINVAL;
+
+	f = fdget(container_fd);
+	if (!f.file)
+		return -EBADF;
+
+	/* Sanity check, is this really our fd? */
+	if (f.file->f_op != &vfio_fops) {
+		fdput(f);
+		return -EINVAL;
+	}
+
+	container = f.file->private_data;
+	WARN_ON(!container); /* fget ensures we don't race vfio_release */
+
+	down_write(&container->group_lock);
+
+	driver = container->iommu_driver;
+	if (driver) {
+		ret = driver->ops->attach_group(container->iommu_data,
+						group->iommu_group);
+		if (ret)
+			goto unlock_out;
+	}
+
+	group->container = container;
+	list_add(&group->container_next, &container->group_list);
+
+	/* Get a reference on the container and mark a user within the group */
+	vfio_container_get(container);
+	atomic_inc(&group->container_users);
+
+unlock_out:
+	up_write(&container->group_lock);
+	fdput(f);
+	return ret;
+}
+
+static bool vfio_group_viable(struct vfio_group *group)
+{
+	return (iommu_group_for_each_dev(group->iommu_group,
+					 group, vfio_dev_viable) == 0);
+}
+
+static const struct file_operations vfio_device_fops;
+
+static int vfio_group_get_device_fd(struct vfio_group *group, char *buf)
+{
+	struct vfio_device *device;
+	struct file *filep;
+	int ret;
+
+	if (0 == atomic_read(&group->container_users) ||
+	    !group->container->iommu_driver || !vfio_group_viable(group))
+		return -EINVAL;
+
+	device = vfio_device_get_from_name(group, buf);
+	if (!device)
+		return -ENODEV;
+
+	ret = device->ops->open(device->device_data);
+	if (ret) {
+		vfio_device_put(device);
+		return ret;
+	}
+
+	/*
+	 * We can't use anon_inode_getfd() because we need to modify
+	 * the f_mode flags directly to allow more than just ioctls
+	 */
+	ret = get_unused_fd_flags(O_CLOEXEC);
+	if (ret < 0) {
+		device->ops->release(device->device_data);
+		vfio_device_put(device);
+		return ret;
+	}
+
+	filep = anon_inode_getfile("[vfio-device]", &vfio_device_fops,
+				   device, O_RDWR);
+	if (IS_ERR(filep)) {
+		put_unused_fd(ret);
+		ret = PTR_ERR(filep);
+		device->ops->release(device->device_data);
+		vfio_device_put(device);
+		return ret;
+	}
+
+	/*
+	 * TODO: add an anon_inode interface to do this.
+	 * Appears to be missing by lack of need rather than
+	 * explicitly prevented.  Now there's need.
+	 */
+	filep->f_mode |= (FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE);
+
+	atomic_inc(&group->container_users);
+
+	fd_install(ret, filep);
+
+	return ret;
+}
+
+static long vfio_group_fops_unl_ioctl(struct file *filep,
+				      unsigned int cmd, unsigned long arg)
+{
+	struct vfio_group *group = filep->private_data;
+	long ret = -ENOTTY;
+
+	switch (cmd) {
+	case VFIO_GROUP_GET_STATUS:
+	{
+		struct vfio_group_status status;
+		unsigned long minsz;
+
+		minsz = offsetofend(struct vfio_group_status, flags);
+
+		if (copy_from_user(&status, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (status.argsz < minsz)
+			return -EINVAL;
+
+		status.flags = 0;
+
+		if (vfio_group_viable(group))
+			status.flags |= VFIO_GROUP_FLAGS_VIABLE;
+
+		if (group->container)
+			status.flags |= VFIO_GROUP_FLAGS_CONTAINER_SET;
+
+		if (copy_to_user((void __user *)arg, &status, minsz))
+			return -EFAULT;
+
+		ret = 0;
+		break;
+	}
+	case VFIO_GROUP_SET_CONTAINER:
+	{
+		int fd;
+
+		if (get_user(fd, (int __user *)arg))
+			return -EFAULT;
+
+		if (fd < 0)
+			return -EINVAL;
+
+		ret = vfio_group_set_container(group, fd);
+		break;
+	}
+	case VFIO_GROUP_UNSET_CONTAINER:
+		ret = vfio_group_unset_container(group);
+		break;
+	case VFIO_GROUP_GET_DEVICE_FD:
+	{
+		char *buf;
+
+		buf = strndup_user((const char __user *)arg, PAGE_SIZE);
+		if (IS_ERR(buf))
+			return PTR_ERR(buf);
+
+		ret = vfio_group_get_device_fd(group, buf);
+		kfree(buf);
+		break;
+	}
+	}
+
+	return ret;
+}
+
+#ifdef CONFIG_COMPAT
+static long vfio_group_fops_compat_ioctl(struct file *filep,
+					 unsigned int cmd, unsigned long arg)
+{
+	arg = (unsigned long)compat_ptr(arg);
+	return vfio_group_fops_unl_ioctl(filep, cmd, arg);
+}
+#endif	/* CONFIG_COMPAT */
+
+static int vfio_group_fops_open(struct inode *inode, struct file *filep)
+{
+	struct vfio_group *group;
+	int opened;
+
+	group = vfio_group_get_from_minor(iminor(inode));
+	if (!group)
+		return -ENODEV;
+
+	/* Do we need multiple instances of the group open?  Seems not. */
+	opened = atomic_cmpxchg(&group->opened, 0, 1);
+	if (opened) {
+		vfio_group_put(group);
+		return -EBUSY;
+	}
+
+	/* Is something still in use from a previous open? */
+	if (group->container) {
+		atomic_dec(&group->opened);
+		vfio_group_put(group);
+		return -EBUSY;
+	}
+
+	filep->private_data = group;
+
+	return 0;
+}
+
+static int vfio_group_fops_release(struct inode *inode, struct file *filep)
+{
+	struct vfio_group *group = filep->private_data;
+
+	filep->private_data = NULL;
+
+	vfio_group_try_dissolve_container(group);
+
+	atomic_dec(&group->opened);
+
+	vfio_group_put(group);
+
+	return 0;
+}
+
+static const struct file_operations vfio_group_fops = {
+	.owner		= THIS_MODULE,
+	.unlocked_ioctl	= vfio_group_fops_unl_ioctl,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl	= vfio_group_fops_compat_ioctl,
+#endif
+	.open		= vfio_group_fops_open,
+	.release	= vfio_group_fops_release,
+};
+
+/**
+ * VFIO Device fd
+ */
+static int vfio_device_fops_release(struct inode *inode, struct file *filep)
+{
+	struct vfio_device *device = filep->private_data;
+
+	device->ops->release(device->device_data);
+
+	vfio_group_try_dissolve_container(device->group);
+
+	vfio_device_put(device);
+
+	return 0;
+}
+
+static long vfio_device_fops_unl_ioctl(struct file *filep,
+				       unsigned int cmd, unsigned long arg)
+{
+	struct vfio_device *device = filep->private_data;
+
+	if (unlikely(!device->ops->ioctl))
+		return -EINVAL;
+
+	return device->ops->ioctl(device->device_data, cmd, arg);
+}
+
+static ssize_t vfio_device_fops_read(struct file *filep, char __user *buf,
+				     size_t count, loff_t *ppos)
+{
+	struct vfio_device *device = filep->private_data;
+
+	if (unlikely(!device->ops->read))
+		return -EINVAL;
+
+	return device->ops->read(device->device_data, buf, count, ppos);
+}
+
+static ssize_t vfio_device_fops_write(struct file *filep,
+				      const char __user *buf,
+				      size_t count, loff_t *ppos)
+{
+	struct vfio_device *device = filep->private_data;
+
+	if (unlikely(!device->ops->write))
+		return -EINVAL;
+
+	return device->ops->write(device->device_data, buf, count, ppos);
+}
+
+static int vfio_device_fops_mmap(struct file *filep, struct vm_area_struct *vma)
+{
+	struct vfio_device *device = filep->private_data;
+
+	if (unlikely(!device->ops->mmap))
+		return -EINVAL;
+
+	return device->ops->mmap(device->device_data, vma);
+}
+
+#ifdef CONFIG_COMPAT
+static long vfio_device_fops_compat_ioctl(struct file *filep,
+					  unsigned int cmd, unsigned long arg)
+{
+	arg = (unsigned long)compat_ptr(arg);
+	return vfio_device_fops_unl_ioctl(filep, cmd, arg);
+}
+#endif	/* CONFIG_COMPAT */
+
+static const struct file_operations vfio_device_fops = {
+	.owner		= THIS_MODULE,
+	.release	= vfio_device_fops_release,
+	.read		= vfio_device_fops_read,
+	.write		= vfio_device_fops_write,
+	.unlocked_ioctl	= vfio_device_fops_unl_ioctl,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl	= vfio_device_fops_compat_ioctl,
+#endif
+	.mmap		= vfio_device_fops_mmap,
+};
+
+/**
+ * External user API, exported by symbols to be linked dynamically.
+ *
+ * The protocol includes:
+ *  1. do normal VFIO init operation:
+ *	- opening a new container;
+ *	- attaching group(s) to it;
+ *	- setting an IOMMU driver for a container.
+ * When IOMMU is set for a container, all groups in it are
+ * considered ready to use by an external user.
+ *
+ * 2. User space passes a group fd to an external user.
+ * The external user calls vfio_group_get_external_user()
+ * to verify that:
+ *	- the group is initialized;
+ *	- IOMMU is set for it.
+ * If both checks passed, vfio_group_get_external_user()
+ * increments the container user counter to prevent
+ * the VFIO group from disposal before KVM exits.
+ *
+ * 3. The external user calls vfio_external_user_iommu_id()
+ * to know an IOMMU ID.
+ *
+ * 4. When the external KVM finishes, it calls
+ * vfio_group_put_external_user() to release the VFIO group.
+ * This call decrements the container user counter.
+ */
+struct vfio_group *vfio_group_get_external_user(struct file *filep)
+{
+	struct vfio_group *group = filep->private_data;
+
+	if (filep->f_op != &vfio_group_fops)
+		return ERR_PTR(-EINVAL);
+
+	if (!atomic_inc_not_zero(&group->container_users))
+		return ERR_PTR(-EINVAL);
+
+	if (!group->container->iommu_driver ||
+			!vfio_group_viable(group)) {
+		atomic_dec(&group->container_users);
+		return ERR_PTR(-EINVAL);
+	}
+
+	vfio_group_get(group);
+
+	return group;
+}
+EXPORT_SYMBOL_GPL(vfio_group_get_external_user);
+
+void vfio_group_put_external_user(struct vfio_group *group)
+{
+	vfio_group_put(group);
+	vfio_group_try_dissolve_container(group);
+}
+EXPORT_SYMBOL_GPL(vfio_group_put_external_user);
+
+int vfio_external_user_iommu_id(struct vfio_group *group)
+{
+	return iommu_group_id(group->iommu_group);
+}
+EXPORT_SYMBOL_GPL(vfio_external_user_iommu_id);
+
+long vfio_external_check_extension(struct vfio_group *group, unsigned long arg)
+{
+	return vfio_ioctl_check_extension(group->container, arg);
+}
+EXPORT_SYMBOL_GPL(vfio_external_check_extension);
+
+/**
+ * Module/class support
+ */
+static char *vfio_devnode(struct device *dev, umode_t *mode)
+{
+	return kasprintf(GFP_KERNEL, "vfio/%s", dev_name(dev));
+}
+
+static struct miscdevice vfio_dev = {
+	.minor = VFIO_MINOR,
+	.name = "vfio",
+	.fops = &vfio_fops,
+	.nodename = "vfio/vfio",
+	.mode = S_IRUGO | S_IWUGO,
+};
+
+static int __init vfio_init(void)
+{
+	int ret;
+
+	idr_init(&vfio.group_idr);
+	mutex_init(&vfio.group_lock);
+	mutex_init(&vfio.iommu_drivers_lock);
+	INIT_LIST_HEAD(&vfio.group_list);
+	INIT_LIST_HEAD(&vfio.iommu_drivers_list);
+	init_waitqueue_head(&vfio.release_q);
+
+	ret = misc_register(&vfio_dev);
+	if (ret) {
+		pr_err("vfio: misc device register failed\n");
+		return ret;
+	}
+
+	/* /dev/vfio/$GROUP */
+	vfio.class = class_create(THIS_MODULE, "vfio");
+	if (IS_ERR(vfio.class)) {
+		ret = PTR_ERR(vfio.class);
+		goto err_class;
+	}
+
+	vfio.class->devnode = vfio_devnode;
+
+	ret = alloc_chrdev_region(&vfio.group_devt, 0, MINORMASK, "vfio");
+	if (ret)
+		goto err_alloc_chrdev;
+
+	cdev_init(&vfio.group_cdev, &vfio_group_fops);
+	ret = cdev_add(&vfio.group_cdev, vfio.group_devt, MINORMASK);
+	if (ret)
+		goto err_cdev_add;
+
+	pr_info(DRIVER_DESC " version: " DRIVER_VERSION "\n");
+
+	/*
+	 * Attempt to load known iommu-drivers.  This gives us a working
+	 * environment without the user needing to explicitly load iommu
+	 * drivers.
+	 */
+	request_module_nowait("vfio_iommu_type1");
+	request_module_nowait("vfio_iommu_spapr_tce");
+
+	return 0;
+
+err_cdev_add:
+	unregister_chrdev_region(vfio.group_devt, MINORMASK);
+err_alloc_chrdev:
+	class_destroy(vfio.class);
+	vfio.class = NULL;
+err_class:
+	misc_deregister(&vfio_dev);
+	return ret;
+}
+
+static void __exit vfio_cleanup(void)
+{
+	WARN_ON(!list_empty(&vfio.group_list));
+
+	idr_destroy(&vfio.group_idr);
+	cdev_del(&vfio.group_cdev);
+	unregister_chrdev_region(vfio.group_devt, MINORMASK);
+	class_destroy(vfio.class);
+	vfio.class = NULL;
+	misc_deregister(&vfio_dev);
+}
+
+module_init(vfio_init);
+module_exit(vfio_cleanup);
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
+MODULE_ALIAS_MISCDEV(VFIO_MINOR);
+MODULE_ALIAS("devname:vfio/vfio");


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH 2/2] vfio: Include no-iommu mode
  2015-10-09 18:40 [RFC PATCH 0/2] VFIO no-iommu Alex Williamson
  2015-10-09 18:41 ` [RFC PATCH 1/2] vfio: Move vfio.c vfio_core.c Alex Williamson
@ 2015-10-09 18:41 ` Alex Williamson
  2015-10-11  8:12   ` Avi Kivity
  2015-10-12 15:56   ` Stephen Hemminger
  2015-10-11 17:29 ` [RFC PATCH 0/2] VFIO no-iommu Varun Sethi
  2015-10-11 18:28 ` Michael S. Tsirkin
  3 siblings, 2 replies; 21+ messages in thread
From: Alex Williamson @ 2015-10-09 18:41 UTC (permalink / raw)
  To: alex.williamson
  Cc: avi, avi, gleb, corbet, bruce.richardson, mst, linux-kernel,
	alexander.duyck, gleb, stephen, vladz, iommu, hjk, gregkh

There is really no way to safely give a user full access to a PCI
without an IOMMU to protect the host from errant DMA.  There is also
no way to provide DMA translation, for use cases such as devices
assignment to virtual machines.  However, there are still those users
that want userspace drivers under those conditions.  The UIO driver
exists for this use case, but does not provide the degree of device
access and programming that VFIO has.  In an effort to avoid code
duplication, this introduces a No-IOMMU mode for VFIO.

This mode requires enabling CONFIG_VFIO_NOIOMMU and loading the vfio
module with the option "enable_unsafe_pci_noiommu_mode".  This should
make it very clear that this mode is not safe.  In this mode, there is
no support for unprivileged users, CAP_SYS_ADMIN is required for
access to the necessary dev files.  Mixing no-iommu and secure VFIO is
also unsupported, as are any VFIO IOMMU backends other than the
vfio-noiommu backend.  Furthermore, unsafe group files are relocated
to /dev/vfio-noiommu/.  Upon successful loading in this mode, the
kernel is tainted due to the dummy IOMMU put in place.  Unloading of
the module in this mode is also unsupported and will BUG due to the
lack of support for unregistering an IOMMU for a bus type.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 drivers/vfio/Kconfig        |   15 +++
 drivers/vfio/Makefile       |    3 +
 drivers/vfio/vfio_core.c    |   40 +++++++++
 drivers/vfio/vfio_noiommu.c |  185 +++++++++++++++++++++++++++++++++++++++++++
 drivers/vfio/vfio_private.h |   31 +++++++
 include/uapi/linux/vfio.h   |    2 
 6 files changed, 276 insertions(+)
 create mode 100644 drivers/vfio/vfio_noiommu.c
 create mode 100644 drivers/vfio/vfio_private.h

diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index 4540179..929de0d 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -31,5 +31,20 @@ menuconfig VFIO
 
 	  If you don't know what to do here, say N.
 
+menuconfig VFIO_NOIOMMU
+	bool "VFIO No IOMMU support"
+	depends on VFIO
+	help
+	  VFIO is built on the ability to isolate devices using the IOMMU.
+	  Only with an IOMMU can userspace access to DMA capable devices be
+	  considered secure.  VFIO No IOMMU support enables a dummy IOMMU
+	  for the purpose of re-using the VFIO infrastructure in a non-secure
+	  mode.  Use of this mode will result in an unsupportable kernel and
+	  will therefore taint the kernel.  Device assignment to virtual
+	  machines is also not possible with this mode since the dummy IOMMU
+	  cannot provide DMA translation.
+
+	  If you don't know what to do here, say N.
+
 source "drivers/vfio/pci/Kconfig"
 source "drivers/vfio/platform/Kconfig"
diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
index e8fa248..9736449 100644
--- a/drivers/vfio/Makefile
+++ b/drivers/vfio/Makefile
@@ -1,5 +1,8 @@
 vfio_virqfd-y := virqfd.o
 vfio-y := vfio_core.o
+ifdef CONFIG_VFIO_NOIOMMU
+vfio-y += vfio_noiommu.o
+endif
 
 obj-$(CONFIG_VFIO) += vfio.o
 obj-$(CONFIG_VFIO_VIRQFD) += vfio_virqfd.o
diff --git a/drivers/vfio/vfio_core.c b/drivers/vfio/vfio_core.c
index 563c510..495a490 100644
--- a/drivers/vfio/vfio_core.c
+++ b/drivers/vfio/vfio_core.c
@@ -34,10 +34,18 @@
 #include <linux/vfio.h>
 #include <linux/wait.h>
 
+#include "vfio_private.h"
+
 #define DRIVER_VERSION	"0.3"
 #define DRIVER_AUTHOR	"Alex Williamson <alex.williamson@redhat.com>"
 #define DRIVER_DESC	"VFIO - User Level meta-driver"
 
+static bool pci_noiommu __read_mostly;
+#ifdef CONFIG_VFIO_NOIOMMU
+module_param_named(enable_unsafe_pci_noiommu_mode, pci_noiommu, bool, S_IRUGO);
+MODULE_PARM_DESC(enable_unsafe_pci_noiommu_mode, "Enable UNSAFE, no-IOMMU mode for PCI.  This mode provides no device isolation, no DMA translation, no host kernel protection, cannot be used for device assignment to virtual machines, requires root permissions, and will taint the kernel.  If you do not know what this is for, step away. (default: false)");
+#endif
+
 static struct vfio {
 	struct class			*class;
 	struct list_head		iommu_drivers_list;
@@ -101,6 +109,9 @@ int vfio_register_iommu_driver(const struct vfio_iommu_driver_ops *ops)
 {
 	struct vfio_iommu_driver *driver, *tmp;
 
+	if (pci_noiommu && strcmp(ops->name, "vfio-noiommu"))
+		return -EPERM;
+
 	driver = kzalloc(sizeof(*driver), GFP_KERNEL);
 	if (!driver)
 		return -ENOMEM;
@@ -998,6 +1009,9 @@ static int vfio_fops_open(struct inode *inode, struct file *filep)
 {
 	struct vfio_container *container;
 
+	if (pci_noiommu && !capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
 	container = kzalloc(sizeof(*container), GFP_KERNEL);
 	if (!container)
 		return -ENOMEM;
@@ -1221,6 +1235,9 @@ static int vfio_group_get_device_fd(struct vfio_group *group, char *buf)
 	    !group->container->iommu_driver || !vfio_group_viable(group))
 		return -EINVAL;
 
+	if (pci_noiommu && !capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
 	device = vfio_device_get_from_name(group, buf);
 	if (!device)
 		return -ENODEV;
@@ -1347,6 +1364,9 @@ static int vfio_group_fops_open(struct inode *inode, struct file *filep)
 	struct vfio_group *group;
 	int opened;
 
+	if (pci_noiommu && !capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
 	group = vfio_group_get_from_minor(iminor(inode));
 	if (!group)
 		return -ENODEV;
@@ -1507,6 +1527,9 @@ struct vfio_group *vfio_group_get_external_user(struct file *filep)
 {
 	struct vfio_group *group = filep->private_data;
 
+	if (pci_noiommu)
+		return ERR_PTR(-EPERM);
+
 	if (filep->f_op != &vfio_group_fops)
 		return ERR_PTR(-EINVAL);
 
@@ -1549,6 +1572,9 @@ EXPORT_SYMBOL_GPL(vfio_external_check_extension);
  */
 static char *vfio_devnode(struct device *dev, umode_t *mode)
 {
+	if (pci_noiommu)
+		return kasprintf(GFP_KERNEL, "vfio-noiommu/%s", dev_name(dev));
+
 	return kasprintf(GFP_KERNEL, "vfio/%s", dev_name(dev));
 }
 
@@ -1564,6 +1590,12 @@ static int __init vfio_init(void)
 {
 	int ret;
 
+	if (pci_noiommu) {
+		ret = vfio_noiommu_iommu_init(&pci_bus_type);
+		if (ret)
+			return ret;
+	}
+
 	idr_init(&vfio.group_idr);
 	mutex_init(&vfio.group_lock);
 	mutex_init(&vfio.iommu_drivers_lock);
@@ -1605,6 +1637,9 @@ static int __init vfio_init(void)
 	request_module_nowait("vfio_iommu_type1");
 	request_module_nowait("vfio_iommu_spapr_tce");
 
+	if (pci_noiommu)
+		vfio_noiommu_init();
+
 	return 0;
 
 err_cdev_add:
@@ -1621,6 +1656,11 @@ static void __exit vfio_cleanup(void)
 {
 	WARN_ON(!list_empty(&vfio.group_list));
 
+	if (pci_noiommu) {
+		vfio_noiommu_cleanup();
+		vfio_noiommu_iommu_cleanup(&pci_bus_type);
+	}
+
 	idr_destroy(&vfio.group_idr);
 	cdev_del(&vfio.group_cdev);
 	unregister_chrdev_region(vfio.group_devt, MINORMASK);
diff --git a/drivers/vfio/vfio_noiommu.c b/drivers/vfio/vfio_noiommu.c
new file mode 100644
index 0000000..b12bb290
--- /dev/null
+++ b/drivers/vfio/vfio_noiommu.c
@@ -0,0 +1,185 @@
+/*
+ * VFIO - No IOMMU support
+ *
+ * Copyright (C) 2015 Red Hat, Inc.  All rights reserved.
+ *     Author: Alex Williamson <alex.williamson@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/device.h>
+#include <linux/iommu.h>
+#include <linux/list.h>
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/string.h>
+#include <linux/uaccess.h>
+#include <linux/vfio.h>
+
+/*
+ * VFIO is fundamentally built on IOMMU groups but we have a lot of
+ * infrastructure for exposing devices to userspace.  There is no way to
+ * make userspace device access safe without IOMMU protection, but some
+ * users want to do it anyway.  Allow, but make it very, very clear that
+ * there's nothing safe about this.
+ */
+
+struct vfio_noiommu_domain {
+	struct iommu_domain domain;
+	/* mutex to avoid attach/detach race? */
+	struct device *dev;
+};
+
+static struct iommu_domain *vfio_noiommu_domain_alloc(unsigned type)
+{
+	if (type != IOMMU_DOMAIN_UNMANAGED)
+		return NULL;
+
+	return kzalloc(sizeof(struct vfio_noiommu_domain), GFP_KERNEL);
+}
+
+static void vfio_noiommu_domain_free(struct iommu_domain *domain)
+{
+	kfree(domain);
+}
+
+static int vfio_noiommu_attach_dev(struct iommu_domain *domain,
+				   struct device *dev)
+{
+	struct vfio_noiommu_domain *vdomain;
+
+	vdomain = container_of(domain, struct vfio_noiommu_domain, domain);
+
+	if (vdomain->dev)
+		return -EBUSY;
+
+	vdomain->dev = dev;
+	return 0;
+}
+
+static void vfio_noiommu_detach_dev(struct iommu_domain *domain,
+				    struct device *dev)
+{
+	struct vfio_noiommu_domain *vdomain;
+
+	vdomain = container_of(domain, struct vfio_noiommu_domain, domain);
+
+	vdomain->dev = NULL;
+}
+
+static int vfio_noiommu_add_device(struct device *dev)
+{
+	struct iommu_group *group = iommu_group_alloc();
+	int ret;
+
+	if (IS_ERR(group))
+		return PTR_ERR(group);
+
+	ret = iommu_group_add_device(group, dev);
+	iommu_group_put(group);
+	return ret;
+}
+
+static void vfio_noiommu_remove_device(struct device *dev)
+{
+	iommu_group_remove_device(dev);
+}
+
+static struct iommu_ops vfio_noiommu = {
+	.domain_alloc = vfio_noiommu_domain_alloc,
+	.domain_free = vfio_noiommu_domain_free,
+	.attach_dev = vfio_noiommu_attach_dev,
+	.detach_dev = vfio_noiommu_detach_dev,
+	.add_device = vfio_noiommu_add_device,
+	.remove_device = vfio_noiommu_remove_device,
+};
+
+int vfio_noiommu_iommu_init(struct bus_type *bus)
+{
+	int ret;
+
+	if (iommu_present(bus)) {
+		pr_warn("IOMMU present on bus %s, "
+			"cannot register vfio-noiommu\n", bus->name);
+		return -EBUSY;
+	}
+
+	ret = bus_set_iommu(bus, &vfio_noiommu);
+	if (!ret) {
+		add_taint(TAINT_USER, LOCKDEP_STILL_OK);
+		pr_warn("vfio-noiommu IOMMU driver registered on bus %s, "
+			"kernel tainted\n", bus->name);
+	}
+
+	return ret;
+}
+
+void vfio_noiommu_iommu_cleanup(struct bus_type *bus)
+{
+	BUG(); /* Unsetting an IOMMU from a bus is not supported */
+}
+
+static void *vfio_noiommu_open(unsigned long arg)
+{
+	if (arg != VFIO_NOIOMMU_IOMMU)
+		return ERR_PTR(-EINVAL);
+	if (!capable(CAP_SYS_ADMIN))
+		return ERR_PTR(-EPERM);
+
+	/* TODO: Should probably at least track pinnings */
+
+	return NULL;
+}
+
+static void vfio_noiommu_release(void *iommu_data)
+{
+	/* TODO: Track stuff an unpin */
+}
+
+static int vfio_noiommu_attach_group(void *iommu_data,
+				     struct iommu_group *iommu_group)
+{
+	return 0;
+}
+
+static void vfio_noiommu_detach_group(void *iommu_data,
+				      struct iommu_group *iommu_group)
+{
+}
+
+static long vfio_noiommu_ioctl(void *iommu_data,
+			       unsigned int cmd, unsigned long arg)
+{
+	if (cmd == VFIO_CHECK_EXTENSION) {
+		switch (arg) {
+		case VFIO_NOIOMMU_IOMMU:
+			return 1;
+		default:
+			return 0;
+		}
+	}
+
+	return -ENOTTY;
+}
+
+static struct vfio_iommu_driver_ops vfio_noiommu_ops = {
+	.name = "vfio-noiommu",
+	.owner = THIS_MODULE,
+	.open = vfio_noiommu_open,
+	.release = vfio_noiommu_release,
+	.ioctl = vfio_noiommu_ioctl,
+	.attach_group = vfio_noiommu_attach_group,
+	.detach_group = vfio_noiommu_detach_group,
+};
+
+int vfio_noiommu_init(void)
+{
+	return vfio_register_iommu_driver(&vfio_noiommu_ops);
+}
+
+void vfio_noiommu_cleanup(void)
+{
+	vfio_unregister_iommu_driver(&vfio_noiommu_ops);
+}
diff --git a/drivers/vfio/vfio_private.h b/drivers/vfio/vfio_private.h
new file mode 100644
index 0000000..6fe073d
--- /dev/null
+++ b/drivers/vfio/vfio_private.h
@@ -0,0 +1,31 @@
+/*
+ * Copyright (C) 2015 Red Hat, Inc.  All rights reserved
+ * Author: Alex Williamson <alex.williamson@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License, version 2, as
+ * published by the Free Software Foundation.
+ */
+
+#ifndef VFIO_PRIVATE_H
+#define VFIO_PRIVATE_H
+
+#ifdef CONFIG_VFIO_NOIOMMU
+int vfio_noiommu_iommu_init(struct bus_type *bus);
+void vfio_noiommu_iommu_cleanup(struct bus_type *bus);
+int vfio_noiommu_init(void);
+void vfio_noiommu_cleanup(void);
+#else
+static inline int vfio_noiommu_iommu_init(struct bus_type *bus)
+{
+	return -ENODEV;
+}
+static inline void vfio_noiommu_iommu_cleanup(struct bus_type *bus) { }
+static inline int vfio_noiommu_init(void)
+{
+	return -ENODEV;
+}
+static inline void vfio_noiommu_cleanup(void) { }
+#endif
+#endif /* VFIO_PRIVATE_H */
+
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 9fd7b5d..c221614 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -38,6 +38,8 @@
 
 #define VFIO_SPAPR_TCE_v2_IOMMU		7
 
+#define VFIO_NOIOMMU_IOMMU		8
+
 /*
  * The IOCTL interface is designed for extensibility by embedding the
  * structure length (argsz) and flags into structures passed between


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 1/2] vfio: Move vfio.c vfio_core.c
  2015-10-09 18:41 ` [RFC PATCH 1/2] vfio: Move vfio.c vfio_core.c Alex Williamson
@ 2015-10-09 19:21   ` Greg KH
  0 siblings, 0 replies; 21+ messages in thread
From: Greg KH @ 2015-10-09 19:21 UTC (permalink / raw)
  To: Alex Williamson
  Cc: avi, avi, gleb, corbet, bruce.richardson, mst, linux-kernel,
	alexander.duyck, gleb, stephen, vladz, iommu, hjk

On Fri, Oct 09, 2015 at 12:41:03PM -0600, Alex Williamson wrote:
> This allows us to more easily create a "vfio" module that includes
> multiple files.  No code change, rename and Makefile update only.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
>  drivers/vfio/Makefile    |    1 
>  drivers/vfio/vfio.c      | 1640 ----------------------------------------------
>  drivers/vfio/vfio_core.c | 1640 ++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 1641 insertions(+), 1640 deletions(-)
>  delete mode 100644 drivers/vfio/vfio.c
>  create mode 100644 drivers/vfio/vfio_core.c

Use -S to git format-patch so we can see the rename, not a delete/add
patch please.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 2/2] vfio: Include no-iommu mode
  2015-10-09 18:41 ` [RFC PATCH 2/2] vfio: Include no-iommu mode Alex Williamson
@ 2015-10-11  8:12   ` Avi Kivity
  2015-10-11  8:57     ` Michael S. Tsirkin
  2015-10-11 21:16     ` Alex Williamson
  2015-10-12 15:56   ` Stephen Hemminger
  1 sibling, 2 replies; 21+ messages in thread
From: Avi Kivity @ 2015-10-11  8:12 UTC (permalink / raw)
  To: Alex Williamson
  Cc: avi, gleb, corbet, bruce.richardson, mst, linux-kernel,
	alexander.duyck, gleb, stephen, vladz, iommu, hjk, gregkh



On 10/09/2015 09:41 PM, Alex Williamson wrote:
> There is really no way to safely give a user full access to a PCI
> without an IOMMU to protect the host from errant DMA.  There is also
> no way to provide DMA translation, for use cases such as devices
> assignment to virtual machines.  However, there are still those users
> that want userspace drivers under those conditions.  The UIO driver
> exists for this use case, but does not provide the degree of device
> access and programming that VFIO has.  In an effort to avoid code
> duplication, this introduces a No-IOMMU mode for VFIO.
>
> This mode requires enabling CONFIG_VFIO_NOIOMMU and loading the vfio
> module with the option "enable_unsafe_pci_noiommu_mode".  This should
> make it very clear that this mode is not safe.  In this mode, there is
> no support for unprivileged users, CAP_SYS_ADMIN is required for
> access to the necessary dev files.

CAP_SYS_RAWIO seems a better match (in particular, it allows access to 
/dev/mem, which is the same thing).

>    Mixing no-iommu and secure VFIO is
> also unsupported, as are any VFIO IOMMU backends other than the
> vfio-noiommu backend.  Furthermore, unsafe group files are relocated
> to /dev/vfio-noiommu/.  Upon successful loading in this mode, the
> kernel is tainted due to the dummy IOMMU put in place.  Unloading of
> the module in this mode is also unsupported and will BUG due to the
> lack of support for unregistering an IOMMU for a bus type.

I did not see an API for detecting whether memory translation is 
provided or not.  We can have the caller guess this by looking at the 
device name, or by requiring the user to specify this, but I think it's 
cleaner to provide programmatic access to this attribute.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 2/2] vfio: Include no-iommu mode
  2015-10-11  8:12   ` Avi Kivity
@ 2015-10-11  8:57     ` Michael S. Tsirkin
  2015-10-11  9:03       ` Avi Kivity
  2015-10-11 21:16     ` Alex Williamson
  1 sibling, 1 reply; 21+ messages in thread
From: Michael S. Tsirkin @ 2015-10-11  8:57 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Alex Williamson, avi, gleb, corbet, bruce.richardson,
	linux-kernel, alexander.duyck, gleb, stephen, vladz, iommu, hjk,
	gregkh

On Sun, Oct 11, 2015 at 11:12:14AM +0300, Avi Kivity wrote:
> >   Mixing no-iommu and secure VFIO is
> >also unsupported, as are any VFIO IOMMU backends other than the
> >vfio-noiommu backend.  Furthermore, unsafe group files are relocated
> >to /dev/vfio-noiommu/.  Upon successful loading in this mode, the
> >kernel is tainted due to the dummy IOMMU put in place.  Unloading of
> >the module in this mode is also unsupported and will BUG due to the
> >lack of support for unregistering an IOMMU for a bus type.
> 
> I did not see an API for detecting whether memory translation is provided or
> not.  We can have the caller guess this by looking at the device name, or by
> requiring the user to specify this, but I think it's cleaner to provide
> programmatic access to this attribute.

It seems that caller can just check for VFIO_NOIOMMU_IOMMU.

Isn't this why it's there?

VFIO_IOMMU_MAP_DMA, VFIO_IOMMU_ENABLE and VFIO_IOMMU_DISABLE
will probably also fail ...

-- 
MST

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 2/2] vfio: Include no-iommu mode
  2015-10-11  8:57     ` Michael S. Tsirkin
@ 2015-10-11  9:03       ` Avi Kivity
  2015-10-11  9:19         ` Michael S. Tsirkin
  0 siblings, 1 reply; 21+ messages in thread
From: Avi Kivity @ 2015-10-11  9:03 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Alex Williamson, avi, gleb, corbet, bruce.richardson,
	linux-kernel, alexander.duyck, gleb, stephen, vladz, iommu, hjk,
	gregkh



On 10/11/2015 11:57 AM, Michael S. Tsirkin wrote:
> On Sun, Oct 11, 2015 at 11:12:14AM +0300, Avi Kivity wrote:
>>>    Mixing no-iommu and secure VFIO is
>>> also unsupported, as are any VFIO IOMMU backends other than the
>>> vfio-noiommu backend.  Furthermore, unsafe group files are relocated
>>> to /dev/vfio-noiommu/.  Upon successful loading in this mode, the
>>> kernel is tainted due to the dummy IOMMU put in place.  Unloading of
>>> the module in this mode is also unsupported and will BUG due to the
>>> lack of support for unregistering an IOMMU for a bus type.
>> I did not see an API for detecting whether memory translation is provided or
>> not.  We can have the caller guess this by looking at the device name, or by
>> requiring the user to specify this, but I think it's cleaner to provide
>> programmatic access to this attribute.
> It seems that caller can just check for VFIO_NOIOMMU_IOMMU.
>
> Isn't this why it's there?

That's just means the capability is there, not that it's active.

But since you must pass the same value to open(), you already know that 
you're using noiommu.

> VFIO_IOMMU_MAP_DMA, VFIO_IOMMU_ENABLE and VFIO_IOMMU_DISABLE
> will probably also fail ...
>

Don't you have to call MAP_DMA to pin the memory?


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 2/2] vfio: Include no-iommu mode
  2015-10-11  9:03       ` Avi Kivity
@ 2015-10-11  9:19         ` Michael S. Tsirkin
  2015-10-11  9:23           ` Gleb Natapov
  0 siblings, 1 reply; 21+ messages in thread
From: Michael S. Tsirkin @ 2015-10-11  9:19 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Alex Williamson, avi, gleb, corbet, bruce.richardson,
	linux-kernel, alexander.duyck, gleb, stephen, vladz, iommu, hjk,
	gregkh

On Sun, Oct 11, 2015 at 12:03:17PM +0300, Avi Kivity wrote:
> 
> 
> On 10/11/2015 11:57 AM, Michael S. Tsirkin wrote:
> >On Sun, Oct 11, 2015 at 11:12:14AM +0300, Avi Kivity wrote:
> >>>   Mixing no-iommu and secure VFIO is
> >>>also unsupported, as are any VFIO IOMMU backends other than the
> >>>vfio-noiommu backend.  Furthermore, unsafe group files are relocated
> >>>to /dev/vfio-noiommu/.  Upon successful loading in this mode, the
> >>>kernel is tainted due to the dummy IOMMU put in place.  Unloading of
> >>>the module in this mode is also unsupported and will BUG due to the
> >>>lack of support for unregistering an IOMMU for a bus type.
> >>I did not see an API for detecting whether memory translation is provided or
> >>not.  We can have the caller guess this by looking at the device name, or by
> >>requiring the user to specify this, but I think it's cleaner to provide
> >>programmatic access to this attribute.
> >It seems that caller can just check for VFIO_NOIOMMU_IOMMU.
> >
> >Isn't this why it's there?
> 
> That's just means the capability is there, not that it's active.

Well it's currently exactly the same.
I guess you can check the return value of VFIO_SET_IOMMU as well.

> But since you must pass the same value to open(), you already know that
> you're using noiommu.
> 
> >VFIO_IOMMU_MAP_DMA, VFIO_IOMMU_ENABLE and VFIO_IOMMU_DISABLE
> >will probably also fail ...
> >
> 
> Don't you have to call MAP_DMA to pin the memory?

Well check it out - the patch in question doesn't implement this ioctl.
In fact it doesn't implement anything except CHECK_EXTENSION.

And this makes sense to me: MAP_DMA
maps a virtual address to io address, and that doesn't
work for the dummy iommu.

You can pin memory using many other ways, including
mlock and hugetlbfs.

-- 
MST

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 2/2] vfio: Include no-iommu mode
  2015-10-11  9:19         ` Michael S. Tsirkin
@ 2015-10-11  9:23           ` Gleb Natapov
  0 siblings, 0 replies; 21+ messages in thread
From: Gleb Natapov @ 2015-10-11  9:23 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Avi Kivity, Alex Williamson, avi, corbet, bruce.richardson,
	linux-kernel, alexander.duyck, gleb, stephen, vladz, iommu, hjk,
	gregkh

On Sun, Oct 11, 2015 at 12:19:54PM +0300, Michael S. Tsirkin wrote:
> > But since you must pass the same value to open(), you already know that
> > you're using noiommu.
> > 
> > >VFIO_IOMMU_MAP_DMA, VFIO_IOMMU_ENABLE and VFIO_IOMMU_DISABLE
> > >will probably also fail ...
> > >
> > 
> > Don't you have to call MAP_DMA to pin the memory?
> 
> Well check it out - the patch in question doesn't implement this ioctl.
> In fact it doesn't implement anything except CHECK_EXTENSION.
> 
> And this makes sense to me: MAP_DMA
> maps a virtual address to io address, and that doesn't
> work for the dummy iommu.
> 
> You can pin memory using many other ways, including
> mlock and hugetlbfs.
> 
mlock() does not pin memory.

--
			Gleb.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [RFC PATCH 0/2] VFIO no-iommu
  2015-10-09 18:40 [RFC PATCH 0/2] VFIO no-iommu Alex Williamson
  2015-10-09 18:41 ` [RFC PATCH 1/2] vfio: Move vfio.c vfio_core.c Alex Williamson
  2015-10-09 18:41 ` [RFC PATCH 2/2] vfio: Include no-iommu mode Alex Williamson
@ 2015-10-11 17:29 ` Varun Sethi
  2015-10-11 18:23   ` Alex Williamson
  2015-10-11 18:28 ` Michael S. Tsirkin
  3 siblings, 1 reply; 21+ messages in thread
From: Varun Sethi @ 2015-10-11 17:29 UTC (permalink / raw)
  To: Alex Williamson
  Cc: avi, avi, gleb, mst, bruce.richardson, corbet, linux-kernel,
	alexander.duyck, gleb, stephen, vladz, iommu, hjk, gregkh

Hi Alex,
Thanks for the patch Alex. This would also require support in Qemu to expose the physical address to the VM. Are you looking at that part as well?

Regards
Varun

-----Original Message-----
From: iommu-bounces@lists.linux-foundation.org [mailto:iommu-bounces@lists.linux-foundation.org] On Behalf Of Alex Williamson
Sent: Saturday, October 10, 2015 12:11 AM
To: alex.williamson@redhat.com
Cc: avi@scylladb.com; avi@cloudius-systems.com; gleb@scylladb.com; mst@redhat.com; bruce.richardson@intel.com; corbet@lwn.net; linux-kernel@vger.kernel.org; alexander.duyck@gmail.com; gleb@cloudius-systems.com; stephen@networkplumber.org; vladz@cloudius-systems.com; iommu@lists.linux-foundation.org; hjk@hansjkoch.de; gregkh@linuxfoundation.org
Subject: [RFC PATCH 0/2] VFIO no-iommu

Recent patches for UIO have been attempting to add MSI/X support, which unfortunately implies DMA support, which users have been enabling anyway, but was never intended for UIO.  VFIO on the other hand expects an IOMMU to provide isolation of devices, but provides a much more complete device interface, which already supports full MSI/X support.  There's really no way to support userspace drivers with DMA capable devices without an IOMMU to protect the host, but we can at least think about doing it in a way that properly taints the kernel and avoids creating new code duplicating existing code, that does have a supportable use case.

The diffstat is only so large because I moved vfio.c to vfio_core.c so I could more easily keep the module named vfio.ko while keeping the bulk of the no-iommu support in a separate file that can be optionally compiled.  We're really looking at a couple hundred lines of mostly stub code.  The VFIO_NOIOMMU_IOMMU could certainly be expanded to do page pinning and virt_to_bus() translation, but I didn't want to complicate anything yet.

I've only compiled this and tested loading the module with the new no-iommu mode enabled, I haven't actually tried to port a DPDK driver to it, though it ought to be a pretty obvious mix of the existing UIO and VFIO versions (set the IOMMU, but avoid using it for mapping, use however bus translations are done w/ UIO).  The core vfio device file is still /dev/vfio/vfio, but all the groups become /dev/vfio-noiommu/$GROUP.

It should be obvious, but I always feel obligated to state that this does not and will not ever enable device assignment to virtual machines on non-IOMMU capable platforms.

I'm curious what IOMMU folks think of this.  This hack is really only possible because we don't use iommu_ops for regular DMA, so we can hijack it fairly safely.  I believe that's intended to change though, so this may not be practical long term.  Thanks,

Alex

---

Alex Williamson (2):
      vfio: Move vfio.c vfio_core.c
      vfio: Include no-iommu mode


 drivers/vfio/Kconfig        |   15 
 drivers/vfio/Makefile       |    4 
 drivers/vfio/vfio.c         | 1640 ------------------------------------------
 drivers/vfio/vfio_core.c    | 1680 +++++++++++++++++++++++++++++++++++++++++++
 drivers/vfio/vfio_noiommu.c |  185 +++++
 drivers/vfio/vfio_private.h |   31 +
 include/uapi/linux/vfio.h   |    2 
 7 files changed, 1917 insertions(+), 1640 deletions(-)  delete mode 100644 drivers/vfio/vfio.c  create mode 100644 drivers/vfio/vfio_core.c  create mode 100644 drivers/vfio/vfio_noiommu.c  create mode 100644 drivers/vfio/vfio_private.h _______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 0/2] VFIO no-iommu
  2015-10-11 17:29 ` [RFC PATCH 0/2] VFIO no-iommu Varun Sethi
@ 2015-10-11 18:23   ` Alex Williamson
  0 siblings, 0 replies; 21+ messages in thread
From: Alex Williamson @ 2015-10-11 18:23 UTC (permalink / raw)
  To: Varun Sethi
  Cc: avi, avi, gleb, mst, bruce.richardson, corbet, linux-kernel,
	alexander.duyck, gleb, stephen, vladz, iommu, hjk, gregkh

On Sun, 2015-10-11 at 17:29 +0000, Varun Sethi wrote:
> Hi Alex,
> Thanks for the patch Alex. This would also require support in Qemu to expose the physical address to the VM. Are you looking at that part as well?

Quoting from below:

        It should be obvious, but I always feel obligated to state that
        this does not and will not ever enable device assignment to
        virtual machines on non-IOMMU capable platforms.


I have no intention whatsoever, nor would I encourage or support use of
this mode for assignment to a virtual machine.  VFIO has uses beyond
QEMU.  Thanks,

Alex

> -----Original Message-----
> From: iommu-bounces@lists.linux-foundation.org [mailto:iommu-bounces@lists.linux-foundation.org] On Behalf Of Alex Williamson
> Sent: Saturday, October 10, 2015 12:11 AM
> To: alex.williamson@redhat.com
> Cc: avi@scylladb.com; avi@cloudius-systems.com; gleb@scylladb.com; mst@redhat.com; bruce.richardson@intel.com; corbet@lwn.net; linux-kernel@vger.kernel.org; alexander.duyck@gmail.com; gleb@cloudius-systems.com; stephen@networkplumber.org; vladz@cloudius-systems.com; iommu@lists.linux-foundation.org; hjk@hansjkoch.de; gregkh@linuxfoundation.org
> Subject: [RFC PATCH 0/2] VFIO no-iommu
> 
> Recent patches for UIO have been attempting to add MSI/X support, which unfortunately implies DMA support, which users have been enabling anyway, but was never intended for UIO.  VFIO on the other hand expects an IOMMU to provide isolation of devices, but provides a much more complete device interface, which already supports full MSI/X support.  There's really no way to support userspace drivers with DMA capable devices without an IOMMU to protect the host, but we can at least think about doing it in a way that properly taints the kernel and avoids creating new code duplicating existing code, that does have a supportable use case.
> 
> The diffstat is only so large because I moved vfio.c to vfio_core.c so I could more easily keep the module named vfio.ko while keeping the bulk of the no-iommu support in a separate file that can be optionally compiled.  We're really looking at a couple hundred lines of mostly stub code.  The VFIO_NOIOMMU_IOMMU could certainly be expanded to do page pinning and virt_to_bus() translation, but I didn't want to complicate anything yet.
> 
> I've only compiled this and tested loading the module with the new no-iommu mode enabled, I haven't actually tried to port a DPDK driver to it, though it ought to be a pretty obvious mix of the existing UIO and VFIO versions (set the IOMMU, but avoid using it for mapping, use however bus translations are done w/ UIO).  The core vfio device file is still /dev/vfio/vfio, but all the groups become /dev/vfio-noiommu/$GROUP.
> 
> It should be obvious, but I always feel obligated to state that this does not and will not ever enable device assignment to virtual machines on non-IOMMU capable platforms.
> 
> I'm curious what IOMMU folks think of this.  This hack is really only possible because we don't use iommu_ops for regular DMA, so we can hijack it fairly safely.  I believe that's intended to change though, so this may not be practical long term.  Thanks,
> 
> Alex
> 
> ---
> 
> Alex Williamson (2):
>       vfio: Move vfio.c vfio_core.c
>       vfio: Include no-iommu mode
> 
> 
>  drivers/vfio/Kconfig        |   15 
>  drivers/vfio/Makefile       |    4 
>  drivers/vfio/vfio.c         | 1640 ------------------------------------------
>  drivers/vfio/vfio_core.c    | 1680 +++++++++++++++++++++++++++++++++++++++++++
>  drivers/vfio/vfio_noiommu.c |  185 +++++
>  drivers/vfio/vfio_private.h |   31 +
>  include/uapi/linux/vfio.h   |    2 
>  7 files changed, 1917 insertions(+), 1640 deletions(-)  delete mode 100644 drivers/vfio/vfio.c  create mode 100644 drivers/vfio/vfio_core.c  create mode 100644 drivers/vfio/vfio_noiommu.c  create mode 100644 drivers/vfio/vfio_private.h _______________________________________________
> iommu mailing list
> iommu@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 0/2] VFIO no-iommu
  2015-10-09 18:40 [RFC PATCH 0/2] VFIO no-iommu Alex Williamson
                   ` (2 preceding siblings ...)
  2015-10-11 17:29 ` [RFC PATCH 0/2] VFIO no-iommu Varun Sethi
@ 2015-10-11 18:28 ` Michael S. Tsirkin
  2015-10-11 18:29   ` Michael S. Tsirkin
  3 siblings, 1 reply; 21+ messages in thread
From: Michael S. Tsirkin @ 2015-10-11 18:28 UTC (permalink / raw)
  To: Alex Williamson
  Cc: avi, avi, gleb, corbet, bruce.richardson, linux-kernel,
	alexander.duyck, gleb, stephen, vladz, iommu, hjk, gregkh

On Fri, Oct 09, 2015 at 12:40:56PM -0600, Alex Williamson wrote:
> Recent patches for UIO have been attempting to add MSI/X support,
> which unfortunately implies DMA support, which users have been
> enabling anyway, but was never intended for UIO.  VFIO on the other
> hand expects an IOMMU to provide isolation of devices, but provides
> a much more complete device interface, which already supports full
> MSI/X support.  There's really no way to support userspace drivers
> with DMA capable devices without an IOMMU to protect the host, but
> we can at least think about doing it in a way that properly taints
> the kernel and avoids creating new code duplicating existing code,
> that does have a supportable use case.
> 
> The diffstat is only so large because I moved vfio.c to vfio_core.c
> so I could more easily keep the module named vfio.ko while keeping
> the bulk of the no-iommu support in a separate file that can be
> optionally compiled.  We're really looking at a couple hundred lines
> of mostly stub code.  The VFIO_NOIOMMU_IOMMU could certainly be
> expanded to do page pinning and virt_to_bus() translation, but I
> didn't want to complicate anything yet.

I think it's already useful like this, since all current users
seem happy enough to just use hugetlbfs to do pinning, and
ignore translation.

> I've only compiled this and tested loading the module with the new
> no-iommu mode enabled, I haven't actually tried to port a DPDK
> driver to it, though it ought to be a pretty obvious mix of the
> existing UIO and VFIO versions (set the IOMMU, but avoid using it
> for mapping, use however bus translations are done w/ UIO).  The core
> vfio device file is still /dev/vfio/vfio, but all the groups become
> /dev/vfio-noiommu/$GROUP.
> 
> It should be obvious, but I always feel obligated to state that this
> does not and will not ever enable device assignment to virtual
> machines on non-IOMMU capable platforms.

In theory, it's kind of possible using paravirtualization.

Within guest, you'd make map_page retrieve the io address from the host
and return that as dma_addr_t.  The only question would be APIs that
require more than one contigious page in IO space (e.g. I think alloc
coherent is like this?).
Not a problem if host is using hugetlbfs, but if not, I guess we could
add a hypercall and some Linux API on the host to trigger compaction
on the host aggressively. MADV_CONTIGIOUS?


> I'm curious what IOMMU folks think of this.  This hack is really
> only possible because we don't use iommu_ops for regular DMA, so we
> can hijack it fairly safely.  I believe that's intended to change
> though, so this may not be practical long term.  Thanks,
> 
> Alex
> 
> ---
> 
> Alex Williamson (2):
>       vfio: Move vfio.c vfio_core.c
>       vfio: Include no-iommu mode
> 
> 
>  drivers/vfio/Kconfig        |   15 
>  drivers/vfio/Makefile       |    4 
>  drivers/vfio/vfio.c         | 1640 ------------------------------------------
>  drivers/vfio/vfio_core.c    | 1680 +++++++++++++++++++++++++++++++++++++++++++
>  drivers/vfio/vfio_noiommu.c |  185 +++++
>  drivers/vfio/vfio_private.h |   31 +
>  include/uapi/linux/vfio.h   |    2 
>  7 files changed, 1917 insertions(+), 1640 deletions(-)
>  delete mode 100644 drivers/vfio/vfio.c
>  create mode 100644 drivers/vfio/vfio_core.c
>  create mode 100644 drivers/vfio/vfio_noiommu.c
>  create mode 100644 drivers/vfio/vfio_private.h

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 0/2] VFIO no-iommu
  2015-10-11 18:28 ` Michael S. Tsirkin
@ 2015-10-11 18:29   ` Michael S. Tsirkin
  2015-10-11 19:25     ` Alex Williamson
  0 siblings, 1 reply; 21+ messages in thread
From: Michael S. Tsirkin @ 2015-10-11 18:29 UTC (permalink / raw)
  To: Alex Williamson
  Cc: avi, avi, gleb, corbet, bruce.richardson, linux-kernel,
	alexander.duyck, gleb, stephen, vladz, iommu, hjk, gregkh

On Sun, Oct 11, 2015 at 09:28:09PM +0300, Michael S. Tsirkin wrote:
> On Fri, Oct 09, 2015 at 12:40:56PM -0600, Alex Williamson wrote:
> > Recent patches for UIO have been attempting to add MSI/X support,
> > which unfortunately implies DMA support, which users have been
> > enabling anyway, but was never intended for UIO.  VFIO on the other
> > hand expects an IOMMU to provide isolation of devices, but provides
> > a much more complete device interface, which already supports full
> > MSI/X support.  There's really no way to support userspace drivers
> > with DMA capable devices without an IOMMU to protect the host, but
> > we can at least think about doing it in a way that properly taints
> > the kernel and avoids creating new code duplicating existing code,
> > that does have a supportable use case.
> > 
> > The diffstat is only so large because I moved vfio.c to vfio_core.c
> > so I could more easily keep the module named vfio.ko while keeping
> > the bulk of the no-iommu support in a separate file that can be
> > optionally compiled.  We're really looking at a couple hundred lines
> > of mostly stub code.  The VFIO_NOIOMMU_IOMMU could certainly be
> > expanded to do page pinning and virt_to_bus() translation, but I
> > didn't want to complicate anything yet.
> 
> I think it's already useful like this, since all current users
> seem happy enough to just use hugetlbfs to do pinning, and
> ignore translation.
> 
> > I've only compiled this and tested loading the module with the new
> > no-iommu mode enabled, I haven't actually tried to port a DPDK
> > driver to it, though it ought to be a pretty obvious mix of the
> > existing UIO and VFIO versions (set the IOMMU, but avoid using it
> > for mapping, use however bus translations are done w/ UIO).  The core
> > vfio device file is still /dev/vfio/vfio, but all the groups become
> > /dev/vfio-noiommu/$GROUP.
> > 
> > It should be obvious, but I always feel obligated to state that this
> > does not and will not ever enable device assignment to virtual
> > machines on non-IOMMU capable platforms.
> 
> In theory, it's kind of possible using paravirtualization.
> 
> Within guest, you'd make map_page retrieve the io address from the host
> and return that as dma_addr_t.  The only question would be APIs that
> require more than one contigious page in IO space (e.g. I think alloc
> coherent is like this?).
> Not a problem if host is using hugetlbfs, but if not, I guess we could
> add a hypercall and some Linux API on the host to trigger compaction
> on the host aggressively. MADV_CONTIGIOUS?

Not that I see a good reason for that.
Just use an iommu.


> 
> > I'm curious what IOMMU folks think of this.  This hack is really
> > only possible because we don't use iommu_ops for regular DMA, so we
> > can hijack it fairly safely.  I believe that's intended to change
> > though, so this may not be practical long term.  Thanks,
> > 
> > Alex
> > 
> > ---
> > 
> > Alex Williamson (2):
> >       vfio: Move vfio.c vfio_core.c
> >       vfio: Include no-iommu mode
> > 
> > 
> >  drivers/vfio/Kconfig        |   15 
> >  drivers/vfio/Makefile       |    4 
> >  drivers/vfio/vfio.c         | 1640 ------------------------------------------
> >  drivers/vfio/vfio_core.c    | 1680 +++++++++++++++++++++++++++++++++++++++++++
> >  drivers/vfio/vfio_noiommu.c |  185 +++++
> >  drivers/vfio/vfio_private.h |   31 +
> >  include/uapi/linux/vfio.h   |    2 
> >  7 files changed, 1917 insertions(+), 1640 deletions(-)
> >  delete mode 100644 drivers/vfio/vfio.c
> >  create mode 100644 drivers/vfio/vfio_core.c
> >  create mode 100644 drivers/vfio/vfio_noiommu.c
> >  create mode 100644 drivers/vfio/vfio_private.h

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 0/2] VFIO no-iommu
  2015-10-11 18:29   ` Michael S. Tsirkin
@ 2015-10-11 19:25     ` Alex Williamson
  0 siblings, 0 replies; 21+ messages in thread
From: Alex Williamson @ 2015-10-11 19:25 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: avi, avi, gleb, corbet, bruce.richardson, linux-kernel,
	alexander.duyck, gleb, stephen, vladz, iommu, hjk, gregkh

On Sun, 2015-10-11 at 21:29 +0300, Michael S. Tsirkin wrote:
> On Sun, Oct 11, 2015 at 09:28:09PM +0300, Michael S. Tsirkin wrote:
> > On Fri, Oct 09, 2015 at 12:40:56PM -0600, Alex Williamson wrote:
> > > Recent patches for UIO have been attempting to add MSI/X support,
> > > which unfortunately implies DMA support, which users have been
> > > enabling anyway, but was never intended for UIO.  VFIO on the other
> > > hand expects an IOMMU to provide isolation of devices, but provides
> > > a much more complete device interface, which already supports full
> > > MSI/X support.  There's really no way to support userspace drivers
> > > with DMA capable devices without an IOMMU to protect the host, but
> > > we can at least think about doing it in a way that properly taints
> > > the kernel and avoids creating new code duplicating existing code,
> > > that does have a supportable use case.
> > > 
> > > The diffstat is only so large because I moved vfio.c to vfio_core.c
> > > so I could more easily keep the module named vfio.ko while keeping
> > > the bulk of the no-iommu support in a separate file that can be
> > > optionally compiled.  We're really looking at a couple hundred lines
> > > of mostly stub code.  The VFIO_NOIOMMU_IOMMU could certainly be
> > > expanded to do page pinning and virt_to_bus() translation, but I
> > > didn't want to complicate anything yet.
> > 
> > I think it's already useful like this, since all current users
> > seem happy enough to just use hugetlbfs to do pinning, and
> > ignore translation.

That was sort of my thought too...
 
> > > I've only compiled this and tested loading the module with the new
> > > no-iommu mode enabled, I haven't actually tried to port a DPDK
> > > driver to it, though it ought to be a pretty obvious mix of the
> > > existing UIO and VFIO versions (set the IOMMU, but avoid using it
> > > for mapping, use however bus translations are done w/ UIO).  The core
> > > vfio device file is still /dev/vfio/vfio, but all the groups become
> > > /dev/vfio-noiommu/$GROUP.
> > > 
> > > It should be obvious, but I always feel obligated to state that this
> > > does not and will not ever enable device assignment to virtual
> > > machines on non-IOMMU capable platforms.
> > 
> > In theory, it's kind of possible using paravirtualization.
> > 
> > Within guest, you'd make map_page retrieve the io address from the host
> > and return that as dma_addr_t.  The only question would be APIs that
> > require more than one contigious page in IO space (e.g. I think alloc
> > coherent is like this?).
> > Not a problem if host is using hugetlbfs, but if not, I guess we could
> > add a hypercall and some Linux API on the host to trigger compaction
> > on the host aggressively. MADV_CONTIGIOUS?
> 
> Not that I see a good reason for that.
> Just use an iommu.

Right, I think it boils down to how much code are we willing to maintain
for an interface that we consider so dangerous and unsupportable that we
immediately taint the kernel.  This is partially why I stopped short of
expanding the no-iommu interface to do page pinning or virt-to-bus
translation.  A few hundred lines of boilerplate stubs to enable re-use
of code that is maintained for supportable interfaces is one thing.
Building onto that with paravirtual IOMMU interfaces to a VM for
something that ultimately cannot be supported is not something I want to
participate in.  Get an IOMMU.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 2/2] vfio: Include no-iommu mode
  2015-10-11  8:12   ` Avi Kivity
  2015-10-11  8:57     ` Michael S. Tsirkin
@ 2015-10-11 21:16     ` Alex Williamson
  1 sibling, 0 replies; 21+ messages in thread
From: Alex Williamson @ 2015-10-11 21:16 UTC (permalink / raw)
  To: Avi Kivity
  Cc: avi, gleb, corbet, bruce.richardson, mst, linux-kernel,
	alexander.duyck, gleb, stephen, vladz, iommu, hjk, gregkh

On Sun, 2015-10-11 at 11:12 +0300, Avi Kivity wrote:
> 
> On 10/09/2015 09:41 PM, Alex Williamson wrote:
> > There is really no way to safely give a user full access to a PCI
> > without an IOMMU to protect the host from errant DMA.  There is also
> > no way to provide DMA translation, for use cases such as devices
> > assignment to virtual machines.  However, there are still those users
> > that want userspace drivers under those conditions.  The UIO driver
> > exists for this use case, but does not provide the degree of device
> > access and programming that VFIO has.  In an effort to avoid code
> > duplication, this introduces a No-IOMMU mode for VFIO.
> >
> > This mode requires enabling CONFIG_VFIO_NOIOMMU and loading the vfio
> > module with the option "enable_unsafe_pci_noiommu_mode".  This should
> > make it very clear that this mode is not safe.  In this mode, there is
> > no support for unprivileged users, CAP_SYS_ADMIN is required for
> > access to the necessary dev files.
> 
> CAP_SYS_RAWIO seems a better match (in particular, it allows access to 
> /dev/mem, which is the same thing).

Sure, that seems reasonable.

> >    Mixing no-iommu and secure VFIO is
> > also unsupported, as are any VFIO IOMMU backends other than the
> > vfio-noiommu backend.  Furthermore, unsafe group files are relocated
> > to /dev/vfio-noiommu/.  Upon successful loading in this mode, the
> > kernel is tainted due to the dummy IOMMU put in place.  Unloading of
> > the module in this mode is also unsupported and will BUG due to the
> > lack of support for unregistering an IOMMU for a bus type.
> 
> I did not see an API for detecting whether memory translation is 
> provided or not.  We can have the caller guess this by looking at the 
> device name, or by requiring the user to specify this, but I think it's 
> cleaner to provide programmatic access to this attribute.

The VFIO user can probe and needs to set the IOMMU model in use before
they can access a device file descriptor.  In this mode, the
VFIO_NOIOMMU_IOMMU is the only model available, which as proposed here
provides no translation, and in fact no mapping ioctls.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 2/2] vfio: Include no-iommu mode
  2015-10-09 18:41 ` [RFC PATCH 2/2] vfio: Include no-iommu mode Alex Williamson
  2015-10-11  8:12   ` Avi Kivity
@ 2015-10-12 15:56   ` Stephen Hemminger
  2015-10-12 16:23     ` Alex Williamson
  2015-10-12 16:27     ` Michael S. Tsirkin
  1 sibling, 2 replies; 21+ messages in thread
From: Stephen Hemminger @ 2015-10-12 15:56 UTC (permalink / raw)
  To: Alex Williamson
  Cc: avi, avi, gleb, corbet, bruce.richardson, mst, linux-kernel,
	alexander.duyck, gleb, vladz, iommu, hjk, gregkh

On Fri, 09 Oct 2015 12:41:10 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> There is really no way to safely give a user full access to a PCI
> without an IOMMU to protect the host from errant DMA.  There is also
> no way to provide DMA translation, for use cases such as devices
> assignment to virtual machines.  However, there are still those users
> that want userspace drivers under those conditions.  The UIO driver
> exists for this use case, but does not provide the degree of device
> access and programming that VFIO has.  In an effort to avoid code
> duplication, this introduces a No-IOMMU mode for VFIO.
> 
> This mode requires enabling CONFIG_VFIO_NOIOMMU and loading the vfio
> module with the option "enable_unsafe_pci_noiommu_mode".  This should
> make it very clear that this mode is not safe.  In this mode, there is
> no support for unprivileged users, CAP_SYS_ADMIN is required for
> access to the necessary dev files.  Mixing no-iommu and secure VFIO is
> also unsupported, as are any VFIO IOMMU backends other than the
> vfio-noiommu backend.  Furthermore, unsafe group files are relocated
> to /dev/vfio-noiommu/.  Upon successful loading in this mode, the
> kernel is tainted due to the dummy IOMMU put in place.  Unloading of
> the module in this mode is also unsupported and will BUG due to the
> lack of support for unregistering an IOMMU for a bus type.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>

Will this work for distro's where chaning kernel command line options
is really not that practical. We need to boot with one command line
and then decide to use IOMMU (or not) later on during the service
startup of the dataplane application. Recent experience is that IOMMU's
are broken on many platforms so the only way to make a DPDK application
it to write a test program that can be used to check if VFIO+IOMMU
works first.

Also, although you think the long option will set the bar high
enough it probably will not satisfy anyone. It is annoying enough, that
I would just carry a patch to remove it the silly requirement.
And the the people who believe
all user mode DMA is evil won't be satisfied either.

But I really like having the same consistent API for handling device
access with IOMMU and when IOMMU will/won't work.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 2/2] vfio: Include no-iommu mode
  2015-10-12 15:56   ` Stephen Hemminger
@ 2015-10-12 16:23     ` Alex Williamson
  2015-10-12 16:31       ` Avi Kivity
  2015-10-12 16:27     ` Michael S. Tsirkin
  1 sibling, 1 reply; 21+ messages in thread
From: Alex Williamson @ 2015-10-12 16:23 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: avi, avi, gleb, corbet, bruce.richardson, mst, linux-kernel,
	alexander.duyck, gleb, vladz, iommu, hjk, gregkh

On Mon, 2015-10-12 at 08:56 -0700, Stephen Hemminger wrote:
> On Fri, 09 Oct 2015 12:41:10 -0600
> Alex Williamson <alex.williamson@redhat.com> wrote:
> 
> > There is really no way to safely give a user full access to a PCI
> > without an IOMMU to protect the host from errant DMA.  There is also
> > no way to provide DMA translation, for use cases such as devices
> > assignment to virtual machines.  However, there are still those users
> > that want userspace drivers under those conditions.  The UIO driver
> > exists for this use case, but does not provide the degree of device
> > access and programming that VFIO has.  In an effort to avoid code
> > duplication, this introduces a No-IOMMU mode for VFIO.
> > 
> > This mode requires enabling CONFIG_VFIO_NOIOMMU and loading the vfio
> > module with the option "enable_unsafe_pci_noiommu_mode".  This should
> > make it very clear that this mode is not safe.  In this mode, there is
> > no support for unprivileged users, CAP_SYS_ADMIN is required for
> > access to the necessary dev files.  Mixing no-iommu and secure VFIO is
> > also unsupported, as are any VFIO IOMMU backends other than the
> > vfio-noiommu backend.  Furthermore, unsafe group files are relocated
> > to /dev/vfio-noiommu/.  Upon successful loading in this mode, the
> > kernel is tainted due to the dummy IOMMU put in place.  Unloading of
> > the module in this mode is also unsupported and will BUG due to the
> > lack of support for unregistering an IOMMU for a bus type.
> > 
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> 
> Will this work for distro's where chaning kernel command line options
> is really not that practical. We need to boot with one command line
> and then decide to use IOMMU (or not) later on during the service
> startup of the dataplane application. Recent experience is that IOMMU's
> are broken on many platforms so the only way to make a DPDK application
> it to write a test program that can be used to check if VFIO+IOMMU
> works first.

There's no kernel command line dependency, if you find that VFIO+IOMMU
doesn't work, unload the vfio modules and reload with the no-iommu
option and try again.  VFIO itself cannot simply fall back to no-iommu,
that definitely needs to be a user opt-in.  The userspace app though
should easily be able to tell when the type1 model is not available and
fall back to no-iommu.

> Also, although you think the long option will set the bar high
> enough it probably will not satisfy anyone. It is annoying enough, that
> I would just carry a patch to remove it the silly requirement.
> And the the people who believe
> all user mode DMA is evil won't be satisfied either.

I find that many users blindly follow howtos and only sometimes do they
question the options if they sound scary enough.  So yeah, I would
intend to make the option upstream sound scary enough for people to
think twice about using it and maybe even read the description.  That
still doesn't prevent pasting it into modprobe.d and forgetting about
it.

I don't see non-IOMMU protected usermode DMA as evil, it's just
unsupportable and I want to be able to immediately know that it's a
possibility if I'm looking at a kernel bug.

> But I really like having the same consistent API for handling device
> access with IOMMU and when IOMMU will/won't work.

We still need to hear from IOMMU folks, AIUI there are plans to
implement dma_iops via iommu_ops, which would make squatting on
iommu_ops much more difficult for a hack like this.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 2/2] vfio: Include no-iommu mode
  2015-10-12 15:56   ` Stephen Hemminger
  2015-10-12 16:23     ` Alex Williamson
@ 2015-10-12 16:27     ` Michael S. Tsirkin
  2015-10-12 17:46       ` Alex Williamson
  1 sibling, 1 reply; 21+ messages in thread
From: Michael S. Tsirkin @ 2015-10-12 16:27 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Alex Williamson, avi, avi, gleb, corbet, bruce.richardson,
	linux-kernel, alexander.duyck, gleb, vladz, iommu, hjk, gregkh

On Mon, Oct 12, 2015 at 08:56:07AM -0700, Stephen Hemminger wrote:
> On Fri, 09 Oct 2015 12:41:10 -0600
> Alex Williamson <alex.williamson@redhat.com> wrote:
> 
> > There is really no way to safely give a user full access to a PCI
> > without an IOMMU to protect the host from errant DMA.  There is also
> > no way to provide DMA translation, for use cases such as devices
> > assignment to virtual machines.  However, there are still those users
> > that want userspace drivers under those conditions.  The UIO driver
> > exists for this use case, but does not provide the degree of device
> > access and programming that VFIO has.  In an effort to avoid code
> > duplication, this introduces a No-IOMMU mode for VFIO.
> > 
> > This mode requires enabling CONFIG_VFIO_NOIOMMU and loading the vfio
> > module with the option "enable_unsafe_pci_noiommu_mode".  This should
> > make it very clear that this mode is not safe.  In this mode, there is
> > no support for unprivileged users, CAP_SYS_ADMIN is required for
> > access to the necessary dev files.  Mixing no-iommu and secure VFIO is
> > also unsupported, as are any VFIO IOMMU backends other than the
> > vfio-noiommu backend.  Furthermore, unsafe group files are relocated
> > to /dev/vfio-noiommu/.  Upon successful loading in this mode, the
> > kernel is tainted due to the dummy IOMMU put in place.  Unloading of
> > the module in this mode is also unsupported and will BUG due to the
> > lack of support for unregistering an IOMMU for a bus type.
> > 
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> 
> Will this work for distro's where chaning kernel command line options
> is really not that practical. We need to boot with one command line
> and then decide to use IOMMU (or not) later on during the service
> startup of the dataplane application.

On open? That's too late in my opinion. But maybe the flag can be
tweaked so that it will probe for iommu, if there - do the
right thing, but if that fails, enable the dummy one.
And maybe defer tainting until device open.

Won't address the "old IOMMUs add performance overhead"
usecase but I'm not very impressed by that in any case.

> Recent experience is that IOMMU's
> are broken on many platforms so the only way to make a DPDK application
> it to write a test program that can be used to check if VFIO+IOMMU
> works first.

In userspace? Well that's just piling up work-arounds.  And assuming
hardware is broken, who knows what's going on security-wise.  These
broken systems need to be identified and black-listed in kernel.

> Also, although you think the long option will set the bar high
> enough it probably will not satisfy anyone. It is annoying enough, that
> I would just carry a patch to remove it the silly requirement.

That sounds reasonable. Anyone who can carry a kernel patch
does not need the warning.

> And the the people who believe
> all user mode DMA is evil won't be satisfied either.
> But I really like having the same consistent API for handling device
> access with IOMMU and when IOMMU will/won't work.

I agree that's good. Makes it easier to migrate applications to
the safe configuration down the road. Thanks Alex!

-- 
mST

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 2/2] vfio: Include no-iommu mode
  2015-10-12 16:23     ` Alex Williamson
@ 2015-10-12 16:31       ` Avi Kivity
  0 siblings, 0 replies; 21+ messages in thread
From: Avi Kivity @ 2015-10-12 16:31 UTC (permalink / raw)
  To: Alex Williamson, Stephen Hemminger
  Cc: avi, gleb, corbet, bruce.richardson, mst, linux-kernel,
	alexander.duyck, gleb, vladz, iommu, hjk, gregkh



On 10/12/2015 07:23 PM, Alex Williamson wrote:
>> Also, although you think the long option will set the bar high
>> enough it probably will not satisfy anyone. It is annoying enough, that
>> I would just carry a patch to remove it the silly requirement.
>> And the the people who believe
>> all user mode DMA is evil won't be satisfied either.
> I find that many users blindly follow howtos and only sometimes do they
> question the options if they sound scary enough.  So yeah, I would
> intend to make the option upstream sound scary enough for people to
> think twice about using it and maybe even read the description.  That
> still doesn't prevent pasting it into modprobe.d and forgetting about
> it.

I think we need to allow for packages to work with this.  I.e. drop 
files into config directories rather than require editing of config 
files.  I think this is mostly workable via modprobe.d.

(for our own use case, we don't require the extreme performance of many 
L2/L3 dpdk apps so we'll work with regular iommued vfio for bare-metal 
installations and ship prebuilt virtual machine images for clouds, so it 
doesn't matter that much to me).


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 2/2] vfio: Include no-iommu mode
  2015-10-12 16:27     ` Michael S. Tsirkin
@ 2015-10-12 17:46       ` Alex Williamson
  2015-10-12 18:08         ` Alex Williamson
  0 siblings, 1 reply; 21+ messages in thread
From: Alex Williamson @ 2015-10-12 17:46 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Stephen Hemminger, avi, avi, gleb, corbet, bruce.richardson,
	linux-kernel, alexander.duyck, gleb, vladz, iommu, hjk, gregkh

On Mon, 2015-10-12 at 19:27 +0300, Michael S. Tsirkin wrote:
> On Mon, Oct 12, 2015 at 08:56:07AM -0700, Stephen Hemminger wrote:
> > On Fri, 09 Oct 2015 12:41:10 -0600
> > Alex Williamson <alex.williamson@redhat.com> wrote:
> > 
> > > There is really no way to safely give a user full access to a PCI
> > > without an IOMMU to protect the host from errant DMA.  There is also
> > > no way to provide DMA translation, for use cases such as devices
> > > assignment to virtual machines.  However, there are still those users
> > > that want userspace drivers under those conditions.  The UIO driver
> > > exists for this use case, but does not provide the degree of device
> > > access and programming that VFIO has.  In an effort to avoid code
> > > duplication, this introduces a No-IOMMU mode for VFIO.
> > > 
> > > This mode requires enabling CONFIG_VFIO_NOIOMMU and loading the vfio
> > > module with the option "enable_unsafe_pci_noiommu_mode".  This should
> > > make it very clear that this mode is not safe.  In this mode, there is
> > > no support for unprivileged users, CAP_SYS_ADMIN is required for
> > > access to the necessary dev files.  Mixing no-iommu and secure VFIO is
> > > also unsupported, as are any VFIO IOMMU backends other than the
> > > vfio-noiommu backend.  Furthermore, unsafe group files are relocated
> > > to /dev/vfio-noiommu/.  Upon successful loading in this mode, the
> > > kernel is tainted due to the dummy IOMMU put in place.  Unloading of
> > > the module in this mode is also unsupported and will BUG due to the
> > > lack of support for unregistering an IOMMU for a bus type.
> > > 
> > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > 
> > Will this work for distro's where chaning kernel command line options
> > is really not that practical. We need to boot with one command line
> > and then decide to use IOMMU (or not) later on during the service
> > startup of the dataplane application.
> 
> On open? That's too late in my opinion. But maybe the flag can be
> tweaked so that it will probe for iommu, if there - do the
> right thing, but if that fails, enable the dummy one.
> And maybe defer tainting until device open.

The vfio mechanics are that a vfio bus driver, such as vfio-pci binds to
a device.  In the probe function, we check for an iommu group, which
vfio-core then uses to create the vfio group.  So there's nothing to
open(), the iommu association needs to be made prior to even binding the
device to vfio-pci.  Probing for an iommu can also only be done on a per
bus_type basis, which will likely eventually become a per bus instance
to support heterogeneous iommus, so vfio can't simply determine that an
iommu is not present globally.  This is why the new module option
includes the word "pci", so that it can probe for and attach the dummy
iommu specifically on the pci_bus_type.

We can still consider if there are better points at which to initiate
the fake iommu group.  Trying to think through vfio-pci doing it on
probe(), but it seems pretty ugly.

In this RFC, I specifically avoided making the vfio no-iommu iommu
driver just another modular iommu backend, I wanted it to be tied to a
vfio module option such that vfio behaves differently with open()s and
certain ioctls.  I think it would be really confusing to users if safe
and unsafe modes could be used concurrently for different devices.

> Won't address the "old IOMMUs add performance overhead"
> usecase but I'm not very impressed by that in any case.

Yep, me neither, certainly not for static mappings.  There's a lot of
FUD left over from latencies in the streaming DMA mapping paths where
mappings are created and destroyed at a high rate.  That has more to do
with flushing mappings out of the hardware than with iotlb miss latency
or actual translation, which is all that should be in play for most uses
here.

> > Recent experience is that IOMMU's
> > are broken on many platforms so the only way to make a DPDK application
> > it to write a test program that can be used to check if VFIO+IOMMU
> > works first.
> 
> In userspace? Well that's just piling up work-arounds.  And assuming
> hardware is broken, who knows what's going on security-wise.  These
> broken systems need to be identified and black-listed in kernel.
> 
> > Also, although you think the long option will set the bar high
> > enough it probably will not satisfy anyone. It is annoying enough, that
> > I would just carry a patch to remove it the silly requirement.
> 
> That sounds reasonable. Anyone who can carry a kernel patch
> does not need the warning.
> 
> > And the the people who believe
> > all user mode DMA is evil won't be satisfied either.
> > But I really like having the same consistent API for handling device
> > access with IOMMU and when IOMMU will/won't work.
> 
> I agree that's good. Makes it easier to migrate applications to
> the safe configuration down the road. Thanks Alex!
> 




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH 2/2] vfio: Include no-iommu mode
  2015-10-12 17:46       ` Alex Williamson
@ 2015-10-12 18:08         ` Alex Williamson
  0 siblings, 0 replies; 21+ messages in thread
From: Alex Williamson @ 2015-10-12 18:08 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Stephen Hemminger, avi, avi, gleb, corbet, bruce.richardson,
	linux-kernel, alexander.duyck, gleb, vladz, iommu, hjk, gregkh

On Mon, 2015-10-12 at 11:46 -0600, Alex Williamson wrote:
> On Mon, 2015-10-12 at 19:27 +0300, Michael S. Tsirkin wrote:
> > On Mon, Oct 12, 2015 at 08:56:07AM -0700, Stephen Hemminger wrote:
> > > On Fri, 09 Oct 2015 12:41:10 -0600
> > > Alex Williamson <alex.williamson@redhat.com> wrote:
> > > 
> > > > There is really no way to safely give a user full access to a PCI
> > > > without an IOMMU to protect the host from errant DMA.  There is also
> > > > no way to provide DMA translation, for use cases such as devices
> > > > assignment to virtual machines.  However, there are still those users
> > > > that want userspace drivers under those conditions.  The UIO driver
> > > > exists for this use case, but does not provide the degree of device
> > > > access and programming that VFIO has.  In an effort to avoid code
> > > > duplication, this introduces a No-IOMMU mode for VFIO.
> > > > 
> > > > This mode requires enabling CONFIG_VFIO_NOIOMMU and loading the vfio
> > > > module with the option "enable_unsafe_pci_noiommu_mode".  This should
> > > > make it very clear that this mode is not safe.  In this mode, there is
> > > > no support for unprivileged users, CAP_SYS_ADMIN is required for
> > > > access to the necessary dev files.  Mixing no-iommu and secure VFIO is
> > > > also unsupported, as are any VFIO IOMMU backends other than the
> > > > vfio-noiommu backend.  Furthermore, unsafe group files are relocated
> > > > to /dev/vfio-noiommu/.  Upon successful loading in this mode, the
> > > > kernel is tainted due to the dummy IOMMU put in place.  Unloading of
> > > > the module in this mode is also unsupported and will BUG due to the
> > > > lack of support for unregistering an IOMMU for a bus type.
> > > > 
> > > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > > 
> > > Will this work for distro's where chaning kernel command line options
> > > is really not that practical. We need to boot with one command line
> > > and then decide to use IOMMU (or not) later on during the service
> > > startup of the dataplane application.
> > 
> > On open? That's too late in my opinion. But maybe the flag can be
> > tweaked so that it will probe for iommu, if there - do the
> > right thing, but if that fails, enable the dummy one.
> > And maybe defer tainting until device open.

I forgot to address the tainting point; I think we were previously
talking about tainting at the point where bus master is enabled, but I
chose to do it much earlier here because the act of registering a dummy
iommu_ops for a bus type is pretty much the point at which we have a
good chance of breaking the system.  I also considered that some devices
can manipulate their config space registers using device specific
registers (such as the example of GPUs that mirror config space in
mmio).  It's therefore not always possible to taint at the point where
we think the user has done something bad.  The best case would be
tainting at the point where the device file descriptor is opened as you
suggested, but we can't do that while we're exposing dummy iommu_ops to
the whole bus type.  Maybe another option would be to create vfio
wrappers for the iommu callbacks to have the iommu facade more local to
vfio.

My main requirements are that I do not want be disruptive to the
existing vfio code or add a significant amount of code that needs to be
maintained for the purpose of supporting a use mode that we don't really
think is supportable.  Thanks,

Alex

> The vfio mechanics are that a vfio bus driver, such as vfio-pci binds to
> a device.  In the probe function, we check for an iommu group, which
> vfio-core then uses to create the vfio group.  So there's nothing to
> open(), the iommu association needs to be made prior to even binding the
> device to vfio-pci.  Probing for an iommu can also only be done on a per
> bus_type basis, which will likely eventually become a per bus instance
> to support heterogeneous iommus, so vfio can't simply determine that an
> iommu is not present globally.  This is why the new module option
> includes the word "pci", so that it can probe for and attach the dummy
> iommu specifically on the pci_bus_type.
> 
> We can still consider if there are better points at which to initiate
> the fake iommu group.  Trying to think through vfio-pci doing it on
> probe(), but it seems pretty ugly.
> 
> In this RFC, I specifically avoided making the vfio no-iommu iommu
> driver just another modular iommu backend, I wanted it to be tied to a
> vfio module option such that vfio behaves differently with open()s and
> certain ioctls.  I think it would be really confusing to users if safe
> and unsafe modes could be used concurrently for different devices.
> 
> > Won't address the "old IOMMUs add performance overhead"
> > usecase but I'm not very impressed by that in any case.
> 
> Yep, me neither, certainly not for static mappings.  There's a lot of
> FUD left over from latencies in the streaming DMA mapping paths where
> mappings are created and destroyed at a high rate.  That has more to do
> with flushing mappings out of the hardware than with iotlb miss latency
> or actual translation, which is all that should be in play for most uses
> here.
> 
> > > Recent experience is that IOMMU's
> > > are broken on many platforms so the only way to make a DPDK application
> > > it to write a test program that can be used to check if VFIO+IOMMU
> > > works first.
> > 
> > In userspace? Well that's just piling up work-arounds.  And assuming
> > hardware is broken, who knows what's going on security-wise.  These
> > broken systems need to be identified and black-listed in kernel.
> > 
> > > Also, although you think the long option will set the bar high
> > > enough it probably will not satisfy anyone. It is annoying enough, that
> > > I would just carry a patch to remove it the silly requirement.
> > 
> > That sounds reasonable. Anyone who can carry a kernel patch
> > does not need the warning.
> > 
> > > And the the people who believe
> > > all user mode DMA is evil won't be satisfied either.
> > > But I really like having the same consistent API for handling device
> > > access with IOMMU and when IOMMU will/won't work.
> > 
> > I agree that's good. Makes it easier to migrate applications to
> > the safe configuration down the road. Thanks Alex!
> > 
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/




^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2015-10-12 18:08 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-09 18:40 [RFC PATCH 0/2] VFIO no-iommu Alex Williamson
2015-10-09 18:41 ` [RFC PATCH 1/2] vfio: Move vfio.c vfio_core.c Alex Williamson
2015-10-09 19:21   ` Greg KH
2015-10-09 18:41 ` [RFC PATCH 2/2] vfio: Include no-iommu mode Alex Williamson
2015-10-11  8:12   ` Avi Kivity
2015-10-11  8:57     ` Michael S. Tsirkin
2015-10-11  9:03       ` Avi Kivity
2015-10-11  9:19         ` Michael S. Tsirkin
2015-10-11  9:23           ` Gleb Natapov
2015-10-11 21:16     ` Alex Williamson
2015-10-12 15:56   ` Stephen Hemminger
2015-10-12 16:23     ` Alex Williamson
2015-10-12 16:31       ` Avi Kivity
2015-10-12 16:27     ` Michael S. Tsirkin
2015-10-12 17:46       ` Alex Williamson
2015-10-12 18:08         ` Alex Williamson
2015-10-11 17:29 ` [RFC PATCH 0/2] VFIO no-iommu Varun Sethi
2015-10-11 18:23   ` Alex Williamson
2015-10-11 18:28 ` Michael S. Tsirkin
2015-10-11 18:29   ` Michael S. Tsirkin
2015-10-11 19:25     ` Alex Williamson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).