linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/5] VFIO core framework
@ 2011-12-21 21:42 Alex Williamson
  2011-12-21 21:42 ` [PATCH 1/5] vfio: Introduce documentation for VFIO driver Alex Williamson
                   ` (5 more replies)
  0 siblings, 6 replies; 11+ messages in thread
From: Alex Williamson @ 2011-12-21 21:42 UTC (permalink / raw)
  To: chrisw, aik, david, joerg.roedel, agraf, benve, aafabbri, B08248,
	B07421, avi, kvm, qemu-devel, iommu, linux-pci, linux-kernel

This series includes the core framework for the VFIO driver.
VFIO is a userspace driver interface meant to replace both the
KVM device assignment code as well as interfaces like UIO.  Please
see patch 1/5 for a complete description of VFIO, what it can do,
and how it's designed.

This version and the VFIO PCI bus driver, for exposing PCI devices
through VFIO, can be found here:

git://github.com/awilliam/linux-vfio.git vfio-next-20111221

A development version of qemu which includes a full working
vfio-pci driver, indepdendent of KVM support, can be found here:

git://github.com/awilliam/qemu-vfio.git vfio-ng

Thanks,

Alex

PS - I'll be mostly unavailable over the holidays, but wanted to get
this out for review and comparison to the isolation APIs being proposed.

---

Alex Williamson (5):
      vfio: VFIO core Kconfig and Makefile
      vfio: VFIO core IOMMU mapping support
      vfio: VFIO core group interface
      vfio: VFIO core header
      vfio: Introduce documentation for VFIO driver


 Documentation/ioctl/ioctl-number.txt |    1 
 Documentation/vfio.txt               |  352 ++++++++++
 MAINTAINERS                          |    8 
 drivers/Kconfig                      |    2 
 drivers/Makefile                     |    1 
 drivers/vfio/Kconfig                 |    8 
 drivers/vfio/Makefile                |    3 
 drivers/vfio/vfio_iommu.c            |  593 +++++++++++++++++
 drivers/vfio/vfio_main.c             | 1201 ++++++++++++++++++++++++++++++++++
 drivers/vfio/vfio_private.h          |   36 +
 include/linux/vfio.h                 |  353 ++++++++++
 11 files changed, 2558 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/vfio.txt
 create mode 100644 drivers/vfio/Kconfig
 create mode 100644 drivers/vfio/Makefile
 create mode 100644 drivers/vfio/vfio_iommu.c
 create mode 100644 drivers/vfio/vfio_main.c
 create mode 100644 drivers/vfio/vfio_private.h
 create mode 100644 include/linux/vfio.h

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/5] vfio: Introduce documentation for VFIO driver
  2011-12-21 21:42 [PATCH 0/5] VFIO core framework Alex Williamson
@ 2011-12-21 21:42 ` Alex Williamson
  2011-12-28 17:16   ` [Qemu-devel] " Ronen Hod
  2011-12-21 21:42 ` [PATCH 2/5] vfio: VFIO core header Alex Williamson
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 11+ messages in thread
From: Alex Williamson @ 2011-12-21 21:42 UTC (permalink / raw)
  To: chrisw, aik, david, joerg.roedel, agraf, benve, aafabbri, B08248,
	B07421, avi, kvm, qemu-devel, iommu, linux-pci, linux-kernel

Including rationale for design, example usage and API description.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---

 Documentation/vfio.txt |  352 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 352 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/vfio.txt

diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
new file mode 100644
index 0000000..09a5a5b
--- /dev/null
+++ b/Documentation/vfio.txt
@@ -0,0 +1,352 @@
+VFIO - "Virtual Function I/O"[1]
+-------------------------------------------------------------------------------
+Many modern system now provide DMA and interrupt remapping facilities
+to help ensure I/O devices behave within the boundaries they've been
+allotted.  This includes x86 hardware with AMD-Vi and Intel VT-d,
+POWER systems with Partitionable Endpoints (PEs) and embedded PowerPC
+systems such as Freescale PAMU.  The VFIO driver is an IOMMU/device
+agnostic framework for exposing direct device access to userspace, in
+a secure, IOMMU protected environment.  In other words, this allows
+safe[2], non-privileged, userspace drivers.
+
+Why do we want that?  Virtual machines often make use of direct device
+access ("device assignment") when configured for the highest possible
+I/O performance.  From a device and host perspective, this simply
+turns the VM into a userspace driver, with the benefits of
+significantly reduced latency, higher bandwidth, and direct use of
+bare-metal device drivers[3].
+
+Some applications, particularly in the high performance computing
+field, also benefit from low-overhead, direct device access from
+userspace.  Examples include network adapters (often non-TCP/IP based)
+and compute accelerators.  Prior to VFIO, these drivers had to either
+go through the full development cycle to become proper upstream
+driver, be maintained out of tree, or make use of the UIO framework,
+which has no notion of IOMMU protection, limited interrupt support,
+and requires root privileges to access things like PCI configuration
+space.
+
+The VFIO driver framework intends to unify these, replacing both the
+KVM PCI specific device assignment code as well as provide a more
+secure, more featureful userspace driver environment than UIO.
+
+Groups, Devices, and IOMMUs
+-------------------------------------------------------------------------------
+
+Userspace drivers are primarily concerned with manipulating individual
+devices and setting up mappings in the IOMMU for those devices.
+Unfortunately, the IOMMU doesn't always have the granularity to track
+mappings for an individual device.  Sometimes this is a topology
+barrier, such as a PCIe-to-PCI bridge interposing the device and
+IOMMU, other times this is an IOMMU limitation.  In any case, the
+reality is that devices are not always independent with respect to the
+IOMMU.  Translations setup for one device can be used by another
+device in these scenarios.
+
+The IOMMU API exposes these relationships by identifying an "IOMMU
+group" for these dependent devices.  Devices on the same bus with the
+same IOMMU group (or just "group" for this document) are not isolated
+from each other with respect to DMA mappings.  For userspace usage,
+this logically means that instead of being able to grant ownership of
+an individual device, we must grant ownership of a group, which may
+contain one or more devices.
+
+These groups therefore become a fundamental component of VFIO and the
+working unit we use for exposing devices and granting permissions to
+userspace.  In addition, VFIO make efforts to ensure the integrity of
+the group for user access.  This includes ensuring that all devices
+within the group are controlled by VFIO (vs native host drivers)
+before allowing a user to access any member of the group or the IOMMU
+mappings, as well as maintaining the group viability as devices are
+dynamically added or removed from the system.
+
+To access a device through VFIO, a user must open a character device
+for the group that the device belongs to and then issue an ioctl to
+retrieve a file descriptor for the individual device.  This ensures
+that the user has permissions to the group (file based access to the
+/dev entry) and allows a check point at which VFIO can deny access to
+the device if the group is not viable (all devices within the group
+controlled by VFIO).  A file descriptor for the IOMMU is obtain in the
+same fashion.
+
+VFIO defines a standard set of APIs for access to devices and a
+modular interface for adding new, bus-specific VFIO device drivers.
+We call these "VFIO bus drivers".  The vfio-pci module is an example
+of a bus driver for exposing PCI devices.  When the bus driver module
+is loaded it enumerates all of the devices for it's bus, registering
+each device with the vfio core along with a set of callbacks.  For
+buses that support hotplug, the bus driver also adds itself to the
+notification chain for such events.  The callbacks registered with
+each device implement the VFIO device access API for that bus.
+
+The VFIO device API includes ioctls for describing the device, the I/O
+regions and their read/write/mmap offsets on the device descriptor, as
+well as mechanisms for describing and registering interrupt
+notifications.
+
+The VFIO IOMMU object is accessed in a similar way; an ioctl on the
+group provides a file descriptor for programming the IOMMU.  Like
+devices, the IOMMU file descriptor is only accessible when a group is
+viable.  The API for the IOMMU is effectively a userspace extension of
+the kernel IOMMU API.  The IOMMU provides an ioctl to describe the
+IOMMU domain as well as to setup and teardown DMA mappings.  As the
+IOMMU API is extended to support more esoteric IOMMU implementations,
+it's expected that the VFIO interface will also evolve.
+
+To facilitate this evolution, all of the VFIO interfaces are designed
+for extensions.  Particularly, for all structures passed via ioctl, we
+include a structure size and flags field.  We also define the ioctl
+request to be independent of passed structure size.  This allows us to
+later add structure fields and define flags as necessary.  It's
+expected that each additional field will have an associated flag to
+indicate whether the data is valid.  Additionally, we provide an
+"info" ioctl for each file descriptor, which allows us to flag new
+features as they're added (ex. an IOMMU domain configuration ioctl).
+
+The final aspect of VFIO is the notion of merging groups.  In both the
+assignment of devices to virtual machines and the pure userspace
+driver model, it's expect that a single user instance is likely to
+have multiple groups in use simultaneously.  If these groups are all
+using the same set of IOMMU mappings, the overhead of userspace
+setting up and tearing down the mappings, as well as the internal
+IOMMU driver overhead of managing those mappings can be non-trivial.
+Some IOMMU implementations are able to easily reduce this overhead by
+simply using the same set of page tables across multiple groups.
+VFIO allows users to take advantage of this option by merging groups
+together, effectively creating a super group (IOMMU groups only define
+the minimum granularity).
+
+A user can attempt to merge groups together by calling the merge ioctl
+on one group (the "merger") and passing a file descriptor for the
+group to be merged in (the "mergee").  Note that existing DMA mappings
+cannot be atomically merged between groups, it's therefore a
+requirement that the mergee group is not in use.  This is enforced by
+not allowing open device or iommu file descriptors on the mergee group
+at the time of merging.  The merger group can be actively in use at
+the time of merging.  Likewise, to unmerge a group, none of the device
+file descriptors for the group being removed can be in use.  The
+remaining merged group can be actively in use.
+
+If the groups cannot be merged, the ioctl will fail and the user will
+need to manage the groups independently.  Users should have no
+expectation for group merging to be successful.  Some platforms may
+not support it at all, others may only enable merging of sufficiently
+similar groups.  If the ioctl succeeds, then the group file
+descriptors are effectively fungible between the groups.  That is,
+instead of their actions being isolated to the individual group, each
+of them are gateways into the combined, merged group.  For instance,
+retrieving an IOMMU file descriptor from any group returns a reference
+to the same object, mappings to that IOMMU descriptor are visible to
+all devices in the merged group, and device descriptors can be
+retrieved for any device in the merged group from any one of the group
+file descriptors.  In effect, a user can manage devices and the IOMMU
+of a merged group using a single file descriptor (saving the merged
+groups file descriptors away only for unmerged) without the
+permission complications of creating a separate "super group" character
+device.
+
+VFIO Usage Example
+-------------------------------------------------------------------------------
+
+Assume user wants to access PCI device 0000:06:0d.0
+
+$ cat /sys/bus/pci/devices/0000:06:0d.0/iommu_group
+240
+
+Since this device is on the "pci" bus, the user can then find the
+character device for interacting with the VFIO group as:
+
+$ ls -l /dev/vfio/pci:240
+crw-rw---- 1 root root 252, 27 Dec 15 15:13 /dev/vfio/pci:240
+
+We can also examine other members of the group through sysfs:
+
+$ ls -l /sys/devices/virtual/vfio/pci:240/devices/
+total 0
+lrwxrwxrwx 1 root root 0 Dec 20 12:01 0000:06:0d.0 -> \
+		../../../../pci0000:00/0000:00:1e.0/0000:06:0d.0
+lrwxrwxrwx 1 root root 0 Dec 20 12:01 0000:06:0d.1 -> \
+		../../../../pci0000:00/0000:00:1e.0/0000:06:0d.1
+
+This group therefore contains two devices[4].  VFIO will prevent
+device or iommu manipulation unless all group members are attached to
+the vfio bus driver, so we simply unbind the devices from their
+current driver and rebind them to vfio:
+
+# for i in /sys/devices/virtual/vfio/pci:240/devices/*; do
+	dir=$(readlink -f $i)
+	if [ -L $dir/driver ]; then
+		echo $(basename $i) > $dir/driver/unbind
+	fi
+	vendor=$(cat $dir/vendor)
+	device=$(cat $dir/device)
+	echo $vendor $device > /sys/bus/pci/drivers/vfio/new_id
+	echo $(basename $i) > /sys/bus/pci/drivers/vfio/bind
+done
+
+# chown user:user /dev/vfio/pci:240
+
+The user now has full access to all the devices and the iommu for this
+group and can access them as follows:
+
+	int group, iommu, device, i;
+	struct vfio_group_info group_info = { .argsz = sizeof(group_info) };
+	struct vfio_iommu_info iommu_info = { .argsz = sizeof(iommu_info) };
+	struct vfio_dma_map dma_map = { .argsz = sizeof(dma_map) };
+	struct vfio_device_info device_info = { .argsz = sizeof(device_info) };
+
+	/* Open the group */
+	group = open("/dev/vfio/pci:240", O_RDWR);
+
+	/* Test the group is viable and available */
+	ioctl(group, VFIO_GROUP_GET_INFO, &group_info);
+
+	if (!(group_info.flags & VFIO_GROUP_FLAGS_VIABLE))
+		/* Group is not viable */
+
+	if ((group_info.flags & VFIO_GROUP_FLAGS_MM_LOCKED))
+		/* Already in use by someone else */
+
+	/* Get a file descriptor for the IOMMU */
+	iommu = ioctl(group, VFIO_GROUP_GET_IOMMU_FD);
+
+	/* Test the IOMMU is what we expect */
+	ioctl(iommu, VFIO_IOMMU_GET_INFO, &iommu_info);
+
+	/* Allocate some space and setup a DMA mapping */
+	dma_map.vaddr = mmap(0, 1024 * 1024, PROT_READ | PROT_WRITE,
+			     MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
+	dma_map.size = 1024 * 1024;
+	dma_map.iova = 0; /* 1MB starting at 0x0 from device view */
+	dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
+
+	ioctl(iommu, VFIO_IOMMU_MAP_DMA, &dma_map);
+
+	/* Get a file descriptor for the device */
+	device = ioctl(group, VFIO_GROUP_GET_DEVICE_FD, "0000:06:0d.0");
+
+	/* Test and setup the device */
+	ioctl(device, VFIO_DEVICE_GET_INFO, &device_info);
+
+	for (i = 0; i < device_info.num_regions; i++) {
+		struct vfio_region_info reg = { .argsz = sizeof(reg) };
+
+		reg.index = i;
+
+		ioctl(device, VFIO_DEVICE_GET_REGION_INFO, &reg);
+
+		/* Setup mappings... read/write offsets, mmaps
+		 * For PCI devices, config space is a region */
+	}
+
+	for (i = 0; i < device_info.num_irqs; i++) {
+		struct vfio_irq_info irq = { .argsz = sizeof(irq) };
+
+		irq.index = i;
+
+		ioctl(device, VFIO_DEVICE_GET_IRQ_INFO, &reg);
+
+		/* Setup IRQs... eventfds, VFIO_DEVICE_SET_IRQ_EVENTFDS */
+	}
+
+	/* Gratuitous device reset and go... */
+	ioctl(device, VFIO_DEVICE_RESET);
+
+VFIO User API
+-------------------------------------------------------------------------------
+
+Please see include/linux/vfio.h for complete API documentation.
+
+VFIO bus driver API
+-------------------------------------------------------------------------------
+
+Bus drivers, such as PCI, have three jobs:
+ 1) Add/remove devices from vfio
+ 2) Provide vfio_device_ops for device access
+ 3) Device binding and unbinding
+
+When initialized, the bus driver should enumerate the devices on its
+bus and call vfio_group_add_dev() for each device.  If the bus
+supports hotplug, notifiers should be enabled to track devices being
+added and removed.  vfio_group_del_dev() removes a previously added
+device from vfio.
+
+extern int vfio_group_add_dev(struct device *dev,
+                              const struct vfio_device_ops *ops);
+extern void vfio_group_del_dev(struct device *dev);
+
+Adding a device registers a vfio_device_ops function pointer structure
+for the device:
+
+struct vfio_device_ops {
+	bool	(*match)(struct device *dev, char *buf);
+	int	(*claim)(struct device *dev);
+	int	(*open)(void *device_data);
+	void	(*release)(void *device_data);
+	ssize_t	(*read)(void *device_data, char __user *buf,
+			size_t count, loff_t *ppos);
+	ssize_t	(*write)(void *device_data, const char __user *buf,
+			 size_t size, loff_t *ppos);
+	long	(*ioctl)(void *device_data, unsigned int cmd,
+			 unsigned long arg);
+	int	(*mmap)(void *device_data, struct vm_area_struct *vma);
+};
+
+For buses supporting hotplug, all functions are required to be
+implemented.  Non-hotplug buses do not need to implement claim().
+
+match() provides a device specific method for associating a struct
+device to a user provided string.  Many drivers may simply strcmp the
+buffer to dev_name().
+
+claim() is used when a device is hot-added to a group that is already
+in use.  This is how VFIO requests that a bus driver manually takes
+ownership of a device.  The expected call path for this is triggered
+from the bus add notifier.  The bus driver calls vfio_group_add_dev for
+the newly added device, vfio-core determines this group is already in
+use and calls claim on the bus driver.  This triggers the bus driver
+to call it's own probe function, including calling vfio_bind_dev to
+mark the device as controlled by vfio.  The device is then available
+for use by the group.
+
+The remaining vfio_device_ops are similar to a simplified struct
+file_operations except a device_data pointer is provided rather than a
+file pointer.  The device_data is an opaque structure registered by
+the bus driver when a device is bound to the vfio bus driver:
+
+extern int vfio_bind_dev(struct device *dev, void *device_data);
+extern void *vfio_unbind_dev(struct device *dev);
+
+When the device is unbound from the driver, the bus driver will call
+vfio_unbind_dev() which will return the device_data for any bus driver
+specific cleanup and freeing of the structure.  The vfio_unbind_dev
+call may block if the group is currently in use.
+
+-------------------------------------------------------------------------------
+
+[1] VFIO was originally an acronym for "Virtual Function I/O" in it's
+initial implementation by Tom Lyon while as Cisco.  We've since
+outgrown the acronym, but it's catchy.
+
+[2] "safe" also depends upon a device being "well behaved".  It's
+possible for multi-function devices to have backdoors between
+functions and even for single function devices to have alternative
+access to things like PCI config space through MMIO registers.  To
+guard against the former we can include additional precautions in the
+IOMMU driver to group multi-function PCI devices together
+(iommu=group_mf).  The latter we can't prevent, but the IOMMU should
+still provide isolation.  For PCI, Virtual Functions are the best
+indicator of "well behaved", as these are designed for virtualization
+usage models.
+
+[3] As always there are trade-offs to virtual machine device
+assignment that are beyond the scope of VFIO.  It's expected that
+future IOMMU technologies will reduce some, but maybe not all, of
+these trade-offs.
+
+[4] In this case the device is below a PCI bridge:
+
+-[0000:00]-+-1e.0-[06]--+-0d.0
+                        \-0d.1
+
+00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 90)


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/5] vfio: VFIO core header
  2011-12-21 21:42 [PATCH 0/5] VFIO core framework Alex Williamson
  2011-12-21 21:42 ` [PATCH 1/5] vfio: Introduce documentation for VFIO driver Alex Williamson
@ 2011-12-21 21:42 ` Alex Williamson
  2011-12-21 21:42 ` [PATCH 3/5] vfio: VFIO core group interface Alex Williamson
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: Alex Williamson @ 2011-12-21 21:42 UTC (permalink / raw)
  To: chrisw, aik, david, joerg.roedel, agraf, benve, aafabbri, B08248,
	B07421, avi, kvm, qemu-devel, iommu, linux-pci, linux-kernel

This defines both the user and bus driver APIs.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---

 Documentation/ioctl/ioctl-number.txt |    1 
 include/linux/vfio.h                 |  353 ++++++++++++++++++++++++++++++++++
 2 files changed, 354 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/vfio.h

diff --git a/Documentation/ioctl/ioctl-number.txt b/Documentation/ioctl/ioctl-number.txt
index af76fde..69825b0 100644
--- a/Documentation/ioctl/ioctl-number.txt
+++ b/Documentation/ioctl/ioctl-number.txt
@@ -88,6 +88,7 @@ Code  Seq#(hex)	Include File		Comments
 		and kernel/power/user.c
 '8'	all				SNP8023 advanced NIC card
 					<mailto:mcr@solidum.com>
+';'	64-83	linux/vfio.h
 '@'	00-0F	linux/radeonfb.h	conflict!
 '@'	00-0F	drivers/video/aty/aty128fb.c	conflict!
 'A'	00-1F	linux/apm_bios.h	conflict!
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
new file mode 100644
index 0000000..2769dfb
--- /dev/null
+++ b/include/linux/vfio.h
@@ -0,0 +1,353 @@
+/*
+ * VFIO API definition
+ *
+ * Copyright (C) 2011 Red Hat, Inc.  All rights reserved.
+ *     Author: Alex Williamson <alex.williamson@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+#ifndef VFIO_H
+#define VFIO_H
+
+#include <linux/types.h>
+
+#ifdef __KERNEL__	/* Internal VFIO-core/bus driver API */
+
+/**
+ * struct vfio_device_ops - VFIO bus driver device callbacks
+ *
+ * @match: Return true if buf describes the device
+ * @claim: Force driver to attach to device
+ * @open: Called when userspace receives file descriptor for device
+ * @release: Called when userspace releases file descriptor for device
+ * @read: Perform read(2) on device file descriptor
+ * @write: Perform write(2) on device file descriptor
+ * @ioctl: Perform ioctl(2) on device file descriptor, supporting VFIO_DEVICE_*
+ *         operations documented below
+ * @mmap: Perform mmap(2) on a region of the device file descriptor
+ */
+struct vfio_device_ops {
+	bool	(*match)(struct device *dev, const char *buf);
+	int	(*claim)(struct device *dev);
+	int	(*open)(void *device_data);
+	void	(*release)(void *device_data);
+	ssize_t	(*read)(void *device_data, char __user *buf,
+			size_t count, loff_t *ppos);
+	ssize_t	(*write)(void *device_data, const char __user *buf,
+			 size_t count, loff_t *size);
+	long	(*ioctl)(void *device_data, unsigned int cmd,
+			 unsigned long arg);
+	int	(*mmap)(void *device_data, struct vm_area_struct *vma);
+};
+
+/**
+ * vfio_group_add_dev() - Add a device to the vfio-core
+ *
+ * @dev: Device to add
+ * @ops: VFIO bus driver callbacks for device
+ *
+ * This registration makes the VFIO core aware of the device, creates
+ * groups objects as required and exposes chardevs under /dev/vfio.
+ *
+ * Return 0 on success, errno on failure.
+ */
+extern int vfio_group_add_dev(struct device *dev,
+			      const struct vfio_device_ops *ops);
+
+/**
+ * vfio_group_del_dev() - Remove a device from the vfio-core
+ *
+ * @dev: Device to remove
+ *
+ * Remove a device previously added to the VFIO core, removing groups
+ * and chardevs as necessary.
+ */
+extern void vfio_group_del_dev(struct device *dev);
+
+/**
+ * vfio_bind_dev() - Indicate device is bound to the VFIO bus driver and
+ *                   register private data structure for ops callbacks.
+ *
+ * @dev: Device being bound
+ * @device_data: VFIO bus driver private data
+ *
+ * This registration indicate that a device previously registered with
+ * vfio_group_add_dev() is now available for use by the VFIO core.  When
+ * all devices within a group are available, the group is viable and my
+ * be used by userspace drivers.  Typically called from VFIO bus driver
+ * probe function.
+ *
+ * Return 0 on success, errno on failure
+ */
+extern int vfio_bind_dev(struct device *dev, void *device_data);
+
+/**
+ * vfio_unbind_dev() - Indicate device is unbinding from VFIO bus driver
+ *
+ * @dev: Device being unbound
+ *
+ * De-registration of the device previously registered with vfio_bind_dev()
+ * from VFIO.  Upon completion, the device is no longer available for use by
+ * the VFIO core.  Typically called from the VFIO bus driver remove function.
+ * The VFIO core will attempt to release the device from users and may take
+ * measures to free the device and/or block as necessary.
+ *
+ * Returns pointer to private device_data structure registered with
+ * vfio_bind_dev().
+ */
+extern void *vfio_unbind_dev(struct device *dev);
+
+
+/**
+ * offsetofend(TYPE, MEMBER)
+ *
+ * @TYPE: The type of the structure
+ * @MEMBER: The member within the structure to get the end offset of
+ *
+ * Simple helper macro for dealing with variable sized structures passed
+ * from user space.  This allows us to easily determine if the provided
+ * structure is sized to include various fields.
+ */
+#define offsetofend(TYPE, MEMBER) ({				\
+	TYPE tmp;						\
+	offsetof(TYPE, MEMBER) + sizeof(tmp.MEMBER); })		\
+
+#endif /* __KERNEL__ */
+
+/* Kernel & User level defines for VFIO IOCTLs. */
+
+/*
+ * The IOCTL interface is designed for extensibility by embedding the
+ * structure length (argsz) and flags into structures passed between
+ * kernel and userspace.  We therefore use the _IO() macro for these
+ * defines to avoid implicitly embedding a size into the ioctl request.
+ * As structure fields are added, argsz will increase to match and flag
+ * bits will be defined to indicate additional fields with valid data.
+ * It's *always* the caller's responsibility to indicate the size of
+ * the structure passed by setting argsz appropriately.
+ */
+
+#define VFIO_TYPE	(';')
+#define VFIO_BASE	100
+
+/* --------------- IOCTLs for GROUP file descriptors --------------- */
+
+/**
+ * VFIO_GROUP_GET_INFO - _IOR(VFIO_TYPE, VFIO_BASE + 0, struct vfio_group_info)
+ *
+ * Retrieve information about the group.  Fills in provided
+ * struct vfio_group_info.  Caller sets argsz.
+ */
+struct vfio_group_info {
+	__u32	argsz;
+	__u32	flags;
+#define VFIO_GROUP_FLAGS_VIABLE		(1 << 0)
+#define VFIO_GROUP_FLAGS_MM_LOCKED	(1 << 1)
+};
+
+#define VFIO_GROUP_GET_INFO		_IO(VFIO_TYPE, VFIO_BASE + 0)
+
+/**
+ * VFIO_GROUP_MERGE - _IOW(VFIO_TYPE, VFIO_BASE + 1, __s32)
+ *
+ * Merge group indicated by passed file descriptor into current group.
+ * Current group may be in use, group indicated by file descriptor
+ * cannot be in use (no open iommu or devices).
+ */
+#define VFIO_GROUP_MERGE		_IOW(VFIO_TYPE, VFIO_BASE + 1, __s32)
+
+/**
+ * VFIO_GROUP_UNMERGE - _IO(VFIO_TYPE, VFIO_BASE + 2)
+ *
+ * Remove the current group from a merged set.  The current group cannot
+ * have any open devices.
+ */
+#define VFIO_GROUP_UNMERGE		_IO(VFIO_TYPE, VFIO_BASE + 2)
+
+/**
+ * VFIO_GROUP_GET_IOMMU_FD - _IO(VFIO_TYPE, VFIO_BASE + 3)
+ *
+ * Return a new file descriptor for the IOMMU object.  The IOMMU object
+ * is shared among members of a merged group.
+ */
+#define VFIO_GROUP_GET_IOMMU_FD		_IO(VFIO_TYPE, VFIO_BASE + 3)
+
+/**
+ * VFIO_GROUP_GET_DEVICE_FD - _IOW(VFIO_TYPE, VFIO_BASE + 4, char)
+ *
+ * Return a new file descriptor for the device object described by
+ * the provided char array.
+ */
+#define VFIO_GROUP_GET_DEVICE_FD	_IOW(VFIO_TYPE, VFIO_BASE + 4, char)
+
+
+/* --------------- IOCTLs for IOMMU file descriptors --------------- */
+
+/**
+ * VFIO_IOMMU_GET_INFO - _IOR(VFIO_TYPE, VFIO_BASE + 5, struct vfio_iommu_info)
+ *
+ * Retrieve information about the IOMMU object.  Fills in provided
+ * struct vfio_iommu_info.  Caller sets argsz.
+ */
+struct vfio_iommu_info {
+	__u32	argsz;
+	__u32	flags;
+	__u64	iova_max;	/* Maximum IOVA address */
+	__u64	iova_min;	/* Minimum IOVA address */
+	__u64	pgsize_bitmap;	/* Bitmap of supported page sizes */
+};
+
+#define	VFIO_IOMMU_GET_INFO		_IO(VFIO_TYPE, VFIO_BASE + 5)
+
+/**
+ * VFIO_IOMMU_MAP_DMA - _IOW(VFIO_TYPE, VFIO_BASE + 6, struct vfio_dma_map)
+ *
+ * Map process virtual addresses to IO virtual addresses using the
+ * provided struct vfio_dma_map.  Caller sets argsz.  READ &/ WRITE required.
+ */
+struct vfio_dma_map {
+	__u32	argsz;
+	__u32	flags;
+#define VFIO_DMA_MAP_FLAG_READ	(1 << 0)	/* readable from device */
+#define VFIO_DMA_MAP_FLAG_WRITE	(1 << 1)	/* writable from device */
+	__u64	vaddr;		/* Process virtual address */
+	__u64	iova;		/* IO virtual address */
+	__u64	size;		/* Size of mapping (bytes) */
+};
+
+#define	VFIO_IOMMU_MAP_DMA		_IO(VFIO_TYPE, VFIO_BASE + 6)
+
+/**
+ * VFIO_IOMMU_UNMAP_DMA - _IOW(VFIO_TYPE, VFIO_BASE + 7, struct vfio_dma_unmap)
+ *
+ * Unmap IO virtual addresses using the provided struct vfio_dma_unmap.
+ * Caller sets argsz.
+ */
+struct vfio_dma_unmap {
+	__u32	argsz;
+	__u32	flags;
+	__u64	iova;		/* IO virtual address */
+	__u64	size;		/* Size of mapping (bytes) */
+};
+
+#define	VFIO_IOMMU_UNMAP_DMA		_IO(VFIO_TYPE, VFIO_BASE + 7)
+
+
+/* --------------- IOCTLs for DEVICE file descriptors --------------- */
+
+/**
+ * VFIO_DEVICE_GET_INFO - _IOR(VFIO_TYPE, VFIO_BASE + 8,
+ *			       struct vfio_device_info)
+ *
+ * Retrieve information about the device.  Fills in provided
+ * struct vfio_device_info.  Caller sets argsz.
+ */
+struct vfio_device_info {
+	__u32	argsz;
+	__u32	flags;
+#define VFIO_DEVICE_FLAGS_RESET	(1 << 0)	/* Device supports reset */
+	__u32	num_regions;	/* Max region index + 1 */
+	__u32	num_irqs;	/* Max IRQ index + 1 */
+};
+
+#define VFIO_DEVICE_GET_INFO		_IO(VFIO_TYPE, VFIO_BASE + 8)
+
+/**
+ * VFIO_DEVICE_GET_REGION_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 9,
+ *				       struct vfio_region_info)
+ *
+ * Retrieve information about a device region.  Caller provides
+ * struct vfio_region_info with index value set.  Caller sets argsz.
+ * Implementation of region mapping is bus driver specific.  This is
+ * intended to describe MMIO, I/O port, as well as bus specific
+ * regions (ex. PCI config space).  Zero sized regions may be used
+ * to describe unimplemented regions (ex. unimplemented PCI BARs).
+ */
+struct vfio_region_info {
+	__u32	argsz;
+	__u32	flags;
+#define VFIO_REGION_INFO_FLAG_MMAP	(1 << 0) /* Region supports mmap */
+#define VFIO_REGION_INFO_FLAG_RO	(1 << 1) /* Region is read-only */
+	__u32	index;		/* Region index */
+	__u32	resv;		/* Reserved for alignment */
+	__u64	size;		/* Region size (bytes) */
+	__u64	offset;		/* Region offset from start of device fd */
+};
+
+#define VFIO_DEVICE_GET_REGION_INFO	_IO(VFIO_TYPE, VFIO_BASE + 9)
+
+/**
+ * VFIO_DEVICE_GET_IRQ_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 10,
+ *				    struct vfio_irq_info)
+ *
+ * Retrieve information about a device IRQ.  Caller provides
+ * struct vfio_irq_info with index value set.  Caller sets argsz.
+ * Implementation of IRQ mapping is bus driver specific.  Indexes
+ * supported multiple IRQs are primarily intended to support
+ * MSI-like interrupt blocks.  Zero count irq blocks may be used
+ * to describe unimplemented interrupt types (ex. PCI MSI-X).
+ */
+struct vfio_irq_info {
+	__u32	argsz;
+	__u32	flags;
+#define VFIO_IRQ_INFO_FLAG_LEVEL	(1 << 0) /* Level (1) vs Edge (0) */
+	__u32	index;		/* IRQ index */
+	__u32	count;		/* Number of IRQs within this index */
+};
+
+#define VFIO_DEVICE_GET_IRQ_INFO	_IO(VFIO_TYPE, VFIO_BASE + 10)
+
+/**
+ * VFIO_DEVICE_SET_IRQ_EVENTFDS - _IOW(VFIO_TYPE, VFIO_BASE + 11,
+ *				       struct vfio_irq_eventfds)
+ *
+ * Set eventfds for IRQs using the struct vfio_irq_eventfds provided.
+ * Setting the eventfds also enables the interrupt.  Caller sets all fields.
+ */
+struct vfio_irq_eventfds {
+	__u32	argsz;
+	__u32	flags;
+	__u32	index;		/* IRQ index */
+	__u32	count;		/* Number of eventfds */
+	__s32	eventfds[];	/* eventfd for sub-index, -1 to unset */
+};
+
+#define VFIO_DEVICE_SET_IRQ_EVENTFDS	_IO(VFIO_TYPE, VFIO_BASE + 11)
+
+/**
+ * VFIO_DEVICE_UNMASK_IRQ - _IOW(VFIO_TYPE, VFIO_BASE + 12,
+ *				 struct vfio_unmask_irq)
+ *
+ * Unmask the IRQ described by the provided struct vfio_unmask_irq.
+ * Level triggered IRQs are masked when posted to userspace and must
+ * be unmasked to re-trigger.  Caller sets all fields.
+ */
+struct vfio_unmask_irq {
+	__u32	argsz;
+	__u32	flags;
+	__u32	index;		/* IRQ index */
+	__u32	subindex;	/* Sub-index to unmask */
+};
+
+#define VFIO_DEVICE_UNMASK_IRQ		_IO(VFIO_TYPE, VFIO_BASE + 12)
+
+/**
+ * VFIO_DEVICE_SET_UNMASK_IRQ_EVENTFDS - _IOW(VFIO_TYPE, VFIO_BASE + 13,
+ *					      struct vfio_irq_eventfds)
+ *
+ * Set eventfds to be used for unmasking IRQs using the provided
+ * struct vfio_irq_eventfds.  Same semantics as VFIO_DEVICE_SET_IRQ_EVENTFDS.
+ * Caller sets all fields.
+ */
+#define VFIO_DEVICE_SET_UNMASK_IRQ_EVENTFDS	_IO(VFIO_TYPE, VFIO_BASE + 13)
+
+/**
+ * VFIO_DEVICE_RESET - _IO(VFIO_TYPE, VFIO_BASE + 14)
+ *
+ * Reset a device.
+ */
+#define VFIO_DEVICE_RESET		_IO(VFIO_TYPE, VFIO_BASE + 14)
+
+#endif /* VFIO_H */


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 3/5] vfio: VFIO core group interface
  2011-12-21 21:42 [PATCH 0/5] VFIO core framework Alex Williamson
  2011-12-21 21:42 ` [PATCH 1/5] vfio: Introduce documentation for VFIO driver Alex Williamson
  2011-12-21 21:42 ` [PATCH 2/5] vfio: VFIO core header Alex Williamson
@ 2011-12-21 21:42 ` Alex Williamson
  2011-12-21 21:42 ` [PATCH 4/5] vfio: VFIO core IOMMU mapping support Alex Williamson
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: Alex Williamson @ 2011-12-21 21:42 UTC (permalink / raw)
  To: chrisw, aik, david, joerg.roedel, agraf, benve, aafabbri, B08248,
	B07421, avi, kvm, qemu-devel, iommu, linux-pci, linux-kernel

This provides the base group management with conduits to the
IOMMU driver and VFIO bus drivers.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---

 drivers/vfio/vfio_main.c    | 1201 +++++++++++++++++++++++++++++++++++++++++++
 drivers/vfio/vfio_private.h |   36 +
 2 files changed, 1237 insertions(+), 0 deletions(-)
 create mode 100644 drivers/vfio/vfio_main.c
 create mode 100644 drivers/vfio/vfio_private.h

diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
new file mode 100644
index 0000000..e4f0f92
--- /dev/null
+++ b/drivers/vfio/vfio_main.c
@@ -0,0 +1,1201 @@
+/*
+ * VFIO framework
+ *
+ * Copyright (C) 2011 Red Hat, Inc.  All rights reserved.
+ *     Author: Alex Williamson <alex.williamson@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * Derived from original vfio:
+ * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
+ * Author: Tom Lyon, pugs@cisco.com
+ */
+
+#include <linux/cdev.h>
+#include <linux/compat.h>
+#include <linux/device.h>
+#include <linux/file.h>
+#include <linux/anon_inodes.h>
+#include <linux/fs.h>
+#include <linux/idr.h>
+#include <linux/iommu.h>
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/string.h>
+#include <linux/uaccess.h>
+#include <linux/vfio.h>
+#include <linux/wait.h>
+
+#include "vfio_private.h"
+
+#define DRIVER_VERSION	"0.2"
+#define DRIVER_AUTHOR	"Alex Williamson <alex.williamson@redhat.com>"
+#define DRIVER_DESC	"VFIO - User Level meta-driver"
+
+static struct vfio {
+	dev_t			devt;
+	struct cdev		cdev;
+	struct list_head	group_list;
+	struct mutex		lock;
+	struct kref		kref;
+	struct class		*class;
+	struct idr		idr;
+	wait_queue_head_t	release_q;
+} vfio;
+
+static const struct file_operations vfio_group_fops;
+
+struct vfio_group {
+	dev_t			devt;
+	unsigned int		groupid;
+	struct bus_type		*bus;
+	struct vfio_iommu	*iommu;
+	struct list_head	device_list;
+	struct list_head	iommu_next;
+	struct list_head	group_next;
+	struct device		*dev;
+	struct kobject		*devices_kobj;
+	int			refcnt;
+	bool			tainted;
+};
+
+struct vfio_device {
+	struct device			*dev;
+	const struct vfio_device_ops	*ops;
+	struct vfio_group		*group;
+	struct list_head		device_next;
+	bool				attached;
+	bool				deleteme;
+	int				refcnt;
+	void				*device_data;
+};
+
+/*
+ * Helper functions called under vfio.lock
+ */
+
+/* Return true if any devices within a group are opened */
+static bool __vfio_group_devs_inuse(struct vfio_group *group)
+{
+	struct list_head *pos;
+
+	list_for_each(pos, &group->device_list) {
+		struct vfio_device *device;
+
+		device = list_entry(pos, struct vfio_device, device_next);
+		if (device->refcnt)
+			return true;
+	}
+	return false;
+}
+
+/* Return true if any of the groups attached to an iommu are opened.
+ * We can only tear apart merged groups when nothing is left open. */
+static bool __vfio_iommu_groups_inuse(struct vfio_iommu *iommu)
+{
+	struct list_head *pos;
+
+	list_for_each(pos, &iommu->group_list) {
+		struct vfio_group *group;
+
+		group = list_entry(pos, struct vfio_group, iommu_next);
+		if (group->refcnt)
+			return true;
+	}
+	return false;
+}
+
+/* An iommu is "in use" if it has a file descriptor open or if any of
+ * the groups assigned to the iommu have devices open. */
+static bool __vfio_iommu_inuse(struct vfio_iommu *iommu)
+{
+	struct list_head *pos;
+
+	if (iommu->refcnt)
+		return true;
+
+	list_for_each(pos, &iommu->group_list) {
+		struct vfio_group *group;
+
+		group = list_entry(pos, struct vfio_group, iommu_next);
+
+		if (__vfio_group_devs_inuse(group))
+			return true;
+	}
+	return false;
+}
+
+static void __vfio_group_set_iommu(struct vfio_group *group,
+				   struct vfio_iommu *iommu)
+{
+	if (group->iommu)
+		list_del(&group->iommu_next);
+	if (iommu)
+		list_add(&group->iommu_next, &iommu->group_list);
+
+	group->iommu = iommu;
+}
+
+static void __vfio_iommu_detach_dev(struct vfio_iommu *iommu,
+				    struct vfio_device *device)
+{
+	if (WARN_ON(!iommu->domain && device->attached))
+		return;
+
+	if (!device->attached)
+		return;
+
+	iommu_detach_device(iommu->domain, device->dev);
+	device->attached = false;
+}
+
+static void __vfio_iommu_detach_group(struct vfio_iommu *iommu,
+				      struct vfio_group *group)
+{
+	struct list_head *pos;
+
+	list_for_each(pos, &group->device_list) {
+		struct vfio_device *device;
+
+		device = list_entry(pos, struct vfio_device, device_next);
+		__vfio_iommu_detach_dev(iommu, device);
+	}
+}
+
+static int __vfio_iommu_attach_dev(struct vfio_iommu *iommu,
+				   struct vfio_device *device)
+{
+	int ret;
+
+	if (WARN_ON(device->attached || !iommu || !iommu->domain))
+		return -EINVAL;
+
+	ret = iommu_attach_device(iommu->domain, device->dev);
+	if (!ret)
+		device->attached = true;
+
+	return ret;
+}
+
+static int __vfio_iommu_attach_group(struct vfio_iommu *iommu,
+				     struct vfio_group *group)
+{
+	struct list_head *pos;
+
+	list_for_each(pos, &group->device_list) {
+		struct vfio_device *device;
+		int ret;
+
+		device = list_entry(pos, struct vfio_device, device_next);
+		ret = __vfio_iommu_attach_dev(iommu, device);
+		if (ret) {
+			__vfio_iommu_detach_group(iommu, group);
+			return ret;
+		}
+	}
+	return 0;
+}
+
+/* The iommu is viable, ie. ready to be configured, when all the devices
+ * for all the groups attached to the iommu are bound to their vfio device
+ * drivers (ex. vfio-pci).  This sets the device_data private data pointer. */
+static bool __vfio_iommu_viable(struct vfio_iommu *iommu)
+{
+	struct list_head *gpos, *dpos;
+
+	list_for_each(gpos, &iommu->group_list) {
+		struct vfio_group *group;
+		group = list_entry(gpos, struct vfio_group, iommu_next);
+
+		if (group->tainted)
+			return false;
+
+		list_for_each(dpos, &group->device_list) {
+			struct vfio_device *device;
+			device = list_entry(dpos,
+					    struct vfio_device, device_next);
+
+			if (!device->device_data)
+				return false;
+		}
+	}
+	return true;
+}
+
+static void __vfio_iommu_close(struct vfio_iommu *iommu)
+{
+	struct list_head *pos;
+
+	if (!iommu->domain)
+		return;
+
+	list_for_each(pos, &iommu->group_list) {
+		struct vfio_group *group;
+		group = list_entry(pos, struct vfio_group, iommu_next);
+
+		__vfio_iommu_detach_group(iommu, group);
+	}
+
+	vfio_iommu_unmapall(iommu);
+
+	iommu_domain_free(iommu->domain);
+	iommu->domain = NULL;
+	iommu->mm = NULL;
+}
+
+/* Open the IOMMU.  This gates all access to the iommu or device file
+ * descriptors and sets current->mm as the exclusive user. */
+static int __vfio_iommu_open(struct vfio_iommu *iommu)
+{
+	struct list_head *pos;
+	int ret;
+
+	if (!__vfio_iommu_viable(iommu))
+		return -EBUSY;
+
+	if (iommu->domain)
+		return -EINVAL;
+
+	iommu->domain = iommu_domain_alloc(iommu->bus);
+	if (!iommu->domain)
+		return -ENOMEM;
+
+	list_for_each(pos, &iommu->group_list) {
+		struct vfio_group *group;
+		group = list_entry(pos, struct vfio_group, iommu_next);
+
+		ret = __vfio_iommu_attach_group(iommu, group);
+		if (ret) {
+			__vfio_iommu_close(iommu);
+			return ret;
+		}
+	}
+
+	iommu->cache = (iommu_domain_has_cap(iommu->domain,
+					     IOMMU_CAP_CACHE_COHERENCY) != 0);
+	iommu->mm = current->mm;
+
+	return 0;
+}
+
+/* Actively try to tear down the iommu and merged groups.  If there are no
+ * open iommu or device fds, we close the iommu.  If we close the iommu and
+ * there are also no open group fds, we can further dissolve the group to
+ * iommu association and free the iommu data structure. */
+static int __vfio_try_dissolve_iommu(struct vfio_iommu *iommu)
+{
+
+	if (__vfio_iommu_inuse(iommu))
+		return -EBUSY;
+
+	__vfio_iommu_close(iommu);
+
+	if (!__vfio_iommu_groups_inuse(iommu)) {
+		struct list_head *pos, *ppos;
+
+		list_for_each_safe(pos, ppos, &iommu->group_list) {
+			struct vfio_group *group;
+
+			group = list_entry(pos, struct vfio_group, iommu_next);
+			__vfio_group_set_iommu(group, NULL);
+		}
+
+		kfree(iommu);
+	}
+
+	return 0;
+}
+
+static struct vfio_device *__vfio_lookup_dev(struct device *dev)
+{
+	struct list_head *gpos;
+	unsigned int groupid;
+
+	if (iommu_device_group(dev, &groupid))
+		return NULL;
+
+	list_for_each(gpos, &vfio.group_list) {
+		struct vfio_group *group;
+		struct list_head *dpos;
+
+		group = list_entry(gpos, struct vfio_group, group_next);
+
+		if (group->groupid != groupid || group->bus != dev->bus)
+			continue;
+
+		list_for_each(dpos, &group->device_list) {
+			struct vfio_device *device;
+
+			device = list_entry(dpos,
+					    struct vfio_device, device_next);
+
+			if (device->dev == dev)
+				return device;
+		}
+	}
+	return NULL;
+}
+
+static struct vfio_group *__vfio_dev_to_group(struct device *dev,
+					      unsigned int groupid)
+{
+	struct list_head *pos;
+	struct vfio_group *group;
+
+	list_for_each(pos, &vfio.group_list) {
+		group = list_entry(pos, struct vfio_group, group_next);
+		if (group->groupid == groupid && group->bus == dev->bus)
+			return group;
+	}
+
+	return NULL;
+}
+
+struct vfio_device *__vfio_group_find_device(struct vfio_group *group,
+					     struct device *dev)
+{
+	struct list_head *pos;
+	struct vfio_device *device;
+
+	list_for_each(pos, &group->device_list) {
+		device = list_entry(pos, struct vfio_device, device_next);
+		if (device->dev == dev)
+			return device;
+	}
+
+	return NULL;
+}
+
+static struct vfio_group *__vfio_create_group(struct device *dev,
+					      unsigned int groupid)
+{
+	struct vfio_group *group;
+	int ret, minor;
+
+	group = kzalloc(sizeof(*group), GFP_KERNEL);
+
+	/* We can't recover from this.  If we can't even get memory for
+	 * the group, we can't track the device and we don't have a place
+	 * to mark the groupid tainted.  Failures below should at least
+	 * return a tainted group. */
+	BUG_ON(!group);
+
+	group->groupid = groupid;
+	group->bus = dev->bus;
+	INIT_LIST_HEAD(&group->device_list);
+
+	group->tainted = true;
+	list_add(&group->group_next, &vfio.group_list);
+
+again:
+	if (unlikely(idr_pre_get(&vfio.idr, GFP_KERNEL) == 0))
+		goto out;
+
+	ret = idr_get_new(&vfio.idr, group, &minor);
+	if (ret == -EAGAIN)
+		goto again;
+	if (ret || minor > MINORMASK) {
+		if (minor > MINORMASK)
+			idr_remove(&vfio.idr, minor);
+		goto out;
+	}
+
+	group->devt = MKDEV(MAJOR(vfio.devt), minor);
+	group->dev = device_create(vfio.class, NULL, group->devt, group,
+				   "%s:%u", dev->bus->name, groupid);
+	if (IS_ERR(group->dev))
+		goto out_device;
+
+	/* Create a place to link individual devices in sysfs */
+	group->devices_kobj = kobject_create_and_add("devices",
+						     &group->dev->kobj);
+	if (!group->devices_kobj)
+		goto out_kobj;
+
+	group->tainted = false;
+
+	return group;
+
+out_kobj:
+	device_destroy(vfio.class, group->devt);
+out_device:
+	group->dev = NULL;
+	group->devt = 0;
+	idr_remove(&vfio.idr, minor);
+out:
+	printk(KERN_WARNING "vfio: Failed to complete setup on group %u, "
+	       "marking as unusable\n", groupid);
+
+	return group;
+}
+
+static struct vfio_iommu *vfio_create_iommu(struct vfio_group *group)
+{
+	struct vfio_iommu *iommu;
+
+	iommu = kzalloc(sizeof(*iommu), GFP_KERNEL);
+	if (!iommu)
+		return ERR_PTR(-ENOMEM);
+
+	INIT_LIST_HEAD(&iommu->group_list);
+	INIT_LIST_HEAD(&iommu->dma_list);
+	mutex_init(&iommu->lock);
+	iommu->bus = group->bus;
+
+	return iommu;
+}
+
+/* All release paths simply decrement the refcnt, attempt to teardown
+ * the iommu and merged groups, and wakeup anything that might be
+ * waiting if we successfully dissolve anything. */
+static int vfio_do_release(int *refcnt, struct vfio_iommu *iommu)
+{
+	bool wake;
+
+	mutex_lock(&vfio.lock);
+
+	(*refcnt)--;
+	wake = (__vfio_try_dissolve_iommu(iommu) == 0);
+
+	mutex_unlock(&vfio.lock);
+
+	if (wake)
+		wake_up(&vfio.release_q);
+
+	return 0;
+}
+
+/*
+ * Device fops - passthrough to vfio device driver w/ device_data
+ */
+static int vfio_device_release(struct inode *inode, struct file *filep)
+{
+	struct vfio_device *device = filep->private_data;
+
+	vfio_do_release(&device->refcnt, device->group->iommu);
+
+	device->ops->release(device->device_data);
+
+	return 0;
+}
+
+static long vfio_device_unl_ioctl(struct file *filep,
+				  unsigned int cmd, unsigned long arg)
+{
+	struct vfio_device *device = filep->private_data;
+
+	return device->ops->ioctl(device->device_data, cmd, arg);
+}
+
+static ssize_t vfio_device_read(struct file *filep, char __user *buf,
+				size_t count, loff_t *ppos)
+{
+	struct vfio_device *device = filep->private_data;
+
+	return device->ops->read(device->device_data, buf, count, ppos);
+}
+
+static ssize_t vfio_device_write(struct file *filep, const char __user *buf,
+				 size_t count, loff_t *ppos)
+{
+	struct vfio_device *device = filep->private_data;
+
+	return device->ops->write(device->device_data, buf, count, ppos);
+}
+
+static int vfio_device_mmap(struct file *filep, struct vm_area_struct *vma)
+{
+	struct vfio_device *device = filep->private_data;
+
+	return device->ops->mmap(device->device_data, vma);
+}
+
+#ifdef CONFIG_COMPAT
+static long vfio_device_compat_ioctl(struct file *filep,
+				     unsigned int cmd, unsigned long arg)
+{
+	arg = (unsigned long)compat_ptr(arg);
+	return vfio_device_unl_ioctl(filep, cmd, arg);
+}
+#endif	/* CONFIG_COMPAT */
+
+const struct file_operations vfio_device_fops = {
+	.owner		= THIS_MODULE,
+	.release	= vfio_device_release,
+	.read		= vfio_device_read,
+	.write		= vfio_device_write,
+	.unlocked_ioctl	= vfio_device_unl_ioctl,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl	= vfio_device_compat_ioctl,
+#endif
+	.mmap		= vfio_device_mmap,
+};
+
+/*
+ * Group fops
+ */
+static int vfio_group_open(struct inode *inode, struct file *filep)
+{
+	struct vfio_group *group;
+	int ret = 0;
+
+	mutex_lock(&vfio.lock);
+
+	group = idr_find(&vfio.idr, iminor(inode));
+
+	if (!group) {
+		ret = -ENODEV;
+		goto out;
+	}
+
+	filep->private_data = group;
+
+	if (!group->iommu) {
+		struct vfio_iommu *iommu;
+
+		iommu = vfio_create_iommu(group);
+		if (IS_ERR(iommu)) {
+			ret = PTR_ERR(iommu);
+			goto out;
+		}
+		__vfio_group_set_iommu(group, iommu);
+	}
+	group->refcnt++;
+
+out:
+	mutex_unlock(&vfio.lock);
+
+	return ret;
+}
+
+static int vfio_group_release(struct inode *inode, struct file *filep)
+{
+	struct vfio_group *group = filep->private_data;
+
+	return vfio_do_release(&group->refcnt, group->iommu);
+}
+
+/* Attempt to merge the group pointed to by fd into group.  The merge-ee
+ * group must not have an iommu or any devices open because we cannot
+ * maintain that context across the merge.  The merge-er group can be
+ * in use. */
+static int vfio_group_merge(struct vfio_group *group, int fd)
+{
+	struct vfio_group *new;
+	struct vfio_iommu *old_iommu;
+	struct file *file;
+	int ret = 0;
+	bool opened = false;
+
+	mutex_lock(&vfio.lock);
+
+	file = fget(fd);
+	if (!file) {
+		ret = -EBADF;
+		goto out_noput;
+	}
+
+	/* Sanity check, is this really our fd? */
+	if (file->f_op != &vfio_group_fops) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	new = file->private_data;
+
+	if (!new || new == group || !new->iommu ||
+	    new->iommu->domain || new->bus != group->bus) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	/* We need to attach all the devices to each domain separately
+	 * in order to validate that the capabilities match for both.  */
+	ret = __vfio_iommu_open(new->iommu);
+	if (ret)
+		goto out;
+
+	if (!group->iommu->domain) {
+		ret = __vfio_iommu_open(group->iommu);
+		if (ret)
+			goto out;
+		opened = true;
+	}
+
+	/* If cache coherency doesn't match we'd potentially need to
+	 * remap existing iommu mappings in the merge-er domain.
+	 * Poor return to bother trying to allow this currently. */
+	if (iommu_domain_has_cap(group->iommu->domain,
+				 IOMMU_CAP_CACHE_COHERENCY) !=
+	    iommu_domain_has_cap(new->iommu->domain,
+				 IOMMU_CAP_CACHE_COHERENCY)) {
+		__vfio_iommu_close(new->iommu);
+		if (opened)
+			__vfio_iommu_close(group->iommu);
+		ret = -EINVAL;
+		goto out;
+	}
+
+	/* Close the iommu for the merge-ee and attach all its devices
+	 * to the merge-er iommu. */
+	__vfio_iommu_close(new->iommu);
+
+	ret = __vfio_iommu_attach_group(group->iommu, new);
+	if (ret)
+		goto out;
+
+	/* set_iommu unlinks new from the iommu, so save a pointer to it */
+	old_iommu = new->iommu;
+	__vfio_group_set_iommu(new, group->iommu);
+	kfree(old_iommu);
+
+out:
+	fput(file);
+out_noput:
+	mutex_unlock(&vfio.lock);
+	return ret;
+}
+
+/* Unmerge a group */
+static int vfio_group_unmerge(struct vfio_group *group)
+{
+	struct vfio_iommu *iommu;
+	int ret = 0;
+
+	/* Since the merge-out group is already opened, it needs to
+	 * have an iommu struct associated with it. */
+	iommu = vfio_create_iommu(group);
+	if (IS_ERR(iommu))
+		return PTR_ERR(iommu);
+
+	mutex_lock(&vfio.lock);
+
+	if (list_is_singular(&group->iommu->group_list)) {
+		ret = -EINVAL; /* Not merged group */
+		goto out;
+	}
+
+	/* We can't merge-out a group with devices still in use. */
+	if (__vfio_group_devs_inuse(group)) {
+		ret = -EBUSY;
+		goto out;
+	}
+
+	__vfio_iommu_detach_group(group->iommu, group);
+	__vfio_group_set_iommu(group, iommu);
+
+out:
+	if (ret)
+		kfree(iommu);
+	mutex_unlock(&vfio.lock);
+	return ret;
+}
+
+/* Get a new iommu file descriptor.  This will open the iommu, setting
+ * the current->mm ownership if it's not already set. */
+static int vfio_group_get_iommu_fd(struct vfio_group *group)
+{
+	int ret = 0;
+
+	mutex_lock(&vfio.lock);
+
+	if (!group->iommu->domain) {
+		ret = __vfio_iommu_open(group->iommu);
+		if (ret)
+			goto out;
+	}
+
+	ret = anon_inode_getfd("[vfio-iommu]", &vfio_iommu_fops,
+			       group->iommu, O_RDWR);
+	if (ret < 0)
+		goto out;
+
+	group->iommu->refcnt++;
+out:
+	mutex_unlock(&vfio.lock);
+	return ret;
+}
+
+/* Get a new device file descriptor.  This will open the iommu, setting
+ * the current->mm ownership if it's not already set.  It's difficult to
+ * specify the requirements for matching a user supplied buffer to a
+ * device, so we use a vfio driver callback to test for a match.  For
+ * PCI, dev_name(dev) is unique, but other drivers may require including
+ * a parent device string. */
+static int vfio_group_get_device_fd(struct vfio_group *group, char *buf)
+{
+	struct vfio_iommu *iommu = group->iommu;
+	struct list_head *gpos;
+	int ret = -ENODEV;
+
+	mutex_lock(&vfio.lock);
+
+	if (!iommu->domain) {
+		ret = __vfio_iommu_open(iommu);
+		if (ret)
+			goto out;
+	}
+
+	list_for_each(gpos, &iommu->group_list) {
+		struct list_head *dpos;
+
+		group = list_entry(gpos, struct vfio_group, iommu_next);
+
+		list_for_each(dpos, &group->device_list) {
+			struct vfio_device *device;
+			struct file *file;
+
+			device = list_entry(dpos,
+					    struct vfio_device, device_next);
+
+			if (!device->ops->match(device->dev, buf))
+				continue;
+
+			ret = device->ops->open(device->device_data);
+			if (ret)
+				goto out;
+
+			/* We can't use anon_inode_getfd(), like above
+			 * because we need to modify the f_mode flags
+			 * directly to allow more than just ioctls */
+			ret = get_unused_fd();
+			if (ret < 0) {
+				device->ops->release(device->device_data);
+				goto out;
+			}
+
+			file = anon_inode_getfile("[vfio-device]",
+						  &vfio_device_fops,
+						  device, O_RDWR);
+			if (IS_ERR(file)) {
+				put_unused_fd(ret);
+				ret = PTR_ERR(file);
+				device->ops->release(device->device_data);
+				goto out;
+			}
+
+			/* Todo: add an anon_inode interface to do
+			 * this.  Appears to be missing by lack of
+			 * need rather than explicitly prevented.
+			 * Now there's need. */
+			file->f_mode |= (FMODE_LSEEK |
+					 FMODE_PREAD |
+					 FMODE_PWRITE);
+
+			fd_install(ret, file);
+
+			device->refcnt++;
+			goto out;
+		}
+	}
+out:
+	mutex_unlock(&vfio.lock);
+	return ret;
+}
+
+static long vfio_group_unl_ioctl(struct file *filep,
+				 unsigned int cmd, unsigned long arg)
+{
+	struct vfio_group *group = filep->private_data;
+
+	if (cmd == VFIO_GROUP_GET_INFO) {
+		struct vfio_group_info info;
+		unsigned long minsz;
+
+		minsz = offsetofend(struct vfio_group_info, flags);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz)
+			return -EINVAL;
+
+		mutex_lock(&vfio.lock);
+		if (__vfio_iommu_viable(group->iommu))
+			info.flags |= VFIO_GROUP_FLAGS_VIABLE;
+		mutex_unlock(&vfio.lock);
+
+		if (group->iommu->mm)
+			info.flags |= VFIO_GROUP_FLAGS_MM_LOCKED;
+
+		return copy_to_user((void __user *)arg, &info, minsz);
+	}
+
+	/* Below commands are restricted once the mm is set */
+	if (group->iommu->mm && group->iommu->mm != current->mm)
+		return -EPERM;
+
+	if (cmd == VFIO_GROUP_MERGE) {
+		int fd;
+
+		if (get_user(fd, (int __user *)arg))
+			return -EFAULT;
+		if (fd < 0)
+			return -EINVAL;
+
+		return vfio_group_merge(group, fd);
+
+	} else if (cmd == VFIO_GROUP_UNMERGE) {
+
+		return vfio_group_unmerge(group);
+
+	} else if (cmd == VFIO_GROUP_GET_IOMMU_FD) {
+
+		return vfio_group_get_iommu_fd(group);
+
+	} else if (cmd == VFIO_GROUP_GET_DEVICE_FD) {
+		char *buf;
+		int ret;
+
+		buf = strndup_user((const char __user *)arg, PAGE_SIZE);
+		if (IS_ERR(buf))
+			return PTR_ERR(buf);
+
+		ret = vfio_group_get_device_fd(group, buf);
+		kfree(buf);
+		return ret;
+	}
+
+	return -ENOTTY;
+}
+
+#ifdef CONFIG_COMPAT
+static long vfio_group_compat_ioctl(struct file *filep,
+				    unsigned int cmd, unsigned long arg)
+{
+	arg = (unsigned long)compat_ptr(arg);
+	return vfio_group_unl_ioctl(filep, cmd, arg);
+}
+#endif	/* CONFIG_COMPAT */
+
+static const struct file_operations vfio_group_fops = {
+	.owner		= THIS_MODULE,
+	.open		= vfio_group_open,
+	.release	= vfio_group_release,
+	.unlocked_ioctl	= vfio_group_unl_ioctl,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl	= vfio_group_compat_ioctl,
+#endif
+};
+
+/* iommu fd release hook */
+int vfio_release_iommu(struct vfio_iommu *iommu)
+{
+	return vfio_do_release(&iommu->refcnt, iommu);
+}
+
+/*
+ * VFIO driver API
+ */
+
+/* Add a new device to the vfio framework with associated vfio driver
+ * callbacks.  This is the entry point for vfio drivers to register devices. */
+int vfio_group_add_dev(struct device *dev, const struct vfio_device_ops *ops)
+{
+	struct vfio_group *group;
+	struct vfio_device *device;
+	unsigned int groupid;
+	int ret = 0;
+
+	if (iommu_device_group(dev, &groupid))
+		return -ENODEV;
+
+	if (WARN_ON(!ops))
+		return -EINVAL;
+
+	mutex_lock(&vfio.lock);
+
+	group = __vfio_dev_to_group(dev, groupid);
+	if (!group)
+		group = __vfio_create_group(dev, groupid); /* No fail */
+
+	device = __vfio_group_find_device(group, dev);
+	if (!device) {
+		device = kzalloc(sizeof(*device), GFP_KERNEL);
+		if (WARN_ON(!device)) {
+			/* We created the group, but can't add this device,
+			 * taint the group to prevent it being used.  If
+			 * it's already in use, we have to BUG_ON.
+			 * XXX - Kill the user process? */
+			group->tainted = true;
+			BUG_ON(group->iommu && group->iommu->domain);
+			ret = -ENOMEM;
+			goto out;
+		}
+
+		list_add(&device->device_next, &group->device_list);
+		device->dev = dev;
+		device->ops = ops;
+		device->group = group;
+
+		if (!group->devices_kobj ||
+		    sysfs_create_link(group->devices_kobj,
+				      &dev->kobj, dev_name(dev)))
+			printk(KERN_WARNING
+			       "vfio: Unable to create sysfs link to %s\n",
+			       dev_name(dev));
+
+		if (group->iommu && group->iommu->domain) {
+			printk(KERN_WARNING "Adding device %s to group %s:%u "
+			       "while group is already in use!!\n",
+			       dev_name(dev), group->bus->name, group->groupid);
+
+			mutex_unlock(&vfio.lock);
+
+			ret = ops->claim(dev);
+
+			BUG_ON(ret);
+
+			goto out_unlocked;
+		}
+
+	}
+out:
+	mutex_unlock(&vfio.lock);
+out_unlocked:
+	return ret;
+}
+EXPORT_SYMBOL_GPL(vfio_group_add_dev);
+
+/* Remove a device from the vfio framework */
+void vfio_group_del_dev(struct device *dev)
+{
+	struct vfio_group *group;
+	struct vfio_device *device;
+	unsigned int groupid;
+
+	if (iommu_device_group(dev, &groupid))
+		return;
+
+	mutex_lock(&vfio.lock);
+
+	group = __vfio_dev_to_group(dev, groupid);
+
+	if (WARN_ON(!group))
+		goto out;
+
+	device = __vfio_group_find_device(group, dev);
+
+	if (WARN_ON(!device))
+		goto out;
+
+	/* If device is bound to a bus driver, we'll get a chance to
+	 * unbind it first.  Just mark it to be removed after unbind. */
+	if (device->device_data) {
+		device->deleteme = true;
+		goto out;
+	}
+
+	if (device->attached)
+		__vfio_iommu_detach_dev(group->iommu, device);
+
+	list_del(&device->device_next);
+
+	if (group->devices_kobj)
+		sysfs_remove_link(group->devices_kobj, dev_name(dev));
+
+	kfree(device);
+
+	/* If this was the only device in the group, remove the group.
+	 * Note that we intentionally unmerge empty groups here if the
+	 * group fd isn't opened. */
+	if (list_empty(&group->device_list) && group->refcnt == 0) {
+		struct vfio_iommu *iommu = group->iommu;
+
+		if (iommu) {
+			__vfio_group_set_iommu(group, NULL);
+			__vfio_try_dissolve_iommu(iommu);
+		}
+
+		/* Groups can be mostly placeholders if setup isn't
+		 * completed, remove them carefully. */
+		if (group->devices_kobj)
+			kobject_put(group->devices_kobj);
+		if (group->dev) {
+			device_destroy(vfio.class, group->devt);
+			idr_remove(&vfio.idr, MINOR(group->devt));
+		}
+		list_del(&group->group_next);
+		kfree(group);
+	}
+
+out:
+	mutex_unlock(&vfio.lock);
+}
+EXPORT_SYMBOL_GPL(vfio_group_del_dev);
+
+/* When a device is bound to a vfio device driver (ex. vfio-pci), this
+ * entry point is used to mark the device usable (viable).  The vfio
+ * device driver associates a private device_data struct with the device
+ * here, which will later be return for vfio_device_fops callbacks. */
+int vfio_bind_dev(struct device *dev, void *device_data)
+{
+	struct vfio_device *device;
+	int ret;
+
+	if (WARN_ON(!device_data))
+		return -EINVAL;
+
+	mutex_lock(&vfio.lock);
+
+	device = __vfio_lookup_dev(dev);
+
+	if (WARN_ON(!device)) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	ret = dev_set_drvdata(dev, device);
+	if (!ret)
+		device->device_data = device_data;
+
+out:
+	mutex_unlock(&vfio.lock);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(vfio_bind_dev);
+
+/* A device is only removable if the iommu for the group is not in use. */
+static bool vfio_device_removeable(struct vfio_device *device)
+{
+	bool ret = true;
+
+	mutex_lock(&vfio.lock);
+
+	if (device->group->iommu && __vfio_iommu_inuse(device->group->iommu))
+		ret = false;
+
+	mutex_unlock(&vfio.lock);
+	return ret;
+}
+
+/* Notify vfio that a device is being unbound from the vfio device driver
+ * and return the device private device_data pointer.  If the group is
+ * in use, we need to block or take other measures to make it safe for
+ * the device to be removed from the iommu. */
+void *vfio_unbind_dev(struct device *dev)
+{
+	struct vfio_device *device = dev_get_drvdata(dev);
+	void *device_data;
+
+	if (WARN_ON(!device))
+		return NULL;
+again:
+	if (!vfio_device_removeable(device)) {
+		/* XXX signal for all devices in group to be removed or
+		 * resort to killing the process holding the device fds.
+		 * For now just block waiting for releases to wake us. */
+		wait_event(vfio.release_q, vfio_device_removeable(device));
+	}
+
+	mutex_lock(&vfio.lock);
+
+	/* Need to re-check that the device is still removable under lock. */
+	if (device->group->iommu && __vfio_iommu_inuse(device->group->iommu)) {
+		mutex_unlock(&vfio.lock);
+		goto again;
+	}
+
+	device_data = device->device_data;
+
+	device->device_data = NULL;
+	dev_set_drvdata(dev, NULL);
+
+	mutex_unlock(&vfio.lock);
+
+	if (device->deleteme)
+		vfio_group_del_dev(dev);
+
+	return device_data;
+}
+EXPORT_SYMBOL_GPL(vfio_unbind_dev);
+
+/*
+ * Module/class support
+ */
+static void vfio_class_release(struct kref *kref)
+{
+	class_destroy(vfio.class);
+	vfio.class = NULL;
+}
+
+static char *vfio_devnode(struct device *dev, mode_t *mode)
+{
+	return kasprintf(GFP_KERNEL, "vfio/%s", dev_name(dev));
+}
+
+static int __init vfio_init(void)
+{
+	int ret;
+
+	idr_init(&vfio.idr);
+	mutex_init(&vfio.lock);
+	INIT_LIST_HEAD(&vfio.group_list);
+	init_waitqueue_head(&vfio.release_q);
+
+	kref_init(&vfio.kref);
+	vfio.class = class_create(THIS_MODULE, "vfio");
+	if (IS_ERR(vfio.class)) {
+		ret = PTR_ERR(vfio.class);
+		goto err_class;
+	}
+
+	vfio.class->devnode = vfio_devnode;
+
+	/* FIXME - how many minors to allocate... all of them! */
+	ret = alloc_chrdev_region(&vfio.devt, 0, MINORMASK, "vfio");
+	if (ret)
+		goto err_chrdev;
+
+	cdev_init(&vfio.cdev, &vfio_group_fops);
+	ret = cdev_add(&vfio.cdev, vfio.devt, MINORMASK);
+	if (ret)
+		goto err_cdev;
+
+	pr_info(DRIVER_DESC " version: " DRIVER_VERSION "\n");
+
+	return 0;
+
+err_cdev:
+	unregister_chrdev_region(vfio.devt, MINORMASK);
+err_chrdev:
+	kref_put(&vfio.kref, vfio_class_release);
+err_class:
+	return ret;
+}
+
+static void __exit vfio_cleanup(void)
+{
+	struct list_head *gpos, *gppos;
+
+	list_for_each_safe(gpos, gppos, &vfio.group_list) {
+		struct vfio_group *group;
+		struct list_head *dpos, *dppos;
+
+		group = list_entry(gpos, struct vfio_group, group_next);
+
+		list_for_each_safe(dpos, dppos, &group->device_list) {
+			struct vfio_device *device;
+
+			device = list_entry(dpos,
+					    struct vfio_device, device_next);
+			vfio_group_del_dev(device->dev);
+		}
+	}
+
+	idr_destroy(&vfio.idr);
+	cdev_del(&vfio.cdev);
+	unregister_chrdev_region(vfio.devt, MINORMASK);
+	kref_put(&vfio.kref, vfio_class_release);
+}
+
+module_init(vfio_init);
+module_exit(vfio_cleanup);
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/drivers/vfio/vfio_private.h b/drivers/vfio/vfio_private.h
new file mode 100644
index 0000000..db99e75
--- /dev/null
+++ b/drivers/vfio/vfio_private.h
@@ -0,0 +1,36 @@
+/*
+ * Copyright (C) 2011 Red Hat, Inc.  All rights reserved.
+ *     Author: Alex Williamson <alex.williamson@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * Derived from original vfio:
+ * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
+ * Author: Tom Lyon, pugs@cisco.com
+ */
+
+#include <linux/list.h>
+#include <linux/mutex.h>
+
+#ifndef VFIO_PRIVATE_H
+#define VFIO_PRIVATE_H
+
+struct vfio_iommu {
+	struct iommu_domain		*domain;
+	struct bus_type			*bus;
+	struct mutex			lock;
+	struct list_head		dma_list;
+	struct mm_struct		*mm;
+	struct list_head		group_list;
+	int				refcnt;
+	bool				cache;
+};
+
+extern const struct file_operations vfio_iommu_fops;
+
+extern int vfio_release_iommu(struct vfio_iommu *iommu);
+extern void vfio_iommu_unmapall(struct vfio_iommu *iommu);
+
+#endif /* VFIO_PRIVATE_H */


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 4/5] vfio: VFIO core IOMMU mapping support
  2011-12-21 21:42 [PATCH 0/5] VFIO core framework Alex Williamson
                   ` (2 preceding siblings ...)
  2011-12-21 21:42 ` [PATCH 3/5] vfio: VFIO core group interface Alex Williamson
@ 2011-12-21 21:42 ` Alex Williamson
  2011-12-21 21:42 ` [PATCH 5/5] vfio: VFIO core Kconfig and Makefile Alex Williamson
       [not found] ` <20120110162631.GB22499@phenom.dumpdata.com>
  5 siblings, 0 replies; 11+ messages in thread
From: Alex Williamson @ 2011-12-21 21:42 UTC (permalink / raw)
  To: chrisw, aik, david, joerg.roedel, agraf, benve, aafabbri, B08248,
	B07421, avi, kvm, qemu-devel, iommu, linux-pci, linux-kernel

Backing for operations on the IOMMU object, including DMA
mapping and unmapping.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---

 drivers/vfio/vfio_iommu.c |  593 +++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 593 insertions(+), 0 deletions(-)
 create mode 100644 drivers/vfio/vfio_iommu.c

diff --git a/drivers/vfio/vfio_iommu.c b/drivers/vfio/vfio_iommu.c
new file mode 100644
index 0000000..b6644ec
--- /dev/null
+++ b/drivers/vfio/vfio_iommu.c
@@ -0,0 +1,593 @@
+/*
+ * VFIO: IOMMU DMA mapping support
+ *
+ * Copyright (C) 2011 Red Hat, Inc.  All rights reserved.
+ *     Author: Alex Williamson <alex.williamson@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * Derived from original vfio:
+ * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
+ * Author: Tom Lyon, pugs@cisco.com
+ */
+
+#include <linux/compat.h>
+#include <linux/device.h>
+#include <linux/fs.h>
+#include <linux/iommu.h>
+#include <linux/module.h>
+#include <linux/mm.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+#include <linux/vfio.h>
+#include <linux/workqueue.h>
+
+#include "vfio_private.h"
+
+struct vfio_dma_map_entry {
+	struct list_head	list;
+	dma_addr_t		iova;		/* Device address */
+	unsigned long		vaddr;		/* Process virtual addr */
+	long			npage;		/* Number of pages */
+	int			prot;		/* IOMMU_READ/WRITE */
+};
+
+/* This code handles mapping and unmapping of user data buffers
+ * into DMA'ble space using the IOMMU */
+
+#define NPAGE_TO_SIZE(npage)	((size_t)(npage) << PAGE_SHIFT)
+
+struct vwork {
+	struct mm_struct	*mm;
+	long			npage;
+	struct work_struct	work;
+};
+
+/* delayed decrement/increment for locked_vm */
+static void vfio_lock_acct_bg(struct work_struct *work)
+{
+	struct vwork *vwork = container_of(work, struct vwork, work);
+	struct mm_struct *mm;
+
+	mm = vwork->mm;
+	down_write(&mm->mmap_sem);
+	mm->locked_vm += vwork->npage;
+	up_write(&mm->mmap_sem);
+	mmput(mm);
+	kfree(vwork);
+}
+
+static void vfio_lock_acct(long npage)
+{
+	struct vwork *vwork;
+	struct mm_struct *mm;
+
+	if (!current->mm)
+		return; /* process exited */
+
+	if (down_write_trylock(&current->mm->mmap_sem)) {
+		current->mm->locked_vm += npage;
+		up_write(&current->mm->mmap_sem);
+		return;
+	}
+
+	/* Couldn't get mmap_sem lock, so must setup to update
+	 * mm->locked_vm later. If locked_vm were atomic, we
+	 * wouldn't need this silliness */
+	vwork = kmalloc(sizeof(struct vwork), GFP_KERNEL);
+	if (!vwork)
+		return;
+	mm = get_task_mm(current);
+	if (!mm) {
+		kfree(vwork);
+		return;
+	}
+	INIT_WORK(&vwork->work, vfio_lock_acct_bg);
+	vwork->mm = mm;
+	vwork->npage = npage;
+	schedule_work(&vwork->work);
+}
+
+/* Some mappings aren't backed by a struct page, for example an mmap'd
+ * MMIO range for our own or another device.  These use a different
+ * pfn conversion and shouldn't be tracked as locked pages. */
+static bool is_invalid_reserved_pfn(unsigned long pfn)
+{
+	if (pfn_valid(pfn)) {
+		bool reserved;
+		struct page *tail = pfn_to_page(pfn);
+		struct page *head = compound_trans_head(tail);
+		reserved = !!(PageReserved(head));
+		if (head != tail) {
+			/* "head" is not a dangling pointer
+			 * (compound_trans_head takes care of that)
+			 * but the hugepage may have been split
+			 * from under us (and we may not hold a
+			 * reference count on the head page so it can
+			 * be reused before we run PageReferenced), so
+			 * we've to check PageTail before returning
+			 * what we just read.  */
+			smp_rmb();
+			if (PageTail(tail))
+				return reserved;
+		}
+		return PageReserved(tail);
+	}
+
+	return true;
+}
+
+static int put_pfn(unsigned long pfn, int prot)
+{
+	if (!is_invalid_reserved_pfn(pfn)) {
+		struct page *page = pfn_to_page(pfn);
+		if (prot & IOMMU_WRITE)
+			SetPageDirty(page);
+		put_page(page);
+		return 1;
+	}
+	return 0;
+}
+
+/* Unmap DMA region */
+static long __vfio_dma_do_unmap(struct vfio_iommu *iommu, dma_addr_t iova,
+			     long npage, int prot)
+{
+	long i, unlocked = 0;
+
+	for (i = 0; i < npage; i++, iova += PAGE_SIZE) {
+		unsigned long pfn;
+
+		pfn = iommu_iova_to_phys(iommu->domain, iova) >> PAGE_SHIFT;
+		if (pfn) {
+			iommu_unmap(iommu->domain, iova, PAGE_SIZE);
+			unlocked += put_pfn(pfn, prot);
+		}
+	}
+	return unlocked;
+}
+
+static void vfio_dma_unmap(struct vfio_iommu *iommu, dma_addr_t iova,
+			   long npage, int prot)
+{
+	long unlocked;
+
+	unlocked = __vfio_dma_do_unmap(iommu, iova, npage, prot);
+	vfio_lock_acct(-unlocked);
+}
+
+/* Unmap ALL DMA regions */
+void vfio_iommu_unmapall(struct vfio_iommu *iommu)
+{
+	struct list_head *pos, *tmp;
+
+	mutex_lock(&iommu->lock);
+	list_for_each_safe(pos, tmp, &iommu->dma_list) {
+		struct vfio_dma_map_entry *dma;
+
+		dma = list_entry(pos, struct vfio_dma_map_entry, list);
+		vfio_dma_unmap(iommu, dma->iova, dma->npage, dma->prot);
+		list_del(&dma->list);
+		kfree(dma);
+	}
+	mutex_unlock(&iommu->lock);
+}
+
+static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
+{
+	struct page *page[1];
+	struct vm_area_struct *vma;
+	int ret = -EFAULT;
+
+	if (get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE), page) == 1) {
+		*pfn = page_to_pfn(page[0]);
+		return 0;
+	}
+
+	down_read(&current->mm->mmap_sem);
+
+	vma = find_vma_intersection(current->mm, vaddr, vaddr + 1);
+
+	if (vma && vma->vm_flags & VM_PFNMAP) {
+		*pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
+		if (is_invalid_reserved_pfn(*pfn))
+			ret = 0;
+	}
+
+	up_read(&current->mm->mmap_sem);
+
+	return ret;
+}
+
+/* Map DMA region */
+static int __vfio_dma_map(struct vfio_iommu *iommu, dma_addr_t iova,
+			  unsigned long vaddr, long npage, int prot)
+{
+	dma_addr_t start = iova;
+	long i, locked = 0;
+	int ret;
+
+	/* Verify that pages are not already mapped */
+	for (i = 0; i < npage; i++, iova += PAGE_SIZE)
+		if (iommu_iova_to_phys(iommu->domain, iova))
+			return -EBUSY;
+
+	iova = start;
+
+	if (iommu->cache)
+		prot |= IOMMU_CACHE;
+
+	/* XXX We break mappings into pages and use get_user_pages_fast to
+	 * pin the pages in memory.  It's been suggested that mlock might
+	 * provide a more efficient mechanism, but nothing prevents the
+	 * user from munlocking the pages, which could then allow the user
+	 * access to random host memory.  We also have no guarantee from the
+	 * IOMMU API that the iommu driver can unmap sub-pages of previous
+	 * mappings.  This means we might lose an entire range if a single
+	 * page within it is unmapped.  Single page mappings are inefficient,
+	 * but provide the most flexibility for now. */
+
+	for (i = 0; i < npage; i++, iova += PAGE_SIZE, vaddr += PAGE_SIZE) {
+		unsigned long pfn = 0;
+
+		ret = vaddr_get_pfn(vaddr, prot, &pfn);
+		if (ret) {
+			__vfio_dma_do_unmap(iommu, start, i, prot);
+			return ret;
+		}
+
+		/* Only add actual locked pages to accounting */
+		/* XXX We're effectively marking a page locked for every
+		 * IOVA page even though it's possible the user could be
+		 * backing multiple IOVAs with the same vaddr.  This over-
+		 * penalizes the user process, but we currently have no
+		 * easy way to do this properly. */
+		if (!is_invalid_reserved_pfn(pfn))
+			locked++;
+
+		ret = iommu_map(iommu->domain, iova,
+				(phys_addr_t)pfn << PAGE_SHIFT,
+				PAGE_SIZE, prot);
+		if (ret) {
+			/* Back out mappings on error */
+			put_pfn(pfn, prot);
+			__vfio_dma_do_unmap(iommu, start, i, prot);
+			return ret;
+		}
+	}
+	vfio_lock_acct(locked);
+	return 0;
+}
+
+static inline bool ranges_overlap(dma_addr_t start1, size_t size1,
+				  dma_addr_t start2, size_t size2)
+{
+	if (start1 < start2)
+		return (start2 - start1 < size1);
+	else if (start2 < start1)
+		return (start1 - start2 < size2);
+	return (size1 > 0 && size2 > 0);
+}
+
+static struct vfio_dma_map_entry *vfio_find_dma(struct vfio_iommu *iommu,
+						dma_addr_t start, size_t size)
+{
+	struct list_head *pos;
+
+	list_for_each(pos, &iommu->dma_list) {
+		struct vfio_dma_map_entry *dma;
+
+		dma = list_entry(pos, struct vfio_dma_map_entry, list);
+		if (ranges_overlap(dma->iova, NPAGE_TO_SIZE(dma->npage),
+				   start, size))
+			return dma;
+	}
+	return NULL;
+}
+
+static long vfio_remove_dma_overlap(struct vfio_iommu *iommu, dma_addr_t start,
+				    size_t size, struct vfio_dma_map_entry *dma)
+{
+	struct vfio_dma_map_entry *split;
+	long npage_lo, npage_hi;
+
+	/* Existing dma region is completely covered, unmap all */
+	if (start <= dma->iova &&
+	    start + size >= dma->iova + NPAGE_TO_SIZE(dma->npage)) {
+		vfio_dma_unmap(iommu, dma->iova, dma->npage, dma->prot);
+		list_del(&dma->list);
+		npage_lo = dma->npage;
+		kfree(dma);
+		return npage_lo;
+	}
+
+	/* Overlap low address of existing range */
+	if (start <= dma->iova) {
+		size_t overlap;
+
+		overlap = start + size - dma->iova;
+		npage_lo = overlap >> PAGE_SHIFT;
+
+		vfio_dma_unmap(iommu, dma->iova, npage_lo, dma->prot);
+		dma->iova += overlap;
+		dma->vaddr += overlap;
+		dma->npage -= npage_lo;
+		return npage_lo;
+	}
+
+	/* Overlap high address of existing range */
+	if (start + size >= dma->iova + NPAGE_TO_SIZE(dma->npage)) {
+		size_t overlap;
+
+		overlap = dma->iova + NPAGE_TO_SIZE(dma->npage) - start;
+		npage_hi = overlap >> PAGE_SHIFT;
+
+		vfio_dma_unmap(iommu, start, npage_hi, dma->prot);
+		dma->npage -= npage_hi;
+		return npage_hi;
+	}
+
+	/* Split existing */
+	npage_lo = (start - dma->iova) >> PAGE_SHIFT;
+	npage_hi = dma->npage - (size >> PAGE_SHIFT) - npage_lo;
+
+	split = kzalloc(sizeof *split, GFP_KERNEL);
+	if (!split)
+		return -ENOMEM;
+
+	vfio_dma_unmap(iommu, start, size >> PAGE_SHIFT, dma->prot);
+
+	dma->npage = npage_lo;
+
+	split->npage = npage_hi;
+	split->iova = start + size;
+	split->vaddr = dma->vaddr + NPAGE_TO_SIZE(npage_lo) + size;
+	split->prot = dma->prot;
+	list_add(&split->list, &iommu->dma_list);
+	return size >> PAGE_SHIFT;
+}
+
+static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
+			     struct vfio_dma_unmap *unmap)
+{
+	long ret = 0, npage = unmap->size >> PAGE_SHIFT;
+	struct list_head *pos, *tmp;
+	uint64_t mask;
+
+	mask = ((uint64_t)1 << __ffs(iommu->domain->ops->pgsize_bitmap)) - 1;
+
+	if (unmap->iova & mask)
+		return -EINVAL;
+	if (unmap->size & mask)
+		return -EINVAL;
+
+	/* XXX We still break these down into PAGE_SIZE */
+	WARN_ON(mask & PAGE_MASK);
+
+	mutex_lock(&iommu->lock);
+
+	list_for_each_safe(pos, tmp, &iommu->dma_list) {
+		struct vfio_dma_map_entry *dma;
+
+		dma = list_entry(pos, struct vfio_dma_map_entry, list);
+		if (ranges_overlap(dma->iova, NPAGE_TO_SIZE(dma->npage),
+				   unmap->iova, unmap->size)) {
+			ret = vfio_remove_dma_overlap(iommu, unmap->iova,
+						      unmap->size, dma);
+			if (ret > 0)
+				npage -= ret;
+			if (ret < 0 || npage == 0)
+				break;
+		}
+	}
+	mutex_unlock(&iommu->lock);
+	return ret > 0 ? 0 : (int)ret;
+}
+
+static int vfio_dma_do_map(struct vfio_iommu *iommu, struct vfio_dma_map *map)
+{
+	struct vfio_dma_map_entry *dma, *pdma = NULL;
+	dma_addr_t iova = map->iova;
+	unsigned long locked, lock_limit, vaddr = map->vaddr;
+	size_t size = map->size;
+	int ret = 0, prot = 0;
+	uint64_t mask;
+	long npage;
+
+	mask = ((uint64_t)1 << __ffs(iommu->domain->ops->pgsize_bitmap)) - 1;
+
+	/* READ/WRITE from device perspective */
+	if (map->flags & VFIO_DMA_MAP_FLAG_WRITE)
+		prot |= IOMMU_WRITE;
+	if (map->flags & VFIO_DMA_MAP_FLAG_READ)
+		prot |= IOMMU_READ;
+
+	if (!prot)
+		return -EINVAL; /* No READ/WRITE? */
+
+	if (vaddr & mask)
+		return -EINVAL;
+	if (iova & mask)
+		return -EINVAL;
+	if (size & mask)
+		return -EINVAL;
+
+	/* XXX We still break these down into PAGE_SIZE */
+	WARN_ON(mask & PAGE_MASK);
+
+	/* Don't allow IOVA wrap */
+	if (iova + size && iova + size < iova)
+		return -EINVAL;
+
+	/* Don't allow virtual address wrap */
+	if (vaddr + size && vaddr + size < vaddr)
+		return -EINVAL;
+
+	npage = size >> PAGE_SHIFT;
+	if (!npage)
+		return -EINVAL;
+
+	mutex_lock(&iommu->lock);
+
+	if (vfio_find_dma(iommu, iova, size)) {
+		ret = -EBUSY;
+		goto out_lock;
+	}
+
+	/* account for locked pages */
+	locked = current->mm->locked_vm + npage;
+	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+	if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
+		printk(KERN_WARNING "%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
+			__func__, rlimit(RLIMIT_MEMLOCK));
+		ret = -ENOMEM;
+		goto out_lock;
+	}
+
+	ret = __vfio_dma_map(iommu, iova, vaddr, npage, prot);
+	if (ret)
+		goto out_lock;
+
+	/* Check if we abut a region below - nothing below 0 */
+	if (iova) {
+		dma = vfio_find_dma(iommu, iova - 1, 1);
+		if (dma && dma->prot == prot &&
+		    dma->vaddr + NPAGE_TO_SIZE(dma->npage) == vaddr) {
+
+			dma->npage += npage;
+			iova = dma->iova;
+			vaddr = dma->vaddr;
+			npage = dma->npage;
+			size = NPAGE_TO_SIZE(npage);
+
+			pdma = dma;
+		}
+	}
+
+	/* Check if we abut a region above - nothing above ~0 + 1 */
+	if (iova + size) {
+		dma = vfio_find_dma(iommu, iova + size, 1);
+		if (dma && dma->prot == prot &&
+		    dma->vaddr == vaddr + size) {
+
+			dma->npage += npage;
+			dma->iova = iova;
+			dma->vaddr = vaddr;
+
+			/* If merged above and below, remove previously
+			 * merged entry.  New entry covers it.  */
+			if (pdma) {
+				list_del(&pdma->list);
+				kfree(pdma);
+			}
+			pdma = dma;
+		}
+	}
+
+	/* Isolated, new region */
+	if (!pdma) {
+		dma = kzalloc(sizeof *dma, GFP_KERNEL);
+		if (!dma) {
+			ret = -ENOMEM;
+			vfio_dma_unmap(iommu, iova, npage, prot);
+			goto out_lock;
+		}
+
+		dma->npage = npage;
+		dma->iova = iova;
+		dma->vaddr = vaddr;
+		dma->prot = prot;
+		list_add(&dma->list, &iommu->dma_list);
+	}
+
+out_lock:
+	mutex_unlock(&iommu->lock);
+	return ret;
+}
+
+static int vfio_iommu_release(struct inode *inode, struct file *filep)
+{
+	struct vfio_iommu *iommu = filep->private_data;
+
+	vfio_release_iommu(iommu);
+	return 0;
+}
+
+static long vfio_iommu_unl_ioctl(struct file *filep,
+				 unsigned int cmd, unsigned long arg)
+{
+	struct vfio_iommu *iommu = filep->private_data;
+	unsigned long minsz;
+
+	if (cmd == VFIO_IOMMU_GET_INFO) {
+		struct vfio_iommu_info info;
+
+		minsz = offsetofend(struct vfio_iommu_info, pgsize_bitmap);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (info.argsz < minsz)
+			return -EINVAL;
+
+		info.flags = 0;
+
+		/* XXX Need to define an interface in IOMMU API for this */
+		info.iova_min = 0;
+		info.iova_max = ~info.iova_min;
+		info.pgsize_bitmap = iommu->domain->ops->pgsize_bitmap;
+
+		return copy_to_user((void __user *)arg, &info, minsz);
+
+	} else if (cmd == VFIO_IOMMU_MAP_DMA) {
+		struct vfio_dma_map map;
+		uint32_t mask = VFIO_DMA_MAP_FLAG_READ |
+				VFIO_DMA_MAP_FLAG_WRITE;
+
+		minsz = offsetofend(struct vfio_dma_map, size);
+
+		if (copy_from_user(&map, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (map.argsz < minsz || map.flags & ~mask)
+			return -EINVAL;
+
+		return vfio_dma_do_map(iommu, &map);
+
+	} else if (cmd == VFIO_IOMMU_UNMAP_DMA) {
+		struct vfio_dma_unmap unmap;
+
+		minsz = offsetofend(struct vfio_dma_unmap, size);
+
+		if (copy_from_user(&unmap, (void __user *)arg, minsz))
+			return -EFAULT;
+
+		if (unmap.argsz < minsz || unmap.flags)
+			return -EINVAL;
+
+		return vfio_dma_do_unmap(iommu, &unmap);
+	}
+
+	return -ENOTTY;
+}
+
+#ifdef CONFIG_COMPAT
+static long vfio_iommu_compat_ioctl(struct file *filep,
+				    unsigned int cmd, unsigned long arg)
+{
+	arg = (unsigned long)compat_ptr(arg);
+	return vfio_iommu_unl_ioctl(filep, cmd, arg);
+}
+#endif	/* CONFIG_COMPAT */
+
+const struct file_operations vfio_iommu_fops = {
+	.owner		= THIS_MODULE,
+	.release	= vfio_iommu_release,
+	.unlocked_ioctl	= vfio_iommu_unl_ioctl,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl	= vfio_iommu_compat_ioctl,
+#endif
+};


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 5/5] vfio: VFIO core Kconfig and Makefile
  2011-12-21 21:42 [PATCH 0/5] VFIO core framework Alex Williamson
                   ` (3 preceding siblings ...)
  2011-12-21 21:42 ` [PATCH 4/5] vfio: VFIO core IOMMU mapping support Alex Williamson
@ 2011-12-21 21:42 ` Alex Williamson
       [not found] ` <20120110162631.GB22499@phenom.dumpdata.com>
  5 siblings, 0 replies; 11+ messages in thread
From: Alex Williamson @ 2011-12-21 21:42 UTC (permalink / raw)
  To: chrisw, aik, david, joerg.roedel, agraf, benve, aafabbri, B08248,
	B07421, avi, kvm, qemu-devel, iommu, linux-pci, linux-kernel

Enable the base code.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---

 MAINTAINERS           |    8 ++++++++
 drivers/Kconfig       |    2 ++
 drivers/Makefile      |    1 +
 drivers/vfio/Kconfig  |    8 ++++++++
 drivers/vfio/Makefile |    3 +++
 5 files changed, 22 insertions(+), 0 deletions(-)
 create mode 100644 drivers/vfio/Kconfig
 create mode 100644 drivers/vfio/Makefile

diff --git a/MAINTAINERS b/MAINTAINERS
index 9f7b469..b1f7230 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -7142,6 +7142,14 @@ S:	Maintained
 F:	Documentation/filesystems/vfat.txt
 F:	fs/fat/
 
+VFIO DRIVER
+M:	Alex Williamson <alex.williamson@redhat.com>
+L:	kvm@vger.kernel.org
+S:	Maintained
+F:	Documentation/vfio.txt
+F:	drivers/vfio/
+F:	include/linux/vfio.h
+
 VIDEOBUF2 FRAMEWORK
 M:	Pawel Osciak <pawel@osciak.com>
 M:	Marek Szyprowski <m.szyprowski@samsung.com>
diff --git a/drivers/Kconfig b/drivers/Kconfig
index d5138e6..f168bf3 100644
--- a/drivers/Kconfig
+++ b/drivers/Kconfig
@@ -114,6 +114,8 @@ source "drivers/auxdisplay/Kconfig"
 
 source "drivers/uio/Kconfig"
 
+source "drivers/vfio/Kconfig"
+
 source "drivers/vlynq/Kconfig"
 
 source "drivers/virtio/Kconfig"
diff --git a/drivers/Makefile b/drivers/Makefile
index 4ef810e..f715919 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -59,6 +59,7 @@ obj-$(CONFIG_ATM)		+= atm/
 obj-$(CONFIG_FUSION)		+= message/
 obj-y				+= firewire/
 obj-$(CONFIG_UIO)		+= uio/
+obj-$(CONFIG_VFIO)		+= vfio/
 obj-y				+= cdrom/
 obj-y				+= auxdisplay/
 obj-$(CONFIG_PCCARD)		+= pcmcia/
diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
new file mode 100644
index 0000000..9acb1e7
--- /dev/null
+++ b/drivers/vfio/Kconfig
@@ -0,0 +1,8 @@
+menuconfig VFIO
+	tristate "VFIO Non-Privileged userspace driver framework"
+	depends on IOMMU_API
+	help
+	  VFIO provides a framework for secure userspace device drivers.
+	  See Documentation/vfio.txt for more details.
+
+	  If you don't know what to do here, say N.
diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
new file mode 100644
index 0000000..088faf1
--- /dev/null
+++ b/drivers/vfio/Makefile
@@ -0,0 +1,3 @@
+vfio-y := vfio_main.o vfio_iommu.o
+
+obj-$(CONFIG_VFIO) := vfio.o


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] vfio: Introduce documentation for VFIO driver
  2011-12-21 21:42 ` [PATCH 1/5] vfio: Introduce documentation for VFIO driver Alex Williamson
@ 2011-12-28 17:16   ` Ronen Hod
  2012-01-03 15:21     ` Alex Williamson
  0 siblings, 1 reply; 11+ messages in thread
From: Ronen Hod @ 2011-12-28 17:16 UTC (permalink / raw)
  To: Alex Williamson
  Cc: chrisw, aik, david, joerg.roedel, agraf, benve, aafabbri, B08248,
	B07421, avi, kvm, qemu-devel, iommu, linux-pci, linux-kernel

On 12/21/2011 11:42 PM, Alex Williamson wrote:
> Including rationale for design, example usage and API description.
>
> Signed-off-by: Alex Williamson<alex.williamson@redhat.com>
> ---
>
>   Documentation/vfio.txt |  352 ++++++++++++++++++++++++++++++++++++++++++++++++
>   1 files changed, 352 insertions(+), 0 deletions(-)
>   create mode 100644 Documentation/vfio.txt
>
> diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
> new file mode 100644
> index 0000000..09a5a5b
> --- /dev/null
> +++ b/Documentation/vfio.txt
> @@ -0,0 +1,352 @@
> +VFIO - "Virtual Function I/O"[1]
> +-------------------------------------------------------------------------------
> +Many modern system now provide DMA and interrupt remapping facilities
> +to help ensure I/O devices behave within the boundaries they've been
> +allotted.  This includes x86 hardware with AMD-Vi and Intel VT-d,
> +POWER systems with Partitionable Endpoints (PEs) and embedded PowerPC
> +systems such as Freescale PAMU.  The VFIO driver is an IOMMU/device
> +agnostic framework for exposing direct device access to userspace, in
> +a secure, IOMMU protected environment.  In other words, this allows
> +safe[2], non-privileged, userspace drivers.
> +
> +Why do we want that?  Virtual machines often make use of direct device
> +access ("device assignment") when configured for the highest possible
> +I/O performance.  From a device and host perspective, this simply
> +turns the VM into a userspace driver, with the benefits of
> +significantly reduced latency, higher bandwidth, and direct use of
> +bare-metal device drivers[3].
> +
> +Some applications, particularly in the high performance computing
> +field, also benefit from low-overhead, direct device access from
> +userspace.  Examples include network adapters (often non-TCP/IP based)
> +and compute accelerators.  Prior to VFIO, these drivers had to either
> +go through the full development cycle to become proper upstream
> +driver, be maintained out of tree, or make use of the UIO framework,
> +which has no notion of IOMMU protection, limited interrupt support,
> +and requires root privileges to access things like PCI configuration
> +space.
> +
> +The VFIO driver framework intends to unify these, replacing both the
> +KVM PCI specific device assignment code as well as provide a more
> +secure, more featureful userspace driver environment than UIO.
> +
> +Groups, Devices, and IOMMUs
> +-------------------------------------------------------------------------------
> +
> +Userspace drivers are primarily concerned with manipulating individual
> +devices and setting up mappings in the IOMMU for those devices.
> +Unfortunately, the IOMMU doesn't always have the granularity to track
> +mappings for an individual device.  Sometimes this is a topology
> +barrier, such as a PCIe-to-PCI bridge interposing the device and
> +IOMMU, other times this is an IOMMU limitation.  In any case, the
> +reality is that devices are not always independent with respect to the
> +IOMMU.  Translations setup for one device can be used by another
> +device in these scenarios.
> +
> +The IOMMU API exposes these relationships by identifying an "IOMMU
> +group" for these dependent devices.  Devices on the same bus with the
> +same IOMMU group (or just "group" for this document) are not isolated
> +from each other with respect to DMA mappings.  For userspace usage,
> +this logically means that instead of being able to grant ownership of
> +an individual device, we must grant ownership of a group, which may
> +contain one or more devices.
> +
> +These groups therefore become a fundamental component of VFIO and the
> +working unit we use for exposing devices and granting permissions to
> +userspace.  In addition, VFIO make efforts to ensure the integrity of
> +the group for user access.  This includes ensuring that all devices
> +within the group are controlled by VFIO (vs native host drivers)
> +before allowing a user to access any member of the group or the IOMMU
> +mappings, as well as maintaining the group viability as devices are
> +dynamically added or removed from the system.
> +
> +To access a device through VFIO, a user must open a character device
> +for the group that the device belongs to and then issue an ioctl to
> +retrieve a file descriptor for the individual device.  This ensures
> +that the user has permissions to the group (file based access to the
> +/dev entry) and allows a check point at which VFIO can deny access to
> +the device if the group is not viable (all devices within the group
> +controlled by VFIO).  A file descriptor for the IOMMU is obtain in the
> +same fashion.
> +
> +VFIO defines a standard set of APIs for access to devices and a
> +modular interface for adding new, bus-specific VFIO device drivers.
> +We call these "VFIO bus drivers".  The vfio-pci module is an example
> +of a bus driver for exposing PCI devices.  When the bus driver module
> +is loaded it enumerates all of the devices for it's bus, registering
> +each device with the vfio core along with a set of callbacks.  For
> +buses that support hotplug, the bus driver also adds itself to the
> +notification chain for such events.  The callbacks registered with
> +each device implement the VFIO device access API for that bus.
> +
> +The VFIO device API includes ioctls for describing the device, the I/O
> +regions and their read/write/mmap offsets on the device descriptor, as
> +well as mechanisms for describing and registering interrupt
> +notifications.
> +
> +The VFIO IOMMU object is accessed in a similar way; an ioctl on the
> +group provides a file descriptor for programming the IOMMU.  Like
> +devices, the IOMMU file descriptor is only accessible when a group is
> +viable.  The API for the IOMMU is effectively a userspace extension of
> +the kernel IOMMU API.  The IOMMU provides an ioctl to describe the
> +IOMMU domain as well as to setup and teardown DMA mappings.  As the
> +IOMMU API is extended to support more esoteric IOMMU implementations,
> +it's expected that the VFIO interface will also evolve.
> +
> +To facilitate this evolution, all of the VFIO interfaces are designed
> +for extensions.  Particularly, for all structures passed via ioctl, we
> +include a structure size and flags field.  We also define the ioctl
> +request to be independent of passed structure size.  This allows us to
> +later add structure fields and define flags as necessary.  It's
> +expected that each additional field will have an associated flag to
> +indicate whether the data is valid.  Additionally, we provide an
> +"info" ioctl for each file descriptor, which allows us to flag new
> +features as they're added (ex. an IOMMU domain configuration ioctl).
> +
> +The final aspect of VFIO is the notion of merging groups.  In both the
> +assignment of devices to virtual machines and the pure userspace
> +driver model, it's expect that a single user instance is likely to
> +have multiple groups in use simultaneously.  If these groups are all
> +using the same set of IOMMU mappings, the overhead of userspace
> +setting up and tearing down the mappings, as well as the internal
> +IOMMU driver overhead of managing those mappings can be non-trivial.
> +Some IOMMU implementations are able to easily reduce this overhead by
> +simply using the same set of page tables across multiple groups.
> +VFIO allows users to take advantage of this option by merging groups
> +together, effectively creating a super group (IOMMU groups only define
> +the minimum granularity).
> +
> +A user can attempt to merge groups together by calling the merge ioctl
> +on one group (the "merger") and passing a file descriptor for the
> +group to be merged in (the "mergee").  Note that existing DMA mappings
> +cannot be atomically merged between groups, it's therefore a
> +requirement that the mergee group is not in use.  This is enforced by
> +not allowing open device or iommu file descriptors on the mergee group
> +at the time of merging.  The merger group can be actively in use at
> +the time of merging.  Likewise, to unmerge a group, none of the device
> +file descriptors for the group being removed can be in use.  The
> +remaining merged group can be actively in use.
> +

Can you elaborate on the scenario that led a user to merge groups?
Does it make sense to try to "automatically" merge a (new) group with 
all the existing groups sometime prior to its first device open?

As always, it is a pleasure to read your documentation.
Ronen.

> +If the groups cannot be merged, the ioctl will fail and the user will
> +need to manage the groups independently.  Users should have no
> +expectation for group merging to be successful.  Some platforms may
> +not support it at all, others may only enable merging of sufficiently
> +similar groups.  If the ioctl succeeds, then the group file
> +descriptors are effectively fungible between the groups.  That is,
> +instead of their actions being isolated to the individual group, each
> +of them are gateways into the combined, merged group.  For instance,
> +retrieving an IOMMU file descriptor from any group returns a reference
> +to the same object, mappings to that IOMMU descriptor are visible to
> +all devices in the merged group, and device descriptors can be
> +retrieved for any device in the merged group from any one of the group
> +file descriptors.  In effect, a user can manage devices and the IOMMU
> +of a merged group using a single file descriptor (saving the merged
> +groups file descriptors away only for unmerged) without the
> +permission complications of creating a separate "super group" character
> +device.
> +
> +VFIO Usage Example
> +-------------------------------------------------------------------------------
> +
> +Assume user wants to access PCI device 0000:06:0d.0
> +
> +$ cat /sys/bus/pci/devices/0000:06:0d.0/iommu_group
> +240
> +
> +Since this device is on the "pci" bus, the user can then find the
> +character device for interacting with the VFIO group as:
> +
> +$ ls -l /dev/vfio/pci:240
> +crw-rw---- 1 root root 252, 27 Dec 15 15:13 /dev/vfio/pci:240
> +
> +We can also examine other members of the group through sysfs:
> +
> +$ ls -l /sys/devices/virtual/vfio/pci:240/devices/
> +total 0
> +lrwxrwxrwx 1 root root 0 Dec 20 12:01 0000:06:0d.0 ->  \
> +		../../../../pci0000:00/0000:00:1e.0/0000:06:0d.0
> +lrwxrwxrwx 1 root root 0 Dec 20 12:01 0000:06:0d.1 ->  \
> +		../../../../pci0000:00/0000:00:1e.0/0000:06:0d.1
> +
> +This group therefore contains two devices[4].  VFIO will prevent
> +device or iommu manipulation unless all group members are attached to
> +the vfio bus driver, so we simply unbind the devices from their
> +current driver and rebind them to vfio:
> +
> +# for i in /sys/devices/virtual/vfio/pci:240/devices/*; do
> +	dir=$(readlink -f $i)
> +	if [ -L $dir/driver ]; then
> +		echo $(basename $i)>  $dir/driver/unbind
> +	fi
> +	vendor=$(cat $dir/vendor)
> +	device=$(cat $dir/device)
> +	echo $vendor $device>  /sys/bus/pci/drivers/vfio/new_id
> +	echo $(basename $i)>  /sys/bus/pci/drivers/vfio/bind
> +done
> +
> +# chown user:user /dev/vfio/pci:240
> +
> +The user now has full access to all the devices and the iommu for this
> +group and can access them as follows:
> +
> +	int group, iommu, device, i;
> +	struct vfio_group_info group_info = { .argsz = sizeof(group_info) };
> +	struct vfio_iommu_info iommu_info = { .argsz = sizeof(iommu_info) };
> +	struct vfio_dma_map dma_map = { .argsz = sizeof(dma_map) };
> +	struct vfio_device_info device_info = { .argsz = sizeof(device_info) };
> +
> +	/* Open the group */
> +	group = open("/dev/vfio/pci:240", O_RDWR);
> +
> +	/* Test the group is viable and available */
> +	ioctl(group, VFIO_GROUP_GET_INFO,&group_info);
> +
> +	if (!(group_info.flags&  VFIO_GROUP_FLAGS_VIABLE))
> +		/* Group is not viable */
> +
> +	if ((group_info.flags&  VFIO_GROUP_FLAGS_MM_LOCKED))
> +		/* Already in use by someone else */
> +
> +	/* Get a file descriptor for the IOMMU */
> +	iommu = ioctl(group, VFIO_GROUP_GET_IOMMU_FD);
> +
> +	/* Test the IOMMU is what we expect */
> +	ioctl(iommu, VFIO_IOMMU_GET_INFO,&iommu_info);
> +
> +	/* Allocate some space and setup a DMA mapping */
> +	dma_map.vaddr = mmap(0, 1024 * 1024, PROT_READ | PROT_WRITE,
> +			     MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
> +	dma_map.size = 1024 * 1024;
> +	dma_map.iova = 0; /* 1MB starting at 0x0 from device view */
> +	dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
> +
> +	ioctl(iommu, VFIO_IOMMU_MAP_DMA,&dma_map);
> +
> +	/* Get a file descriptor for the device */
> +	device = ioctl(group, VFIO_GROUP_GET_DEVICE_FD, "0000:06:0d.0");
> +
> +	/* Test and setup the device */
> +	ioctl(device, VFIO_DEVICE_GET_INFO,&device_info);
> +
> +	for (i = 0; i<  device_info.num_regions; i++) {
> +		struct vfio_region_info reg = { .argsz = sizeof(reg) };
> +
> +		reg.index = i;
> +
> +		ioctl(device, VFIO_DEVICE_GET_REGION_INFO,&reg);
> +
> +		/* Setup mappings... read/write offsets, mmaps
> +		 * For PCI devices, config space is a region */
> +	}
> +
> +	for (i = 0; i<  device_info.num_irqs; i++) {
> +		struct vfio_irq_info irq = { .argsz = sizeof(irq) };
> +
> +		irq.index = i;
> +
> +		ioctl(device, VFIO_DEVICE_GET_IRQ_INFO,&reg);
> +
> +		/* Setup IRQs... eventfds, VFIO_DEVICE_SET_IRQ_EVENTFDS */
> +	}
> +
> +	/* Gratuitous device reset and go... */
> +	ioctl(device, VFIO_DEVICE_RESET);
> +
> +VFIO User API
> +-------------------------------------------------------------------------------
> +
> +Please see include/linux/vfio.h for complete API documentation.
> +
> +VFIO bus driver API
> +-------------------------------------------------------------------------------
> +
> +Bus drivers, such as PCI, have three jobs:
> + 1) Add/remove devices from vfio
> + 2) Provide vfio_device_ops for device access
> + 3) Device binding and unbinding
> +
> +When initialized, the bus driver should enumerate the devices on its
> +bus and call vfio_group_add_dev() for each device.  If the bus
> +supports hotplug, notifiers should be enabled to track devices being
> +added and removed.  vfio_group_del_dev() removes a previously added
> +device from vfio.
> +
> +extern int vfio_group_add_dev(struct device *dev,
> +                              const struct vfio_device_ops *ops);
> +extern void vfio_group_del_dev(struct device *dev);
> +
> +Adding a device registers a vfio_device_ops function pointer structure
> +for the device:
> +
> +struct vfio_device_ops {
> +	bool	(*match)(struct device *dev, char *buf);
> +	int	(*claim)(struct device *dev);
> +	int	(*open)(void *device_data);
> +	void	(*release)(void *device_data);
> +	ssize_t	(*read)(void *device_data, char __user *buf,
> +			size_t count, loff_t *ppos);
> +	ssize_t	(*write)(void *device_data, const char __user *buf,
> +			 size_t size, loff_t *ppos);
> +	long	(*ioctl)(void *device_data, unsigned int cmd,
> +			 unsigned long arg);
> +	int	(*mmap)(void *device_data, struct vm_area_struct *vma);
> +};
> +
> +For buses supporting hotplug, all functions are required to be
> +implemented.  Non-hotplug buses do not need to implement claim().
> +
> +match() provides a device specific method for associating a struct
> +device to a user provided string.  Many drivers may simply strcmp the
> +buffer to dev_name().
> +
> +claim() is used when a device is hot-added to a group that is already
> +in use.  This is how VFIO requests that a bus driver manually takes
> +ownership of a device.  The expected call path for this is triggered
> +from the bus add notifier.  The bus driver calls vfio_group_add_dev for
> +the newly added device, vfio-core determines this group is already in
> +use and calls claim on the bus driver.  This triggers the bus driver
> +to call it's own probe function, including calling vfio_bind_dev to
> +mark the device as controlled by vfio.  The device is then available
> +for use by the group.
> +
> +The remaining vfio_device_ops are similar to a simplified struct
> +file_operations except a device_data pointer is provided rather than a
> +file pointer.  The device_data is an opaque structure registered by
> +the bus driver when a device is bound to the vfio bus driver:
> +
> +extern int vfio_bind_dev(struct device *dev, void *device_data);
> +extern void *vfio_unbind_dev(struct device *dev);
> +
> +When the device is unbound from the driver, the bus driver will call
> +vfio_unbind_dev() which will return the device_data for any bus driver
> +specific cleanup and freeing of the structure.  The vfio_unbind_dev
> +call may block if the group is currently in use.
> +
> +-------------------------------------------------------------------------------
> +
> +[1] VFIO was originally an acronym for "Virtual Function I/O" in it's
> +initial implementation by Tom Lyon while as Cisco.  We've since
> +outgrown the acronym, but it's catchy.
> +
> +[2] "safe" also depends upon a device being "well behaved".  It's
> +possible for multi-function devices to have backdoors between
> +functions and even for single function devices to have alternative
> +access to things like PCI config space through MMIO registers.  To
> +guard against the former we can include additional precautions in the
> +IOMMU driver to group multi-function PCI devices together
> +(iommu=group_mf).  The latter we can't prevent, but the IOMMU should
> +still provide isolation.  For PCI, Virtual Functions are the best
> +indicator of "well behaved", as these are designed for virtualization
> +usage models.
> +
> +[3] As always there are trade-offs to virtual machine device
> +assignment that are beyond the scope of VFIO.  It's expected that
> +future IOMMU technologies will reduce some, but maybe not all, of
> +these trade-offs.
> +
> +[4] In this case the device is below a PCI bridge:
> +
> +-[0000:00]-+-1e.0-[06]--+-0d.0
> +                        \-0d.1
> +
> +00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 90)
>
>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] vfio: Introduce documentation for VFIO driver
  2011-12-28 17:16   ` [Qemu-devel] " Ronen Hod
@ 2012-01-03 15:21     ` Alex Williamson
  0 siblings, 0 replies; 11+ messages in thread
From: Alex Williamson @ 2012-01-03 15:21 UTC (permalink / raw)
  To: Ronen Hod
  Cc: chrisw, aik, david, joerg.roedel, agraf, benve, aafabbri, B08248,
	B07421, avi, kvm, qemu-devel, iommu, linux-pci, linux-kernel

On Wed, 2011-12-28 at 19:16 +0200, Ronen Hod wrote:
> On 12/21/2011 11:42 PM, Alex Williamson wrote:
> > +The final aspect of VFIO is the notion of merging groups.  In both the
> > +assignment of devices to virtual machines and the pure userspace
> > +driver model, it's expect that a single user instance is likely to
> > +have multiple groups in use simultaneously.  If these groups are all
> > +using the same set of IOMMU mappings, the overhead of userspace
> > +setting up and tearing down the mappings, as well as the internal
> > +IOMMU driver overhead of managing those mappings can be non-trivial.
> > +Some IOMMU implementations are able to easily reduce this overhead by
> > +simply using the same set of page tables across multiple groups.
> > +VFIO allows users to take advantage of this option by merging groups
> > +together, effectively creating a super group (IOMMU groups only define
> > +the minimum granularity).
> > +
> > +A user can attempt to merge groups together by calling the merge ioctl
> > +on one group (the "merger") and passing a file descriptor for the
> > +group to be merged in (the "mergee").  Note that existing DMA mappings
> > +cannot be atomically merged between groups, it's therefore a
> > +requirement that the mergee group is not in use.  This is enforced by
> > +not allowing open device or iommu file descriptors on the mergee group
> > +at the time of merging.  The merger group can be actively in use at
> > +the time of merging.  Likewise, to unmerge a group, none of the device
> > +file descriptors for the group being removed can be in use.  The
> > +remaining merged group can be actively in use.
> > +
> 
> Can you elaborate on the scenario that led a user to merge groups?

Anytime a user is managing multiple groups with the same set of iommu
mappings, they probably wants to try to merge the groups so they only
have to program the iommu once, instead of once per group.  This works
well in the qemu/x86 device assignment model where we map full guest
memory into the iommu to allow transparent device assignment, ie. the
guest doesn't need to be enlightened or forced to use a specific DMA
mechanism.  If instead we exposed an emulated or paravirtualized iommu
and expected the guest to make use of it to limit the number of pinned
guest pages, it might make sense to expose that per group, especially if
the I/O virtual address window is small or lives at different offsets,
so merging wouldn't make sense then.  I'll add something to this effect
in the documentation.

> Does it make sense to try to "automatically" merge a (new) group with 
> all the existing groups sometime prior to its first device open?

I don't think so.  As above, it would assume a usage model.

> As always, it is a pleasure to read your documentation.

Thanks!

Alex



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/5] VFIO core framework
       [not found] ` <20120110162631.GB22499@phenom.dumpdata.com>
@ 2012-01-10 18:35   ` Alex Williamson
  2012-01-12 20:56     ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 11+ messages in thread
From: Alex Williamson @ 2012-01-10 18:35 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: anthony.perard, chrisw, aik, david, joerg.roedel, agraf, benve,
	aafabbri, B08248, B07421, avi, kvm, qemu-devel, iommu, linux-pci,
	linux-kernel

On Tue, 2012-01-10 at 11:26 -0500, Konrad Rzeszutek Wilk wrote:
> On Wed, Dec 21, 2011 at 02:42:02PM -0700, Alex Williamson wrote:
> > This series includes the core framework for the VFIO driver.
> > VFIO is a userspace driver interface meant to replace both the
> > KVM device assignment code as well as interfaces like UIO.  Please
> > see patch 1/5 for a complete description of VFIO, what it can do,
> > and how it's designed.
> > 
> > This version and the VFIO PCI bus driver, for exposing PCI devices
> > through VFIO, can be found here:
> > 
> > git://github.com/awilliam/linux-vfio.git vfio-next-20111221
> > 
> > A development version of qemu which includes a full working
> > vfio-pci driver, indepdendent of KVM support, can be found here:
> > 
> > git://github.com/awilliam/qemu-vfio.git vfio-ng
> > 
> > Thanks,
> 
> Alex,
> 
> So I took a look at the patchset with two different things in mind this time:
>  - What if you do not need to do any IRQ ack/de-ack etc in the host all of that
>    is done in the guest (say you have an actual IOAPIC in the guest that is
>    _not_ managed by QEMU).
>  - What would be required to make this work with a different hypervisor - say Xen.
> 
> And the conclusions I came to that it would require some surgery - especially
> as some of the IRQ, irqfs, etc code support is not required per say.
> 
> To me it seems to get this working with Xen (or perhaps with the Power machines
> as well, as their hypervisor is similar to Xen in architecture?) we would need at
> least two extra pieces of Linux kernel code: 
> - Xen IOMMU, which really is just doing a whole bunch of xc_domain_memory_mapping
>   the user-space iova calls. For the normal PCI devices operations it would just
>   offload them to the existing DMA API.
> - Xen VFIO PCI. Or at least make the VFIO PCI (in your vfio-next-20111221 branch)
>   driver allow some abstraction. There are certain things we might done via alternate
>   operations. Such as the interrupt handling - where we "bind" the IRQ to an event
>   channel or make a hypercall to program the guest' MSI vectors. Perhaps there can
>   be an "platform-specific" part of it.

Sure, I've envisioned that we'll have multiple iommu interfaces.  We'll
need build-time and run-time selection.  I haven't implemented that yet
since the iommu requirements are still developing.  Likewise, a
vfio-xen-pci module is possible or we can look at whether we make the
vfio-pci code too ugly by incorporating a dual-mode into that.

> In the userland:
>  - In QEMU VFIO, make the interrupt part optional for certain parts (like we don't
>    expect an IRQ to happen in the host).

Or can it be handled by vfio-xen-pci, which enables event channels
through to xen?  It's possible the GET_IRQ_INFO ioctls could report a
flag indicating the type of notification available (eventfds being the
initial option) and SET_IRQ_EVENTFDS could be generalized to take an
array of structs other than eventfds.  For the non-Xen case, eventfds
seem to provide us with the most flexibility since we can either connect
them to userspace or just have userspace be the agent that connects the
eventfd to an irqfd in another module.  See the (outdated) version of
qemu-kvm vfio in this tree for an example (look for QEMU_KVM_BUILD):
https://github.com/awilliam/qemu-kvm-vfio/blob/vfio/hw/vfio.c

> I am curious to see how the Power folks have to deal with this? Perhaps the requirement
> to write an PV IOMMU is not something they need to write?
> 
> In terms of this patchset, the "big" thing for me is that it moves the usual mechanism
> of "unbind"/"bind" of using the SysFS to be done via ioctls. I get the reasoning for it
> - cannot guarantee any locking, but doing it all in ioctls instead of configfs or sysfs
> seems odd. But perhaps that is just me having gotten use to doing it in sysfs/configfs.
> Certainly it makes it easier to program in QEMU/libvirt. And ultimately that is going
> to be user for 99% of this.

Can you be more specific about which ioctl part you're referring to?  We
bind/unbind each device to vfio-pci via the normal sysfs driver
interfaces.  Userspace binds itself to a group via ioctls, but that's
because neither configfs or sysfs allow ioctl and I don't think it's
possible to implement an ioctl-free vfio.  Trying to implement vfio
across both configfs and chardev presents issues with ownership.

> The requirement of the VFIO PCI driver to deal with all of the nasty work-arounds for
> devices is nice. I do like the seperation - where this driver (VFIO core) deal
> with _just_ the user facing portion. And the backends (just one right now - VFIO PCI)
> gets to play with all the real hardware details.

Yep, and the iommu layer is intended to be the same, but is maybe not
quite as evolved yet.

> So curious if your perception of this is similar to mine or if I had missed
> something?

It seems like we have options for dealing with it via separate or
modified iommu/device vfio modules and some tweaks to some of the
ioctls.  Maybe I'm oversimplifying the xen requirements?  Thanks for the
review and comments,

Alex


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/5] VFIO core framework
  2012-01-10 18:35   ` [PATCH 0/5] VFIO core framework Alex Williamson
@ 2012-01-12 20:56     ` Konrad Rzeszutek Wilk
  2012-01-13 22:21       ` Alex Williamson
  0 siblings, 1 reply; 11+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-01-12 20:56 UTC (permalink / raw)
  To: Alex Williamson
  Cc: anthony.perard, chrisw, aik, david, joerg.roedel, agraf, benve,
	aafabbri, B08248, B07421, avi, kvm, qemu-devel, iommu, linux-pci,
	linux-kernel

On Tue, Jan 10, 2012 at 11:35:54AM -0700, Alex Williamson wrote:
> On Tue, 2012-01-10 at 11:26 -0500, Konrad Rzeszutek Wilk wrote:
> > On Wed, Dec 21, 2011 at 02:42:02PM -0700, Alex Williamson wrote:
> > > This series includes the core framework for the VFIO driver.
> > > VFIO is a userspace driver interface meant to replace both the
> > > KVM device assignment code as well as interfaces like UIO.  Please
> > > see patch 1/5 for a complete description of VFIO, what it can do,
> > > and how it's designed.
> > > 
> > > This version and the VFIO PCI bus driver, for exposing PCI devices
> > > through VFIO, can be found here:
> > > 
> > > git://github.com/awilliam/linux-vfio.git vfio-next-20111221
> > > 
> > > A development version of qemu which includes a full working
> > > vfio-pci driver, indepdendent of KVM support, can be found here:
> > > 
> > > git://github.com/awilliam/qemu-vfio.git vfio-ng
> > > 
> > > Thanks,
> > 
> > Alex,
> > 
> > So I took a look at the patchset with two different things in mind this time:
> >  - What if you do not need to do any IRQ ack/de-ack etc in the host all of that
> >    is done in the guest (say you have an actual IOAPIC in the guest that is
> >    _not_ managed by QEMU).
> >  - What would be required to make this work with a different hypervisor - say Xen.
> > 
> > And the conclusions I came to that it would require some surgery - especially
> > as some of the IRQ, irqfs, etc code support is not required per say.
> > 
> > To me it seems to get this working with Xen (or perhaps with the Power machines
> > as well, as their hypervisor is similar to Xen in architecture?) we would need at
> > least two extra pieces of Linux kernel code: 
> > - Xen IOMMU, which really is just doing a whole bunch of xc_domain_memory_mapping
> >   the user-space iova calls. For the normal PCI devices operations it would just
> >   offload them to the existing DMA API.
> > - Xen VFIO PCI. Or at least make the VFIO PCI (in your vfio-next-20111221 branch)
> >   driver allow some abstraction. There are certain things we might done via alternate
> >   operations. Such as the interrupt handling - where we "bind" the IRQ to an event
> >   channel or make a hypercall to program the guest' MSI vectors. Perhaps there can
> >   be an "platform-specific" part of it.
> 
> Sure, I've envisioned that we'll have multiple iommu interfaces.  We'll
> need build-time and run-time selection.  I haven't implemented that yet
> since the iommu requirements are still developing.  Likewise, a
> vfio-xen-pci module is possible or we can look at whether we make the
> vfio-pci code too ugly by incorporating a dual-mode into that.

Yuck. Well, I am all up for making it pretty.

> 
> > In the userland:
> >  - In QEMU VFIO, make the interrupt part optional for certain parts (like we don't
> >    expect an IRQ to happen in the host).
> 
> Or can it be handled by vfio-xen-pci, which enables event channels
> through to xen?  It's possible the GET_IRQ_INFO ioctls could report a

Sure.
> flag indicating the type of notification available (eventfds being the
> initial option) and SET_IRQ_EVENTFDS could be generalized to take an
> array of structs other than eventfds.  For the non-Xen case, eventfds
> seem to provide us with the most flexibility since we can either connect
> them to userspace or just have userspace be the agent that connects the
> eventfd to an irqfd in another module.  See the (outdated) version of
> qemu-kvm vfio in this tree for an example (look for QEMU_KVM_BUILD):
> https://github.com/awilliam/qemu-kvm-vfio/blob/vfio/hw/vfio.c

Ah I see.
> 
> > I am curious to see how the Power folks have to deal with this? Perhaps the requirement
> > to write an PV IOMMU is not something they need to write?
> > 
> > In terms of this patchset, the "big" thing for me is that it moves the usual mechanism
> > of "unbind"/"bind" of using the SysFS to be done via ioctls. I get the reasoning for it
> > - cannot guarantee any locking, but doing it all in ioctls instead of configfs or sysfs
> > seems odd. But perhaps that is just me having gotten use to doing it in sysfs/configfs.
> > Certainly it makes it easier to program in QEMU/libvirt. And ultimately that is going
> > to be user for 99% of this.
> 
> Can you be more specific about which ioctl part you're referring to?  We
> bind/unbind each device to vfio-pci via the normal sysfs driver

Let me look again at the QEMU changes. I was thinking you did a bunch
of ioctls to assign a device, but I am probably getting it confused
with the vfio-group ioctls.

> interfaces.  Userspace binds itself to a group via ioctls, but that's
> because neither configfs or sysfs allow ioctl and I don't think it's
> possible to implement an ioctl-free vfio.  Trying to implement vfio
> across both configfs and chardev presents issues with ownership.

Right, one of them works. No need to do it across different subsystem.
> 
> > The requirement of the VFIO PCI driver to deal with all of the nasty work-arounds for
> > devices is nice. I do like the seperation - where this driver (VFIO core) deal
> > with _just_ the user facing portion. And the backends (just one right now - VFIO PCI)
> > gets to play with all the real hardware details.
> 
> Yep, and the iommu layer is intended to be the same, but is maybe not
> quite as evolved yet.
> 
> > So curious if your perception of this is similar to mine or if I had missed
> > something?
> 
> It seems like we have options for dealing with it via separate or
> modified iommu/device vfio modules and some tweaks to some of the
> ioctls.  Maybe I'm oversimplifying the xen requirements?  Thanks for the

That is the broad changes. Thought I am sure that once coding starts
we will find some new things. Hopefully they will all fit within these APIs.

> review and comments,
> 
> Alex

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/5] VFIO core framework
  2012-01-12 20:56     ` Konrad Rzeszutek Wilk
@ 2012-01-13 22:21       ` Alex Williamson
  0 siblings, 0 replies; 11+ messages in thread
From: Alex Williamson @ 2012-01-13 22:21 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: anthony.perard, chrisw, aik, david, joerg.roedel, agraf, benve,
	aafabbri, B08248, B07421, avi, kvm, qemu-devel, iommu, linux-pci,
	linux-kernel

On Thu, 2012-01-12 at 15:56 -0500, Konrad Rzeszutek Wilk wrote:
> On Tue, Jan 10, 2012 at 11:35:54AM -0700, Alex Williamson wrote:
> > On Tue, 2012-01-10 at 11:26 -0500, Konrad Rzeszutek Wilk wrote:
> > > On Wed, Dec 21, 2011 at 02:42:02PM -0700, Alex Williamson wrote:
> > > > This series includes the core framework for the VFIO driver.
> > > > VFIO is a userspace driver interface meant to replace both the
> > > > KVM device assignment code as well as interfaces like UIO.  Please
> > > > see patch 1/5 for a complete description of VFIO, what it can do,
> > > > and how it's designed.
> > > > 
> > > > This version and the VFIO PCI bus driver, for exposing PCI devices
> > > > through VFIO, can be found here:
> > > > 
> > > > git://github.com/awilliam/linux-vfio.git vfio-next-20111221
> > > > 
> > > > A development version of qemu which includes a full working
> > > > vfio-pci driver, indepdendent of KVM support, can be found here:
> > > > 
> > > > git://github.com/awilliam/qemu-vfio.git vfio-ng
> > > > 
> > > > Thanks,
> > > 
> > > Alex,
> > > 
> > > So I took a look at the patchset with two different things in mind this time:
> > >  - What if you do not need to do any IRQ ack/de-ack etc in the host all of that
> > >    is done in the guest (say you have an actual IOAPIC in the guest that is
> > >    _not_ managed by QEMU).
> > >  - What would be required to make this work with a different hypervisor - say Xen.
> > > 
> > > And the conclusions I came to that it would require some surgery - especially
> > > as some of the IRQ, irqfs, etc code support is not required per say.
> > > 
> > > To me it seems to get this working with Xen (or perhaps with the Power machines
> > > as well, as their hypervisor is similar to Xen in architecture?) we would need at
> > > least two extra pieces of Linux kernel code: 
> > > - Xen IOMMU, which really is just doing a whole bunch of xc_domain_memory_mapping
> > >   the user-space iova calls. For the normal PCI devices operations it would just
> > >   offload them to the existing DMA API.
> > > - Xen VFIO PCI. Or at least make the VFIO PCI (in your vfio-next-20111221 branch)
> > >   driver allow some abstraction. There are certain things we might done via alternate
> > >   operations. Such as the interrupt handling - where we "bind" the IRQ to an event
> > >   channel or make a hypercall to program the guest' MSI vectors. Perhaps there can
> > >   be an "platform-specific" part of it.
> > 
> > Sure, I've envisioned that we'll have multiple iommu interfaces.  We'll
> > need build-time and run-time selection.  I haven't implemented that yet
> > since the iommu requirements are still developing.  Likewise, a
> > vfio-xen-pci module is possible or we can look at whether we make the
> > vfio-pci code too ugly by incorporating a dual-mode into that.
> 
> Yuck. Well, I am all up for making it pretty.
> 
> > 
> > > In the userland:
> > >  - In QEMU VFIO, make the interrupt part optional for certain parts (like we don't
> > >    expect an IRQ to happen in the host).
> > 
> > Or can it be handled by vfio-xen-pci, which enables event channels
> > through to xen?  It's possible the GET_IRQ_INFO ioctls could report a
> 
> Sure.
> > flag indicating the type of notification available (eventfds being the
> > initial option) and SET_IRQ_EVENTFDS could be generalized to take an
> > array of structs other than eventfds.  For the non-Xen case, eventfds
> > seem to provide us with the most flexibility since we can either connect
> > them to userspace or just have userspace be the agent that connects the
> > eventfd to an irqfd in another module.  See the (outdated) version of
> > qemu-kvm vfio in this tree for an example (look for QEMU_KVM_BUILD):
> > https://github.com/awilliam/qemu-kvm-vfio/blob/vfio/hw/vfio.c
> 
> Ah I see.

Here's how I'm thinking of reworking the IRQ ioctls, this should allow
more generic future extensions and consolidates what we have now:

VFIO_DEVICE_GET_IRQ_INFO
	_IOWR(VFIO_TYPE, VFIO_BASE + 10, struct vfio_irq_info)

struct vfio_irq_info {
        __u32   argsz;
        __u32   flags;
#define VFIO_IRQ_INFO_EVENTFD           (1 << 0) /* Supports eventfd signals */
#define VFIO_IRQ_INFO_MASKABLE          (1 << 1) /* Supports un/masking */
#define VFIO_IRQ_INFO_AUTOMASKED        (1 << 2) /* Auto masked after trigger */
        __u32   index;          /* IRQ index */
        __u32   count;          /* Number of IRQs within this index */
};

I previously had a LEVEL flag here, but that just means that it's
automatically masked and needs to be unmasked to re-trigger, so we might
as well just name it AUTOMASKED.  We can also add a bit for MASKABLE,
which enables us to do things like optionally supporting MSI/MSI-X
masking (level triggered would be expected to set both MASKABLE &
AUTOMASKED).  The EVENTFD flag indicates that it supports EVENTFD based
triggering, so we can clear that and add another flag if we need to use
a different mechanism.

VFIO_DEVICE_SET_IRQS
	_IOW(VFIO_TYPE, VFIO_BASE + 11, struct vfio_irq_set)

struct vfio_irq_set {
        __u32   argsz;
        __u32   flags;
#define VFIO_IRQ_SET_SINGLE             (1 << 0) /* count = subindex */
#define VFIO_IRQ_SET_MASK               (1 << 1) /* mask/unmask irq(s) */
#define VFIO_IRQ_SET_TRIGGER_EVENTFD    (1 << 2) /* Set eventfds for trigger */
#define VFIO_IRQ_SET_MASK_EVENTFD       (1 << 3) /* Set eventfds for mask */
#define VFIO_IRQ_SET_UNMASK_EVENTFD     (1 << 4) /* Set eventfds for unmask */
        __u32   index;          /* IRQ index */
        __s32   count;          /* Number of data array elements */
        __u8    data[];         /* Flag specific data array */
};

Here the caller now needs to set a flag indicating what they're setting.
SINGLE is a modifier flag, when set we use count field as the subindex
of the interrupt index block and only apply the data to that entry.
When clear, count indicates the size of the data array and each entry is
applied to the corresponding subindex.

MASK is an action flag, when set data is a u8 indicating 0/1 mask state
(this replaces VFIO_DEVICE_UNMASK_IRQ).

TRIGGER_EVENTFD, MASK_EVENTFD, and UNMASK_EVENTFD are also action flags.
When these are used, data is an s32 indicating the eventfd using for the
indicated action.  I imagine that we'll eventually be able tie all of
these directly to kvm with eventfd<->irqfd binding to avoid kernel
exits.

Only one "action" flag is allowed, "modifier" flags can be used in
conjunction with action flags.  

I'm wondering if EVENTFD should be a modifier here so that we don't have
to duplicate trigger, mask, and unmask for every signaling type that we
want to use.  Also considering if we need an ALL in addition to SINGLE
to facilitate anything.  I currently use count = 0 to tear down setup.

Would something like this be useful so that we can more easily
incorporate new signaling mechanisms?

> > 
> > > I am curious to see how the Power folks have to deal with this? Perhaps the requirement
> > > to write an PV IOMMU is not something they need to write?
> > > 
> > > In terms of this patchset, the "big" thing for me is that it moves the usual mechanism
> > > of "unbind"/"bind" of using the SysFS to be done via ioctls. I get the reasoning for it
> > > - cannot guarantee any locking, but doing it all in ioctls instead of configfs or sysfs
> > > seems odd. But perhaps that is just me having gotten use to doing it in sysfs/configfs.
> > > Certainly it makes it easier to program in QEMU/libvirt. And ultimately that is going
> > > to be user for 99% of this.
> > 
> > Can you be more specific about which ioctl part you're referring to?  We
> > bind/unbind each device to vfio-pci via the normal sysfs driver
> 
> Let me look again at the QEMU changes. I was thinking you did a bunch
> of ioctls to assign a device, but I am probably getting it confused
> with the vfio-group ioctls.

I try to outline this in 1/5 with the very basic example of setting up a
device.  It does take a fair number of ioctl calls, but that's hard to
avoid.  There are a few things here that we used to do via sysfs, like
reading through the resource file and using resource# for device access.
Those are effectively duplicated in vfio, bit they're generalized to not
be device type specific.  To summarize the steps:

 - $GROUP = open($GROUP_FILE)
 - ioctl($GROUP, VFIO_GROUP_GET_INFO,)
 - $IOMMU = ioctl($GROUP, VFIO_GROUP_GET_IOMMU_FD)
 - ioctl($IOMMU, VFIO_IOMMU_GET_INFO,)
 - ioctl($IOMMU, VFIO_IOMMU_MAP_DMA,)
 - $DEVICE = ioctl($GROUP, VFIO_GROU_GET_DEVICE_FD)
 - ioctl($DEVICE, VFIO_DEVICE_GET_INFO)
 - ioctl($DEVICE, VFIO_DEVICE_GET_REGION_INFO)
 - mmap/read/write($DEVICE)
 - ioctl($DEVICE, VFIO_DEVICE_GET_IRQ_INFO)
 - ioctl($DEVICE, VFIO_DEVICE_SET_IRQ)

Thanks,

Alex


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2012-01-13 22:22 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-12-21 21:42 [PATCH 0/5] VFIO core framework Alex Williamson
2011-12-21 21:42 ` [PATCH 1/5] vfio: Introduce documentation for VFIO driver Alex Williamson
2011-12-28 17:16   ` [Qemu-devel] " Ronen Hod
2012-01-03 15:21     ` Alex Williamson
2011-12-21 21:42 ` [PATCH 2/5] vfio: VFIO core header Alex Williamson
2011-12-21 21:42 ` [PATCH 3/5] vfio: VFIO core group interface Alex Williamson
2011-12-21 21:42 ` [PATCH 4/5] vfio: VFIO core IOMMU mapping support Alex Williamson
2011-12-21 21:42 ` [PATCH 5/5] vfio: VFIO core Kconfig and Makefile Alex Williamson
     [not found] ` <20120110162631.GB22499@phenom.dumpdata.com>
2012-01-10 18:35   ` [PATCH 0/5] VFIO core framework Alex Williamson
2012-01-12 20:56     ` Konrad Rzeszutek Wilk
2012-01-13 22:21       ` Alex Williamson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).