All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] VFIO driver: Non-privileged user level PCI drivers
@ 2010-05-28 23:07 Tom Lyon
  2010-05-28 23:36 ` Randy Dunlap
                   ` (5 more replies)
  0 siblings, 6 replies; 66+ messages in thread
From: Tom Lyon @ 2010-05-28 23:07 UTC (permalink / raw)
  To: linux-kernel, kvm, chrisw, joro, hjk, mst, avi, gregkh; +Cc: aafabbri, scofeldm

The VFIO "driver" is used to allow privileged AND non-privileged processes to 
implement user-level device drivers for any well-behaved PCI, PCI-X, and PCIe
devices.
	Signed-off-by: Tom Lyon <pugs@cisco.com>
---
This patch is the evolution of code which was first proposed as a patch to
uio/uio_pci_generic, then as a more generic uio patch. Now it is taken entirely
out of the uio framework, and things seem much cleaner. Of course, there is
a lot of functional overlap with uio, but the previous version just seemed
like a giant mode switch in the uio code that did not lead to clarity for
either the new or old code.

[a pony for avi...]
The major new functionality in this version is the ability to deal with
PCI config space accesses (through read & write calls) - but includes table
driven code to determine whats safe to write and what is not. Also, some
virtualization of the config space to allow drivers to think they're writing
some registers when they're not.  Also, IO space accesses are also allowed.
Drivers for devices which use MSI-X are now prevented from directly writing
the MSI-X vector area.

All interrupts are now handled using eventfds, which makes things very simple.

The name VFIO refers to the Virtual Function capabilities of SR-IOV devices
but the driver does support many more types of devices.  I was none too sure
what driver directory this should live in, so for now I made up my own under
drivers/vfio. As a new driver/new directory, who makes the commit decision?

I currently have user level drivers working for 3 different network adapters
- the Cisco "Palo" enic, the Intel 82599 VF, and the Intel 82576 VF (but the
whole user level framework is a long ways from release).  This driver could
also clearly replace a number of other drivers written just to give user
access to certain devices - but that will take time.

diff -uprN linux-2.6.34/Documentation/vfio.txt vfio-linux-2.6.34/Documentation/vfio.txt
--- linux-2.6.34/Documentation/vfio.txt	1969-12-31 16:00:00.000000000 -0800
+++ vfio-linux-2.6.34/Documentation/vfio.txt	2010-05-28 14:03:05.000000000 -0700
@@ -0,0 +1,176 @@
+-------------------------------------------------------------------------------
+The VFIO "driver" is used to allow privileged AND non-privileged processes to
+implement user-level device drivers for any well-behaved PCI, PCI-X, and PCIe
+devices.
+
+Why is this interesting?  Some applications, especially in the high performance
+computing field, need access to hardware functions with as little overhead as
+possible. Examples are in network adapters (typically non tcp/ip based) and
+in compute accelerators - i.e., array processors, FPGA processors, etc.
+Previous to the VFIO drivers these apps would need either a kernel-level
+driver (with corrsponding overheads), or else root permissions to directly
+access the hardware. The VFIO driver allows generic access to the hardware
+from non-privileged apps IF the hardware is "well-behaved" enough for this
+to be safe.
+
+While there have long been ways to implement user-level drivers using specific
+corresponding drivers in the kernel, it was not until the introduction of the
+UIO driver framework, and the uio_pci_generic driver that one could have a
+generic kernel component supporting many types of user level drivers. However,
+even with the uio_pci_generic driver, processes implementing the user level
+drivers had to be trusted - they could do dangerous manipulation of DMA
+addreses and were required to be root to write PCI configuration space
+registers.
+
+Recent hardware technologies - I/O MMUs and PCI I/O Virtualization - provide
+new hardware capabilities which the VFIO solution exploits to allow non-root
+user level drivers. The main role of the IOMMU is to ensure that DMA accesses
+from devices go only to the appropriate memory locations, this allows VFIO to
+ensure that user level drivers do not corrupt inappropriate memory.  PCI I/O
+virtualization (SR-IOV) was defined to allow "pass-through" of virtual devices
+to guest virtual machines. VFIO in essence implements pass-through of devices
+to user processes, not virtual machines.  SR-IOV devices implement a
+traditional PCI device (the physical function) and a dynamic number of special
+PCI devices (virtual functions) whose feature set is somewhat restricted - in
+order to allow the operating system or virtual machine monitor to ensure the
+safe operation of the system.
+
+Any SR-IOV virtual function meets the VFIO definition of "well-behaved", but
+there are many other non-IOV PCI devices which also meet the defintion.
+Elements of this definition are:
+- The size of any memory BARs to be mmap'ed into the user process space must be
+  a multiple of the system page size.
+- If MSI-X interrupts are used, the device driver must not attempt to mmap or
+  write the MSI-X vector area.
+- If the device is a PCI device (not PCI-X or PCIe), it must conform to PCI
+  revision 2.3 to allow its interrupts to be masked in a generic way.
+- The device must not use the PCI configuration space in any non-standard way,
+  i.e., the user level driver will be permitted only to read and write standard
+  fields of the PCI config space, and only if those fields cannot cause harm to
+  the system. In addition, some fields are "virtualized", so that the user
+  driver can read/write them like a kernel driver, but they do not affect the
+  real device.
+- For now, there is no support for user access to the PCIe and PCI-X extended
+  capabilities configuration space.
+
+Even with these restrictions, there are bound to be devices which are unsafe
+for user level use - it is still up to the system admin to decide whether to
+grant access to the device.  When the vfio module is loaded, it will have
+access to no devices until the desired PCI devices are "bound" to the driver.
+First, make sure the devices are not bound to another kernel driver. You can
+unload that driver if you wish to unbind all its devices, or else enter the
+driver's sysfs directory, and unbind a specific device:
+	cd /sys/bus/pci/drivers/<drivername>
+	echo 0000:06:02.00 > unbind
+(The 0000:06:02.00 is a fully qualified PCI device name - different for each
+device).  Now, to bind to the vfio driver, go to /sys/bus/pci/drivers/vfio and
+write the PCI device type of the target device to the new_id file:
+	echo 8086 10ca > new_id
+(8086 10ca are the vendor and device type for the Intel 82576 virtual function
+devices). A /dev/vfio<N> entry will be created for each device bound. The final
+step is to grant users permission by changing the mode and/or owner of the /dev
+entry - "chmod 666 /dev/vfio0".
+
+Reads & Writes:
+
+The user driver will typically use mmap to access the memory BAR(s) of a
+device; the I/O BARs and the PCI config space may be accessed through normal
+read and write system calls. Only 1 file descriptor is needed for all driver
+functions -- the desired BAR for I/O, memory, or config space is indicated via
+high-order bits of the file offset.  For instance, the following implements a
+write to the PCI config space:
+
+	#include <linux/vfio.h>
+	void pci_write_config_word(int pci_fd, u16 off, u16 wd)
+	{
+		off_t cfg_off = VFIO_PCI_CONFIG_OFF + off;
+
+		if (pwrite(pci_fd, &wd, 2, cfg_off) != 2)
+			perror("pwrite config_dword");
+	}
+
+The routines vfio_pci_space_to_offset and vfio_offset_to_pci_space are provided
+in vfio.h to convert bar numbers to file offsets and vice-versa.
+
+Interrupts:
+
+Device interrupts are translated by the vfio driver into input events on event
+notification file descriptors created by the eventfd system call. The user
+program must one or more event descriptors and pass them to the vfio driver
+via ioctls to arrange for the interrupt mapping:
+1.
+	efd = eventfd(0, 0);
+	ioctl(vfio_fd, VFIO_EVENTFD_IRQ, &efd);
+		This provides an eventfd for traditional IRQ interrupts.
+		IRQs will be disable after each interrupt until the driver
+		re-enables them via the PCI COMMAND register.
+2.
+	efd = eventfd(0, 0);
+	ioctl(vfio_fd, VFIO_EVENTFD_MSI, &efd);
+		This connects MSI interrupts to an eventfd.
+3.
+ 	int arg[N+1];
+	arg[0] = N;
+	arg[1..N] = eventfd(0, 0);
+	ioctl(vfio_fd, VFIO_EVENTFDS_MSIX, arg);
+		This connects N MSI-X interrupts with N eventfds.
+
+Waiting and checking for interrupts is done by the user program by reads,
+polls, or selects on the related event file descriptors.
+
+DMA:
+
+The VFIO driver uses ioctls to allow the user level driver to get DMA
+addresses which correspond to virtual addresses.  In systems with IOMMUs,
+each PCI device will have its own address space for DMA operations, so when
+the user level driver programs the device registers, only addresses known to
+the IOMMU will be valid, any others will be rejected.  The IOMMU creates the
+illusion (to the device) that multi-page buffers are physically contiguous,
+so a single DMA operation can safely span multiple user pages.  Note that
+the VFIO driver is still useful in systems without IOMMUs, but only for
+trusted processes which can deal with DMAs which do not span pages (Huge
+pages count as a single page also).
+
+If the user process desires many DMA buffers, it may be wise to do a mapping
+of a single large buffer, and then allocate the smaller buffers from the
+large one.
+
+The DMA buffers are locked into physical memory for the duration of their
+existence - until VFIO_DMA_UNMAP is called, until the user pages are
+unmapped from the user process, or until the vfio file descriptor is closed.
+The user process must have permission to lock the pages given by the ulimit(-l)
+command, which in turn relies on settings in the /etc/security/limits.conf
+file.
+
+The vfio_dma_map structure is used as an argument to the ioctls which
+do the DMA mapping. Its vaddr, dmaaddr, and size fields must always be a
+multiple of a page. Its rdwr field is zero for read-only (outbound), and
+non-zero for read/write buffers.
+
+	struct vfio_dma_map {
+		__u64	vaddr;	  /* process virtual addr */
+		__u64	dmaaddr;  /* desired and/or returned dma address */
+		__u64	size;	  /* size in bytes */
+		int	rdwr;	  /* bool: 0 for r/o; 1 for r/w */
+	};
+
+The VFIO_DMA_MAP_ANYWHERE is called with a vfio_dma_map structure as its
+argument, and returns the structure with a valid dmaaddr field.
+
+The VFIO_DMA_MAP_IOVA is called with a vfio_dma_map structure with the
+dmaaddr field already assigned. The system will attempt to map the DMA
+buffer into the IO space at the givne dmaaddr. This is expected to be
+useful if KVM or other virtualization facilities use this driver.
+
+The VFIO_DMA_UNMAP takes a fully filled vfio_dma_map structure and unmaps
+the buffer and releases the corresponding system resources.
+
+The VFIO_DMA_MASK ioctl is used to set the maximum permissible DMA address
+(device dependent). It takes a single unsigned 64 bit integer as an argument.
+This call also has the side effect on enabled PCI bus mastership.
+
+Miscellaneous:
+
+The VFIO_BAR_LEN ioctl provides an easy way to determine the size of a PCI
+device's base address region. It is passed a single integer specifying which
+BAR (0-5 or 6 for ROM bar), and passes back the length in the same field.
diff -uprN linux-2.6.34/drivers/Kconfig vfio-linux-2.6.34/drivers/Kconfig
--- linux-2.6.34/drivers/Kconfig	2010-05-16 14:17:36.000000000 -0700
+++ vfio-linux-2.6.34/drivers/Kconfig	2010-05-27 17:01:02.000000000 -0700
@@ -111,4 +111,6 @@ source "drivers/xen/Kconfig"
 source "drivers/staging/Kconfig"
 
 source "drivers/platform/Kconfig"
+
+source "drivers/vfio/Kconfig"
 endmenu
diff -uprN linux-2.6.34/drivers/Makefile vfio-linux-2.6.34/drivers/Makefile
--- linux-2.6.34/drivers/Makefile	2010-05-16 14:17:36.000000000 -0700
+++ vfio-linux-2.6.34/drivers/Makefile	2010-05-27 17:25:33.000000000 -0700
@@ -52,6 +52,7 @@ obj-$(CONFIG_FUSION)		+= message/
 obj-$(CONFIG_FIREWIRE)		+= firewire/
 obj-y				+= ieee1394/
 obj-$(CONFIG_UIO)		+= uio/
+obj-$(CONFIG_VFIO)		+= vfio/
 obj-y				+= cdrom/
 obj-y				+= auxdisplay/
 obj-$(CONFIG_PCCARD)		+= pcmcia/
diff -uprN linux-2.6.34/drivers/vfio/Kconfig vfio-linux-2.6.34/drivers/vfio/Kconfig
--- linux-2.6.34/drivers/vfio/Kconfig	1969-12-31 16:00:00.000000000 -0800
+++ vfio-linux-2.6.34/drivers/vfio/Kconfig	2010-05-27 17:07:25.000000000 -0700
@@ -0,0 +1,9 @@
+menuconfig VFIO
+	tristate "Non-Priv User Space PCI drivers"
+	depends on PCI
+	help
+	  Driver to allow advanced user space drivers for PCI, PCI-X,
+	  and PCIe devices.  Requires IOMMU to allow non-privilged
+	  processes to directly control the PCI devices.
+
+	  If you don't know what to do here, say N.
diff -uprN linux-2.6.34/drivers/vfio/Makefile vfio-linux-2.6.34/drivers/vfio/Makefile
--- linux-2.6.34/drivers/vfio/Makefile	1969-12-31 16:00:00.000000000 -0800
+++ vfio-linux-2.6.34/drivers/vfio/Makefile	2010-05-27 17:32:35.000000000 -0700
@@ -0,0 +1,5 @@
+obj-$(CONFIG_VFIO) := vfio.o
+
+vfio-y := vfio_main.o vfio_dma.o vfio_intrs.o \
+		vfio_pci_config.o vfio_rdwr.o vfio_sysfs.o
+
diff -uprN linux-2.6.34/drivers/vfio/vfio_dma.c vfio-linux-2.6.34/drivers/vfio/vfio_dma.c
--- linux-2.6.34/drivers/vfio/vfio_dma.c	1969-12-31 16:00:00.000000000 -0800
+++ vfio-linux-2.6.34/drivers/vfio/vfio_dma.c	2010-05-28 14:04:04.000000000 -0700
@@ -0,0 +1,372 @@
+/*
+ * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
+ * Author: Tom Lyon, pugs@cisco.com
+ *
+ * This program is free software; you may redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; version 2 of the License.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Portions derived from drivers/uio/uio.c:
+ * Copyright(C) 2005, Benedikt Spranger <b.spranger@linutronix.de>
+ * Copyright(C) 2005, Thomas Gleixner <tglx@linutronix.de>
+ * Copyright(C) 2006, Hans J. Koch <hjk@linutronix.de>
+ * Copyright(C) 2006, Greg Kroah-Hartman <greg@kroah.com>
+ *
+ * Portions derived from drivers/uio/uio_pci_generic.c:
+ * Copyright (C) 2009 Red Hat, Inc.
+ * Author: Michael S. Tsirkin <mst@redhat.com>
+ */
+
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/pci.h>
+#include <linux/mm.h>
+#include <linux/mmu_notifier.h>
+#include <linux/iommu.h>
+#include <linux/sched.h>
+
+#include <linux/vfio.h>
+
+/* Unmap DMA region */
+static void vfio_dma_unmap(struct vfio_listener *listener,
+			struct dma_map_page *mlp)
+{
+	int i;
+	struct vfio_dev *vdev = listener->vdev;
+	struct pci_dev *pdev = vdev->pdev;
+
+	mutex_lock(&vdev->gate);
+	list_del(&mlp->list);
+	if (mlp->sg) {
+		dma_unmap_sg(&pdev->dev, mlp->sg, mlp->npage,
+				DMA_BIDIRECTIONAL);
+		kfree(mlp->sg);
+	} else {
+		for (i = 0; i < mlp->npage; i++)
+			(void) iommu_unmap_range(vdev->domain,
+					mlp->daddr + i*PAGE_SIZE, PAGE_SIZE);
+	}
+	for (i = 0; i < mlp->npage; i++) {
+		if (mlp->rdwr)
+			SetPageDirty(mlp->pages[i]);
+		put_page(mlp->pages[i]);
+	}
+	listener->mm->locked_vm -= mlp->npage;
+	vdev->locked_pages -= mlp->npage;
+	kfree(mlp->pages);
+	kfree(mlp);
+	vdev->mapcount--;
+	mutex_unlock(&vdev->gate);
+}
+
+/* Unmap ALL DMA regions */
+void vfio_dma_unmapall(struct vfio_listener *listener)
+{
+	struct list_head *pos, *pos2;
+	struct dma_map_page *mlp;
+
+	list_for_each_safe(pos, pos2, &listener->dm_list) {
+		mlp = list_entry(pos, struct dma_map_page, list);
+		vfio_dma_unmap(listener, mlp);
+	}
+}
+
+int vfio_dma_unmap_dm(struct vfio_listener *listener, struct vfio_dma_map *dmp)
+{
+	unsigned long start, npage;
+	struct dma_map_page *mlp;
+	struct list_head *pos, *pos2;
+	int ret;
+
+	start = dmp->vaddr & ~PAGE_SIZE;
+	npage = dmp->size >> PAGE_SHIFT;
+
+	ret = -ENXIO;
+	list_for_each_safe(pos, pos2, &listener->dm_list) {
+		mlp = list_entry(pos, struct dma_map_page, list);
+		if (dmp->vaddr != mlp->vaddr || mlp->npage != npage)
+			continue;
+		ret = 0;
+		vfio_dma_unmap(listener, mlp);
+		break;
+	}
+	return ret;
+}
+
+/* Handle MMU notifications - user process freed or realloced memory
+ * which may be in use in a DMA region. Clean up region if so.
+ */
+static void vfio_dma_handle_mmu_notify(struct mmu_notifier *mn,
+		unsigned long start, unsigned long end)
+{
+	struct vfio_listener *listener;
+	unsigned long myend;
+	struct list_head *pos, *pos2;
+	struct dma_map_page *mlp;
+
+	listener = container_of(mn, struct vfio_listener, mmu_notifier);
+	list_for_each_safe(pos, pos2, &listener->dm_list) {
+		mlp = list_entry(pos, struct dma_map_page, list);
+		if (mlp->vaddr >= end)
+			continue;
+		/*
+		 * Ranges overlap if they're not disjoint; and they're
+		 * disjoint if the end of one is before the start of
+		 * the other one.
+		 */
+		myend = mlp->vaddr + (mlp->npage << PAGE_SHIFT) - 1;
+		if (!(myend <= start || end <= mlp->vaddr)) {
+			printk(KERN_WARNING
+				"%s: demap start %lx end %lx va %lx pa %lx\n",
+				__func__, start, end,
+				mlp->vaddr, (long)mlp->daddr);
+			vfio_dma_unmap(listener, mlp);
+		}
+	}
+}
+
+static void vfio_dma_inval_page(struct mmu_notifier *mn,
+		struct mm_struct *mm, unsigned long addr)
+{
+	vfio_dma_handle_mmu_notify(mn, addr, addr + PAGE_SIZE);
+}
+
+static void vfio_dma_inval_range_start(struct mmu_notifier *mn,
+		struct mm_struct *mm, unsigned long start, unsigned long end)
+{
+	vfio_dma_handle_mmu_notify(mn, start, end);
+}
+
+static const struct mmu_notifier_ops vfio_dma_mmu_notifier_ops = {
+	.invalidate_page = vfio_dma_inval_page,
+	.invalidate_range_start = vfio_dma_inval_range_start,
+};
+
+/*
+ * Map usr buffer at specific IO virtual address
+ */
+static int vfio_dma_map_iova(
+		struct vfio_listener *listener,
+		unsigned long start_iova,
+		struct page **pages,
+		int npage,
+		int rdwr,
+		struct dma_map_page **mlpp)
+{
+	struct vfio_dev *vdev = listener->vdev;
+	struct pci_dev *pdev = vdev->pdev;
+	int ret;
+	int i;
+	phys_addr_t hpa;
+	struct dma_map_page *mlp;
+	unsigned long iova = start_iova;
+
+	if (vdev->domain == NULL) {
+		/* can't mix iova with anywhere */
+		if (vdev->mapcount > 0)
+			return -EINVAL;
+		if (!iommu_found())
+			return -EINVAL;
+		vdev->domain = iommu_domain_alloc();
+		if (vdev->domain == NULL)
+			return -ENXIO;
+		vdev->cachec = iommu_domain_has_cap(vdev->domain,
+					IOMMU_CAP_CACHE_COHERENCY);
+		ret = iommu_attach_device(vdev->domain, &pdev->dev);
+		if (ret) {
+			iommu_domain_free(vdev->domain);
+			vdev->domain = NULL;
+			printk(KERN_ERR "%s: device_attach failed %d\n",
+					__func__, ret);
+			return ret;
+		}
+	}
+	for (i = 0; i < npage; i++) {
+		if (iommu_iova_to_phys(vdev->domain, iova + i*PAGE_SIZE))
+			return -EBUSY;
+	}
+	rdwr = rdwr ? IOMMU_READ|IOMMU_WRITE : IOMMU_READ;
+	if (vdev->cachec)
+		rdwr |= IOMMU_CACHE;
+	for (i = 0; i < npage; i++) {
+		hpa = page_to_phys(pages[i]);
+		ret = iommu_map_range(vdev->domain, iova,
+			hpa, PAGE_SIZE, rdwr);
+		if (ret) {
+			while (--i > 0) {
+				iova -= PAGE_SIZE;
+				(void) iommu_unmap_range(vdev->domain,
+							iova, PAGE_SIZE);
+			}
+			return ret;
+		}
+		iova += PAGE_SIZE;
+	}
+
+	mlp = kzalloc(sizeof *mlp, GFP_KERNEL);
+	mlp->pages = pages;
+	mlp->daddr = start_iova;
+	mlp->npage = npage;
+	*mlpp = mlp;
+	return 0;
+}
+
+/*
+ * Map user buffer - return IO virtual address
+ */
+static int vfio_dma_map_anywhere(
+		struct vfio_listener *listener,
+		struct page **pages,
+		int npage,
+		int rdwr,
+		struct dma_map_page **mlpp)
+{
+	struct vfio_dev *vdev = listener->vdev;
+	struct pci_dev *pdev = vdev->pdev;
+	struct scatterlist *sg, *nsg;
+	int i, nents;
+	struct dma_map_page *mlp;
+	unsigned long length;
+
+	if (vdev->domain) {
+		/* map anywhere and map iova don't mix */
+		if (vdev->mapcount > 0)
+			return -EINVAL;
+		iommu_domain_free(vdev->domain);
+		vdev->domain = NULL;
+	}
+	sg = kzalloc(npage * sizeof(struct scatterlist), GFP_KERNEL);
+	if (sg == NULL)
+		return -ENOMEM;
+	for (i = 0; i < npage; i++)
+		sg_set_page(&sg[i], pages[i], PAGE_SIZE, 0);
+	nents = dma_map_sg(&pdev->dev, sg, npage,
+			rdwr ? DMA_BIDIRECTIONAL : DMA_TO_DEVICE);
+	/* The API for dma_map_sg suggests that it may squash together
+	 * adjacent pages, but noone seems to really do that. So we squash
+	 * it ourselves, because the user level wants a single buffer.
+	 * This works if (a) there is an iommu, or (b) the user allocates
+	 * large buffers from a huge page
+	 */
+	nsg = sg;
+	for (i = 1; i < nents; i++) {
+		length = sg[i].dma_length;
+		sg[i].dma_length = 0;
+		if (sg[i].dma_address == (nsg->dma_address + nsg->dma_length)) {
+			nsg->dma_length += length;
+		} else {
+			nsg++;
+			nsg->dma_address = sg[i].dma_address;
+			nsg->dma_length = length;
+		}
+	}
+	nents = 1 + (nsg - sg);
+	if (nents != 1) {
+		if (nents > 0)
+			dma_unmap_sg(&pdev->dev, sg, npage,
+					DMA_BIDIRECTIONAL);
+		for (i = 0; i < npage; i++)
+			put_page(pages[i]);
+		kfree(sg);
+		printk(KERN_ERR "%s: sequential dma mapping failed\n",
+				__func__);
+		return -EFAULT;
+	}
+
+	mlp = kzalloc(sizeof *mlp, GFP_KERNEL);
+	mlp->pages = pages;
+	mlp->sg = sg;
+	mlp->daddr = sg_dma_address(sg);
+	mlp->npage = npage;
+	*mlpp = mlp;
+	return 0;
+}
+
+int vfio_dma_map_common(struct vfio_listener *listener,
+		unsigned int cmd, struct vfio_dma_map *dmp)
+{
+	int locked, lock_limit;
+	struct page **pages;
+	int npage;
+	struct dma_map_page *mlp = NULL;
+	int ret = 0;
+
+	if (dmp->vaddr & (PAGE_SIZE-1))
+		return -EINVAL;
+	if (dmp->size & (PAGE_SIZE-1))
+		return -EINVAL;
+	if (dmp->size <= 0)
+		return -EINVAL;
+	npage = dmp->size >> PAGE_SHIFT;
+
+	mutex_lock(&listener->vdev->gate);
+
+	/* account for locked pages */
+	locked = npage + current->mm->locked_vm;
+	lock_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur
+			>> PAGE_SHIFT;
+	if ((locked > lock_limit) && !capable(CAP_IPC_LOCK)) {
+		printk(KERN_WARNING "%s: RLIMIT_MEMLOCK exceeded\n",
+			__func__);
+		ret = -ENOMEM;
+		goto out_lock;
+	}
+	/* only 1 address space per fd */
+	if (current->mm != listener->mm) {
+		if (listener->mm != NULL)
+			return -EINVAL;
+		listener->mm = current->mm;
+		listener->mmu_notifier.ops = &vfio_dma_mmu_notifier_ops;
+		ret = mmu_notifier_register(&listener->mmu_notifier,
+						listener->mm);
+		if (ret)
+			printk(KERN_ERR "%s: mmu_notifier_register failed %d\n",
+				__func__, ret);
+		ret = 0;
+	}
+
+	pages = kzalloc(npage * sizeof(struct page *), GFP_KERNEL);
+	if (pages == NULL) {
+		ret = ENOMEM;
+		goto out_lock;
+	}
+	ret = get_user_pages_fast(dmp->vaddr, npage, dmp->rdwr, pages);
+	if (ret != npage) {
+		printk(KERN_ERR "%s: get_user_pages_fast returns %d, not %d\n",
+			__func__, ret, npage);
+		kfree(pages);
+		ret = -EFAULT;
+		goto out_lock;
+	}
+
+	if (cmd == VFIO_DMA_MAP_IOVA)
+		ret = vfio_dma_map_iova(listener, dmp->dmaaddr,
+				pages, npage, dmp->rdwr, &mlp);
+	else
+		ret = vfio_dma_map_anywhere(listener, pages,
+				npage, dmp->rdwr, &mlp);
+	if (ret) {
+		kfree(pages);
+		goto out_lock;
+	}
+	listener->vdev->mapcount++;
+	mlp->vaddr = dmp->vaddr;
+	mlp->rdwr = dmp->rdwr;
+	dmp->dmaaddr = mlp->daddr;
+	list_add(&mlp->list, &listener->dm_list);
+
+	current->mm->locked_vm += npage;
+	listener->vdev->locked_pages += npage;
+out_lock:
+	mutex_unlock(&listener->vdev->gate);
+	return ret;
+}
diff -uprN linux-2.6.34/drivers/vfio/vfio_intrs.c vfio-linux-2.6.34/drivers/vfio/vfio_intrs.c
--- linux-2.6.34/drivers/vfio/vfio_intrs.c	1969-12-31 16:00:00.000000000 -0800
+++ vfio-linux-2.6.34/drivers/vfio/vfio_intrs.c	2010-05-28 14:09:15.000000000 -0700
@@ -0,0 +1,189 @@
+/*
+ * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
+ * Author: Tom Lyon, pugs@cisco.com
+ *
+ * This program is free software; you may redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; version 2 of the License.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Portions derived from drivers/uio/uio.c:
+ * Copyright(C) 2005, Benedikt Spranger <b.spranger@linutronix.de>
+ * Copyright(C) 2005, Thomas Gleixner <tglx@linutronix.de>
+ * Copyright(C) 2006, Hans J. Koch <hjk@linutronix.de>
+ * Copyright(C) 2006, Greg Kroah-Hartman <greg@kroah.com>
+ *
+ * Portions derived from drivers/uio/uio_pci_generic.c:
+ * Copyright (C) 2009 Red Hat, Inc.
+ * Author: Michael S. Tsirkin <mst@redhat.com>
+ */
+
+#include <linux/device.h>
+#include <linux/interrupt.h>
+#include <linux/eventfd.h>
+#include <linux/pci.h>
+#include <linux/mmu_notifier.h>
+
+#include <linux/vfio.h>
+
+
+/*
+ * vfio_interrupt - IRQ hardware interrupt handler
+ */
+irqreturn_t vfio_interrupt(int irq, void *dev_id)
+{
+	struct vfio_dev *vdev = (struct vfio_dev *)dev_id;
+	struct pci_dev *pdev = vdev->pdev;
+	irqreturn_t ret = IRQ_NONE;
+	u32 cmd_status_dword;
+	u16 origcmd, newcmd, status;
+
+	spin_lock_irq(&vdev->lock);
+	pci_block_user_cfg_access(pdev);
+
+	/* Read both command and status registers in a single 32-bit operation.
+	 * Note: we could cache the value for command and move the status read
+	 * out of the lock if there was a way to get notified of user changes
+	 * to command register through sysfs. Should be good for shared irqs. */
+	pci_read_config_dword(pdev, PCI_COMMAND, &cmd_status_dword);
+	origcmd = cmd_status_dword;
+	status = cmd_status_dword >> 16;
+
+	/* Check interrupt status register to see whether our device
+	 * triggered the interrupt. */
+	if (!(status & PCI_STATUS_INTERRUPT))
+		goto done;
+
+	/* We triggered the interrupt, disable it. */
+	newcmd = origcmd | PCI_COMMAND_INTX_DISABLE;
+	if (newcmd != origcmd)
+		pci_write_config_word(pdev, PCI_COMMAND, newcmd);
+
+	ret = IRQ_HANDLED;
+done:
+	pci_unblock_user_cfg_access(pdev);
+	spin_unlock_irq(&vdev->lock);
+	if (ret != IRQ_HANDLED)
+		return ret;
+	if (vdev->ev_irq)
+		eventfd_signal(vdev->ev_irq, 1);
+	return ret;
+}
+
+/*
+ * MSI and MSI-X Interrupt handler.
+ * Just signal an event
+ */
+static irqreturn_t msihandler(int irq, void *arg)
+{
+	struct eventfd_ctx *ctx = arg;
+
+	eventfd_signal(ctx, 1);
+	return IRQ_HANDLED;
+}
+
+void vfio_disable_msi(struct vfio_dev *vdev)
+{
+	struct pci_dev *pdev = vdev->pdev;
+
+	if (vdev->ev_msi) {
+		eventfd_ctx_put(vdev->ev_msi);
+		free_irq(pdev->irq, vdev->ev_msi);
+		vdev->ev_msi = NULL;
+	}
+	pci_disable_msi(pdev);
+}
+
+int vfio_enable_msi(struct vfio_dev *vdev, int fd)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	struct eventfd_ctx *ctx;
+	int ret;
+
+	ctx = eventfd_ctx_fdget(fd);
+	if (IS_ERR(ctx))
+		return PTR_ERR(ctx);
+	vdev->ev_msi = ctx;
+	pci_enable_msi(pdev);
+	ret = request_irq(pdev->irq, msihandler, 0,
+			vdev->name, ctx);
+	if (ret) {
+		eventfd_ctx_put(ctx);
+		pci_disable_msi(pdev);
+		vdev->ev_msi = NULL;
+	}
+	return ret;
+}
+
+void vfio_disable_msix(struct vfio_dev *vdev)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	int i;
+
+	if (vdev->ev_msix && vdev->msix) {
+		for (i = 0; i < vdev->nvec; i++) {
+			free_irq(vdev->msix[i].vector, vdev->ev_msix[i]);
+			if (vdev->ev_msix[i])
+				eventfd_ctx_put(vdev->ev_msix[i]);
+		}
+	}
+	kfree(vdev->ev_msix);
+	vdev->ev_msix = NULL;
+	kfree(vdev->msix);
+	vdev->msix = NULL;
+	vdev->nvec = 0;
+	pci_disable_msix(pdev);
+}
+
+int vfio_enable_msix(struct vfio_dev *vdev, int nvec, void __user *uarg)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	struct eventfd_ctx *ctx;
+	int ret = 0;
+	int i;
+	int fd;
+
+	vdev->msix = kzalloc(nvec * sizeof(struct msix_entry),
+				GFP_KERNEL);
+	vdev->ev_msix = kzalloc(nvec * sizeof(struct eventfd_ctx *),
+				GFP_KERNEL);
+	if (vdev->msix == NULL || vdev->ev_msix == NULL)
+		ret = -ENOMEM;
+	else {
+		for (i = 0; i < nvec; i++) {
+			if (copy_from_user(&fd, uarg, sizeof fd)) {
+				ret = -EFAULT;
+				break;
+			}
+			uarg += sizeof fd;
+			ctx = eventfd_ctx_fdget(fd);
+			if (IS_ERR(ctx)) {
+				ret = PTR_ERR(ctx);
+				break;
+			}
+			vdev->msix[i].entry = i;
+			vdev->ev_msix[i] = ctx;
+		}
+	}
+	if (!ret)
+		ret = pci_enable_msix(pdev, vdev->msix, nvec);
+	vdev->nvec = 0;
+	for (i = 0; i < nvec && !ret; i++) {
+		ret = request_irq(vdev->msix[i].vector, msihandler, 0,
+			vdev->name, vdev->ev_msix[i]);
+		if (ret)
+			break;
+		vdev->nvec = i+1;
+	}
+	if (ret)
+		vfio_disable_msix(vdev);
+	return ret;
+}
diff -uprN linux-2.6.34/drivers/vfio/vfio_main.c vfio-linux-2.6.34/drivers/vfio/vfio_main.c
--- linux-2.6.34/drivers/vfio/vfio_main.c	1969-12-31 16:00:00.000000000 -0800
+++ vfio-linux-2.6.34/drivers/vfio/vfio_main.c	2010-05-28 14:13:38.000000000 -0700
@@ -0,0 +1,627 @@
+/*
+ * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
+ * Author: Tom Lyon, pugs@cisco.com
+ *
+ * This program is free software; you may redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; version 2 of the License.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Portions derived from drivers/uio/uio.c:
+ * Copyright(C) 2005, Benedikt Spranger <b.spranger@linutronix.de>
+ * Copyright(C) 2005, Thomas Gleixner <tglx@linutronix.de>
+ * Copyright(C) 2006, Hans J. Koch <hjk@linutronix.de>
+ * Copyright(C) 2006, Greg Kroah-Hartman <greg@kroah.com>
+ *
+ * Portions derived from drivers/uio/uio_pci_generic.c:
+ * Copyright (C) 2009 Red Hat, Inc.
+ * Author: Michael S. Tsirkin <mst@redhat.com>
+ */
+
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/mm.h>
+#include <linux/idr.h>
+#include <linux/string.h>
+#include <linux/interrupt.h>
+#include <linux/fs.h>
+#include <linux/eventfd.h>
+#include <linux/pci.h>
+#include <linux/iommu.h>
+#include <linux/mmu_notifier.h>
+#include <linux/uaccess.h>
+
+#include <linux/vfio.h>
+
+
+#define DRIVER_VERSION	"0.1"
+#define DRIVER_AUTHOR	"Tom Lyon <pugs@cisco.com>"
+#define DRIVER_DESC	"VFIO - User Level PCI meta-driver"
+
+static int vfio_major = -1;
+DEFINE_IDR(vfio_idr);
+/* Protect idr accesses */
+DEFINE_MUTEX(vfio_minor_lock);
+
+/*
+ * Does [a1,b1) overlap [a2,b2) ?
+ */
+static inline int overlap(int a1, int b1, int a2, int b2)
+{
+	/*
+	 * Ranges overlap if they're not disjoint; and they're
+	 * disjoint if the end of one is before the start of
+	 * the other one.
+	 */
+	return !(b2 <= a1 || b1 <= a2);
+}
+
+static int vfio_open(struct inode *inode, struct file *filep)
+{
+	struct vfio_dev *vdev;
+	struct vfio_listener *listener;
+	int ret = 0;
+
+	mutex_lock(&vfio_minor_lock);
+	vdev = idr_find(&vfio_idr, iminor(inode));
+	mutex_unlock(&vfio_minor_lock);
+	if (!vdev) {
+		ret = -ENODEV;
+		goto out;
+	}
+
+	listener = kzalloc(sizeof(*listener), GFP_KERNEL);
+	if (!listener) {
+		ret = -ENOMEM;
+		goto err_alloc_listener;
+	}
+
+	listener->vdev = vdev;
+	INIT_LIST_HEAD(&listener->dm_list);
+	filep->private_data = listener;
+
+	mutex_lock(&vdev->gate);
+	if (vdev->listeners == 0) {		/* first open */
+		if (vdev->pmaster && !iommu_found() &&
+		    !capable(CAP_SYS_RAWIO)) {
+			mutex_unlock(&vdev->gate);
+			ret = -EPERM;
+			goto err_perm;
+		}
+		/* reset to known state if we can */
+		(void) pci_reset_function(vdev->pdev);
+	}
+	vdev->listeners++;
+	mutex_unlock(&vdev->gate);
+	return 0;
+
+err_perm:
+	kfree(listener);
+
+err_alloc_listener:
+out:
+	return ret;
+}
+
+static int vfio_release(struct inode *inode, struct file *filep)
+{
+	int ret = 0;
+	struct vfio_listener *listener = filep->private_data;
+	struct vfio_dev *vdev = listener->vdev;
+
+	vfio_dma_unmapall(listener);
+	if (listener->mm) {
+		mmu_notifier_unregister(&listener->mmu_notifier, listener->mm);
+		listener->mm = NULL;
+	}
+
+	mutex_lock(&vdev->gate);
+	if (--vdev->listeners <= 0) {
+		if (vdev->ev_msix)
+			vfio_disable_msix(vdev);
+		if (vdev->ev_msi)
+			vfio_disable_msi(vdev);
+		if (vdev->ev_irq) {
+			eventfd_ctx_put(vdev->ev_msi);
+			vdev->ev_irq = NULL;
+		}
+		if (vdev->domain) {
+			iommu_domain_free(vdev->domain);
+			vdev->domain = NULL;
+		}
+		/* reset to known state if we can */
+		(void) pci_reset_function(vdev->pdev);
+	}
+	mutex_unlock(&vdev->gate);
+
+	kfree(listener);
+	return ret;
+}
+
+static ssize_t vfio_read(struct file *filep, char __user *buf,
+			size_t count, loff_t *ppos)
+{
+	struct vfio_listener *listener = filep->private_data;
+	struct vfio_dev *vdev = listener->vdev;
+	struct pci_dev *pdev = vdev->pdev;
+	int pci_space;
+
+	pci_space = vfio_offset_to_pci_space(*ppos);
+	if (pci_space == VFIO_PCI_CONFIG_RESOURCE)
+		return vfio_config_readwrite(0, vdev, buf, count, ppos);
+	if (pci_space > PCI_ROM_RESOURCE)
+		return -EINVAL;
+	if (pci_resource_flags(pdev, pci_space) & IORESOURCE_IO)
+		return vfio_io_readwrite(0, vdev, buf, count, ppos);
+	if (pci_resource_flags(pdev, pci_space) & IORESOURCE_MEM)
+		return vfio_mem_readwrite(0, vdev, buf, count, ppos);
+	if (pci_space == PCI_ROM_RESOURCE)
+		return vfio_mem_readwrite(0, vdev, buf, count, ppos);
+	return -EINVAL;
+}
+
+static int vfio_msix_check(struct vfio_dev *vdev, u64 start, u32 len)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	u16 pos;
+	u32 table_offset;
+	u16 table_size;
+	u8 bir;
+	u32 lo, hi, startp, endp;
+
+	pos = pci_find_capability(pdev, PCI_CAP_ID_MSIX);
+	if (!pos)
+		return 0;
+
+	pci_read_config_word(pdev, pos + PCI_MSIX_FLAGS, &table_size);
+	table_size = (table_size & PCI_MSIX_FLAGS_QSIZE) + 1;
+	pci_read_config_dword(pdev, pos + 4, &table_offset);
+	bir = table_offset & PCI_MSIX_FLAGS_BIRMASK;
+	lo = table_offset >> PAGE_SHIFT;
+	hi = (table_offset + PCI_MSIX_ENTRY_SIZE * table_size + PAGE_SIZE - 1)
+		>> PAGE_SHIFT;
+	startp = start >> PAGE_SHIFT;
+	endp = (start + len + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	if (bir == vfio_offset_to_pci_space(start) &&
+	    overlap(lo, hi, startp, endp)) {
+		printk(KERN_WARNING "%s: cannot write msi-x vectors\n",
+			__func__);
+		return -EINVAL;
+	}
+	return 0;
+}
+
+static ssize_t vfio_write(struct file *filep, const char __user *buf,
+			size_t count, loff_t *ppos)
+{
+	struct vfio_listener *listener = filep->private_data;
+	struct vfio_dev *vdev = listener->vdev;
+	struct pci_dev *pdev = vdev->pdev;
+	int pci_space;
+	int ret;
+
+	pci_space = vfio_offset_to_pci_space(*ppos);
+	if (pci_space == VFIO_PCI_CONFIG_RESOURCE)
+		return vfio_config_readwrite(1, vdev,
+					(char __user *)buf, count, ppos);
+	if (pci_space > PCI_ROM_RESOURCE)
+		return -EINVAL;
+	if (pci_resource_flags(pdev, pci_space) & IORESOURCE_IO)
+		return vfio_io_readwrite(1, vdev,
+					(char __user *)buf, count, ppos);
+	if (pci_resource_flags(pdev, pci_space) & IORESOURCE_MEM) {
+		/* don't allow writes to msi-x vectors */
+		ret = vfio_msix_check(vdev, *ppos, count);
+		if (ret)
+			return ret;
+		return vfio_mem_readwrite(1, vdev,
+				(char __user *)buf, count, ppos);
+	}
+	return -EINVAL;
+}
+
+static int vfio_mmap(struct file *filep, struct vm_area_struct *vma)
+{
+	struct vfio_listener *listener = filep->private_data;
+	struct vfio_dev *vdev = listener->vdev;
+	struct pci_dev *pdev = vdev->pdev;
+	unsigned long requested, actual;
+	int pci_space;
+	u64 start;
+	u32 len;
+	unsigned long phys;
+	int ret;
+
+	if (vma->vm_end < vma->vm_start)
+		return -EINVAL;
+	if ((vma->vm_flags & VM_SHARED) == 0)
+		return -EINVAL;
+
+
+	pci_space = vfio_offset_to_pci_space((u64)vma->vm_pgoff << PAGE_SHIFT);
+	if (pci_space > PCI_ROM_RESOURCE)
+		return -EINVAL;
+	switch (pci_space) {
+	case PCI_ROM_RESOURCE:
+		if (vma->vm_flags & VM_WRITE)
+			return -EINVAL;
+		if (pci_resource_flags(pdev, PCI_ROM_RESOURCE) == 0)
+			return -EINVAL;
+		actual = pci_resource_len(pdev, PCI_ROM_RESOURCE) >> PAGE_SHIFT;
+		break;
+	default:
+		if ((pci_resource_flags(pdev, pci_space) & IORESOURCE_MEM) == 0)
+			return -EINVAL;
+		actual = pci_resource_len(pdev, pci_space) >> PAGE_SHIFT;
+		break;
+	}
+
+	requested = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
+	if (requested > actual || actual == 0)
+		return -EINVAL;
+
+	/*
+	 * Can't allow non-priv users to mmap MSI-X vectors
+	 * else they can write anywhere in phys memory
+	 */
+	start = vma->vm_pgoff << PAGE_SHIFT;
+	len = vma->vm_end - vma->vm_start;
+	if (vma->vm_flags & VM_WRITE) {
+		ret = vfio_msix_check(vdev, start, len);
+		if (ret)
+			return ret;
+	}
+
+	vma->vm_private_data = vdev;
+	vma->vm_flags |= VM_IO | VM_RESERVED;
+	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+	phys = pci_resource_start(pdev, pci_space) >> PAGE_SHIFT;
+
+	return remap_pfn_range(vma, vma->vm_start, phys,
+			       vma->vm_end - vma->vm_start,
+			       vma->vm_page_prot);
+}
+
+static long vfio_unl_ioctl(struct file *filep,
+			unsigned int cmd,
+			unsigned long arg)
+{
+	struct vfio_listener *listener = filep->private_data;
+	struct vfio_dev *vdev = listener->vdev;
+	void __user *uarg = (void __user *)arg;
+	struct pci_dev *pdev = vdev->pdev;
+	struct vfio_dma_map dm;
+	int ret = 0;
+	u64 mask;
+	int fd, nfd;
+	int bar;
+
+	if (vdev == NULL)
+		return -EINVAL;
+
+	switch (cmd) {
+
+	case VFIO_DMA_MAP_ANYWHERE:
+	case VFIO_DMA_MAP_IOVA:
+		if (copy_from_user(&dm, uarg, sizeof dm))
+			return -EFAULT;
+		ret = vfio_dma_map_common(listener, cmd, &dm);
+		if (!ret && copy_to_user(uarg, &dm, sizeof dm))
+			ret = -EFAULT;
+		break;
+
+	case VFIO_DMA_UNMAP:
+		if (copy_from_user(&dm, uarg, sizeof dm))
+			return -EFAULT;
+		ret = vfio_dma_unmap_dm(listener, &dm);
+		break;
+
+	case VFIO_DMA_MASK:	/* set master mode and DMA mask */
+		if (copy_from_user(&mask, uarg, sizeof mask))
+			return -EFAULT;
+		pci_set_master(pdev);
+		ret = pci_set_dma_mask(pdev, mask);
+		break;
+
+	case VFIO_EVENTFD_IRQ:
+		if (copy_from_user(&fd, uarg, sizeof fd))
+			return -EFAULT;
+		if (vdev->ev_irq)
+			eventfd_ctx_put(vdev->ev_irq);
+		if (fd >= 0) {
+			vdev->ev_irq = eventfd_ctx_fdget(fd);
+			if (vdev->ev_irq == NULL)
+				ret = -EINVAL;
+		}
+		break;
+
+	case VFIO_EVENTFD_MSI:
+		if (copy_from_user(&fd, uarg, sizeof fd))
+			return -EFAULT;
+		if (fd >= 0 && vdev->ev_msi == NULL && vdev->ev_msix == NULL)
+			ret = vfio_enable_msi(vdev, fd);
+		else if (fd < 0 && vdev->ev_msi)
+			vfio_disable_msi(vdev);
+		else
+			ret = -EINVAL;
+		break;
+
+	case VFIO_EVENTFDS_MSIX:
+		if (copy_from_user(&nfd, uarg, sizeof nfd))
+			return -EFAULT;
+		uarg += sizeof nfd;
+		if (nfd > 0 && vdev->ev_msi == NULL && vdev->ev_msix == NULL)
+			ret = vfio_enable_msix(vdev, nfd, uarg);
+		else if (nfd == 0 && vdev->ev_msix)
+			vfio_disable_msix(vdev);
+		else
+			ret = -EINVAL;
+		break;
+
+	case VFIO_BAR_LEN:
+		if (copy_from_user(&bar, uarg, sizeof bar))
+			return -EFAULT;
+		if (bar < 0 || bar > PCI_ROM_RESOURCE)
+			return -EINVAL;
+		bar = pci_resource_len(pdev, bar);
+		if (copy_to_user(uarg, &bar, sizeof bar))
+			return -EFAULT;
+		break;
+
+	default:
+		return -EINVAL;
+	}
+	return ret;
+}
+
+static const struct file_operations vfio_fops = {
+	.owner		= THIS_MODULE,
+	.open		= vfio_open,
+	.release	= vfio_release,
+	.read		= vfio_read,
+	.write		= vfio_write,
+	.unlocked_ioctl	= vfio_unl_ioctl,
+	.mmap		= vfio_mmap,
+};
+
+static int vfio_get_devnum(struct vfio_dev *vdev)
+{
+	int retval = -ENOMEM;
+	int id;
+
+	mutex_lock(&vfio_minor_lock);
+	if (idr_pre_get(&vfio_idr, GFP_KERNEL) == 0)
+		goto exit;
+
+	retval = idr_get_new(&vfio_idr, vdev, &id);
+	if (retval < 0) {
+		if (retval == -EAGAIN)
+			retval = -ENOMEM;
+		goto exit;
+	}
+	if (id > MINORMASK) {
+		idr_remove(&vfio_idr, id);
+		retval = -ENOMEM;
+	}
+	if (vfio_major < 0) {
+		retval = register_chrdev(0, "vfio", &vfio_fops);
+		if (retval < 0)
+			goto exit;
+		vfio_major = retval;
+	}
+
+	retval = MKDEV(vfio_major, id);
+exit:
+	mutex_unlock(&vfio_minor_lock);
+	return retval;
+}
+
+static void vfio_free_minor(struct vfio_dev *vdev)
+{
+	mutex_lock(&vfio_minor_lock);
+	idr_remove(&vfio_idr, MINOR(vdev->devnum));
+	mutex_unlock(&vfio_minor_lock);
+}
+
+/*
+ * Verify that the device supports Interrupt Disable bit in command register,
+ * per PCI 2.3, by flipping this bit and reading it back: this bit was readonly
+ * in PCI 2.2.  (from uio_pci_generic)
+ */
+static int verify_pci_2_3(struct pci_dev *pdev)
+{
+	u16 orig, new;
+	int err = 0;
+	u8 line;
+
+	pci_block_user_cfg_access(pdev);
+
+	pci_read_config_byte(pdev, PCI_INTERRUPT_LINE, &line);
+	if (line == 0)
+		goto out;
+
+	pci_read_config_word(pdev, PCI_COMMAND, &orig);
+	pci_write_config_word(pdev, PCI_COMMAND,
+			      orig ^ PCI_COMMAND_INTX_DISABLE);
+	pci_read_config_word(pdev, PCI_COMMAND, &new);
+	/* There's no way to protect against
+	 * hardware bugs or detect them reliably, but as long as we know
+	 * what the value should be, let's go ahead and check it. */
+	if ((new ^ orig) & ~PCI_COMMAND_INTX_DISABLE) {
+		err = -EBUSY;
+		dev_err(&pdev->dev, "Command changed from 0x%x to 0x%x: "
+			"driver or HW bug?\n", orig, new);
+		goto out;
+	}
+	if (!((new ^ orig) & PCI_COMMAND_INTX_DISABLE)) {
+		dev_warn(&pdev->dev, "Device does not support "
+			 "disabling interrupts: unable to bind.\n");
+		err = -ENODEV;
+		goto out;
+	}
+	/* Now restore the original value. */
+	pci_write_config_word(pdev, PCI_COMMAND, orig);
+out:
+	pci_unblock_user_cfg_access(pdev);
+	return err;
+}
+
+static int pci_is_master(struct pci_dev *pdev)
+{
+	int ret;
+	u16 orig, new;
+
+	if (pci_find_capability(pdev, PCI_CAP_ID_MSI))
+		return 1;
+	if (pci_find_capability(pdev, PCI_CAP_ID_MSIX))
+		return 1;
+
+	pci_block_user_cfg_access(pdev);
+
+	pci_read_config_word(pdev, PCI_COMMAND, &orig);
+	ret = orig & PCI_COMMAND_MASTER;
+	if (!ret) {
+		new = orig | PCI_COMMAND_MASTER;
+		pci_write_config_word(pdev, PCI_COMMAND, new);
+		pci_read_config_word(pdev, PCI_COMMAND, &new);
+		ret = new & PCI_COMMAND_MASTER;
+		pci_write_config_word(pdev, PCI_COMMAND, orig);
+	}
+
+	pci_unblock_user_cfg_access(pdev);
+	return ret;
+}
+
+static int vfio_probe(struct pci_dev *pdev, const struct pci_device_id *id)
+{
+	struct vfio_dev *vdev;
+	int err;
+
+	err = pci_enable_device(pdev);
+	if (err) {
+		dev_err(&pdev->dev, "%s: pci_enable_device failed: %d\n",
+			__func__, err);
+		return err;
+	}
+
+	err = verify_pci_2_3(pdev);
+	if (err)
+		goto err_verify;
+
+	vdev = kzalloc(sizeof(struct vfio_dev), GFP_KERNEL);
+	if (!vdev) {
+		err = -ENOMEM;
+		goto err_alloc;
+	}
+	vdev->pdev = pdev;
+	vdev->pmaster = pci_is_master(pdev);
+
+	err = vfio_class_init();
+	if (err)
+		goto err_class;
+
+	mutex_init(&vdev->gate);
+
+	err = vfio_get_devnum(vdev);
+	if (err < 0)
+		goto err_get_devnum;
+	vdev->devnum = err;
+	err = 0;
+
+	sprintf(vdev->name, "vfio%d", MINOR(vdev->devnum));
+	pci_set_drvdata(pdev, vdev);
+	vdev->dev = device_create(vfio_class->class, &pdev->dev,
+			  vdev->devnum, vdev, vdev->name);
+	if (IS_ERR(vdev->dev)) {
+		printk(KERN_ERR "VFIO: device register failed\n");
+		err = PTR_ERR(vdev->dev);
+		goto err_device_create;
+	}
+
+	err = vfio_dev_add_attributes(vdev);
+	if (err)
+		goto err_vfio_dev_add_attributes;
+
+
+	if (pdev->irq > 0) {
+		err = request_irq(pdev->irq, vfio_interrupt,
+				  IRQF_SHARED, "vfio", vdev);
+		if (err)
+			goto err_request_irq;
+	}
+	vdev->vinfo.bardirty = 1;
+
+	return 0;
+
+err_request_irq:
+#ifdef notdef
+	vfio_dev_del_attributes(vdev);
+#endif
+err_vfio_dev_add_attributes:
+	device_destroy(vfio_class->class, vdev->devnum);
+err_device_create:
+	vfio_free_minor(vdev);
+err_get_devnum:
+err_class:
+	kfree(vdev);
+err_alloc:
+err_verify:
+	pci_disable_device(pdev);
+	return err;
+}
+
+static void vfio_remove(struct pci_dev *pdev)
+{
+	struct vfio_dev *vdev = pci_get_drvdata(pdev);
+
+	vfio_free_minor(vdev);
+
+	if (pdev->irq > 0)
+		free_irq(pdev->irq, vdev);
+
+#ifdef notdef
+	vfio_dev_del_attributes(vdev);
+#endif
+
+	pci_set_drvdata(pdev, NULL);
+	device_destroy(vfio_class->class, vdev->devnum);
+	kfree(vdev);
+	vfio_class_destroy();
+	pci_disable_device(pdev);
+}
+
+static struct pci_driver driver = {
+	.name = "vfio",
+	.id_table = NULL, /* only dynamic id's */
+	.probe = vfio_probe,
+	.remove = vfio_remove,
+};
+
+static int __init init(void)
+{
+	pr_info(DRIVER_DESC " version: " DRIVER_VERSION "\n");
+	return pci_register_driver(&driver);
+}
+
+static void __exit cleanup(void)
+{
+	if (vfio_major >= 0)
+		unregister_chrdev(vfio_major, "vfio");
+	pci_unregister_driver(&driver);
+}
+
+module_init(init);
+module_exit(cleanup);
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff -uprN linux-2.6.34/drivers/vfio/vfio_pci_config.c vfio-linux-2.6.34/drivers/vfio/vfio_pci_config.c
--- linux-2.6.34/drivers/vfio/vfio_pci_config.c	1969-12-31 16:00:00.000000000 -0800
+++ vfio-linux-2.6.34/drivers/vfio/vfio_pci_config.c	2010-05-28 14:26:47.000000000 -0700
@@ -0,0 +1,554 @@
+/*
+ * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
+ * Author: Tom Lyon, pugs@cisco.com
+ *
+ * This program is free software; you may redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; version 2 of the License.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Portions derived from drivers/uio/uio.c:
+ * Copyright(C) 2005, Benedikt Spranger <b.spranger@linutronix.de>
+ * Copyright(C) 2005, Thomas Gleixner <tglx@linutronix.de>
+ * Copyright(C) 2006, Hans J. Koch <hjk@linutronix.de>
+ * Copyright(C) 2006, Greg Kroah-Hartman <greg@kroah.com>
+ *
+ * Portions derived from drivers/uio/uio_pci_generic.c:
+ * Copyright (C) 2009 Red Hat, Inc.
+ * Author: Michael S. Tsirkin <mst@redhat.com>
+ */
+
+#include <linux/fs.h>
+#include <linux/pci.h>
+#include <linux/mmu_notifier.h>
+#include <linux/uaccess.h>
+#include <linux/vfio.h>
+
+#define PCI_CAP_ID_BASIC	0
+#ifndef PCI_CAP_ID_MAX
+#define	PCI_CAP_ID_MAX		PCI_CAP_ID_AF
+#endif
+
+/*
+ * Lengths of PCI Config Capabilities
+ * 0 means unknown (but at least 4)
+ * FF means special/variable
+ */
+static u8 pci_capability_length[] = {
+	[PCI_CAP_ID_BASIC]	= 64,		/* pci config header */
+	[PCI_CAP_ID_PM]		= PCI_PM_SIZEOF,
+	[PCI_CAP_ID_AGP]	= PCI_AGP_SIZEOF,
+	[PCI_CAP_ID_VPD]	= 8,
+	[PCI_CAP_ID_SLOTID]	= 4,
+	[PCI_CAP_ID_MSI]	= 0xFF,		/* 10, 14, or 24 */
+	[PCI_CAP_ID_CHSWP]	= 4,
+	[PCI_CAP_ID_PCIX]	= 0xFF,		/* 8 or 24 */
+	[PCI_CAP_ID_HT]		= 28,
+	[PCI_CAP_ID_VNDR]	= 0xFF,
+	[PCI_CAP_ID_DBG]	= 0,
+	[PCI_CAP_ID_CCRC]	= 0,
+	[PCI_CAP_ID_SHPC]	= 0,
+	[PCI_CAP_ID_SSVID]	= 0,		/* bridge only - not supp */
+	[PCI_CAP_ID_AGP3]	= 0,
+	[PCI_CAP_ID_EXP]	= 36,
+	[PCI_CAP_ID_MSIX]	= 12,
+	[PCI_CAP_ID_AF]		= 6,
+};
+
+/*
+ * Read/Write Permission Bits - one bit for each bit in capability
+ * Any field can be read if it exists,
+ * but what is read depends on whether the field
+ * is 'virtualized', or just pass thru to the hardware.
+ * Any virtualized field is also virtualized for writes.
+ * Writes are only permitted if they have a 1 bit here.
+ */
+struct perm_bits {
+	u32	rvirt;		/* read bits which must be virtualized */
+	u32	write;		/* writeable bits - virt if read virt */
+};
+
+static struct perm_bits pci_cap_basic_perm[] = {
+	{ 0xFFFFFFFF,	0, },		/* 0x00 vendor & device id - RO */
+	{ 0,		0xFFFFFFFC, },	/* 0x04 cmd & status except mem/io */
+	{ 0,		0xFF00FFFF, },	/* 0x08 bist, htype, lat, cache */
+	{ 0xFFFFFFFF,	0xFFFFFFFF, },	/* 0x0c bar */
+	{ 0xFFFFFFFF,	0xFFFFFFFF, },	/* 0x10 bar */
+	{ 0xFFFFFFFF,	0xFFFFFFFF, },	/* 0x14 bar */
+	{ 0xFFFFFFFF,	0xFFFFFFFF, },	/* 0x18 bar */
+	{ 0xFFFFFFFF,	0xFFFFFFFF, },	/* 0x1c bar */
+	{ 0xFFFFFFFF,	0xFFFFFFFF, },	/* 0x20 bar */
+	{ 0,		0, },		/* 0x24 cardbus - not yet */
+	{ 0,		0, },		/* 0x28 subsys vendor & dev */
+	{ 0xFFFFFFFF,	0xFFFFFFFF, },	/* 0x2c rom bar */
+	{ 0,		0, },		/* 0x30 capability ptr & resv */
+	{ 0,		0, },		/* 0x34 resv */
+	{ 0,		0, },		/* 0x38 resv */
+	{ 0x000000FF,	0x000000FF, },	/* 0x3c max_lat ... irq */
+};
+
+static struct perm_bits pci_cap_pm_perm[] = {
+	{ 0,		0, },		/* 0x00 PM capabilities */
+	{ 0,		0xFFFFFFFF, },	/* 0x04 PM control/status */
+};
+
+static struct perm_bits pci_cap_vpd_perm[] = {
+	{ 0,		0xFFFF0000, },	/* 0x00 address */
+	{ 0,		0xFFFFFFFF, },	/* 0x04 data */
+};
+
+static struct perm_bits pci_cap_slotid_perm[] = {
+	{ 0,		0, },		/* 0x00 all read only */
+};
+
+static struct perm_bits pci_cap_msi_perm[] = {
+	{ 0,		0, },		/* 0x00 MSI message control */
+	{ 0xFFFFFFFF,	0xFFFFFFFF, },	/* 0x04 MSI message address */
+	{ 0xFFFFFFFF,	0xFFFFFFFF, },	/* 0x08 MSI message addr/data */
+	{ 0x0000FFFF,	0x0000FFFF, },	/* 0x0c MSI message data */
+	{ 0,		0xFFFFFFFF, },	/* 0x10 MSI mask bits */
+	{ 0,		0xFFFFFFFF, },	/* 0x14 MSI pending bits */
+};
+
+static struct perm_bits pci_cap_pcix_perm[] = {
+	{ 0,		0xFFFF0000, },	/* 0x00 PCI_X_CMD */
+	{ 0,		0, },		/* 0x04 PCI_X_STATUS */
+	{ 0,		0xFFFFFFFF, },	/* 0x08 ECC ctlr & status */
+	{ 0,		0, },		/* 0x0c ECC first addr */
+	{ 0,		0, },		/* 0x10 ECC second addr */
+	{ 0,		0, },		/* 0x14 ECC attr */
+};
+
+/* pci express capabilities */
+static struct perm_bits pci_cap_exp_perm[] = {
+	{ 0,		0, },		/* 0x00 PCIe capabilities */
+	{ 0,		0, },		/* 0x04 PCIe device capabilities */
+	{ 0,		0xFFFFFFFF, },	/* 0x08 PCIe device control & status */
+	{ 0,		0, },		/* 0x0c PCIe link capabilities */
+	{ 0,		0x000000FF, },	/* 0x10 PCIe link ctl/stat - SAFE? */
+	{ 0,		0, },		/* 0x14 PCIe slot capabilities */
+	{ 0,		0x00FFFFFF, },	/* 0x18 PCIe link ctl/stat - SAFE? */
+	{ 0,		0, },		/* 0x1c PCIe root port stuff */
+	{ 0,		0, },		/* 0x20 PCIe root port stuff */
+};
+
+static struct perm_bits pci_cap_msix_perm[] = {
+	{ 0,		0, },		/* 0x00 MSI-X Enable */
+	{ 0,		0, },		/* 0x04 table offset & bir */
+	{ 0,		0, },		/* 0x08 pba offset & bir */
+};
+
+static struct perm_bits pci_cap_af_perm[] = {
+	{ 0,		0, },		/* 0x00 af capability */
+	{ 0,		0x0001,	 },	/* 0x04 af flr bit */
+};
+
+static struct perm_bits *pci_cap_perms[] = {
+	[PCI_CAP_ID_BASIC]	= pci_cap_basic_perm,
+	[PCI_CAP_ID_PM]		= pci_cap_pm_perm,
+	[PCI_CAP_ID_VPD]	= pci_cap_vpd_perm,
+	[PCI_CAP_ID_SLOTID]	= pci_cap_slotid_perm,
+	[PCI_CAP_ID_MSI]	= pci_cap_msi_perm,
+	[PCI_CAP_ID_PCIX]	= pci_cap_pcix_perm,
+	[PCI_CAP_ID_EXP]	= pci_cap_exp_perm,
+	[PCI_CAP_ID_MSIX]	= pci_cap_msix_perm,
+	[PCI_CAP_ID_AF]		= pci_cap_af_perm,
+};
+
+/*
+ * We build a map of the config space that tells us where
+ * and what capabilities exist, so that we can map reads and
+ * writes back to capabilities, and thus figure out what to
+ * allow, deny, or virtualize
+ */
+int vfio_build_config_map(struct vfio_dev *vdev)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	u8 *map;
+	int i, len;
+	u8 pos, cap, tmp;
+	u16 flags;
+	int ret;
+	int loops = 100;
+
+	map = kmalloc(pdev->cfg_size, GFP_KERNEL);
+	if (map == NULL)
+		return -ENOMEM;
+	for (i = 0; i < pdev->cfg_size; i++)
+		map[i] = 0xFF;
+	vdev->pci_config_map = map;
+
+	/* default config space */
+	for (i = 0; i < pci_capability_length[0]; i++)
+		map[i] = 0;
+
+	/* any capabilities? */
+	ret = pci_read_config_word(pdev, PCI_STATUS, &flags);
+	if (ret < 0)
+		return ret;
+	if ((flags & PCI_STATUS_CAP_LIST) == 0)
+		return 0;
+
+	ret = pci_read_config_byte(pdev, PCI_CAPABILITY_LIST, &pos);
+	if (ret < 0)
+		return ret;
+	while (pos && --loops > 0) {
+		ret = pci_read_config_byte(pdev, pos, &cap);
+		if (ret < 0)
+			return ret;
+		if (cap == 0) {
+			printk(KERN_WARNING "%s: cap 0\n", __func__);
+			break;
+		}
+		if (cap > PCI_CAP_ID_MAX) {
+			printk(KERN_WARNING "%s: unknown pci capability id %x\n",
+					__func__, cap);
+			len = 0;
+		} else
+			len = pci_capability_length[cap];
+		if (len == 0) {
+			printk(KERN_WARNING "%s: unknown length for pci cap %x\n",
+					__func__, cap);
+			len = 4;
+		}
+		if (len == 0xFF) {
+			switch (cap) {
+			case PCI_CAP_ID_MSI:
+				ret = pci_read_config_word(pdev,
+						pos + PCI_MSI_FLAGS, &flags);
+				if (ret < 0)
+					return ret;
+				if (flags & PCI_MSI_FLAGS_MASKBIT)
+					/* per vec masking */
+					len = 24;
+				else if (flags & PCI_MSI_FLAGS_64BIT)
+					/* 64 bit */
+					len = 14;
+				else
+					len = 10;
+				break;
+			case PCI_CAP_ID_PCIX:
+				ret = pci_read_config_word(pdev, pos + 2,
+					&flags);
+				if (ret < 0)
+					return ret;
+				if (flags & 0x3000)
+					len = 24;
+				else
+					len = 8;
+				break;
+			case PCI_CAP_ID_VNDR:
+				/* length follows next field */
+				ret = pci_read_config_byte(pdev, pos + 2, &tmp);
+				if (ret < 0)
+					return ret;
+				len = tmp;
+				break;
+			default:
+				len = 0;
+				break;
+			}
+		}
+
+		for (i = 0; i < len; i++) {
+			if (map[pos+i] != 0xFF)
+				printk(KERN_WARNING
+					"%s: pci config conflict at %x, "
+					"caps %x %x\n",
+					__func__, i, map[pos+i], cap);
+			map[pos+i] = cap;
+		}
+		ret = pci_read_config_byte(pdev, pos + PCI_CAP_LIST_NEXT, &pos);
+		if (ret < 0)
+			return ret;
+	}
+	if (loops <= 0)
+		printk(KERN_ERR "%s: config space loop!\n", __func__);
+	return 0;
+}
+
+static void vfio_virt_init(struct vfio_dev *vdev)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	int bar;
+	u32 *lp;
+	u32 val;
+	u8 *map, pos;
+	u16 flags;
+	int i, len;
+	int ret;
+
+	for (bar = 0; bar <= 5; bar++) {
+		lp = (u32 *)&vdev->vinfo.bar[bar * 4];
+		pci_read_config_dword(pdev, PCI_BASE_ADDRESS_0 + 4*bar, &val);
+		*lp++ = val;
+	}
+	lp = (u32 *)vdev->vinfo.rombar;
+	pci_read_config_dword(pdev, PCI_ROM_ADDRESS, &val);
+	*lp = val;
+
+	vdev->vinfo.intr = pdev->irq;
+
+	pos = pci_find_capability(pdev, PCI_CAP_ID_MSI);
+	map = vdev->pci_config_map + pos;
+	if (pos > 0) {
+		ret = pci_read_config_word(pdev, pos + PCI_MSI_FLAGS, &flags);
+		if (ret < 0)
+			return;
+		if (flags & PCI_MSI_FLAGS_MASKBIT)	/* per vec masking */
+			len = 24;
+		else if (flags & PCI_MSI_FLAGS_64BIT)	/* 64 bit */
+			len = 14;
+		else
+			len = 10;
+		for (i = 0; i < len; i++)
+			(void) pci_read_config_byte(pdev, pos + i,
+						&vdev->vinfo.msi[i]);
+	}
+}
+
+static void vfio_bar_fixup(struct vfio_dev *vdev)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	int bar;
+	u32 *lp;
+	u32 len;
+
+	for (bar = 0; bar <= 5; bar++) {
+		len = pci_resource_len(pdev, bar);
+		lp = (u32 *)&vdev->vinfo.bar[bar * 4];
+		if (len == 0) {
+			*lp = 0;
+		} else if (pci_resource_flags(pdev, bar) & IORESOURCE_MEM) {
+			*lp &= ~0x1;
+			*lp = (*lp & ~(len-1)) |
+				(*lp & ~PCI_BASE_ADDRESS_MEM_MASK);
+			if (*lp & PCI_BASE_ADDRESS_MEM_TYPE_64)
+				bar++;
+		} else if (pci_resource_flags(pdev, bar) & IORESOURCE_IO) {
+			*lp |= PCI_BASE_ADDRESS_SPACE_IO;
+			*lp = (*lp & ~(len-1)) |
+				(*lp & ~PCI_BASE_ADDRESS_IO_MASK);
+		}
+	}
+	lp = (u32 *)vdev->vinfo.rombar;
+	len = pci_resource_len(pdev, PCI_ROM_RESOURCE);
+	*lp = *lp & PCI_ROM_ADDRESS_MASK & ~(len-1);
+	vdev->vinfo.bardirty = 0;
+}
+
+static int vfio_config_rwbyte(int write,
+				struct vfio_dev *vdev,
+				int pos,
+				char __user *buf)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	u8 *map = vdev->pci_config_map;
+	u8 cap, val, newval;
+	u16 start, off;
+	int p;
+	struct perm_bits *perm;
+	u8 wr, virt;
+	int ret;
+
+	cap = map[pos];
+	if (cap == 0xFF) {	/* unknown region */
+		if (write)
+			return 0;	/* silent no-op */
+		val = 0;
+		if (pos <= pci_capability_length[0])	/* ok to read */
+			(void) pci_read_config_byte(pdev, pos, &val);
+		if (copy_to_user(buf, &val, 1))
+			return -EFAULT;
+		return 0;
+	}
+
+	/* scan back to start of cap region */
+	for (p = pos; p >= 0; p--) {
+		if (map[p] != cap)
+			break;
+		start = p;
+	}
+	off = pos - start;	/* offset within capability */
+
+	perm = pci_cap_perms[cap];
+	if (perm == NULL) {
+		wr = 0;
+		virt = 0;
+	} else {
+		perm += (off >> 2);
+		wr = perm->write >> ((off & 3) * 8);
+		virt = perm->rvirt >> ((off & 3) * 8);
+	}
+	if (write && !wr)		/* no writeable bits */
+		return 0;
+	if (!virt) {
+		if (write) {
+			if (copy_from_user(&val, buf, 1))
+				return -EFAULT;
+			val &= wr;
+			if (wr != 0xFF) {
+				u8 existing;
+
+				ret = pci_read_config_byte(pdev, pos,
+								&existing);
+				if (ret < 0)
+					return ret;
+				val |= (existing & ~wr);
+			}
+			pci_write_config_byte(pdev, pos, val);
+		} else {
+			ret = pci_read_config_byte(pdev, pos, &val);
+			if (ret < 0)
+				return ret;
+			if (copy_to_user(buf, &val, 1))
+				return -EFAULT;
+		}
+		return 0;
+	}
+
+	if (write) {
+		if (copy_from_user(&newval, buf, 1))
+			return -EFAULT;
+	}
+	/*
+	 * We get here if there are some virt bits
+	 * handle remaining real bits, if any
+	 */
+	if (~virt) {
+		u8 rbits = (~virt) & wr;
+
+		ret = pci_read_config_byte(pdev, pos, &val);
+		if (ret < 0)
+			return ret;
+		if (write && rbits) {
+			val &= ~rbits;
+			newval &= rbits;
+			val |= newval;
+			pci_write_config_byte(pdev, pos, val);
+		}
+	}
+	/*
+	 * Now handle entirely virtual fields
+	 */
+	switch (cap) {
+	case PCI_CAP_ID_BASIC:		/* virtualize BARs */
+		switch (off) {
+		/*
+		 * vendor and device are virt because they don't
+		 * show up otherwise for sr-iov vfs
+		 */
+		case PCI_VENDOR_ID:
+			val = pdev->vendor;
+			break;
+		case PCI_VENDOR_ID + 1:
+			val = pdev->vendor >> 8;
+			break;
+		case PCI_DEVICE_ID:
+			val = pdev->device;
+			break;
+		case PCI_DEVICE_ID + 1:
+			val = pdev->device >> 8;
+			break;
+		case PCI_INTERRUPT_LINE:
+			if (write)
+				vdev->vinfo.intr = newval;
+			else
+				val = vdev->vinfo.intr;
+			break;
+		case PCI_ROM_ADDRESS:
+		case PCI_ROM_ADDRESS+1:
+		case PCI_ROM_ADDRESS+2:
+		case PCI_ROM_ADDRESS+3:
+			if (write) {
+				vdev->vinfo.rombar[off & 3] = newval;
+				vdev->vinfo.bardirty = 1;
+			} else {
+				if (vdev->vinfo.bardirty)
+					vfio_bar_fixup(vdev);
+				val = vdev->vinfo.rombar[off & 3];
+			}
+			break;
+		default:
+			if (off >= PCI_BASE_ADDRESS_0 &&
+			    off <= PCI_BASE_ADDRESS_5 + 3) {
+				int boff = off - PCI_BASE_ADDRESS_0;
+
+				if (write) {
+					vdev->vinfo.bar[boff] = newval;
+					vdev->vinfo.bardirty = 1;
+				} else {
+					if (vdev->vinfo.bardirty)
+						vfio_bar_fixup(vdev);
+					val = vdev->vinfo.bar[boff];
+				}
+			}
+			break;
+		}
+		break;
+	case PCI_CAP_ID_MSI:		/* virtualize MSI */
+		if (off >= PCI_MSI_ADDRESS_LO && off <= (PCI_MSI_DATA_64 + 2)) {
+			int moff = off - PCI_MSI_ADDRESS_LO;
+
+			if (write)
+				vdev->vinfo.msi[moff] = newval;
+			else
+				val = vdev->vinfo.msi[moff];
+			break;
+		}
+		break;
+	}
+	if (!write && copy_to_user(buf, &val, 1))
+		return -EFAULT;
+	return 0;
+}
+
+ssize_t vfio_config_readwrite(int write,
+		struct vfio_dev *vdev,
+		char __user *buf,
+		size_t count,
+		loff_t *ppos)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	int done = 0;
+	int ret;
+	int pos;
+
+	pci_block_user_cfg_access(pdev);
+
+	if (vdev->pci_config_map == NULL) {
+		ret = vfio_build_config_map(vdev);
+		if (ret < 0)
+			goto out;
+		vfio_virt_init(vdev);
+	}
+
+	while (count > 0) {
+		pos = *ppos;
+		if (pos == pdev->cfg_size)
+			break;
+		if (pos > pdev->cfg_size) {
+			ret = -EINVAL;
+			goto out;
+		}
+		ret = vfio_config_rwbyte(write, vdev, pos, buf);
+		if (ret < 0)
+			goto out;
+		buf++;
+		done++;
+		count--;
+		(*ppos)++;
+	}
+	ret = done;
+out:
+	pci_unblock_user_cfg_access(pdev);
+	return ret;
+}
diff -uprN linux-2.6.34/drivers/vfio/vfio_rdwr.c vfio-linux-2.6.34/drivers/vfio/vfio_rdwr.c
--- linux-2.6.34/drivers/vfio/vfio_rdwr.c	1969-12-31 16:00:00.000000000 -0800
+++ vfio-linux-2.6.34/drivers/vfio/vfio_rdwr.c	2010-05-28 14:27:40.000000000 -0700
@@ -0,0 +1,147 @@
+/*
+ * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
+ * Author: Tom Lyon, pugs@cisco.com
+ *
+ * This program is free software; you may redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; version 2 of the License.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Portions derived from drivers/uio/uio.c:
+ * Copyright(C) 2005, Benedikt Spranger <b.spranger@linutronix.de>
+ * Copyright(C) 2005, Thomas Gleixner <tglx@linutronix.de>
+ * Copyright(C) 2006, Hans J. Koch <hjk@linutronix.de>
+ * Copyright(C) 2006, Greg Kroah-Hartman <greg@kroah.com>
+ *
+ * Portions derived from drivers/uio/uio_pci_generic.c:
+ * Copyright (C) 2009 Red Hat, Inc.
+ * Author: Michael S. Tsirkin <mst@redhat.com>
+ */
+
+#include <linux/fs.h>
+#include <linux/mmu_notifier.h>
+#include <linux/pci.h>
+#include <linux/uaccess.h>
+#include <linux/io.h>
+
+#include <linux/vfio.h>
+
+ssize_t vfio_io_readwrite(
+		int write,
+		struct vfio_dev *vdev,
+		char __user *buf,
+		size_t count,
+		loff_t *ppos)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	size_t done = 0;
+	resource_size_t end;
+	void __iomem *io;
+	loff_t pos;
+	int pci_space;
+	int unit;
+
+	pci_space = vfio_offset_to_pci_space(*ppos);
+	pos = (*ppos & 0xFFFFFFFF);
+
+	if (vdev->bar[pci_space] == NULL)
+		vdev->bar[pci_space] = pci_iomap(pdev, pci_space, 0);
+	io = vdev->bar[pci_space];
+	end = pci_resource_len(pdev, pci_space);
+	if (pos + count > end)
+		return -EINVAL;
+
+	while (count > 0) {
+		if ((pos % 4) == 0 && count >= 4) {
+			u32 val;
+
+			if (write) {
+				if (copy_from_user(&val, buf, 4))
+					return -EFAULT;
+				iowrite32(val, io + pos);
+			} else {
+				val = ioread32(io + pos);
+				if (copy_to_user(buf, &val, 4))
+					return -EFAULT;
+			}
+			unit = 4;
+		} else if ((pos % 2) == 0 && count >= 2) {
+			u16 val;
+
+			if (write) {
+				if (copy_from_user(&val, buf, 2))
+					return -EFAULT;
+				iowrite16(val, io + pos);
+			} else {
+				val = ioread16(io + pos);
+				if (copy_to_user(buf, &val, 2))
+					return -EFAULT;
+			}
+			unit = 2;
+		} else {
+			u8 val;
+
+			if (write) {
+				if (copy_from_user(&val, buf, 1))
+					return -EFAULT;
+				iowrite8(val, io + pos);
+			} else {
+				val = ioread8(io + pos);
+				if (copy_to_user(buf, &val, 1))
+					return -EFAULT;
+			}
+			unit = 1;
+		}
+		pos += unit;
+		buf += unit;
+		count -= unit;
+		done += unit;
+	}
+	*ppos += done;
+	return done;
+}
+
+ssize_t vfio_mem_readwrite(
+		int write,
+		struct vfio_dev *vdev,
+		char __user *buf,
+		size_t count,
+		loff_t *ppos)
+{
+	struct pci_dev *pdev = vdev->pdev;
+	resource_size_t end;
+	void __iomem *io;
+	loff_t pos;
+	int pci_space;
+
+	pci_space = vfio_offset_to_pci_space(*ppos);
+	pos = (*ppos & 0xFFFFFFFF);
+
+	if (vdev->bar[pci_space] == NULL)
+		vdev->bar[pci_space] = pci_iomap(pdev, pci_space, 0);
+	io = vdev->bar[pci_space];
+	end = pci_resource_len(pdev, pci_space);
+	if (pos > end)
+		return -EINVAL;
+	if (pos == end)
+		return 0;
+	if (pos + count > end)
+		count = end - pos;
+	if (write) {
+		if (copy_from_user(io + pos, buf, count))
+			return -EFAULT;
+	} else {
+		if (copy_to_user(buf, io + pos, count))
+			return -EFAULT;
+	}
+	*ppos += count;
+	return count;
+}
diff -uprN linux-2.6.34/drivers/vfio/vfio_sysfs.c vfio-linux-2.6.34/drivers/vfio/vfio_sysfs.c
--- linux-2.6.34/drivers/vfio/vfio_sysfs.c	1969-12-31 16:00:00.000000000 -0800
+++ vfio-linux-2.6.34/drivers/vfio/vfio_sysfs.c	2010-05-28 14:04:34.000000000 -0700
@@ -0,0 +1,153 @@
+/*
+ * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
+ * Author: Tom Lyon, pugs@cisco.com
+ *
+ * This program is free software; you may redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; version 2 of the License.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Portions derived from drivers/uio/uio.c:
+ * Copyright(C) 2005, Benedikt Spranger <b.spranger@linutronix.de>
+ * Copyright(C) 2005, Thomas Gleixner <tglx@linutronix.de>
+ * Copyright(C) 2006, Hans J. Koch <hjk@linutronix.de>
+ * Copyright(C) 2006, Greg Kroah-Hartman <greg@kroah.com>
+ *
+ * Portions derived from drivers/uio/uio_pci_generic.c:
+ * Copyright (C) 2009 Red Hat, Inc.
+ * Author: Michael S. Tsirkin <mst@redhat.com>
+ */
+
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/kobject.h>
+#include <linux/sysfs.h>
+#include <linux/mm.h>
+#include <linux/fs.h>
+#include <linux/idr.h>
+#include <linux/pci.h>
+#include <linux/mmu_notifier.h>
+
+#include <linux/vfio.h>
+
+struct vfio_class *vfio_class;
+
+int vfio_class_init(void)
+{
+	int ret = 0;
+
+	if (vfio_class != NULL) {
+		kref_get(&vfio_class->kref);
+		goto exit;
+	}
+
+	vfio_class = kzalloc(sizeof(*vfio_class), GFP_KERNEL);
+	if (!vfio_class) {
+		ret = -ENOMEM;
+		goto err_kzalloc;
+	}
+
+	kref_init(&vfio_class->kref);
+	vfio_class->class = class_create(THIS_MODULE, "vfio");
+	if (IS_ERR(vfio_class->class)) {
+		ret = IS_ERR(vfio_class->class);
+		printk(KERN_ERR "class_create failed for vfio\n");
+		goto err_class_create;
+	}
+	return 0;
+
+err_class_create:
+	kfree(vfio_class);
+	vfio_class = NULL;
+err_kzalloc:
+exit:
+	return ret;
+}
+
+static void vfio_class_release(struct kref *kref)
+{
+	/* Ok, we cheat as we know we only have one vfio_class */
+	class_destroy(vfio_class->class);
+	kfree(vfio_class);
+	vfio_class = NULL;
+}
+
+void vfio_class_destroy(void)
+{
+	if (vfio_class)
+		kref_put(&vfio_class->kref, vfio_class_release);
+}
+
+ssize_t config_map_read(struct kobject *kobj, struct bin_attribute *bin_attr,
+		char *buf, loff_t off, size_t count)
+{
+	struct vfio_dev *vdev = bin_attr->private;
+	int ret;
+
+	if (off >= 256)
+		return 0;
+	if (off + count > 256)
+		count = 256 - off;
+	if (vdev->pci_config_map == NULL) {
+		ret = vfio_build_config_map(vdev);
+		if (ret < 0)
+			return ret;
+	}
+	memcpy(buf, vdev->pci_config_map + off, count);
+	return count;
+}
+
+static ssize_t show_locked_pages(struct device *dev,
+				 struct device_attribute *attr,
+				 char *buf)
+{
+	struct vfio_dev *vdev = dev_get_drvdata(dev);
+
+	if (vdev == NULL)
+		return -ENODEV;
+	return sprintf(buf, "%u\n", vdev->locked_pages);
+}
+
+static DEVICE_ATTR(locked_pages, S_IRUGO, show_locked_pages, NULL);
+
+static struct attribute *vfio_attrs[] = {
+	&dev_attr_locked_pages.attr,
+	NULL,
+};
+
+static struct attribute_group vfio_attr_grp = {
+	.attrs = vfio_attrs,
+};
+
+struct bin_attribute config_map_bin_attribute = {
+	.attr	= {
+		.name = "config_map",
+		.mode = S_IRUGO,
+	},
+	.size	= 256,
+	.read	= config_map_read,
+};
+
+int vfio_dev_add_attributes(struct vfio_dev *vdev)
+{
+	struct bin_attribute *bi;
+	int ret;
+
+	ret = sysfs_create_group(&vdev->dev->kobj, &vfio_attr_grp);
+	if (ret)
+		return ret;
+	bi = kmalloc(sizeof(*bi), GFP_KERNEL);
+	if (bi == NULL)
+		return -ENOMEM;
+	*bi = config_map_bin_attribute;
+	bi->private = vdev;
+	return sysfs_create_bin_file(&vdev->dev->kobj, bi);
+}
diff -uprN linux-2.6.34/include/linux/vfio.h vfio-linux-2.6.34/include/linux/vfio.h
--- linux-2.6.34/include/linux/vfio.h	1969-12-31 16:00:00.000000000 -0800
+++ vfio-linux-2.6.34/include/linux/vfio.h	2010-05-28 14:29:49.000000000 -0700
@@ -0,0 +1,193 @@
+/*
+ * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
+ * Author: Tom Lyon, pugs@cisco.com
+ *
+ * This program is free software; you may redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; version 2 of the License.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ *
+ * Portions derived from drivers/uio/uio.c:
+ * Copyright(C) 2005, Benedikt Spranger <b.spranger@linutronix.de>
+ * Copyright(C) 2005, Thomas Gleixner <tglx@linutronix.de>
+ * Copyright(C) 2006, Hans J. Koch <hjk@linutronix.de>
+ * Copyright(C) 2006, Greg Kroah-Hartman <greg@kroah.com>
+ *
+ * Portions derived from drivers/uio/uio_pci_generic.c:
+ * Copyright (C) 2009 Red Hat, Inc.
+ * Author: Michael S. Tsirkin <mst@redhat.com>
+ */
+
+/*
+ * VFIO driver - allow mapping and use of certain PCI devices
+ * in unprivileged user processes. (If IOMMU is present)
+ * Especially useful for Virtual Function parts of SR-IOV devices
+ */
+
+#ifdef __KERNEL__
+
+struct vfio_dev {
+	struct device	*dev;
+	struct pci_dev	*pdev;
+	u8		*pci_config_map;
+	int		pci_config_size;
+	char		name[8];
+	int		devnum;
+	int		pmaster;
+	void __iomem	*bar[PCI_ROM_RESOURCE+1];
+	spinlock_t	lock; /* guards command register accesses */
+	int		listeners;
+	int		mapcount;
+	u32		locked_pages;
+	struct mutex	gate;
+	struct msix_entry	*msix;
+	int			nvec;
+	struct iommu_domain	*domain;
+	int			cachec;
+	struct eventfd_ctx	*ev_irq;
+	struct eventfd_ctx	*ev_msi;
+	struct eventfd_ctx	**ev_msix;
+	struct {
+		u8	intr;
+		u8	bardirty;
+		u8	rombar[4];
+		u8	bar[6*4];
+		u8	msi[24];
+	} vinfo;
+};
+
+struct vfio_listener {
+	struct vfio_dev	*vdev;
+	struct list_head	dm_list;
+	struct mm_struct	*mm;
+	struct mmu_notifier	mmu_notifier;
+};
+
+/*
+ * Structure for keeping track of memory nailed down by the
+ * user for DMA
+ */
+struct dma_map_page {
+	struct list_head list;
+	struct page     **pages;
+	struct scatterlist *sg;
+	dma_addr_t      daddr;
+	unsigned long	vaddr;
+	int		npage;
+	int		rdwr;
+};
+
+/* VFIO class infrastructure */
+struct vfio_class {
+	struct kref kref;
+	struct class *class;
+};
+extern struct vfio_class *vfio_class;
+
+ssize_t vfio_io_readwrite(int, struct vfio_dev *,
+			char __user *, size_t, loff_t *);
+ssize_t vfio_mem_readwrite(int, struct vfio_dev *,
+			char __user *, size_t, loff_t *);
+ssize_t vfio_config_readwrite(int, struct vfio_dev *,
+			char __user *, size_t, loff_t *);
+
+void vfio_disable_msi(struct vfio_dev *);
+void vfio_disable_msix(struct vfio_dev *);
+int vfio_enable_msi(struct vfio_dev *, int);
+int vfio_enable_msix(struct vfio_dev *, int, void __user *);
+
+#ifndef PCI_MSIX_ENTRY_SIZE
+#define	PCI_MSIX_ENTRY_SIZE	16
+#endif
+#ifndef PCI_STATUS_INTERRUPT
+#define	PCI_STATUS_INTERRUPT	0x08
+#endif
+
+struct vfio_dma_map;
+void vfio_dma_unmapall(struct vfio_listener *);
+int vfio_dma_unmap_dm(struct vfio_listener *, struct vfio_dma_map *);
+int vfio_dma_map_common(struct vfio_listener *, unsigned int,
+			struct vfio_dma_map *);
+
+int vfio_class_init(void);
+void vfio_class_destroy(void);
+int vfio_dev_add_attributes(struct vfio_dev *);
+extern struct idr vfio_idr;
+extern struct mutex vfio_minor_lock;
+int vfio_build_config_map(struct vfio_dev *);
+
+irqreturn_t vfio_interrupt(int, void *);
+
+#endif	/* __KERNEL__ */
+
+/* Kernel & User level defines for ioctls */
+
+/*
+ * Structure for DMA mapping of user buffers
+ * vaddr, dmaaddr, and size must all be page aligned
+ * buffer may only be larger than 1 page if (a) there is
+ * an iommu in the system, or (b) buffer is part of a huge page
+ */
+struct vfio_dma_map {
+	__u64	vaddr;		/* process virtual addr */
+	__u64	dmaaddr;	/* desired and/or returned dma address */
+	__u64	size;		/* size in bytes */
+	int	rdwr;		/* bool: 0 for r/o; 1 for r/w */
+};
+
+/* map user pages at any dma address */
+#define	VFIO_DMA_MAP_ANYWHERE	_IOWR(';', 100, struct vfio_dma_map)
+
+/* map user pages at specific dma address */
+#define	VFIO_DMA_MAP_IOVA	_IOWR(';', 101, struct vfio_dma_map)
+
+/* unmap user pages */
+#define	VFIO_DMA_UNMAP		_IOW(';', 102, struct vfio_dma_map)
+
+/* set device DMA mask & master status */
+#define	VFIO_DMA_MASK		_IOW(';', 103, __u64)
+
+/* request IRQ interrupts; use given eventfd */
+#define	VFIO_EVENTFD_IRQ		_IOW(';', 104, int)
+
+/* request MSI interrupts; use given eventfd */
+#define	VFIO_EVENTFD_MSI		_IOW(';', 105, int)
+
+/* Request MSI-X interrupts: arg[0] is #, arg[1-n] are eventfds */
+#define	VFIO_EVENTFDS_MSIX	_IOW(';', 106, int)
+
+/* Get length of a BAR */
+#define	VFIO_BAR_LEN		_IOWR(';', 107, __u32)
+
+/*
+ * Reads, writes, and mmaps determine which PCI BAR (or config space)
+ * from the high level bits of the file offset
+ */
+#define	VFIO_PCI_BAR0_RESOURCE		0x0
+#define	VFIO_PCI_BAR1_RESOURCE		0x1
+#define	VFIO_PCI_BAR2_RESOURCE		0x2
+#define	VFIO_PCI_BAR3_RESOURCE		0x3
+#define	VFIO_PCI_BAR4_RESOURCE		0x4
+#define	VFIO_PCI_BAR5_RESOURCE		0x5
+#define	VFIO_PCI_ROM_RESOURCE		0x6
+#define	VFIO_PCI_CONFIG_RESOURCE	0xF
+#define	VFIO_PCI_SPACE_SHIFT	32
+#define VFIO_PCI_CONFIG_OFF vfio_pci_space_to_offset(VFIO_PCI_CONFIG_RESOURCE)
+
+static inline int vfio_offset_to_pci_space(__u64 off)
+{
+	return (off >> VFIO_PCI_SPACE_SHIFT) & 0xF;
+}
+
+static __u64 vfio_pci_space_to_offset(int sp)
+{
+	return (__u64)(sp) << VFIO_PCI_SPACE_SHIFT;
+}
diff -uprN linux-2.6.34/MAINTAINERS vfio-linux-2.6.34/MAINTAINERS
--- linux-2.6.34/MAINTAINERS	2010-05-16 14:17:36.000000000 -0700
+++ vfio-linux-2.6.34/MAINTAINERS	2010-05-28 12:30:21.000000000 -0700
@@ -5968,6 +5968,13 @@ S:	Maintained
 F:	Documentation/fb/uvesafb.txt
 F:	drivers/video/uvesafb.*
 
+VFIO DRIVER
+M:	Tom Lyon <pugs@cisco.com>
+S:	Supported
+F:	Documentation/vfio.txt
+F:	drivers/vfio/
+F:	include/linux/vfio.h
+
 VFAT/FAT/MSDOS FILESYSTEM
 M:	OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
 S:	Maintained

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
  2010-05-28 23:07 [PATCH] VFIO driver: Non-privileged user level PCI drivers Tom Lyon
@ 2010-05-28 23:36 ` Randy Dunlap
  2010-05-28 23:56 ` Randy Dunlap
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 66+ messages in thread
From: Randy Dunlap @ 2010-05-28 23:36 UTC (permalink / raw)
  To: Tom Lyon
  Cc: linux-kernel, kvm, chrisw, joro, hjk, mst, avi, gregkh, aafabbri,
	scofeldm

Hi,


On Fri, 28 May 2010 16:07:38 -0700 Tom Lyon wrote:

Missing diffstat -p1 -w 70:

 Documentation/vfio.txt         |  176 ++++++++
 MAINTAINERS                    |    7 
 drivers/Kconfig                |    2 
 drivers/Makefile               |    1 
 drivers/vfio/Kconfig           |    9 
 drivers/vfio/Makefile          |    5 
 drivers/vfio/vfio_dma.c        |  372 ++++++++++++++++++
 drivers/vfio/vfio_intrs.c      |  189 +++++++++
 drivers/vfio/vfio_main.c       |  627 +++++++++++++++++++++++++++++++
 drivers/vfio/vfio_pci_config.c |  554 +++++++++++++++++++++++++++
 drivers/vfio/vfio_rdwr.c       |  147 +++++++
 drivers/vfio/vfio_sysfs.c      |  153 +++++++
 include/linux/vfio.h           |  193 +++++++++
 13 files changed, 2435 insertions(+)


which shows that the patch is missing an update to
Documentation/ioctl/ioctl-number.txt for ioctl code ';'.  Please add that.


> diff -uprN linux-2.6.34/drivers/vfio/Kconfig vfio-linux-2.6.34/drivers/vfio/Kconfig
> --- linux-2.6.34/drivers/vfio/Kconfig	1969-12-31 16:00:00.000000000 -0800
> +++ vfio-linux-2.6.34/drivers/vfio/Kconfig	2010-05-27 17:07:25.000000000 -0700
> @@ -0,0 +1,9 @@
> +menuconfig VFIO
> +	tristate "Non-Priv User Space PCI drivers"

	          Non-privileged

> +	depends on PCI
> +	help
> +	  Driver to allow advanced user space drivers for PCI, PCI-X,
> +	  and PCIe devices.  Requires IOMMU to allow non-privilged

	                                             non-privileged

> +	  processes to directly control the PCI devices.
> +
> +	  If you don't know what to do here, say N.


---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
  2010-05-28 23:07 [PATCH] VFIO driver: Non-privileged user level PCI drivers Tom Lyon
  2010-05-28 23:36 ` Randy Dunlap
@ 2010-05-28 23:56 ` Randy Dunlap
  2010-05-29 11:55 ` Arnd Bergmann
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 66+ messages in thread
From: Randy Dunlap @ 2010-05-28 23:56 UTC (permalink / raw)
  To: Tom Lyon
  Cc: linux-kernel, kvm, chrisw, joro, hjk, mst, avi, gregkh, aafabbri,
	scofeldm

On Fri, 28 May 2010 16:07:38 -0700 Tom Lyon wrote:

> diff -uprN linux-2.6.34/Documentation/vfio.txt vfio-linux-2.6.34/Documentation/vfio.txt
> --- linux-2.6.34/Documentation/vfio.txt	1969-12-31 16:00:00.000000000 -0800
> +++ vfio-linux-2.6.34/Documentation/vfio.txt	2010-05-28 14:03:05.000000000 -0700
> @@ -0,0 +1,176 @@
> +-------------------------------------------------------------------------------
> +The VFIO "driver" is used to allow privileged AND non-privileged processes to
> +implement user-level device drivers for any well-behaved PCI, PCI-X, and PCIe
> +devices.
> +
> +Why is this interesting?  Some applications, especially in the high performance
> +computing field, need access to hardware functions with as little overhead as
> +possible. Examples are in network adapters (typically non tcp/ip based) and

                                                         non-TCP/IP-based)

> +in compute accelerators - i.e., array processors, FPGA processors, etc.
> +Previous to the VFIO drivers these apps would need either a kernel-level
> +driver (with corrsponding overheads), or else root permissions to directly

                corresponding

> +access the hardware. The VFIO driver allows generic access to the hardware
> +from non-privileged apps IF the hardware is "well-behaved" enough for this
> +to be safe.
> +
> +While there have long been ways to implement user-level drivers using specific
> +corresponding drivers in the kernel, it was not until the introduction of the
> +UIO driver framework, and the uio_pci_generic driver that one could have a
> +generic kernel component supporting many types of user level drivers. However,
> +even with the uio_pci_generic driver, processes implementing the user level
> +drivers had to be trusted - they could do dangerous manipulation of DMA
> +addreses and were required to be root to write PCI configuration space
> +registers.
> +
> +Recent hardware technologies - I/O MMUs and PCI I/O Virtualization - provide
> +new hardware capabilities which the VFIO solution exploits to allow non-root
> +user level drivers. The main role of the IOMMU is to ensure that DMA accesses
> +from devices go only to the appropriate memory locations, this allows VFIO to

                                                  locations;

> +ensure that user level drivers do not corrupt inappropriate memory.  PCI I/O
> +virtualization (SR-IOV) was defined to allow "pass-through" of virtual devices
> +to guest virtual machines. VFIO in essence implements pass-through of devices
> +to user processes, not virtual machines.  SR-IOV devices implement a
> +traditional PCI device (the physical function) and a dynamic number of special
> +PCI devices (virtual functions) whose feature set is somewhat restricted - in
> +order to allow the operating system or virtual machine monitor to ensure the
> +safe operation of the system.
> +
> +Any SR-IOV virtual function meets the VFIO definition of "well-behaved", but
> +there are many other non-IOV PCI devices which also meet the defintion.
> +Elements of this definition are:
> +- The size of any memory BARs to be mmap'ed into the user process space must be
> +  a multiple of the system page size.
> +- If MSI-X interrupts are used, the device driver must not attempt to mmap or
> +  write the MSI-X vector area.
> +- If the device is a PCI device (not PCI-X or PCIe), it must conform to PCI
> +  revision 2.3 to allow its interrupts to be masked in a generic way.
> +- The device must not use the PCI configuration space in any non-standard way,
> +  i.e., the user level driver will be permitted only to read and write standard
> +  fields of the PCI config space, and only if those fields cannot cause harm to
> +  the system. In addition, some fields are "virtualized", so that the user
> +  driver can read/write them like a kernel driver, but they do not affect the
> +  real device.
> +- For now, there is no support for user access to the PCIe and PCI-X extended
> +  capabilities configuration space.
> +
> +Even with these restrictions, there are bound to be devices which are unsafe
> +for user level use - it is still up to the system admin to decide whether to
> +grant access to the device.  When the vfio module is loaded, it will have
> +access to no devices until the desired PCI devices are "bound" to the driver.
> +First, make sure the devices are not bound to another kernel driver. You can
> +unload that driver if you wish to unbind all its devices, or else enter the
> +driver's sysfs directory, and unbind a specific device:
> +	cd /sys/bus/pci/drivers/<drivername>
> +	echo 0000:06:02.00 > unbind
> +(The 0000:06:02.00 is a fully qualified PCI device name - different for each
> +device).  Now, to bind to the vfio driver, go to /sys/bus/pci/drivers/vfio and
> +write the PCI device type of the target device to the new_id file:
> +	echo 8086 10ca > new_id
> +(8086 10ca are the vendor and device type for the Intel 82576 virtual function
> +devices). A /dev/vfio<N> entry will be created for each device bound. The final
> +step is to grant users permission by changing the mode and/or owner of the /dev
> +entry - "chmod 666 /dev/vfio0".
> +
> +Reads & Writes:
> +
> +The user driver will typically use mmap to access the memory BAR(s) of a
> +device; the I/O BARs and the PCI config space may be accessed through normal
> +read and write system calls. Only 1 file descriptor is needed for all driver
> +functions -- the desired BAR for I/O, memory, or config space is indicated via
> +high-order bits of the file offset.  For instance, the following implements a
> +write to the PCI config space:
> +
> +	#include <linux/vfio.h>
> +	void pci_write_config_word(int pci_fd, u16 off, u16 wd)
> +	{
> +		off_t cfg_off = VFIO_PCI_CONFIG_OFF + off;
> +
> +		if (pwrite(pci_fd, &wd, 2, cfg_off) != 2)
> +			perror("pwrite config_dword");
> +	}
> +
> +The routines vfio_pci_space_to_offset and vfio_offset_to_pci_space are provided
> +in vfio.h to convert bar numbers to file offsets and vice-versa.

                        BAR

> +
> +Interrupts:
> +
> +Device interrupts are translated by the vfio driver into input events on event
> +notification file descriptors created by the eventfd system call. The user
> +program must one or more event descriptors and pass them to the vfio driver

           must ___ ?  missing word?

> +via ioctls to arrange for the interrupt mapping:
> +1.
> +	efd = eventfd(0, 0);
> +	ioctl(vfio_fd, VFIO_EVENTFD_IRQ, &efd);
> +		This provides an eventfd for traditional IRQ interrupts.
> +		IRQs will be disable after each interrupt until the driver

		             disabled

> +		re-enables them via the PCI COMMAND register.
> +2.
> +	efd = eventfd(0, 0);
> +	ioctl(vfio_fd, VFIO_EVENTFD_MSI, &efd);
> +		This connects MSI interrupts to an eventfd.
> +3.
> + 	int arg[N+1];
> +	arg[0] = N;
> +	arg[1..N] = eventfd(0, 0);
> +	ioctl(vfio_fd, VFIO_EVENTFDS_MSIX, arg);
> +		This connects N MSI-X interrupts with N eventfds.
> +
> +Waiting and checking for interrupts is done by the user program by reads,
> +polls, or selects on the related event file descriptors.
> +
> +DMA:
> +
> +The VFIO driver uses ioctls to allow the user level driver to get DMA
> +addresses which correspond to virtual addresses.  In systems with IOMMUs,
> +each PCI device will have its own address space for DMA operations, so when
> +the user level driver programs the device registers, only addresses known to
> +the IOMMU will be valid, any others will be rejected.  The IOMMU creates the
> +illusion (to the device) that multi-page buffers are physically contiguous,
> +so a single DMA operation can safely span multiple user pages.  Note that
> +the VFIO driver is still useful in systems without IOMMUs, but only for
> +trusted processes which can deal with DMAs which do not span pages (Huge
> +pages count as a single page also).
> +
> +If the user process desires many DMA buffers, it may be wise to do a mapping
> +of a single large buffer, and then allocate the smaller buffers from the
> +large one.
> +
> +The DMA buffers are locked into physical memory for the duration of their
> +existence - until VFIO_DMA_UNMAP is called, until the user pages are
> +unmapped from the user process, or until the vfio file descriptor is closed.
> +The user process must have permission to lock the pages given by the ulimit(-l)
> +command, which in turn relies on settings in the /etc/security/limits.conf
> +file.
> +
> +The vfio_dma_map structure is used as an argument to the ioctls which
> +do the DMA mapping. Its vaddr, dmaaddr, and size fields must always be a
> +multiple of a page. Its rdwr field is zero for read-only (outbound), and
> +non-zero for read/write buffers.
> +
> +	struct vfio_dma_map {
> +		__u64	vaddr;	  /* process virtual addr */
> +		__u64	dmaaddr;  /* desired and/or returned dma address */
> +		__u64	size;	  /* size in bytes */
> +		int	rdwr;	  /* bool: 0 for r/o; 1 for r/w */
> +	};
> +
> +The VFIO_DMA_MAP_ANYWHERE is called with a vfio_dma_map structure as its
> +argument, and returns the structure with a valid dmaaddr field.
> +
> +The VFIO_DMA_MAP_IOVA is called with a vfio_dma_map structure with the
> +dmaaddr field already assigned. The system will attempt to map the DMA
> +buffer into the IO space at the givne dmaaddr. This is expected to be

                                   given

> +useful if KVM or other virtualization facilities use this driver.
> +
> +The VFIO_DMA_UNMAP takes a fully filled vfio_dma_map structure and unmaps
> +the buffer and releases the corresponding system resources.
> +
> +The VFIO_DMA_MASK ioctl is used to set the maximum permissible DMA address
> +(device dependent). It takes a single unsigned 64 bit integer as an argument.
> +This call also has the side effect on enabled PCI bus mastership.

eh?  I don't get that last sentence...

> +
> +Miscellaneous:
> +
> +The VFIO_BAR_LEN ioctl provides an easy way to determine the size of a PCI
> +device's base address region. It is passed a single integer specifying which
> +BAR (0-5 or 6 for ROM bar), and passes back the length in the same field.


---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
  2010-05-28 23:07 [PATCH] VFIO driver: Non-privileged user level PCI drivers Tom Lyon
  2010-05-28 23:36 ` Randy Dunlap
  2010-05-28 23:56 ` Randy Dunlap
@ 2010-05-29 11:55 ` Arnd Bergmann
  2010-05-29 12:16   ` Avi Kivity
  2010-05-30 12:19 ` Michael S. Tsirkin
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 66+ messages in thread
From: Arnd Bergmann @ 2010-05-29 11:55 UTC (permalink / raw)
  To: Tom Lyon
  Cc: linux-kernel, kvm, chrisw, joro, hjk, mst, avi, gregkh, aafabbri,
	scofeldm

On Saturday 29 May 2010, Tom Lyon wrote:
> +/*
> + * Structure for DMA mapping of user buffers
> + * vaddr, dmaaddr, and size must all be page aligned
> + * buffer may only be larger than 1 page if (a) there is
> + * an iommu in the system, or (b) buffer is part of a huge page
> + */
> +struct vfio_dma_map {
> +       __u64   vaddr;          /* process virtual addr */
> +       __u64   dmaaddr;        /* desired and/or returned dma address */
> +       __u64   size;           /* size in bytes */
> +       int     rdwr;           /* bool: 0 for r/o; 1 for r/w */
> +};

Please add a 32 bit padding word at the end of this, otherwise the
size of the data structure is incompatible between 32 x86 applications
and 64 bit kernels.

	Arnd

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
  2010-05-29 11:55 ` Arnd Bergmann
@ 2010-05-29 12:16   ` Avi Kivity
  0 siblings, 0 replies; 66+ messages in thread
From: Avi Kivity @ 2010-05-29 12:16 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Tom Lyon, linux-kernel, kvm, chrisw, joro, hjk, mst, gregkh,
	aafabbri, scofeldm

On 05/29/2010 02:55 PM, Arnd Bergmann wrote:
> On Saturday 29 May 2010, Tom Lyon wrote:
>    
>> +/*
>> + * Structure for DMA mapping of user buffers
>> + * vaddr, dmaaddr, and size must all be page aligned
>> + * buffer may only be larger than 1 page if (a) there is
>> + * an iommu in the system, or (b) buffer is part of a huge page
>> + */
>> +struct vfio_dma_map {
>> +       __u64   vaddr;          /* process virtual addr */
>> +       __u64   dmaaddr;        /* desired and/or returned dma address */
>> +       __u64   size;           /* size in bytes */
>> +       int     rdwr;           /* bool: 0 for r/o; 1 for r/w */
>> +};
>>      
> Please add a 32 bit padding word at the end of this, otherwise the
> size of the data structure is incompatible between 32 x86 applications
> and 64 bit kernels.
>    

Might as well call it 'flags' and reserve a bit more space (keeping 
64-bit aligned size) for future expansion.

rdwr can be folded into it.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
  2010-05-28 23:07 [PATCH] VFIO driver: Non-privileged user level PCI drivers Tom Lyon
                   ` (2 preceding siblings ...)
  2010-05-29 11:55 ` Arnd Bergmann
@ 2010-05-30 12:19 ` Michael S. Tsirkin
  2010-05-30 12:27   ` Avi Kivity
  2010-05-30 12:59 ` Avi Kivity
  2010-05-31 17:17 ` Alan Cox
  5 siblings, 1 reply; 66+ messages in thread
From: Michael S. Tsirkin @ 2010-05-30 12:19 UTC (permalink / raw)
  To: Tom Lyon
  Cc: linux-kernel, kvm, chrisw, joro, hjk, avi, gregkh, aafabbri, scofeldm

On Fri, May 28, 2010 at 04:07:38PM -0700, Tom Lyon wrote:
> The VFIO "driver" is used to allow privileged AND non-privileged processes to 
> implement user-level device drivers for any well-behaved PCI, PCI-X, and PCIe
> devices.
> 	Signed-off-by: Tom Lyon <pugs@cisco.com>
> ---
> This patch is the evolution of code which was first proposed as a patch to
> uio/uio_pci_generic, then as a more generic uio patch. Now it is taken entirely
> out of the uio framework, and things seem much cleaner. Of course, there is
> a lot of functional overlap with uio, but the previous version just seemed
> like a giant mode switch in the uio code that did not lead to clarity for
> either the new or old code.

IMO this was because this driver does two things: programming iommu and
handling interrupts. uio does interrupt handling.
We could have moved iommu / DMA programming to
a separate driver, and have uio work with it.
This would solve limitation of the current driver
that is needs an iommu domain per device.

> [a pony for avi...]
> The major new functionality in this version is the ability to deal with
> PCI config space accesses (through read & write calls) - but includes table
> driven code to determine whats safe to write and what is not.

I don't really see why this is helpful: a driver written corrrectly
will not access these addresses, and we need an iommu anyway to protect
us against a drivers.

> Also, some
> virtualization of the config space to allow drivers to think they're writing
> some registers when they're not.

This emulation seems unlikely to be complete.
It can be done in userspace. What's the need for it in kernel?

> Also, IO space accesses are also allowed.
> Drivers for devices which use MSI-X are now prevented from directly writing
> the MSI-X vector area.
> 
> All interrupts are now handled using eventfds, which makes things very simple.
> 
> The name VFIO refers to the Virtual Function capabilities of SR-IOV devices
> but the driver does support many more types of devices.  I was none too sure
> what driver directory this should live in, so for now I made up my own under
> drivers/vfio. As a new driver/new directory, who makes the commit decision?
> 
> I currently have user level drivers working for 3 different network adapters
> - the Cisco "Palo" enic, the Intel 82599 VF, and the Intel 82576 VF (but the
> whole user level framework is a long ways from release).  This driver could
> also clearly replace a number of other drivers written just to give user
> access to certain devices - but that will take time.
> 
> diff -uprN linux-2.6.34/Documentation/vfio.txt vfio-linux-2.6.34/Documentation/vfio.txt
> --- linux-2.6.34/Documentation/vfio.txt	1969-12-31 16:00:00.000000000 -0800
> +++ vfio-linux-2.6.34/Documentation/vfio.txt	2010-05-28 14:03:05.000000000 -0700
> @@ -0,0 +1,176 @@
> +-------------------------------------------------------------------------------
> +The VFIO "driver" is used to allow privileged AND non-privileged processes to
> +implement user-level device drivers for any well-behaved PCI, PCI-X, and PCIe
> +devices.
> +
> +Why is this interesting?  Some applications, especially in the high performance
> +computing field, need access to hardware functions with as little overhead as
> +possible. Examples are in network adapters (typically non tcp/ip based) and
> +in compute accelerators - i.e., array processors, FPGA processors, etc.
> +Previous to the VFIO drivers these apps would need either a kernel-level
> +driver (with corrsponding overheads), or else root permissions to directly
> +access the hardware. The VFIO driver allows generic access to the hardware
> +from non-privileged apps IF the hardware is "well-behaved" enough for this
> +to be safe.
> +
> +While there have long been ways to implement user-level drivers using specific
> +corresponding drivers in the kernel, it was not until the introduction of the
> +UIO driver framework, and the uio_pci_generic driver that one could have a
> +generic kernel component supporting many types of user level drivers. However,
> +even with the uio_pci_generic driver, processes implementing the user level
> +drivers had to be trusted - they could do dangerous manipulation of DMA
> +addreses and were required to be root to write PCI configuration space
> +registers.
> +
> +Recent hardware technologies - I/O MMUs and PCI I/O Virtualization - provide
> +new hardware capabilities which the VFIO solution exploits to allow non-root
> +user level drivers. The main role of the IOMMU is to ensure that DMA accesses
> +from devices go only to the appropriate memory locations, this allows VFIO to
> +ensure that user level drivers do not corrupt inappropriate memory.  PCI I/O
> +virtualization (SR-IOV) was defined to allow "pass-through" of virtual devices
> +to guest virtual machines. VFIO in essence implements pass-through of devices
> +to user processes, not virtual machines.  SR-IOV devices implement a
> +traditional PCI device (the physical function) and a dynamic number of special
> +PCI devices (virtual functions) whose feature set is somewhat restricted - in
> +order to allow the operating system or virtual machine monitor to ensure the
> +safe operation of the system.
> +
> +Any SR-IOV virtual function meets the VFIO definition of "well-behaved", but
> +there are many other non-IOV PCI devices which also meet the defintion.
> +Elements of this definition are:
> +- The size of any memory BARs to be mmap'ed into the user process space must be
> +  a multiple of the system page size.
> +- If MSI-X interrupts are used, the device driver must not attempt to mmap or
> +  write the MSI-X vector area.
> +- If the device is a PCI device (not PCI-X or PCIe), it must conform to PCI
> +  revision 2.3 to allow its interrupts to be masked in a generic way.
> +- The device must not use the PCI configuration space in any non-standard way,
> +  i.e., the user level driver will be permitted only to read and write standard
> +  fields of the PCI config space, and only if those fields cannot cause harm to
> +  the system. In addition, some fields are "virtualized", so that the user
> +  driver can read/write them like a kernel driver, but they do not affect the
> +  real device.
> +- For now, there is no support for user access to the PCIe and PCI-X extended
> +  capabilities configuration space.
> +
> +Even with these restrictions, there are bound to be devices which are unsafe
> +for user level use - it is still up to the system admin to decide whether to
> +grant access to the device.  When the vfio module is loaded, it will have
> +access to no devices until the desired PCI devices are "bound" to the driver.
> +First, make sure the devices are not bound to another kernel driver. You can
> +unload that driver if you wish to unbind all its devices, or else enter the
> +driver's sysfs directory, and unbind a specific device:
> +	cd /sys/bus/pci/drivers/<drivername>
> +	echo 0000:06:02.00 > unbind
> +(The 0000:06:02.00 is a fully qualified PCI device name - different for each
> +device).  Now, to bind to the vfio driver, go to /sys/bus/pci/drivers/vfio and
> +write the PCI device type of the target device to the new_id file:
> +	echo 8086 10ca > new_id
> +(8086 10ca are the vendor and device type for the Intel 82576 virtual function
> +devices). A /dev/vfio<N> entry will be created for each device bound. The final
> +step is to grant users permission by changing the mode and/or owner of the /dev
> +entry - "chmod 666 /dev/vfio0".
> +
> +Reads & Writes:
> +
> +The user driver will typically use mmap to access the memory BAR(s) of a
> +device; the I/O BARs and the PCI config space may be accessed through normal
> +read and write system calls. Only 1 file descriptor is needed for all driver
> +functions -- the desired BAR for I/O, memory, or config space is indicated via
> +high-order bits of the file offset.  For instance, the following implements a
> +write to the PCI config space:
> +
> +	#include <linux/vfio.h>
> +	void pci_write_config_word(int pci_fd, u16 off, u16 wd)
> +	{
> +		off_t cfg_off = VFIO_PCI_CONFIG_OFF + off;
> +
> +		if (pwrite(pci_fd, &wd, 2, cfg_off) != 2)
> +			perror("pwrite config_dword");
> +	}
> +
> +The routines vfio_pci_space_to_offset and vfio_offset_to_pci_space are provided
> +in vfio.h to convert bar numbers to file offsets and vice-versa.
> +
> +Interrupts:
> +
> +Device interrupts are translated by the vfio driver into input events on event
> +notification file descriptors created by the eventfd system call. The user
> +program must one or more event descriptors and pass them to the vfio driver
> +via ioctls to arrange for the interrupt mapping:
> +1.
> +	efd = eventfd(0, 0);
> +	ioctl(vfio_fd, VFIO_EVENTFD_IRQ, &efd);
> +		This provides an eventfd for traditional IRQ interrupts.
> +		IRQs will be disable after each interrupt until the driver
> +		re-enables them via the PCI COMMAND register.
> +2.
> +	efd = eventfd(0, 0);
> +	ioctl(vfio_fd, VFIO_EVENTFD_MSI, &efd);
> +		This connects MSI interrupts to an eventfd.
> +3.
> + 	int arg[N+1];
> +	arg[0] = N;
> +	arg[1..N] = eventfd(0, 0);
> +	ioctl(vfio_fd, VFIO_EVENTFDS_MSIX, arg);
> +		This connects N MSI-X interrupts with N eventfds.
> +
> +Waiting and checking for interrupts is done by the user program by reads,
> +polls, or selects on the related event file descriptors.
> +
> +DMA:
> +
> +The VFIO driver uses ioctls to allow the user level driver to get DMA
> +addresses which correspond to virtual addresses.  In systems with IOMMUs,
> +each PCI device will have its own address space for DMA operations, so when
> +the user level driver programs the device registers, only addresses known to
> +the IOMMU will be valid, any others will be rejected.  The IOMMU creates the
> +illusion (to the device) that multi-page buffers are physically contiguous,
> +so a single DMA operation can safely span multiple user pages.  Note that
> +the VFIO driver is still useful in systems without IOMMUs, but only for
> +trusted processes which can deal with DMAs which do not span pages (Huge
> +pages count as a single page also).

IMO we definitely do not want to enable this non-IOMMU mode of operation.
If you are writing a driver that can DMA into arbitrary memory,
you should do this in kernel.


> +If the user process desires many DMA buffers, it may be wise to do a mapping
> +of a single large buffer, and then allocate the smaller buffers from the
> +large one.
> +
> +The DMA buffers are locked into physical memory for the duration of their
> +existence - until VFIO_DMA_UNMAP is called, until the user pages are
> +unmapped from the user process, or until the vfio file descriptor is closed.
> +The user process must have permission to lock the pages given by the ulimit(-l)
> +command, which in turn relies on settings in the /etc/security/limits.conf
> +file.

I think it's better to have userspace handle this by calling mlock.
To protect against, buggy userspace,
you can detect when a page is unmapped from a process and
remove it from iommu.


> +
> +The vfio_dma_map structure is used as an argument to the ioctls which
> +do the DMA mapping. Its vaddr, dmaaddr, and size fields must always be a
> +multiple of a page. Its rdwr field is zero for read-only (outbound), and
> +non-zero for read/write buffers.
> +
> +	struct vfio_dma_map {
> +		__u64	vaddr;	  /* process virtual addr */
> +		__u64	dmaaddr;  /* desired and/or returned dma address */
> +		__u64	size;	  /* size in bytes */
> +		int	rdwr;	  /* bool: 0 for r/o; 1 for r/w */
> +	};
> +
> +The VFIO_DMA_MAP_ANYWHERE is called with a vfio_dma_map structure as its
> +argument, and returns the structure with a valid dmaaddr field.
> +
> +The VFIO_DMA_MAP_IOVA is called with a vfio_dma_map structure with the
> +dmaaddr field already assigned. The system will attempt to map the DMA
> +buffer into the IO space at the givne dmaaddr. This is expected to be
> +useful if KVM or other virtualization facilities use this driver.
> +
> +The VFIO_DMA_UNMAP takes a fully filled vfio_dma_map structure and unmaps
> +the buffer and releases the corresponding system resources.
> +
> +The VFIO_DMA_MASK ioctl is used to set the maximum permissible DMA address
> +(device dependent). It takes a single unsigned 64 bit integer as an argument.
> +This call also has the side effect on enabled PCI bus mastership.
> +
> +Miscellaneous:
> +
> +The VFIO_BAR_LEN ioctl provides an easy way to determine the size of a PCI
> +device's base address region. It is passed a single integer specifying which
> +BAR (0-5 or 6 for ROM bar), and passes back the length in the same field.
> diff -uprN linux-2.6.34/drivers/Kconfig vfio-linux-2.6.34/drivers/Kconfig
> --- linux-2.6.34/drivers/Kconfig	2010-05-16 14:17:36.000000000 -0700
> +++ vfio-linux-2.6.34/drivers/Kconfig	2010-05-27 17:01:02.000000000 -0700
> @@ -111,4 +111,6 @@ source "drivers/xen/Kconfig"
>  source "drivers/staging/Kconfig"
>  
>  source "drivers/platform/Kconfig"
> +
> +source "drivers/vfio/Kconfig"
>  endmenu
> diff -uprN linux-2.6.34/drivers/Makefile vfio-linux-2.6.34/drivers/Makefile
> --- linux-2.6.34/drivers/Makefile	2010-05-16 14:17:36.000000000 -0700
> +++ vfio-linux-2.6.34/drivers/Makefile	2010-05-27 17:25:33.000000000 -0700
> @@ -52,6 +52,7 @@ obj-$(CONFIG_FUSION)		+= message/
>  obj-$(CONFIG_FIREWIRE)		+= firewire/
>  obj-y				+= ieee1394/
>  obj-$(CONFIG_UIO)		+= uio/
> +obj-$(CONFIG_VFIO)		+= vfio/
>  obj-y				+= cdrom/
>  obj-y				+= auxdisplay/
>  obj-$(CONFIG_PCCARD)		+= pcmcia/
> diff -uprN linux-2.6.34/drivers/vfio/Kconfig vfio-linux-2.6.34/drivers/vfio/Kconfig
> --- linux-2.6.34/drivers/vfio/Kconfig	1969-12-31 16:00:00.000000000 -0800
> +++ vfio-linux-2.6.34/drivers/vfio/Kconfig	2010-05-27 17:07:25.000000000 -0700
> @@ -0,0 +1,9 @@
> +menuconfig VFIO
> +	tristate "Non-Priv User Space PCI drivers"
> +	depends on PCI
> +	help
> +	  Driver to allow advanced user space drivers for PCI, PCI-X,
> +	  and PCIe devices.  Requires IOMMU to allow non-privilged
> +	  processes to directly control the PCI devices.
> +
> +	  If you don't know what to do here, say N.
> diff -uprN linux-2.6.34/drivers/vfio/Makefile vfio-linux-2.6.34/drivers/vfio/Makefile
> --- linux-2.6.34/drivers/vfio/Makefile	1969-12-31 16:00:00.000000000 -0800
> +++ vfio-linux-2.6.34/drivers/vfio/Makefile	2010-05-27 17:32:35.000000000 -0700
> @@ -0,0 +1,5 @@
> +obj-$(CONFIG_VFIO) := vfio.o
> +
> +vfio-y := vfio_main.o vfio_dma.o vfio_intrs.o \
> +		vfio_pci_config.o vfio_rdwr.o vfio_sysfs.o
> +
> diff -uprN linux-2.6.34/drivers/vfio/vfio_dma.c vfio-linux-2.6.34/drivers/vfio/vfio_dma.c
> --- linux-2.6.34/drivers/vfio/vfio_dma.c	1969-12-31 16:00:00.000000000 -0800
> +++ vfio-linux-2.6.34/drivers/vfio/vfio_dma.c	2010-05-28 14:04:04.000000000 -0700
> @@ -0,0 +1,372 @@
> +/*
> + * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
> + * Author: Tom Lyon, pugs@cisco.com
> + *
> + * This program is free software; you may redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; version 2 of the License.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> + * SOFTWARE.
> + *
> + * Portions derived from drivers/uio/uio.c:
> + * Copyright(C) 2005, Benedikt Spranger <b.spranger@linutronix.de>
> + * Copyright(C) 2005, Thomas Gleixner <tglx@linutronix.de>
> + * Copyright(C) 2006, Hans J. Koch <hjk@linutronix.de>
> + * Copyright(C) 2006, Greg Kroah-Hartman <greg@kroah.com>
> + *
> + * Portions derived from drivers/uio/uio_pci_generic.c:
> + * Copyright (C) 2009 Red Hat, Inc.
> + * Author: Michael S. Tsirkin <mst@redhat.com>
> + */
> +
> +#include <linux/module.h>
> +#include <linux/device.h>
> +#include <linux/pci.h>
> +#include <linux/mm.h>
> +#include <linux/mmu_notifier.h>
> +#include <linux/iommu.h>
> +#include <linux/sched.h>
> +
> +#include <linux/vfio.h>
> +
> +/* Unmap DMA region */
> +static void vfio_dma_unmap(struct vfio_listener *listener,
> +			struct dma_map_page *mlp)
> +{
> +	int i;
> +	struct vfio_dev *vdev = listener->vdev;
> +	struct pci_dev *pdev = vdev->pdev;
> +
> +	mutex_lock(&vdev->gate);
> +	list_del(&mlp->list);
> +	if (mlp->sg) {
> +		dma_unmap_sg(&pdev->dev, mlp->sg, mlp->npage,
> +				DMA_BIDIRECTIONAL);
> +		kfree(mlp->sg);
> +	} else {
> +		for (i = 0; i < mlp->npage; i++)
> +			(void) iommu_unmap_range(vdev->domain,
> +					mlp->daddr + i*PAGE_SIZE, PAGE_SIZE);
> +	}
> +	for (i = 0; i < mlp->npage; i++) {
> +		if (mlp->rdwr)
> +			SetPageDirty(mlp->pages[i]);
> +		put_page(mlp->pages[i]);
> +	}
> +	listener->mm->locked_vm -= mlp->npage;
> +	vdev->locked_pages -= mlp->npage;
> +	kfree(mlp->pages);
> +	kfree(mlp);
> +	vdev->mapcount--;
> +	mutex_unlock(&vdev->gate);
> +}
> +
> +/* Unmap ALL DMA regions */
> +void vfio_dma_unmapall(struct vfio_listener *listener)
> +{
> +	struct list_head *pos, *pos2;
> +	struct dma_map_page *mlp;
> +
> +	list_for_each_safe(pos, pos2, &listener->dm_list) {
> +		mlp = list_entry(pos, struct dma_map_page, list);
> +		vfio_dma_unmap(listener, mlp);
> +	}
> +}
> +
> +int vfio_dma_unmap_dm(struct vfio_listener *listener, struct vfio_dma_map *dmp)
> +{
> +	unsigned long start, npage;
> +	struct dma_map_page *mlp;
> +	struct list_head *pos, *pos2;
> +	int ret;
> +
> +	start = dmp->vaddr & ~PAGE_SIZE;
> +	npage = dmp->size >> PAGE_SHIFT;
> +
> +	ret = -ENXIO;
> +	list_for_each_safe(pos, pos2, &listener->dm_list) {
> +		mlp = list_entry(pos, struct dma_map_page, list);
> +		if (dmp->vaddr != mlp->vaddr || mlp->npage != npage)
> +			continue;
> +		ret = 0;
> +		vfio_dma_unmap(listener, mlp);
> +		break;
> +	}
> +	return ret;
> +}
> +
> +/* Handle MMU notifications - user process freed or realloced memory
> + * which may be in use in a DMA region. Clean up region if so.
> + */
> +static void vfio_dma_handle_mmu_notify(struct mmu_notifier *mn,
> +		unsigned long start, unsigned long end)
> +{
> +	struct vfio_listener *listener;
> +	unsigned long myend;
> +	struct list_head *pos, *pos2;
> +	struct dma_map_page *mlp;
> +
> +	listener = container_of(mn, struct vfio_listener, mmu_notifier);
> +	list_for_each_safe(pos, pos2, &listener->dm_list) {
> +		mlp = list_entry(pos, struct dma_map_page, list);
> +		if (mlp->vaddr >= end)
> +			continue;
> +		/*
> +		 * Ranges overlap if they're not disjoint; and they're
> +		 * disjoint if the end of one is before the start of
> +		 * the other one.
> +		 */
> +		myend = mlp->vaddr + (mlp->npage << PAGE_SHIFT) - 1;
> +		if (!(myend <= start || end <= mlp->vaddr)) {
> +			printk(KERN_WARNING
> +				"%s: demap start %lx end %lx va %lx pa %lx\n",
> +				__func__, start, end,
> +				mlp->vaddr, (long)mlp->daddr);
> +			vfio_dma_unmap(listener, mlp);
> +		}
> +	}
> +}
> +
> +static void vfio_dma_inval_page(struct mmu_notifier *mn,
> +		struct mm_struct *mm, unsigned long addr)
> +{
> +	vfio_dma_handle_mmu_notify(mn, addr, addr + PAGE_SIZE);
> +}
> +
> +static void vfio_dma_inval_range_start(struct mmu_notifier *mn,
> +		struct mm_struct *mm, unsigned long start, unsigned long end)
> +{
> +	vfio_dma_handle_mmu_notify(mn, start, end);
> +}
> +
> +static const struct mmu_notifier_ops vfio_dma_mmu_notifier_ops = {
> +	.invalidate_page = vfio_dma_inval_page,
> +	.invalidate_range_start = vfio_dma_inval_range_start,
> +};
> +
> +/*
> + * Map usr buffer at specific IO virtual address
> + */
> +static int vfio_dma_map_iova(
> +		struct vfio_listener *listener,
> +		unsigned long start_iova,
> +		struct page **pages,
> +		int npage,
> +		int rdwr,
> +		struct dma_map_page **mlpp)
> +{
> +	struct vfio_dev *vdev = listener->vdev;
> +	struct pci_dev *pdev = vdev->pdev;
> +	int ret;
> +	int i;
> +	phys_addr_t hpa;
> +	struct dma_map_page *mlp;
> +	unsigned long iova = start_iova;
> +
> +	if (vdev->domain == NULL) {
> +		/* can't mix iova with anywhere */
> +		if (vdev->mapcount > 0)
> +			return -EINVAL;
> +		if (!iommu_found())
> +			return -EINVAL;
> +		vdev->domain = iommu_domain_alloc();

So there's a domain per device? Since a domain uses resources,
and since a single application is likely to use multiple devices,
I think it is better to enable sharing a domain between them.

> +		if (vdev->domain == NULL)
> +			return -ENXIO;

An ioctl to attach/detach from domain explicitly would be cleaner.

> +		vdev->cachec = iommu_domain_has_cap(vdev->domain,
> +					IOMMU_CAP_CACHE_COHERENCY);
> +		ret = iommu_attach_device(vdev->domain, &pdev->dev);
> +		if (ret) {
> +			iommu_domain_free(vdev->domain);
> +			vdev->domain = NULL;
> +			printk(KERN_ERR "%s: device_attach failed %d\n",
> +					__func__, ret);
> +			return ret;
> +		}
> +	}
> +	for (i = 0; i < npage; i++) {
> +		if (iommu_iova_to_phys(vdev->domain, iova + i*PAGE_SIZE))
> +			return -EBUSY;
> +	}
> +	rdwr = rdwr ? IOMMU_READ|IOMMU_WRITE : IOMMU_READ;
> +	if (vdev->cachec)
> +		rdwr |= IOMMU_CACHE;
> +	for (i = 0; i < npage; i++) {
> +		hpa = page_to_phys(pages[i]);
> +		ret = iommu_map_range(vdev->domain, iova,
> +			hpa, PAGE_SIZE, rdwr);
> +		if (ret) {
> +			while (--i > 0) {
> +				iova -= PAGE_SIZE;
> +				(void) iommu_unmap_range(vdev->domain,
> +							iova, PAGE_SIZE);
> +			}
> +			return ret;
> +		}
> +		iova += PAGE_SIZE;
> +	}
> +
> +	mlp = kzalloc(sizeof *mlp, GFP_KERNEL);
> +	mlp->pages = pages;
> +	mlp->daddr = start_iova;
> +	mlp->npage = npage;
> +	*mlpp = mlp;
> +	return 0;
> +}
> +
> +/*
> + * Map user buffer - return IO virtual address
> + */
> +static int vfio_dma_map_anywhere(
> +		struct vfio_listener *listener,
> +		struct page **pages,
> +		int npage,
> +		int rdwr,
> +		struct dma_map_page **mlpp)
> +{
> +	struct vfio_dev *vdev = listener->vdev;
> +	struct pci_dev *pdev = vdev->pdev;
> +	struct scatterlist *sg, *nsg;
> +	int i, nents;
> +	struct dma_map_page *mlp;
> +	unsigned long length;
> +
> +	if (vdev->domain) {
> +		/* map anywhere and map iova don't mix */
> +		if (vdev->mapcount > 0)
> +			return -EINVAL;
> +		iommu_domain_free(vdev->domain);
> +		vdev->domain = NULL;
> +	}
> +	sg = kzalloc(npage * sizeof(struct scatterlist), GFP_KERNEL);
> +	if (sg == NULL)
> +		return -ENOMEM;
> +	for (i = 0; i < npage; i++)
> +		sg_set_page(&sg[i], pages[i], PAGE_SIZE, 0);
> +	nents = dma_map_sg(&pdev->dev, sg, npage,
> +			rdwr ? DMA_BIDIRECTIONAL : DMA_TO_DEVICE);
> +	/* The API for dma_map_sg suggests that it may squash together
> +	 * adjacent pages, but noone seems to really do that. So we squash
> +	 * it ourselves, because the user level wants a single buffer.
> +	 * This works if (a) there is an iommu, or (b) the user allocates
> +	 * large buffers from a huge page
> +	 */
> +	nsg = sg;
> +	for (i = 1; i < nents; i++) {
> +		length = sg[i].dma_length;
> +		sg[i].dma_length = 0;
> +		if (sg[i].dma_address == (nsg->dma_address + nsg->dma_length)) {
> +			nsg->dma_length += length;
> +		} else {
> +			nsg++;
> +			nsg->dma_address = sg[i].dma_address;
> +			nsg->dma_length = length;
> +		}
> +	}
> +	nents = 1 + (nsg - sg);
> +	if (nents != 1) {
> +		if (nents > 0)
> +			dma_unmap_sg(&pdev->dev, sg, npage,
> +					DMA_BIDIRECTIONAL);
> +		for (i = 0; i < npage; i++)
> +			put_page(pages[i]);
> +		kfree(sg);
> +		printk(KERN_ERR "%s: sequential dma mapping failed\n",
> +				__func__);
> +		return -EFAULT;
> +	}
> +
> +	mlp = kzalloc(sizeof *mlp, GFP_KERNEL);
> +	mlp->pages = pages;
> +	mlp->sg = sg;
> +	mlp->daddr = sg_dma_address(sg);
> +	mlp->npage = npage;
> +	*mlpp = mlp;
> +	return 0;
> +}
> +
> +int vfio_dma_map_common(struct vfio_listener *listener,
> +		unsigned int cmd, struct vfio_dma_map *dmp)
> +{
> +	int locked, lock_limit;
> +	struct page **pages;
> +	int npage;
> +	struct dma_map_page *mlp = NULL;
> +	int ret = 0;
> +
> +	if (dmp->vaddr & (PAGE_SIZE-1))
> +		return -EINVAL;
> +	if (dmp->size & (PAGE_SIZE-1))
> +		return -EINVAL;
> +	if (dmp->size <= 0)
> +		return -EINVAL;
> +	npage = dmp->size >> PAGE_SHIFT;
> +
> +	mutex_lock(&listener->vdev->gate);
> +
> +	/* account for locked pages */
> +	locked = npage + current->mm->locked_vm;
> +	lock_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur
> +			>> PAGE_SHIFT;
> +	if ((locked > lock_limit) && !capable(CAP_IPC_LOCK)) {
> +		printk(KERN_WARNING "%s: RLIMIT_MEMLOCK exceeded\n",
> +			__func__);
> +		ret = -ENOMEM;
> +		goto out_lock;
> +	}

It says 'account' but does not seem to change anything?
Also, this seems racy: userspace
can do mlock after you checked the limit.

> +	/* only 1 address space per fd */
> +	if (current->mm != listener->mm) {
> +		if (listener->mm != NULL)
> +			return -EINVAL;

return with lock held

> +		listener->mm = current->mm;
> +		listener->mmu_notifier.ops = &vfio_dma_mmu_notifier_ops;
> +		ret = mmu_notifier_register(&listener->mmu_notifier,
> +						listener->mm);
> +		if (ret)
> +			printk(KERN_ERR "%s: mmu_notifier_register failed %d\n",
> +				__func__, ret);

debugging code?

> +		ret = 0;
> +	}
> +
> +	pages = kzalloc(npage * sizeof(struct page *), GFP_KERNEL);
> +	if (pages == NULL) {
> +		ret = ENOMEM;
> +		goto out_lock;
> +	}
> +	ret = get_user_pages_fast(dmp->vaddr, npage, dmp->rdwr, pages);
> +	if (ret != npage) {
> +		printk(KERN_ERR "%s: get_user_pages_fast returns %d, not %d\n",
> +			__func__, ret, npage);

above too.

> +		kfree(pages);
> +		ret = -EFAULT;
> +		goto out_lock;
> +	}
> +
> +	if (cmd == VFIO_DMA_MAP_IOVA)
> +		ret = vfio_dma_map_iova(listener, dmp->dmaaddr,
> +				pages, npage, dmp->rdwr, &mlp);
> +	else
> +		ret = vfio_dma_map_anywhere(listener, pages,
> +				npage, dmp->rdwr, &mlp);
> +	if (ret) {
> +		kfree(pages);
> +		goto out_lock;
> +	}
> +	listener->vdev->mapcount++;
> +	mlp->vaddr = dmp->vaddr;
> +	mlp->rdwr = dmp->rdwr;
> +	dmp->dmaaddr = mlp->daddr;
> +	list_add(&mlp->list, &listener->dm_list);
> +
> +	current->mm->locked_vm += npage;
> +	listener->vdev->locked_pages += npage;
> +out_lock:
> +	mutex_unlock(&listener->vdev->gate);
> +	return ret;
> +}
> diff -uprN linux-2.6.34/drivers/vfio/vfio_intrs.c vfio-linux-2.6.34/drivers/vfio/vfio_intrs.c
> --- linux-2.6.34/drivers/vfio/vfio_intrs.c	1969-12-31 16:00:00.000000000 -0800
> +++ vfio-linux-2.6.34/drivers/vfio/vfio_intrs.c	2010-05-28 14:09:15.000000000 -0700
> @@ -0,0 +1,189 @@
> +/*
> + * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
> + * Author: Tom Lyon, pugs@cisco.com
> + *
> + * This program is free software; you may redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; version 2 of the License.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> + * SOFTWARE.
> + *
> + * Portions derived from drivers/uio/uio.c:
> + * Copyright(C) 2005, Benedikt Spranger <b.spranger@linutronix.de>
> + * Copyright(C) 2005, Thomas Gleixner <tglx@linutronix.de>
> + * Copyright(C) 2006, Hans J. Koch <hjk@linutronix.de>
> + * Copyright(C) 2006, Greg Kroah-Hartman <greg@kroah.com>
> + *
> + * Portions derived from drivers/uio/uio_pci_generic.c:
> + * Copyright (C) 2009 Red Hat, Inc.
> + * Author: Michael S. Tsirkin <mst@redhat.com>
> + */
> +
> +#include <linux/device.h>
> +#include <linux/interrupt.h>
> +#include <linux/eventfd.h>
> +#include <linux/pci.h>
> +#include <linux/mmu_notifier.h>
> +
> +#include <linux/vfio.h>
> +
> +
> +/*
> + * vfio_interrupt - IRQ hardware interrupt handler
> + */
> +irqreturn_t vfio_interrupt(int irq, void *dev_id)
> +{
> +	struct vfio_dev *vdev = (struct vfio_dev *)dev_id;
> +	struct pci_dev *pdev = vdev->pdev;
> +	irqreturn_t ret = IRQ_NONE;
> +	u32 cmd_status_dword;
> +	u16 origcmd, newcmd, status;
> +
> +	spin_lock_irq(&vdev->lock);
> +	pci_block_user_cfg_access(pdev);
> +
> +	/* Read both command and status registers in a single 32-bit operation.
> +	 * Note: we could cache the value for command and move the status read
> +	 * out of the lock if there was a way to get notified of user changes
> +	 * to command register through sysfs. Should be good for shared irqs. */
> +	pci_read_config_dword(pdev, PCI_COMMAND, &cmd_status_dword);
> +	origcmd = cmd_status_dword;
> +	status = cmd_status_dword >> 16;
> +
> +	/* Check interrupt status register to see whether our device
> +	 * triggered the interrupt. */
> +	if (!(status & PCI_STATUS_INTERRUPT))
> +		goto done;
> +
> +	/* We triggered the interrupt, disable it. */
> +	newcmd = origcmd | PCI_COMMAND_INTX_DISABLE;
> +	if (newcmd != origcmd)
> +		pci_write_config_word(pdev, PCI_COMMAND, newcmd);
> +
> +	ret = IRQ_HANDLED;
> +done:
> +	pci_unblock_user_cfg_access(pdev);
> +	spin_unlock_irq(&vdev->lock);
> +	if (ret != IRQ_HANDLED)
> +		return ret;
> +	if (vdev->ev_irq)
> +		eventfd_signal(vdev->ev_irq, 1);
> +	return ret;
> +}
> +
> +/*
> + * MSI and MSI-X Interrupt handler.
> + * Just signal an event
> + */
> +static irqreturn_t msihandler(int irq, void *arg)
> +{
> +	struct eventfd_ctx *ctx = arg;
> +
> +	eventfd_signal(ctx, 1);
> +	return IRQ_HANDLED;
> +}
> +
> +void vfio_disable_msi(struct vfio_dev *vdev)
> +{
> +	struct pci_dev *pdev = vdev->pdev;
> +
> +	if (vdev->ev_msi) {
> +		eventfd_ctx_put(vdev->ev_msi);
> +		free_irq(pdev->irq, vdev->ev_msi);
> +		vdev->ev_msi = NULL;
> +	}
> +	pci_disable_msi(pdev);
> +}
> +
> +int vfio_enable_msi(struct vfio_dev *vdev, int fd)
> +{
> +	struct pci_dev *pdev = vdev->pdev;
> +	struct eventfd_ctx *ctx;
> +	int ret;
> +
> +	ctx = eventfd_ctx_fdget(fd);
> +	if (IS_ERR(ctx))
> +		return PTR_ERR(ctx);
> +	vdev->ev_msi = ctx;
> +	pci_enable_msi(pdev);
> +	ret = request_irq(pdev->irq, msihandler, 0,
> +			vdev->name, ctx);
> +	if (ret) {
> +		eventfd_ctx_put(ctx);
> +		pci_disable_msi(pdev);
> +		vdev->ev_msi = NULL;
> +	}
> +	return ret;
> +}
> +
> +void vfio_disable_msix(struct vfio_dev *vdev)
> +{
> +	struct pci_dev *pdev = vdev->pdev;
> +	int i;
> +
> +	if (vdev->ev_msix && vdev->msix) {
> +		for (i = 0; i < vdev->nvec; i++) {
> +			free_irq(vdev->msix[i].vector, vdev->ev_msix[i]);
> +			if (vdev->ev_msix[i])
> +				eventfd_ctx_put(vdev->ev_msix[i]);
> +		}
> +	}
> +	kfree(vdev->ev_msix);
> +	vdev->ev_msix = NULL;
> +	kfree(vdev->msix);
> +	vdev->msix = NULL;
> +	vdev->nvec = 0;
> +	pci_disable_msix(pdev);
> +}
> +
> +int vfio_enable_msix(struct vfio_dev *vdev, int nvec, void __user *uarg)
> +{
> +	struct pci_dev *pdev = vdev->pdev;
> +	struct eventfd_ctx *ctx;
> +	int ret = 0;
> +	int i;
> +	int fd;
> +
> +	vdev->msix = kzalloc(nvec * sizeof(struct msix_entry),
> +				GFP_KERNEL);
> +	vdev->ev_msix = kzalloc(nvec * sizeof(struct eventfd_ctx *),
> +				GFP_KERNEL);
> +	if (vdev->msix == NULL || vdev->ev_msix == NULL)
> +		ret = -ENOMEM;
> +	else {
> +		for (i = 0; i < nvec; i++) {
> +			if (copy_from_user(&fd, uarg, sizeof fd)) {
> +				ret = -EFAULT;
> +				break;
> +			}
> +			uarg += sizeof fd;
> +			ctx = eventfd_ctx_fdget(fd);
> +			if (IS_ERR(ctx)) {
> +				ret = PTR_ERR(ctx);
> +				break;
> +			}
> +			vdev->msix[i].entry = i;
> +			vdev->ev_msix[i] = ctx;
> +		}
> +	}
> +	if (!ret)
> +		ret = pci_enable_msix(pdev, vdev->msix, nvec);
> +	vdev->nvec = 0;
> +	for (i = 0; i < nvec && !ret; i++) {
> +		ret = request_irq(vdev->msix[i].vector, msihandler, 0,
> +			vdev->name, vdev->ev_msix[i]);
> +		if (ret)
> +			break;
> +		vdev->nvec = i+1;
> +	}
> +	if (ret)
> +		vfio_disable_msix(vdev);
> +	return ret;
> +}
> diff -uprN linux-2.6.34/drivers/vfio/vfio_main.c vfio-linux-2.6.34/drivers/vfio/vfio_main.c
> --- linux-2.6.34/drivers/vfio/vfio_main.c	1969-12-31 16:00:00.000000000 -0800
> +++ vfio-linux-2.6.34/drivers/vfio/vfio_main.c	2010-05-28 14:13:38.000000000 -0700
> @@ -0,0 +1,627 @@
> +/*
> + * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
> + * Author: Tom Lyon, pugs@cisco.com
> + *
> + * This program is free software; you may redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; version 2 of the License.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> + * SOFTWARE.
> + *
> + * Portions derived from drivers/uio/uio.c:
> + * Copyright(C) 2005, Benedikt Spranger <b.spranger@linutronix.de>
> + * Copyright(C) 2005, Thomas Gleixner <tglx@linutronix.de>
> + * Copyright(C) 2006, Hans J. Koch <hjk@linutronix.de>
> + * Copyright(C) 2006, Greg Kroah-Hartman <greg@kroah.com>
> + *
> + * Portions derived from drivers/uio/uio_pci_generic.c:
> + * Copyright (C) 2009 Red Hat, Inc.
> + * Author: Michael S. Tsirkin <mst@redhat.com>
> + */
> +
> +#include <linux/module.h>
> +#include <linux/device.h>
> +#include <linux/mm.h>
> +#include <linux/idr.h>
> +#include <linux/string.h>
> +#include <linux/interrupt.h>
> +#include <linux/fs.h>
> +#include <linux/eventfd.h>
> +#include <linux/pci.h>
> +#include <linux/iommu.h>
> +#include <linux/mmu_notifier.h>
> +#include <linux/uaccess.h>
> +
> +#include <linux/vfio.h>
> +
> +
> +#define DRIVER_VERSION	"0.1"
> +#define DRIVER_AUTHOR	"Tom Lyon <pugs@cisco.com>"
> +#define DRIVER_DESC	"VFIO - User Level PCI meta-driver"
> +
> +static int vfio_major = -1;
> +DEFINE_IDR(vfio_idr);
> +/* Protect idr accesses */
> +DEFINE_MUTEX(vfio_minor_lock);
> +
> +/*
> + * Does [a1,b1) overlap [a2,b2) ?
> + */
> +static inline int overlap(int a1, int b1, int a2, int b2)
> +{
> +	/*
> +	 * Ranges overlap if they're not disjoint; and they're
> +	 * disjoint if the end of one is before the start of
> +	 * the other one.
> +	 */
> +	return !(b2 <= a1 || b1 <= a2);
> +}
> +
> +static int vfio_open(struct inode *inode, struct file *filep)
> +{
> +	struct vfio_dev *vdev;
> +	struct vfio_listener *listener;
> +	int ret = 0;
> +
> +	mutex_lock(&vfio_minor_lock);
> +	vdev = idr_find(&vfio_idr, iminor(inode));
> +	mutex_unlock(&vfio_minor_lock);
> +	if (!vdev) {
> +		ret = -ENODEV;
> +		goto out;
> +	}
> +
> +	listener = kzalloc(sizeof(*listener), GFP_KERNEL);
> +	if (!listener) {
> +		ret = -ENOMEM;
> +		goto err_alloc_listener;
> +	}
> +
> +	listener->vdev = vdev;
> +	INIT_LIST_HEAD(&listener->dm_list);
> +	filep->private_data = listener;
> +
> +	mutex_lock(&vdev->gate);
> +	if (vdev->listeners == 0) {		/* first open */
> +		if (vdev->pmaster && !iommu_found() &&
> +		    !capable(CAP_SYS_RAWIO)) {
> +			mutex_unlock(&vdev->gate);
> +			ret = -EPERM;
> +			goto err_perm;
> +		}

root already can control devices through sysfs.
Let's not add more unsafe ways to do this:
if there's no iommu, just fail the open.


> +		/* reset to known state if we can */
> +		(void) pci_reset_function(vdev->pdev);
> +	}
> +	vdev->listeners++;
> +	mutex_unlock(&vdev->gate);
> +	return 0;
> +
> +err_perm:
> +	kfree(listener);
> +
> +err_alloc_listener:
> +out:
> +	return ret;
> +}
> +
> +static int vfio_release(struct inode *inode, struct file *filep)
> +{
> +	int ret = 0;
> +	struct vfio_listener *listener = filep->private_data;
> +	struct vfio_dev *vdev = listener->vdev;
> +
> +	vfio_dma_unmapall(listener);
> +	if (listener->mm) {
> +		mmu_notifier_unregister(&listener->mmu_notifier, listener->mm);
> +		listener->mm = NULL;
> +	}
> +
> +	mutex_lock(&vdev->gate);
> +	if (--vdev->listeners <= 0) {
> +		if (vdev->ev_msix)
> +			vfio_disable_msix(vdev);
> +		if (vdev->ev_msi)
> +			vfio_disable_msi(vdev);
> +		if (vdev->ev_irq) {
> +			eventfd_ctx_put(vdev->ev_msi);
> +			vdev->ev_irq = NULL;
> +		}
> +		if (vdev->domain) {
> +			iommu_domain_free(vdev->domain);
> +			vdev->domain = NULL;
> +		}
> +		/* reset to known state if we can */
> +		(void) pci_reset_function(vdev->pdev);
> +	}
> +	mutex_unlock(&vdev->gate);
> +
> +	kfree(listener);
> +	return ret;
> +}
> +
> +static ssize_t vfio_read(struct file *filep, char __user *buf,
> +			size_t count, loff_t *ppos)
> +{
> +	struct vfio_listener *listener = filep->private_data;
> +	struct vfio_dev *vdev = listener->vdev;
> +	struct pci_dev *pdev = vdev->pdev;
> +	int pci_space;
> +
> +	pci_space = vfio_offset_to_pci_space(*ppos);
> +	if (pci_space == VFIO_PCI_CONFIG_RESOURCE)
> +		return vfio_config_readwrite(0, vdev, buf, count, ppos);
> +	if (pci_space > PCI_ROM_RESOURCE)
> +		return -EINVAL;
> +	if (pci_resource_flags(pdev, pci_space) & IORESOURCE_IO)
> +		return vfio_io_readwrite(0, vdev, buf, count, ppos);
> +	if (pci_resource_flags(pdev, pci_space) & IORESOURCE_MEM)
> +		return vfio_mem_readwrite(0, vdev, buf, count, ppos);
> +	if (pci_space == PCI_ROM_RESOURCE)
> +		return vfio_mem_readwrite(0, vdev, buf, count, ppos);
> +	return -EINVAL;
> +}
> +
> +static int vfio_msix_check(struct vfio_dev *vdev, u64 start, u32 len)
> +{
> +	struct pci_dev *pdev = vdev->pdev;
> +	u16 pos;
> +	u32 table_offset;
> +	u16 table_size;
> +	u8 bir;
> +	u32 lo, hi, startp, endp;
> +
> +	pos = pci_find_capability(pdev, PCI_CAP_ID_MSIX);
> +	if (!pos)
> +		return 0;
> +
> +	pci_read_config_word(pdev, pos + PCI_MSIX_FLAGS, &table_size);
> +	table_size = (table_size & PCI_MSIX_FLAGS_QSIZE) + 1;
> +	pci_read_config_dword(pdev, pos + 4, &table_offset);
> +	bir = table_offset & PCI_MSIX_FLAGS_BIRMASK;
> +	lo = table_offset >> PAGE_SHIFT;
> +	hi = (table_offset + PCI_MSIX_ENTRY_SIZE * table_size + PAGE_SIZE - 1)
> +		>> PAGE_SHIFT;
> +	startp = start >> PAGE_SHIFT;
> +	endp = (start + len + PAGE_SIZE - 1) >> PAGE_SHIFT;
> +	if (bir == vfio_offset_to_pci_space(start) &&
> +	    overlap(lo, hi, startp, endp)) {
> +		printk(KERN_WARNING "%s: cannot write msi-x vectors\n",
> +			__func__);
> +		return -EINVAL;
> +	}
> +	return 0;
> +}
> +
> +static ssize_t vfio_write(struct file *filep, const char __user *buf,
> +			size_t count, loff_t *ppos)
> +{
> +	struct vfio_listener *listener = filep->private_data;
> +	struct vfio_dev *vdev = listener->vdev;
> +	struct pci_dev *pdev = vdev->pdev;
> +	int pci_space;
> +	int ret;
> +
> +	pci_space = vfio_offset_to_pci_space(*ppos);
> +	if (pci_space == VFIO_PCI_CONFIG_RESOURCE)
> +		return vfio_config_readwrite(1, vdev,
> +					(char __user *)buf, count, ppos);
> +	if (pci_space > PCI_ROM_RESOURCE)
> +		return -EINVAL;
> +	if (pci_resource_flags(pdev, pci_space) & IORESOURCE_IO)
> +		return vfio_io_readwrite(1, vdev,
> +					(char __user *)buf, count, ppos);
> +	if (pci_resource_flags(pdev, pci_space) & IORESOURCE_MEM) {
> +		/* don't allow writes to msi-x vectors */
> +		ret = vfio_msix_check(vdev, *ppos, count);
> +		if (ret)
> +			return ret;
> +		return vfio_mem_readwrite(1, vdev,
> +				(char __user *)buf, count, ppos);
> +	}
> +	return -EINVAL;
> +}
> +
> +static int vfio_mmap(struct file *filep, struct vm_area_struct *vma)
> +{
> +	struct vfio_listener *listener = filep->private_data;
> +	struct vfio_dev *vdev = listener->vdev;
> +	struct pci_dev *pdev = vdev->pdev;
> +	unsigned long requested, actual;
> +	int pci_space;
> +	u64 start;
> +	u32 len;
> +	unsigned long phys;
> +	int ret;
> +
> +	if (vma->vm_end < vma->vm_start)
> +		return -EINVAL;
> +	if ((vma->vm_flags & VM_SHARED) == 0)
> +		return -EINVAL;
> +
> +
> +	pci_space = vfio_offset_to_pci_space((u64)vma->vm_pgoff << PAGE_SHIFT);
> +	if (pci_space > PCI_ROM_RESOURCE)
> +		return -EINVAL;
> +	switch (pci_space) {
> +	case PCI_ROM_RESOURCE:
> +		if (vma->vm_flags & VM_WRITE)
> +			return -EINVAL;
> +		if (pci_resource_flags(pdev, PCI_ROM_RESOURCE) == 0)
> +			return -EINVAL;
> +		actual = pci_resource_len(pdev, PCI_ROM_RESOURCE) >> PAGE_SHIFT;
> +		break;
> +	default:
> +		if ((pci_resource_flags(pdev, pci_space) & IORESOURCE_MEM) == 0)
> +			return -EINVAL;
> +		actual = pci_resource_len(pdev, pci_space) >> PAGE_SHIFT;
> +		break;
> +	}
> +
> +	requested = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
> +	if (requested > actual || actual == 0)
> +		return -EINVAL;
> +
> +	/*
> +	 * Can't allow non-priv users to mmap MSI-X vectors
> +	 * else they can write anywhere in phys memory
> +	 */

I don't get the above comment. If device can do DMA,
with BAR access you can make it DMA to an arbitrary address
anyway. Does not this driver enforce an iommu to protect against
this?

> +	start = vma->vm_pgoff << PAGE_SHIFT;
> +	len = vma->vm_end - vma->vm_start;
> +	if (vma->vm_flags & VM_WRITE) {
> +		ret = vfio_msix_check(vdev, start, len);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	vma->vm_private_data = vdev;
> +	vma->vm_flags |= VM_IO | VM_RESERVED;
> +	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> +	phys = pci_resource_start(pdev, pci_space) >> PAGE_SHIFT;
> +
> +	return remap_pfn_range(vma, vma->vm_start, phys,
> +			       vma->vm_end - vma->vm_start,
> +			       vma->vm_page_prot);

This seems to map PCI BAR directly into userspace, but does not
seem to do any accounting.
What prevents userspace from hanging on to the mapped
address after closing an fd?
If this happens, iommu won't protect us.

> +}
> +
> +static long vfio_unl_ioctl(struct file *filep,
> +			unsigned int cmd,
> +			unsigned long arg)
> +{
> +	struct vfio_listener *listener = filep->private_data;
> +	struct vfio_dev *vdev = listener->vdev;
> +	void __user *uarg = (void __user *)arg;
> +	struct pci_dev *pdev = vdev->pdev;
> +	struct vfio_dma_map dm;
> +	int ret = 0;
> +	u64 mask;
> +	int fd, nfd;
> +	int bar;
> +
> +	if (vdev == NULL)
> +		return -EINVAL;
> +
> +	switch (cmd) {
> +
> +	case VFIO_DMA_MAP_ANYWHERE:
> +	case VFIO_DMA_MAP_IOVA:
> +		if (copy_from_user(&dm, uarg, sizeof dm))
> +			return -EFAULT;
> +		ret = vfio_dma_map_common(listener, cmd, &dm);
> +		if (!ret && copy_to_user(uarg, &dm, sizeof dm))
> +			ret = -EFAULT;
> +		break;
> +
> +	case VFIO_DMA_UNMAP:
> +		if (copy_from_user(&dm, uarg, sizeof dm))
> +			return -EFAULT;
> +		ret = vfio_dma_unmap_dm(listener, &dm);
> +		break;
> +
> +	case VFIO_DMA_MASK:	/* set master mode and DMA mask */
> +		if (copy_from_user(&mask, uarg, sizeof mask))
> +			return -EFAULT;
> +		pci_set_master(pdev);
> +		ret = pci_set_dma_mask(pdev, mask);
> +		break;
> +
> +	case VFIO_EVENTFD_IRQ:
> +		if (copy_from_user(&fd, uarg, sizeof fd))
> +			return -EFAULT;
> +		if (vdev->ev_irq)
> +			eventfd_ctx_put(vdev->ev_irq);
> +		if (fd >= 0) {
> +			vdev->ev_irq = eventfd_ctx_fdget(fd);
> +			if (vdev->ev_irq == NULL)
> +				ret = -EINVAL;
> +		}
> +		break;
> +
> +	case VFIO_EVENTFD_MSI:
> +		if (copy_from_user(&fd, uarg, sizeof fd))
> +			return -EFAULT;
> +		if (fd >= 0 && vdev->ev_msi == NULL && vdev->ev_msix == NULL)
> +			ret = vfio_enable_msi(vdev, fd);
> +		else if (fd < 0 && vdev->ev_msi)
> +			vfio_disable_msi(vdev);
> +		else
> +			ret = -EINVAL;
> +		break;
> +
> +	case VFIO_EVENTFDS_MSIX:
> +		if (copy_from_user(&nfd, uarg, sizeof nfd))
> +			return -EFAULT;
> +		uarg += sizeof nfd;
> +		if (nfd > 0 && vdev->ev_msi == NULL && vdev->ev_msix == NULL)
> +			ret = vfio_enable_msix(vdev, nfd, uarg);
> +		else if (nfd == 0 && vdev->ev_msix)
> +			vfio_disable_msix(vdev);
> +		else
> +			ret = -EINVAL;
> +		break;
> +
> +	case VFIO_BAR_LEN:
> +		if (copy_from_user(&bar, uarg, sizeof bar))
> +			return -EFAULT;
> +		if (bar < 0 || bar > PCI_ROM_RESOURCE)
> +			return -EINVAL;
> +		bar = pci_resource_len(pdev, bar);
> +		if (copy_to_user(uarg, &bar, sizeof bar))
> +			return -EFAULT;
> +		break;
> +
> +	default:
> +		return -EINVAL;
> +	}
> +	return ret;
> +}
> +
> +static const struct file_operations vfio_fops = {
> +	.owner		= THIS_MODULE,
> +	.open		= vfio_open,
> +	.release	= vfio_release,
> +	.read		= vfio_read,
> +	.write		= vfio_write,
> +	.unlocked_ioctl	= vfio_unl_ioctl,
> +	.mmap		= vfio_mmap,
> +};
> +
> +static int vfio_get_devnum(struct vfio_dev *vdev)
> +{
> +	int retval = -ENOMEM;
> +	int id;
> +
> +	mutex_lock(&vfio_minor_lock);
> +	if (idr_pre_get(&vfio_idr, GFP_KERNEL) == 0)
> +		goto exit;
> +
> +	retval = idr_get_new(&vfio_idr, vdev, &id);
> +	if (retval < 0) {
> +		if (retval == -EAGAIN)
> +			retval = -ENOMEM;
> +		goto exit;
> +	}
> +	if (id > MINORMASK) {
> +		idr_remove(&vfio_idr, id);
> +		retval = -ENOMEM;
> +	}
> +	if (vfio_major < 0) {
> +		retval = register_chrdev(0, "vfio", &vfio_fops);
> +		if (retval < 0)
> +			goto exit;
> +		vfio_major = retval;
> +	}
> +
> +	retval = MKDEV(vfio_major, id);
> +exit:
> +	mutex_unlock(&vfio_minor_lock);
> +	return retval;
> +}
> +
> +static void vfio_free_minor(struct vfio_dev *vdev)
> +{
> +	mutex_lock(&vfio_minor_lock);
> +	idr_remove(&vfio_idr, MINOR(vdev->devnum));
> +	mutex_unlock(&vfio_minor_lock);
> +}
> +
> +/*
> + * Verify that the device supports Interrupt Disable bit in command register,
> + * per PCI 2.3, by flipping this bit and reading it back: this bit was readonly
> + * in PCI 2.2.  (from uio_pci_generic)
> + */
> +static int verify_pci_2_3(struct pci_dev *pdev)
> +{
> +	u16 orig, new;
> +	int err = 0;
> +	u8 line;
> +
> +	pci_block_user_cfg_access(pdev);
> +
> +	pci_read_config_byte(pdev, PCI_INTERRUPT_LINE, &line);
> +	if (line == 0)
> +		goto out;
> +
> +	pci_read_config_word(pdev, PCI_COMMAND, &orig);
> +	pci_write_config_word(pdev, PCI_COMMAND,
> +			      orig ^ PCI_COMMAND_INTX_DISABLE);
> +	pci_read_config_word(pdev, PCI_COMMAND, &new);
> +	/* There's no way to protect against
> +	 * hardware bugs or detect them reliably, but as long as we know
> +	 * what the value should be, let's go ahead and check it. */
> +	if ((new ^ orig) & ~PCI_COMMAND_INTX_DISABLE) {
> +		err = -EBUSY;
> +		dev_err(&pdev->dev, "Command changed from 0x%x to 0x%x: "
> +			"driver or HW bug?\n", orig, new);
> +		goto out;
> +	}
> +	if (!((new ^ orig) & PCI_COMMAND_INTX_DISABLE)) {
> +		dev_warn(&pdev->dev, "Device does not support "
> +			 "disabling interrupts: unable to bind.\n");
> +		err = -ENODEV;
> +		goto out;
> +	}
> +	/* Now restore the original value. */
> +	pci_write_config_word(pdev, PCI_COMMAND, orig);
> +out:
> +	pci_unblock_user_cfg_access(pdev);
> +	return err;
> +}
> +
> +static int pci_is_master(struct pci_dev *pdev)
> +{
> +	int ret;
> +	u16 orig, new;
> +
> +	if (pci_find_capability(pdev, PCI_CAP_ID_MSI))
> +		return 1;
> +	if (pci_find_capability(pdev, PCI_CAP_ID_MSIX))
> +		return 1;
> +
> +	pci_block_user_cfg_access(pdev);
> +
> +	pci_read_config_word(pdev, PCI_COMMAND, &orig);
> +	ret = orig & PCI_COMMAND_MASTER;
> +	if (!ret) {
> +		new = orig | PCI_COMMAND_MASTER;
> +		pci_write_config_word(pdev, PCI_COMMAND, new);
> +		pci_read_config_word(pdev, PCI_COMMAND, &new);
> +		ret = new & PCI_COMMAND_MASTER;
> +		pci_write_config_word(pdev, PCI_COMMAND, orig);
> +	}
> +
> +	pci_unblock_user_cfg_access(pdev);
> +	return ret;
> +}
> +
> +static int vfio_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> +{
> +	struct vfio_dev *vdev;
> +	int err;
> +
> +	err = pci_enable_device(pdev);
> +	if (err) {
> +		dev_err(&pdev->dev, "%s: pci_enable_device failed: %d\n",
> +			__func__, err);
> +		return err;
> +	}
> +
> +	err = verify_pci_2_3(pdev);
> +	if (err)
> +		goto err_verify;
> +
> +	vdev = kzalloc(sizeof(struct vfio_dev), GFP_KERNEL);
> +	if (!vdev) {
> +		err = -ENOMEM;
> +		goto err_alloc;
> +	}
> +	vdev->pdev = pdev;
> +	vdev->pmaster = pci_is_master(pdev);
> +
> +	err = vfio_class_init();
> +	if (err)
> +		goto err_class;
> +
> +	mutex_init(&vdev->gate);
> +
> +	err = vfio_get_devnum(vdev);
> +	if (err < 0)
> +		goto err_get_devnum;
> +	vdev->devnum = err;
> +	err = 0;
> +
> +	sprintf(vdev->name, "vfio%d", MINOR(vdev->devnum));
> +	pci_set_drvdata(pdev, vdev);
> +	vdev->dev = device_create(vfio_class->class, &pdev->dev,
> +			  vdev->devnum, vdev, vdev->name);
> +	if (IS_ERR(vdev->dev)) {
> +		printk(KERN_ERR "VFIO: device register failed\n");
> +		err = PTR_ERR(vdev->dev);
> +		goto err_device_create;
> +	}
> +
> +	err = vfio_dev_add_attributes(vdev);
> +	if (err)
> +		goto err_vfio_dev_add_attributes;
> +
> +
> +	if (pdev->irq > 0) {
> +		err = request_irq(pdev->irq, vfio_interrupt,
> +				  IRQF_SHARED, "vfio", vdev);
> +		if (err)
> +			goto err_request_irq;
> +	}
> +	vdev->vinfo.bardirty = 1;
> +
> +	return 0;
> +
> +err_request_irq:
> +#ifdef notdef
> +	vfio_dev_del_attributes(vdev);
> +#endif
> +err_vfio_dev_add_attributes:
> +	device_destroy(vfio_class->class, vdev->devnum);
> +err_device_create:
> +	vfio_free_minor(vdev);
> +err_get_devnum:
> +err_class:
> +	kfree(vdev);
> +err_alloc:
> +err_verify:
> +	pci_disable_device(pdev);
> +	return err;
> +}
> +
> +static void vfio_remove(struct pci_dev *pdev)
> +{
> +	struct vfio_dev *vdev = pci_get_drvdata(pdev);
> +
> +	vfio_free_minor(vdev);
> +
> +	if (pdev->irq > 0)
> +		free_irq(pdev->irq, vdev);
> +
> +#ifdef notdef
> +	vfio_dev_del_attributes(vdev);
> +#endif
> +
> +	pci_set_drvdata(pdev, NULL);
> +	device_destroy(vfio_class->class, vdev->devnum);
> +	kfree(vdev);
> +	vfio_class_destroy();
> +	pci_disable_device(pdev);
> +}
> +
> +static struct pci_driver driver = {
> +	.name = "vfio",
> +	.id_table = NULL, /* only dynamic id's */
> +	.probe = vfio_probe,
> +	.remove = vfio_remove,
> +};
> +
> +static int __init init(void)
> +{
> +	pr_info(DRIVER_DESC " version: " DRIVER_VERSION "\n");
> +	return pci_register_driver(&driver);
> +}
> +
> +static void __exit cleanup(void)
> +{
> +	if (vfio_major >= 0)
> +		unregister_chrdev(vfio_major, "vfio");
> +	pci_unregister_driver(&driver);
> +}
> +
> +module_init(init);
> +module_exit(cleanup);
> +
> +MODULE_VERSION(DRIVER_VERSION);
> +MODULE_LICENSE("GPL v2");
> +MODULE_AUTHOR(DRIVER_AUTHOR);
> +MODULE_DESCRIPTION(DRIVER_DESC);
> diff -uprN linux-2.6.34/drivers/vfio/vfio_pci_config.c vfio-linux-2.6.34/drivers/vfio/vfio_pci_config.c
> --- linux-2.6.34/drivers/vfio/vfio_pci_config.c	1969-12-31 16:00:00.000000000 -0800
> +++ vfio-linux-2.6.34/drivers/vfio/vfio_pci_config.c	2010-05-28 14:26:47.000000000 -0700
> @@ -0,0 +1,554 @@
> +/*
> + * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
> + * Author: Tom Lyon, pugs@cisco.com
> + *
> + * This program is free software; you may redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; version 2 of the License.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> + * SOFTWARE.
> + *
> + * Portions derived from drivers/uio/uio.c:
> + * Copyright(C) 2005, Benedikt Spranger <b.spranger@linutronix.de>
> + * Copyright(C) 2005, Thomas Gleixner <tglx@linutronix.de>
> + * Copyright(C) 2006, Hans J. Koch <hjk@linutronix.de>
> + * Copyright(C) 2006, Greg Kroah-Hartman <greg@kroah.com>
> + *
> + * Portions derived from drivers/uio/uio_pci_generic.c:
> + * Copyright (C) 2009 Red Hat, Inc.
> + * Author: Michael S. Tsirkin <mst@redhat.com>
> + */
> +
> +#include <linux/fs.h>
> +#include <linux/pci.h>
> +#include <linux/mmu_notifier.h>
> +#include <linux/uaccess.h>
> +#include <linux/vfio.h>
> +
> +#define PCI_CAP_ID_BASIC	0
> +#ifndef PCI_CAP_ID_MAX
> +#define	PCI_CAP_ID_MAX		PCI_CAP_ID_AF
> +#endif
> +
> +/*
> + * Lengths of PCI Config Capabilities
> + * 0 means unknown (but at least 4)
> + * FF means special/variable
> + */
> +static u8 pci_capability_length[] = {
> +	[PCI_CAP_ID_BASIC]	= 64,		/* pci config header */
> +	[PCI_CAP_ID_PM]		= PCI_PM_SIZEOF,
> +	[PCI_CAP_ID_AGP]	= PCI_AGP_SIZEOF,
> +	[PCI_CAP_ID_VPD]	= 8,
> +	[PCI_CAP_ID_SLOTID]	= 4,
> +	[PCI_CAP_ID_MSI]	= 0xFF,		/* 10, 14, or 24 */
> +	[PCI_CAP_ID_CHSWP]	= 4,
> +	[PCI_CAP_ID_PCIX]	= 0xFF,		/* 8 or 24 */
> +	[PCI_CAP_ID_HT]		= 28,
> +	[PCI_CAP_ID_VNDR]	= 0xFF,
> +	[PCI_CAP_ID_DBG]	= 0,
> +	[PCI_CAP_ID_CCRC]	= 0,
> +	[PCI_CAP_ID_SHPC]	= 0,
> +	[PCI_CAP_ID_SSVID]	= 0,		/* bridge only - not supp */
> +	[PCI_CAP_ID_AGP3]	= 0,
> +	[PCI_CAP_ID_EXP]	= 36,
> +	[PCI_CAP_ID_MSIX]	= 12,
> +	[PCI_CAP_ID_AF]		= 6,
> +};
> +
> +/*
> + * Read/Write Permission Bits - one bit for each bit in capability
> + * Any field can be read if it exists,
> + * but what is read depends on whether the field
> + * is 'virtualized', or just pass thru to the hardware.
> + * Any virtualized field is also virtualized for writes.
> + * Writes are only permitted if they have a 1 bit here.
> + */
> +struct perm_bits {
> +	u32	rvirt;		/* read bits which must be virtualized */
> +	u32	write;		/* writeable bits - virt if read virt */
> +};
> +
> +static struct perm_bits pci_cap_basic_perm[] = {
> +	{ 0xFFFFFFFF,	0, },		/* 0x00 vendor & device id - RO */
> +	{ 0,		0xFFFFFFFC, },	/* 0x04 cmd & status except mem/io */
> +	{ 0,		0xFF00FFFF, },	/* 0x08 bist, htype, lat, cache */
> +	{ 0xFFFFFFFF,	0xFFFFFFFF, },	/* 0x0c bar */
> +	{ 0xFFFFFFFF,	0xFFFFFFFF, },	/* 0x10 bar */
> +	{ 0xFFFFFFFF,	0xFFFFFFFF, },	/* 0x14 bar */
> +	{ 0xFFFFFFFF,	0xFFFFFFFF, },	/* 0x18 bar */
> +	{ 0xFFFFFFFF,	0xFFFFFFFF, },	/* 0x1c bar */
> +	{ 0xFFFFFFFF,	0xFFFFFFFF, },	/* 0x20 bar */
> +	{ 0,		0, },		/* 0x24 cardbus - not yet */
> +	{ 0,		0, },		/* 0x28 subsys vendor & dev */
> +	{ 0xFFFFFFFF,	0xFFFFFFFF, },	/* 0x2c rom bar */
> +	{ 0,		0, },		/* 0x30 capability ptr & resv */
> +	{ 0,		0, },		/* 0x34 resv */
> +	{ 0,		0, },		/* 0x38 resv */
> +	{ 0x000000FF,	0x000000FF, },	/* 0x3c max_lat ... irq */
> +};
> +
> +static struct perm_bits pci_cap_pm_perm[] = {
> +	{ 0,		0, },		/* 0x00 PM capabilities */
> +	{ 0,		0xFFFFFFFF, },	/* 0x04 PM control/status */
> +};
> +
> +static struct perm_bits pci_cap_vpd_perm[] = {
> +	{ 0,		0xFFFF0000, },	/* 0x00 address */
> +	{ 0,		0xFFFFFFFF, },	/* 0x04 data */
> +};
> +
> +static struct perm_bits pci_cap_slotid_perm[] = {
> +	{ 0,		0, },		/* 0x00 all read only */
> +};
> +
> +static struct perm_bits pci_cap_msi_perm[] = {
> +	{ 0,		0, },		/* 0x00 MSI message control */
> +	{ 0xFFFFFFFF,	0xFFFFFFFF, },	/* 0x04 MSI message address */
> +	{ 0xFFFFFFFF,	0xFFFFFFFF, },	/* 0x08 MSI message addr/data */
> +	{ 0x0000FFFF,	0x0000FFFF, },	/* 0x0c MSI message data */
> +	{ 0,		0xFFFFFFFF, },	/* 0x10 MSI mask bits */
> +	{ 0,		0xFFFFFFFF, },	/* 0x14 MSI pending bits */
> +};
> +
> +static struct perm_bits pci_cap_pcix_perm[] = {
> +	{ 0,		0xFFFF0000, },	/* 0x00 PCI_X_CMD */
> +	{ 0,		0, },		/* 0x04 PCI_X_STATUS */
> +	{ 0,		0xFFFFFFFF, },	/* 0x08 ECC ctlr & status */
> +	{ 0,		0, },		/* 0x0c ECC first addr */
> +	{ 0,		0, },		/* 0x10 ECC second addr */
> +	{ 0,		0, },		/* 0x14 ECC attr */
> +};
> +
> +/* pci express capabilities */
> +static struct perm_bits pci_cap_exp_perm[] = {
> +	{ 0,		0, },		/* 0x00 PCIe capabilities */
> +	{ 0,		0, },		/* 0x04 PCIe device capabilities */
> +	{ 0,		0xFFFFFFFF, },	/* 0x08 PCIe device control & status */
> +	{ 0,		0, },		/* 0x0c PCIe link capabilities */
> +	{ 0,		0x000000FF, },	/* 0x10 PCIe link ctl/stat - SAFE? */
> +	{ 0,		0, },		/* 0x14 PCIe slot capabilities */
> +	{ 0,		0x00FFFFFF, },	/* 0x18 PCIe link ctl/stat - SAFE? */
> +	{ 0,		0, },		/* 0x1c PCIe root port stuff */
> +	{ 0,		0, },		/* 0x20 PCIe root port stuff */
> +};
> +
> +static struct perm_bits pci_cap_msix_perm[] = {
> +	{ 0,		0, },		/* 0x00 MSI-X Enable */
> +	{ 0,		0, },		/* 0x04 table offset & bir */
> +	{ 0,		0, },		/* 0x08 pba offset & bir */
> +};
> +
> +static struct perm_bits pci_cap_af_perm[] = {
> +	{ 0,		0, },		/* 0x00 af capability */
> +	{ 0,		0x0001,	 },	/* 0x04 af flr bit */
> +};
> +
> +static struct perm_bits *pci_cap_perms[] = {
> +	[PCI_CAP_ID_BASIC]	= pci_cap_basic_perm,
> +	[PCI_CAP_ID_PM]		= pci_cap_pm_perm,
> +	[PCI_CAP_ID_VPD]	= pci_cap_vpd_perm,
> +	[PCI_CAP_ID_SLOTID]	= pci_cap_slotid_perm,
> +	[PCI_CAP_ID_MSI]	= pci_cap_msi_perm,
> +	[PCI_CAP_ID_PCIX]	= pci_cap_pcix_perm,
> +	[PCI_CAP_ID_EXP]	= pci_cap_exp_perm,
> +	[PCI_CAP_ID_MSIX]	= pci_cap_msix_perm,
> +	[PCI_CAP_ID_AF]		= pci_cap_af_perm,
> +};
> +
> +/*
> + * We build a map of the config space that tells us where
> + * and what capabilities exist, so that we can map reads and
> + * writes back to capabilities, and thus figure out what to
> + * allow, deny, or virtualize
> + */
> +int vfio_build_config_map(struct vfio_dev *vdev)
> +{
> +	struct pci_dev *pdev = vdev->pdev;
> +	u8 *map;
> +	int i, len;
> +	u8 pos, cap, tmp;
> +	u16 flags;
> +	int ret;
> +	int loops = 100;
> +
> +	map = kmalloc(pdev->cfg_size, GFP_KERNEL);
> +	if (map == NULL)
> +		return -ENOMEM;
> +	for (i = 0; i < pdev->cfg_size; i++)
> +		map[i] = 0xFF;
> +	vdev->pci_config_map = map;
> +
> +	/* default config space */
> +	for (i = 0; i < pci_capability_length[0]; i++)
> +		map[i] = 0;
> +
> +	/* any capabilities? */
> +	ret = pci_read_config_word(pdev, PCI_STATUS, &flags);
> +	if (ret < 0)
> +		return ret;
> +	if ((flags & PCI_STATUS_CAP_LIST) == 0)
> +		return 0;
> +
> +	ret = pci_read_config_byte(pdev, PCI_CAPABILITY_LIST, &pos);
> +	if (ret < 0)
> +		return ret;
> +	while (pos && --loops > 0) {
> +		ret = pci_read_config_byte(pdev, pos, &cap);
> +		if (ret < 0)
> +			return ret;
> +		if (cap == 0) {
> +			printk(KERN_WARNING "%s: cap 0\n", __func__);
> +			break;
> +		}
> +		if (cap > PCI_CAP_ID_MAX) {
> +			printk(KERN_WARNING "%s: unknown pci capability id %x\n",
> +					__func__, cap);
> +			len = 0;
> +		} else
> +			len = pci_capability_length[cap];
> +		if (len == 0) {
> +			printk(KERN_WARNING "%s: unknown length for pci cap %x\n",
> +					__func__, cap);
> +			len = 4;
> +		}
> +		if (len == 0xFF) {
> +			switch (cap) {
> +			case PCI_CAP_ID_MSI:
> +				ret = pci_read_config_word(pdev,
> +						pos + PCI_MSI_FLAGS, &flags);
> +				if (ret < 0)
> +					return ret;
> +				if (flags & PCI_MSI_FLAGS_MASKBIT)
> +					/* per vec masking */
> +					len = 24;
> +				else if (flags & PCI_MSI_FLAGS_64BIT)
> +					/* 64 bit */
> +					len = 14;
> +				else
> +					len = 10;
> +				break;
> +			case PCI_CAP_ID_PCIX:
> +				ret = pci_read_config_word(pdev, pos + 2,
> +					&flags);
> +				if (ret < 0)
> +					return ret;
> +				if (flags & 0x3000)
> +					len = 24;
> +				else
> +					len = 8;
> +				break;
> +			case PCI_CAP_ID_VNDR:
> +				/* length follows next field */
> +				ret = pci_read_config_byte(pdev, pos + 2, &tmp);
> +				if (ret < 0)
> +					return ret;
> +				len = tmp;
> +				break;
> +			default:
> +				len = 0;
> +				break;
> +			}
> +		}
> +
> +		for (i = 0; i < len; i++) {
> +			if (map[pos+i] != 0xFF)
> +				printk(KERN_WARNING
> +					"%s: pci config conflict at %x, "
> +					"caps %x %x\n",
> +					__func__, i, map[pos+i], cap);
> +			map[pos+i] = cap;
> +		}
> +		ret = pci_read_config_byte(pdev, pos + PCI_CAP_LIST_NEXT, &pos);
> +		if (ret < 0)
> +			return ret;
> +	}
> +	if (loops <= 0)
> +		printk(KERN_ERR "%s: config space loop!\n", __func__);
> +	return 0;
> +}
> +
> +static void vfio_virt_init(struct vfio_dev *vdev)
> +{
> +	struct pci_dev *pdev = vdev->pdev;
> +	int bar;
> +	u32 *lp;
> +	u32 val;
> +	u8 *map, pos;
> +	u16 flags;
> +	int i, len;
> +	int ret;
> +
> +	for (bar = 0; bar <= 5; bar++) {
> +		lp = (u32 *)&vdev->vinfo.bar[bar * 4];
> +		pci_read_config_dword(pdev, PCI_BASE_ADDRESS_0 + 4*bar, &val);
> +		*lp++ = val;
> +	}
> +	lp = (u32 *)vdev->vinfo.rombar;
> +	pci_read_config_dword(pdev, PCI_ROM_ADDRESS, &val);
> +	*lp = val;
> +
> +	vdev->vinfo.intr = pdev->irq;
> +
> +	pos = pci_find_capability(pdev, PCI_CAP_ID_MSI);
> +	map = vdev->pci_config_map + pos;
> +	if (pos > 0) {
> +		ret = pci_read_config_word(pdev, pos + PCI_MSI_FLAGS, &flags);
> +		if (ret < 0)
> +			return;
> +		if (flags & PCI_MSI_FLAGS_MASKBIT)	/* per vec masking */
> +			len = 24;
> +		else if (flags & PCI_MSI_FLAGS_64BIT)	/* 64 bit */
> +			len = 14;
> +		else
> +			len = 10;
> +		for (i = 0; i < len; i++)
> +			(void) pci_read_config_byte(pdev, pos + i,
> +						&vdev->vinfo.msi[i]);
> +	}
> +}
> +
> +static void vfio_bar_fixup(struct vfio_dev *vdev)
> +{
> +	struct pci_dev *pdev = vdev->pdev;
> +	int bar;
> +	u32 *lp;
> +	u32 len;
> +
> +	for (bar = 0; bar <= 5; bar++) {
> +		len = pci_resource_len(pdev, bar);
> +		lp = (u32 *)&vdev->vinfo.bar[bar * 4];
> +		if (len == 0) {
> +			*lp = 0;
> +		} else if (pci_resource_flags(pdev, bar) & IORESOURCE_MEM) {
> +			*lp &= ~0x1;
> +			*lp = (*lp & ~(len-1)) |
> +				(*lp & ~PCI_BASE_ADDRESS_MEM_MASK);
> +			if (*lp & PCI_BASE_ADDRESS_MEM_TYPE_64)
> +				bar++;
> +		} else if (pci_resource_flags(pdev, bar) & IORESOURCE_IO) {
> +			*lp |= PCI_BASE_ADDRESS_SPACE_IO;
> +			*lp = (*lp & ~(len-1)) |
> +				(*lp & ~PCI_BASE_ADDRESS_IO_MASK);
> +		}
> +	}
> +	lp = (u32 *)vdev->vinfo.rombar;
> +	len = pci_resource_len(pdev, PCI_ROM_RESOURCE);
> +	*lp = *lp & PCI_ROM_ADDRESS_MASK & ~(len-1);
> +	vdev->vinfo.bardirty = 0;
> +}
> +
> +static int vfio_config_rwbyte(int write,
> +				struct vfio_dev *vdev,
> +				int pos,
> +				char __user *buf)
> +{
> +	struct pci_dev *pdev = vdev->pdev;
> +	u8 *map = vdev->pci_config_map;
> +	u8 cap, val, newval;
> +	u16 start, off;
> +	int p;
> +	struct perm_bits *perm;
> +	u8 wr, virt;
> +	int ret;
> +
> +	cap = map[pos];
> +	if (cap == 0xFF) {	/* unknown region */
> +		if (write)
> +			return 0;	/* silent no-op */
> +		val = 0;
> +		if (pos <= pci_capability_length[0])	/* ok to read */
> +			(void) pci_read_config_byte(pdev, pos, &val);
> +		if (copy_to_user(buf, &val, 1))
> +			return -EFAULT;
> +		return 0;
> +	}
> +
> +	/* scan back to start of cap region */
> +	for (p = pos; p >= 0; p--) {
> +		if (map[p] != cap)
> +			break;
> +		start = p;
> +	}
> +	off = pos - start;	/* offset within capability */
> +
> +	perm = pci_cap_perms[cap];
> +	if (perm == NULL) {
> +		wr = 0;
> +		virt = 0;
> +	} else {
> +		perm += (off >> 2);
> +		wr = perm->write >> ((off & 3) * 8);
> +		virt = perm->rvirt >> ((off & 3) * 8);
> +	}
> +	if (write && !wr)		/* no writeable bits */
> +		return 0;
> +	if (!virt) {
> +		if (write) {
> +			if (copy_from_user(&val, buf, 1))
> +				return -EFAULT;
> +			val &= wr;
> +			if (wr != 0xFF) {
> +				u8 existing;
> +
> +				ret = pci_read_config_byte(pdev, pos,
> +								&existing);
> +				if (ret < 0)
> +					return ret;
> +				val |= (existing & ~wr);
> +			}
> +			pci_write_config_byte(pdev, pos, val);
> +		} else {
> +			ret = pci_read_config_byte(pdev, pos, &val);
> +			if (ret < 0)
> +				return ret;
> +			if (copy_to_user(buf, &val, 1))
> +				return -EFAULT;
> +		}
> +		return 0;
> +	}
> +
> +	if (write) {
> +		if (copy_from_user(&newval, buf, 1))
> +			return -EFAULT;
> +	}
> +	/*
> +	 * We get here if there are some virt bits
> +	 * handle remaining real bits, if any
> +	 */
> +	if (~virt) {
> +		u8 rbits = (~virt) & wr;
> +
> +		ret = pci_read_config_byte(pdev, pos, &val);
> +		if (ret < 0)
> +			return ret;
> +		if (write && rbits) {
> +			val &= ~rbits;
> +			newval &= rbits;
> +			val |= newval;
> +			pci_write_config_byte(pdev, pos, val);
> +		}
> +	}
> +	/*
> +	 * Now handle entirely virtual fields
> +	 */
> +	switch (cap) {
> +	case PCI_CAP_ID_BASIC:		/* virtualize BARs */
> +		switch (off) {
> +		/*
> +		 * vendor and device are virt because they don't
> +		 * show up otherwise for sr-iov vfs
> +		 */
> +		case PCI_VENDOR_ID:
> +			val = pdev->vendor;
> +			break;
> +		case PCI_VENDOR_ID + 1:
> +			val = pdev->vendor >> 8;
> +			break;
> +		case PCI_DEVICE_ID:
> +			val = pdev->device;
> +			break;
> +		case PCI_DEVICE_ID + 1:
> +			val = pdev->device >> 8;
> +			break;
> +		case PCI_INTERRUPT_LINE:
> +			if (write)
> +				vdev->vinfo.intr = newval;
> +			else
> +				val = vdev->vinfo.intr;
> +			break;
> +		case PCI_ROM_ADDRESS:
> +		case PCI_ROM_ADDRESS+1:
> +		case PCI_ROM_ADDRESS+2:
> +		case PCI_ROM_ADDRESS+3:
> +			if (write) {
> +				vdev->vinfo.rombar[off & 3] = newval;
> +				vdev->vinfo.bardirty = 1;
> +			} else {
> +				if (vdev->vinfo.bardirty)
> +					vfio_bar_fixup(vdev);
> +				val = vdev->vinfo.rombar[off & 3];
> +			}
> +			break;
> +		default:
> +			if (off >= PCI_BASE_ADDRESS_0 &&
> +			    off <= PCI_BASE_ADDRESS_5 + 3) {
> +				int boff = off - PCI_BASE_ADDRESS_0;
> +
> +				if (write) {
> +					vdev->vinfo.bar[boff] = newval;
> +					vdev->vinfo.bardirty = 1;
> +				} else {
> +					if (vdev->vinfo.bardirty)
> +						vfio_bar_fixup(vdev);
> +					val = vdev->vinfo.bar[boff];
> +				}
> +			}
> +			break;
> +		}
> +		break;
> +	case PCI_CAP_ID_MSI:		/* virtualize MSI */
> +		if (off >= PCI_MSI_ADDRESS_LO && off <= (PCI_MSI_DATA_64 + 2)) {
> +			int moff = off - PCI_MSI_ADDRESS_LO;
> +
> +			if (write)
> +				vdev->vinfo.msi[moff] = newval;
> +			else
> +				val = vdev->vinfo.msi[moff];
> +			break;
> +		}
> +		break;
> +	}
> +	if (!write && copy_to_user(buf, &val, 1))
> +		return -EFAULT;
> +	return 0;
> +}
> +
> +ssize_t vfio_config_readwrite(int write,
> +		struct vfio_dev *vdev,
> +		char __user *buf,
> +		size_t count,
> +		loff_t *ppos)
> +{
> +	struct pci_dev *pdev = vdev->pdev;
> +	int done = 0;
> +	int ret;
> +	int pos;
> +
> +	pci_block_user_cfg_access(pdev);
> +
> +	if (vdev->pci_config_map == NULL) {
> +		ret = vfio_build_config_map(vdev);
> +		if (ret < 0)
> +			goto out;
> +		vfio_virt_init(vdev);
> +	}
> +
> +	while (count > 0) {
> +		pos = *ppos;
> +		if (pos == pdev->cfg_size)
> +			break;
> +		if (pos > pdev->cfg_size) {
> +			ret = -EINVAL;
> +			goto out;
> +		}
> +		ret = vfio_config_rwbyte(write, vdev, pos, buf);
> +		if (ret < 0)
> +			goto out;
> +		buf++;
> +		done++;
> +		count--;
> +		(*ppos)++;
> +	}
> +	ret = done;
> +out:
> +	pci_unblock_user_cfg_access(pdev);
> +	return ret;
> +}
> diff -uprN linux-2.6.34/drivers/vfio/vfio_rdwr.c vfio-linux-2.6.34/drivers/vfio/vfio_rdwr.c
> --- linux-2.6.34/drivers/vfio/vfio_rdwr.c	1969-12-31 16:00:00.000000000 -0800
> +++ vfio-linux-2.6.34/drivers/vfio/vfio_rdwr.c	2010-05-28 14:27:40.000000000 -0700
> @@ -0,0 +1,147 @@
> +/*
> + * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
> + * Author: Tom Lyon, pugs@cisco.com
> + *
> + * This program is free software; you may redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; version 2 of the License.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> + * SOFTWARE.
> + *
> + * Portions derived from drivers/uio/uio.c:
> + * Copyright(C) 2005, Benedikt Spranger <b.spranger@linutronix.de>
> + * Copyright(C) 2005, Thomas Gleixner <tglx@linutronix.de>
> + * Copyright(C) 2006, Hans J. Koch <hjk@linutronix.de>
> + * Copyright(C) 2006, Greg Kroah-Hartman <greg@kroah.com>
> + *
> + * Portions derived from drivers/uio/uio_pci_generic.c:
> + * Copyright (C) 2009 Red Hat, Inc.
> + * Author: Michael S. Tsirkin <mst@redhat.com>
> + */
> +
> +#include <linux/fs.h>
> +#include <linux/mmu_notifier.h>
> +#include <linux/pci.h>
> +#include <linux/uaccess.h>
> +#include <linux/io.h>
> +
> +#include <linux/vfio.h>
> +
> +ssize_t vfio_io_readwrite(
> +		int write,
> +		struct vfio_dev *vdev,
> +		char __user *buf,
> +		size_t count,
> +		loff_t *ppos)
> +{
> +	struct pci_dev *pdev = vdev->pdev;
> +	size_t done = 0;
> +	resource_size_t end;
> +	void __iomem *io;
> +	loff_t pos;
> +	int pci_space;
> +	int unit;
> +
> +	pci_space = vfio_offset_to_pci_space(*ppos);
> +	pos = (*ppos & 0xFFFFFFFF);
> +
> +	if (vdev->bar[pci_space] == NULL)
> +		vdev->bar[pci_space] = pci_iomap(pdev, pci_space, 0);
> +	io = vdev->bar[pci_space];
> +	end = pci_resource_len(pdev, pci_space);
> +	if (pos + count > end)
> +		return -EINVAL;
> +
> +	while (count > 0) {
> +		if ((pos % 4) == 0 && count >= 4) {
> +			u32 val;
> +
> +			if (write) {
> +				if (copy_from_user(&val, buf, 4))
> +					return -EFAULT;
> +				iowrite32(val, io + pos);
> +			} else {
> +				val = ioread32(io + pos);
> +				if (copy_to_user(buf, &val, 4))
> +					return -EFAULT;
> +			}
> +			unit = 4;
> +		} else if ((pos % 2) == 0 && count >= 2) {
> +			u16 val;
> +
> +			if (write) {
> +				if (copy_from_user(&val, buf, 2))
> +					return -EFAULT;
> +				iowrite16(val, io + pos);
> +			} else {
> +				val = ioread16(io + pos);
> +				if (copy_to_user(buf, &val, 2))
> +					return -EFAULT;
> +			}
> +			unit = 2;
> +		} else {
> +			u8 val;
> +
> +			if (write) {
> +				if (copy_from_user(&val, buf, 1))
> +					return -EFAULT;
> +				iowrite8(val, io + pos);
> +			} else {
> +				val = ioread8(io + pos);
> +				if (copy_to_user(buf, &val, 1))
> +					return -EFAULT;
> +			}
> +			unit = 1;
> +		}
> +		pos += unit;
> +		buf += unit;
> +		count -= unit;
> +		done += unit;
> +	}
> +	*ppos += done;
> +	return done;
> +}
> +
> +ssize_t vfio_mem_readwrite(
> +		int write,
> +		struct vfio_dev *vdev,
> +		char __user *buf,
> +		size_t count,
> +		loff_t *ppos)
> +{
> +	struct pci_dev *pdev = vdev->pdev;
> +	resource_size_t end;
> +	void __iomem *io;
> +	loff_t pos;
> +	int pci_space;
> +
> +	pci_space = vfio_offset_to_pci_space(*ppos);
> +	pos = (*ppos & 0xFFFFFFFF);
> +
> +	if (vdev->bar[pci_space] == NULL)
> +		vdev->bar[pci_space] = pci_iomap(pdev, pci_space, 0);
> +	io = vdev->bar[pci_space];
> +	end = pci_resource_len(pdev, pci_space);
> +	if (pos > end)
> +		return -EINVAL;
> +	if (pos == end)
> +		return 0;
> +	if (pos + count > end)
> +		count = end - pos;
> +	if (write) {
> +		if (copy_from_user(io + pos, buf, count))
> +			return -EFAULT;
> +	} else {
> +		if (copy_to_user(buf, io + pos, count))
> +			return -EFAULT;
> +	}
> +	*ppos += count;
> +	return count;
> +}
> diff -uprN linux-2.6.34/drivers/vfio/vfio_sysfs.c vfio-linux-2.6.34/drivers/vfio/vfio_sysfs.c
> --- linux-2.6.34/drivers/vfio/vfio_sysfs.c	1969-12-31 16:00:00.000000000 -0800
> +++ vfio-linux-2.6.34/drivers/vfio/vfio_sysfs.c	2010-05-28 14:04:34.000000000 -0700
> @@ -0,0 +1,153 @@
> +/*
> + * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
> + * Author: Tom Lyon, pugs@cisco.com
> + *
> + * This program is free software; you may redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; version 2 of the License.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> + * SOFTWARE.
> + *
> + * Portions derived from drivers/uio/uio.c:
> + * Copyright(C) 2005, Benedikt Spranger <b.spranger@linutronix.de>
> + * Copyright(C) 2005, Thomas Gleixner <tglx@linutronix.de>
> + * Copyright(C) 2006, Hans J. Koch <hjk@linutronix.de>
> + * Copyright(C) 2006, Greg Kroah-Hartman <greg@kroah.com>
> + *
> + * Portions derived from drivers/uio/uio_pci_generic.c:
> + * Copyright (C) 2009 Red Hat, Inc.
> + * Author: Michael S. Tsirkin <mst@redhat.com>
> + */
> +
> +#include <linux/module.h>
> +#include <linux/device.h>
> +#include <linux/kobject.h>
> +#include <linux/sysfs.h>
> +#include <linux/mm.h>
> +#include <linux/fs.h>
> +#include <linux/idr.h>
> +#include <linux/pci.h>
> +#include <linux/mmu_notifier.h>
> +
> +#include <linux/vfio.h>
> +
> +struct vfio_class *vfio_class;
> +
> +int vfio_class_init(void)
> +{
> +	int ret = 0;
> +
> +	if (vfio_class != NULL) {
> +		kref_get(&vfio_class->kref);
> +		goto exit;
> +	}
> +
> +	vfio_class = kzalloc(sizeof(*vfio_class), GFP_KERNEL);
> +	if (!vfio_class) {
> +		ret = -ENOMEM;
> +		goto err_kzalloc;
> +	}
> +
> +	kref_init(&vfio_class->kref);
> +	vfio_class->class = class_create(THIS_MODULE, "vfio");
> +	if (IS_ERR(vfio_class->class)) {
> +		ret = IS_ERR(vfio_class->class);
> +		printk(KERN_ERR "class_create failed for vfio\n");
> +		goto err_class_create;
> +	}
> +	return 0;
> +
> +err_class_create:
> +	kfree(vfio_class);
> +	vfio_class = NULL;
> +err_kzalloc:
> +exit:
> +	return ret;
> +}
> +
> +static void vfio_class_release(struct kref *kref)
> +{
> +	/* Ok, we cheat as we know we only have one vfio_class */
> +	class_destroy(vfio_class->class);
> +	kfree(vfio_class);
> +	vfio_class = NULL;
> +}
> +
> +void vfio_class_destroy(void)
> +{
> +	if (vfio_class)
> +		kref_put(&vfio_class->kref, vfio_class_release);
> +}
> +
> +ssize_t config_map_read(struct kobject *kobj, struct bin_attribute *bin_attr,
> +		char *buf, loff_t off, size_t count)
> +{
> +	struct vfio_dev *vdev = bin_attr->private;
> +	int ret;
> +
> +	if (off >= 256)
> +		return 0;
> +	if (off + count > 256)
> +		count = 256 - off;
> +	if (vdev->pci_config_map == NULL) {
> +		ret = vfio_build_config_map(vdev);
> +		if (ret < 0)
> +			return ret;
> +	}
> +	memcpy(buf, vdev->pci_config_map + off, count);
> +	return count;
> +}
> +
> +static ssize_t show_locked_pages(struct device *dev,
> +				 struct device_attribute *attr,
> +				 char *buf)
> +{
> +	struct vfio_dev *vdev = dev_get_drvdata(dev);
> +
> +	if (vdev == NULL)
> +		return -ENODEV;
> +	return sprintf(buf, "%u\n", vdev->locked_pages);
> +}
> +
> +static DEVICE_ATTR(locked_pages, S_IRUGO, show_locked_pages, NULL);
> +
> +static struct attribute *vfio_attrs[] = {
> +	&dev_attr_locked_pages.attr,
> +	NULL,
> +};
> +
> +static struct attribute_group vfio_attr_grp = {
> +	.attrs = vfio_attrs,
> +};
> +
> +struct bin_attribute config_map_bin_attribute = {
> +	.attr	= {
> +		.name = "config_map",
> +		.mode = S_IRUGO,
> +	},
> +	.size	= 256,
> +	.read	= config_map_read,
> +};
> +
> +int vfio_dev_add_attributes(struct vfio_dev *vdev)
> +{
> +	struct bin_attribute *bi;
> +	int ret;
> +
> +	ret = sysfs_create_group(&vdev->dev->kobj, &vfio_attr_grp);
> +	if (ret)
> +		return ret;
> +	bi = kmalloc(sizeof(*bi), GFP_KERNEL);
> +	if (bi == NULL)
> +		return -ENOMEM;
> +	*bi = config_map_bin_attribute;
> +	bi->private = vdev;
> +	return sysfs_create_bin_file(&vdev->dev->kobj, bi);
> +}
> diff -uprN linux-2.6.34/include/linux/vfio.h vfio-linux-2.6.34/include/linux/vfio.h
> --- linux-2.6.34/include/linux/vfio.h	1969-12-31 16:00:00.000000000 -0800
> +++ vfio-linux-2.6.34/include/linux/vfio.h	2010-05-28 14:29:49.000000000 -0700
> @@ -0,0 +1,193 @@
> +/*
> + * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
> + * Author: Tom Lyon, pugs@cisco.com
> + *
> + * This program is free software; you may redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; version 2 of the License.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> + * SOFTWARE.
> + *
> + * Portions derived from drivers/uio/uio.c:
> + * Copyright(C) 2005, Benedikt Spranger <b.spranger@linutronix.de>
> + * Copyright(C) 2005, Thomas Gleixner <tglx@linutronix.de>
> + * Copyright(C) 2006, Hans J. Koch <hjk@linutronix.de>
> + * Copyright(C) 2006, Greg Kroah-Hartman <greg@kroah.com>
> + *
> + * Portions derived from drivers/uio/uio_pci_generic.c:
> + * Copyright (C) 2009 Red Hat, Inc.
> + * Author: Michael S. Tsirkin <mst@redhat.com>
> + */
> +
> +/*
> + * VFIO driver - allow mapping and use of certain PCI devices
> + * in unprivileged user processes. (If IOMMU is present)
> + * Especially useful for Virtual Function parts of SR-IOV devices
> + */
> +
> +#ifdef __KERNEL__
> +
> +struct vfio_dev {
> +	struct device	*dev;
> +	struct pci_dev	*pdev;
> +	u8		*pci_config_map;
> +	int		pci_config_size;
> +	char		name[8];
> +	int		devnum;
> +	int		pmaster;
> +	void __iomem	*bar[PCI_ROM_RESOURCE+1];
> +	spinlock_t	lock; /* guards command register accesses */
> +	int		listeners;
> +	int		mapcount;
> +	u32		locked_pages;
> +	struct mutex	gate;
> +	struct msix_entry	*msix;
> +	int			nvec;
> +	struct iommu_domain	*domain;
> +	int			cachec;
> +	struct eventfd_ctx	*ev_irq;
> +	struct eventfd_ctx	*ev_msi;
> +	struct eventfd_ctx	**ev_msix;
> +	struct {
> +		u8	intr;
> +		u8	bardirty;
> +		u8	rombar[4];
> +		u8	bar[6*4];
> +		u8	msi[24];
> +	} vinfo;

Add some comments + named constants above?

> +};
> +
> +struct vfio_listener {
> +	struct vfio_dev	*vdev;
> +	struct list_head	dm_list;
> +	struct mm_struct	*mm;
> +	struct mmu_notifier	mmu_notifier;
> +};
> +
> +/*
> + * Structure for keeping track of memory nailed down by the
> + * user for DMA
> + */
> +struct dma_map_page {
> +	struct list_head list;
> +	struct page     **pages;
> +	struct scatterlist *sg;
> +	dma_addr_t      daddr;
> +	unsigned long	vaddr;
> +	int		npage;
> +	int		rdwr;
> +};
> +
> +/* VFIO class infrastructure */
> +struct vfio_class {
> +	struct kref kref;
> +	struct class *class;
> +};
> +extern struct vfio_class *vfio_class;
> +
> +ssize_t vfio_io_readwrite(int, struct vfio_dev *,
> +			char __user *, size_t, loff_t *);
> +ssize_t vfio_mem_readwrite(int, struct vfio_dev *,
> +			char __user *, size_t, loff_t *);
> +ssize_t vfio_config_readwrite(int, struct vfio_dev *,
> +			char __user *, size_t, loff_t *);
> +
> +void vfio_disable_msi(struct vfio_dev *);
> +void vfio_disable_msix(struct vfio_dev *);
> +int vfio_enable_msi(struct vfio_dev *, int);
> +int vfio_enable_msix(struct vfio_dev *, int, void __user *);
> +
> +#ifndef PCI_MSIX_ENTRY_SIZE
> +#define	PCI_MSIX_ENTRY_SIZE	16
> +#endif
> +#ifndef PCI_STATUS_INTERRUPT
> +#define	PCI_STATUS_INTERRUPT	0x08
> +#endif
> +
> +struct vfio_dma_map;
> +void vfio_dma_unmapall(struct vfio_listener *);
> +int vfio_dma_unmap_dm(struct vfio_listener *, struct vfio_dma_map *);
> +int vfio_dma_map_common(struct vfio_listener *, unsigned int,
> +			struct vfio_dma_map *);
> +
> +int vfio_class_init(void);
> +void vfio_class_destroy(void);
> +int vfio_dev_add_attributes(struct vfio_dev *);
> +extern struct idr vfio_idr;
> +extern struct mutex vfio_minor_lock;
> +int vfio_build_config_map(struct vfio_dev *);
> +
> +irqreturn_t vfio_interrupt(int, void *);
> +
> +#endif	/* __KERNEL__ */
> +
> +/* Kernel & User level defines for ioctls */
> +
> +/*
> + * Structure for DMA mapping of user buffers
> + * vaddr, dmaaddr, and size must all be page aligned
> + * buffer may only be larger than 1 page if (a) there is
> + * an iommu in the system, or (b) buffer is part of a huge page
> + */
> +struct vfio_dma_map {
> +	__u64	vaddr;		/* process virtual addr */
> +	__u64	dmaaddr;	/* desired and/or returned dma address */
> +	__u64	size;		/* size in bytes */
> +	int	rdwr;		/* bool: 0 for r/o; 1 for r/w */
> +};
> +
> +/* map user pages at any dma address */
> +#define	VFIO_DMA_MAP_ANYWHERE	_IOWR(';', 100, struct vfio_dma_map)
> +
> +/* map user pages at specific dma address */
> +#define	VFIO_DMA_MAP_IOVA	_IOWR(';', 101, struct vfio_dma_map)
> +
> +/* unmap user pages */
> +#define	VFIO_DMA_UNMAP		_IOW(';', 102, struct vfio_dma_map)
> +
> +/* set device DMA mask & master status */
> +#define	VFIO_DMA_MASK		_IOW(';', 103, __u64)
> +
> +/* request IRQ interrupts; use given eventfd */
> +#define	VFIO_EVENTFD_IRQ		_IOW(';', 104, int)
> +
> +/* request MSI interrupts; use given eventfd */
> +#define	VFIO_EVENTFD_MSI		_IOW(';', 105, int)
> +
> +/* Request MSI-X interrupts: arg[0] is #, arg[1-n] are eventfds */
> +#define	VFIO_EVENTFDS_MSIX	_IOW(';', 106, int)
> +
> +/* Get length of a BAR */
> +#define	VFIO_BAR_LEN		_IOWR(';', 107, __u32)
> +
> +/*
> + * Reads, writes, and mmaps determine which PCI BAR (or config space)
> + * from the high level bits of the file offset
> + */
> +#define	VFIO_PCI_BAR0_RESOURCE		0x0
> +#define	VFIO_PCI_BAR1_RESOURCE		0x1
> +#define	VFIO_PCI_BAR2_RESOURCE		0x2
> +#define	VFIO_PCI_BAR3_RESOURCE		0x3
> +#define	VFIO_PCI_BAR4_RESOURCE		0x4
> +#define	VFIO_PCI_BAR5_RESOURCE		0x5
> +#define	VFIO_PCI_ROM_RESOURCE		0x6
> +#define	VFIO_PCI_CONFIG_RESOURCE	0xF
> +#define	VFIO_PCI_SPACE_SHIFT	32
> +#define VFIO_PCI_CONFIG_OFF vfio_pci_space_to_offset(VFIO_PCI_CONFIG_RESOURCE)
> +
> +static inline int vfio_offset_to_pci_space(__u64 off)
> +{
> +	return (off >> VFIO_PCI_SPACE_SHIFT) & 0xF;
> +}
> +
> +static __u64 vfio_pci_space_to_offset(int sp)
> +{
> +	return (__u64)(sp) << VFIO_PCI_SPACE_SHIFT;
> +}
> diff -uprN linux-2.6.34/MAINTAINERS vfio-linux-2.6.34/MAINTAINERS
> --- linux-2.6.34/MAINTAINERS	2010-05-16 14:17:36.000000000 -0700
> +++ vfio-linux-2.6.34/MAINTAINERS	2010-05-28 12:30:21.000000000 -0700
> @@ -5968,6 +5968,13 @@ S:	Maintained
>  F:	Documentation/fb/uvesafb.txt
>  F:	drivers/video/uvesafb.*
>  
> +VFIO DRIVER
> +M:	Tom Lyon <pugs@cisco.com>
> +S:	Supported
> +F:	Documentation/vfio.txt
> +F:	drivers/vfio/
> +F:	include/linux/vfio.h
> +
>  VFAT/FAT/MSDOS FILESYSTEM
>  M:	OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
>  S:	Maintained

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
  2010-05-30 12:19 ` Michael S. Tsirkin
@ 2010-05-30 12:27   ` Avi Kivity
  2010-05-30 12:49     ` Michael S. Tsirkin
  0 siblings, 1 reply; 66+ messages in thread
From: Avi Kivity @ 2010-05-30 12:27 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Tom Lyon, linux-kernel, kvm, chrisw, joro, hjk, gregkh, aafabbri,
	scofeldm

On 05/30/2010 03:19 PM, Michael S. Tsirkin wrote:
> On Fri, May 28, 2010 at 04:07:38PM -0700, Tom Lyon wrote:
>    
>> The VFIO "driver" is used to allow privileged AND non-privileged processes to
>> implement user-level device drivers for any well-behaved PCI, PCI-X, and PCIe
>> devices.
>> 	Signed-off-by: Tom Lyon<pugs@cisco.com>
>> ---
>> This patch is the evolution of code which was first proposed as a patch to
>> uio/uio_pci_generic, then as a more generic uio patch. Now it is taken entirely
>> out of the uio framework, and things seem much cleaner. Of course, there is
>> a lot of functional overlap with uio, but the previous version just seemed
>> like a giant mode switch in the uio code that did not lead to clarity for
>> either the new or old code.
>>      
> IMO this was because this driver does two things: programming iommu and
> handling interrupts. uio does interrupt handling.
> We could have moved iommu / DMA programming to
> a separate driver, and have uio work with it.
> This would solve limitation of the current driver
> that is needs an iommu domain per device.
>    

How do we enforce security then?  We need to ensure that unprivileged 
users can only use the device with an iommu.

>> [a pony for avi...]
>> The major new functionality in this version is the ability to deal with
>> PCI config space accesses (through read&  write calls) - but includes table
>> driven code to determine whats safe to write and what is not.
>>      
> I don't really see why this is helpful: a driver written corrrectly
> will not access these addresses, and we need an iommu anyway to protect
> us against a drivers.
>    

Haven't reviewed the code (yet) but things like the BARs, MSI, and 
interrupt disable need to be protected from the guest regardless of the 
iommu.


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
  2010-05-30 12:27   ` Avi Kivity
@ 2010-05-30 12:49     ` Michael S. Tsirkin
  2010-05-30 13:01       ` Avi Kivity
  0 siblings, 1 reply; 66+ messages in thread
From: Michael S. Tsirkin @ 2010-05-30 12:49 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Tom Lyon, linux-kernel, kvm, chrisw, joro, hjk, gregkh, aafabbri,
	scofeldm

On Sun, May 30, 2010 at 03:27:05PM +0300, Avi Kivity wrote:
> On 05/30/2010 03:19 PM, Michael S. Tsirkin wrote:
>> On Fri, May 28, 2010 at 04:07:38PM -0700, Tom Lyon wrote:
>>    
>>> The VFIO "driver" is used to allow privileged AND non-privileged processes to
>>> implement user-level device drivers for any well-behaved PCI, PCI-X, and PCIe
>>> devices.
>>> 	Signed-off-by: Tom Lyon<pugs@cisco.com>
>>> ---
>>> This patch is the evolution of code which was first proposed as a patch to
>>> uio/uio_pci_generic, then as a more generic uio patch. Now it is taken entirely
>>> out of the uio framework, and things seem much cleaner. Of course, there is
>>> a lot of functional overlap with uio, but the previous version just seemed
>>> like a giant mode switch in the uio code that did not lead to clarity for
>>> either the new or old code.
>>>      
>> IMO this was because this driver does two things: programming iommu and
>> handling interrupts. uio does interrupt handling.
>> We could have moved iommu / DMA programming to
>> a separate driver, and have uio work with it.
>> This would solve limitation of the current driver
>> that is needs an iommu domain per device.
>>    
>
> How do we enforce security then?  We need to ensure that unprivileged  
> users can only use the device with an iommu.

Force assigning to iommu before we allow any other operation?

>>> [a pony for avi...]
>>> The major new functionality in this version is the ability to deal with
>>> PCI config space accesses (through read&  write calls) - but includes table
>>> driven code to determine whats safe to write and what is not.
>>>      
>> I don't really see why this is helpful: a driver written corrrectly
>> will not access these addresses, and we need an iommu anyway to protect
>> us against a drivers.
>>    
>
> Haven't reviewed the code (yet) but things like the BARs, MSI, and  
> interrupt disable need to be protected from the guest regardless of the  
> iommu.

Yes but userspace can do this. As long as userspace can not
crash the kernel, no reason to put this policy into kernel.

>
> -- 
> error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
  2010-05-28 23:07 [PATCH] VFIO driver: Non-privileged user level PCI drivers Tom Lyon
                   ` (3 preceding siblings ...)
  2010-05-30 12:19 ` Michael S. Tsirkin
@ 2010-05-30 12:59 ` Avi Kivity
  2010-05-31 17:17 ` Alan Cox
  5 siblings, 0 replies; 66+ messages in thread
From: Avi Kivity @ 2010-05-30 12:59 UTC (permalink / raw)
  To: Tom Lyon
  Cc: linux-kernel, kvm, chrisw, joro, hjk, mst, gregkh, aafabbri, scofeldm

On 05/29/2010 02:07 AM, Tom Lyon wrote:
> The VFIO "driver" is used to allow privileged AND non-privileged processes to
> implement user-level device drivers for any well-behaved PCI, PCI-X, and PCIe
> devices.
>    

> +
> +Why is this interesting?  Some applications, especially in the high performance
> +computing field, need access to hardware functions with as little overhead as
> +possible. Examples are in network adapters (typically non tcp/ip based) and
> +in compute accelerators - i.e., array processors, FPGA processors, etc.
> +Previous to the VFIO drivers these apps would need either a kernel-level
> +driver (with corrsponding overheads), or else root permissions to directly
> +access the hardware. The VFIO driver allows generic access to the hardware
> +from non-privileged apps IF the hardware is "well-behaved" enough for this
> +to be safe.
>    


> +
> +Any SR-IOV virtual function meets the VFIO definition of "well-behaved", but
> +there are many other non-IOV PCI devices which also meet the defintion.
> +Elements of this definition are:
> +- The size of any memory BARs to be mmap'ed into the user process space must be
> +  a multiple of the system page size.
>    

You can relax this.
  - smaller than page size can be mapped if the rest of the page unused 
and if the platform tolerates writes to unused areas
  - if the rest of the page is used, we can relocate the BAR
  - otherwise, we can prevent mmap() but still allow mediated access via 
a syscall

> +- If MSI-X interrupts are used, the device driver must not attempt to mmap or
> +  write the MSI-X vector area.
>    

We can allow mediated access (that's what qemu-kvm does).  I guess the 
ioctls for setting up msi interrupts are equivalent to this mediated access.

(later I see you do provide mediated access via pwrite - please confirm)

> +- The device must not use the PCI configuration space in any non-standard way,
> +  i.e., the user level driver will be permitted only to read and write standard
> +  fields of the PCI config space, and only if those fields cannot cause harm to
> +  the system. In addition, some fields are "virtualized", so that the user
> +  driver can read/write them like a kernel driver, but they do not affect the
> +  real device.
>    

What's wrong with nonstandard fields?

> +
> +Even with these restrictions, there are bound to be devices which are unsafe
> +for user level use - it is still up to the system admin to decide whether to
> +grant access to the device.  When the vfio module is loaded, it will have
> +access to no devices until the desired PCI devices are "bound" to the driver.
> +First, make sure the devices are not bound to another kernel driver. You can
> +unload that driver if you wish to unbind all its devices, or else enter the
> +driver's sysfs directory, and unbind a specific device:
> +	cd /sys/bus/pci/drivers/<drivername>
> +	echo 0000:06:02.00>  unbind
> +(The 0000:06:02.00 is a fully qualified PCI device name - different for each
> +device).  Now, to bind to the vfio driver, go to /sys/bus/pci/drivers/vfio and
> +write the PCI device type of the target device to the new_id file:
> +	echo 8086 10ca>  new_id
> +(8086 10ca are the vendor and device type for the Intel 82576 virtual function
> +devices). A /dev/vfio<N>  entry will be created for each device bound. The final
> +step is to grant users permission by changing the mode and/or owner of the /dev
> +entry - "chmod 666 /dev/vfio0".
>    

What if I have several such devices?  Isn't it better to bind by topoloy 
(device address)?

> +
> +Reads&  Writes:
> +
> +The user driver will typically use mmap to access the memory BAR(s) of a
> +device; the I/O BARs and the PCI config space may be accessed through normal
> +read and write system calls. Only 1 file descriptor is needed for all driver
> +functions -- the desired BAR for I/O, memory, or config space is indicated via
> +high-order bits of the file offset.

My preference would be one fd per BAR, but that's a matter of personal 
taste.

> For instance, the following implements a
> +write to the PCI config space:
> +
> +	#include<linux/vfio.h>
> +	void pci_write_config_word(int pci_fd, u16 off, u16 wd)
> +	{
> +		off_t cfg_off = VFIO_PCI_CONFIG_OFF + off;
> +
> +		if (pwrite(pci_fd,&wd, 2, cfg_off) != 2)
> +			perror("pwrite config_dword");
> +	}
> +
>    

Nice, has the benefit of avoiding endianness issues in the interface.

> +The routines vfio_pci_space_to_offset and vfio_offset_to_pci_space are provided
> +in vfio.h to convert bar numbers to file offsets and vice-versa.
> +
> +Interrupts:
> +
> +Device interrupts are translated by the vfio driver into input events on event
> +notification file descriptors created by the eventfd system call. The user
> +program must one or more event descriptors and pass them to the vfio driver
> +via ioctls to arrange for the interrupt mapping:
> +1.
> +	efd = eventfd(0, 0);
> +	ioctl(vfio_fd, VFIO_EVENTFD_IRQ,&efd);
> +		This provides an eventfd for traditional IRQ interrupts.
> +		IRQs will be disable after each interrupt until the driver
> +		re-enables them via the PCI COMMAND register.
>    

My thinking was to emulate a level-triggered interrupt but I think your 
way is better.  For virtualization, it becomes the responsibility of 
user space to multiplex between the guest writing PCI COMMAND and 
userspace writing PCI COMMAND to re-enable interrupts, but that's fine.

> +2.
> +	efd = eventfd(0, 0);
> +	ioctl(vfio_fd, VFIO_EVENTFD_MSI,&efd);
> +		This connects MSI interrupts to an eventfd.
> +3.
> + 	int arg[N+1];
> +	arg[0] = N;
> +	arg[1..N] = eventfd(0, 0);
> +	ioctl(vfio_fd, VFIO_EVENTFDS_MSIX, arg);
> +		This connects N MSI-X interrupts with N eventfds.
> +
> +Waiting and checking for interrupts is done by the user program by reads,
> +polls, or selects on the related event file descriptors.
>    

This all looks nice and clean.

> +
> +DMA:
> +
> +The VFIO driver uses ioctls to allow the user level driver to get DMA
> +addresses which correspond to virtual addresses.  In systems with IOMMUs,
> +each PCI device will have its own address space for DMA operations, so when
> +the user level driver programs the device registers, only addresses known to
> +the IOMMU will be valid, any others will be rejected.  The IOMMU creates the
> +illusion (to the device) that multi-page buffers are physically contiguous,
> +so a single DMA operation can safely span multiple user pages.  Note that
> +the VFIO driver is still useful in systems without IOMMUs, but only for
> +trusted processes which can deal with DMAs which do not span pages (Huge
> +pages count as a single page also).
> +
> +If the user process desires many DMA buffers, it may be wise to do a mapping
> +of a single large buffer, and then allocate the smaller buffers from the
> +large one.
>    

Or use scatter/gather, if the device supports it.

> +
> +The DMA buffers are locked into physical memory for the duration of their
> +existence - until VFIO_DMA_UNMAP is called, until the user pages are
> +unmapped from the user process, or until the vfio file descriptor is closed.
> +The user process must have permission to lock the pages given by the ulimit(-l)
> +command, which in turn relies on settings in the /etc/security/limits.conf
> +file.
> +
> +The vfio_dma_map structure is used as an argument to the ioctls which
> +do the DMA mapping. Its vaddr, dmaaddr, and size fields must always be a
> +multiple of a page. Its rdwr field is zero for read-only (outbound), and
> +non-zero for read/write buffers.
> +
> +	struct vfio_dma_map {
> +		__u64	vaddr;	  /* process virtual addr */
> +		__u64	dmaaddr;  /* desired and/or returned dma address */
> +		__u64	size;	  /* size in bytes */
> +		int	rdwr;	  /* bool: 0 for r/o; 1 for r/w */
> +	};
> +
> +The VFIO_DMA_MAP_ANYWHERE is called with a vfio_dma_map structure as its
> +argument, and returns the structure with a valid dmaaddr field.
> +
> +The VFIO_DMA_MAP_IOVA is called with a vfio_dma_map structure with the
> +dmaaddr field already assigned. The system will attempt to map the DMA
> +buffer into the IO space at the givne dmaaddr. This is expected to be
> +useful if KVM or other virtualization facilities use this driver.
> +
> +The VFIO_DMA_UNMAP takes a fully filled vfio_dma_map structure and unmaps
> +the buffer and releases the corresponding system resources.
> +
> +The VFIO_DMA_MASK ioctl is used to set the maximum permissible DMA address
> +(device dependent). It takes a single unsigned 64 bit integer as an argument.
> +This call also has the side effect on enabled PCI bus mastership.
>    


How many such mappings can be mapped simultaneously?

Note you need privileges (RLIMIT_MEMLOCK) to lock memory, this should be 
accounted for.

> +	/* account for locked pages */
> +	locked = npage + current->mm->locked_vm;
> +	lock_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur
> +			>>  PAGE_SHIFT;
>    

Ah, you already do.

> +/* Kernel&  User level defines for ioctls */
> +
> +/*
> + * Structure for DMA mapping of user buffers
> + * vaddr, dmaaddr, and size must all be page aligned
> + * buffer may only be larger than 1 page if (a) there is
> + * an iommu in the system, or (b) buffer is part of a huge page
> + */
> +struct vfio_dma_map {
> +	__u64	vaddr;		/* process virtual addr */
> +	__u64	dmaaddr;	/* desired and/or returned dma address */
> +	__u64	size;		/* size in bytes */
> +	int	rdwr;		/* bool: 0 for r/o; 1 for r/w */
> +};
>    

As noted before, align, add flags, and reserve space.
> +
> +/* Get length of a BAR */
> +#define	VFIO_BAR_LEN		_IOWR(';', 107, __u32)
>    

A 64-bit BAR will overflow on a 32-bit system.

> +
> +/*
> + * Reads, writes, and mmaps determine which PCI BAR (or config space)
> + * from the high level bits of the file offset
> + */
> +#define	VFIO_PCI_BAR0_RESOURCE		0x0
> +#define	VFIO_PCI_BAR1_RESOURCE		0x1
> +#define	VFIO_PCI_BAR2_RESOURCE		0x2
> +#define	VFIO_PCI_BAR3_RESOURCE		0x3
> +#define	VFIO_PCI_BAR4_RESOURCE		0x4
> +#define	VFIO_PCI_BAR5_RESOURCE		0x5
> +#define	VFIO_PCI_ROM_RESOURCE		0x6
> +#define	VFIO_PCI_CONFIG_RESOURCE	0xF
> +#define	VFIO_PCI_SPACE_SHIFT	32
>    

64-bit BARs break this.  51 would be a good value for x86 systems (the 
PTE format makes bits 52:62 available to software, so the address space 
cannot grow beyond 2PB).

> +#define VFIO_PCI_CONFIG_OFF vfio_pci_space_to_offset(VFIO_PCI_CONFIG_RESOURCE)
> +
> +static inline int vfio_offset_to_pci_space(__u64 off)
> +{
> +	return (off>>  VFIO_PCI_SPACE_SHIFT)&  0xF;
> +}
> +
> +static __u64 vfio_pci_space_to_offset(int sp)
> +{
> +	return (__u64)(sp)<<  VFIO_PCI_SPACE_SHIFT;
> +}
>    


Needs to be inline too.

Suggest the last function also take the offset, and add a function to 
extract the offset from a space/offset combo.


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
  2010-05-30 12:49     ` Michael S. Tsirkin
@ 2010-05-30 13:01       ` Avi Kivity
  2010-05-30 13:03         ` Michael S. Tsirkin
  0 siblings, 1 reply; 66+ messages in thread
From: Avi Kivity @ 2010-05-30 13:01 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Tom Lyon, linux-kernel, kvm, chrisw, joro, hjk, gregkh, aafabbri,
	scofeldm

On 05/30/2010 03:49 PM, Michael S. Tsirkin wrote:
> On Sun, May 30, 2010 at 03:27:05PM +0300, Avi Kivity wrote:
>    
>> On 05/30/2010 03:19 PM, Michael S. Tsirkin wrote:
>>      
>>> On Fri, May 28, 2010 at 04:07:38PM -0700, Tom Lyon wrote:
>>>
>>>        
>>>> The VFIO "driver" is used to allow privileged AND non-privileged processes to
>>>> implement user-level device drivers for any well-behaved PCI, PCI-X, and PCIe
>>>> devices.
>>>> 	Signed-off-by: Tom Lyon<pugs@cisco.com>
>>>> ---
>>>> This patch is the evolution of code which was first proposed as a patch to
>>>> uio/uio_pci_generic, then as a more generic uio patch. Now it is taken entirely
>>>> out of the uio framework, and things seem much cleaner. Of course, there is
>>>> a lot of functional overlap with uio, but the previous version just seemed
>>>> like a giant mode switch in the uio code that did not lead to clarity for
>>>> either the new or old code.
>>>>
>>>>          
>>> IMO this was because this driver does two things: programming iommu and
>>> handling interrupts. uio does interrupt handling.
>>> We could have moved iommu / DMA programming to
>>> a separate driver, and have uio work with it.
>>> This would solve limitation of the current driver
>>> that is needs an iommu domain per device.
>>>
>>>        
>> How do we enforce security then?  We need to ensure that unprivileged
>> users can only use the device with an iommu.
>>      
> Force assigning to iommu before we allow any other operation?
>    

That means the driver must be aware of the iommu.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
  2010-05-30 13:01       ` Avi Kivity
@ 2010-05-30 13:03         ` Michael S. Tsirkin
  2010-05-30 13:13           ` Avi Kivity
  0 siblings, 1 reply; 66+ messages in thread
From: Michael S. Tsirkin @ 2010-05-30 13:03 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Tom Lyon, linux-kernel, kvm, chrisw, joro, hjk, gregkh, aafabbri,
	scofeldm

On Sun, May 30, 2010 at 04:01:53PM +0300, Avi Kivity wrote:
> On 05/30/2010 03:49 PM, Michael S. Tsirkin wrote:
>> On Sun, May 30, 2010 at 03:27:05PM +0300, Avi Kivity wrote:
>>    
>>> On 05/30/2010 03:19 PM, Michael S. Tsirkin wrote:
>>>      
>>>> On Fri, May 28, 2010 at 04:07:38PM -0700, Tom Lyon wrote:
>>>>
>>>>        
>>>>> The VFIO "driver" is used to allow privileged AND non-privileged processes to
>>>>> implement user-level device drivers for any well-behaved PCI, PCI-X, and PCIe
>>>>> devices.
>>>>> 	Signed-off-by: Tom Lyon<pugs@cisco.com>
>>>>> ---
>>>>> This patch is the evolution of code which was first proposed as a patch to
>>>>> uio/uio_pci_generic, then as a more generic uio patch. Now it is taken entirely
>>>>> out of the uio framework, and things seem much cleaner. Of course, there is
>>>>> a lot of functional overlap with uio, but the previous version just seemed
>>>>> like a giant mode switch in the uio code that did not lead to clarity for
>>>>> either the new or old code.
>>>>>
>>>>>          
>>>> IMO this was because this driver does two things: programming iommu and
>>>> handling interrupts. uio does interrupt handling.
>>>> We could have moved iommu / DMA programming to
>>>> a separate driver, and have uio work with it.
>>>> This would solve limitation of the current driver
>>>> that is needs an iommu domain per device.
>>>>
>>>>        
>>> How do we enforce security then?  We need to ensure that unprivileged
>>> users can only use the device with an iommu.
>>>      
>> Force assigning to iommu before we allow any other operation?
>>    
>
> That means the driver must be aware of the iommu.

The userspace driver? Yes. And It is a good thing to be explicit
there anyway, since this lets userspace map a non-contigious
virtual address list into a contiguous bus address range.

> -- 
> error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
  2010-05-30 13:03         ` Michael S. Tsirkin
@ 2010-05-30 13:13           ` Avi Kivity
  2010-05-30 14:53             ` Michael S. Tsirkin
  0 siblings, 1 reply; 66+ messages in thread
From: Avi Kivity @ 2010-05-30 13:13 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Tom Lyon, linux-kernel, kvm, chrisw, joro, hjk, gregkh, aafabbri,
	scofeldm

On 05/30/2010 04:03 PM, Michael S. Tsirkin wrote:
>
>    
>>>>> IMO this was because this driver does two things: programming iommu and
>>>>> handling interrupts. uio does interrupt handling.
>>>>> We could have moved iommu / DMA programming to
>>>>> a separate driver, and have uio work with it.
>>>>> This would solve limitation of the current driver
>>>>> that is needs an iommu domain per device.
>>>>>
>>>>>
>>>>>            
>>>> How do we enforce security then?  We need to ensure that unprivileged
>>>> users can only use the device with an iommu.
>>>>
>>>>          
>>> Force assigning to iommu before we allow any other operation?
>>>
>>>        
>> That means the driver must be aware of the iommu.
>>      
> The userspace driver? Yes. And It is a good thing to be explicit
> there anyway, since this lets userspace map a non-contigious
> virtual address list into a contiguous bus address range.
>    

No, the kernel driver.  It cannot allow userspace to enable bus 
mastering unless it knows the iommu is enabled for the device and remaps 
dma to user pages.


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
  2010-05-30 13:13           ` Avi Kivity
@ 2010-05-30 14:53             ` Michael S. Tsirkin
  2010-05-31 11:50               ` Avi Kivity
  0 siblings, 1 reply; 66+ messages in thread
From: Michael S. Tsirkin @ 2010-05-30 14:53 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Tom Lyon, linux-kernel, kvm, chrisw, joro, hjk, gregkh, aafabbri,
	scofeldm

On Sun, May 30, 2010 at 04:13:59PM +0300, Avi Kivity wrote:
> On 05/30/2010 04:03 PM, Michael S. Tsirkin wrote:
>>
>>    
>>>>>> IMO this was because this driver does two things: programming iommu and
>>>>>> handling interrupts. uio does interrupt handling.
>>>>>> We could have moved iommu / DMA programming to
>>>>>> a separate driver, and have uio work with it.
>>>>>> This would solve limitation of the current driver
>>>>>> that is needs an iommu domain per device.
>>>>>>
>>>>>>
>>>>>>            
>>>>> How do we enforce security then?  We need to ensure that unprivileged
>>>>> users can only use the device with an iommu.
>>>>>
>>>>>          
>>>> Force assigning to iommu before we allow any other operation?
>>>>
>>>>        
>>> That means the driver must be aware of the iommu.
>>>      
>> The userspace driver? Yes. And It is a good thing to be explicit
>> there anyway, since this lets userspace map a non-contigious
>> virtual address list into a contiguous bus address range.
>>    
>
> No, the kernel driver.  It cannot allow userspace to enable bus  
> mastering unless it knows the iommu is enabled for the device and remaps  
> dma to user pages.

So what I suggested is failing any kind of access until iommu
is assigned.

>
> -- 
> error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
  2010-05-30 14:53             ` Michael S. Tsirkin
@ 2010-05-31 11:50               ` Avi Kivity
  2010-05-31 17:10                 ` Michael S. Tsirkin
  0 siblings, 1 reply; 66+ messages in thread
From: Avi Kivity @ 2010-05-31 11:50 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Tom Lyon, linux-kernel, kvm, chrisw, joro, hjk, gregkh, aafabbri,
	scofeldm

On 05/30/2010 05:53 PM, Michael S. Tsirkin wrote:
>
> So what I suggested is failing any kind of access until iommu
> is assigned.
>    

So, the kernel driver must be aware of the iommu.  In which case it may 
as well program it.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
  2010-05-31 11:50               ` Avi Kivity
@ 2010-05-31 17:10                 ` Michael S. Tsirkin
  2010-06-01  8:10                   ` Avi Kivity
  0 siblings, 1 reply; 66+ messages in thread
From: Michael S. Tsirkin @ 2010-05-31 17:10 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Tom Lyon, linux-kernel, kvm, chrisw, joro, hjk, gregkh, aafabbri,
	scofeldm

On Mon, May 31, 2010 at 02:50:29PM +0300, Avi Kivity wrote:
> On 05/30/2010 05:53 PM, Michael S. Tsirkin wrote:
>>
>> So what I suggested is failing any kind of access until iommu
>> is assigned.
>>    
>
> So, the kernel driver must be aware of the iommu.  In which case it may  
> as well program it.

It's a kernel driver anyway. Point is that
the *device* driver is better off not programming iommu,
this way we do not need to reprogram it for each device.

> -- 
> I have a truly marvellous patch that fixes the bug which this
> signature is too narrow to contain.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
  2010-05-28 23:07 [PATCH] VFIO driver: Non-privileged user level PCI drivers Tom Lyon
                   ` (4 preceding siblings ...)
  2010-05-30 12:59 ` Avi Kivity
@ 2010-05-31 17:17 ` Alan Cox
  2010-06-01 21:29   ` Tom Lyon
  5 siblings, 1 reply; 66+ messages in thread
From: Alan Cox @ 2010-05-31 17:17 UTC (permalink / raw)
  To: Tom Lyon
  Cc: linux-kernel, kvm, chrisw, joro, hjk, mst, avi, gregkh, aafabbri,
	scofeldm


> +/*
> + * Map usr buffer at specific IO virtual address
> + */
> +static int vfio_dma_map_iova(

> +	mlp = kzalloc(sizeof *mlp, GFP_KERNEL);

Not good at that point. I think you need to allocate it first, error if
it can't be allocated and then do the work and free it on error ?


> +	mlp = kzalloc(sizeof *mlp, GFP_KERNEL);
> +	mlp->pages = pages;

Ditto


> +int vfio_enable_msix(struct vfio_dev *vdev, int nvec, void __user *uarg)
> +{
> +	struct pci_dev *pdev = vdev->pdev;
> +	struct eventfd_ctx *ctx;
> +	int ret = 0;
> +	int i;
> +	int fd;
> +
> +	vdev->msix = kzalloc(nvec * sizeof(struct msix_entry),
> +				GFP_KERNEL);
> +	vdev->ev_msix = kzalloc(nvec * sizeof(struct eventfd_ctx *),
> +				GFP_KERNEL);

These don't seem to get freed on the error path - or indeed protected
against being allocated twice (eg two parallel ioctls ?)



> +	case VFIO_DMA_MAP_ANYWHERE:
> +	case VFIO_DMA_MAP_IOVA:
> +		if (copy_from_user(&dm, uarg, sizeof dm))
> +			return -EFAULT;
> +		ret = vfio_dma_map_common(listener, cmd, &dm);
> +		if (!ret && copy_to_user(uarg, &dm, sizeof dm))

So the vfio_dma_map is untrusted. That seems to be checked ok later but
the dma_map_common code then plays in current->mm-> without apparently
holding any locks to stop the values getting corrupted by a parallel
mlock ?

Actually no I take that back

	dmp->size is 64bit

	So npage can end up with values like 0xFFFFFFFF and cause 32bit
	boxes to go kerblam

> +
> +	case VFIO_EVENTFD_IRQ:
> +		if (copy_from_user(&fd, uarg, sizeof fd))
> +			return -EFAULT;
> +		if (vdev->ev_irq)
> +			eventfd_ctx_put(vdev->ev_irq);

These paths need locking - suppose two EVENTFD irq ioctls occur at once
(in general these paths seem not to be covered)

>
> +	case VFIO_BAR_LEN:
> +		if (copy_from_user(&bar, uarg, sizeof bar))
> +			return -EFAULT;
> +		if (bar < 0 || bar > PCI_ROM_RESOURCE)
> +			return -EINVAL;
> +		bar = pci_resource_len(pdev, bar);
> +		if (copy_to_user(uarg, &bar, sizeof bar))
> +			return -EFAULT;

How does this all work out if the device is a bridge ?

> +	pci_read_config_byte(pdev, PCI_INTERRUPT_LINE, &line);
> +	if (line == 0)
> +		goto out;

That may produce some interestingly wrong answers. Firstly the platform
has interrupt abstraction so dev->irq may not match PCI_INTERRUPT_LINE,
secondly you have devices that report their IRQ via other paths as per
spec (notably IDE class devices in non-native mode)

So that would also want extra checks.


> +	pci_read_config_word(pdev, PCI_COMMAND, &orig);
> +	ret = orig & PCI_COMMAND_MASTER;
> +	if (!ret) {
> +		new = orig | PCI_COMMAND_MASTER;
> +		pci_write_config_word(pdev, PCI_COMMAND, new);
> +		pci_read_config_word(pdev, PCI_COMMAND, &new);
> +		ret = new & PCI_COMMAND_MASTER;
> +		pci_write_config_word(pdev, PCI_COMMAND, orig);

The master bit on some devices can be turned on but not off. Not sure it
matters here.

> +	vdev->pdev = pdev;

Probably best to take/drop a reference. Not needed if you can prove your
last use is before the end of the remove path though.


Does look like it needs a locking audit, some memory and error checks
reviewing and some further review of the ioctl security and
overflows/trusted values.

Rather a nice way of attacking the user space PCI problem.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
  2010-05-31 17:10                 ` Michael S. Tsirkin
@ 2010-06-01  8:10                   ` Avi Kivity
  2010-06-01  9:55                     ` Michael S. Tsirkin
  0 siblings, 1 reply; 66+ messages in thread
From: Avi Kivity @ 2010-06-01  8:10 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Tom Lyon, linux-kernel, kvm, chrisw, joro, hjk, gregkh, aafabbri,
	scofeldm

On 05/31/2010 08:10 PM, Michael S. Tsirkin wrote:
> On Mon, May 31, 2010 at 02:50:29PM +0300, Avi Kivity wrote:
>    
>> On 05/30/2010 05:53 PM, Michael S. Tsirkin wrote:
>>      
>>> So what I suggested is failing any kind of access until iommu
>>> is assigned.
>>>
>>>        
>> So, the kernel driver must be aware of the iommu.  In which case it may
>> as well program it.
>>      
> It's a kernel driver anyway. Point is that
> the *device* driver is better off not programming iommu,
> this way we do not need to reprogram it for each device.
>    

The device driver is in userspace.  It can't program the iommu.  What 
the patch proposes is that userspace tells vfio about the needed 
mappings, and vfio programs the iommu.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
  2010-06-01  8:10                   ` Avi Kivity
@ 2010-06-01  9:55                     ` Michael S. Tsirkin
  2010-06-01 10:28                       ` Avi Kivity
  2010-06-02  9:42                       ` Joerg Roedel
  0 siblings, 2 replies; 66+ messages in thread
From: Michael S. Tsirkin @ 2010-06-01  9:55 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Tom Lyon, linux-kernel, kvm, chrisw, joro, hjk, gregkh, aafabbri,
	scofeldm

On Tue, Jun 01, 2010 at 11:10:45AM +0300, Avi Kivity wrote:
> On 05/31/2010 08:10 PM, Michael S. Tsirkin wrote:
>> On Mon, May 31, 2010 at 02:50:29PM +0300, Avi Kivity wrote:
>>    
>>> On 05/30/2010 05:53 PM, Michael S. Tsirkin wrote:
>>>      
>>>> So what I suggested is failing any kind of access until iommu
>>>> is assigned.
>>>>
>>>>        
>>> So, the kernel driver must be aware of the iommu.  In which case it may
>>> as well program it.
>>>      
>> It's a kernel driver anyway. Point is that
>> the *device* driver is better off not programming iommu,
>> this way we do not need to reprogram it for each device.
>>    
>
> The device driver is in userspace.

I mean the kernel driver that grants userspace the access.

>  It can't program the iommu.
> What  
> the patch proposes is that userspace tells vfio about the needed  
> mappings, and vfio programs the iommu.

There seems to be some misunderstanding.  The userspace interface
proposed forces a separate domain per device and forces userspace to
repeat iommu programming for each device.  We are better off sharing a
domain between devices and programming the iommu once.

The natural way to do this is to have an iommu driver for programming
iommu.

This likely means we will have to pass the domain to 'vfio' or uio or
whatever the driver that gives userspace the access to device is called,
but this is only for security, there's no need to support programming
iommu there.

And using this design means the uio framework changes
required would be minor, so we won't have to duplicate code.

> -- 
> error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
  2010-06-01  9:55                     ` Michael S. Tsirkin
@ 2010-06-01 10:28                       ` Avi Kivity
  2010-06-01 10:46                         ` Michael S. Tsirkin
  2010-06-02  4:29                         ` Alex Williamson
  2010-06-02  9:42                       ` Joerg Roedel
  1 sibling, 2 replies; 66+ messages in thread
From: Avi Kivity @ 2010-06-01 10:28 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Tom Lyon, linux-kernel, kvm, chrisw, joro, hjk, gregkh, aafabbri,
	scofeldm

On 06/01/2010 12:55 PM, Michael S. Tsirkin wrote:
>
>>   It can't program the iommu.
>> What
>> the patch proposes is that userspace tells vfio about the needed
>> mappings, and vfio programs the iommu.
>>      
> There seems to be some misunderstanding.  The userspace interface
> proposed forces a separate domain per device and forces userspace to
> repeat iommu programming for each device.  We are better off sharing a
> domain between devices and programming the iommu once.
>    

   iommufd = open(/dev/iommu);
   ioctl(iommufd, IOMMUFD_ASSIGN_RANGE, ...)
   ioctl(vfiofd, VFIO_SET_IOMMU, iommufd)

?

If so, I agree.

> The natural way to do this is to have an iommu driver for programming
> iommu.
>
> This likely means we will have to pass the domain to 'vfio' or uio or
> whatever the driver that gives userspace the access to device is called,
> but this is only for security, there's no need to support programming
> iommu there.
>
> And using this design means the uio framework changes
> required would be minor, so we won't have to duplicate code.
>    

Since vfio would be the only driver, there would be no duplication.  But 
a separate object for the iommu mapping is a good thing.  Perhaps we can 
even share it with vhost (without actually using the mmu, since vhost is 
software only).

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
  2010-06-01 10:28                       ` Avi Kivity
@ 2010-06-01 10:46                         ` Michael S. Tsirkin
  2010-06-01 12:41                           ` Avi Kivity
  2010-06-01 21:26                           ` Tom Lyon
  2010-06-02  4:29                         ` Alex Williamson
  1 sibling, 2 replies; 66+ messages in thread
From: Michael S. Tsirkin @ 2010-06-01 10:46 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Tom Lyon, linux-kernel, kvm, chrisw, joro, hjk, gregkh, aafabbri,
	scofeldm

On Tue, Jun 01, 2010 at 01:28:48PM +0300, Avi Kivity wrote:
> On 06/01/2010 12:55 PM, Michael S. Tsirkin wrote:
>>
>>>   It can't program the iommu.
>>> What
>>> the patch proposes is that userspace tells vfio about the needed
>>> mappings, and vfio programs the iommu.
>>>      
>> There seems to be some misunderstanding.  The userspace interface
>> proposed forces a separate domain per device and forces userspace to
>> repeat iommu programming for each device.  We are better off sharing a
>> domain between devices and programming the iommu once.
>>    
>
>   iommufd = open(/dev/iommu);
>   ioctl(iommufd, IOMMUFD_ASSIGN_RANGE, ...)
>   ioctl(vfiofd, VFIO_SET_IOMMU, iommufd)
>
> ?

Yes.

> If so, I agree.

Good.

>> The natural way to do this is to have an iommu driver for programming
>> iommu.
>>
>> This likely means we will have to pass the domain to 'vfio' or uio or
>> whatever the driver that gives userspace the access to device is called,
>> but this is only for security, there's no need to support programming
>> iommu there.
>>
>> And using this design means the uio framework changes
>> required would be minor, so we won't have to duplicate code.
>>    
>
> Since vfio would be the only driver, there would be no duplication.  But  
> a separate object for the iommu mapping is a good thing.  Perhaps we can  
> even share it with vhost (without actually using the mmu, since vhost is  
> software only).

Main difference is that vhost works fine with unlocked
memory, paging it in on demand. iommu needs to unmap
memory when it is swapped out or relocated.

> -- 
> error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
  2010-06-01 10:46                         ` Michael S. Tsirkin
@ 2010-06-01 12:41                           ` Avi Kivity
  2010-06-02  9:45                             ` Joerg Roedel
  2010-06-01 21:26                           ` Tom Lyon
  1 sibling, 1 reply; 66+ messages in thread
From: Avi Kivity @ 2010-06-01 12:41 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Tom Lyon, linux-kernel, kvm, chrisw, joro, hjk, gregkh, aafabbri,
	scofeldm

On 06/01/2010 01:46 PM, Michael S. Tsirkin wrote:
>
>> Since vfio would be the only driver, there would be no duplication.  But
>> a separate object for the iommu mapping is a good thing.  Perhaps we can
>> even share it with vhost (without actually using the mmu, since vhost is
>> software only).
>>      
> Main difference is that vhost works fine with unlocked
> memory, paging it in on demand. iommu needs to unmap
> memory when it is swapped out or relocated.
>
>    

So you'd just take the memory map and not pin anything.  This way you 
can reuse the memory map.

But no, it doesn't handle the dirty bitmap, so no go.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
  2010-06-01 10:46                         ` Michael S. Tsirkin
  2010-06-01 12:41                           ` Avi Kivity
@ 2010-06-01 21:26                           ` Tom Lyon
  2010-06-02  2:59                             ` Avi Kivity
  1 sibling, 1 reply; 66+ messages in thread
From: Tom Lyon @ 2010-06-01 21:26 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Avi Kivity, linux-kernel, kvm, chrisw, joro, hjk, gregkh,
	aafabbri, scofeldm

On Tuesday 01 June 2010 03:46:51 am Michael S. Tsirkin wrote:
> On Tue, Jun 01, 2010 at 01:28:48PM +0300, Avi Kivity wrote:
> > On 06/01/2010 12:55 PM, Michael S. Tsirkin wrote:
> >>
> >>>   It can't program the iommu.
> >>> What
> >>> the patch proposes is that userspace tells vfio about the needed
> >>> mappings, and vfio programs the iommu.
> >>>      
> >> There seems to be some misunderstanding.  The userspace interface
> >> proposed forces a separate domain per device and forces userspace to
> >> repeat iommu programming for each device.  We are better off sharing a
> >> domain between devices and programming the iommu once.
> >>    
> >
> >   iommufd = open(/dev/iommu);
> >   ioctl(iommufd, IOMMUFD_ASSIGN_RANGE, ...)
> >   ioctl(vfiofd, VFIO_SET_IOMMU, iommufd)
> >
> > ?
> 
> Yes.
> 
> > If so, I agree.
> 
> Good.

I'm not really opposed to multiple devices per domain, but let me point out how I
ended up here.  First, the driver has two ways of mapping pages, one based on the
iommu api and one based on the dma_map_sg api.  With the latter, the system
already allocates a domain per device and there's no way to control it. This was
presumably done to help isolation between drivers.  If there are multiple drivers
in the user level, do we not want the same isoation to apply to them?  
Also, domains are not a very scarce resource - my little core i5 has 256, 
and the intel architecture goes to 64K.
And then there's the fact that it is possible to have multiple disjoint iommus on a system,
so it may not even be possible to bring 2 devices under one domain. 

Given all that, I am inclined to leave it alone until someone has a real problem.
Note that not sharing iommu domains doesn't mean you can't share device memory,
just that you have to do multiple mappings

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
  2010-05-31 17:17 ` Alan Cox
@ 2010-06-01 21:29   ` Tom Lyon
  0 siblings, 0 replies; 66+ messages in thread
From: Tom Lyon @ 2010-06-01 21:29 UTC (permalink / raw)
  To: Alan Cox
  Cc: linux-kernel, kvm, chrisw, joro, hjk, mst, avi, gregkh, aafabbri,
	scofeldm

On Monday 31 May 2010 10:17:35 am Alan Cox wrote:
> 
> Does look like it needs a locking audit, some memory and error checks
> reviewing and some further review of the ioctl security and
> overflows/trusted values.
Yes. Thanks for the detailed look.
> 
> Rather a nice way of attacking the user space PCI problem.
And thanks for that!


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
  2010-06-01 21:26                           ` Tom Lyon
@ 2010-06-02  2:59                             ` Avi Kivity
  2010-06-02  5:29                               ` Chris Wright
  0 siblings, 1 reply; 66+ messages in thread
From: Avi Kivity @ 2010-06-02  2:59 UTC (permalink / raw)
  To: Tom Lyon
  Cc: Michael S. Tsirkin, linux-kernel, kvm, chrisw, joro, hjk, gregkh,
	aafabbri, scofeldm, alex.williamson

On 06/02/2010 12:26 AM, Tom Lyon wrote:
>
> I'm not really opposed to multiple devices per domain, but let me point out how I
> ended up here.  First, the driver has two ways of mapping pages, one based on the
> iommu api and one based on the dma_map_sg api.  With the latter, the system
> already allocates a domain per device and there's no way to control it. This was
> presumably done to help isolation between drivers.  If there are multiple drivers
> in the user level, do we not want the same isoation to apply to them?
>    

In the case of kvm, we don't want isolation between devices, because 
that doesn't happen on real hardware.  So if the guest programs devices 
to dma to each other, we want that to succeed.

> Also, domains are not a very scarce resource - my little core i5 has 256,
> and the intel architecture goes to 64K.
>    

But there is a 0.2% of mapped memory per domain cost for the page 
tables.  For the kvm use case, that could be significant since a guest 
may have large amounts of memory and large numbers of assigned devices.

> And then there's the fact that it is possible to have multiple disjoint iommus on a system,
> so it may not even be possible to bring 2 devices under one domain.
>    

That's indeed a deficiency.

> Given all that, I am inclined to leave it alone until someone has a real problem.
> Note that not sharing iommu domains doesn't mean you can't share device memory,
> just that you have to do multiple mappings
>    

I think we do have a real problem (though a mild one).

The only issue I see with deferring the solution is that the API becomes 
gnarly; both the kernel and userspace will have to support both APIs 
forever.  Perhaps we can implement the new API but defer the actual 
sharing until later, don't know how much work this saves.  Or Alex/Chris 
can pitch in and help.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
  2010-06-01 10:28                       ` Avi Kivity
  2010-06-01 10:46                         ` Michael S. Tsirkin
@ 2010-06-02  4:29                         ` Alex Williamson
  2010-06-02  4:59                           ` Tom Lyon
  1 sibling, 1 reply; 66+ messages in thread
From: Alex Williamson @ 2010-06-02  4:29 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Michael S. Tsirkin, Tom Lyon, linux-kernel, kvm, chrisw, joro,
	hjk, gregkh, aafabbri, scofeldm

On Tue, 2010-06-01 at 13:28 +0300, Avi Kivity wrote:
> On 06/01/2010 12:55 PM, Michael S. Tsirkin wrote:
> >
> >>   It can't program the iommu.
> >> What
> >> the patch proposes is that userspace tells vfio about the needed
> >> mappings, and vfio programs the iommu.
> >>      
> > There seems to be some misunderstanding.  The userspace interface
> > proposed forces a separate domain per device and forces userspace to
> > repeat iommu programming for each device.  We are better off sharing a
> > domain between devices and programming the iommu once.
> >    
> 
>    iommufd = open(/dev/iommu);
>    ioctl(iommufd, IOMMUFD_ASSIGN_RANGE, ...)
>    ioctl(vfiofd, VFIO_SET_IOMMU, iommufd)

It seems part of the annoyance of the current KVM device assignment is
that we have multiple files open, we mmap here, read there, write over
there, maybe, if it's not emulated.  I quite like Tom's approach that we
have one stop shopping with /dev/vfio<n>, including config space
emulation so each driver doesn't have to try to write their own.  So
continuing with that, shouldn't we be able to add a GET_IOMMU/SET_IOMMU
ioctl to vfio so that after we setup one device we can bind the next to
the same domain?

Alex


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
  2010-06-02  4:29                         ` Alex Williamson
@ 2010-06-02  4:59                           ` Tom Lyon
  2010-06-02  5:08                             ` Avi Kivity
  2010-06-02  9:53                             ` Joerg Roedel
  0 siblings, 2 replies; 66+ messages in thread
From: Tom Lyon @ 2010-06-02  4:59 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Avi Kivity, Michael S. Tsirkin, linux-kernel, kvm, chrisw, joro,
	hjk, gregkh, aafabbri, scofeldm

On Tuesday 01 June 2010 09:29:47 pm Alex Williamson wrote:
> On Tue, 2010-06-01 at 13:28 +0300, Avi Kivity wrote:
> > On 06/01/2010 12:55 PM, Michael S. Tsirkin wrote:
> > >
> > >>   It can't program the iommu.
> > >> What
> > >> the patch proposes is that userspace tells vfio about the needed
> > >> mappings, and vfio programs the iommu.
> > >>      
> > > There seems to be some misunderstanding.  The userspace interface
> > > proposed forces a separate domain per device and forces userspace to
> > > repeat iommu programming for each device.  We are better off sharing a
> > > domain between devices and programming the iommu once.
> > >    
> > 
> >    iommufd = open(/dev/iommu);
> >    ioctl(iommufd, IOMMUFD_ASSIGN_RANGE, ...)
> >    ioctl(vfiofd, VFIO_SET_IOMMU, iommufd)
> 
> It seems part of the annoyance of the current KVM device assignment is
> that we have multiple files open, we mmap here, read there, write over
> there, maybe, if it's not emulated.  I quite like Tom's approach that we
> have one stop shopping with /dev/vfio<n>, including config space
> emulation so each driver doesn't have to try to write their own.  So
> continuing with that, shouldn't we be able to add a GET_IOMMU/SET_IOMMU
> ioctl to vfio so that after we setup one device we can bind the next to
> the same domain?

This is just what I was thinking.  But rather than a get/set, just use two fds.

	ioctl(vfio_fd1, VFIO_SET_DOMAIN, vfio_fd2);

This may fail if there are really 2 different IOMMUs, so user code must be
prepared for failure,  In addition, this is strictlyupwards compatible with 
what is there now, so maybe we can add it later.



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
  2010-06-02  4:59                           ` Tom Lyon
@ 2010-06-02  5:08                             ` Avi Kivity
  2010-06-02  9:53                             ` Joerg Roedel
  1 sibling, 0 replies; 66+ messages in thread
From: Avi Kivity @ 2010-06-02  5:08 UTC (permalink / raw)
  To: Tom Lyon
  Cc: Alex Williamson, Michael S. Tsirkin, linux-kernel, kvm, chrisw,
	joro, hjk, gregkh, aafabbri, scofeldm

On 06/02/2010 07:59 AM, Tom Lyon wrote:
>
> This is just what I was thinking.  But rather than a get/set, just use two fds.
>
> 	ioctl(vfio_fd1, VFIO_SET_DOMAIN, vfio_fd2);
>
> This may fail if there are really 2 different IOMMUs, so user code must be
> prepared for failure,  In addition, this is strictlyupwards compatible with
> what is there now, so maybe we can add it later.
>
>    

What happens if one of the fds is later closed?

I don't like this conceptually.  There is a 1:n relationship between the 
memory map and the devices.  Ignoring it will cause the API to have 
warts.  It's more straightforward to have an object to represent the 
memory mapping (and talk to the iommus), and have devices bind to this 
object.


-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
  2010-06-02  2:59                             ` Avi Kivity
@ 2010-06-02  5:29                               ` Chris Wright
  2010-06-02  5:40                                 ` Avi Kivity
  0 siblings, 1 reply; 66+ messages in thread
From: Chris Wright @ 2010-06-02  5:29 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Tom Lyon, Michael S. Tsirkin, linux-kernel, kvm, chrisw, joro,
	hjk, gregkh, aafabbri, scofeldm, alex.williamson

* Avi Kivity (avi@redhat.com) wrote:
> On 06/02/2010 12:26 AM, Tom Lyon wrote:
> >
> >I'm not really opposed to multiple devices per domain, but let me point out how I
> >ended up here.  First, the driver has two ways of mapping pages, one based on the
> >iommu api and one based on the dma_map_sg api.  With the latter, the system
> >already allocates a domain per device and there's no way to control it. This was
> >presumably done to help isolation between drivers.  If there are multiple drivers
> >in the user level, do we not want the same isoation to apply to them?
> 
> In the case of kvm, we don't want isolation between devices, because
> that doesn't happen on real hardware.

Sure it does.  That's exactly what happens when there's an iommu
involved with bare metal.

> So if the guest programs
> devices to dma to each other, we want that to succeed.

And it will as long as ATS is enabled (this is a basic requirement
for PCIe peer-to-peer traffic to succeed with an iommu involved on
bare metal).

That's how things currently are, i.e. we put all devices belonging to a
single guest in the same domain.  However, it can be useful to put each
device belonging to a guest in a unique domain.  Especially as qemu
grows support for iommu emulation, and guest OSes begin to understand
how to use a hw iommu.

> >Also, domains are not a very scarce resource - my little core i5 has 256,
> >and the intel architecture goes to 64K.
> 
> But there is a 0.2% of mapped memory per domain cost for the page
> tables.  For the kvm use case, that could be significant since a
> guest may have large amounts of memory and large numbers of assigned
> devices.
> 
> >And then there's the fact that it is possible to have multiple disjoint iommus on a system,
> >so it may not even be possible to bring 2 devices under one domain.
> 
> That's indeed a deficiency.

Not sure it's a deficiency.  Typically to share page table mappings
across multiple iommu's you just have to do update/invalidate to each
hw iommu that is sharing the mapping.  Alternatively, you can use more
memory and build/maintain identical mappings (as Tom alludes to below).

> >Given all that, I am inclined to leave it alone until someone has a real problem.
> >Note that not sharing iommu domains doesn't mean you can't share device memory,
> >just that you have to do multiple mappings
> 
> I think we do have a real problem (though a mild one).
> 
> The only issue I see with deferring the solution is that the API
> becomes gnarly; both the kernel and userspace will have to support
> both APIs forever.  Perhaps we can implement the new API but defer
> the actual sharing until later, don't know how much work this saves.
> Or Alex/Chris can pitch in and help.

It really shouldn't be that complicated to create the API to allow for
flexible device <-> domain mappings, so I agree, makes sense to do it
right up front.

thanks,
-chris

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
  2010-06-02  5:29                               ` Chris Wright
@ 2010-06-02  5:40                                 ` Avi Kivity
  0 siblings, 0 replies; 66+ messages in thread
From: Avi Kivity @ 2010-06-02  5:40 UTC (permalink / raw)
  To: Chris Wright
  Cc: Tom Lyon, Michael S. Tsirkin, linux-kernel, kvm, joro, hjk,
	gregkh, aafabbri, scofeldm, alex.williamson

On 06/02/2010 08:29 AM, Chris Wright wrote:
> * Avi Kivity (avi@redhat.com) wrote:
>    
>> On 06/02/2010 12:26 AM, Tom Lyon wrote:
>>      
>>> I'm not really opposed to multiple devices per domain, but let me point out how I
>>> ended up here.  First, the driver has two ways of mapping pages, one based on the
>>> iommu api and one based on the dma_map_sg api.  With the latter, the system
>>> already allocates a domain per device and there's no way to control it. This was
>>> presumably done to help isolation between drivers.  If there are multiple drivers
>>> in the user level, do we not want the same isoation to apply to them?
>>>        
>> In the case of kvm, we don't want isolation between devices, because
>> that doesn't happen on real hardware.
>>      
> Sure it does.  That's exactly what happens when there's an iommu
> involved with bare metal.
>    

But we are emulating a machine without an iommu.

When we emulate a machine with an iommu, then yes, we'll want to use as 
many domains as the guest does.

>> So if the guest programs
>> devices to dma to each other, we want that to succeed.
>>      
> And it will as long as ATS is enabled (this is a basic requirement
> for PCIe peer-to-peer traffic to succeed with an iommu involved on
> bare metal).
>
> That's how things currently are, i.e. we put all devices belonging to a
> single guest in the same domain.  However, it can be useful to put each
> device belonging to a guest in a unique domain.  Especially as qemu
> grows support for iommu emulation, and guest OSes begin to understand
> how to use a hw iommu.
>    

Right, we need to keep flexibility.

>>> And then there's the fact that it is possible to have multiple disjoint iommus on a system,
>>> so it may not even be possible to bring 2 devices under one domain.
>>>        
>> That's indeed a deficiency.
>>      
> Not sure it's a deficiency.  Typically to share page table mappings
> across multiple iommu's you just have to do update/invalidate to each
> hw iommu that is sharing the mapping.  Alternatively, you can use more
> memory and build/maintain identical mappings (as Tom alludes to below).
>    

Sharing the page tables is just an optimization, I was worried about 
devices in separate domains not talking to each other.  if ATS fixes 
that, great.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
  2010-06-01  9:55                     ` Michael S. Tsirkin
  2010-06-01 10:28                       ` Avi Kivity
@ 2010-06-02  9:42                       ` Joerg Roedel
  2010-06-02  9:50                         ` Avi Kivity
  2010-06-02  9:53                         ` Michael S. Tsirkin
  1 sibling, 2 replies; 66+ messages in thread
From: Joerg Roedel @ 2010-06-02  9:42 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Avi Kivity, Tom Lyon, linux-kernel, kvm, chrisw, hjk, gregkh,
	aafabbri, scofeldm

On Tue, Jun 01, 2010 at 12:55:32PM +0300, Michael S. Tsirkin wrote:

> There seems to be some misunderstanding.  The userspace interface
> proposed forces a separate domain per device and forces userspace to
> repeat iommu programming for each device.  We are better off sharing a
> domain between devices and programming the iommu once.
> 
> The natural way to do this is to have an iommu driver for programming
> iommu.

IMO a seperate iommu-userspace driver is a nightmare for a userspace
interface. It is just too complicated to use. We can solve the problem
of multiple devices-per-domain with an ioctl which allows binding one
uio-device to the address-space on another. Thats much simpler.

	Joerg


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
  2010-06-01 12:41                           ` Avi Kivity
@ 2010-06-02  9:45                             ` Joerg Roedel
  2010-06-02  9:49                               ` Avi Kivity
  2010-06-02 10:15                               ` Michael S. Tsirkin
  0 siblings, 2 replies; 66+ messages in thread
From: Joerg Roedel @ 2010-06-02  9:45 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Michael S. Tsirkin, Tom Lyon, linux-kernel, kvm, chrisw, hjk,
	gregkh, aafabbri, scofeldm

On Tue, Jun 01, 2010 at 03:41:55PM +0300, Avi Kivity wrote:
> On 06/01/2010 01:46 PM, Michael S. Tsirkin wrote:

>> Main difference is that vhost works fine with unlocked
>> memory, paging it in on demand. iommu needs to unmap
>> memory when it is swapped out or relocated.
>>
> So you'd just take the memory map and not pin anything.  This way you  
> can reuse the memory map.
>
> But no, it doesn't handle the dirty bitmap, so no go.

IOMMU mapped memory can not be swapped out because we can't do demand
paging on io-page-faults with current devices. We have to pin _all_
userspace memory that is mapped into an IOMMU domain.

	Joerg


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
  2010-06-02  9:45                             ` Joerg Roedel
@ 2010-06-02  9:49                               ` Avi Kivity
  2010-06-02 10:04                                 ` Joerg Roedel
  2010-06-02 10:15                               ` Michael S. Tsirkin
  1 sibling, 1 reply; 66+ messages in thread
From: Avi Kivity @ 2010-06-02  9:49 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Michael S. Tsirkin, Tom Lyon, linux-kernel, kvm, chrisw, hjk,
	gregkh, aafabbri, scofeldm

On 06/02/2010 12:45 PM, Joerg Roedel wrote:
> On Tue, Jun 01, 2010 at 03:41:55PM +0300, Avi Kivity wrote:
>    
>> On 06/01/2010 01:46 PM, Michael S. Tsirkin wrote:
>>      
>    
>>> Main difference is that vhost works fine with unlocked
>>> memory, paging it in on demand. iommu needs to unmap
>>> memory when it is swapped out or relocated.
>>>
>>>        
>> So you'd just take the memory map and not pin anything.  This way you
>> can reuse the memory map.
>>
>> But no, it doesn't handle the dirty bitmap, so no go.
>>      
> IOMMU mapped memory can not be swapped out because we can't do demand
> paging on io-page-faults with current devices. We have to pin _all_
> userspace memory that is mapped into an IOMMU domain.
>    

vhost doesn't pin memory.

What I proposed is to describe the memory map using an object (fd), and 
pass it around to clients that use it: kvm, vhost, vfio.  That way you 
maintain the memory map in a central location and broadcast changes to 
clients.  Only a vfio client would result in memory being pinned.

It can still work, but the interface needs to be extended to include 
dirty bitmap logging.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
  2010-06-02  9:42                       ` Joerg Roedel
@ 2010-06-02  9:50                         ` Avi Kivity
  2010-06-02  9:53                         ` Michael S. Tsirkin
  1 sibling, 0 replies; 66+ messages in thread
From: Avi Kivity @ 2010-06-02  9:50 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Michael S. Tsirkin, Tom Lyon, linux-kernel, kvm, chrisw, hjk,
	gregkh, aafabbri, scofeldm

On 06/02/2010 12:42 PM, Joerg Roedel wrote:
> On Tue, Jun 01, 2010 at 12:55:32PM +0300, Michael S. Tsirkin wrote:
>
>    
>> There seems to be some misunderstanding.  The userspace interface
>> proposed forces a separate domain per device and forces userspace to
>> repeat iommu programming for each device.  We are better off sharing a
>> domain between devices and programming the iommu once.
>>
>> The natural way to do this is to have an iommu driver for programming
>> iommu.
>>      
> IMO a seperate iommu-userspace driver is a nightmare for a userspace
> interface. It is just too complicated to use. We can solve the problem
> of multiple devices-per-domain with an ioctl which allows binding one
> uio-device to the address-space on another. Thats much simpler.
>    

This is non trivial with hotplug.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
  2010-06-02  9:42                       ` Joerg Roedel
  2010-06-02  9:50                         ` Avi Kivity
@ 2010-06-02  9:53                         ` Michael S. Tsirkin
  2010-06-02 10:19                           ` Joerg Roedel
  1 sibling, 1 reply; 66+ messages in thread
From: Michael S. Tsirkin @ 2010-06-02  9:53 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Avi Kivity, Tom Lyon, linux-kernel, kvm, chrisw, hjk, gregkh,
	aafabbri, scofeldm

On Wed, Jun 02, 2010 at 11:42:01AM +0200, Joerg Roedel wrote:
> On Tue, Jun 01, 2010 at 12:55:32PM +0300, Michael S. Tsirkin wrote:
> 
> > There seems to be some misunderstanding.  The userspace interface
> > proposed forces a separate domain per device and forces userspace to
> > repeat iommu programming for each device.  We are better off sharing a
> > domain between devices and programming the iommu once.
> > 
> > The natural way to do this is to have an iommu driver for programming
> > iommu.
> 
> IMO a seperate iommu-userspace driver is a nightmare for a userspace
> interface. It is just too complicated to use.

One advantage would be that we can reuse the uio framework
for the devices themselves. So an existing app can just program
an iommu for DMA and keep using uio for interrupts and access.

> We can solve the problem
> of multiple devices-per-domain with an ioctl which allows binding one
> uio-device to the address-space on another.

This would imply switching an iommu domain for a device while
it could potentially be doing DMA. No idea whether this can be done
in a safe manner.
Forcing iommu assignment to be done as a first step seems much saner.


> Thats much simpler.
> 
> 	Joerg


So instead of
dev = open();
ioctl(dev, ASSIGN, iommu)
mmap

and if we for ioctl mmap will fail
we have

dev = open();
if (ndevices > 0)
	ioctl(devices[0], ASSIGN, dev)
mmap

And if we forget ioctl we get errors from device.
Seems more complicated to me.


There will also always exist the confusion: address space for
which device are we modifying? With a separate driver for iommu,
we can safely check that binding is done correctly.

-- 
MST

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
  2010-06-02  4:59                           ` Tom Lyon
  2010-06-02  5:08                             ` Avi Kivity
@ 2010-06-02  9:53                             ` Joerg Roedel
  1 sibling, 0 replies; 66+ messages in thread
From: Joerg Roedel @ 2010-06-02  9:53 UTC (permalink / raw)
  To: Tom Lyon
  Cc: Alex Williamson, Avi Kivity, Michael S. Tsirkin, linux-kernel,
	kvm, chrisw, hjk, gregkh, aafabbri, scofeldm

On Tue, Jun 01, 2010 at 09:59:40PM -0700, Tom Lyon wrote:
> This is just what I was thinking.  But rather than a get/set, just use two fds.
> 
> 	ioctl(vfio_fd1, VFIO_SET_DOMAIN, vfio_fd2);
> 
> This may fail if there are really 2 different IOMMUs, so user code must be
> prepared for failure,  In addition, this is strictlyupwards compatible with 
> what is there now, so maybe we can add it later.

How can this fail with multiple IOMMUs? This should be handled
transparently by the IOMMU driver.

	Joerg


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
  2010-06-02  9:49                               ` Avi Kivity
@ 2010-06-02 10:04                                 ` Joerg Roedel
  2010-06-02 10:09                                   ` Michael S. Tsirkin
  2010-06-02 11:21                                   ` Avi Kivity
  0 siblings, 2 replies; 66+ messages in thread
From: Joerg Roedel @ 2010-06-02 10:04 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Michael S. Tsirkin, Tom Lyon, linux-kernel, kvm, chrisw, hjk,
	gregkh, aafabbri, scofeldm

On Wed, Jun 02, 2010 at 12:49:28PM +0300, Avi Kivity wrote:
> On 06/02/2010 12:45 PM, Joerg Roedel wrote:
>> IOMMU mapped memory can not be swapped out because we can't do demand
>> paging on io-page-faults with current devices. We have to pin _all_
>> userspace memory that is mapped into an IOMMU domain.
>
> vhost doesn't pin memory.
>
> What I proposed is to describe the memory map using an object (fd), and  
> pass it around to clients that use it: kvm, vhost, vfio.  That way you  
> maintain the memory map in a central location and broadcast changes to  
> clients.  Only a vfio client would result in memory being pinned.

Ah ok, so its only about the database which keeps the mapping
information.

> It can still work, but the interface needs to be extended to include  
> dirty bitmap logging.

Thats hard to do. I am not sure about VT-d but the AMD IOMMU has no
dirty-bits in the page-table. And without demand-paging we can't really
tell what pages a device has written to. The only choice is to mark all
IOMMU-mapped pages dirty as long as they are mapped.

	Joerg


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
  2010-06-02 10:04                                 ` Joerg Roedel
@ 2010-06-02 10:09                                   ` Michael S. Tsirkin
  2010-06-02 11:21                                   ` Avi Kivity
  1 sibling, 0 replies; 66+ messages in thread
From: Michael S. Tsirkin @ 2010-06-02 10:09 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Avi Kivity, Tom Lyon, linux-kernel, kvm, chrisw, hjk, gregkh,
	aafabbri, scofeldm

On Wed, Jun 02, 2010 at 12:04:04PM +0200, Joerg Roedel wrote:
> On Wed, Jun 02, 2010 at 12:49:28PM +0300, Avi Kivity wrote:
> > On 06/02/2010 12:45 PM, Joerg Roedel wrote:
> >> IOMMU mapped memory can not be swapped out because we can't do demand
> >> paging on io-page-faults with current devices. We have to pin _all_
> >> userspace memory that is mapped into an IOMMU domain.
> >
> > vhost doesn't pin memory.
> >
> > What I proposed is to describe the memory map using an object (fd), and  
> > pass it around to clients that use it: kvm, vhost, vfio.  That way you  
> > maintain the memory map in a central location and broadcast changes to  
> > clients.  Only a vfio client would result in memory being pinned.
> 
> Ah ok, so its only about the database which keeps the mapping
> information.
> 
> > It can still work, but the interface needs to be extended to include  
> > dirty bitmap logging.
> 
> Thats hard to do. I am not sure about VT-d but the AMD IOMMU has no
> dirty-bits in the page-table. And without demand-paging we can't really
> tell what pages a device has written to. The only choice is to mark all
> IOMMU-mapped pages dirty as long as they are mapped.
> 
> 	Joerg

Or mark them dirty when they are unmapped.

-- 
MST

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
  2010-06-02  9:45                             ` Joerg Roedel
  2010-06-02  9:49                               ` Avi Kivity
@ 2010-06-02 10:15                               ` Michael S. Tsirkin
  2010-06-02 10:26                                 ` Joerg Roedel
  1 sibling, 1 reply; 66+ messages in thread
From: Michael S. Tsirkin @ 2010-06-02 10:15 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Avi Kivity, Tom Lyon, linux-kernel, kvm, chrisw, hjk, gregkh,
	aafabbri, scofeldm

On Wed, Jun 02, 2010 at 11:45:27AM +0200, Joerg Roedel wrote:
> On Tue, Jun 01, 2010 at 03:41:55PM +0300, Avi Kivity wrote:
> > On 06/01/2010 01:46 PM, Michael S. Tsirkin wrote:
> 
> >> Main difference is that vhost works fine with unlocked
> >> memory, paging it in on demand. iommu needs to unmap
> >> memory when it is swapped out or relocated.
> >>
> > So you'd just take the memory map and not pin anything.  This way you  
> > can reuse the memory map.
> >
> > But no, it doesn't handle the dirty bitmap, so no go.
> 
> IOMMU mapped memory can not be swapped out because we can't do demand
> paging on io-page-faults with current devices. We have to pin _all_
> userspace memory that is mapped into an IOMMU domain.
> 
> 	Joerg


One of the issues I see with the current patch is that
it uses the mlock rlimit to do this pinning. So this wastes the rlimit
for an app that did mlockall already, and also consumes
this resource transparently, so an app might call mlock
on a small buffer and be surprised that it fails.

Using mmu notifiers might help?


-- 
MST

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
  2010-06-02  9:53                         ` Michael S. Tsirkin
@ 2010-06-02 10:19                           ` Joerg Roedel
  2010-06-02 10:21                             ` Michael S. Tsirkin
  2010-06-02 10:44                             ` Michael S. Tsirkin
  0 siblings, 2 replies; 66+ messages in thread
From: Joerg Roedel @ 2010-06-02 10:19 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Avi Kivity, Tom Lyon, linux-kernel, kvm, chrisw, hjk, gregkh,
	aafabbri, scofeldm

On Wed, Jun 02, 2010 at 12:53:12PM +0300, Michael S. Tsirkin wrote:
> On Wed, Jun 02, 2010 at 11:42:01AM +0200, Joerg Roedel wrote:

> > IMO a seperate iommu-userspace driver is a nightmare for a userspace
> > interface. It is just too complicated to use.
> 
> One advantage would be that we can reuse the uio framework
> for the devices themselves. So an existing app can just program
> an iommu for DMA and keep using uio for interrupts and access.

The driver is called UIO and not U-INTR-MMIO ;-) So I think handling
IOMMU mappings belongs there.

> > We can solve the problem
> > of multiple devices-per-domain with an ioctl which allows binding one
> > uio-device to the address-space on another.
> 
> This would imply switching an iommu domain for a device while
> it could potentially be doing DMA. No idea whether this can be done
> in a safe manner.

It can. The worst thing that can happen is an io-page-fault.

> Forcing iommu assignment to be done as a first step seems much saner.

If we force it, there is no reason why not doing it implicitly.

We can do something like this then:

dev1 = open();
ioctl(dev1, IOMMU_MAP, ...); /* creates IOMMU domain and assigns dev1 to
			        it*/

dev2 = open();
ioctl(dev2, IOMMU_MAP, ...);

/* Now dev1 and dev2 are in seperate domains */

ioctl(dev2, IOMMU_SHARE, dev1); /* destroys all mapping for dev2 and
				   assigns it to the same domain as
				   dev1. Domain has a refcount of two
				   now */

close(dev1); /* domain refcount goes down to one */
close(dev2); /* domain refcount is zero and domain gets destroyed */


	Joerg


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
  2010-06-02 10:19                           ` Joerg Roedel
@ 2010-06-02 10:21                             ` Michael S. Tsirkin
  2010-06-02 10:35                               ` Joerg Roedel
  2010-06-02 10:44                             ` Michael S. Tsirkin
  1 sibling, 1 reply; 66+ messages in thread
From: Michael S. Tsirkin @ 2010-06-02 10:21 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Avi Kivity, Tom Lyon, linux-kernel, kvm, chrisw, hjk, gregkh,
	aafabbri, scofeldm

On Wed, Jun 02, 2010 at 12:19:40PM +0200, Joerg Roedel wrote:
> On Wed, Jun 02, 2010 at 12:53:12PM +0300, Michael S. Tsirkin wrote:
> > On Wed, Jun 02, 2010 at 11:42:01AM +0200, Joerg Roedel wrote:
> 
> > > IMO a seperate iommu-userspace driver is a nightmare for a userspace
> > > interface. It is just too complicated to use.
> > 
> > One advantage would be that we can reuse the uio framework
> > for the devices themselves. So an existing app can just program
> > an iommu for DMA and keep using uio for interrupts and access.
> 
> The driver is called UIO and not U-INTR-MMIO ;-) So I think handling
> IOMMU mappings belongs there.
> 
> > > We can solve the problem
> > > of multiple devices-per-domain with an ioctl which allows binding one
> > > uio-device to the address-space on another.
> > 
> > This would imply switching an iommu domain for a device while
> > it could potentially be doing DMA. No idea whether this can be done
> > in a safe manner.
> 
> It can. The worst thing that can happen is an io-page-fault.

devices might not be able to recover from this.

> > Forcing iommu assignment to be done as a first step seems much saner.
> 
> If we force it, there is no reason why not doing it implicitly.

What you describe below does 3 ioctls for what can be done with 1.

> We can do something like this then:
> 
> dev1 = open();
> ioctl(dev1, IOMMU_MAP, ...); /* creates IOMMU domain and assigns dev1 to
> 			        it*/
> 
> dev2 = open();
> ioctl(dev2, IOMMU_MAP, ...);
> 
> /* Now dev1 and dev2 are in seperate domains */
> 
> ioctl(dev2, IOMMU_SHARE, dev1); /* destroys all mapping for dev2 and
> 				   assigns it to the same domain as
> 				   dev1. Domain has a refcount of two
> 				   now */

Or maybe it destroys mapping for dev1?
How do you remember?

> close(dev1); /* domain refcount goes down to one */
> close(dev2); /* domain refcount is zero and domain gets destroyed */
> 
> 
> 	Joerg

Also, no way to unshare? That seems limiting.

-- 
MST

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
  2010-06-02 10:15                               ` Michael S. Tsirkin
@ 2010-06-02 10:26                                 ` Joerg Roedel
  0 siblings, 0 replies; 66+ messages in thread
From: Joerg Roedel @ 2010-06-02 10:26 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Avi Kivity, Tom Lyon, linux-kernel, kvm, chrisw, hjk, gregkh,
	aafabbri, scofeldm

On Wed, Jun 02, 2010 at 01:15:34PM +0300, Michael S. Tsirkin wrote:
> One of the issues I see with the current patch is that
> it uses the mlock rlimit to do this pinning. So this wastes the rlimit
> for an app that did mlockall already, and also consumes
> this resource transparently, so an app might call mlock
> on a small buffer and be surprised that it fails.
> 
> Using mmu notifiers might help?

MMU notifiers are problematic because they are designed for situations
where we can do demand paging. The invalidate_range_start and
invalidate_range_end functions are not only called on munmap, they also
run when mprotect is called (in which case we don't want to tear down
iommu mappings). So what may happen with mmu notifiers is that we
accidentially tear down iommu mappings. With demand-paging this is no
problem because the io-ptes could be re-faulted. But that does not work
here.

	Joerg


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
  2010-06-02 10:21                             ` Michael S. Tsirkin
@ 2010-06-02 10:35                               ` Joerg Roedel
  2010-06-02 10:38                                 ` Michael S. Tsirkin
  0 siblings, 1 reply; 66+ messages in thread
From: Joerg Roedel @ 2010-06-02 10:35 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Avi Kivity, Tom Lyon, linux-kernel, kvm, chrisw, hjk, gregkh,
	aafabbri, scofeldm

On Wed, Jun 02, 2010 at 01:21:44PM +0300, Michael S. Tsirkin wrote:
> On Wed, Jun 02, 2010 at 12:19:40PM +0200, Joerg Roedel wrote:

> > It can. The worst thing that can happen is an io-page-fault.
> 
> devices might not be able to recover from this.

With the userspace interface a process can create io-page-faults
anyway if it wants. We can't protect us from this. And the process is
also responsible to not tear down iommu-mappings that are currently in
use.

> What you describe below does 3 ioctls for what can be done with 1.

The second IOMMU_MAP ioctl is just to show that existing mappings would
be destroyed if the device is assigned to another address space. Not
strictly necessary. So we have two ioctls but save one call to create
the iommu-domain.

> > ioctl(dev2, IOMMU_SHARE, dev1); /* destroys all mapping for dev2 and
> > 				   assigns it to the same domain as
> > 				   dev1. Domain has a refcount of two
> > 				   now */
> 
> Or maybe it destroys mapping for dev1?
> How do you remember?

Because we express here that "dev2 shares the iommu mappings of dev1".
Thats easy to remember.

> Also, no way to unshare? That seems limiting.

Just left out for simplicity reasons. An IOMMU_UNBIND (no IOMMU_UNSHARE
because that would require a second parameter) ioctl is certainly also
required.

	Joerg


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
  2010-06-02 10:35                               ` Joerg Roedel
@ 2010-06-02 10:38                                 ` Michael S. Tsirkin
  2010-06-02 11:12                                   ` Joerg Roedel
  0 siblings, 1 reply; 66+ messages in thread
From: Michael S. Tsirkin @ 2010-06-02 10:38 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Avi Kivity, Tom Lyon, linux-kernel, kvm, chrisw, hjk, gregkh,
	aafabbri, scofeldm

On Wed, Jun 02, 2010 at 12:35:16PM +0200, Joerg Roedel wrote:
> On Wed, Jun 02, 2010 at 01:21:44PM +0300, Michael S. Tsirkin wrote:
> > On Wed, Jun 02, 2010 at 12:19:40PM +0200, Joerg Roedel wrote:
> 
> > > It can. The worst thing that can happen is an io-page-fault.
> > 
> > devices might not be able to recover from this.
> 
> With the userspace interface a process can create io-page-faults
> anyway if it wants. We can't protect us from this.

We could fail all operations until an iommu is bound.
This will help catch bugs with access before setup. We can not do this
if a domain is bound by default.

> And the process is
> also responsible to not tear down iommu-mappings that are currently in
> use.
> 
> > What you describe below does 3 ioctls for what can be done with 1.
> 
> The second IOMMU_MAP ioctl is just to show that existing mappings would
> be destroyed if the device is assigned to another address space. Not
> strictly necessary. So we have two ioctls but save one call to create
> the iommu-domain.

With 10 devices you have 10 extra ioctls.

> > > ioctl(dev2, IOMMU_SHARE, dev1); /* destroys all mapping for dev2 and
> > > 				   assigns it to the same domain as
> > > 				   dev1. Domain has a refcount of two
> > > 				   now */
> > 
> > Or maybe it destroys mapping for dev1?
> > How do you remember?
> 
> Because we express here that "dev2 shares the iommu mappings of dev1".
> Thats easy to remember.

they both share the mappings. which one gets the iommu
destroyed (breaking the device if it is now doing DMA)?

> > Also, no way to unshare? That seems limiting.
> 
> Just left out for simplicity reasons. An IOMMU_UNBIND (no IOMMU_UNSHARE
> because that would require a second parameter) ioctl is certainly also
> required.
> 
> 	Joerg



-- 
MST

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
  2010-06-02 10:19                           ` Joerg Roedel
  2010-06-02 10:21                             ` Michael S. Tsirkin
@ 2010-06-02 10:44                             ` Michael S. Tsirkin
  1 sibling, 0 replies; 66+ messages in thread
From: Michael S. Tsirkin @ 2010-06-02 10:44 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Avi Kivity, Tom Lyon, linux-kernel, kvm, chrisw, hjk, gregkh,
	aafabbri, scofeldm

On Wed, Jun 02, 2010 at 12:19:40PM +0200, Joerg Roedel wrote:
> On Wed, Jun 02, 2010 at 12:53:12PM +0300, Michael S. Tsirkin wrote:
> > On Wed, Jun 02, 2010 at 11:42:01AM +0200, Joerg Roedel wrote:
> 
> > > IMO a seperate iommu-userspace driver is a nightmare for a userspace
> > > interface. It is just too complicated to use.
> > 
> > One advantage would be that we can reuse the uio framework
> > for the devices themselves. So an existing app can just program
> > an iommu for DMA and keep using uio for interrupts and access.
> 
> The driver is called UIO and not U-INTR-MMIO ;-) So I think handling
> IOMMU mappings belongs there.

Maybe it could be put there but the patch posted did not use uio.
And one of the reasons is that uio framework provides for
device access and interrupts but not for programming memory mappings.

Solutions (besides giving up on uio completely)
could include extending the framework in some way
(which was tried, but the result was not pretty) or adding
a separate driver for iommu and binding to that.

-- 
MST

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
  2010-06-02 10:38                                 ` Michael S. Tsirkin
@ 2010-06-02 11:12                                   ` Joerg Roedel
  2010-06-02 11:21                                     ` Michael S. Tsirkin
  0 siblings, 1 reply; 66+ messages in thread
From: Joerg Roedel @ 2010-06-02 11:12 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Avi Kivity, Tom Lyon, linux-kernel, kvm, chrisw, hjk, gregkh,
	aafabbri, scofeldm

On Wed, Jun 02, 2010 at 01:38:28PM +0300, Michael S. Tsirkin wrote:
> On Wed, Jun 02, 2010 at 12:35:16PM +0200, Joerg Roedel wrote:

> > With the userspace interface a process can create io-page-faults
> > anyway if it wants. We can't protect us from this.
> 
> We could fail all operations until an iommu is bound.  This will help
> catch bugs with access before setup. We can not do this if a domain is
> bound by default.

Even if it is bound to a domain the userspace driver could program the
device to do dma to unmapped regions causing io-page-faults. The kernel
can't do anything about it.

> > The second IOMMU_MAP ioctl is just to show that existing mappings would
> > be destroyed if the device is assigned to another address space. Not
> > strictly necessary. So we have two ioctls but save one call to create
> > the iommu-domain.
> 
> With 10 devices you have 10 extra ioctls.

And this works implicitly with your proposal? Remember that we still
need to be able to provide seperate mappings for each device to support
IOMMU emulation for the guest. I think my proposal does not have any
extra costs.

> > Because we express here that "dev2 shares the iommu mappings of dev1".
> > Thats easy to remember.
> 
> they both share the mappings. which one gets the iommu
> destroyed (breaking the device if it is now doing DMA)?

As I wrote the domain has a reference count and is destroyed only when
it goes down to zero. This does not happen as long as a device is bound
to it.

	Joerg


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
  2010-06-02 11:12                                   ` Joerg Roedel
@ 2010-06-02 11:21                                     ` Michael S. Tsirkin
  2010-06-02 12:19                                       ` Joerg Roedel
  0 siblings, 1 reply; 66+ messages in thread
From: Michael S. Tsirkin @ 2010-06-02 11:21 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Avi Kivity, Tom Lyon, linux-kernel, kvm, chrisw, hjk, gregkh,
	aafabbri, scofeldm

On Wed, Jun 02, 2010 at 01:12:25PM +0200, Joerg Roedel wrote:
> On Wed, Jun 02, 2010 at 01:38:28PM +0300, Michael S. Tsirkin wrote:
> > On Wed, Jun 02, 2010 at 12:35:16PM +0200, Joerg Roedel wrote:
> 
> > > With the userspace interface a process can create io-page-faults
> > > anyway if it wants. We can't protect us from this.
> > 
> > We could fail all operations until an iommu is bound.  This will help
> > catch bugs with access before setup. We can not do this if a domain is
> > bound by default.
> 
> Even if it is bound to a domain the userspace driver could program the
> device to do dma to unmapped regions causing io-page-faults. The kernel
> can't do anything about it.

It can always corrupt its own memory directly as well :)
But that is not a reason not to detect errors if we can,
and not to make APIs hard to misuse.

> > > The second IOMMU_MAP ioctl is just to show that existing mappings would
> > > be destroyed if the device is assigned to another address space. Not
> > > strictly necessary. So we have two ioctls but save one call to create
> > > the iommu-domain.
> > 
> > With 10 devices you have 10 extra ioctls.
> 
> And this works implicitly with your proposal?

Yes.  so you do:
iommu = open
ioctl(dev1, BIND, iommu)
ioctl(dev2, BIND, iommu)
ioctl(dev3, BIND, iommu)
ioctl(dev4, BIND, iommu)

No need to add a SHARE ioctl.


> Remember that we still
> need to be able to provide seperate mappings for each device to support
> IOMMU emulation for the guest.

Generally not true. E.g. guest can enable iommu passthrough
or have domain per a group of devices.

> I think my proposal does not have any
> extra costs.

with my proposal we have 1 ioctl per device + 1 per domain.
with yours we have 2 ioctls per device is iommu is shared
and 1 if it is not shared.

as current apps share iommu it seems to make sense
to optimize for that.

> > > Because we express here that "dev2 shares the iommu mappings of dev1".
> > > Thats easy to remember.
> > 
> > they both share the mappings. which one gets the iommu
> > destroyed (breaking the device if it is now doing DMA)?
> 
> As I wrote the domain has a reference count and is destroyed only when
> it goes down to zero. This does not happen as long as a device is bound
> to it.
> 
> 	Joerg

We were talking about UNSHARE ioctl:
ioctl(dev1, UNSHARE, dev2)
Does it change the domain for dev1 or dev2?
If you make a mistake you get a hard to debug bug.

-- 
MST

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
  2010-06-02 10:04                                 ` Joerg Roedel
  2010-06-02 10:09                                   ` Michael S. Tsirkin
@ 2010-06-02 11:21                                   ` Avi Kivity
  2010-06-02 16:53                                     ` Chris Wright
  1 sibling, 1 reply; 66+ messages in thread
From: Avi Kivity @ 2010-06-02 11:21 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Michael S. Tsirkin, Tom Lyon, linux-kernel, kvm, chrisw, hjk,
	gregkh, aafabbri, scofeldm

On 06/02/2010 01:04 PM, Joerg Roedel wrote:
> On Wed, Jun 02, 2010 at 12:49:28PM +0300, Avi Kivity wrote:
>    
>> On 06/02/2010 12:45 PM, Joerg Roedel wrote:
>>      
>>> IOMMU mapped memory can not be swapped out because we can't do demand
>>> paging on io-page-faults with current devices. We have to pin _all_
>>> userspace memory that is mapped into an IOMMU domain.
>>>        
>> vhost doesn't pin memory.
>>
>> What I proposed is to describe the memory map using an object (fd), and
>> pass it around to clients that use it: kvm, vhost, vfio.  That way you
>> maintain the memory map in a central location and broadcast changes to
>> clients.  Only a vfio client would result in memory being pinned.
>>      
> Ah ok, so its only about the database which keeps the mapping
> information.
>    

Yes.

>    
>> It can still work, but the interface needs to be extended to include
>> dirty bitmap logging.
>>      
> Thats hard to do. I am not sure about VT-d but the AMD IOMMU has no
> dirty-bits in the page-table. And without demand-paging we can't really
> tell what pages a device has written to. The only choice is to mark all
> IOMMU-mapped pages dirty as long as they are mapped.
>
>    

The interface would only work for clients which support it: kvm, vhost, 
and iommu/devices with restartable dma.

Note dirty logging is not very interesting for vfio anyway, since you 
can't live migrate with assigned devices.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
  2010-06-02 11:21                                     ` Michael S. Tsirkin
@ 2010-06-02 12:19                                       ` Joerg Roedel
  2010-06-02 12:25                                         ` Avi Kivity
                                                           ` (2 more replies)
  0 siblings, 3 replies; 66+ messages in thread
From: Joerg Roedel @ 2010-06-02 12:19 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Avi Kivity, Tom Lyon, linux-kernel, kvm, chrisw, hjk, gregkh,
	aafabbri, scofeldm

On Wed, Jun 02, 2010 at 02:21:00PM +0300, Michael S. Tsirkin wrote:
> On Wed, Jun 02, 2010 at 01:12:25PM +0200, Joerg Roedel wrote:

> > Even if it is bound to a domain the userspace driver could program the
> > device to do dma to unmapped regions causing io-page-faults. The kernel
> > can't do anything about it.
> 
> It can always corrupt its own memory directly as well :)
> But that is not a reason not to detect errors if we can,
> and not to make APIs hard to misuse.

Changing the domain of a device while dma can happen is the same type of
bug as unmapping potential dma target addresses. We can't catch this
kind of misuse.

> > > With 10 devices you have 10 extra ioctls.
> > 
> > And this works implicitly with your proposal?
> 
> Yes.  so you do:
> iommu = open
> ioctl(dev1, BIND, iommu)
> ioctl(dev2, BIND, iommu)
> ioctl(dev3, BIND, iommu)
> ioctl(dev4, BIND, iommu)
> 
> No need to add a SHARE ioctl.

In my proposal this looks like:


dev1 = open();
ioctl(dev2, SHARE, dev1);
ioctl(dev3, SHARE, dev1);
ioctl(dev4, SHARE, dev1);

So we actually save an ioctl.

> > Remember that we still need to be able to provide seperate mappings
> > for each device to support IOMMU emulation for the guest.
> 
> Generally not true. E.g. guest can enable iommu passthrough
> or have domain per a group of devices.

What I meant was that there may me multiple io-addresses spaces
necessary for one process. I didn't want to say that every device
_needs_ to have its own address space.

> > As I wrote the domain has a reference count and is destroyed only when
> > it goes down to zero. This does not happen as long as a device is bound
> > to it.
> > 
> > 	Joerg
> 
> We were talking about UNSHARE ioctl:
> ioctl(dev1, UNSHARE, dev2)
> Does it change the domain for dev1 or dev2?
> If you make a mistake you get a hard to debug bug.

As I already wrote we would have an UNBIND ioctl which just removes a
device from its current domain. UNBIND is better than UNSHARE for
exactly the reason you pointed out above. I thought I stated that
already.

	Joerg


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
  2010-06-02 12:19                                       ` Joerg Roedel
@ 2010-06-02 12:25                                         ` Avi Kivity
  2010-06-02 12:50                                           ` Joerg Roedel
  2010-06-02 12:34                                         ` Michael S. Tsirkin
  2010-06-02 17:46                                         ` Chris Wright
  2 siblings, 1 reply; 66+ messages in thread
From: Avi Kivity @ 2010-06-02 12:25 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Michael S. Tsirkin, Tom Lyon, linux-kernel, kvm, chrisw, hjk,
	gregkh, aafabbri, scofeldm

On 06/02/2010 03:19 PM, Joerg Roedel wrote:
>
>> Yes.  so you do:
>> iommu = open
>> ioctl(dev1, BIND, iommu)
>> ioctl(dev2, BIND, iommu)
>> ioctl(dev3, BIND, iommu)
>> ioctl(dev4, BIND, iommu)
>>
>> No need to add a SHARE ioctl.
>>      
> In my proposal this looks like:
>
>
> dev1 = open();
> ioctl(dev2, SHARE, dev1);
> ioctl(dev3, SHARE, dev1);
> ioctl(dev4, SHARE, dev1);
>
> So we actually save an ioctl.
>    

The problem with this is that it is assymetric, dev1 is treated 
differently from dev[234].  It's an unintuitive API.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
  2010-06-02 12:19                                       ` Joerg Roedel
  2010-06-02 12:25                                         ` Avi Kivity
@ 2010-06-02 12:34                                         ` Michael S. Tsirkin
  2010-06-02 13:02                                           ` Joerg Roedel
  2010-06-02 17:46                                         ` Chris Wright
  2 siblings, 1 reply; 66+ messages in thread
From: Michael S. Tsirkin @ 2010-06-02 12:34 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Avi Kivity, Tom Lyon, linux-kernel, kvm, chrisw, hjk, gregkh,
	aafabbri, scofeldm

On Wed, Jun 02, 2010 at 02:19:28PM +0200, Joerg Roedel wrote:
> On Wed, Jun 02, 2010 at 02:21:00PM +0300, Michael S. Tsirkin wrote:
> > On Wed, Jun 02, 2010 at 01:12:25PM +0200, Joerg Roedel wrote:
> 
> > > Even if it is bound to a domain the userspace driver could program the
> > > device to do dma to unmapped regions causing io-page-faults. The kernel
> > > can't do anything about it.
> > 
> > It can always corrupt its own memory directly as well :)
> > But that is not a reason not to detect errors if we can,
> > and not to make APIs hard to misuse.
> 
> Changing the domain of a device while dma can happen is the same type of
> bug as unmapping potential dma target addresses. We can't catch this
> kind of misuse.

you normally need device mapped to start DMA.
SHARE makes this bug more likely as you allow
switching domains: mmap could be done before switching.

> > > > With 10 devices you have 10 extra ioctls.
> > > 
> > > And this works implicitly with your proposal?
> > 
> > Yes.  so you do:
> > iommu = open
> > ioctl(dev1, BIND, iommu)
> > ioctl(dev2, BIND, iommu)
> > ioctl(dev3, BIND, iommu)
> > ioctl(dev4, BIND, iommu)
> > 
> > No need to add a SHARE ioctl.
> 
> In my proposal this looks like:
> 
> 
> dev1 = open();
> ioctl(dev2, SHARE, dev1);
> ioctl(dev3, SHARE, dev1);
> ioctl(dev4, SHARE, dev1);
> 
> So we actually save an ioctl.

I thought we had a BIND ioctl?

> > > Remember that we still need to be able to provide seperate mappings
> > > for each device to support IOMMU emulation for the guest.
> > 
> > Generally not true. E.g. guest can enable iommu passthrough
> > or have domain per a group of devices.
> 
> What I meant was that there may me multiple io-addresses spaces
> necessary for one process. I didn't want to say that every device
> _needs_ to have its own address space.
> 
> > > As I wrote the domain has a reference count and is destroyed only when
> > > it goes down to zero. This does not happen as long as a device is bound
> > > to it.
> > > 
> > > 	Joerg
> > 
> > We were talking about UNSHARE ioctl:
> > ioctl(dev1, UNSHARE, dev2)
> > Does it change the domain for dev1 or dev2?
> > If you make a mistake you get a hard to debug bug.
> 
> As I already wrote we would have an UNBIND ioctl which just removes a
> device from its current domain. UNBIND is better than UNSHARE for
> exactly the reason you pointed out above. I thought I stated that
> already.
> 
> 	Joerg

You undo SHARE with UNBIND?


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
  2010-06-02 12:25                                         ` Avi Kivity
@ 2010-06-02 12:50                                           ` Joerg Roedel
  2010-06-02 13:06                                             ` Avi Kivity
  2010-06-02 13:17                                             ` Michael S. Tsirkin
  0 siblings, 2 replies; 66+ messages in thread
From: Joerg Roedel @ 2010-06-02 12:50 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Michael S. Tsirkin, Tom Lyon, linux-kernel, kvm, chrisw, hjk,
	gregkh, aafabbri, scofeldm

On Wed, Jun 02, 2010 at 03:25:11PM +0300, Avi Kivity wrote:
> On 06/02/2010 03:19 PM, Joerg Roedel wrote:
>>
>>> Yes.  so you do:
>>> iommu = open
>>> ioctl(dev1, BIND, iommu)
>>> ioctl(dev2, BIND, iommu)
>>> ioctl(dev3, BIND, iommu)
>>> ioctl(dev4, BIND, iommu)
>>>
>>> No need to add a SHARE ioctl.
>>>      
>> In my proposal this looks like:
>>
>>
>> dev1 = open();
>> ioctl(dev2, SHARE, dev1);
>> ioctl(dev3, SHARE, dev1);
>> ioctl(dev4, SHARE, dev1);
>>
>> So we actually save an ioctl.
>>    
>
> The problem with this is that it is assymetric, dev1 is treated  
> differently from dev[234].  It's an unintuitive API.

Its by far more unintuitive that a process needs to explicitly bind a
device to an iommu domain before it can do anything with it. If its
required anyway the binding can happen implicitly. We could allow to do
a nop 'ioctl(dev1, SHARE, dev1)' to remove the asymmetry.

Note that this way of handling userspace iommu mappings is also a lot
simpler for most use-cases outside of KVM. If a developer wants to write
a userspace driver all it needs to do is:

dev = open();
ioctl(dev, MAP, ...);
/* use device with mappings */
close(dev);

Which is much easier than the need to create a domain explicitly.

	Joerg


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
  2010-06-02 12:34                                         ` Michael S. Tsirkin
@ 2010-06-02 13:02                                           ` Joerg Roedel
  0 siblings, 0 replies; 66+ messages in thread
From: Joerg Roedel @ 2010-06-02 13:02 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Avi Kivity, Tom Lyon, linux-kernel, kvm, chrisw, hjk, gregkh,
	aafabbri, scofeldm

On Wed, Jun 02, 2010 at 03:34:17PM +0300, Michael S. Tsirkin wrote:
> On Wed, Jun 02, 2010 at 02:19:28PM +0200, Joerg Roedel wrote:

> you normally need device mapped to start DMA.
> SHARE makes this bug more likely as you allow
> switching domains: mmap could be done before switching.

We need to support domain switching anyway for iommu emulation in a
guest. So if you consider this to be a problem (I don't) it will not go
away with your proposal.

> > dev1 = open();
> > ioctl(dev2, SHARE, dev1);
> > ioctl(dev3, SHARE, dev1);
> > ioctl(dev4, SHARE, dev1);
> > 
> > So we actually save an ioctl.
> 
> I thought we had a BIND ioctl?

I can't remember a BIND ioctl in my proposal. I remember an UNBIND, but
thats bad naming as you pointed out below. See my statement on this
below too.

> You undo SHARE with UNBIND?

Thats bad naming, agreed. Lets keep UNSHARE. Point is, we only need one
parameter to do this which removes any ambiguity:

ioctl(dev1, UNSHARE);

	Joerg


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
  2010-06-02 12:50                                           ` Joerg Roedel
@ 2010-06-02 13:06                                             ` Avi Kivity
  2010-06-02 13:53                                               ` Joerg Roedel
  2010-06-02 13:17                                             ` Michael S. Tsirkin
  1 sibling, 1 reply; 66+ messages in thread
From: Avi Kivity @ 2010-06-02 13:06 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Michael S. Tsirkin, Tom Lyon, linux-kernel, kvm, chrisw, hjk,
	gregkh, aafabbri, scofeldm

On 06/02/2010 03:50 PM, Joerg Roedel wrote:
>
>> The problem with this is that it is assymetric, dev1 is treated
>> differently from dev[234].  It's an unintuitive API.
>>      
> Its by far more unintuitive that a process needs to explicitly bind a
> device to an iommu domain before it can do anything with it.

I don't really care about the iommu domain.  It's a side effect.  The 
kernel takes care of it.  I'm only worried about the API.

We have a memory map that is (often) the same for a set of devices.  If 
you were coding a non-kernel interface, how would you code it?

   struct memory_map;
   void memory_map_init(struct memory_map *mm, ...);
   struct device;
   void device_set_memory_map(struct device *device, struct memory_map *mm);

or

   struct device;
   void device_init_memory_map(struct device *dev, ...);
   void device_clone_memory_map(struct device *dev, struct device *other);

I wouldn't even think of the second one personally.

> If its
> required anyway the binding can happen implicitly. We could allow to do
> a nop 'ioctl(dev1, SHARE, dev1)' to remove the asymmetry.
>    

It's still special.  You define the memory map only for the first 
device.  You have to make sure dev1 doesn't go away while sharing it.

> Note that this way of handling userspace iommu mappings is also a lot
> simpler for most use-cases outside of KVM. If a developer wants to write
> a userspace driver all it needs to do is:
>
> dev = open();
> ioctl(dev, MAP, ...);
> /* use device with mappings */
> close(dev);
>
> Which is much easier than the need to create a domain explicitly.
>    

mm = open()
ioctl(mm, MAP, ...)
dev = open();
ioctl(dev, BIND, mm);
...
close(mm);
close(dev);

so yes, more work, but once you have multiple devices which come and go 
dynamically things become simpler.  The map object has global lifetime 
(you can even construct it if you don't assign any devices), the devices 
attach to it, memory hotplug updates the memory map but doesn't touch 
devices.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
  2010-06-02 12:50                                           ` Joerg Roedel
  2010-06-02 13:06                                             ` Avi Kivity
@ 2010-06-02 13:17                                             ` Michael S. Tsirkin
  2010-06-02 14:01                                               ` Joerg Roedel
  1 sibling, 1 reply; 66+ messages in thread
From: Michael S. Tsirkin @ 2010-06-02 13:17 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Avi Kivity, Tom Lyon, linux-kernel, kvm, chrisw, hjk, gregkh,
	aafabbri, scofeldm

On Wed, Jun 02, 2010 at 02:50:50PM +0200, Joerg Roedel wrote:
> On Wed, Jun 02, 2010 at 03:25:11PM +0300, Avi Kivity wrote:
> > On 06/02/2010 03:19 PM, Joerg Roedel wrote:
> >>
> >>> Yes.  so you do:
> >>> iommu = open
> >>> ioctl(dev1, BIND, iommu)
> >>> ioctl(dev2, BIND, iommu)
> >>> ioctl(dev3, BIND, iommu)
> >>> ioctl(dev4, BIND, iommu)
> >>>
> >>> No need to add a SHARE ioctl.
> >>>      
> >> In my proposal this looks like:
> >>
> >>
> >> dev1 = open();
> >> ioctl(dev2, SHARE, dev1);
> >> ioctl(dev3, SHARE, dev1);
> >> ioctl(dev4, SHARE, dev1);
> >>
> >> So we actually save an ioctl.
> >>    
> >
> > The problem with this is that it is assymetric, dev1 is treated  
> > differently from dev[234].  It's an unintuitive API.
> 
> Its by far more unintuitive that a process needs to explicitly bind a
> device to an iommu domain before it can do anything with it.

The reason it is more intuitive is because it is harder to get it
wrong. If you swap iommu and device in the call, you get BADF
so you know you made a mistake. We can even make it work
both ways if we wanted to. With ioctl(dev1, BIND, dev2)
it breaks silently.


> If its
> required anyway the binding can happen implicitly. We could allow to do
> a nop 'ioctl(dev1, SHARE, dev1)' to remove the asymmetry.

And then when we assign meaning to it we find that half the apps
are broken because they did not call this ioctl.

> Note that this way of handling userspace iommu mappings is also a lot
> simpler for most use-cases outside of KVM. If a developer wants to write
> a userspace driver all it needs to do is:
> 
> dev = open();
> ioctl(dev, MAP, ...);
> /* use device with mappings */
> close(dev);
> 
> Which is much easier than the need to create a domain explicitly.
> 
> 	Joerg

This simple scenario ignores all the real-life corner cases.
For example, with an explicit iommu open and bind application
can naturally detect that:
- we have run out of iommu domains
- iommu is unsupported
- iommu is in use by another, incompatible device
- device is in bad state
because each is a separate operation, so it is easy to produce meaningful
errors.

Another interesting thing that a separate iommu device supports is when
application A controls the iommu and application B
controls the device. This might be good to e.g. improve security
(B is run by root, A is unpriveledged and passes commands to/from B
 over a pipe).

This is not possible when same fd is used for iommu and device.

-- 
MST

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
  2010-06-02 13:06                                             ` Avi Kivity
@ 2010-06-02 13:53                                               ` Joerg Roedel
  0 siblings, 0 replies; 66+ messages in thread
From: Joerg Roedel @ 2010-06-02 13:53 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Michael S. Tsirkin, Tom Lyon, linux-kernel, kvm, chrisw, hjk,
	gregkh, aafabbri, scofeldm

On Wed, Jun 02, 2010 at 04:06:21PM +0300, Avi Kivity wrote:
> On 06/02/2010 03:50 PM, Joerg Roedel wrote:

>> Its by far more unintuitive that a process needs to explicitly bind a
>> device to an iommu domain before it can do anything with it.
>
> I don't really care about the iommu domain.  It's a side effect.  The  
> kernel takes care of it.  I'm only worried about the API.

The proposed memory-map object is nothing else than a userspace
abstraction of an iommu-domain.

> We have a memory map that is (often) the same for a set of devices.  If  
> you were coding a non-kernel interface, how would you code it?
>
>   struct memory_map;
>   void memory_map_init(struct memory_map *mm, ...);
>   struct device;
>   void device_set_memory_map(struct device *device, struct memory_map *mm);
>
> or
>
>   struct device;
>   void device_init_memory_map(struct device *dev, ...);
>   void device_clone_memory_map(struct device *dev, struct device *other);
>
> I wouldn't even think of the second one personally.

Right, a kernel-interface would be designed the first way. The IOMMU-API
is actually designed in this manner. But I still think we should keep it
simpler for userspace.

>> If its required anyway the binding can happen implicitly. We could
>> allow to do a nop 'ioctl(dev1, SHARE, dev1)' to remove the asymmetry.
>
> It's still special.  You define the memory map only for the first  
> device.  You have to make sure dev1 doesn't go away while sharing it.

Must be a misunderstanding. In my proposal the domain is not owned by
one device. It is owned by all devices that share it and will only
vanish if all devices that use it are unbound (which happens when the file
descriptor is closed, for example).

> so yes, more work, but once you have multiple devices which come and go  
> dynamically things become simpler.  The map object has global lifetime  
> (you can even construct it if you don't assign any devices), the devices  
> attach to it, memory hotplug updates the memory map but doesn't touch  
> devices.

I still think a userspace interface should be as simple as possible. But
since both ways will work I am not really opposed to Michael's proposal.
I just think its overkill for the common non-kvm usecase (a userspace
device driver).

	Joerg


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
  2010-06-02 13:17                                             ` Michael S. Tsirkin
@ 2010-06-02 14:01                                               ` Joerg Roedel
  0 siblings, 0 replies; 66+ messages in thread
From: Joerg Roedel @ 2010-06-02 14:01 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Avi Kivity, Tom Lyon, linux-kernel, kvm, chrisw, hjk, gregkh,
	aafabbri, scofeldm

On Wed, Jun 02, 2010 at 04:17:19PM +0300, Michael S. Tsirkin wrote:
> On Wed, Jun 02, 2010 at 02:50:50PM +0200, Joerg Roedel wrote:
> > On Wed, Jun 02, 2010 at 03:25:11PM +0300, Avi Kivity wrote:
> > > On 06/02/2010 03:19 PM, Joerg Roedel wrote:

> 
> > If its
> > required anyway the binding can happen implicitly. We could allow to do
> > a nop 'ioctl(dev1, SHARE, dev1)' to remove the asymmetry.
> 
> And then when we assign meaning to it we find that half the apps
> are broken because they did not call this ioctl.

The meaning is already assigned and chaning it means changing the
userspace-abi which is a no-go.

> This simple scenario ignores all the real-life corner cases.
> For example, with an explicit iommu open and bind application
> can naturally detect that:
> - we have run out of iommu domains

ioctl(dev, MAP, ...)  will fail in this case.

> - iommu is unsupported

Is best checked by open() anyway because userspace can't do anything
with the device before it is bound to a domain.

> - iommu is in use by another, incompatible device

How should this happen?

> - device is in bad state

How is this checked with your proposal and why can this not be detected
with my one?

> because each is a separate operation, so it is easy to produce meaningful
> errors.

Ok, this is true.

> Another interesting thing that a separate iommu device supports is when
> application A controls the iommu and application B
> controls the device.

Until Linux becomes a micro-kernel the IOMMU itself will _never_ be
controlled by an application.

> This might be good to e.g. improve security (B is run by root, A is
> unpriveledged and passes commands to/from B over a pipe).

Micro-kernel arguments. I hope a userspace controlled IOMMU in Linux
will never happen ;-)

	Joerg


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
  2010-06-02 11:21                                   ` Avi Kivity
@ 2010-06-02 16:53                                     ` Chris Wright
  2010-06-06 13:44                                       ` Avi Kivity
  0 siblings, 1 reply; 66+ messages in thread
From: Chris Wright @ 2010-06-02 16:53 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Joerg Roedel, Michael S. Tsirkin, Tom Lyon, linux-kernel, kvm,
	chrisw, hjk, gregkh, aafabbri, scofeldm

* Avi Kivity (avi@redhat.com) wrote:
> The interface would only work for clients which support it: kvm,
> vhost, and iommu/devices with restartable dma.

BTW, there is no such thing as restartable dma.  There is a provision in
new specs (read: no real hardware) that allows a device to request pages
before using them.  So it's akin to demand paging, but the demand is an
explicit request rather than a page fault.

thanks,
-chris

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
  2010-06-02 12:19                                       ` Joerg Roedel
  2010-06-02 12:25                                         ` Avi Kivity
  2010-06-02 12:34                                         ` Michael S. Tsirkin
@ 2010-06-02 17:46                                         ` Chris Wright
  2010-06-02 18:09                                           ` Tom Lyon
  2010-06-03  6:23                                           ` Avi Kivity
  2 siblings, 2 replies; 66+ messages in thread
From: Chris Wright @ 2010-06-02 17:46 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: Michael S. Tsirkin, Avi Kivity, Tom Lyon, linux-kernel, kvm,
	chrisw, hjk, gregkh, aafabbri, scofeldm

* Joerg Roedel (joro@8bytes.org) wrote:
> On Wed, Jun 02, 2010 at 02:21:00PM +0300, Michael S. Tsirkin wrote:
> > On Wed, Jun 02, 2010 at 01:12:25PM +0200, Joerg Roedel wrote:
> 
> > > Even if it is bound to a domain the userspace driver could program the
> > > device to do dma to unmapped regions causing io-page-faults. The kernel
> > > can't do anything about it.
> > 
> > It can always corrupt its own memory directly as well :)
> > But that is not a reason not to detect errors if we can,
> > and not to make APIs hard to misuse.
> 
> Changing the domain of a device while dma can happen is the same type of
> bug as unmapping potential dma target addresses. We can't catch this
> kind of misuse.
> 
> > > > With 10 devices you have 10 extra ioctls.
> > > 
> > > And this works implicitly with your proposal?
> > 
> > Yes.  so you do:
> > iommu = open
> > ioctl(dev1, BIND, iommu)
> > ioctl(dev2, BIND, iommu)
> > ioctl(dev3, BIND, iommu)
> > ioctl(dev4, BIND, iommu)
> > 
> > No need to add a SHARE ioctl.
> 
> In my proposal this looks like:
> 
> 
> dev1 = open();
> ioctl(dev2, SHARE, dev1);
> ioctl(dev3, SHARE, dev1);
> ioctl(dev4, SHARE, dev1);
> 
> So we actually save an ioctl.

This is not any hot path, so saving an ioctl shouldn't be a consideration.
Only important consideration is a good API.  I may have lost context here,
but the SHARE API is limited to the vfio fd.  The BIND API expects a new
iommu object.  Are there other uses for this object?  Tom's current vfio
driver exposes a dma mapping interface, would the iommu object expose
one as well?  Current interface is device specific DMA interface for
host device drivers typically mapping in-flight dma buffers, and IOMMU
specific interface for assigned devices typically mapping entire virtual
address space.

thanks,
-chris

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
  2010-06-02 17:46                                         ` Chris Wright
@ 2010-06-02 18:09                                           ` Tom Lyon
  2010-06-02 19:46                                             ` Joerg Roedel
  2010-06-03  6:23                                           ` Avi Kivity
  1 sibling, 1 reply; 66+ messages in thread
From: Tom Lyon @ 2010-06-02 18:09 UTC (permalink / raw)
  To: Chris Wright
  Cc: Joerg Roedel, Michael S. Tsirkin, Avi Kivity, linux-kernel, kvm,
	hjk, gregkh, aafabbri, scofeldm

On Wednesday 02 June 2010 10:46:15 am Chris Wright wrote:
> * Joerg Roedel (joro@8bytes.org) wrote:
> > On Wed, Jun 02, 2010 at 02:21:00PM +0300, Michael S. Tsirkin wrote:
> > > On Wed, Jun 02, 2010 at 01:12:25PM +0200, Joerg Roedel wrote:
> > 
> > > > Even if it is bound to a domain the userspace driver could program the
> > > > device to do dma to unmapped regions causing io-page-faults. The kernel
> > > > can't do anything about it.
> > > 
> > > It can always corrupt its own memory directly as well :)
> > > But that is not a reason not to detect errors if we can,
> > > and not to make APIs hard to misuse.
> > 
> > Changing the domain of a device while dma can happen is the same type of
> > bug as unmapping potential dma target addresses. We can't catch this
> > kind of misuse.
> > 
> > > > > With 10 devices you have 10 extra ioctls.
> > > > 
> > > > And this works implicitly with your proposal?
> > > 
> > > Yes.  so you do:
> > > iommu = open
> > > ioctl(dev1, BIND, iommu)
> > > ioctl(dev2, BIND, iommu)
> > > ioctl(dev3, BIND, iommu)
> > > ioctl(dev4, BIND, iommu)
> > > 
> > > No need to add a SHARE ioctl.
> > 
> > In my proposal this looks like:
> > 
> > 
> > dev1 = open();
> > ioctl(dev2, SHARE, dev1);
> > ioctl(dev3, SHARE, dev1);
> > ioctl(dev4, SHARE, dev1);
> > 
> > So we actually save an ioctl.
> 
> This is not any hot path, so saving an ioctl shouldn't be a consideration.
> Only important consideration is a good API.  I may have lost context here,
> but the SHARE API is limited to the vfio fd.  The BIND API expects a new
> iommu object.  Are there other uses for this object?  Tom's current vfio
> driver exposes a dma mapping interface, would the iommu object expose
> one as well?  Current interface is device specific DMA interface for
> host device drivers typically mapping in-flight dma buffers, and IOMMU
> specific interface for assigned devices typically mapping entire virtual
> address space.

Actually, it a domain object - which may be usable among iommus (Joerg?).
However, you can't really do the dma mapping with just the domain because
every device supports a different size address space as a master, i.e.,
the dma_mask.

And I don't know how kvm would deal with devices with varying dma mask support,
or why they'd be in the same domain.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
  2010-06-02 18:09                                           ` Tom Lyon
@ 2010-06-02 19:46                                             ` Joerg Roedel
  0 siblings, 0 replies; 66+ messages in thread
From: Joerg Roedel @ 2010-06-02 19:46 UTC (permalink / raw)
  To: Tom Lyon
  Cc: Chris Wright, Michael S. Tsirkin, Avi Kivity, linux-kernel, kvm,
	hjk, gregkh, aafabbri, scofeldm

On Wed, Jun 02, 2010 at 11:09:17AM -0700, Tom Lyon wrote:
> On Wednesday 02 June 2010 10:46:15 am Chris Wright wrote:

> > This is not any hot path, so saving an ioctl shouldn't be a consideration.
> > Only important consideration is a good API.  I may have lost context here,
> > but the SHARE API is limited to the vfio fd.  The BIND API expects a new
> > iommu object.  Are there other uses for this object?  Tom's current vfio
> > driver exposes a dma mapping interface, would the iommu object expose
> > one as well?  Current interface is device specific DMA interface for
> > host device drivers typically mapping in-flight dma buffers, and IOMMU
> > specific interface for assigned devices typically mapping entire virtual
> > address space.
> 
> Actually, it a domain object - which may be usable among iommus (Joerg?).

Yes, this 'iommu' thing is would be a domain object. But sharing among
iommus is not necessary because the fact that there are multiple iommus
in the system is hidden by the iommu drivers. This fact is not even
exposed by the iommu-api. This makes protection domains system global.

> However, you can't really do the dma mapping with just the domain because
> every device supports a different size address space as a master, i.e.,
> the dma_mask.

The dma_mask has to be handled by the device driver. With the
iommu-mapping interface the driver can specify the target io-address and
has to consider the dma_mask for that too.

> And I don't know how kvm would deal with devices with varying dma mask support,
> or why they'd be in the same domain.

KVM does not care about these masks. This is the business of the guest
device drivers.

	Joerg


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
  2010-06-02 17:46                                         ` Chris Wright
  2010-06-02 18:09                                           ` Tom Lyon
@ 2010-06-03  6:23                                           ` Avi Kivity
  2010-06-03 21:41                                             ` Tom Lyon
  1 sibling, 1 reply; 66+ messages in thread
From: Avi Kivity @ 2010-06-03  6:23 UTC (permalink / raw)
  To: Chris Wright
  Cc: Joerg Roedel, Michael S. Tsirkin, Tom Lyon, linux-kernel, kvm,
	hjk, gregkh, aafabbri, scofeldm

On 06/02/2010 08:46 PM, Chris Wright wrote:
> The BIND API expects a new
> iommu object.  Are there other uses for this object?

Both kvm and vhost use similar memory maps, so they could use the new 
object (without invoking the iommu unless they want dma).

> Tom's current vfio
> driver exposes a dma mapping interface, would the iommu object expose
> one as well?  Current interface is device specific DMA interface for
> host device drivers typically mapping in-flight dma buffers, and IOMMU
> specific interface for assigned devices typically mapping entire virtual
> address space.
>    

A per-request mapping sounds like a device API since it would only 
affect that device (whereas the address space API affects multiple devices).

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
  2010-06-03  6:23                                           ` Avi Kivity
@ 2010-06-03 21:41                                             ` Tom Lyon
  2010-06-06  9:54                                               ` Michael S. Tsirkin
  0 siblings, 1 reply; 66+ messages in thread
From: Tom Lyon @ 2010-06-03 21:41 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Chris Wright, Joerg Roedel, Michael S. Tsirkin, linux-kernel,
	kvm, hjk, gregkh, aafabbri, scofeldm

OK, in the interest of making progress, I am about to embark on the following:

1. Create a user-iommu-domain driver - opening it will give a new empty domain.
    Ultimately this can also populate sysfs with the state of its world, which would
    also be a good addition to the base iommu stuff.
    If someone closes the fd while in use, the domain stays valid anyway until users
    drop off.

2. Add DOMAIN_SET and DOMAIN_UNSET ioctls to the vfio driver.  Require that
   a domain be set before using the VFIO_DMA_MAP_IOVA ioctl (this is the one
   that KVM wants).  However, the VFIO_DMA_MAP_ANYWHERE ioctl is the one
   which uses the dma_sg interface which has no expicit control of domains. I
   intend to keep it the way it is, but expect only non-hypervisor programs would
   want to use it.

3. Clean up the docs and other nits that folks have found.

Comments? 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
  2010-06-03 21:41                                             ` Tom Lyon
@ 2010-06-06  9:54                                               ` Michael S. Tsirkin
  2010-06-07 19:01                                                 ` Tom Lyon
  0 siblings, 1 reply; 66+ messages in thread
From: Michael S. Tsirkin @ 2010-06-06  9:54 UTC (permalink / raw)
  To: Tom Lyon
  Cc: Avi Kivity, Chris Wright, Joerg Roedel, linux-kernel, kvm, hjk,
	gregkh, aafabbri, scofeldm

On Thu, Jun 03, 2010 at 02:41:38PM -0700, Tom Lyon wrote:
> OK, in the interest of making progress, I am about to embark on the following:
> 
> 1. Create a user-iommu-domain driver - opening it will give a new empty domain.
>     Ultimately this can also populate sysfs with the state of its world, which would
>     also be a good addition to the base iommu stuff.
>     If someone closes the fd while in use, the domain stays valid anyway until users
>     drop off.
> 
> 2. Add DOMAIN_SET and DOMAIN_UNSET ioctls to the vfio driver.  Require that
>    a domain be set before using the VFIO_DMA_MAP_IOVA ioctl

Require domain to be set before you allow any access to the device:
mmap, write, read.  IMO this is the only safe way to make sure userspace
does not corrupt memory, and this removes the need to special-case
MSI memory, play with bus master enable and hope it can be cleared without
reset, etc.

> (this is the one
>    that KVM wants).

Not sure I understand. I think that MAP should be done on the domain,
not the device, this handles pinning pages correctly and
this way you don't need any special checks.

>    However, the VFIO_DMA_MAP_ANYWHERE ioctl is the one
>    which uses the dma_sg interface which has no expicit control of domains. I
>    intend to keep it the way it is, but expect only non-hypervisor programs would
>    want to use it.

If we support MAP_IOVA, why is MAP_ANYWHERE useful? Can't
non-hypervisors just pick an address?

> 3. Clean up the docs and other nits that folks have found.
> 
> Comments? 

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
  2010-06-02 16:53                                     ` Chris Wright
@ 2010-06-06 13:44                                       ` Avi Kivity
  0 siblings, 0 replies; 66+ messages in thread
From: Avi Kivity @ 2010-06-06 13:44 UTC (permalink / raw)
  To: Chris Wright
  Cc: Joerg Roedel, Michael S. Tsirkin, Tom Lyon, linux-kernel, kvm,
	hjk, gregkh, aafabbri, scofeldm

On 06/02/2010 07:53 PM, Chris Wright wrote:
> * Avi Kivity (avi@redhat.com) wrote:
>    
>> The interface would only work for clients which support it: kvm,
>> vhost, and iommu/devices with restartable dma.
>>      
> BTW, there is no such thing as restartable dma.  There is a provision in
> new specs (read: no real hardware) that allows a device to request pages
> before using them.  So it's akin to demand paging, but the demand is an
> explicit request rather than a page fault.
>    

Thanks for the correction.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
  2010-06-06  9:54                                               ` Michael S. Tsirkin
@ 2010-06-07 19:01                                                 ` Tom Lyon
  2010-06-08 21:22                                                   ` Michael S. Tsirkin
  0 siblings, 1 reply; 66+ messages in thread
From: Tom Lyon @ 2010-06-07 19:01 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Avi Kivity, Chris Wright, Joerg Roedel, linux-kernel, kvm, hjk,
	gregkh, aafabbri, scofeldm

On Sunday 06 June 2010 02:54:51 am Michael S. Tsirkin wrote:
> On Thu, Jun 03, 2010 at 02:41:38PM -0700, Tom Lyon wrote:
> > OK, in the interest of making progress, I am about to embark on the following:
> > 
> > 1. Create a user-iommu-domain driver - opening it will give a new empty domain.
> >     Ultimately this can also populate sysfs with the state of its world, which would
> >     also be a good addition to the base iommu stuff.
> >     If someone closes the fd while in use, the domain stays valid anyway until users
> >     drop off.
> > 
> > 2. Add DOMAIN_SET and DOMAIN_UNSET ioctls to the vfio driver.  Require that
> >    a domain be set before using the VFIO_DMA_MAP_IOVA ioctl
> 
> Require domain to be set before you allow any access to the device:
> mmap, write, read.  IMO this is the only safe way to make sure userspace
> does not corrupt memory, and this removes the need to special-case
> MSI memory, play with bus master enable and hope it can be cleared without
> reset, etc.

Michael - the light bulb finally lit for me and I now understand what you've been
saying the past few weeks.  Of course you're right - we need iommu set before any
register access.  I had thought that was done by default but now I see that the 
dma_map_sg routine only attaches to the iommu on demand.

So I will torpedo the MAP_ANYWHERE stuff. I'd like to keep the MAP_IOVA ioctl
with the vfio fd so that the user can still do everything with one fd. I'm thinking the
fd opens and iommu bindings could be done in a program before spinning out the
program with the user driver.
> 
> > (this is the one
> >    that KVM wants).
> 
> Not sure I understand. I think that MAP should be done on the domain,
> not the device, this handles pinning pages correctly and
> this way you don't need any special checks.
> 
> >    However, the VFIO_DMA_MAP_ANYWHERE ioctl is the one
> >    which uses the dma_sg interface which has no expicit control of domains. I
> >    intend to keep it the way it is, but expect only non-hypervisor programs would
> >    want to use it.
> 
> If we support MAP_IOVA, why is MAP_ANYWHERE useful? Can't
> non-hypervisors just pick an address?
> 
> > 3. Clean up the docs and other nits that folks have found.
> > 
> > Comments? 
> 



^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers
  2010-06-07 19:01                                                 ` Tom Lyon
@ 2010-06-08 21:22                                                   ` Michael S. Tsirkin
  0 siblings, 0 replies; 66+ messages in thread
From: Michael S. Tsirkin @ 2010-06-08 21:22 UTC (permalink / raw)
  To: Tom Lyon
  Cc: Avi Kivity, Chris Wright, Joerg Roedel, linux-kernel, kvm, hjk,
	gregkh, aafabbri, scofeldm

On Mon, Jun 07, 2010 at 12:01:04PM -0700, Tom Lyon wrote:
> On Sunday 06 June 2010 02:54:51 am Michael S. Tsirkin wrote:
> > On Thu, Jun 03, 2010 at 02:41:38PM -0700, Tom Lyon wrote:
> > > OK, in the interest of making progress, I am about to embark on the following:
> > > 
> > > 1. Create a user-iommu-domain driver - opening it will give a new empty domain.
> > >     Ultimately this can also populate sysfs with the state of its world, which would
> > >     also be a good addition to the base iommu stuff.
> > >     If someone closes the fd while in use, the domain stays valid anyway until users
> > >     drop off.
> > > 
> > > 2. Add DOMAIN_SET and DOMAIN_UNSET ioctls to the vfio driver.  Require that
> > >    a domain be set before using the VFIO_DMA_MAP_IOVA ioctl
> > 
> > Require domain to be set before you allow any access to the device:
> > mmap, write, read.  IMO this is the only safe way to make sure userspace
> > does not corrupt memory, and this removes the need to special-case
> > MSI memory, play with bus master enable and hope it can be cleared without
> > reset, etc.
> 
> Michael - the light bulb finally lit for me and I now understand what you've been
> saying the past few weeks.  Of course you're right - we need iommu set before any
> register access.  I had thought that was done by default but now I see that the 
> dma_map_sg routine only attaches to the iommu on demand.
> 
> So I will torpedo the MAP_ANYWHERE stuff. I'd like to keep the MAP_IOVA ioctl
> with the vfio fd so that the user can still do everything with one fd. I'm thinking the
> fd opens and iommu bindings could be done in a program before spinning out the
> program with the user driver.

This would kind of break Avi's idea that mappings are programmed
at the domain and shared by multiple devices, won't it?

> > 
> > > (this is the one
> > >    that KVM wants).
> > 
> > Not sure I understand. I think that MAP should be done on the domain,
> > not the device, this handles pinning pages correctly and
> > this way you don't need any special checks.
> > 
> > >    However, the VFIO_DMA_MAP_ANYWHERE ioctl is the one
> > >    which uses the dma_sg interface which has no expicit control of domains. I
> > >    intend to keep it the way it is, but expect only non-hypervisor programs would
> > >    want to use it.
> > 
> > If we support MAP_IOVA, why is MAP_ANYWHERE useful? Can't
> > non-hypervisors just pick an address?
> > 
> > > 3. Clean up the docs and other nits that folks have found.
> > > 
> > > Comments? 
> > 
> 

^ permalink raw reply	[flat|nested] 66+ messages in thread

end of thread, other threads:[~2010-06-08 21:27 UTC | newest]

Thread overview: 66+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-05-28 23:07 [PATCH] VFIO driver: Non-privileged user level PCI drivers Tom Lyon
2010-05-28 23:36 ` Randy Dunlap
2010-05-28 23:56 ` Randy Dunlap
2010-05-29 11:55 ` Arnd Bergmann
2010-05-29 12:16   ` Avi Kivity
2010-05-30 12:19 ` Michael S. Tsirkin
2010-05-30 12:27   ` Avi Kivity
2010-05-30 12:49     ` Michael S. Tsirkin
2010-05-30 13:01       ` Avi Kivity
2010-05-30 13:03         ` Michael S. Tsirkin
2010-05-30 13:13           ` Avi Kivity
2010-05-30 14:53             ` Michael S. Tsirkin
2010-05-31 11:50               ` Avi Kivity
2010-05-31 17:10                 ` Michael S. Tsirkin
2010-06-01  8:10                   ` Avi Kivity
2010-06-01  9:55                     ` Michael S. Tsirkin
2010-06-01 10:28                       ` Avi Kivity
2010-06-01 10:46                         ` Michael S. Tsirkin
2010-06-01 12:41                           ` Avi Kivity
2010-06-02  9:45                             ` Joerg Roedel
2010-06-02  9:49                               ` Avi Kivity
2010-06-02 10:04                                 ` Joerg Roedel
2010-06-02 10:09                                   ` Michael S. Tsirkin
2010-06-02 11:21                                   ` Avi Kivity
2010-06-02 16:53                                     ` Chris Wright
2010-06-06 13:44                                       ` Avi Kivity
2010-06-02 10:15                               ` Michael S. Tsirkin
2010-06-02 10:26                                 ` Joerg Roedel
2010-06-01 21:26                           ` Tom Lyon
2010-06-02  2:59                             ` Avi Kivity
2010-06-02  5:29                               ` Chris Wright
2010-06-02  5:40                                 ` Avi Kivity
2010-06-02  4:29                         ` Alex Williamson
2010-06-02  4:59                           ` Tom Lyon
2010-06-02  5:08                             ` Avi Kivity
2010-06-02  9:53                             ` Joerg Roedel
2010-06-02  9:42                       ` Joerg Roedel
2010-06-02  9:50                         ` Avi Kivity
2010-06-02  9:53                         ` Michael S. Tsirkin
2010-06-02 10:19                           ` Joerg Roedel
2010-06-02 10:21                             ` Michael S. Tsirkin
2010-06-02 10:35                               ` Joerg Roedel
2010-06-02 10:38                                 ` Michael S. Tsirkin
2010-06-02 11:12                                   ` Joerg Roedel
2010-06-02 11:21                                     ` Michael S. Tsirkin
2010-06-02 12:19                                       ` Joerg Roedel
2010-06-02 12:25                                         ` Avi Kivity
2010-06-02 12:50                                           ` Joerg Roedel
2010-06-02 13:06                                             ` Avi Kivity
2010-06-02 13:53                                               ` Joerg Roedel
2010-06-02 13:17                                             ` Michael S. Tsirkin
2010-06-02 14:01                                               ` Joerg Roedel
2010-06-02 12:34                                         ` Michael S. Tsirkin
2010-06-02 13:02                                           ` Joerg Roedel
2010-06-02 17:46                                         ` Chris Wright
2010-06-02 18:09                                           ` Tom Lyon
2010-06-02 19:46                                             ` Joerg Roedel
2010-06-03  6:23                                           ` Avi Kivity
2010-06-03 21:41                                             ` Tom Lyon
2010-06-06  9:54                                               ` Michael S. Tsirkin
2010-06-07 19:01                                                 ` Tom Lyon
2010-06-08 21:22                                                   ` Michael S. Tsirkin
2010-06-02 10:44                             ` Michael S. Tsirkin
2010-05-30 12:59 ` Avi Kivity
2010-05-31 17:17 ` Alan Cox
2010-06-01 21:29   ` Tom Lyon

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.