All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] vfio: type1 multi-domain support & kvm-vfio coherency checking
@ 2014-02-17 20:23 Alex Williamson
  2014-02-17 20:23 ` [PATCH 1/4] vfio/iommu_type1: Multi-IOMMU domain support Alex Williamson
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Alex Williamson @ 2014-02-17 20:23 UTC (permalink / raw)
  To: alex.williamson, kvm; +Cc: linux-kernel

This series switches the vfio type1 IOMMU backend to support multiple
IOMMU domains per container.  As outlined in 1/4, this provides several
advantages, including supporting features like IOMMU_CACHE and allowing
bus_type independence.  With such support, we're able to provide an
interface to indicate the container is fully cache coherent and export
it for use by the external user interface.  This allows the kvm-vfio
pseudo device to properly test for non-coherent DMA rather than assume
it whenever a kvm-vfio device is registered.

A change introduced by 1/4 creates a new, v2 type1 IOMMU backend.  This
type differs subtly in unmap behavior as described in the patch itself.
For QEMU use of VFIO, there's no change and switching to v2 is
transparent.  I'd appreciate any feedback on whether we should simply
call the previous behavior broken or if we should do like implemented
here and support a compatible mode.  Comments welcome.  Thanks,

Alex

---

Alex Williamson (4):
      vfio/iommu_type1: Multi-IOMMU domain support
      vfio/type1: Add extension to test DMA cache coherence of IOMMU
      vfio: Add external user check extension interface
      kvm/vfio: Support for DMA coherent IOMMUs


 drivers/vfio/vfio.c             |    6 
 drivers/vfio/vfio_iommu_type1.c |  656 +++++++++++++++++++++------------------
 include/linux/vfio.h            |    2 
 include/uapi/linux/vfio.h       |    6 
 virt/kvm/vfio.c                 |   27 +-
 5 files changed, 389 insertions(+), 308 deletions(-)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/4] vfio/iommu_type1: Multi-IOMMU domain support
  2014-02-17 20:23 [PATCH 0/4] vfio: type1 multi-domain support & kvm-vfio coherency checking Alex Williamson
@ 2014-02-17 20:23 ` Alex Williamson
  2014-03-18 10:24     ` Varun Sethi
       [not found]   ` <CAKKYfmFrRF8U3NKhkSgepDfKNeLQ8+2yiz8VSFBzMc58dNAPvg@mail.gmail.com>
  2014-02-17 20:24 ` [PATCH 2/4] vfio/type1: Add extension to test DMA cache coherence of IOMMU Alex Williamson
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 10+ messages in thread
From: Alex Williamson @ 2014-02-17 20:23 UTC (permalink / raw)
  To: alex.williamson, kvm; +Cc: Varun Sethi, linux-kernel

We currently have a problem that we cannot support advanced features
of an IOMMU domain (ex. IOMMU_CACHE), because we have no guarantee
that those features will be supported by all of the hardware units
involved with the domain over its lifetime.  For instance, the Intel
VT-d architecture does not require that all DRHDs support snoop
control.  If we create a domain based on a device behind a DRHD that
does support snoop control and enable SNP support via the IOMMU_CACHE
mapping option, we cannot then add a device behind a DRHD which does
not support snoop control or we'll get reserved bit faults from the
SNP bit in the pagetables.  To add to the complexity, we can't know
the properties of a domain until a device is attached.

We could pass this problem off to userspace and require that a
separate vfio container be used, but we don't know how to handle page
accounting in that case.  How do we know that a page pinned in one
container is the same page as a different container and avoid double
billing the user for the page.

The solution is therefore to support multiple IOMMU domains per
container.  In the majority of cases, only one domain will be required
since hardware is typically consistent within a system.  However, this
provides us the ability to validate compatibility of domains and
support mixed environments where page table flags can be different
between domains.

To do this, our DMA tracking needs to change.  We currently try to
coalesce user mappings into as few tracking entries as possible.  The
problem then becomes that we lose granularity of user mappings.  We've
never guaranteed that a user is able to unmap at a finer granularity
than the original mapping, but we must honor the granularity of the
original mapping.  This coalescing code is therefore removed, allowing
only unmaps covering complete maps.  The change in accounting is
fairly small here, a typical QEMU VM will start out with roughly a
dozen entries, so it's arguable if this coalescing was ever needed.

We also move IOMMU domain creation to the point where a group is
attached to the container.  An interesting side-effect of this is that
we now have access to the device at the time of domain creation and
can probe the devices within the group to determine the bus_type.
This finally makes vfio_iommu_type1 completely device/bus agnostic.
In fact, each IOMMU domain can host devices on different buses managed
by different physical IOMMUs, and present a single DMA mapping
interface to the user.  When a new domain is created, mappings are
replayed to bring the IOMMU pagetables up to the state of the current
container.  And of course, DMA mapping and unmapping automatically
traverse all of the configured IOMMU domains.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Cc: Varun Sethi <Varun.Sethi@freescale.com>
---
 drivers/vfio/vfio_iommu_type1.c |  637 +++++++++++++++++++++------------------
 include/uapi/linux/vfio.h       |    1 
 2 files changed, 336 insertions(+), 302 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 4fb7a8f..8c7bb9b 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -30,7 +30,6 @@
 #include <linux/iommu.h>
 #include <linux/module.h>
 #include <linux/mm.h>
-#include <linux/pci.h>		/* pci_bus_type */
 #include <linux/rbtree.h>
 #include <linux/sched.h>
 #include <linux/slab.h>
@@ -55,11 +54,17 @@ MODULE_PARM_DESC(disable_hugepages,
 		 "Disable VFIO IOMMU support for IOMMU hugepages.");
 
 struct vfio_iommu {
-	struct iommu_domain	*domain;
+	struct list_head	domain_list;
 	struct mutex		lock;
 	struct rb_root		dma_list;
+	bool v2;
+};
+
+struct vfio_domain {
+	struct iommu_domain	*domain;
+	struct list_head	next;
 	struct list_head	group_list;
-	bool			cache;
+	int			prot;		/* IOMMU_CACHE */
 };
 
 struct vfio_dma {
@@ -99,7 +104,7 @@ static struct vfio_dma *vfio_find_dma(struct vfio_iommu *iommu,
 	return NULL;
 }
 
-static void vfio_insert_dma(struct vfio_iommu *iommu, struct vfio_dma *new)
+static void vfio_link_dma(struct vfio_iommu *iommu, struct vfio_dma *new)
 {
 	struct rb_node **link = &iommu->dma_list.rb_node, *parent = NULL;
 	struct vfio_dma *dma;
@@ -118,7 +123,7 @@ static void vfio_insert_dma(struct vfio_iommu *iommu, struct vfio_dma *new)
 	rb_insert_color(&new->node, &iommu->dma_list);
 }
 
-static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *old)
+static void vfio_unlink_dma(struct vfio_iommu *iommu, struct vfio_dma *old)
 {
 	rb_erase(&old->node, &iommu->dma_list);
 }
@@ -322,32 +327,39 @@ static long vfio_unpin_pages(unsigned long pfn, long npage,
 	return unlocked;
 }
 
-static int vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
-			    dma_addr_t iova, size_t *size)
+static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
 {
-	dma_addr_t start = iova, end = iova + *size;
+	dma_addr_t iova = dma->iova, end = dma->iova + dma->size;
+	struct vfio_domain *domain, *d;
 	long unlocked = 0;
 
+	if (!dma->size)
+		return;
+	/*
+	 * We use the IOMMU to track the physical addresses, otherwise we'd
+	 * need a much more complicated tracking system.  Unfortunately that
+	 * means we need to use one of the iommu domains to figure out the
+	 * pfns to unpin.  The rest need to be unmapped in advance so we have
+	 * no iommu translations remaining when the pages are unpinned.
+	 */
+	domain = d = list_first_entry(&iommu->domain_list,
+				      struct vfio_domain, next);
+
+	list_for_each_entry_continue(d, &iommu->domain_list, next)
+		iommu_unmap(d->domain, dma->iova, dma->size);
+
 	while (iova < end) {
 		size_t unmapped;
 		phys_addr_t phys;
 
-		/*
-		 * We use the IOMMU to track the physical address.  This
-		 * saves us from having a lot more entries in our mapping
-		 * tree.  The downside is that we don't track the size
-		 * used to do the mapping.  We request unmap of a single
-		 * page, but expect IOMMUs that support large pages to
-		 * unmap a larger chunk.
-		 */
-		phys = iommu_iova_to_phys(iommu->domain, iova);
+		phys = iommu_iova_to_phys(domain->domain, iova);
 		if (WARN_ON(!phys)) {
 			iova += PAGE_SIZE;
 			continue;
 		}
 
-		unmapped = iommu_unmap(iommu->domain, iova, PAGE_SIZE);
-		if (!unmapped)
+		unmapped = iommu_unmap(domain->domain, iova, PAGE_SIZE);
+		if (WARN_ON(!unmapped))
 			break;
 
 		unlocked += vfio_unpin_pages(phys >> PAGE_SHIFT,
@@ -357,119 +369,26 @@ static int vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
 	}
 
 	vfio_lock_acct(-unlocked);
-
-	*size = iova - start;
-
-	return 0;
 }
 
-static int vfio_remove_dma_overlap(struct vfio_iommu *iommu, dma_addr_t start,
-				   size_t *size, struct vfio_dma *dma)
+static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
 {
-	size_t offset, overlap, tmp;
-	struct vfio_dma *split;
-	int ret;
-
-	if (!*size)
-		return 0;
-
-	/*
-	 * Existing dma region is completely covered, unmap all.  This is
-	 * the likely case since userspace tends to map and unmap buffers
-	 * in one shot rather than multiple mappings within a buffer.
-	 */
-	if (likely(start <= dma->iova &&
-		   start + *size >= dma->iova + dma->size)) {
-		*size = dma->size;
-		ret = vfio_unmap_unpin(iommu, dma, dma->iova, size);
-		if (ret)
-			return ret;
-
-		/*
-		 * Did we remove more than we have?  Should never happen
-		 * since a vfio_dma is contiguous in iova and vaddr.
-		 */
-		WARN_ON(*size != dma->size);
-
-		vfio_remove_dma(iommu, dma);
-		kfree(dma);
-		return 0;
-	}
-
-	/* Overlap low address of existing range */
-	if (start <= dma->iova) {
-		overlap = start + *size - dma->iova;
-		ret = vfio_unmap_unpin(iommu, dma, dma->iova, &overlap);
-		if (ret)
-			return ret;
-
-		vfio_remove_dma(iommu, dma);
-
-		/*
-		 * Check, we may have removed to whole vfio_dma.  If not
-		 * fixup and re-insert.
-		 */
-		if (overlap < dma->size) {
-			dma->iova += overlap;
-			dma->vaddr += overlap;
-			dma->size -= overlap;
-			vfio_insert_dma(iommu, dma);
-		} else
-			kfree(dma);
-
-		*size = overlap;
-		return 0;
-	}
-
-	/* Overlap high address of existing range */
-	if (start + *size >= dma->iova + dma->size) {
-		offset = start - dma->iova;
-		overlap = dma->size - offset;
-
-		ret = vfio_unmap_unpin(iommu, dma, start, &overlap);
-		if (ret)
-			return ret;
-
-		dma->size -= overlap;
-		*size = overlap;
-		return 0;
-	}
-
-	/* Split existing */
-
-	/*
-	 * Allocate our tracking structure early even though it may not
-	 * be used.  An Allocation failure later loses track of pages and
-	 * is more difficult to unwind.
-	 */
-	split = kzalloc(sizeof(*split), GFP_KERNEL);
-	if (!split)
-		return -ENOMEM;
-
-	offset = start - dma->iova;
-
-	ret = vfio_unmap_unpin(iommu, dma, start, size);
-	if (ret || !*size) {
-		kfree(split);
-		return ret;
-	}
-
-	tmp = dma->size;
+	vfio_unmap_unpin(iommu, dma);
+	vfio_unlink_dma(iommu, dma);
+	kfree(dma);
+}
 
-	/* Resize the lower vfio_dma in place, before the below insert */
-	dma->size = offset;
+static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu)
+{
+	struct vfio_domain *domain;
+	unsigned long bitmap = PAGE_MASK;
 
-	/* Insert new for remainder, assuming it didn't all get unmapped */
-	if (likely(offset + *size < tmp)) {
-		split->size = tmp - offset - *size;
-		split->iova = dma->iova + offset + *size;
-		split->vaddr = dma->vaddr + offset + *size;
-		split->prot = dma->prot;
-		vfio_insert_dma(iommu, split);
-	} else
-		kfree(split);
+	mutex_lock(&iommu->lock);
+	list_for_each_entry(domain, &iommu->domain_list, next)
+		bitmap &= domain->domain->ops->pgsize_bitmap;
+	mutex_unlock(&iommu->lock);
 
-	return 0;
+	return bitmap;
 }
 
 static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
@@ -477,10 +396,10 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
 {
 	uint64_t mask;
 	struct vfio_dma *dma;
-	size_t unmapped = 0, size;
+	size_t unmapped = 0;
 	int ret = 0;
 
-	mask = ((uint64_t)1 << __ffs(iommu->domain->ops->pgsize_bitmap)) - 1;
+	mask = ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1;
 
 	if (unmap->iova & mask)
 		return -EINVAL;
@@ -491,20 +410,61 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
 
 	mutex_lock(&iommu->lock);
 
+	/*
+	 * vfio-iommu-type1 (v1) - User mappings were coalesced together to
+	 * avoid tracking individual mappings.  This means that the granularity
+	 * of the original mapping was lost and the user was allowed to attempt
+	 * to unmap any range.  Depending on the contiguousness of physical
+	 * memory and page sizes supported by the IOMMU, arbitrary unmaps may
+	 * or may not have worked.  We only guaranteed unmap granularity
+	 * matching the original mapping; even though it was untracked here,
+	 * the original mappings are reflected in IOMMU mappings.  This
+	 * resulted in a couple unusual behaviors.  First, if a range is not
+	 * able to be unmapped, ex. a set of 4k pages that was mapped as a
+	 * 2M hugepage into the IOMMU, the unmap ioctl returns success but with
+	 * a zero sized unmap.  Also, if an unmap request overlaps the first
+	 * address of a hugepage, the IOMMU will unmap the entire hugepage.
+	 * This also returns success and the returned unmap size reflects the
+	 * actual size unmapped.
+	 *
+	 * We attempt to maintain compatibility with this "v1" interface, but
+	 * we take control out of the hands of the IOMMU.  Therefore, an unmap
+	 * request offset from the beginning of the original mapping will
+	 * return success with zero sized unmap.  And an unmap request covering
+	 * the first iova of mapping will unmap the entire range.
+	 *
+	 * The v2 version of this interface intends to be more deterministic.
+	 * Unmap requests must fully cover previous mappings.  Multiple
+	 * mappings may still be unmaped by specifying large ranges, but there
+	 * must not be any previous mappings bisected by the range.  An error
+	 * will be returned if these conditions are not met.  The v2 interface
+	 * will only return success and a size of zero if there were no
+	 * mappings within the range.
+	 */
+	if (iommu->v2) {
+		dma = vfio_find_dma(iommu, unmap->iova, 0);
+		if (dma && dma->iova != unmap->iova) {
+			ret = -EINVAL;
+			goto unlock;
+		}
+		dma = vfio_find_dma(iommu, unmap->iova + unmap->size - 1, 0);
+		if (dma && dma->iova + dma->size != unmap->iova + unmap->size) {
+			ret = -EINVAL;
+			goto unlock;
+		}
+	}
+
 	while ((dma = vfio_find_dma(iommu, unmap->iova, unmap->size))) {
-		size = unmap->size;
-		ret = vfio_remove_dma_overlap(iommu, unmap->iova, &size, dma);
-		if (ret || !size)
+		if (!iommu->v2 && unmap->iova > dma->iova)
 			break;
-		unmapped += size;
+		unmapped += dma->size;
+		vfio_remove_dma(iommu, dma);
 	}
 
+unlock:
 	mutex_unlock(&iommu->lock);
 
-	/*
-	 * We may unmap more than requested, update the unmap struct so
-	 * userspace can know.
-	 */
+	/* Report how much was unmapped */
 	unmap->size = unmapped;
 
 	return ret;
@@ -516,22 +476,47 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
  * soon, so this is just a temporary workaround to break mappings down into
  * PAGE_SIZE.  Better to map smaller pages than nothing.
  */
-static int map_try_harder(struct vfio_iommu *iommu, dma_addr_t iova,
+static int map_try_harder(struct vfio_domain *domain, dma_addr_t iova,
 			  unsigned long pfn, long npage, int prot)
 {
 	long i;
 	int ret;
 
 	for (i = 0; i < npage; i++, pfn++, iova += PAGE_SIZE) {
-		ret = iommu_map(iommu->domain, iova,
+		ret = iommu_map(domain->domain, iova,
 				(phys_addr_t)pfn << PAGE_SHIFT,
-				PAGE_SIZE, prot);
+				PAGE_SIZE, prot | domain->prot);
 		if (ret)
 			break;
 	}
 
 	for (; i < npage && i > 0; i--, iova -= PAGE_SIZE)
-		iommu_unmap(iommu->domain, iova, PAGE_SIZE);
+		iommu_unmap(domain->domain, iova, PAGE_SIZE);
+
+	return ret;
+}
+
+static int vfio_iommu_map(struct vfio_iommu *iommu, dma_addr_t iova,
+			  unsigned long pfn, long npage, int prot)
+{
+	struct vfio_domain *d;
+	int ret;
+
+	list_for_each_entry(d, &iommu->domain_list, next) {
+		ret = iommu_map(d->domain, iova, (phys_addr_t)pfn << PAGE_SHIFT,
+				npage << PAGE_SHIFT, prot | d->prot);
+		if (ret) {
+			if (ret != -EBUSY ||
+			    map_try_harder(d, iova, pfn, npage, prot))
+				goto unwind;
+		}
+	}
+
+	return 0;
+
+unwind:
+	list_for_each_entry_continue_reverse(d, &iommu->domain_list, next)
+		iommu_unmap(d->domain, iova, npage << PAGE_SHIFT);
 
 	return ret;
 }
@@ -545,12 +530,12 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
 	long npage;
 	int ret = 0, prot = 0;
 	uint64_t mask;
-	struct vfio_dma *dma = NULL;
+	struct vfio_dma *dma;
 	unsigned long pfn;
 
 	end = map->iova + map->size;
 
-	mask = ((uint64_t)1 << __ffs(iommu->domain->ops->pgsize_bitmap)) - 1;
+	mask = ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1;
 
 	/* READ/WRITE from device perspective */
 	if (map->flags & VFIO_DMA_MAP_FLAG_WRITE)
@@ -561,9 +546,6 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
 	if (!prot)
 		return -EINVAL; /* No READ/WRITE? */
 
-	if (iommu->cache)
-		prot |= IOMMU_CACHE;
-
 	if (vaddr & mask)
 		return -EINVAL;
 	if (map->iova & mask)
@@ -588,180 +570,257 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
 		return -EEXIST;
 	}
 
-	for (iova = map->iova; iova < end; iova += size, vaddr += size) {
-		long i;
+	dma = kzalloc(sizeof(*dma), GFP_KERNEL);
+	if (!dma) {
+		mutex_unlock(&iommu->lock);
+		return -ENOMEM;
+	}
+
+	dma->iova = map->iova;
+	dma->vaddr = map->vaddr;
+	dma->prot = prot;
 
+	/* Insert zero-sized and grow as we map chunks of it */
+	vfio_link_dma(iommu, dma);
+
+	for (iova = map->iova; iova < end; iova += size, vaddr += size) {
 		/* Pin a contiguous chunk of memory */
 		npage = vfio_pin_pages(vaddr, (end - iova) >> PAGE_SHIFT,
 				       prot, &pfn);
 		if (npage <= 0) {
 			WARN_ON(!npage);
 			ret = (int)npage;
-			goto out;
-		}
-
-		/* Verify pages are not already mapped */
-		for (i = 0; i < npage; i++) {
-			if (iommu_iova_to_phys(iommu->domain,
-					       iova + (i << PAGE_SHIFT))) {
-				ret = -EBUSY;
-				goto out_unpin;
-			}
+			break;
 		}
 
-		ret = iommu_map(iommu->domain, iova,
-				(phys_addr_t)pfn << PAGE_SHIFT,
-				npage << PAGE_SHIFT, prot);
+		/* Map it! */
+		ret = vfio_iommu_map(iommu, iova, pfn, npage, prot);
 		if (ret) {
-			if (ret != -EBUSY ||
-			    map_try_harder(iommu, iova, pfn, npage, prot)) {
-				goto out_unpin;
-			}
+			vfio_unpin_pages(pfn, npage, prot, true);
+			break;
 		}
 
 		size = npage << PAGE_SHIFT;
+		dma->size += size;
+	}
 
-		/*
-		 * Check if we abut a region below - nothing below 0.
-		 * This is the most likely case when mapping chunks of
-		 * physically contiguous regions within a virtual address
-		 * range.  Update the abutting entry in place since iova
-		 * doesn't change.
-		 */
-		if (likely(iova)) {
-			struct vfio_dma *tmp;
-			tmp = vfio_find_dma(iommu, iova - 1, 1);
-			if (tmp && tmp->prot == prot &&
-			    tmp->vaddr + tmp->size == vaddr) {
-				tmp->size += size;
-				iova = tmp->iova;
-				size = tmp->size;
-				vaddr = tmp->vaddr;
-				dma = tmp;
-			}
-		}
+	if (ret)
+		vfio_remove_dma(iommu, dma);
 
-		/*
-		 * Check if we abut a region above - nothing above ~0 + 1.
-		 * If we abut above and below, remove and free.  If only
-		 * abut above, remove, modify, reinsert.
-		 */
-		if (likely(iova + size)) {
-			struct vfio_dma *tmp;
-			tmp = vfio_find_dma(iommu, iova + size, 1);
-			if (tmp && tmp->prot == prot &&
-			    tmp->vaddr == vaddr + size) {
-				vfio_remove_dma(iommu, tmp);
-				if (dma) {
-					dma->size += tmp->size;
-					kfree(tmp);
-				} else {
-					size += tmp->size;
-					tmp->size = size;
-					tmp->iova = iova;
-					tmp->vaddr = vaddr;
-					vfio_insert_dma(iommu, tmp);
-					dma = tmp;
-				}
-			}
-		}
+	mutex_unlock(&iommu->lock);
+	return ret;
+}
+
+static int vfio_bus_type(struct device *dev, void *data)
+{
+	struct bus_type **bus = data;
+
+	if (*bus && *bus != dev->bus)
+		return -EINVAL;
+
+	*bus = dev->bus;
+
+	return 0;
+}
+
+static int vfio_iommu_replay(struct vfio_iommu *iommu,
+			     struct vfio_domain *domain)
+{
+	struct vfio_domain *d;
+	struct rb_node *n;
+	int ret;
+
+	/* Arbitrarily pick the first domain in the list for lookups */
+	d = list_first_entry(&iommu->domain_list, struct vfio_domain, next);
+	n = rb_first(&iommu->dma_list);
+
+	/* If there's not a domain, there better not be any mappings */
+	if (WARN_ON(n && !d))
+		return -EINVAL;
+
+	for (; n; n = rb_next(n)) {
+		struct vfio_dma *dma;
+		dma_addr_t iova;
+
+		dma = rb_entry(n, struct vfio_dma, node);
+		iova = dma->iova;
+
+		while (iova < dma->iova + dma->size) {
+			phys_addr_t phys = iommu_iova_to_phys(d->domain, iova);
+			size_t size;
 
-		if (!dma) {
-			dma = kzalloc(sizeof(*dma), GFP_KERNEL);
-			if (!dma) {
-				iommu_unmap(iommu->domain, iova, size);
-				ret = -ENOMEM;
-				goto out_unpin;
+			if (WARN_ON(!phys)) {
+				iova += PAGE_SIZE;
+				continue;
 			}
 
-			dma->size = size;
-			dma->iova = iova;
-			dma->vaddr = vaddr;
-			dma->prot = prot;
-			vfio_insert_dma(iommu, dma);
-		}
-	}
+			size = PAGE_SIZE;
 
-	WARN_ON(ret);
-	mutex_unlock(&iommu->lock);
-	return ret;
+			while (iova + size < dma->iova + dma->size &&
+			       phys + size == iommu_iova_to_phys(d->domain,
+								 iova + size))
+				size += PAGE_SIZE;
 
-out_unpin:
-	vfio_unpin_pages(pfn, npage, prot, true);
+			ret = iommu_map(domain->domain, iova, phys,
+					size, dma->prot | domain->prot);
+			if (ret)
+				return ret;
 
-out:
-	iova = map->iova;
-	size = map->size;
-	while ((dma = vfio_find_dma(iommu, iova, size))) {
-		int r = vfio_remove_dma_overlap(iommu, iova,
-						&size, dma);
-		if (WARN_ON(r || !size))
-			break;
+			iova += size;
+		}
 	}
 
-	mutex_unlock(&iommu->lock);
-	return ret;
+	return 0;
 }
 
 static int vfio_iommu_type1_attach_group(void *iommu_data,
 					 struct iommu_group *iommu_group)
 {
 	struct vfio_iommu *iommu = iommu_data;
-	struct vfio_group *group, *tmp;
+	struct vfio_group *group, *g;
+	struct vfio_domain *domain, *d;
+	struct bus_type *bus = NULL;
 	int ret;
 
-	group = kzalloc(sizeof(*group), GFP_KERNEL);
-	if (!group)
-		return -ENOMEM;
-
 	mutex_lock(&iommu->lock);
 
-	list_for_each_entry(tmp, &iommu->group_list, next) {
-		if (tmp->iommu_group == iommu_group) {
+	list_for_each_entry(d, &iommu->domain_list, next) {
+		list_for_each_entry(g, &d->group_list, next) {
+			if (g->iommu_group != iommu_group)
+				continue;
+
 			mutex_unlock(&iommu->lock);
-			kfree(group);
 			return -EINVAL;
 		}
 	}
 
+	group = kzalloc(sizeof(*group), GFP_KERNEL);
+	domain = kzalloc(sizeof(*domain), GFP_KERNEL);
+	if (!group || !domain) {
+		ret = -ENOMEM;
+		goto out_free;
+	}
+
+	group->iommu_group = iommu_group;
+
+	/* Determine bus_type in order to allocate a domain */
+	ret = iommu_group_for_each_dev(iommu_group, &bus, vfio_bus_type);
+	if (ret)
+		goto out_free;
+
+	domain->domain = iommu_domain_alloc(bus);
+	if (!domain->domain) {
+		ret = -EIO;
+		goto out_free;
+	}
+
+	ret = iommu_attach_group(domain->domain, iommu_group);
+	if (ret)
+		goto out_domain;
+
+	INIT_LIST_HEAD(&domain->group_list);
+	list_add(&group->next, &domain->group_list);
+
+	if (!allow_unsafe_interrupts &&
+	    !iommu_domain_has_cap(domain->domain, IOMMU_CAP_INTR_REMAP)) {
+		pr_warn("%s: No interrupt remapping support.  Use the module param \"allow_unsafe_interrupts\" to enable VFIO IOMMU support on this platform\n",
+		       __func__);
+		ret = -EPERM;
+		goto out_detach;
+	}
+
+	if (iommu_domain_has_cap(domain->domain, IOMMU_CAP_CACHE_COHERENCY))
+		domain->prot |= IOMMU_CACHE;
+
 	/*
-	 * TODO: Domain have capabilities that might change as we add
-	 * groups (see iommu->cache, currently never set).  Check for
-	 * them and potentially disallow groups to be attached when it
-	 * would change capabilities (ugh).
+	 * Try to match an existing compatible domain.  We don't want to
+	 * preclude an IOMMU driver supporting multiple bus_types and being
+	 * able to include different bus_types in the same IOMMU domain, so
+	 * we test whether the domains use the same iommu_ops rather than
+	 * testing if they're on the same bus_type.
 	 */
-	ret = iommu_attach_group(iommu->domain, iommu_group);
-	if (ret) {
-		mutex_unlock(&iommu->lock);
-		kfree(group);
-		return ret;
+	list_for_each_entry(d, &iommu->domain_list, next) {
+		if (d->domain->ops == domain->domain->ops &&
+		    d->prot == domain->prot) {
+			iommu_detach_group(domain->domain, iommu_group);
+			if (!iommu_attach_group(d->domain, iommu_group)) {
+				list_add(&group->next, &d->group_list);
+				iommu_domain_free(domain->domain);
+				kfree(domain);
+				mutex_unlock(&iommu->lock);
+				return 0;
+			}
+
+			ret = iommu_attach_group(domain->domain, iommu_group);
+			if (ret)
+				goto out_domain;
+		}
 	}
 
-	group->iommu_group = iommu_group;
-	list_add(&group->next, &iommu->group_list);
+	/* replay mappings on new domains */
+	ret = vfio_iommu_replay(iommu, domain);
+	if (ret)
+		goto out_detach;
+
+	list_add(&domain->next, &iommu->domain_list);
 
 	mutex_unlock(&iommu->lock);
 
 	return 0;
+
+out_detach:
+	iommu_detach_group(domain->domain, iommu_group);
+out_domain:
+	iommu_domain_free(domain->domain);
+out_free:
+	kfree(domain);
+	kfree(group);
+	mutex_unlock(&iommu->lock);
+	return ret;
+}
+
+static void vfio_iommu_unmap_unpin_all(struct vfio_iommu *iommu)
+{
+	struct rb_node *node;
+
+	while ((node = rb_first(&iommu->dma_list)))
+		vfio_remove_dma(iommu, rb_entry(node, struct vfio_dma, node));
 }
 
 static void vfio_iommu_type1_detach_group(void *iommu_data,
 					  struct iommu_group *iommu_group)
 {
 	struct vfio_iommu *iommu = iommu_data;
+	struct vfio_domain *domain;
 	struct vfio_group *group;
 
 	mutex_lock(&iommu->lock);
 
-	list_for_each_entry(group, &iommu->group_list, next) {
-		if (group->iommu_group == iommu_group) {
-			iommu_detach_group(iommu->domain, iommu_group);
+	list_for_each_entry(domain, &iommu->domain_list, next) {
+		list_for_each_entry(group, &domain->group_list, next) {
+			if (group->iommu_group != iommu_group)
+				continue;
+
+			iommu_detach_group(domain->domain, iommu_group);
 			list_del(&group->next);
 			kfree(group);
-			break;
+			/*
+			 * Group ownership provides privilege, if the group
+			 * list is empty, the domain goes away.  If it's the
+			 * last domain, then all the mappings go away too.
+			 */
+			if (list_empty(&domain->group_list)) {
+				if (list_is_singular(&iommu->domain_list))
+					vfio_iommu_unmap_unpin_all(iommu);
+				iommu_domain_free(domain->domain);
+				list_del(&domain->next);
+				kfree(domain);
+			}
+			goto done;
 		}
 	}
 
+done:
 	mutex_unlock(&iommu->lock);
 }
 
@@ -769,40 +828,17 @@ static void *vfio_iommu_type1_open(unsigned long arg)
 {
 	struct vfio_iommu *iommu;
 
-	if (arg != VFIO_TYPE1_IOMMU)
+	if (arg != VFIO_TYPE1_IOMMU && arg != VFIO_TYPE1v2_IOMMU)
 		return ERR_PTR(-EINVAL);
 
 	iommu = kzalloc(sizeof(*iommu), GFP_KERNEL);
 	if (!iommu)
 		return ERR_PTR(-ENOMEM);
 
-	INIT_LIST_HEAD(&iommu->group_list);
+	INIT_LIST_HEAD(&iommu->domain_list);
 	iommu->dma_list = RB_ROOT;
 	mutex_init(&iommu->lock);
-
-	/*
-	 * Wish we didn't have to know about bus_type here.
-	 */
-	iommu->domain = iommu_domain_alloc(&pci_bus_type);
-	if (!iommu->domain) {
-		kfree(iommu);
-		return ERR_PTR(-EIO);
-	}
-
-	/*
-	 * Wish we could specify required capabilities rather than create
-	 * a domain, see what comes out and hope it doesn't change along
-	 * the way.  Fortunately we know interrupt remapping is global for
-	 * our iommus.
-	 */
-	if (!allow_unsafe_interrupts &&
-	    !iommu_domain_has_cap(iommu->domain, IOMMU_CAP_INTR_REMAP)) {
-		pr_warn("%s: No interrupt remapping support.  Use the module param \"allow_unsafe_interrupts\" to enable VFIO IOMMU support on this platform\n",
-		       __func__);
-		iommu_domain_free(iommu->domain);
-		kfree(iommu);
-		return ERR_PTR(-EPERM);
-	}
+	iommu->v2 = (arg == VFIO_TYPE1v2_IOMMU);
 
 	return iommu;
 }
@@ -810,25 +846,24 @@ static void *vfio_iommu_type1_open(unsigned long arg)
 static void vfio_iommu_type1_release(void *iommu_data)
 {
 	struct vfio_iommu *iommu = iommu_data;
+	struct vfio_domain *domain, *domain_tmp;
 	struct vfio_group *group, *group_tmp;
-	struct rb_node *node;
 
-	list_for_each_entry_safe(group, group_tmp, &iommu->group_list, next) {
-		iommu_detach_group(iommu->domain, group->iommu_group);
-		list_del(&group->next);
-		kfree(group);
-	}
+	vfio_iommu_unmap_unpin_all(iommu);
 
-	while ((node = rb_first(&iommu->dma_list))) {
-		struct vfio_dma *dma = rb_entry(node, struct vfio_dma, node);
-		size_t size = dma->size;
-		vfio_remove_dma_overlap(iommu, dma->iova, &size, dma);
-		if (WARN_ON(!size))
-			break;
+	list_for_each_entry_safe(domain, domain_tmp,
+				 &iommu->domain_list, next) {
+		list_for_each_entry_safe(group, group_tmp,
+					 &domain->group_list, next) {
+			iommu_detach_group(domain->domain, group->iommu_group);
+			list_del(&group->next);
+			kfree(group);
+		}
+		iommu_domain_free(domain->domain);
+		list_del(&domain->next);
+		kfree(domain);
 	}
 
-	iommu_domain_free(iommu->domain);
-	iommu->domain = NULL;
 	kfree(iommu);
 }
 
@@ -841,6 +876,7 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
 	if (cmd == VFIO_CHECK_EXTENSION) {
 		switch (arg) {
 		case VFIO_TYPE1_IOMMU:
+		case VFIO_TYPE1v2_IOMMU:
 			return 1;
 		default:
 			return 0;
@@ -858,7 +894,7 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
 
 		info.flags = 0;
 
-		info.iova_pgsizes = iommu->domain->ops->pgsize_bitmap;
+		info.iova_pgsizes = vfio_pgsize_bitmap(iommu);
 
 		return copy_to_user((void __user *)arg, &info, minsz);
 
@@ -911,9 +947,6 @@ static const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_type1 = {
 
 static int __init vfio_iommu_type1_init(void)
 {
-	if (!iommu_present(&pci_bus_type))
-		return -ENODEV;
-
 	return vfio_register_iommu_driver(&vfio_iommu_driver_ops_type1);
 }
 
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 0fd47f5..460fdf2 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -23,6 +23,7 @@
 
 #define VFIO_TYPE1_IOMMU		1
 #define VFIO_SPAPR_TCE_IOMMU		2
+#define VFIO_TYPE1v2_IOMMU		3
 
 /*
  * The IOCTL interface is designed for extensibility by embedding the


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/4] vfio/type1: Add extension to test DMA cache coherence of IOMMU
  2014-02-17 20:23 [PATCH 0/4] vfio: type1 multi-domain support & kvm-vfio coherency checking Alex Williamson
  2014-02-17 20:23 ` [PATCH 1/4] vfio/iommu_type1: Multi-IOMMU domain support Alex Williamson
@ 2014-02-17 20:24 ` Alex Williamson
  2014-02-17 20:24 ` [PATCH 3/4] vfio: Add external user check extension interface Alex Williamson
  2014-02-17 20:24 ` [PATCH 4/4] kvm/vfio: Support for DMA coherent IOMMUs Alex Williamson
  3 siblings, 0 replies; 10+ messages in thread
From: Alex Williamson @ 2014-02-17 20:24 UTC (permalink / raw)
  To: alex.williamson, kvm; +Cc: linux-kernel

Now that the type1 IOMMU backend can support IOMMU_CACHE, we need to
be able to test whether coherency is currently enforced.  Add an
extension for this.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 drivers/vfio/vfio_iommu_type1.c |   21 +++++++++++++++++++++
 include/uapi/linux/vfio.h       |    5 +++++
 2 files changed, 26 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 8c7bb9b..1f90344 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -867,6 +867,23 @@ static void vfio_iommu_type1_release(void *iommu_data)
 	kfree(iommu);
 }
 
+static int vfio_domains_have_iommu_cache(struct vfio_iommu *iommu)
+{
+	struct vfio_domain *domain;
+	int ret = 1;
+
+	mutex_lock(&iommu->lock);
+	list_for_each_entry(domain, &iommu->domain_list, next) {
+		if (!(domain->prot & IOMMU_CACHE)) {
+			ret = 0;
+			break;
+		}
+	}
+	mutex_unlock(&iommu->lock);
+
+	return ret;
+}
+
 static long vfio_iommu_type1_ioctl(void *iommu_data,
 				   unsigned int cmd, unsigned long arg)
 {
@@ -878,6 +895,10 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
 		case VFIO_TYPE1_IOMMU:
 		case VFIO_TYPE1v2_IOMMU:
 			return 1;
+		case VFIO_DMA_CC_IOMMU:
+			if (!iommu)
+				return 0;
+			return vfio_domains_have_iommu_cache(iommu);
 		default:
 			return 0;
 		}
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 460fdf2..cb9023d 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -24,6 +24,11 @@
 #define VFIO_TYPE1_IOMMU		1
 #define VFIO_SPAPR_TCE_IOMMU		2
 #define VFIO_TYPE1v2_IOMMU		3
+/*
+ * IOMMU enforces DMA cache coherence (ex. PCIe NoSnoop stripping).  This
+ * capability is subject to change as groups are added or removed.
+ */
+#define VFIO_DMA_CC_IOMMU		4
 
 /*
  * The IOCTL interface is designed for extensibility by embedding the


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 3/4] vfio: Add external user check extension interface
  2014-02-17 20:23 [PATCH 0/4] vfio: type1 multi-domain support & kvm-vfio coherency checking Alex Williamson
  2014-02-17 20:23 ` [PATCH 1/4] vfio/iommu_type1: Multi-IOMMU domain support Alex Williamson
  2014-02-17 20:24 ` [PATCH 2/4] vfio/type1: Add extension to test DMA cache coherence of IOMMU Alex Williamson
@ 2014-02-17 20:24 ` Alex Williamson
  2014-02-17 20:24 ` [PATCH 4/4] kvm/vfio: Support for DMA coherent IOMMUs Alex Williamson
  3 siblings, 0 replies; 10+ messages in thread
From: Alex Williamson @ 2014-02-17 20:24 UTC (permalink / raw)
  To: alex.williamson, kvm; +Cc: linux-kernel

This lets us check extensions, particularly VFIO_DMA_CC_IOMMU using
the external user interface, allowing KVM to probe IOMMU coherency.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 drivers/vfio/vfio.c  |    6 ++++++
 include/linux/vfio.h |    2 ++
 2 files changed, 8 insertions(+)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 21271d8..512f479 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -1413,6 +1413,12 @@ int vfio_external_user_iommu_id(struct vfio_group *group)
 }
 EXPORT_SYMBOL_GPL(vfio_external_user_iommu_id);
 
+long vfio_external_check_extension(struct vfio_group *group, unsigned long arg)
+{
+	return vfio_ioctl_check_extension(group->container, arg);
+}
+EXPORT_SYMBOL_GPL(vfio_external_check_extension);
+
 /**
  * Module/class support
  */
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 24579a0..81022a52 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -96,5 +96,7 @@ extern void vfio_unregister_iommu_driver(
 extern struct vfio_group *vfio_group_get_external_user(struct file *filep);
 extern void vfio_group_put_external_user(struct vfio_group *group);
 extern int vfio_external_user_iommu_id(struct vfio_group *group);
+extern long vfio_external_check_extension(struct vfio_group *group,
+					  unsigned long arg);
 
 #endif /* VFIO_H */


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 4/4] kvm/vfio: Support for DMA coherent IOMMUs
  2014-02-17 20:23 [PATCH 0/4] vfio: type1 multi-domain support & kvm-vfio coherency checking Alex Williamson
                   ` (2 preceding siblings ...)
  2014-02-17 20:24 ` [PATCH 3/4] vfio: Add external user check extension interface Alex Williamson
@ 2014-02-17 20:24 ` Alex Williamson
  2014-02-18 10:38   ` Paolo Bonzini
  3 siblings, 1 reply; 10+ messages in thread
From: Alex Williamson @ 2014-02-17 20:24 UTC (permalink / raw)
  To: alex.williamson, kvm; +Cc: Gleb Natapov, Paolo Bonzini, linux-kernel

VFIO now has support for using the IOMMU_CACHE flag and a mechanism
for an external user to test the current operating mode of the IOMMU.
Add support for this to the kvm-vfio pseudo device so that we only
register noncoherent DMA when necessary.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
---
 virt/kvm/vfio.c |   27 ++++++++++++++++++++-------
 1 file changed, 20 insertions(+), 7 deletions(-)

diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
index b4f9507..ba1a93f 100644
--- a/virt/kvm/vfio.c
+++ b/virt/kvm/vfio.c
@@ -59,6 +59,22 @@ static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
 	symbol_put(vfio_group_put_external_user);
 }
 
+static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group)
+{
+	long (*fn)(struct vfio_group *, unsigned long);
+	long ret;
+
+	fn = symbol_get(vfio_external_check_extension);
+	if (!fn)
+		return false;
+
+	ret = fn(vfio_group, VFIO_DMA_CC_IOMMU);
+
+	symbol_put(vfio_external_check_extension);
+
+	return ret > 0;
+}
+
 /*
  * Groups can use the same or different IOMMU domains.  If the same then
  * adding a new group may change the coherency of groups we've previously
@@ -75,13 +91,10 @@ static void kvm_vfio_update_coherency(struct kvm_device *dev)
 	mutex_lock(&kv->lock);
 
 	list_for_each_entry(kvg, &kv->group_list, node) {
-		/*
-		 * TODO: We need an interface to check the coherency of
-		 * the IOMMU domain this group is using.  For now, assume
-		 * it's always noncoherent.
-		 */
-		noncoherent = true;
-		break;
+		if (!kvm_vfio_group_is_coherent(kvg->vfio_group)) {
+			noncoherent = true;
+			break;
+		}
 	}
 
 	if (noncoherent != kv->noncoherent) {


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 4/4] kvm/vfio: Support for DMA coherent IOMMUs
  2014-02-17 20:24 ` [PATCH 4/4] kvm/vfio: Support for DMA coherent IOMMUs Alex Williamson
@ 2014-02-18 10:38   ` Paolo Bonzini
  0 siblings, 0 replies; 10+ messages in thread
From: Paolo Bonzini @ 2014-02-18 10:38 UTC (permalink / raw)
  To: Alex Williamson, kvm; +Cc: Gleb Natapov, linux-kernel

Il 17/02/2014 21:24, Alex Williamson ha scritto:
> VFIO now has support for using the IOMMU_CACHE flag and a mechanism
> for an external user to test the current operating mode of the IOMMU.
> Add support for this to the kvm-vfio pseudo device so that we only
> register noncoherent DMA when necessary.
>
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> Cc: Gleb Natapov <gleb@kernel.org>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  virt/kvm/vfio.c |   27 ++++++++++++++++++++-------
>  1 file changed, 20 insertions(+), 7 deletions(-)
>
> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
> index b4f9507..ba1a93f 100644
> --- a/virt/kvm/vfio.c
> +++ b/virt/kvm/vfio.c
> @@ -59,6 +59,22 @@ static void kvm_vfio_group_put_external_user(struct vfio_group *vfio_group)
>  	symbol_put(vfio_group_put_external_user);
>  }
>
> +static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group)
> +{
> +	long (*fn)(struct vfio_group *, unsigned long);
> +	long ret;
> +
> +	fn = symbol_get(vfio_external_check_extension);
> +	if (!fn)
> +		return false;
> +
> +	ret = fn(vfio_group, VFIO_DMA_CC_IOMMU);
> +
> +	symbol_put(vfio_external_check_extension);
> +
> +	return ret > 0;
> +}
> +
>  /*
>   * Groups can use the same or different IOMMU domains.  If the same then
>   * adding a new group may change the coherency of groups we've previously
> @@ -75,13 +91,10 @@ static void kvm_vfio_update_coherency(struct kvm_device *dev)
>  	mutex_lock(&kv->lock);
>
>  	list_for_each_entry(kvg, &kv->group_list, node) {
> -		/*
> -		 * TODO: We need an interface to check the coherency of
> -		 * the IOMMU domain this group is using.  For now, assume
> -		 * it's always noncoherent.
> -		 */
> -		noncoherent = true;
> -		break;
> +		if (!kvm_vfio_group_is_coherent(kvg->vfio_group)) {
> +			noncoherent = true;
> +			break;
> +		}
>  	}
>
>  	if (noncoherent != kv->noncoherent) {
>

Acked-by: Paolo Bonzini <pbonzini@redhat.com>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [PATCH 1/4] vfio/iommu_type1: Multi-IOMMU domain support
  2014-02-17 20:23 ` [PATCH 1/4] vfio/iommu_type1: Multi-IOMMU domain support Alex Williamson
@ 2014-03-18 10:24     ` Varun Sethi
       [not found]   ` <CAKKYfmFrRF8U3NKhkSgepDfKNeLQ8+2yiz8VSFBzMc58dNAPvg@mail.gmail.com>
  1 sibling, 0 replies; 10+ messages in thread
From: Varun Sethi @ 2014-03-18 10:24 UTC (permalink / raw)
  To: Alex Williamson, kvm; +Cc: linux-kernel

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 31082 bytes --]

Hi Alex,
Would it make sense, to link the iommu group to its corresponding hardware iommu block's capabilities? This could be done if we can determine the iommu device corresponding to the iommu group during bus probe. With this we won't have to wait till device attach to determine the domain capabilities (or underlying iommu capabilities). In vfio we can simply attach the iommu groups with similar iommu capabilities to the same domain.

Regards
Varun

> -----Original Message-----
> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Tuesday, February 18, 2014 1:54 AM
> To: alex.williamson@redhat.com; kvm@vger.kernel.org
> Cc: Sethi Varun-B16395; linux-kernel@vger.kernel.org
> Subject: [PATCH 1/4] vfio/iommu_type1: Multi-IOMMU domain support
> 
> We currently have a problem that we cannot support advanced features of
> an IOMMU domain (ex. IOMMU_CACHE), because we have no guarantee that
> those features will be supported by all of the hardware units involved
> with the domain over its lifetime.  For instance, the Intel VT-d
> architecture does not require that all DRHDs support snoop control.  If
> we create a domain based on a device behind a DRHD that does support
> snoop control and enable SNP support via the IOMMU_CACHE mapping option,
> we cannot then add a device behind a DRHD which does not support snoop
> control or we'll get reserved bit faults from the SNP bit in the
> pagetables.  To add to the complexity, we can't know the properties of a
> domain until a device is attached.
> 
> We could pass this problem off to userspace and require that a separate
> vfio container be used, but we don't know how to handle page accounting
> in that case.  How do we know that a page pinned in one container is the
> same page as a different container and avoid double billing the user for
> the page.
> 
> The solution is therefore to support multiple IOMMU domains per
> container.  In the majority of cases, only one domain will be required
> since hardware is typically consistent within a system.  However, this
> provides us the ability to validate compatibility of domains and support
> mixed environments where page table flags can be different between
> domains.
> 
> To do this, our DMA tracking needs to change.  We currently try to
> coalesce user mappings into as few tracking entries as possible.  The
> problem then becomes that we lose granularity of user mappings.  We've
> never guaranteed that a user is able to unmap at a finer granularity than
> the original mapping, but we must honor the granularity of the original
> mapping.  This coalescing code is therefore removed, allowing only unmaps
> covering complete maps.  The change in accounting is fairly small here, a
> typical QEMU VM will start out with roughly a dozen entries, so it's
> arguable if this coalescing was ever needed.
> 
> We also move IOMMU domain creation to the point where a group is attached
> to the container.  An interesting side-effect of this is that we now have
> access to the device at the time of domain creation and can probe the
> devices within the group to determine the bus_type.
> This finally makes vfio_iommu_type1 completely device/bus agnostic.
> In fact, each IOMMU domain can host devices on different buses managed by
> different physical IOMMUs, and present a single DMA mapping interface to
> the user.  When a new domain is created, mappings are replayed to bring
> the IOMMU pagetables up to the state of the current container.  And of
> course, DMA mapping and unmapping automatically traverse all of the
> configured IOMMU domains.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> Cc: Varun Sethi <Varun.Sethi@freescale.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c |  637 +++++++++++++++++++++------------
> ------
>  include/uapi/linux/vfio.h       |    1
>  2 files changed, 336 insertions(+), 302 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c
> b/drivers/vfio/vfio_iommu_type1.c index 4fb7a8f..8c7bb9b 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -30,7 +30,6 @@
>  #include <linux/iommu.h>
>  #include <linux/module.h>
>  #include <linux/mm.h>
> -#include <linux/pci.h>		/* pci_bus_type */
>  #include <linux/rbtree.h>
>  #include <linux/sched.h>
>  #include <linux/slab.h>
> @@ -55,11 +54,17 @@ MODULE_PARM_DESC(disable_hugepages,
>  		 "Disable VFIO IOMMU support for IOMMU hugepages.");
> 
>  struct vfio_iommu {
> -	struct iommu_domain	*domain;
> +	struct list_head	domain_list;
>  	struct mutex		lock;
>  	struct rb_root		dma_list;
> +	bool v2;
> +};
> +
> +struct vfio_domain {
> +	struct iommu_domain	*domain;
> +	struct list_head	next;
>  	struct list_head	group_list;
> -	bool			cache;
> +	int			prot;		/* IOMMU_CACHE */
>  };
> 
>  struct vfio_dma {
> @@ -99,7 +104,7 @@ static struct vfio_dma *vfio_find_dma(struct
> vfio_iommu *iommu,
>  	return NULL;
>  }
> 
> -static void vfio_insert_dma(struct vfio_iommu *iommu, struct vfio_dma
> *new)
> +static void vfio_link_dma(struct vfio_iommu *iommu, struct vfio_dma
> +*new)
>  {
>  	struct rb_node **link = &iommu->dma_list.rb_node, *parent = NULL;
>  	struct vfio_dma *dma;
> @@ -118,7 +123,7 @@ static void vfio_insert_dma(struct vfio_iommu *iommu,
> struct vfio_dma *new)
>  	rb_insert_color(&new->node, &iommu->dma_list);  }
> 
> -static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma
> *old)
> +static void vfio_unlink_dma(struct vfio_iommu *iommu, struct vfio_dma
> +*old)
>  {
>  	rb_erase(&old->node, &iommu->dma_list);  } @@ -322,32 +327,39 @@
> static long vfio_unpin_pages(unsigned long pfn, long npage,
>  	return unlocked;
>  }
> 
> -static int vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma
> *dma,
> -			    dma_addr_t iova, size_t *size)
> +static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma
> +*dma)
>  {
> -	dma_addr_t start = iova, end = iova + *size;
> +	dma_addr_t iova = dma->iova, end = dma->iova + dma->size;
> +	struct vfio_domain *domain, *d;
>  	long unlocked = 0;
> 
> +	if (!dma->size)
> +		return;
> +	/*
> +	 * We use the IOMMU to track the physical addresses, otherwise we'd
> +	 * need a much more complicated tracking system.  Unfortunately
> that
> +	 * means we need to use one of the iommu domains to figure out the
> +	 * pfns to unpin.  The rest need to be unmapped in advance so we
> have
> +	 * no iommu translations remaining when the pages are unpinned.
> +	 */
> +	domain = d = list_first_entry(&iommu->domain_list,
> +				      struct vfio_domain, next);
> +
> +	list_for_each_entry_continue(d, &iommu->domain_list, next)
> +		iommu_unmap(d->domain, dma->iova, dma->size);
> +
>  	while (iova < end) {
>  		size_t unmapped;
>  		phys_addr_t phys;
> 
> -		/*
> -		 * We use the IOMMU to track the physical address.  This
> -		 * saves us from having a lot more entries in our mapping
> -		 * tree.  The downside is that we don't track the size
> -		 * used to do the mapping.  We request unmap of a single
> -		 * page, but expect IOMMUs that support large pages to
> -		 * unmap a larger chunk.
> -		 */
> -		phys = iommu_iova_to_phys(iommu->domain, iova);
> +		phys = iommu_iova_to_phys(domain->domain, iova);
>  		if (WARN_ON(!phys)) {
>  			iova += PAGE_SIZE;
>  			continue;
>  		}
> 
> -		unmapped = iommu_unmap(iommu->domain, iova, PAGE_SIZE);
> -		if (!unmapped)
> +		unmapped = iommu_unmap(domain->domain, iova, PAGE_SIZE);
> +		if (WARN_ON(!unmapped))
>  			break;
> 
>  		unlocked += vfio_unpin_pages(phys >> PAGE_SHIFT, @@ -357,119
> +369,26 @@ static int vfio_unmap_unpin(struct vfio_iommu *iommu, struct
> vfio_dma *dma,
>  	}
> 
>  	vfio_lock_acct(-unlocked);
> -
> -	*size = iova - start;
> -
> -	return 0;
>  }
> 
> -static int vfio_remove_dma_overlap(struct vfio_iommu *iommu, dma_addr_t
> start,
> -				   size_t *size, struct vfio_dma *dma)
> +static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma
> +*dma)
>  {
> -	size_t offset, overlap, tmp;
> -	struct vfio_dma *split;
> -	int ret;
> -
> -	if (!*size)
> -		return 0;
> -
> -	/*
> -	 * Existing dma region is completely covered, unmap all.  This is
> -	 * the likely case since userspace tends to map and unmap buffers
> -	 * in one shot rather than multiple mappings within a buffer.
> -	 */
> -	if (likely(start <= dma->iova &&
> -		   start + *size >= dma->iova + dma->size)) {
> -		*size = dma->size;
> -		ret = vfio_unmap_unpin(iommu, dma, dma->iova, size);
> -		if (ret)
> -			return ret;
> -
> -		/*
> -		 * Did we remove more than we have?  Should never happen
> -		 * since a vfio_dma is contiguous in iova and vaddr.
> -		 */
> -		WARN_ON(*size != dma->size);
> -
> -		vfio_remove_dma(iommu, dma);
> -		kfree(dma);
> -		return 0;
> -	}
> -
> -	/* Overlap low address of existing range */
> -	if (start <= dma->iova) {
> -		overlap = start + *size - dma->iova;
> -		ret = vfio_unmap_unpin(iommu, dma, dma->iova, &overlap);
> -		if (ret)
> -			return ret;
> -
> -		vfio_remove_dma(iommu, dma);
> -
> -		/*
> -		 * Check, we may have removed to whole vfio_dma.  If not
> -		 * fixup and re-insert.
> -		 */
> -		if (overlap < dma->size) {
> -			dma->iova += overlap;
> -			dma->vaddr += overlap;
> -			dma->size -= overlap;
> -			vfio_insert_dma(iommu, dma);
> -		} else
> -			kfree(dma);
> -
> -		*size = overlap;
> -		return 0;
> -	}
> -
> -	/* Overlap high address of existing range */
> -	if (start + *size >= dma->iova + dma->size) {
> -		offset = start - dma->iova;
> -		overlap = dma->size - offset;
> -
> -		ret = vfio_unmap_unpin(iommu, dma, start, &overlap);
> -		if (ret)
> -			return ret;
> -
> -		dma->size -= overlap;
> -		*size = overlap;
> -		return 0;
> -	}
> -
> -	/* Split existing */
> -
> -	/*
> -	 * Allocate our tracking structure early even though it may not
> -	 * be used.  An Allocation failure later loses track of pages and
> -	 * is more difficult to unwind.
> -	 */
> -	split = kzalloc(sizeof(*split), GFP_KERNEL);
> -	if (!split)
> -		return -ENOMEM;
> -
> -	offset = start - dma->iova;
> -
> -	ret = vfio_unmap_unpin(iommu, dma, start, size);
> -	if (ret || !*size) {
> -		kfree(split);
> -		return ret;
> -	}
> -
> -	tmp = dma->size;
> +	vfio_unmap_unpin(iommu, dma);
> +	vfio_unlink_dma(iommu, dma);
> +	kfree(dma);
> +}
> 
> -	/* Resize the lower vfio_dma in place, before the below insert */
> -	dma->size = offset;
> +static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu) {
> +	struct vfio_domain *domain;
> +	unsigned long bitmap = PAGE_MASK;
> 
> -	/* Insert new for remainder, assuming it didn't all get unmapped */
> -	if (likely(offset + *size < tmp)) {
> -		split->size = tmp - offset - *size;
> -		split->iova = dma->iova + offset + *size;
> -		split->vaddr = dma->vaddr + offset + *size;
> -		split->prot = dma->prot;
> -		vfio_insert_dma(iommu, split);
> -	} else
> -		kfree(split);
> +	mutex_lock(&iommu->lock);
> +	list_for_each_entry(domain, &iommu->domain_list, next)
> +		bitmap &= domain->domain->ops->pgsize_bitmap;
> +	mutex_unlock(&iommu->lock);
> 
> -	return 0;
> +	return bitmap;
>  }
> 
>  static int vfio_dma_do_unmap(struct vfio_iommu *iommu, @@ -477,10
> +396,10 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,  {
>  	uint64_t mask;
>  	struct vfio_dma *dma;
> -	size_t unmapped = 0, size;
> +	size_t unmapped = 0;
>  	int ret = 0;
> 
> -	mask = ((uint64_t)1 << __ffs(iommu->domain->ops->pgsize_bitmap)) -
> 1;
> +	mask = ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1;
> 
>  	if (unmap->iova & mask)
>  		return -EINVAL;
> @@ -491,20 +410,61 @@ static int vfio_dma_do_unmap(struct vfio_iommu
> *iommu,
> 
>  	mutex_lock(&iommu->lock);
> 
> +	/*
> +	 * vfio-iommu-type1 (v1) - User mappings were coalesced together to
> +	 * avoid tracking individual mappings.  This means that the
> granularity
> +	 * of the original mapping was lost and the user was allowed to
> attempt
> +	 * to unmap any range.  Depending on the contiguousness of physical
> +	 * memory and page sizes supported by the IOMMU, arbitrary unmaps
> may
> +	 * or may not have worked.  We only guaranteed unmap granularity
> +	 * matching the original mapping; even though it was untracked
> here,
> +	 * the original mappings are reflected in IOMMU mappings.  This
> +	 * resulted in a couple unusual behaviors.  First, if a range is
> not
> +	 * able to be unmapped, ex. a set of 4k pages that was mapped as a
> +	 * 2M hugepage into the IOMMU, the unmap ioctl returns success but
> with
> +	 * a zero sized unmap.  Also, if an unmap request overlaps the
> first
> +	 * address of a hugepage, the IOMMU will unmap the entire hugepage.
> +	 * This also returns success and the returned unmap size reflects
> the
> +	 * actual size unmapped.
> +	 *
> +	 * We attempt to maintain compatibility with this "v1" interface,
> but
> +	 * we take control out of the hands of the IOMMU.  Therefore, an
> unmap
> +	 * request offset from the beginning of the original mapping will
> +	 * return success with zero sized unmap.  And an unmap request
> covering
> +	 * the first iova of mapping will unmap the entire range.
> +	 *
> +	 * The v2 version of this interface intends to be more
> deterministic.
> +	 * Unmap requests must fully cover previous mappings.  Multiple
> +	 * mappings may still be unmaped by specifying large ranges, but
> there
> +	 * must not be any previous mappings bisected by the range.  An
> error
> +	 * will be returned if these conditions are not met.  The v2
> interface
> +	 * will only return success and a size of zero if there were no
> +	 * mappings within the range.
> +	 */
> +	if (iommu->v2) {
> +		dma = vfio_find_dma(iommu, unmap->iova, 0);
> +		if (dma && dma->iova != unmap->iova) {
> +			ret = -EINVAL;
> +			goto unlock;
> +		}
> +		dma = vfio_find_dma(iommu, unmap->iova + unmap->size - 1, 0);
> +		if (dma && dma->iova + dma->size != unmap->iova + unmap-
> >size) {
> +			ret = -EINVAL;
> +			goto unlock;
> +		}
> +	}
> +
>  	while ((dma = vfio_find_dma(iommu, unmap->iova, unmap->size))) {
> -		size = unmap->size;
> -		ret = vfio_remove_dma_overlap(iommu, unmap->iova, &size,
> dma);
> -		if (ret || !size)
> +		if (!iommu->v2 && unmap->iova > dma->iova)
>  			break;
> -		unmapped += size;
> +		unmapped += dma->size;
> +		vfio_remove_dma(iommu, dma);
>  	}
> 
> +unlock:
>  	mutex_unlock(&iommu->lock);
> 
> -	/*
> -	 * We may unmap more than requested, update the unmap struct so
> -	 * userspace can know.
> -	 */
> +	/* Report how much was unmapped */
>  	unmap->size = unmapped;
> 
>  	return ret;
> @@ -516,22 +476,47 @@ static int vfio_dma_do_unmap(struct vfio_iommu
> *iommu,
>   * soon, so this is just a temporary workaround to break mappings down
> into
>   * PAGE_SIZE.  Better to map smaller pages than nothing.
>   */
> -static int map_try_harder(struct vfio_iommu *iommu, dma_addr_t iova,
> +static int map_try_harder(struct vfio_domain *domain, dma_addr_t iova,
>  			  unsigned long pfn, long npage, int prot)  {
>  	long i;
>  	int ret;
> 
>  	for (i = 0; i < npage; i++, pfn++, iova += PAGE_SIZE) {
> -		ret = iommu_map(iommu->domain, iova,
> +		ret = iommu_map(domain->domain, iova,
>  				(phys_addr_t)pfn << PAGE_SHIFT,
> -				PAGE_SIZE, prot);
> +				PAGE_SIZE, prot | domain->prot);
>  		if (ret)
>  			break;
>  	}
> 
>  	for (; i < npage && i > 0; i--, iova -= PAGE_SIZE)
> -		iommu_unmap(iommu->domain, iova, PAGE_SIZE);
> +		iommu_unmap(domain->domain, iova, PAGE_SIZE);
> +
> +	return ret;
> +}
> +
> +static int vfio_iommu_map(struct vfio_iommu *iommu, dma_addr_t iova,
> +			  unsigned long pfn, long npage, int prot) {
> +	struct vfio_domain *d;
> +	int ret;
> +
> +	list_for_each_entry(d, &iommu->domain_list, next) {
> +		ret = iommu_map(d->domain, iova, (phys_addr_t)pfn <<
> PAGE_SHIFT,
> +				npage << PAGE_SHIFT, prot | d->prot);
> +		if (ret) {
> +			if (ret != -EBUSY ||
> +			    map_try_harder(d, iova, pfn, npage, prot))
> +				goto unwind;
> +		}
> +	}
> +
> +	return 0;
> +
> +unwind:
> +	list_for_each_entry_continue_reverse(d, &iommu->domain_list, next)
> +		iommu_unmap(d->domain, iova, npage << PAGE_SHIFT);
> 
>  	return ret;
>  }
> @@ -545,12 +530,12 @@ static int vfio_dma_do_map(struct vfio_iommu
> *iommu,
>  	long npage;
>  	int ret = 0, prot = 0;
>  	uint64_t mask;
> -	struct vfio_dma *dma = NULL;
> +	struct vfio_dma *dma;
>  	unsigned long pfn;
> 
>  	end = map->iova + map->size;
> 
> -	mask = ((uint64_t)1 << __ffs(iommu->domain->ops->pgsize_bitmap)) -
> 1;
> +	mask = ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1;
> 
>  	/* READ/WRITE from device perspective */
>  	if (map->flags & VFIO_DMA_MAP_FLAG_WRITE) @@ -561,9 +546,6 @@
> static int vfio_dma_do_map(struct vfio_iommu *iommu,
>  	if (!prot)
>  		return -EINVAL; /* No READ/WRITE? */
> 
> -	if (iommu->cache)
> -		prot |= IOMMU_CACHE;
> -
>  	if (vaddr & mask)
>  		return -EINVAL;
>  	if (map->iova & mask)
> @@ -588,180 +570,257 @@ static int vfio_dma_do_map(struct vfio_iommu
> *iommu,
>  		return -EEXIST;
>  	}
> 
> -	for (iova = map->iova; iova < end; iova += size, vaddr += size) {
> -		long i;
> +	dma = kzalloc(sizeof(*dma), GFP_KERNEL);
> +	if (!dma) {
> +		mutex_unlock(&iommu->lock);
> +		return -ENOMEM;
> +	}
> +
> +	dma->iova = map->iova;
> +	dma->vaddr = map->vaddr;
> +	dma->prot = prot;
> 
> +	/* Insert zero-sized and grow as we map chunks of it */
> +	vfio_link_dma(iommu, dma);
> +
> +	for (iova = map->iova; iova < end; iova += size, vaddr += size) {
>  		/* Pin a contiguous chunk of memory */
>  		npage = vfio_pin_pages(vaddr, (end - iova) >> PAGE_SHIFT,
>  				       prot, &pfn);
>  		if (npage <= 0) {
>  			WARN_ON(!npage);
>  			ret = (int)npage;
> -			goto out;
> -		}
> -
> -		/* Verify pages are not already mapped */
> -		for (i = 0; i < npage; i++) {
> -			if (iommu_iova_to_phys(iommu->domain,
> -					       iova + (i << PAGE_SHIFT))) {
> -				ret = -EBUSY;
> -				goto out_unpin;
> -			}
> +			break;
>  		}
> 
> -		ret = iommu_map(iommu->domain, iova,
> -				(phys_addr_t)pfn << PAGE_SHIFT,
> -				npage << PAGE_SHIFT, prot);
> +		/* Map it! */
> +		ret = vfio_iommu_map(iommu, iova, pfn, npage, prot);
>  		if (ret) {
> -			if (ret != -EBUSY ||
> -			    map_try_harder(iommu, iova, pfn, npage, prot)) {
> -				goto out_unpin;
> -			}
> +			vfio_unpin_pages(pfn, npage, prot, true);
> +			break;
>  		}
> 
>  		size = npage << PAGE_SHIFT;
> +		dma->size += size;
> +	}
> 
> -		/*
> -		 * Check if we abut a region below - nothing below 0.
> -		 * This is the most likely case when mapping chunks of
> -		 * physically contiguous regions within a virtual address
> -		 * range.  Update the abutting entry in place since iova
> -		 * doesn't change.
> -		 */
> -		if (likely(iova)) {
> -			struct vfio_dma *tmp;
> -			tmp = vfio_find_dma(iommu, iova - 1, 1);
> -			if (tmp && tmp->prot == prot &&
> -			    tmp->vaddr + tmp->size == vaddr) {
> -				tmp->size += size;
> -				iova = tmp->iova;
> -				size = tmp->size;
> -				vaddr = tmp->vaddr;
> -				dma = tmp;
> -			}
> -		}
> +	if (ret)
> +		vfio_remove_dma(iommu, dma);
> 
> -		/*
> -		 * Check if we abut a region above - nothing above ~0 + 1.
> -		 * If we abut above and below, remove and free.  If only
> -		 * abut above, remove, modify, reinsert.
> -		 */
> -		if (likely(iova + size)) {
> -			struct vfio_dma *tmp;
> -			tmp = vfio_find_dma(iommu, iova + size, 1);
> -			if (tmp && tmp->prot == prot &&
> -			    tmp->vaddr == vaddr + size) {
> -				vfio_remove_dma(iommu, tmp);
> -				if (dma) {
> -					dma->size += tmp->size;
> -					kfree(tmp);
> -				} else {
> -					size += tmp->size;
> -					tmp->size = size;
> -					tmp->iova = iova;
> -					tmp->vaddr = vaddr;
> -					vfio_insert_dma(iommu, tmp);
> -					dma = tmp;
> -				}
> -			}
> -		}
> +	mutex_unlock(&iommu->lock);
> +	return ret;
> +}
> +
> +static int vfio_bus_type(struct device *dev, void *data) {
> +	struct bus_type **bus = data;
> +
> +	if (*bus && *bus != dev->bus)
> +		return -EINVAL;
> +
> +	*bus = dev->bus;
> +
> +	return 0;
> +}
> +
> +static int vfio_iommu_replay(struct vfio_iommu *iommu,
> +			     struct vfio_domain *domain)
> +{
> +	struct vfio_domain *d;
> +	struct rb_node *n;
> +	int ret;
> +
> +	/* Arbitrarily pick the first domain in the list for lookups */
> +	d = list_first_entry(&iommu->domain_list, struct vfio_domain,
> next);
> +	n = rb_first(&iommu->dma_list);
> +
> +	/* If there's not a domain, there better not be any mappings */
> +	if (WARN_ON(n && !d))
> +		return -EINVAL;
> +
> +	for (; n; n = rb_next(n)) {
> +		struct vfio_dma *dma;
> +		dma_addr_t iova;
> +
> +		dma = rb_entry(n, struct vfio_dma, node);
> +		iova = dma->iova;
> +
> +		while (iova < dma->iova + dma->size) {
> +			phys_addr_t phys = iommu_iova_to_phys(d->domain, iova);
> +			size_t size;
> 
> -		if (!dma) {
> -			dma = kzalloc(sizeof(*dma), GFP_KERNEL);
> -			if (!dma) {
> -				iommu_unmap(iommu->domain, iova, size);
> -				ret = -ENOMEM;
> -				goto out_unpin;
> +			if (WARN_ON(!phys)) {
> +				iova += PAGE_SIZE;
> +				continue;
>  			}
> 
> -			dma->size = size;
> -			dma->iova = iova;
> -			dma->vaddr = vaddr;
> -			dma->prot = prot;
> -			vfio_insert_dma(iommu, dma);
> -		}
> -	}
> +			size = PAGE_SIZE;
> 
> -	WARN_ON(ret);
> -	mutex_unlock(&iommu->lock);
> -	return ret;
> +			while (iova + size < dma->iova + dma->size &&
> +			       phys + size == iommu_iova_to_phys(d->domain,
> +								 iova + size))
> +				size += PAGE_SIZE;
> 
> -out_unpin:
> -	vfio_unpin_pages(pfn, npage, prot, true);
> +			ret = iommu_map(domain->domain, iova, phys,
> +					size, dma->prot | domain->prot);
> +			if (ret)
> +				return ret;
> 
> -out:
> -	iova = map->iova;
> -	size = map->size;
> -	while ((dma = vfio_find_dma(iommu, iova, size))) {
> -		int r = vfio_remove_dma_overlap(iommu, iova,
> -						&size, dma);
> -		if (WARN_ON(r || !size))
> -			break;
> +			iova += size;
> +		}
>  	}
> 
> -	mutex_unlock(&iommu->lock);
> -	return ret;
> +	return 0;
>  }
> 
>  static int vfio_iommu_type1_attach_group(void *iommu_data,
>  					 struct iommu_group *iommu_group)
>  {
>  	struct vfio_iommu *iommu = iommu_data;
> -	struct vfio_group *group, *tmp;
> +	struct vfio_group *group, *g;
> +	struct vfio_domain *domain, *d;
> +	struct bus_type *bus = NULL;
>  	int ret;
> 
> -	group = kzalloc(sizeof(*group), GFP_KERNEL);
> -	if (!group)
> -		return -ENOMEM;
> -
>  	mutex_lock(&iommu->lock);
> 
> -	list_for_each_entry(tmp, &iommu->group_list, next) {
> -		if (tmp->iommu_group == iommu_group) {
> +	list_for_each_entry(d, &iommu->domain_list, next) {
> +		list_for_each_entry(g, &d->group_list, next) {
> +			if (g->iommu_group != iommu_group)
> +				continue;
> +
>  			mutex_unlock(&iommu->lock);
> -			kfree(group);
>  			return -EINVAL;
>  		}
>  	}
> 
> +	group = kzalloc(sizeof(*group), GFP_KERNEL);
> +	domain = kzalloc(sizeof(*domain), GFP_KERNEL);
> +	if (!group || !domain) {
> +		ret = -ENOMEM;
> +		goto out_free;
> +	}
> +
> +	group->iommu_group = iommu_group;
> +
> +	/* Determine bus_type in order to allocate a domain */
> +	ret = iommu_group_for_each_dev(iommu_group, &bus, vfio_bus_type);
> +	if (ret)
> +		goto out_free;
> +
> +	domain->domain = iommu_domain_alloc(bus);
> +	if (!domain->domain) {
> +		ret = -EIO;
> +		goto out_free;
> +	}
> +
> +	ret = iommu_attach_group(domain->domain, iommu_group);
> +	if (ret)
> +		goto out_domain;
> +
> +	INIT_LIST_HEAD(&domain->group_list);
> +	list_add(&group->next, &domain->group_list);
> +
> +	if (!allow_unsafe_interrupts &&
> +	    !iommu_domain_has_cap(domain->domain, IOMMU_CAP_INTR_REMAP)) {
> +		pr_warn("%s: No interrupt remapping support.  Use the module
> param \"allow_unsafe_interrupts\" to enable VFIO IOMMU support on this
> platform\n",
> +		       __func__);
> +		ret = -EPERM;
> +		goto out_detach;
> +	}
> +
> +	if (iommu_domain_has_cap(domain->domain,
> IOMMU_CAP_CACHE_COHERENCY))
> +		domain->prot |= IOMMU_CACHE;
> +
>  	/*
> -	 * TODO: Domain have capabilities that might change as we add
> -	 * groups (see iommu->cache, currently never set).  Check for
> -	 * them and potentially disallow groups to be attached when it
> -	 * would change capabilities (ugh).
> +	 * Try to match an existing compatible domain.  We don't want to
> +	 * preclude an IOMMU driver supporting multiple bus_types and being
> +	 * able to include different bus_types in the same IOMMU domain, so
> +	 * we test whether the domains use the same iommu_ops rather than
> +	 * testing if they're on the same bus_type.
>  	 */
> -	ret = iommu_attach_group(iommu->domain, iommu_group);
> -	if (ret) {
> -		mutex_unlock(&iommu->lock);
> -		kfree(group);
> -		return ret;
> +	list_for_each_entry(d, &iommu->domain_list, next) {
> +		if (d->domain->ops == domain->domain->ops &&
> +		    d->prot == domain->prot) {
> +			iommu_detach_group(domain->domain, iommu_group);
> +			if (!iommu_attach_group(d->domain, iommu_group)) {
> +				list_add(&group->next, &d->group_list);
> +				iommu_domain_free(domain->domain);
> +				kfree(domain);
> +				mutex_unlock(&iommu->lock);
> +				return 0;
> +			}
> +
> +			ret = iommu_attach_group(domain->domain, iommu_group);
> +			if (ret)
> +				goto out_domain;
> +		}
>  	}
> 
> -	group->iommu_group = iommu_group;
> -	list_add(&group->next, &iommu->group_list);
> +	/* replay mappings on new domains */
> +	ret = vfio_iommu_replay(iommu, domain);
> +	if (ret)
> +		goto out_detach;
> +
> +	list_add(&domain->next, &iommu->domain_list);
> 
>  	mutex_unlock(&iommu->lock);
> 
>  	return 0;
> +
> +out_detach:
> +	iommu_detach_group(domain->domain, iommu_group);
> +out_domain:
> +	iommu_domain_free(domain->domain);
> +out_free:
> +	kfree(domain);
> +	kfree(group);
> +	mutex_unlock(&iommu->lock);
> +	return ret;
> +}
> +
> +static void vfio_iommu_unmap_unpin_all(struct vfio_iommu *iommu) {
> +	struct rb_node *node;
> +
> +	while ((node = rb_first(&iommu->dma_list)))
> +		vfio_remove_dma(iommu, rb_entry(node, struct vfio_dma,
> node));
>  }
> 
>  static void vfio_iommu_type1_detach_group(void *iommu_data,
>  					  struct iommu_group *iommu_group)  {
>  	struct vfio_iommu *iommu = iommu_data;
> +	struct vfio_domain *domain;
>  	struct vfio_group *group;
> 
>  	mutex_lock(&iommu->lock);
> 
> -	list_for_each_entry(group, &iommu->group_list, next) {
> -		if (group->iommu_group == iommu_group) {
> -			iommu_detach_group(iommu->domain, iommu_group);
> +	list_for_each_entry(domain, &iommu->domain_list, next) {
> +		list_for_each_entry(group, &domain->group_list, next) {
> +			if (group->iommu_group != iommu_group)
> +				continue;
> +
> +			iommu_detach_group(domain->domain, iommu_group);
>  			list_del(&group->next);
>  			kfree(group);
> -			break;
> +			/*
> +			 * Group ownership provides privilege, if the group
> +			 * list is empty, the domain goes away.  If it's the
> +			 * last domain, then all the mappings go away too.
> +			 */
> +			if (list_empty(&domain->group_list)) {
> +				if (list_is_singular(&iommu->domain_list))
> +					vfio_iommu_unmap_unpin_all(iommu);
> +				iommu_domain_free(domain->domain);
> +				list_del(&domain->next);
> +				kfree(domain);
> +			}
> +			goto done;
>  		}
>  	}
> 
> +done:
>  	mutex_unlock(&iommu->lock);
>  }
> 
> @@ -769,40 +828,17 @@ static void *vfio_iommu_type1_open(unsigned long
> arg)  {
>  	struct vfio_iommu *iommu;
> 
> -	if (arg != VFIO_TYPE1_IOMMU)
> +	if (arg != VFIO_TYPE1_IOMMU && arg != VFIO_TYPE1v2_IOMMU)
>  		return ERR_PTR(-EINVAL);
> 
>  	iommu = kzalloc(sizeof(*iommu), GFP_KERNEL);
>  	if (!iommu)
>  		return ERR_PTR(-ENOMEM);
> 
> -	INIT_LIST_HEAD(&iommu->group_list);
> +	INIT_LIST_HEAD(&iommu->domain_list);
>  	iommu->dma_list = RB_ROOT;
>  	mutex_init(&iommu->lock);
> -
> -	/*
> -	 * Wish we didn't have to know about bus_type here.
> -	 */
> -	iommu->domain = iommu_domain_alloc(&pci_bus_type);
> -	if (!iommu->domain) {
> -		kfree(iommu);
> -		return ERR_PTR(-EIO);
> -	}
> -
> -	/*
> -	 * Wish we could specify required capabilities rather than create
> -	 * a domain, see what comes out and hope it doesn't change along
> -	 * the way.  Fortunately we know interrupt remapping is global for
> -	 * our iommus.
> -	 */
> -	if (!allow_unsafe_interrupts &&
> -	    !iommu_domain_has_cap(iommu->domain, IOMMU_CAP_INTR_REMAP)) {
> -		pr_warn("%s: No interrupt remapping support.  Use the module
> param \"allow_unsafe_interrupts\" to enable VFIO IOMMU support on this
> platform\n",
> -		       __func__);
> -		iommu_domain_free(iommu->domain);
> -		kfree(iommu);
> -		return ERR_PTR(-EPERM);
> -	}
> +	iommu->v2 = (arg == VFIO_TYPE1v2_IOMMU);
> 
>  	return iommu;
>  }
> @@ -810,25 +846,24 @@ static void *vfio_iommu_type1_open(unsigned long
> arg)  static void vfio_iommu_type1_release(void *iommu_data)  {
>  	struct vfio_iommu *iommu = iommu_data;
> +	struct vfio_domain *domain, *domain_tmp;
>  	struct vfio_group *group, *group_tmp;
> -	struct rb_node *node;
> 
> -	list_for_each_entry_safe(group, group_tmp, &iommu->group_list,
> next) {
> -		iommu_detach_group(iommu->domain, group->iommu_group);
> -		list_del(&group->next);
> -		kfree(group);
> -	}
> +	vfio_iommu_unmap_unpin_all(iommu);
> 
> -	while ((node = rb_first(&iommu->dma_list))) {
> -		struct vfio_dma *dma = rb_entry(node, struct vfio_dma, node);
> -		size_t size = dma->size;
> -		vfio_remove_dma_overlap(iommu, dma->iova, &size, dma);
> -		if (WARN_ON(!size))
> -			break;
> +	list_for_each_entry_safe(domain, domain_tmp,
> +				 &iommu->domain_list, next) {
> +		list_for_each_entry_safe(group, group_tmp,
> +					 &domain->group_list, next) {
> +			iommu_detach_group(domain->domain, group->iommu_group);
> +			list_del(&group->next);
> +			kfree(group);
> +		}
> +		iommu_domain_free(domain->domain);
> +		list_del(&domain->next);
> +		kfree(domain);
>  	}
> 
> -	iommu_domain_free(iommu->domain);
> -	iommu->domain = NULL;
>  	kfree(iommu);
>  }
> 
> @@ -841,6 +876,7 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>  	if (cmd == VFIO_CHECK_EXTENSION) {
>  		switch (arg) {
>  		case VFIO_TYPE1_IOMMU:
> +		case VFIO_TYPE1v2_IOMMU:
>  			return 1;
>  		default:
>  			return 0;
> @@ -858,7 +894,7 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
> 
>  		info.flags = 0;
> 
> -		info.iova_pgsizes = iommu->domain->ops->pgsize_bitmap;
> +		info.iova_pgsizes = vfio_pgsize_bitmap(iommu);
> 
>  		return copy_to_user((void __user *)arg, &info, minsz);
> 
> @@ -911,9 +947,6 @@ static const struct vfio_iommu_driver_ops
> vfio_iommu_driver_ops_type1 = {
> 
>  static int __init vfio_iommu_type1_init(void)  {
> -	if (!iommu_present(&pci_bus_type))
> -		return -ENODEV;
> -
>  	return vfio_register_iommu_driver(&vfio_iommu_driver_ops_type1);
>  }
> 
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h index
> 0fd47f5..460fdf2 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -23,6 +23,7 @@
> 
>  #define VFIO_TYPE1_IOMMU		1
>  #define VFIO_SPAPR_TCE_IOMMU		2
> +#define VFIO_TYPE1v2_IOMMU		3
> 
>  /*
>   * The IOCTL interface is designed for extensibility by embedding the
> 
> 

ÿôèº{.nÇ+‰·Ÿ®‰­†+%ŠËÿ±éݶ\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dʇڙë,j\a­¢f£¢·hšïêÿ‘êçz_è®\x03(­éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨è­Ú&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [PATCH 1/4] vfio/iommu_type1: Multi-IOMMU domain support
@ 2014-03-18 10:24     ` Varun Sethi
  0 siblings, 0 replies; 10+ messages in thread
From: Varun Sethi @ 2014-03-18 10:24 UTC (permalink / raw)
  To: Alex Williamson, kvm; +Cc: linux-kernel

Hi Alex,
Would it make sense, to link the iommu group to its corresponding hardware iommu block's capabilities? This could be done if we can determine the iommu device corresponding to the iommu group during bus probe. With this we won't have to wait till device attach to determine the domain capabilities (or underlying iommu capabilities). In vfio we can simply attach the iommu groups with similar iommu capabilities to the same domain.

Regards
Varun

> -----Original Message-----
> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Tuesday, February 18, 2014 1:54 AM
> To: alex.williamson@redhat.com; kvm@vger.kernel.org
> Cc: Sethi Varun-B16395; linux-kernel@vger.kernel.org
> Subject: [PATCH 1/4] vfio/iommu_type1: Multi-IOMMU domain support
> 
> We currently have a problem that we cannot support advanced features of
> an IOMMU domain (ex. IOMMU_CACHE), because we have no guarantee that
> those features will be supported by all of the hardware units involved
> with the domain over its lifetime.  For instance, the Intel VT-d
> architecture does not require that all DRHDs support snoop control.  If
> we create a domain based on a device behind a DRHD that does support
> snoop control and enable SNP support via the IOMMU_CACHE mapping option,
> we cannot then add a device behind a DRHD which does not support snoop
> control or we'll get reserved bit faults from the SNP bit in the
> pagetables.  To add to the complexity, we can't know the properties of a
> domain until a device is attached.
> 
> We could pass this problem off to userspace and require that a separate
> vfio container be used, but we don't know how to handle page accounting
> in that case.  How do we know that a page pinned in one container is the
> same page as a different container and avoid double billing the user for
> the page.
> 
> The solution is therefore to support multiple IOMMU domains per
> container.  In the majority of cases, only one domain will be required
> since hardware is typically consistent within a system.  However, this
> provides us the ability to validate compatibility of domains and support
> mixed environments where page table flags can be different between
> domains.
> 
> To do this, our DMA tracking needs to change.  We currently try to
> coalesce user mappings into as few tracking entries as possible.  The
> problem then becomes that we lose granularity of user mappings.  We've
> never guaranteed that a user is able to unmap at a finer granularity than
> the original mapping, but we must honor the granularity of the original
> mapping.  This coalescing code is therefore removed, allowing only unmaps
> covering complete maps.  The change in accounting is fairly small here, a
> typical QEMU VM will start out with roughly a dozen entries, so it's
> arguable if this coalescing was ever needed.
> 
> We also move IOMMU domain creation to the point where a group is attached
> to the container.  An interesting side-effect of this is that we now have
> access to the device at the time of domain creation and can probe the
> devices within the group to determine the bus_type.
> This finally makes vfio_iommu_type1 completely device/bus agnostic.
> In fact, each IOMMU domain can host devices on different buses managed by
> different physical IOMMUs, and present a single DMA mapping interface to
> the user.  When a new domain is created, mappings are replayed to bring
> the IOMMU pagetables up to the state of the current container.  And of
> course, DMA mapping and unmapping automatically traverse all of the
> configured IOMMU domains.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> Cc: Varun Sethi <Varun.Sethi@freescale.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c |  637 +++++++++++++++++++++------------
> ------
>  include/uapi/linux/vfio.h       |    1
>  2 files changed, 336 insertions(+), 302 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c
> b/drivers/vfio/vfio_iommu_type1.c index 4fb7a8f..8c7bb9b 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -30,7 +30,6 @@
>  #include <linux/iommu.h>
>  #include <linux/module.h>
>  #include <linux/mm.h>
> -#include <linux/pci.h>		/* pci_bus_type */
>  #include <linux/rbtree.h>
>  #include <linux/sched.h>
>  #include <linux/slab.h>
> @@ -55,11 +54,17 @@ MODULE_PARM_DESC(disable_hugepages,
>  		 "Disable VFIO IOMMU support for IOMMU hugepages.");
> 
>  struct vfio_iommu {
> -	struct iommu_domain	*domain;
> +	struct list_head	domain_list;
>  	struct mutex		lock;
>  	struct rb_root		dma_list;
> +	bool v2;
> +};
> +
> +struct vfio_domain {
> +	struct iommu_domain	*domain;
> +	struct list_head	next;
>  	struct list_head	group_list;
> -	bool			cache;
> +	int			prot;		/* IOMMU_CACHE */
>  };
> 
>  struct vfio_dma {
> @@ -99,7 +104,7 @@ static struct vfio_dma *vfio_find_dma(struct
> vfio_iommu *iommu,
>  	return NULL;
>  }
> 
> -static void vfio_insert_dma(struct vfio_iommu *iommu, struct vfio_dma
> *new)
> +static void vfio_link_dma(struct vfio_iommu *iommu, struct vfio_dma
> +*new)
>  {
>  	struct rb_node **link = &iommu->dma_list.rb_node, *parent = NULL;
>  	struct vfio_dma *dma;
> @@ -118,7 +123,7 @@ static void vfio_insert_dma(struct vfio_iommu *iommu,
> struct vfio_dma *new)
>  	rb_insert_color(&new->node, &iommu->dma_list);  }
> 
> -static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma
> *old)
> +static void vfio_unlink_dma(struct vfio_iommu *iommu, struct vfio_dma
> +*old)
>  {
>  	rb_erase(&old->node, &iommu->dma_list);  } @@ -322,32 +327,39 @@
> static long vfio_unpin_pages(unsigned long pfn, long npage,
>  	return unlocked;
>  }
> 
> -static int vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma
> *dma,
> -			    dma_addr_t iova, size_t *size)
> +static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma
> +*dma)
>  {
> -	dma_addr_t start = iova, end = iova + *size;
> +	dma_addr_t iova = dma->iova, end = dma->iova + dma->size;
> +	struct vfio_domain *domain, *d;
>  	long unlocked = 0;
> 
> +	if (!dma->size)
> +		return;
> +	/*
> +	 * We use the IOMMU to track the physical addresses, otherwise we'd
> +	 * need a much more complicated tracking system.  Unfortunately
> that
> +	 * means we need to use one of the iommu domains to figure out the
> +	 * pfns to unpin.  The rest need to be unmapped in advance so we
> have
> +	 * no iommu translations remaining when the pages are unpinned.
> +	 */
> +	domain = d = list_first_entry(&iommu->domain_list,
> +				      struct vfio_domain, next);
> +
> +	list_for_each_entry_continue(d, &iommu->domain_list, next)
> +		iommu_unmap(d->domain, dma->iova, dma->size);
> +
>  	while (iova < end) {
>  		size_t unmapped;
>  		phys_addr_t phys;
> 
> -		/*
> -		 * We use the IOMMU to track the physical address.  This
> -		 * saves us from having a lot more entries in our mapping
> -		 * tree.  The downside is that we don't track the size
> -		 * used to do the mapping.  We request unmap of a single
> -		 * page, but expect IOMMUs that support large pages to
> -		 * unmap a larger chunk.
> -		 */
> -		phys = iommu_iova_to_phys(iommu->domain, iova);
> +		phys = iommu_iova_to_phys(domain->domain, iova);
>  		if (WARN_ON(!phys)) {
>  			iova += PAGE_SIZE;
>  			continue;
>  		}
> 
> -		unmapped = iommu_unmap(iommu->domain, iova, PAGE_SIZE);
> -		if (!unmapped)
> +		unmapped = iommu_unmap(domain->domain, iova, PAGE_SIZE);
> +		if (WARN_ON(!unmapped))
>  			break;
> 
>  		unlocked += vfio_unpin_pages(phys >> PAGE_SHIFT, @@ -357,119
> +369,26 @@ static int vfio_unmap_unpin(struct vfio_iommu *iommu, struct
> vfio_dma *dma,
>  	}
> 
>  	vfio_lock_acct(-unlocked);
> -
> -	*size = iova - start;
> -
> -	return 0;
>  }
> 
> -static int vfio_remove_dma_overlap(struct vfio_iommu *iommu, dma_addr_t
> start,
> -				   size_t *size, struct vfio_dma *dma)
> +static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma
> +*dma)
>  {
> -	size_t offset, overlap, tmp;
> -	struct vfio_dma *split;
> -	int ret;
> -
> -	if (!*size)
> -		return 0;
> -
> -	/*
> -	 * Existing dma region is completely covered, unmap all.  This is
> -	 * the likely case since userspace tends to map and unmap buffers
> -	 * in one shot rather than multiple mappings within a buffer.
> -	 */
> -	if (likely(start <= dma->iova &&
> -		   start + *size >= dma->iova + dma->size)) {
> -		*size = dma->size;
> -		ret = vfio_unmap_unpin(iommu, dma, dma->iova, size);
> -		if (ret)
> -			return ret;
> -
> -		/*
> -		 * Did we remove more than we have?  Should never happen
> -		 * since a vfio_dma is contiguous in iova and vaddr.
> -		 */
> -		WARN_ON(*size != dma->size);
> -
> -		vfio_remove_dma(iommu, dma);
> -		kfree(dma);
> -		return 0;
> -	}
> -
> -	/* Overlap low address of existing range */
> -	if (start <= dma->iova) {
> -		overlap = start + *size - dma->iova;
> -		ret = vfio_unmap_unpin(iommu, dma, dma->iova, &overlap);
> -		if (ret)
> -			return ret;
> -
> -		vfio_remove_dma(iommu, dma);
> -
> -		/*
> -		 * Check, we may have removed to whole vfio_dma.  If not
> -		 * fixup and re-insert.
> -		 */
> -		if (overlap < dma->size) {
> -			dma->iova += overlap;
> -			dma->vaddr += overlap;
> -			dma->size -= overlap;
> -			vfio_insert_dma(iommu, dma);
> -		} else
> -			kfree(dma);
> -
> -		*size = overlap;
> -		return 0;
> -	}
> -
> -	/* Overlap high address of existing range */
> -	if (start + *size >= dma->iova + dma->size) {
> -		offset = start - dma->iova;
> -		overlap = dma->size - offset;
> -
> -		ret = vfio_unmap_unpin(iommu, dma, start, &overlap);
> -		if (ret)
> -			return ret;
> -
> -		dma->size -= overlap;
> -		*size = overlap;
> -		return 0;
> -	}
> -
> -	/* Split existing */
> -
> -	/*
> -	 * Allocate our tracking structure early even though it may not
> -	 * be used.  An Allocation failure later loses track of pages and
> -	 * is more difficult to unwind.
> -	 */
> -	split = kzalloc(sizeof(*split), GFP_KERNEL);
> -	if (!split)
> -		return -ENOMEM;
> -
> -	offset = start - dma->iova;
> -
> -	ret = vfio_unmap_unpin(iommu, dma, start, size);
> -	if (ret || !*size) {
> -		kfree(split);
> -		return ret;
> -	}
> -
> -	tmp = dma->size;
> +	vfio_unmap_unpin(iommu, dma);
> +	vfio_unlink_dma(iommu, dma);
> +	kfree(dma);
> +}
> 
> -	/* Resize the lower vfio_dma in place, before the below insert */
> -	dma->size = offset;
> +static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu) {
> +	struct vfio_domain *domain;
> +	unsigned long bitmap = PAGE_MASK;
> 
> -	/* Insert new for remainder, assuming it didn't all get unmapped */
> -	if (likely(offset + *size < tmp)) {
> -		split->size = tmp - offset - *size;
> -		split->iova = dma->iova + offset + *size;
> -		split->vaddr = dma->vaddr + offset + *size;
> -		split->prot = dma->prot;
> -		vfio_insert_dma(iommu, split);
> -	} else
> -		kfree(split);
> +	mutex_lock(&iommu->lock);
> +	list_for_each_entry(domain, &iommu->domain_list, next)
> +		bitmap &= domain->domain->ops->pgsize_bitmap;
> +	mutex_unlock(&iommu->lock);
> 
> -	return 0;
> +	return bitmap;
>  }
> 
>  static int vfio_dma_do_unmap(struct vfio_iommu *iommu, @@ -477,10
> +396,10 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,  {
>  	uint64_t mask;
>  	struct vfio_dma *dma;
> -	size_t unmapped = 0, size;
> +	size_t unmapped = 0;
>  	int ret = 0;
> 
> -	mask = ((uint64_t)1 << __ffs(iommu->domain->ops->pgsize_bitmap)) -
> 1;
> +	mask = ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1;
> 
>  	if (unmap->iova & mask)
>  		return -EINVAL;
> @@ -491,20 +410,61 @@ static int vfio_dma_do_unmap(struct vfio_iommu
> *iommu,
> 
>  	mutex_lock(&iommu->lock);
> 
> +	/*
> +	 * vfio-iommu-type1 (v1) - User mappings were coalesced together to
> +	 * avoid tracking individual mappings.  This means that the
> granularity
> +	 * of the original mapping was lost and the user was allowed to
> attempt
> +	 * to unmap any range.  Depending on the contiguousness of physical
> +	 * memory and page sizes supported by the IOMMU, arbitrary unmaps
> may
> +	 * or may not have worked.  We only guaranteed unmap granularity
> +	 * matching the original mapping; even though it was untracked
> here,
> +	 * the original mappings are reflected in IOMMU mappings.  This
> +	 * resulted in a couple unusual behaviors.  First, if a range is
> not
> +	 * able to be unmapped, ex. a set of 4k pages that was mapped as a
> +	 * 2M hugepage into the IOMMU, the unmap ioctl returns success but
> with
> +	 * a zero sized unmap.  Also, if an unmap request overlaps the
> first
> +	 * address of a hugepage, the IOMMU will unmap the entire hugepage.
> +	 * This also returns success and the returned unmap size reflects
> the
> +	 * actual size unmapped.
> +	 *
> +	 * We attempt to maintain compatibility with this "v1" interface,
> but
> +	 * we take control out of the hands of the IOMMU.  Therefore, an
> unmap
> +	 * request offset from the beginning of the original mapping will
> +	 * return success with zero sized unmap.  And an unmap request
> covering
> +	 * the first iova of mapping will unmap the entire range.
> +	 *
> +	 * The v2 version of this interface intends to be more
> deterministic.
> +	 * Unmap requests must fully cover previous mappings.  Multiple
> +	 * mappings may still be unmaped by specifying large ranges, but
> there
> +	 * must not be any previous mappings bisected by the range.  An
> error
> +	 * will be returned if these conditions are not met.  The v2
> interface
> +	 * will only return success and a size of zero if there were no
> +	 * mappings within the range.
> +	 */
> +	if (iommu->v2) {
> +		dma = vfio_find_dma(iommu, unmap->iova, 0);
> +		if (dma && dma->iova != unmap->iova) {
> +			ret = -EINVAL;
> +			goto unlock;
> +		}
> +		dma = vfio_find_dma(iommu, unmap->iova + unmap->size - 1, 0);
> +		if (dma && dma->iova + dma->size != unmap->iova + unmap-
> >size) {
> +			ret = -EINVAL;
> +			goto unlock;
> +		}
> +	}
> +
>  	while ((dma = vfio_find_dma(iommu, unmap->iova, unmap->size))) {
> -		size = unmap->size;
> -		ret = vfio_remove_dma_overlap(iommu, unmap->iova, &size,
> dma);
> -		if (ret || !size)
> +		if (!iommu->v2 && unmap->iova > dma->iova)
>  			break;
> -		unmapped += size;
> +		unmapped += dma->size;
> +		vfio_remove_dma(iommu, dma);
>  	}
> 
> +unlock:
>  	mutex_unlock(&iommu->lock);
> 
> -	/*
> -	 * We may unmap more than requested, update the unmap struct so
> -	 * userspace can know.
> -	 */
> +	/* Report how much was unmapped */
>  	unmap->size = unmapped;
> 
>  	return ret;
> @@ -516,22 +476,47 @@ static int vfio_dma_do_unmap(struct vfio_iommu
> *iommu,
>   * soon, so this is just a temporary workaround to break mappings down
> into
>   * PAGE_SIZE.  Better to map smaller pages than nothing.
>   */
> -static int map_try_harder(struct vfio_iommu *iommu, dma_addr_t iova,
> +static int map_try_harder(struct vfio_domain *domain, dma_addr_t iova,
>  			  unsigned long pfn, long npage, int prot)  {
>  	long i;
>  	int ret;
> 
>  	for (i = 0; i < npage; i++, pfn++, iova += PAGE_SIZE) {
> -		ret = iommu_map(iommu->domain, iova,
> +		ret = iommu_map(domain->domain, iova,
>  				(phys_addr_t)pfn << PAGE_SHIFT,
> -				PAGE_SIZE, prot);
> +				PAGE_SIZE, prot | domain->prot);
>  		if (ret)
>  			break;
>  	}
> 
>  	for (; i < npage && i > 0; i--, iova -= PAGE_SIZE)
> -		iommu_unmap(iommu->domain, iova, PAGE_SIZE);
> +		iommu_unmap(domain->domain, iova, PAGE_SIZE);
> +
> +	return ret;
> +}
> +
> +static int vfio_iommu_map(struct vfio_iommu *iommu, dma_addr_t iova,
> +			  unsigned long pfn, long npage, int prot) {
> +	struct vfio_domain *d;
> +	int ret;
> +
> +	list_for_each_entry(d, &iommu->domain_list, next) {
> +		ret = iommu_map(d->domain, iova, (phys_addr_t)pfn <<
> PAGE_SHIFT,
> +				npage << PAGE_SHIFT, prot | d->prot);
> +		if (ret) {
> +			if (ret != -EBUSY ||
> +			    map_try_harder(d, iova, pfn, npage, prot))
> +				goto unwind;
> +		}
> +	}
> +
> +	return 0;
> +
> +unwind:
> +	list_for_each_entry_continue_reverse(d, &iommu->domain_list, next)
> +		iommu_unmap(d->domain, iova, npage << PAGE_SHIFT);
> 
>  	return ret;
>  }
> @@ -545,12 +530,12 @@ static int vfio_dma_do_map(struct vfio_iommu
> *iommu,
>  	long npage;
>  	int ret = 0, prot = 0;
>  	uint64_t mask;
> -	struct vfio_dma *dma = NULL;
> +	struct vfio_dma *dma;
>  	unsigned long pfn;
> 
>  	end = map->iova + map->size;
> 
> -	mask = ((uint64_t)1 << __ffs(iommu->domain->ops->pgsize_bitmap)) -
> 1;
> +	mask = ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1;
> 
>  	/* READ/WRITE from device perspective */
>  	if (map->flags & VFIO_DMA_MAP_FLAG_WRITE) @@ -561,9 +546,6 @@
> static int vfio_dma_do_map(struct vfio_iommu *iommu,
>  	if (!prot)
>  		return -EINVAL; /* No READ/WRITE? */
> 
> -	if (iommu->cache)
> -		prot |= IOMMU_CACHE;
> -
>  	if (vaddr & mask)
>  		return -EINVAL;
>  	if (map->iova & mask)
> @@ -588,180 +570,257 @@ static int vfio_dma_do_map(struct vfio_iommu
> *iommu,
>  		return -EEXIST;
>  	}
> 
> -	for (iova = map->iova; iova < end; iova += size, vaddr += size) {
> -		long i;
> +	dma = kzalloc(sizeof(*dma), GFP_KERNEL);
> +	if (!dma) {
> +		mutex_unlock(&iommu->lock);
> +		return -ENOMEM;
> +	}
> +
> +	dma->iova = map->iova;
> +	dma->vaddr = map->vaddr;
> +	dma->prot = prot;
> 
> +	/* Insert zero-sized and grow as we map chunks of it */
> +	vfio_link_dma(iommu, dma);
> +
> +	for (iova = map->iova; iova < end; iova += size, vaddr += size) {
>  		/* Pin a contiguous chunk of memory */
>  		npage = vfio_pin_pages(vaddr, (end - iova) >> PAGE_SHIFT,
>  				       prot, &pfn);
>  		if (npage <= 0) {
>  			WARN_ON(!npage);
>  			ret = (int)npage;
> -			goto out;
> -		}
> -
> -		/* Verify pages are not already mapped */
> -		for (i = 0; i < npage; i++) {
> -			if (iommu_iova_to_phys(iommu->domain,
> -					       iova + (i << PAGE_SHIFT))) {
> -				ret = -EBUSY;
> -				goto out_unpin;
> -			}
> +			break;
>  		}
> 
> -		ret = iommu_map(iommu->domain, iova,
> -				(phys_addr_t)pfn << PAGE_SHIFT,
> -				npage << PAGE_SHIFT, prot);
> +		/* Map it! */
> +		ret = vfio_iommu_map(iommu, iova, pfn, npage, prot);
>  		if (ret) {
> -			if (ret != -EBUSY ||
> -			    map_try_harder(iommu, iova, pfn, npage, prot)) {
> -				goto out_unpin;
> -			}
> +			vfio_unpin_pages(pfn, npage, prot, true);
> +			break;
>  		}
> 
>  		size = npage << PAGE_SHIFT;
> +		dma->size += size;
> +	}
> 
> -		/*
> -		 * Check if we abut a region below - nothing below 0.
> -		 * This is the most likely case when mapping chunks of
> -		 * physically contiguous regions within a virtual address
> -		 * range.  Update the abutting entry in place since iova
> -		 * doesn't change.
> -		 */
> -		if (likely(iova)) {
> -			struct vfio_dma *tmp;
> -			tmp = vfio_find_dma(iommu, iova - 1, 1);
> -			if (tmp && tmp->prot == prot &&
> -			    tmp->vaddr + tmp->size == vaddr) {
> -				tmp->size += size;
> -				iova = tmp->iova;
> -				size = tmp->size;
> -				vaddr = tmp->vaddr;
> -				dma = tmp;
> -			}
> -		}
> +	if (ret)
> +		vfio_remove_dma(iommu, dma);
> 
> -		/*
> -		 * Check if we abut a region above - nothing above ~0 + 1.
> -		 * If we abut above and below, remove and free.  If only
> -		 * abut above, remove, modify, reinsert.
> -		 */
> -		if (likely(iova + size)) {
> -			struct vfio_dma *tmp;
> -			tmp = vfio_find_dma(iommu, iova + size, 1);
> -			if (tmp && tmp->prot == prot &&
> -			    tmp->vaddr == vaddr + size) {
> -				vfio_remove_dma(iommu, tmp);
> -				if (dma) {
> -					dma->size += tmp->size;
> -					kfree(tmp);
> -				} else {
> -					size += tmp->size;
> -					tmp->size = size;
> -					tmp->iova = iova;
> -					tmp->vaddr = vaddr;
> -					vfio_insert_dma(iommu, tmp);
> -					dma = tmp;
> -				}
> -			}
> -		}
> +	mutex_unlock(&iommu->lock);
> +	return ret;
> +}
> +
> +static int vfio_bus_type(struct device *dev, void *data) {
> +	struct bus_type **bus = data;
> +
> +	if (*bus && *bus != dev->bus)
> +		return -EINVAL;
> +
> +	*bus = dev->bus;
> +
> +	return 0;
> +}
> +
> +static int vfio_iommu_replay(struct vfio_iommu *iommu,
> +			     struct vfio_domain *domain)
> +{
> +	struct vfio_domain *d;
> +	struct rb_node *n;
> +	int ret;
> +
> +	/* Arbitrarily pick the first domain in the list for lookups */
> +	d = list_first_entry(&iommu->domain_list, struct vfio_domain,
> next);
> +	n = rb_first(&iommu->dma_list);
> +
> +	/* If there's not a domain, there better not be any mappings */
> +	if (WARN_ON(n && !d))
> +		return -EINVAL;
> +
> +	for (; n; n = rb_next(n)) {
> +		struct vfio_dma *dma;
> +		dma_addr_t iova;
> +
> +		dma = rb_entry(n, struct vfio_dma, node);
> +		iova = dma->iova;
> +
> +		while (iova < dma->iova + dma->size) {
> +			phys_addr_t phys = iommu_iova_to_phys(d->domain, iova);
> +			size_t size;
> 
> -		if (!dma) {
> -			dma = kzalloc(sizeof(*dma), GFP_KERNEL);
> -			if (!dma) {
> -				iommu_unmap(iommu->domain, iova, size);
> -				ret = -ENOMEM;
> -				goto out_unpin;
> +			if (WARN_ON(!phys)) {
> +				iova += PAGE_SIZE;
> +				continue;
>  			}
> 
> -			dma->size = size;
> -			dma->iova = iova;
> -			dma->vaddr = vaddr;
> -			dma->prot = prot;
> -			vfio_insert_dma(iommu, dma);
> -		}
> -	}
> +			size = PAGE_SIZE;
> 
> -	WARN_ON(ret);
> -	mutex_unlock(&iommu->lock);
> -	return ret;
> +			while (iova + size < dma->iova + dma->size &&
> +			       phys + size == iommu_iova_to_phys(d->domain,
> +								 iova + size))
> +				size += PAGE_SIZE;
> 
> -out_unpin:
> -	vfio_unpin_pages(pfn, npage, prot, true);
> +			ret = iommu_map(domain->domain, iova, phys,
> +					size, dma->prot | domain->prot);
> +			if (ret)
> +				return ret;
> 
> -out:
> -	iova = map->iova;
> -	size = map->size;
> -	while ((dma = vfio_find_dma(iommu, iova, size))) {
> -		int r = vfio_remove_dma_overlap(iommu, iova,
> -						&size, dma);
> -		if (WARN_ON(r || !size))
> -			break;
> +			iova += size;
> +		}
>  	}
> 
> -	mutex_unlock(&iommu->lock);
> -	return ret;
> +	return 0;
>  }
> 
>  static int vfio_iommu_type1_attach_group(void *iommu_data,
>  					 struct iommu_group *iommu_group)
>  {
>  	struct vfio_iommu *iommu = iommu_data;
> -	struct vfio_group *group, *tmp;
> +	struct vfio_group *group, *g;
> +	struct vfio_domain *domain, *d;
> +	struct bus_type *bus = NULL;
>  	int ret;
> 
> -	group = kzalloc(sizeof(*group), GFP_KERNEL);
> -	if (!group)
> -		return -ENOMEM;
> -
>  	mutex_lock(&iommu->lock);
> 
> -	list_for_each_entry(tmp, &iommu->group_list, next) {
> -		if (tmp->iommu_group == iommu_group) {
> +	list_for_each_entry(d, &iommu->domain_list, next) {
> +		list_for_each_entry(g, &d->group_list, next) {
> +			if (g->iommu_group != iommu_group)
> +				continue;
> +
>  			mutex_unlock(&iommu->lock);
> -			kfree(group);
>  			return -EINVAL;
>  		}
>  	}
> 
> +	group = kzalloc(sizeof(*group), GFP_KERNEL);
> +	domain = kzalloc(sizeof(*domain), GFP_KERNEL);
> +	if (!group || !domain) {
> +		ret = -ENOMEM;
> +		goto out_free;
> +	}
> +
> +	group->iommu_group = iommu_group;
> +
> +	/* Determine bus_type in order to allocate a domain */
> +	ret = iommu_group_for_each_dev(iommu_group, &bus, vfio_bus_type);
> +	if (ret)
> +		goto out_free;
> +
> +	domain->domain = iommu_domain_alloc(bus);
> +	if (!domain->domain) {
> +		ret = -EIO;
> +		goto out_free;
> +	}
> +
> +	ret = iommu_attach_group(domain->domain, iommu_group);
> +	if (ret)
> +		goto out_domain;
> +
> +	INIT_LIST_HEAD(&domain->group_list);
> +	list_add(&group->next, &domain->group_list);
> +
> +	if (!allow_unsafe_interrupts &&
> +	    !iommu_domain_has_cap(domain->domain, IOMMU_CAP_INTR_REMAP)) {
> +		pr_warn("%s: No interrupt remapping support.  Use the module
> param \"allow_unsafe_interrupts\" to enable VFIO IOMMU support on this
> platform\n",
> +		       __func__);
> +		ret = -EPERM;
> +		goto out_detach;
> +	}
> +
> +	if (iommu_domain_has_cap(domain->domain,
> IOMMU_CAP_CACHE_COHERENCY))
> +		domain->prot |= IOMMU_CACHE;
> +
>  	/*
> -	 * TODO: Domain have capabilities that might change as we add
> -	 * groups (see iommu->cache, currently never set).  Check for
> -	 * them and potentially disallow groups to be attached when it
> -	 * would change capabilities (ugh).
> +	 * Try to match an existing compatible domain.  We don't want to
> +	 * preclude an IOMMU driver supporting multiple bus_types and being
> +	 * able to include different bus_types in the same IOMMU domain, so
> +	 * we test whether the domains use the same iommu_ops rather than
> +	 * testing if they're on the same bus_type.
>  	 */
> -	ret = iommu_attach_group(iommu->domain, iommu_group);
> -	if (ret) {
> -		mutex_unlock(&iommu->lock);
> -		kfree(group);
> -		return ret;
> +	list_for_each_entry(d, &iommu->domain_list, next) {
> +		if (d->domain->ops == domain->domain->ops &&
> +		    d->prot == domain->prot) {
> +			iommu_detach_group(domain->domain, iommu_group);
> +			if (!iommu_attach_group(d->domain, iommu_group)) {
> +				list_add(&group->next, &d->group_list);
> +				iommu_domain_free(domain->domain);
> +				kfree(domain);
> +				mutex_unlock(&iommu->lock);
> +				return 0;
> +			}
> +
> +			ret = iommu_attach_group(domain->domain, iommu_group);
> +			if (ret)
> +				goto out_domain;
> +		}
>  	}
> 
> -	group->iommu_group = iommu_group;
> -	list_add(&group->next, &iommu->group_list);
> +	/* replay mappings on new domains */
> +	ret = vfio_iommu_replay(iommu, domain);
> +	if (ret)
> +		goto out_detach;
> +
> +	list_add(&domain->next, &iommu->domain_list);
> 
>  	mutex_unlock(&iommu->lock);
> 
>  	return 0;
> +
> +out_detach:
> +	iommu_detach_group(domain->domain, iommu_group);
> +out_domain:
> +	iommu_domain_free(domain->domain);
> +out_free:
> +	kfree(domain);
> +	kfree(group);
> +	mutex_unlock(&iommu->lock);
> +	return ret;
> +}
> +
> +static void vfio_iommu_unmap_unpin_all(struct vfio_iommu *iommu) {
> +	struct rb_node *node;
> +
> +	while ((node = rb_first(&iommu->dma_list)))
> +		vfio_remove_dma(iommu, rb_entry(node, struct vfio_dma,
> node));
>  }
> 
>  static void vfio_iommu_type1_detach_group(void *iommu_data,
>  					  struct iommu_group *iommu_group)  {
>  	struct vfio_iommu *iommu = iommu_data;
> +	struct vfio_domain *domain;
>  	struct vfio_group *group;
> 
>  	mutex_lock(&iommu->lock);
> 
> -	list_for_each_entry(group, &iommu->group_list, next) {
> -		if (group->iommu_group == iommu_group) {
> -			iommu_detach_group(iommu->domain, iommu_group);
> +	list_for_each_entry(domain, &iommu->domain_list, next) {
> +		list_for_each_entry(group, &domain->group_list, next) {
> +			if (group->iommu_group != iommu_group)
> +				continue;
> +
> +			iommu_detach_group(domain->domain, iommu_group);
>  			list_del(&group->next);
>  			kfree(group);
> -			break;
> +			/*
> +			 * Group ownership provides privilege, if the group
> +			 * list is empty, the domain goes away.  If it's the
> +			 * last domain, then all the mappings go away too.
> +			 */
> +			if (list_empty(&domain->group_list)) {
> +				if (list_is_singular(&iommu->domain_list))
> +					vfio_iommu_unmap_unpin_all(iommu);
> +				iommu_domain_free(domain->domain);
> +				list_del(&domain->next);
> +				kfree(domain);
> +			}
> +			goto done;
>  		}
>  	}
> 
> +done:
>  	mutex_unlock(&iommu->lock);
>  }
> 
> @@ -769,40 +828,17 @@ static void *vfio_iommu_type1_open(unsigned long
> arg)  {
>  	struct vfio_iommu *iommu;
> 
> -	if (arg != VFIO_TYPE1_IOMMU)
> +	if (arg != VFIO_TYPE1_IOMMU && arg != VFIO_TYPE1v2_IOMMU)
>  		return ERR_PTR(-EINVAL);
> 
>  	iommu = kzalloc(sizeof(*iommu), GFP_KERNEL);
>  	if (!iommu)
>  		return ERR_PTR(-ENOMEM);
> 
> -	INIT_LIST_HEAD(&iommu->group_list);
> +	INIT_LIST_HEAD(&iommu->domain_list);
>  	iommu->dma_list = RB_ROOT;
>  	mutex_init(&iommu->lock);
> -
> -	/*
> -	 * Wish we didn't have to know about bus_type here.
> -	 */
> -	iommu->domain = iommu_domain_alloc(&pci_bus_type);
> -	if (!iommu->domain) {
> -		kfree(iommu);
> -		return ERR_PTR(-EIO);
> -	}
> -
> -	/*
> -	 * Wish we could specify required capabilities rather than create
> -	 * a domain, see what comes out and hope it doesn't change along
> -	 * the way.  Fortunately we know interrupt remapping is global for
> -	 * our iommus.
> -	 */
> -	if (!allow_unsafe_interrupts &&
> -	    !iommu_domain_has_cap(iommu->domain, IOMMU_CAP_INTR_REMAP)) {
> -		pr_warn("%s: No interrupt remapping support.  Use the module
> param \"allow_unsafe_interrupts\" to enable VFIO IOMMU support on this
> platform\n",
> -		       __func__);
> -		iommu_domain_free(iommu->domain);
> -		kfree(iommu);
> -		return ERR_PTR(-EPERM);
> -	}
> +	iommu->v2 = (arg == VFIO_TYPE1v2_IOMMU);
> 
>  	return iommu;
>  }
> @@ -810,25 +846,24 @@ static void *vfio_iommu_type1_open(unsigned long
> arg)  static void vfio_iommu_type1_release(void *iommu_data)  {
>  	struct vfio_iommu *iommu = iommu_data;
> +	struct vfio_domain *domain, *domain_tmp;
>  	struct vfio_group *group, *group_tmp;
> -	struct rb_node *node;
> 
> -	list_for_each_entry_safe(group, group_tmp, &iommu->group_list,
> next) {
> -		iommu_detach_group(iommu->domain, group->iommu_group);
> -		list_del(&group->next);
> -		kfree(group);
> -	}
> +	vfio_iommu_unmap_unpin_all(iommu);
> 
> -	while ((node = rb_first(&iommu->dma_list))) {
> -		struct vfio_dma *dma = rb_entry(node, struct vfio_dma, node);
> -		size_t size = dma->size;
> -		vfio_remove_dma_overlap(iommu, dma->iova, &size, dma);
> -		if (WARN_ON(!size))
> -			break;
> +	list_for_each_entry_safe(domain, domain_tmp,
> +				 &iommu->domain_list, next) {
> +		list_for_each_entry_safe(group, group_tmp,
> +					 &domain->group_list, next) {
> +			iommu_detach_group(domain->domain, group->iommu_group);
> +			list_del(&group->next);
> +			kfree(group);
> +		}
> +		iommu_domain_free(domain->domain);
> +		list_del(&domain->next);
> +		kfree(domain);
>  	}
> 
> -	iommu_domain_free(iommu->domain);
> -	iommu->domain = NULL;
>  	kfree(iommu);
>  }
> 
> @@ -841,6 +876,7 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>  	if (cmd == VFIO_CHECK_EXTENSION) {
>  		switch (arg) {
>  		case VFIO_TYPE1_IOMMU:
> +		case VFIO_TYPE1v2_IOMMU:
>  			return 1;
>  		default:
>  			return 0;
> @@ -858,7 +894,7 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
> 
>  		info.flags = 0;
> 
> -		info.iova_pgsizes = iommu->domain->ops->pgsize_bitmap;
> +		info.iova_pgsizes = vfio_pgsize_bitmap(iommu);
> 
>  		return copy_to_user((void __user *)arg, &info, minsz);
> 
> @@ -911,9 +947,6 @@ static const struct vfio_iommu_driver_ops
> vfio_iommu_driver_ops_type1 = {
> 
>  static int __init vfio_iommu_type1_init(void)  {
> -	if (!iommu_present(&pci_bus_type))
> -		return -ENODEV;
> -
>  	return vfio_register_iommu_driver(&vfio_iommu_driver_ops_type1);
>  }
> 
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h index
> 0fd47f5..460fdf2 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -23,6 +23,7 @@
> 
>  #define VFIO_TYPE1_IOMMU		1
>  #define VFIO_SPAPR_TCE_IOMMU		2
> +#define VFIO_TYPE1v2_IOMMU		3
> 
>  /*
>   * The IOCTL interface is designed for extensibility by embedding the
> 
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/4] vfio/iommu_type1: Multi-IOMMU domain support
  2014-03-18 10:24     ` Varun Sethi
  (?)
@ 2014-03-18 14:16     ` Alex Williamson
  -1 siblings, 0 replies; 10+ messages in thread
From: Alex Williamson @ 2014-03-18 14:16 UTC (permalink / raw)
  To: Varun Sethi; +Cc: kvm, linux-kernel

On Tue, 2014-03-18 at 10:24 +0000, Varun Sethi wrote:
> Hi Alex,
> Would it make sense, to link the iommu group to its corresponding
> hardware iommu block's capabilities? This could be done if we can
> determine the iommu device corresponding to the iommu group during bus
> probe. With this we won't have to wait till device attach to determine
> the domain capabilities (or underlying iommu capabilities). In vfio we
> can simply attach the iommu groups with similar iommu capabilities to
> the same domain.

The IOMMU API doesn't provide us with a way to get the capabilities
without creating a domain and attaching all the devices in the group.
That means during device scan we'd need to for each device added to a
group, create a domain, attach the group and record the IOMMU
capabilities.  That sounds rather intrusive for a startup operation.
There's also the problem of hotplug, a device may be hot-added to the
system and join an existing IOMMU group whose other devices are in use.
We'd have no way to attach the devices necessary to probe the
capabilities without disrupting the system.

Perhaps extending the IOMMU API to ask about the domain capabilities of
a device without actually instantiating the domain would be an
interesting addition, but that doesn't exist today.  Thanks,

Alex

> > -----Original Message-----
> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Tuesday, February 18, 2014 1:54 AM
> > To: alex.williamson@redhat.com; kvm@vger.kernel.org
> > Cc: Sethi Varun-B16395; linux-kernel@vger.kernel.org
> > Subject: [PATCH 1/4] vfio/iommu_type1: Multi-IOMMU domain support
> > 
> > We currently have a problem that we cannot support advanced features of
> > an IOMMU domain (ex. IOMMU_CACHE), because we have no guarantee that
> > those features will be supported by all of the hardware units involved
> > with the domain over its lifetime.  For instance, the Intel VT-d
> > architecture does not require that all DRHDs support snoop control.  If
> > we create a domain based on a device behind a DRHD that does support
> > snoop control and enable SNP support via the IOMMU_CACHE mapping option,
> > we cannot then add a device behind a DRHD which does not support snoop
> > control or we'll get reserved bit faults from the SNP bit in the
> > pagetables.  To add to the complexity, we can't know the properties of a
> > domain until a device is attached.
> > 
> > We could pass this problem off to userspace and require that a separate
> > vfio container be used, but we don't know how to handle page accounting
> > in that case.  How do we know that a page pinned in one container is the
> > same page as a different container and avoid double billing the user for
> > the page.
> > 
> > The solution is therefore to support multiple IOMMU domains per
> > container.  In the majority of cases, only one domain will be required
> > since hardware is typically consistent within a system.  However, this
> > provides us the ability to validate compatibility of domains and support
> > mixed environments where page table flags can be different between
> > domains.
> > 
> > To do this, our DMA tracking needs to change.  We currently try to
> > coalesce user mappings into as few tracking entries as possible.  The
> > problem then becomes that we lose granularity of user mappings.  We've
> > never guaranteed that a user is able to unmap at a finer granularity than
> > the original mapping, but we must honor the granularity of the original
> > mapping.  This coalescing code is therefore removed, allowing only unmaps
> > covering complete maps.  The change in accounting is fairly small here, a
> > typical QEMU VM will start out with roughly a dozen entries, so it's
> > arguable if this coalescing was ever needed.
> > 
> > We also move IOMMU domain creation to the point where a group is attached
> > to the container.  An interesting side-effect of this is that we now have
> > access to the device at the time of domain creation and can probe the
> > devices within the group to determine the bus_type.
> > This finally makes vfio_iommu_type1 completely device/bus agnostic.
> > In fact, each IOMMU domain can host devices on different buses managed by
> > different physical IOMMUs, and present a single DMA mapping interface to
> > the user.  When a new domain is created, mappings are replayed to bring
> > the IOMMU pagetables up to the state of the current container.  And of
> > course, DMA mapping and unmapping automatically traverse all of the
> > configured IOMMU domains.
> > 
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > Cc: Varun Sethi <Varun.Sethi@freescale.com>
> > ---
> >  drivers/vfio/vfio_iommu_type1.c |  637 +++++++++++++++++++++------------
> > ------
> >  include/uapi/linux/vfio.h       |    1
> >  2 files changed, 336 insertions(+), 302 deletions(-)
> > 
> > diff --git a/drivers/vfio/vfio_iommu_type1.c
> > b/drivers/vfio/vfio_iommu_type1.c index 4fb7a8f..8c7bb9b 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -30,7 +30,6 @@
> >  #include <linux/iommu.h>
> >  #include <linux/module.h>
> >  #include <linux/mm.h>
> > -#include <linux/pci.h>		/* pci_bus_type */
> >  #include <linux/rbtree.h>
> >  #include <linux/sched.h>
> >  #include <linux/slab.h>
> > @@ -55,11 +54,17 @@ MODULE_PARM_DESC(disable_hugepages,
> >  		 "Disable VFIO IOMMU support for IOMMU hugepages.");
> > 
> >  struct vfio_iommu {
> > -	struct iommu_domain	*domain;
> > +	struct list_head	domain_list;
> >  	struct mutex		lock;
> >  	struct rb_root		dma_list;
> > +	bool v2;
> > +};
> > +
> > +struct vfio_domain {
> > +	struct iommu_domain	*domain;
> > +	struct list_head	next;
> >  	struct list_head	group_list;
> > -	bool			cache;
> > +	int			prot;		/* IOMMU_CACHE */
> >  };
> > 
> >  struct vfio_dma {
> > @@ -99,7 +104,7 @@ static struct vfio_dma *vfio_find_dma(struct
> > vfio_iommu *iommu,
> >  	return NULL;
> >  }
> > 
> > -static void vfio_insert_dma(struct vfio_iommu *iommu, struct vfio_dma
> > *new)
> > +static void vfio_link_dma(struct vfio_iommu *iommu, struct vfio_dma
> > +*new)
> >  {
> >  	struct rb_node **link = &iommu->dma_list.rb_node, *parent = NULL;
> >  	struct vfio_dma *dma;
> > @@ -118,7 +123,7 @@ static void vfio_insert_dma(struct vfio_iommu *iommu,
> > struct vfio_dma *new)
> >  	rb_insert_color(&new->node, &iommu->dma_list);  }
> > 
> > -static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma
> > *old)
> > +static void vfio_unlink_dma(struct vfio_iommu *iommu, struct vfio_dma
> > +*old)
> >  {
> >  	rb_erase(&old->node, &iommu->dma_list);  } @@ -322,32 +327,39 @@
> > static long vfio_unpin_pages(unsigned long pfn, long npage,
> >  	return unlocked;
> >  }
> > 
> > -static int vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma
> > *dma,
> > -			    dma_addr_t iova, size_t *size)
> > +static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma
> > +*dma)
> >  {
> > -	dma_addr_t start = iova, end = iova + *size;
> > +	dma_addr_t iova = dma->iova, end = dma->iova + dma->size;
> > +	struct vfio_domain *domain, *d;
> >  	long unlocked = 0;
> > 
> > +	if (!dma->size)
> > +		return;
> > +	/*
> > +	 * We use the IOMMU to track the physical addresses, otherwise we'd
> > +	 * need a much more complicated tracking system.  Unfortunately
> > that
> > +	 * means we need to use one of the iommu domains to figure out the
> > +	 * pfns to unpin.  The rest need to be unmapped in advance so we
> > have
> > +	 * no iommu translations remaining when the pages are unpinned.
> > +	 */
> > +	domain = d = list_first_entry(&iommu->domain_list,
> > +				      struct vfio_domain, next);
> > +
> > +	list_for_each_entry_continue(d, &iommu->domain_list, next)
> > +		iommu_unmap(d->domain, dma->iova, dma->size);
> > +
> >  	while (iova < end) {
> >  		size_t unmapped;
> >  		phys_addr_t phys;
> > 
> > -		/*
> > -		 * We use the IOMMU to track the physical address.  This
> > -		 * saves us from having a lot more entries in our mapping
> > -		 * tree.  The downside is that we don't track the size
> > -		 * used to do the mapping.  We request unmap of a single
> > -		 * page, but expect IOMMUs that support large pages to
> > -		 * unmap a larger chunk.
> > -		 */
> > -		phys = iommu_iova_to_phys(iommu->domain, iova);
> > +		phys = iommu_iova_to_phys(domain->domain, iova);
> >  		if (WARN_ON(!phys)) {
> >  			iova += PAGE_SIZE;
> >  			continue;
> >  		}
> > 
> > -		unmapped = iommu_unmap(iommu->domain, iova, PAGE_SIZE);
> > -		if (!unmapped)
> > +		unmapped = iommu_unmap(domain->domain, iova, PAGE_SIZE);
> > +		if (WARN_ON(!unmapped))
> >  			break;
> > 
> >  		unlocked += vfio_unpin_pages(phys >> PAGE_SHIFT, @@ -357,119
> > +369,26 @@ static int vfio_unmap_unpin(struct vfio_iommu *iommu, struct
> > vfio_dma *dma,
> >  	}
> > 
> >  	vfio_lock_acct(-unlocked);
> > -
> > -	*size = iova - start;
> > -
> > -	return 0;
> >  }
> > 
> > -static int vfio_remove_dma_overlap(struct vfio_iommu *iommu, dma_addr_t
> > start,
> > -				   size_t *size, struct vfio_dma *dma)
> > +static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma
> > +*dma)
> >  {
> > -	size_t offset, overlap, tmp;
> > -	struct vfio_dma *split;
> > -	int ret;
> > -
> > -	if (!*size)
> > -		return 0;
> > -
> > -	/*
> > -	 * Existing dma region is completely covered, unmap all.  This is
> > -	 * the likely case since userspace tends to map and unmap buffers
> > -	 * in one shot rather than multiple mappings within a buffer.
> > -	 */
> > -	if (likely(start <= dma->iova &&
> > -		   start + *size >= dma->iova + dma->size)) {
> > -		*size = dma->size;
> > -		ret = vfio_unmap_unpin(iommu, dma, dma->iova, size);
> > -		if (ret)
> > -			return ret;
> > -
> > -		/*
> > -		 * Did we remove more than we have?  Should never happen
> > -		 * since a vfio_dma is contiguous in iova and vaddr.
> > -		 */
> > -		WARN_ON(*size != dma->size);
> > -
> > -		vfio_remove_dma(iommu, dma);
> > -		kfree(dma);
> > -		return 0;
> > -	}
> > -
> > -	/* Overlap low address of existing range */
> > -	if (start <= dma->iova) {
> > -		overlap = start + *size - dma->iova;
> > -		ret = vfio_unmap_unpin(iommu, dma, dma->iova, &overlap);
> > -		if (ret)
> > -			return ret;
> > -
> > -		vfio_remove_dma(iommu, dma);
> > -
> > -		/*
> > -		 * Check, we may have removed to whole vfio_dma.  If not
> > -		 * fixup and re-insert.
> > -		 */
> > -		if (overlap < dma->size) {
> > -			dma->iova += overlap;
> > -			dma->vaddr += overlap;
> > -			dma->size -= overlap;
> > -			vfio_insert_dma(iommu, dma);
> > -		} else
> > -			kfree(dma);
> > -
> > -		*size = overlap;
> > -		return 0;
> > -	}
> > -
> > -	/* Overlap high address of existing range */
> > -	if (start + *size >= dma->iova + dma->size) {
> > -		offset = start - dma->iova;
> > -		overlap = dma->size - offset;
> > -
> > -		ret = vfio_unmap_unpin(iommu, dma, start, &overlap);
> > -		if (ret)
> > -			return ret;
> > -
> > -		dma->size -= overlap;
> > -		*size = overlap;
> > -		return 0;
> > -	}
> > -
> > -	/* Split existing */
> > -
> > -	/*
> > -	 * Allocate our tracking structure early even though it may not
> > -	 * be used.  An Allocation failure later loses track of pages and
> > -	 * is more difficult to unwind.
> > -	 */
> > -	split = kzalloc(sizeof(*split), GFP_KERNEL);
> > -	if (!split)
> > -		return -ENOMEM;
> > -
> > -	offset = start - dma->iova;
> > -
> > -	ret = vfio_unmap_unpin(iommu, dma, start, size);
> > -	if (ret || !*size) {
> > -		kfree(split);
> > -		return ret;
> > -	}
> > -
> > -	tmp = dma->size;
> > +	vfio_unmap_unpin(iommu, dma);
> > +	vfio_unlink_dma(iommu, dma);
> > +	kfree(dma);
> > +}
> > 
> > -	/* Resize the lower vfio_dma in place, before the below insert */
> > -	dma->size = offset;
> > +static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu) {
> > +	struct vfio_domain *domain;
> > +	unsigned long bitmap = PAGE_MASK;
> > 
> > -	/* Insert new for remainder, assuming it didn't all get unmapped */
> > -	if (likely(offset + *size < tmp)) {
> > -		split->size = tmp - offset - *size;
> > -		split->iova = dma->iova + offset + *size;
> > -		split->vaddr = dma->vaddr + offset + *size;
> > -		split->prot = dma->prot;
> > -		vfio_insert_dma(iommu, split);
> > -	} else
> > -		kfree(split);
> > +	mutex_lock(&iommu->lock);
> > +	list_for_each_entry(domain, &iommu->domain_list, next)
> > +		bitmap &= domain->domain->ops->pgsize_bitmap;
> > +	mutex_unlock(&iommu->lock);
> > 
> > -	return 0;
> > +	return bitmap;
> >  }
> > 
> >  static int vfio_dma_do_unmap(struct vfio_iommu *iommu, @@ -477,10
> > +396,10 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,  {
> >  	uint64_t mask;
> >  	struct vfio_dma *dma;
> > -	size_t unmapped = 0, size;
> > +	size_t unmapped = 0;
> >  	int ret = 0;
> > 
> > -	mask = ((uint64_t)1 << __ffs(iommu->domain->ops->pgsize_bitmap)) -
> > 1;
> > +	mask = ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1;
> > 
> >  	if (unmap->iova & mask)
> >  		return -EINVAL;
> > @@ -491,20 +410,61 @@ static int vfio_dma_do_unmap(struct vfio_iommu
> > *iommu,
> > 
> >  	mutex_lock(&iommu->lock);
> > 
> > +	/*
> > +	 * vfio-iommu-type1 (v1) - User mappings were coalesced together to
> > +	 * avoid tracking individual mappings.  This means that the
> > granularity
> > +	 * of the original mapping was lost and the user was allowed to
> > attempt
> > +	 * to unmap any range.  Depending on the contiguousness of physical
> > +	 * memory and page sizes supported by the IOMMU, arbitrary unmaps
> > may
> > +	 * or may not have worked.  We only guaranteed unmap granularity
> > +	 * matching the original mapping; even though it was untracked
> > here,
> > +	 * the original mappings are reflected in IOMMU mappings.  This
> > +	 * resulted in a couple unusual behaviors.  First, if a range is
> > not
> > +	 * able to be unmapped, ex. a set of 4k pages that was mapped as a
> > +	 * 2M hugepage into the IOMMU, the unmap ioctl returns success but
> > with
> > +	 * a zero sized unmap.  Also, if an unmap request overlaps the
> > first
> > +	 * address of a hugepage, the IOMMU will unmap the entire hugepage.
> > +	 * This also returns success and the returned unmap size reflects
> > the
> > +	 * actual size unmapped.
> > +	 *
> > +	 * We attempt to maintain compatibility with this "v1" interface,
> > but
> > +	 * we take control out of the hands of the IOMMU.  Therefore, an
> > unmap
> > +	 * request offset from the beginning of the original mapping will
> > +	 * return success with zero sized unmap.  And an unmap request
> > covering
> > +	 * the first iova of mapping will unmap the entire range.
> > +	 *
> > +	 * The v2 version of this interface intends to be more
> > deterministic.
> > +	 * Unmap requests must fully cover previous mappings.  Multiple
> > +	 * mappings may still be unmaped by specifying large ranges, but
> > there
> > +	 * must not be any previous mappings bisected by the range.  An
> > error
> > +	 * will be returned if these conditions are not met.  The v2
> > interface
> > +	 * will only return success and a size of zero if there were no
> > +	 * mappings within the range.
> > +	 */
> > +	if (iommu->v2) {
> > +		dma = vfio_find_dma(iommu, unmap->iova, 0);
> > +		if (dma && dma->iova != unmap->iova) {
> > +			ret = -EINVAL;
> > +			goto unlock;
> > +		}
> > +		dma = vfio_find_dma(iommu, unmap->iova + unmap->size - 1, 0);
> > +		if (dma && dma->iova + dma->size != unmap->iova + unmap-
> > >size) {
> > +			ret = -EINVAL;
> > +			goto unlock;
> > +		}
> > +	}
> > +
> >  	while ((dma = vfio_find_dma(iommu, unmap->iova, unmap->size))) {
> > -		size = unmap->size;
> > -		ret = vfio_remove_dma_overlap(iommu, unmap->iova, &size,
> > dma);
> > -		if (ret || !size)
> > +		if (!iommu->v2 && unmap->iova > dma->iova)
> >  			break;
> > -		unmapped += size;
> > +		unmapped += dma->size;
> > +		vfio_remove_dma(iommu, dma);
> >  	}
> > 
> > +unlock:
> >  	mutex_unlock(&iommu->lock);
> > 
> > -	/*
> > -	 * We may unmap more than requested, update the unmap struct so
> > -	 * userspace can know.
> > -	 */
> > +	/* Report how much was unmapped */
> >  	unmap->size = unmapped;
> > 
> >  	return ret;
> > @@ -516,22 +476,47 @@ static int vfio_dma_do_unmap(struct vfio_iommu
> > *iommu,
> >   * soon, so this is just a temporary workaround to break mappings down
> > into
> >   * PAGE_SIZE.  Better to map smaller pages than nothing.
> >   */
> > -static int map_try_harder(struct vfio_iommu *iommu, dma_addr_t iova,
> > +static int map_try_harder(struct vfio_domain *domain, dma_addr_t iova,
> >  			  unsigned long pfn, long npage, int prot)  {
> >  	long i;
> >  	int ret;
> > 
> >  	for (i = 0; i < npage; i++, pfn++, iova += PAGE_SIZE) {
> > -		ret = iommu_map(iommu->domain, iova,
> > +		ret = iommu_map(domain->domain, iova,
> >  				(phys_addr_t)pfn << PAGE_SHIFT,
> > -				PAGE_SIZE, prot);
> > +				PAGE_SIZE, prot | domain->prot);
> >  		if (ret)
> >  			break;
> >  	}
> > 
> >  	for (; i < npage && i > 0; i--, iova -= PAGE_SIZE)
> > -		iommu_unmap(iommu->domain, iova, PAGE_SIZE);
> > +		iommu_unmap(domain->domain, iova, PAGE_SIZE);
> > +
> > +	return ret;
> > +}
> > +
> > +static int vfio_iommu_map(struct vfio_iommu *iommu, dma_addr_t iova,
> > +			  unsigned long pfn, long npage, int prot) {
> > +	struct vfio_domain *d;
> > +	int ret;
> > +
> > +	list_for_each_entry(d, &iommu->domain_list, next) {
> > +		ret = iommu_map(d->domain, iova, (phys_addr_t)pfn <<
> > PAGE_SHIFT,
> > +				npage << PAGE_SHIFT, prot | d->prot);
> > +		if (ret) {
> > +			if (ret != -EBUSY ||
> > +			    map_try_harder(d, iova, pfn, npage, prot))
> > +				goto unwind;
> > +		}
> > +	}
> > +
> > +	return 0;
> > +
> > +unwind:
> > +	list_for_each_entry_continue_reverse(d, &iommu->domain_list, next)
> > +		iommu_unmap(d->domain, iova, npage << PAGE_SHIFT);
> > 
> >  	return ret;
> >  }
> > @@ -545,12 +530,12 @@ static int vfio_dma_do_map(struct vfio_iommu
> > *iommu,
> >  	long npage;
> >  	int ret = 0, prot = 0;
> >  	uint64_t mask;
> > -	struct vfio_dma *dma = NULL;
> > +	struct vfio_dma *dma;
> >  	unsigned long pfn;
> > 
> >  	end = map->iova + map->size;
> > 
> > -	mask = ((uint64_t)1 << __ffs(iommu->domain->ops->pgsize_bitmap)) -
> > 1;
> > +	mask = ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1;
> > 
> >  	/* READ/WRITE from device perspective */
> >  	if (map->flags & VFIO_DMA_MAP_FLAG_WRITE) @@ -561,9 +546,6 @@
> > static int vfio_dma_do_map(struct vfio_iommu *iommu,
> >  	if (!prot)
> >  		return -EINVAL; /* No READ/WRITE? */
> > 
> > -	if (iommu->cache)
> > -		prot |= IOMMU_CACHE;
> > -
> >  	if (vaddr & mask)
> >  		return -EINVAL;
> >  	if (map->iova & mask)
> > @@ -588,180 +570,257 @@ static int vfio_dma_do_map(struct vfio_iommu
> > *iommu,
> >  		return -EEXIST;
> >  	}
> > 
> > -	for (iova = map->iova; iova < end; iova += size, vaddr += size) {
> > -		long i;
> > +	dma = kzalloc(sizeof(*dma), GFP_KERNEL);
> > +	if (!dma) {
> > +		mutex_unlock(&iommu->lock);
> > +		return -ENOMEM;
> > +	}
> > +
> > +	dma->iova = map->iova;
> > +	dma->vaddr = map->vaddr;
> > +	dma->prot = prot;
> > 
> > +	/* Insert zero-sized and grow as we map chunks of it */
> > +	vfio_link_dma(iommu, dma);
> > +
> > +	for (iova = map->iova; iova < end; iova += size, vaddr += size) {
> >  		/* Pin a contiguous chunk of memory */
> >  		npage = vfio_pin_pages(vaddr, (end - iova) >> PAGE_SHIFT,
> >  				       prot, &pfn);
> >  		if (npage <= 0) {
> >  			WARN_ON(!npage);
> >  			ret = (int)npage;
> > -			goto out;
> > -		}
> > -
> > -		/* Verify pages are not already mapped */
> > -		for (i = 0; i < npage; i++) {
> > -			if (iommu_iova_to_phys(iommu->domain,
> > -					       iova + (i << PAGE_SHIFT))) {
> > -				ret = -EBUSY;
> > -				goto out_unpin;
> > -			}
> > +			break;
> >  		}
> > 
> > -		ret = iommu_map(iommu->domain, iova,
> > -				(phys_addr_t)pfn << PAGE_SHIFT,
> > -				npage << PAGE_SHIFT, prot);
> > +		/* Map it! */
> > +		ret = vfio_iommu_map(iommu, iova, pfn, npage, prot);
> >  		if (ret) {
> > -			if (ret != -EBUSY ||
> > -			    map_try_harder(iommu, iova, pfn, npage, prot)) {
> > -				goto out_unpin;
> > -			}
> > +			vfio_unpin_pages(pfn, npage, prot, true);
> > +			break;
> >  		}
> > 
> >  		size = npage << PAGE_SHIFT;
> > +		dma->size += size;
> > +	}
> > 
> > -		/*
> > -		 * Check if we abut a region below - nothing below 0.
> > -		 * This is the most likely case when mapping chunks of
> > -		 * physically contiguous regions within a virtual address
> > -		 * range.  Update the abutting entry in place since iova
> > -		 * doesn't change.
> > -		 */
> > -		if (likely(iova)) {
> > -			struct vfio_dma *tmp;
> > -			tmp = vfio_find_dma(iommu, iova - 1, 1);
> > -			if (tmp && tmp->prot == prot &&
> > -			    tmp->vaddr + tmp->size == vaddr) {
> > -				tmp->size += size;
> > -				iova = tmp->iova;
> > -				size = tmp->size;
> > -				vaddr = tmp->vaddr;
> > -				dma = tmp;
> > -			}
> > -		}
> > +	if (ret)
> > +		vfio_remove_dma(iommu, dma);
> > 
> > -		/*
> > -		 * Check if we abut a region above - nothing above ~0 + 1.
> > -		 * If we abut above and below, remove and free.  If only
> > -		 * abut above, remove, modify, reinsert.
> > -		 */
> > -		if (likely(iova + size)) {
> > -			struct vfio_dma *tmp;
> > -			tmp = vfio_find_dma(iommu, iova + size, 1);
> > -			if (tmp && tmp->prot == prot &&
> > -			    tmp->vaddr == vaddr + size) {
> > -				vfio_remove_dma(iommu, tmp);
> > -				if (dma) {
> > -					dma->size += tmp->size;
> > -					kfree(tmp);
> > -				} else {
> > -					size += tmp->size;
> > -					tmp->size = size;
> > -					tmp->iova = iova;
> > -					tmp->vaddr = vaddr;
> > -					vfio_insert_dma(iommu, tmp);
> > -					dma = tmp;
> > -				}
> > -			}
> > -		}
> > +	mutex_unlock(&iommu->lock);
> > +	return ret;
> > +}
> > +
> > +static int vfio_bus_type(struct device *dev, void *data) {
> > +	struct bus_type **bus = data;
> > +
> > +	if (*bus && *bus != dev->bus)
> > +		return -EINVAL;
> > +
> > +	*bus = dev->bus;
> > +
> > +	return 0;
> > +}
> > +
> > +static int vfio_iommu_replay(struct vfio_iommu *iommu,
> > +			     struct vfio_domain *domain)
> > +{
> > +	struct vfio_domain *d;
> > +	struct rb_node *n;
> > +	int ret;
> > +
> > +	/* Arbitrarily pick the first domain in the list for lookups */
> > +	d = list_first_entry(&iommu->domain_list, struct vfio_domain,
> > next);
> > +	n = rb_first(&iommu->dma_list);
> > +
> > +	/* If there's not a domain, there better not be any mappings */
> > +	if (WARN_ON(n && !d))
> > +		return -EINVAL;
> > +
> > +	for (; n; n = rb_next(n)) {
> > +		struct vfio_dma *dma;
> > +		dma_addr_t iova;
> > +
> > +		dma = rb_entry(n, struct vfio_dma, node);
> > +		iova = dma->iova;
> > +
> > +		while (iova < dma->iova + dma->size) {
> > +			phys_addr_t phys = iommu_iova_to_phys(d->domain, iova);
> > +			size_t size;
> > 
> > -		if (!dma) {
> > -			dma = kzalloc(sizeof(*dma), GFP_KERNEL);
> > -			if (!dma) {
> > -				iommu_unmap(iommu->domain, iova, size);
> > -				ret = -ENOMEM;
> > -				goto out_unpin;
> > +			if (WARN_ON(!phys)) {
> > +				iova += PAGE_SIZE;
> > +				continue;
> >  			}
> > 
> > -			dma->size = size;
> > -			dma->iova = iova;
> > -			dma->vaddr = vaddr;
> > -			dma->prot = prot;
> > -			vfio_insert_dma(iommu, dma);
> > -		}
> > -	}
> > +			size = PAGE_SIZE;
> > 
> > -	WARN_ON(ret);
> > -	mutex_unlock(&iommu->lock);
> > -	return ret;
> > +			while (iova + size < dma->iova + dma->size &&
> > +			       phys + size == iommu_iova_to_phys(d->domain,
> > +								 iova + size))
> > +				size += PAGE_SIZE;
> > 
> > -out_unpin:
> > -	vfio_unpin_pages(pfn, npage, prot, true);
> > +			ret = iommu_map(domain->domain, iova, phys,
> > +					size, dma->prot | domain->prot);
> > +			if (ret)
> > +				return ret;
> > 
> > -out:
> > -	iova = map->iova;
> > -	size = map->size;
> > -	while ((dma = vfio_find_dma(iommu, iova, size))) {
> > -		int r = vfio_remove_dma_overlap(iommu, iova,
> > -						&size, dma);
> > -		if (WARN_ON(r || !size))
> > -			break;
> > +			iova += size;
> > +		}
> >  	}
> > 
> > -	mutex_unlock(&iommu->lock);
> > -	return ret;
> > +	return 0;
> >  }
> > 
> >  static int vfio_iommu_type1_attach_group(void *iommu_data,
> >  					 struct iommu_group *iommu_group)
> >  {
> >  	struct vfio_iommu *iommu = iommu_data;
> > -	struct vfio_group *group, *tmp;
> > +	struct vfio_group *group, *g;
> > +	struct vfio_domain *domain, *d;
> > +	struct bus_type *bus = NULL;
> >  	int ret;
> > 
> > -	group = kzalloc(sizeof(*group), GFP_KERNEL);
> > -	if (!group)
> > -		return -ENOMEM;
> > -
> >  	mutex_lock(&iommu->lock);
> > 
> > -	list_for_each_entry(tmp, &iommu->group_list, next) {
> > -		if (tmp->iommu_group == iommu_group) {
> > +	list_for_each_entry(d, &iommu->domain_list, next) {
> > +		list_for_each_entry(g, &d->group_list, next) {
> > +			if (g->iommu_group != iommu_group)
> > +				continue;
> > +
> >  			mutex_unlock(&iommu->lock);
> > -			kfree(group);
> >  			return -EINVAL;
> >  		}
> >  	}
> > 
> > +	group = kzalloc(sizeof(*group), GFP_KERNEL);
> > +	domain = kzalloc(sizeof(*domain), GFP_KERNEL);
> > +	if (!group || !domain) {
> > +		ret = -ENOMEM;
> > +		goto out_free;
> > +	}
> > +
> > +	group->iommu_group = iommu_group;
> > +
> > +	/* Determine bus_type in order to allocate a domain */
> > +	ret = iommu_group_for_each_dev(iommu_group, &bus, vfio_bus_type);
> > +	if (ret)
> > +		goto out_free;
> > +
> > +	domain->domain = iommu_domain_alloc(bus);
> > +	if (!domain->domain) {
> > +		ret = -EIO;
> > +		goto out_free;
> > +	}
> > +
> > +	ret = iommu_attach_group(domain->domain, iommu_group);
> > +	if (ret)
> > +		goto out_domain;
> > +
> > +	INIT_LIST_HEAD(&domain->group_list);
> > +	list_add(&group->next, &domain->group_list);
> > +
> > +	if (!allow_unsafe_interrupts &&
> > +	    !iommu_domain_has_cap(domain->domain, IOMMU_CAP_INTR_REMAP)) {
> > +		pr_warn("%s: No interrupt remapping support.  Use the module
> > param \"allow_unsafe_interrupts\" to enable VFIO IOMMU support on this
> > platform\n",
> > +		       __func__);
> > +		ret = -EPERM;
> > +		goto out_detach;
> > +	}
> > +
> > +	if (iommu_domain_has_cap(domain->domain,
> > IOMMU_CAP_CACHE_COHERENCY))
> > +		domain->prot |= IOMMU_CACHE;
> > +
> >  	/*
> > -	 * TODO: Domain have capabilities that might change as we add
> > -	 * groups (see iommu->cache, currently never set).  Check for
> > -	 * them and potentially disallow groups to be attached when it
> > -	 * would change capabilities (ugh).
> > +	 * Try to match an existing compatible domain.  We don't want to
> > +	 * preclude an IOMMU driver supporting multiple bus_types and being
> > +	 * able to include different bus_types in the same IOMMU domain, so
> > +	 * we test whether the domains use the same iommu_ops rather than
> > +	 * testing if they're on the same bus_type.
> >  	 */
> > -	ret = iommu_attach_group(iommu->domain, iommu_group);
> > -	if (ret) {
> > -		mutex_unlock(&iommu->lock);
> > -		kfree(group);
> > -		return ret;
> > +	list_for_each_entry(d, &iommu->domain_list, next) {
> > +		if (d->domain->ops == domain->domain->ops &&
> > +		    d->prot == domain->prot) {
> > +			iommu_detach_group(domain->domain, iommu_group);
> > +			if (!iommu_attach_group(d->domain, iommu_group)) {
> > +				list_add(&group->next, &d->group_list);
> > +				iommu_domain_free(domain->domain);
> > +				kfree(domain);
> > +				mutex_unlock(&iommu->lock);
> > +				return 0;
> > +			}
> > +
> > +			ret = iommu_attach_group(domain->domain, iommu_group);
> > +			if (ret)
> > +				goto out_domain;
> > +		}
> >  	}
> > 
> > -	group->iommu_group = iommu_group;
> > -	list_add(&group->next, &iommu->group_list);
> > +	/* replay mappings on new domains */
> > +	ret = vfio_iommu_replay(iommu, domain);
> > +	if (ret)
> > +		goto out_detach;
> > +
> > +	list_add(&domain->next, &iommu->domain_list);
> > 
> >  	mutex_unlock(&iommu->lock);
> > 
> >  	return 0;
> > +
> > +out_detach:
> > +	iommu_detach_group(domain->domain, iommu_group);
> > +out_domain:
> > +	iommu_domain_free(domain->domain);
> > +out_free:
> > +	kfree(domain);
> > +	kfree(group);
> > +	mutex_unlock(&iommu->lock);
> > +	return ret;
> > +}
> > +
> > +static void vfio_iommu_unmap_unpin_all(struct vfio_iommu *iommu) {
> > +	struct rb_node *node;
> > +
> > +	while ((node = rb_first(&iommu->dma_list)))
> > +		vfio_remove_dma(iommu, rb_entry(node, struct vfio_dma,
> > node));
> >  }
> > 
> >  static void vfio_iommu_type1_detach_group(void *iommu_data,
> >  					  struct iommu_group *iommu_group)  {
> >  	struct vfio_iommu *iommu = iommu_data;
> > +	struct vfio_domain *domain;
> >  	struct vfio_group *group;
> > 
> >  	mutex_lock(&iommu->lock);
> > 
> > -	list_for_each_entry(group, &iommu->group_list, next) {
> > -		if (group->iommu_group == iommu_group) {
> > -			iommu_detach_group(iommu->domain, iommu_group);
> > +	list_for_each_entry(domain, &iommu->domain_list, next) {
> > +		list_for_each_entry(group, &domain->group_list, next) {
> > +			if (group->iommu_group != iommu_group)
> > +				continue;
> > +
> > +			iommu_detach_group(domain->domain, iommu_group);
> >  			list_del(&group->next);
> >  			kfree(group);
> > -			break;
> > +			/*
> > +			 * Group ownership provides privilege, if the group
> > +			 * list is empty, the domain goes away.  If it's the
> > +			 * last domain, then all the mappings go away too.
> > +			 */
> > +			if (list_empty(&domain->group_list)) {
> > +				if (list_is_singular(&iommu->domain_list))
> > +					vfio_iommu_unmap_unpin_all(iommu);
> > +				iommu_domain_free(domain->domain);
> > +				list_del(&domain->next);
> > +				kfree(domain);
> > +			}
> > +			goto done;
> >  		}
> >  	}
> > 
> > +done:
> >  	mutex_unlock(&iommu->lock);
> >  }
> > 
> > @@ -769,40 +828,17 @@ static void *vfio_iommu_type1_open(unsigned long
> > arg)  {
> >  	struct vfio_iommu *iommu;
> > 
> > -	if (arg != VFIO_TYPE1_IOMMU)
> > +	if (arg != VFIO_TYPE1_IOMMU && arg != VFIO_TYPE1v2_IOMMU)
> >  		return ERR_PTR(-EINVAL);
> > 
> >  	iommu = kzalloc(sizeof(*iommu), GFP_KERNEL);
> >  	if (!iommu)
> >  		return ERR_PTR(-ENOMEM);
> > 
> > -	INIT_LIST_HEAD(&iommu->group_list);
> > +	INIT_LIST_HEAD(&iommu->domain_list);
> >  	iommu->dma_list = RB_ROOT;
> >  	mutex_init(&iommu->lock);
> > -
> > -	/*
> > -	 * Wish we didn't have to know about bus_type here.
> > -	 */
> > -	iommu->domain = iommu_domain_alloc(&pci_bus_type);
> > -	if (!iommu->domain) {
> > -		kfree(iommu);
> > -		return ERR_PTR(-EIO);
> > -	}
> > -
> > -	/*
> > -	 * Wish we could specify required capabilities rather than create
> > -	 * a domain, see what comes out and hope it doesn't change along
> > -	 * the way.  Fortunately we know interrupt remapping is global for
> > -	 * our iommus.
> > -	 */
> > -	if (!allow_unsafe_interrupts &&
> > -	    !iommu_domain_has_cap(iommu->domain, IOMMU_CAP_INTR_REMAP)) {
> > -		pr_warn("%s: No interrupt remapping support.  Use the module
> > param \"allow_unsafe_interrupts\" to enable VFIO IOMMU support on this
> > platform\n",
> > -		       __func__);
> > -		iommu_domain_free(iommu->domain);
> > -		kfree(iommu);
> > -		return ERR_PTR(-EPERM);
> > -	}
> > +	iommu->v2 = (arg == VFIO_TYPE1v2_IOMMU);
> > 
> >  	return iommu;
> >  }
> > @@ -810,25 +846,24 @@ static void *vfio_iommu_type1_open(unsigned long
> > arg)  static void vfio_iommu_type1_release(void *iommu_data)  {
> >  	struct vfio_iommu *iommu = iommu_data;
> > +	struct vfio_domain *domain, *domain_tmp;
> >  	struct vfio_group *group, *group_tmp;
> > -	struct rb_node *node;
> > 
> > -	list_for_each_entry_safe(group, group_tmp, &iommu->group_list,
> > next) {
> > -		iommu_detach_group(iommu->domain, group->iommu_group);
> > -		list_del(&group->next);
> > -		kfree(group);
> > -	}
> > +	vfio_iommu_unmap_unpin_all(iommu);
> > 
> > -	while ((node = rb_first(&iommu->dma_list))) {
> > -		struct vfio_dma *dma = rb_entry(node, struct vfio_dma, node);
> > -		size_t size = dma->size;
> > -		vfio_remove_dma_overlap(iommu, dma->iova, &size, dma);
> > -		if (WARN_ON(!size))
> > -			break;
> > +	list_for_each_entry_safe(domain, domain_tmp,
> > +				 &iommu->domain_list, next) {
> > +		list_for_each_entry_safe(group, group_tmp,
> > +					 &domain->group_list, next) {
> > +			iommu_detach_group(domain->domain, group->iommu_group);
> > +			list_del(&group->next);
> > +			kfree(group);
> > +		}
> > +		iommu_domain_free(domain->domain);
> > +		list_del(&domain->next);
> > +		kfree(domain);
> >  	}
> > 
> > -	iommu_domain_free(iommu->domain);
> > -	iommu->domain = NULL;
> >  	kfree(iommu);
> >  }
> > 
> > @@ -841,6 +876,7 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
> >  	if (cmd == VFIO_CHECK_EXTENSION) {
> >  		switch (arg) {
> >  		case VFIO_TYPE1_IOMMU:
> > +		case VFIO_TYPE1v2_IOMMU:
> >  			return 1;
> >  		default:
> >  			return 0;
> > @@ -858,7 +894,7 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
> > 
> >  		info.flags = 0;
> > 
> > -		info.iova_pgsizes = iommu->domain->ops->pgsize_bitmap;
> > +		info.iova_pgsizes = vfio_pgsize_bitmap(iommu);
> > 
> >  		return copy_to_user((void __user *)arg, &info, minsz);
> > 
> > @@ -911,9 +947,6 @@ static const struct vfio_iommu_driver_ops
> > vfio_iommu_driver_ops_type1 = {
> > 
> >  static int __init vfio_iommu_type1_init(void)  {
> > -	if (!iommu_present(&pci_bus_type))
> > -		return -ENODEV;
> > -
> >  	return vfio_register_iommu_driver(&vfio_iommu_driver_ops_type1);
> >  }
> > 
> > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h index
> > 0fd47f5..460fdf2 100644
> > --- a/include/uapi/linux/vfio.h
> > +++ b/include/uapi/linux/vfio.h
> > @@ -23,6 +23,7 @@
> > 
> >  #define VFIO_TYPE1_IOMMU		1
> >  #define VFIO_SPAPR_TCE_IOMMU		2
> > +#define VFIO_TYPE1v2_IOMMU		3
> > 
> >  /*
> >   * The IOCTL interface is designed for extensibility by embedding the
> > 
> > 
> 




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/4] vfio/iommu_type1: Multi-IOMMU domain support
       [not found]   ` <CAKKYfmFrRF8U3NKhkSgepDfKNeLQ8+2yiz8VSFBzMc58dNAPvg@mail.gmail.com>
@ 2014-03-18 14:30     ` Alex Williamson
  0 siblings, 0 replies; 10+ messages in thread
From: Alex Williamson @ 2014-03-18 14:30 UTC (permalink / raw)
  To: Dennis Mungai; +Cc: Varun Sethi, kvm, linux-kernel

On Tue, 2014-03-18 at 14:02 +0300, Dennis Mungai wrote:
> Also, on the same note, do Intel processors from Lynnfield (2009 Core i7s)
> and Arrandale (2010 Mobile Intel Core processors ) that advertise VT-d
> support handle these advanced features?

You would need to check the capability registers of the physical
hardware or look for it in the datasheets.  You can decode it from dmesg
by looking for these lines:

dmar: IOMMU 0: reg_base_addr fed90000 ver 1:0 cap c0000020e60262 ecap f0101a
dmar: IOMMU 1: reg_base_addr fed91000 ver 1:0 cap c9008020660262 ecap f0105a

On Intel, IOMMU_CACHE is available with Snoop Control:

#define ecap_sc_support(e)      ((e >> 7) & 0x1) /* Snooping Control */

So neither of the IOMMUs in the example above support it (Ivy Bridge
mobile system).

> Apparently, even in cases where both the processor and the chip set handles
> VT-d for IOMMU, BIOS vendors mess it up so badly, and at times, even so
> called workarounds don't help much.

The change here presupposes working IOMMU support, it's not going to fix
you BIOS.  Much of the intel-iommu code tries to avoid trusting the
BIOS, but basic DMAR support is required.  Thanks,

Alex

> On 17 Feb 2014 23:25, "Alex Williamson" <alex.williamson@redhat.com> wrote:
> 
> > We currently have a problem that we cannot support advanced features
> > of an IOMMU domain (ex. IOMMU_CACHE), because we have no guarantee
> > that those features will be supported by all of the hardware units
> > involved with the domain over its lifetime.  For instance, the Intel
> > VT-d architecture does not require that all DRHDs support snoop
> > control.  If we create a domain based on a device behind a DRHD that
> > does support snoop control and enable SNP support via the IOMMU_CACHE
> > mapping option, we cannot then add a device behind a DRHD which does
> > not support snoop control or we'll get reserved bit faults from the
> > SNP bit in the pagetables.  To add to the complexity, we can't know
> > the properties of a domain until a device is attached.
> >
> > We could pass this problem off to userspace and require that a
> > separate vfio container be used, but we don't know how to handle page
> > accounting in that case.  How do we know that a page pinned in one
> > container is the same page as a different container and avoid double
> > billing the user for the page.
> >
> > The solution is therefore to support multiple IOMMU domains per
> > container.  In the majority of cases, only one domain will be required
> > since hardware is typically consistent within a system.  However, this
> > provides us the ability to validate compatibility of domains and
> > support mixed environments where page table flags can be different
> > between domains.
> >
> > To do this, our DMA tracking needs to change.  We currently try to
> > coalesce user mappings into as few tracking entries as possible.  The
> > problem then becomes that we lose granularity of user mappings.  We've
> > never guaranteed that a user is able to unmap at a finer granularity
> > than the original mapping, but we must honor the granularity of the
> > original mapping.  This coalescing code is therefore removed, allowing
> > only unmaps covering complete maps.  The change in accounting is
> > fairly small here, a typical QEMU VM will start out with roughly a
> > dozen entries, so it's arguable if this coalescing was ever needed.
> >
> > We also move IOMMU domain creation to the point where a group is
> > attached to the container.  An interesting side-effect of this is that
> > we now have access to the device at the time of domain creation and
> > can probe the devices within the group to determine the bus_type.
> > This finally makes vfio_iommu_type1 completely device/bus agnostic.
> > In fact, each IOMMU domain can host devices on different buses managed
> > by different physical IOMMUs, and present a single DMA mapping
> > interface to the user.  When a new domain is created, mappings are
> > replayed to bring the IOMMU pagetables up to the state of the current
> > container.  And of course, DMA mapping and unmapping automatically
> > traverse all of the configured IOMMU domains.
> >
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > Cc: Varun Sethi <Varun.Sethi@freescale.com>
> > ---
> >  drivers/vfio/vfio_iommu_type1.c |  637
> > +++++++++++++++++++++------------------
> >  include/uapi/linux/vfio.h       |    1
> >  2 files changed, 336 insertions(+), 302 deletions(-)
> >
> > diff --git a/drivers/vfio/vfio_iommu_type1.c
> > b/drivers/vfio/vfio_iommu_type1.c
> > index 4fb7a8f..8c7bb9b 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -30,7 +30,6 @@
> >  #include <linux/iommu.h>
> >  #include <linux/module.h>
> >  #include <linux/mm.h>
> > -#include <linux/pci.h>         /* pci_bus_type */
> >  #include <linux/rbtree.h>
> >  #include <linux/sched.h>
> >  #include <linux/slab.h>
> > @@ -55,11 +54,17 @@ MODULE_PARM_DESC(disable_hugepages,
> >                  "Disable VFIO IOMMU support for IOMMU hugepages.");
> >
> >  struct vfio_iommu {
> > -       struct iommu_domain     *domain;
> > +       struct list_head        domain_list;
> >         struct mutex            lock;
> >         struct rb_root          dma_list;
> > +       bool v2;
> > +};
> > +
> > +struct vfio_domain {
> > +       struct iommu_domain     *domain;
> > +       struct list_head        next;
> >         struct list_head        group_list;
> > -       bool                    cache;
> > +       int                     prot;           /* IOMMU_CACHE */
> >  };
> >
> >  struct vfio_dma {
> > @@ -99,7 +104,7 @@ static struct vfio_dma *vfio_find_dma(struct vfio_iommu
> > *iommu,
> >         return NULL;
> >  }
> >
> > -static void vfio_insert_dma(struct vfio_iommu *iommu, struct vfio_dma
> > *new)
> > +static void vfio_link_dma(struct vfio_iommu *iommu, struct vfio_dma *new)
> >  {
> >         struct rb_node **link = &iommu->dma_list.rb_node, *parent = NULL;
> >         struct vfio_dma *dma;
> > @@ -118,7 +123,7 @@ static void vfio_insert_dma(struct vfio_iommu *iommu,
> > struct vfio_dma *new)
> >         rb_insert_color(&new->node, &iommu->dma_list);
> >  }
> >
> > -static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma
> > *old)
> > +static void vfio_unlink_dma(struct vfio_iommu *iommu, struct vfio_dma
> > *old)
> >  {
> >         rb_erase(&old->node, &iommu->dma_list);
> >  }
> > @@ -322,32 +327,39 @@ static long vfio_unpin_pages(unsigned long pfn, long
> > npage,
> >         return unlocked;
> >  }
> >
> > -static int vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma
> > *dma,
> > -                           dma_addr_t iova, size_t *size)
> > +static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma
> > *dma)
> >  {
> > -       dma_addr_t start = iova, end = iova + *size;
> > +       dma_addr_t iova = dma->iova, end = dma->iova + dma->size;
> > +       struct vfio_domain *domain, *d;
> >         long unlocked = 0;
> >
> > +       if (!dma->size)
> > +               return;
> > +       /*
> > +        * We use the IOMMU to track the physical addresses, otherwise we'd
> > +        * need a much more complicated tracking system.  Unfortunately
> > that
> > +        * means we need to use one of the iommu domains to figure out the
> > +        * pfns to unpin.  The rest need to be unmapped in advance so we
> > have
> > +        * no iommu translations remaining when the pages are unpinned.
> > +        */
> > +       domain = d = list_first_entry(&iommu->domain_list,
> > +                                     struct vfio_domain, next);
> > +
> > +       list_for_each_entry_continue(d, &iommu->domain_list, next)
> > +               iommu_unmap(d->domain, dma->iova, dma->size);
> > +
> >         while (iova < end) {
> >                 size_t unmapped;
> >                 phys_addr_t phys;
> >
> > -               /*
> > -                * We use the IOMMU to track the physical address.  This
> > -                * saves us from having a lot more entries in our mapping
> > -                * tree.  The downside is that we don't track the size
> > -                * used to do the mapping.  We request unmap of a single
> > -                * page, but expect IOMMUs that support large pages to
> > -                * unmap a larger chunk.
> > -                */
> > -               phys = iommu_iova_to_phys(iommu->domain, iova);
> > +               phys = iommu_iova_to_phys(domain->domain, iova);
> >                 if (WARN_ON(!phys)) {
> >                         iova += PAGE_SIZE;
> >                         continue;
> >                 }
> >
> > -               unmapped = iommu_unmap(iommu->domain, iova, PAGE_SIZE);
> > -               if (!unmapped)
> > +               unmapped = iommu_unmap(domain->domain, iova, PAGE_SIZE);
> > +               if (WARN_ON(!unmapped))
> >                         break;
> >
> >                 unlocked += vfio_unpin_pages(phys >> PAGE_SHIFT,
> > @@ -357,119 +369,26 @@ static int vfio_unmap_unpin(struct vfio_iommu
> > *iommu, struct vfio_dma *dma,
> >         }
> >
> >         vfio_lock_acct(-unlocked);
> > -
> > -       *size = iova - start;
> > -
> > -       return 0;
> >  }
> >
> > -static int vfio_remove_dma_overlap(struct vfio_iommu *iommu, dma_addr_t
> > start,
> > -                                  size_t *size, struct vfio_dma *dma)
> > +static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma
> > *dma)
> >  {
> > -       size_t offset, overlap, tmp;
> > -       struct vfio_dma *split;
> > -       int ret;
> > -
> > -       if (!*size)
> > -               return 0;
> > -
> > -       /*
> > -        * Existing dma region is completely covered, unmap all.  This is
> > -        * the likely case since userspace tends to map and unmap buffers
> > -        * in one shot rather than multiple mappings within a buffer.
> > -        */
> > -       if (likely(start <= dma->iova &&
> > -                  start + *size >= dma->iova + dma->size)) {
> > -               *size = dma->size;
> > -               ret = vfio_unmap_unpin(iommu, dma, dma->iova, size);
> > -               if (ret)
> > -                       return ret;
> > -
> > -               /*
> > -                * Did we remove more than we have?  Should never happen
> > -                * since a vfio_dma is contiguous in iova and vaddr.
> > -                */
> > -               WARN_ON(*size != dma->size);
> > -
> > -               vfio_remove_dma(iommu, dma);
> > -               kfree(dma);
> > -               return 0;
> > -       }
> > -
> > -       /* Overlap low address of existing range */
> > -       if (start <= dma->iova) {
> > -               overlap = start + *size - dma->iova;
> > -               ret = vfio_unmap_unpin(iommu, dma, dma->iova, &overlap);
> > -               if (ret)
> > -                       return ret;
> > -
> > -               vfio_remove_dma(iommu, dma);
> > -
> > -               /*
> > -                * Check, we may have removed to whole vfio_dma.  If not
> > -                * fixup and re-insert.
> > -                */
> > -               if (overlap < dma->size) {
> > -                       dma->iova += overlap;
> > -                       dma->vaddr += overlap;
> > -                       dma->size -= overlap;
> > -                       vfio_insert_dma(iommu, dma);
> > -               } else
> > -                       kfree(dma);
> > -
> > -               *size = overlap;
> > -               return 0;
> > -       }
> > -
> > -       /* Overlap high address of existing range */
> > -       if (start + *size >= dma->iova + dma->size) {
> > -               offset = start - dma->iova;
> > -               overlap = dma->size - offset;
> > -
> > -               ret = vfio_unmap_unpin(iommu, dma, start, &overlap);
> > -               if (ret)
> > -                       return ret;
> > -
> > -               dma->size -= overlap;
> > -               *size = overlap;
> > -               return 0;
> > -       }
> > -
> > -       /* Split existing */
> > -
> > -       /*
> > -        * Allocate our tracking structure early even though it may not
> > -        * be used.  An Allocation failure later loses track of pages and
> > -        * is more difficult to unwind.
> > -        */
> > -       split = kzalloc(sizeof(*split), GFP_KERNEL);
> > -       if (!split)
> > -               return -ENOMEM;
> > -
> > -       offset = start - dma->iova;
> > -
> > -       ret = vfio_unmap_unpin(iommu, dma, start, size);
> > -       if (ret || !*size) {
> > -               kfree(split);
> > -               return ret;
> > -       }
> > -
> > -       tmp = dma->size;
> > +       vfio_unmap_unpin(iommu, dma);
> > +       vfio_unlink_dma(iommu, dma);
> > +       kfree(dma);
> > +}
> >
> > -       /* Resize the lower vfio_dma in place, before the below insert */
> > -       dma->size = offset;
> > +static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu)
> > +{
> > +       struct vfio_domain *domain;
> > +       unsigned long bitmap = PAGE_MASK;
> >
> > -       /* Insert new for remainder, assuming it didn't all get unmapped */
> > -       if (likely(offset + *size < tmp)) {
> > -               split->size = tmp - offset - *size;
> > -               split->iova = dma->iova + offset + *size;
> > -               split->vaddr = dma->vaddr + offset + *size;
> > -               split->prot = dma->prot;
> > -               vfio_insert_dma(iommu, split);
> > -       } else
> > -               kfree(split);
> > +       mutex_lock(&iommu->lock);
> > +       list_for_each_entry(domain, &iommu->domain_list, next)
> > +               bitmap &= domain->domain->ops->pgsize_bitmap;
> > +       mutex_unlock(&iommu->lock);
> >
> > -       return 0;
> > +       return bitmap;
> >  }
> >
> >  static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
> > @@ -477,10 +396,10 @@ static int vfio_dma_do_unmap(struct vfio_iommu
> > *iommu,
> >  {
> >         uint64_t mask;
> >         struct vfio_dma *dma;
> > -       size_t unmapped = 0, size;
> > +       size_t unmapped = 0;
> >         int ret = 0;
> >
> > -       mask = ((uint64_t)1 << __ffs(iommu->domain->ops->pgsize_bitmap)) -
> > 1;
> > +       mask = ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1;
> >
> >         if (unmap->iova & mask)
> >                 return -EINVAL;
> > @@ -491,20 +410,61 @@ static int vfio_dma_do_unmap(struct vfio_iommu
> > *iommu,
> >
> >         mutex_lock(&iommu->lock);
> >
> > +       /*
> > +        * vfio-iommu-type1 (v1) - User mappings were coalesced together to
> > +        * avoid tracking individual mappings.  This means that the
> > granularity
> > +        * of the original mapping was lost and the user was allowed to
> > attempt
> > +        * to unmap any range.  Depending on the contiguousness of physical
> > +        * memory and page sizes supported by the IOMMU, arbitrary unmaps
> > may
> > +        * or may not have worked.  We only guaranteed unmap granularity
> > +        * matching the original mapping; even though it was untracked
> > here,
> > +        * the original mappings are reflected in IOMMU mappings.  This
> > +        * resulted in a couple unusual behaviors.  First, if a range is
> > not
> > +        * able to be unmapped, ex. a set of 4k pages that was mapped as a
> > +        * 2M hugepage into the IOMMU, the unmap ioctl returns success but
> > with
> > +        * a zero sized unmap.  Also, if an unmap request overlaps the
> > first
> > +        * address of a hugepage, the IOMMU will unmap the entire hugepage.
> > +        * This also returns success and the returned unmap size reflects
> > the
> > +        * actual size unmapped.
> > +        *
> > +        * We attempt to maintain compatibility with this "v1" interface,
> > but
> > +        * we take control out of the hands of the IOMMU.  Therefore, an
> > unmap
> > +        * request offset from the beginning of the original mapping will
> > +        * return success with zero sized unmap.  And an unmap request
> > covering
> > +        * the first iova of mapping will unmap the entire range.
> > +        *
> > +        * The v2 version of this interface intends to be more
> > deterministic.
> > +        * Unmap requests must fully cover previous mappings.  Multiple
> > +        * mappings may still be unmaped by specifying large ranges, but
> > there
> > +        * must not be any previous mappings bisected by the range.  An
> > error
> > +        * will be returned if these conditions are not met.  The v2
> > interface
> > +        * will only return success and a size of zero if there were no
> > +        * mappings within the range.
> > +        */
> > +       if (iommu->v2) {
> > +               dma = vfio_find_dma(iommu, unmap->iova, 0);
> > +               if (dma && dma->iova != unmap->iova) {
> > +                       ret = -EINVAL;
> > +                       goto unlock;
> > +               }
> > +               dma = vfio_find_dma(iommu, unmap->iova + unmap->size - 1,
> > 0);
> > +               if (dma && dma->iova + dma->size != unmap->iova +
> > unmap->size) {
> > +                       ret = -EINVAL;
> > +                       goto unlock;
> > +               }
> > +       }
> > +
> >         while ((dma = vfio_find_dma(iommu, unmap->iova, unmap->size))) {
> > -               size = unmap->size;
> > -               ret = vfio_remove_dma_overlap(iommu, unmap->iova, &size,
> > dma);
> > -               if (ret || !size)
> > +               if (!iommu->v2 && unmap->iova > dma->iova)
> >                         break;
> > -               unmapped += size;
> > +               unmapped += dma->size;
> > +               vfio_remove_dma(iommu, dma);
> >         }
> >
> > +unlock:
> >         mutex_unlock(&iommu->lock);
> >
> > -       /*
> > -        * We may unmap more than requested, update the unmap struct so
> > -        * userspace can know.
> > -        */
> > +       /* Report how much was unmapped */
> >         unmap->size = unmapped;
> >
> >         return ret;
> > @@ -516,22 +476,47 @@ static int vfio_dma_do_unmap(struct vfio_iommu
> > *iommu,
> >   * soon, so this is just a temporary workaround to break mappings down
> > into
> >   * PAGE_SIZE.  Better to map smaller pages than nothing.
> >   */
> > -static int map_try_harder(struct vfio_iommu *iommu, dma_addr_t iova,
> > +static int map_try_harder(struct vfio_domain *domain, dma_addr_t iova,
> >                           unsigned long pfn, long npage, int prot)
> >  {
> >         long i;
> >         int ret;
> >
> >         for (i = 0; i < npage; i++, pfn++, iova += PAGE_SIZE) {
> > -               ret = iommu_map(iommu->domain, iova,
> > +               ret = iommu_map(domain->domain, iova,
> >                                 (phys_addr_t)pfn << PAGE_SHIFT,
> > -                               PAGE_SIZE, prot);
> > +                               PAGE_SIZE, prot | domain->prot);
> >                 if (ret)
> >                         break;
> >         }
> >
> >         for (; i < npage && i > 0; i--, iova -= PAGE_SIZE)
> > -               iommu_unmap(iommu->domain, iova, PAGE_SIZE);
> > +               iommu_unmap(domain->domain, iova, PAGE_SIZE);
> > +
> > +       return ret;
> > +}
> > +
> > +static int vfio_iommu_map(struct vfio_iommu *iommu, dma_addr_t iova,
> > +                         unsigned long pfn, long npage, int prot)
> > +{
> > +       struct vfio_domain *d;
> > +       int ret;
> > +
> > +       list_for_each_entry(d, &iommu->domain_list, next) {
> > +               ret = iommu_map(d->domain, iova, (phys_addr_t)pfn <<
> > PAGE_SHIFT,
> > +                               npage << PAGE_SHIFT, prot | d->prot);
> > +               if (ret) {
> > +                       if (ret != -EBUSY ||
> > +                           map_try_harder(d, iova, pfn, npage, prot))
> > +                               goto unwind;
> > +               }
> > +       }
> > +
> > +       return 0;
> > +
> > +unwind:
> > +       list_for_each_entry_continue_reverse(d, &iommu->domain_list, next)
> > +               iommu_unmap(d->domain, iova, npage << PAGE_SHIFT);
> >
> >         return ret;
> >  }
> > @@ -545,12 +530,12 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
> >         long npage;
> >         int ret = 0, prot = 0;
> >         uint64_t mask;
> > -       struct vfio_dma *dma = NULL;
> > +       struct vfio_dma *dma;
> >         unsigned long pfn;
> >
> >         end = map->iova + map->size;
> >
> > -       mask = ((uint64_t)1 << __ffs(iommu->domain->ops->pgsize_bitmap)) -
> > 1;
> > +       mask = ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1;
> >
> >         /* READ/WRITE from device perspective */
> >         if (map->flags & VFIO_DMA_MAP_FLAG_WRITE)
> > @@ -561,9 +546,6 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
> >         if (!prot)
> >                 return -EINVAL; /* No READ/WRITE? */
> >
> > -       if (iommu->cache)
> > -               prot |= IOMMU_CACHE;
> > -
> >         if (vaddr & mask)
> >                 return -EINVAL;
> >         if (map->iova & mask)
> > @@ -588,180 +570,257 @@ static int vfio_dma_do_map(struct vfio_iommu
> > *iommu,
> >                 return -EEXIST;
> >         }
> >
> > -       for (iova = map->iova; iova < end; iova += size, vaddr += size) {
> > -               long i;
> > +       dma = kzalloc(sizeof(*dma), GFP_KERNEL);
> > +       if (!dma) {
> > +               mutex_unlock(&iommu->lock);
> > +               return -ENOMEM;
> > +       }
> > +
> > +       dma->iova = map->iova;
> > +       dma->vaddr = map->vaddr;
> > +       dma->prot = prot;
> >
> > +       /* Insert zero-sized and grow as we map chunks of it */
> > +       vfio_link_dma(iommu, dma);
> > +
> > +       for (iova = map->iova; iova < end; iova += size, vaddr += size) {
> >                 /* Pin a contiguous chunk of memory */
> >                 npage = vfio_pin_pages(vaddr, (end - iova) >> PAGE_SHIFT,
> >                                        prot, &pfn);
> >                 if (npage <= 0) {
> >                         WARN_ON(!npage);
> >                         ret = (int)npage;
> > -                       goto out;
> > -               }
> > -
> > -               /* Verify pages are not already mapped */
> > -               for (i = 0; i < npage; i++) {
> > -                       if (iommu_iova_to_phys(iommu->domain,
> > -                                              iova + (i << PAGE_SHIFT))) {
> > -                               ret = -EBUSY;
> > -                               goto out_unpin;
> > -                       }
> > +                       break;
> >                 }
> >
> > -               ret = iommu_map(iommu->domain, iova,
> > -                               (phys_addr_t)pfn << PAGE_SHIFT,
> > -                               npage << PAGE_SHIFT, prot);
> > +               /* Map it! */
> > +               ret = vfio_iommu_map(iommu, iova, pfn, npage, prot);
> >                 if (ret) {
> > -                       if (ret != -EBUSY ||
> > -                           map_try_harder(iommu, iova, pfn, npage, prot))
> > {
> > -                               goto out_unpin;
> > -                       }
> > +                       vfio_unpin_pages(pfn, npage, prot, true);
> > +                       break;
> >                 }
> >
> >                 size = npage << PAGE_SHIFT;
> > +               dma->size += size;
> > +       }
> >
> > -               /*
> > -                * Check if we abut a region below - nothing below 0.
> > -                * This is the most likely case when mapping chunks of
> > -                * physically contiguous regions within a virtual address
> > -                * range.  Update the abutting entry in place since iova
> > -                * doesn't change.
> > -                */
> > -               if (likely(iova)) {
> > -                       struct vfio_dma *tmp;
> > -                       tmp = vfio_find_dma(iommu, iova - 1, 1);
> > -                       if (tmp && tmp->prot == prot &&
> > -                           tmp->vaddr + tmp->size == vaddr) {
> > -                               tmp->size += size;
> > -                               iova = tmp->iova;
> > -                               size = tmp->size;
> > -                               vaddr = tmp->vaddr;
> > -                               dma = tmp;
> > -                       }
> > -               }
> > +       if (ret)
> > +               vfio_remove_dma(iommu, dma);
> >
> > -               /*
> > -                * Check if we abut a region above - nothing above ~0 + 1.
> > -                * If we abut above and below, remove and free.  If only
> > -                * abut above, remove, modify, reinsert.
> > -                */
> > -               if (likely(iova + size)) {
> > -                       struct vfio_dma *tmp;
> > -                       tmp = vfio_find_dma(iommu, iova + size, 1);
> > -                       if (tmp && tmp->prot == prot &&
> > -                           tmp->vaddr == vaddr + size) {
> > -                               vfio_remove_dma(iommu, tmp);
> > -                               if (dma) {
> > -                                       dma->size += tmp->size;
> > -                                       kfree(tmp);
> > -                               } else {
> > -                                       size += tmp->size;
> > -                                       tmp->size = size;
> > -                                       tmp->iova = iova;
> > -                                       tmp->vaddr = vaddr;
> > -                                       vfio_insert_dma(iommu, tmp);
> > -                                       dma = tmp;
> > -                               }
> > -                       }
> > -               }
> > +       mutex_unlock(&iommu->lock);
> > +       return ret;
> > +}
> > +
> > +static int vfio_bus_type(struct device *dev, void *data)
> > +{
> > +       struct bus_type **bus = data;
> > +
> > +       if (*bus && *bus != dev->bus)
> > +               return -EINVAL;
> > +
> > +       *bus = dev->bus;
> > +
> > +       return 0;
> > +}
> > +
> > +static int vfio_iommu_replay(struct vfio_iommu *iommu,
> > +                            struct vfio_domain *domain)
> > +{
> > +       struct vfio_domain *d;
> > +       struct rb_node *n;
> > +       int ret;
> > +
> > +       /* Arbitrarily pick the first domain in the list for lookups */
> > +       d = list_first_entry(&iommu->domain_list, struct vfio_domain,
> > next);
> > +       n = rb_first(&iommu->dma_list);
> > +
> > +       /* If there's not a domain, there better not be any mappings */
> > +       if (WARN_ON(n && !d))
> > +               return -EINVAL;
> > +
> > +       for (; n; n = rb_next(n)) {
> > +               struct vfio_dma *dma;
> > +               dma_addr_t iova;
> > +
> > +               dma = rb_entry(n, struct vfio_dma, node);
> > +               iova = dma->iova;
> > +
> > +               while (iova < dma->iova + dma->size) {
> > +                       phys_addr_t phys = iommu_iova_to_phys(d->domain,
> > iova);
> > +                       size_t size;
> >
> > -               if (!dma) {
> > -                       dma = kzalloc(sizeof(*dma), GFP_KERNEL);
> > -                       if (!dma) {
> > -                               iommu_unmap(iommu->domain, iova, size);
> > -                               ret = -ENOMEM;
> > -                               goto out_unpin;
> > +                       if (WARN_ON(!phys)) {
> > +                               iova += PAGE_SIZE;
> > +                               continue;
> >                         }
> >
> > -                       dma->size = size;
> > -                       dma->iova = iova;
> > -                       dma->vaddr = vaddr;
> > -                       dma->prot = prot;
> > -                       vfio_insert_dma(iommu, dma);
> > -               }
> > -       }
> > +                       size = PAGE_SIZE;
> >
> > -       WARN_ON(ret);
> > -       mutex_unlock(&iommu->lock);
> > -       return ret;
> > +                       while (iova + size < dma->iova + dma->size &&
> > +                              phys + size == iommu_iova_to_phys(d->domain,
> > +                                                                iova +
> > size))
> > +                               size += PAGE_SIZE;
> >
> > -out_unpin:
> > -       vfio_unpin_pages(pfn, npage, prot, true);
> > +                       ret = iommu_map(domain->domain, iova, phys,
> > +                                       size, dma->prot | domain->prot);
> > +                       if (ret)
> > +                               return ret;
> >
> > -out:
> > -       iova = map->iova;
> > -       size = map->size;
> > -       while ((dma = vfio_find_dma(iommu, iova, size))) {
> > -               int r = vfio_remove_dma_overlap(iommu, iova,
> > -                                               &size, dma);
> > -               if (WARN_ON(r || !size))
> > -                       break;
> > +                       iova += size;
> > +               }
> >         }
> >
> > -       mutex_unlock(&iommu->lock);
> > -       return ret;
> > +       return 0;
> >  }
> >
> >  static int vfio_iommu_type1_attach_group(void *iommu_data,
> >                                          struct iommu_group *iommu_group)
> >  {
> >         struct vfio_iommu *iommu = iommu_data;
> > -       struct vfio_group *group, *tmp;
> > +       struct vfio_group *group, *g;
> > +       struct vfio_domain *domain, *d;
> > +       struct bus_type *bus = NULL;
> >         int ret;
> >
> > -       group = kzalloc(sizeof(*group), GFP_KERNEL);
> > -       if (!group)
> > -               return -ENOMEM;
> > -
> >         mutex_lock(&iommu->lock);
> >
> > -       list_for_each_entry(tmp, &iommu->group_list, next) {
> > -               if (tmp->iommu_group == iommu_group) {
> > +       list_for_each_entry(d, &iommu->domain_list, next) {
> > +               list_for_each_entry(g, &d->group_list, next) {
> > +                       if (g->iommu_group != iommu_group)
> > +                               continue;
> > +
> >                         mutex_unlock(&iommu->lock);
> > -                       kfree(group);
> >                         return -EINVAL;
> >                 }
> >         }
> >
> > +       group = kzalloc(sizeof(*group), GFP_KERNEL);
> > +       domain = kzalloc(sizeof(*domain), GFP_KERNEL);
> > +       if (!group || !domain) {
> > +               ret = -ENOMEM;
> > +               goto out_free;
> > +       }
> > +
> > +       group->iommu_group = iommu_group;
> > +
> > +       /* Determine bus_type in order to allocate a domain */
> > +       ret = iommu_group_for_each_dev(iommu_group, &bus, vfio_bus_type);
> > +       if (ret)
> > +               goto out_free;
> > +
> > +       domain->domain = iommu_domain_alloc(bus);
> > +       if (!domain->domain) {
> > +               ret = -EIO;
> > +               goto out_free;
> > +       }
> > +
> > +       ret = iommu_attach_group(domain->domain, iommu_group);
> > +       if (ret)
> > +               goto out_domain;
> > +
> > +       INIT_LIST_HEAD(&domain->group_list);
> > +       list_add(&group->next, &domain->group_list);
> > +
> > +       if (!allow_unsafe_interrupts &&
> > +           !iommu_domain_has_cap(domain->domain, IOMMU_CAP_INTR_REMAP)) {
> > +               pr_warn("%s: No interrupt remapping support.  Use the
> > module param \"allow_unsafe_interrupts\" to enable VFIO IOMMU support on
> > this platform\n",
> > +                      __func__);
> > +               ret = -EPERM;
> > +               goto out_detach;
> > +       }
> > +
> > +       if (iommu_domain_has_cap(domain->domain,
> > IOMMU_CAP_CACHE_COHERENCY))
> > +               domain->prot |= IOMMU_CACHE;
> > +
> >         /*
> > -        * TODO: Domain have capabilities that might change as we add
> > -        * groups (see iommu->cache, currently never set).  Check for
> > -        * them and potentially disallow groups to be attached when it
> > -        * would change capabilities (ugh).
> > +        * Try to match an existing compatible domain.  We don't want to
> > +        * preclude an IOMMU driver supporting multiple bus_types and being
> > +        * able to include different bus_types in the same IOMMU domain, so
> > +        * we test whether the domains use the same iommu_ops rather than
> > +        * testing if they're on the same bus_type.
> >          */
> > -       ret = iommu_attach_group(iommu->domain, iommu_group);
> > -       if (ret) {
> > -               mutex_unlock(&iommu->lock);
> > -               kfree(group);
> > -               return ret;
> > +       list_for_each_entry(d, &iommu->domain_list, next) {
> > +               if (d->domain->ops == domain->domain->ops &&
> > +                   d->prot == domain->prot) {
> > +                       iommu_detach_group(domain->domain, iommu_group);
> > +                       if (!iommu_attach_group(d->domain, iommu_group)) {
> > +                               list_add(&group->next, &d->group_list);
> > +                               iommu_domain_free(domain->domain);
> > +                               kfree(domain);
> > +                               mutex_unlock(&iommu->lock);
> > +                               return 0;
> > +                       }
> > +
> > +                       ret = iommu_attach_group(domain->domain,
> > iommu_group);
> > +                       if (ret)
> > +                               goto out_domain;
> > +               }
> >         }
> >
> > -       group->iommu_group = iommu_group;
> > -       list_add(&group->next, &iommu->group_list);
> > +       /* replay mappings on new domains */
> > +       ret = vfio_iommu_replay(iommu, domain);
> > +       if (ret)
> > +               goto out_detach;
> > +
> > +       list_add(&domain->next, &iommu->domain_list);
> >
> >         mutex_unlock(&iommu->lock);
> >
> >         return 0;
> > +
> > +out_detach:
> > +       iommu_detach_group(domain->domain, iommu_group);
> > +out_domain:
> > +       iommu_domain_free(domain->domain);
> > +out_free:
> > +       kfree(domain);
> > +       kfree(group);
> > +       mutex_unlock(&iommu->lock);
> > +       return ret;
> > +}
> > +
> > +static void vfio_iommu_unmap_unpin_all(struct vfio_iommu *iommu)
> > +{
> > +       struct rb_node *node;
> > +
> > +       while ((node = rb_first(&iommu->dma_list)))
> > +               vfio_remove_dma(iommu, rb_entry(node, struct vfio_dma,
> > node));
> >  }
> >
> >  static void vfio_iommu_type1_detach_group(void *iommu_data,
> >                                           struct iommu_group *iommu_group)
> >  {
> >         struct vfio_iommu *iommu = iommu_data;
> > +       struct vfio_domain *domain;
> >         struct vfio_group *group;
> >
> >         mutex_lock(&iommu->lock);
> >
> > -       list_for_each_entry(group, &iommu->group_list, next) {
> > -               if (group->iommu_group == iommu_group) {
> > -                       iommu_detach_group(iommu->domain, iommu_group);
> > +       list_for_each_entry(domain, &iommu->domain_list, next) {
> > +               list_for_each_entry(group, &domain->group_list, next) {
> > +                       if (group->iommu_group != iommu_group)
> > +                               continue;
> > +
> > +                       iommu_detach_group(domain->domain, iommu_group);
> >                         list_del(&group->next);
> >                         kfree(group);
> > -                       break;
> > +                       /*
> > +                        * Group ownership provides privilege, if the group
> > +                        * list is empty, the domain goes away.  If it's
> > the
> > +                        * last domain, then all the mappings go away too.
> > +                        */
> > +                       if (list_empty(&domain->group_list)) {
> > +                               if (list_is_singular(&iommu->domain_list))
> > +                                       vfio_iommu_unmap_unpin_all(iommu);
> > +                               iommu_domain_free(domain->domain);
> > +                               list_del(&domain->next);
> > +                               kfree(domain);
> > +                       }
> > +                       goto done;
> >                 }
> >         }
> >
> > +done:
> >         mutex_unlock(&iommu->lock);
> >  }
> >
> > @@ -769,40 +828,17 @@ static void *vfio_iommu_type1_open(unsigned long arg)
> >  {
> >         struct vfio_iommu *iommu;
> >
> > -       if (arg != VFIO_TYPE1_IOMMU)
> > +       if (arg != VFIO_TYPE1_IOMMU && arg != VFIO_TYPE1v2_IOMMU)
> >                 return ERR_PTR(-EINVAL);
> >
> >         iommu = kzalloc(sizeof(*iommu), GFP_KERNEL);
> >         if (!iommu)
> >                 return ERR_PTR(-ENOMEM);
> >
> > -       INIT_LIST_HEAD(&iommu->group_list);
> > +       INIT_LIST_HEAD(&iommu->domain_list);
> >         iommu->dma_list = RB_ROOT;
> >         mutex_init(&iommu->lock);
> > -
> > -       /*
> > -        * Wish we didn't have to know about bus_type here.
> > -        */
> > -       iommu->domain = iommu_domain_alloc(&pci_bus_type);
> > -       if (!iommu->domain) {
> > -               kfree(iommu);
> > -               return ERR_PTR(-EIO);
> > -       }
> > -
> > -       /*
> > -        * Wish we could specify required capabilities rather than create
> > -        * a domain, see what comes out and hope it doesn't change along
> > -        * the way.  Fortunately we know interrupt remapping is global for
> > -        * our iommus.
> > -        */
> > -       if (!allow_unsafe_interrupts &&
> > -           !iommu_domain_has_cap(iommu->domain, IOMMU_CAP_INTR_REMAP)) {
> > -               pr_warn("%s: No interrupt remapping support.  Use the
> > module param \"allow_unsafe_interrupts\" to enable VFIO IOMMU support on
> > this platform\n",
> > -                      __func__);
> > -               iommu_domain_free(iommu->domain);
> > -               kfree(iommu);
> > -               return ERR_PTR(-EPERM);
> > -       }
> > +       iommu->v2 = (arg == VFIO_TYPE1v2_IOMMU);
> >
> >         return iommu;
> >  }
> > @@ -810,25 +846,24 @@ static void *vfio_iommu_type1_open(unsigned long arg)
> >  static void vfio_iommu_type1_release(void *iommu_data)
> >  {
> >         struct vfio_iommu *iommu = iommu_data;
> > +       struct vfio_domain *domain, *domain_tmp;
> >         struct vfio_group *group, *group_tmp;
> > -       struct rb_node *node;
> >
> > -       list_for_each_entry_safe(group, group_tmp, &iommu->group_list,
> > next) {
> > -               iommu_detach_group(iommu->domain, group->iommu_group);
> > -               list_del(&group->next);
> > -               kfree(group);
> > -       }
> > +       vfio_iommu_unmap_unpin_all(iommu);
> >
> > -       while ((node = rb_first(&iommu->dma_list))) {
> > -               struct vfio_dma *dma = rb_entry(node, struct vfio_dma,
> > node);
> > -               size_t size = dma->size;
> > -               vfio_remove_dma_overlap(iommu, dma->iova, &size, dma);
> > -               if (WARN_ON(!size))
> > -                       break;
> > +       list_for_each_entry_safe(domain, domain_tmp,
> > +                                &iommu->domain_list, next) {
> > +               list_for_each_entry_safe(group, group_tmp,
> > +                                        &domain->group_list, next) {
> > +                       iommu_detach_group(domain->domain,
> > group->iommu_group);
> > +                       list_del(&group->next);
> > +                       kfree(group);
> > +               }
> > +               iommu_domain_free(domain->domain);
> > +               list_del(&domain->next);
> > +               kfree(domain);
> >         }
> >
> > -       iommu_domain_free(iommu->domain);
> > -       iommu->domain = NULL;
> >         kfree(iommu);
> >  }
> >
> > @@ -841,6 +876,7 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
> >         if (cmd == VFIO_CHECK_EXTENSION) {
> >                 switch (arg) {
> >                 case VFIO_TYPE1_IOMMU:
> > +               case VFIO_TYPE1v2_IOMMU:
> >                         return 1;
> >                 default:
> >                         return 0;
> > @@ -858,7 +894,7 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
> >
> >                 info.flags = 0;
> >
> > -               info.iova_pgsizes = iommu->domain->ops->pgsize_bitmap;
> > +               info.iova_pgsizes = vfio_pgsize_bitmap(iommu);
> >
> >                 return copy_to_user((void __user *)arg, &info, minsz);
> >
> > @@ -911,9 +947,6 @@ static const struct vfio_iommu_driver_ops
> > vfio_iommu_driver_ops_type1 = {
> >
> >  static int __init vfio_iommu_type1_init(void)
> >  {
> > -       if (!iommu_present(&pci_bus_type))
> > -               return -ENODEV;
> > -
> >         return vfio_register_iommu_driver(&vfio_iommu_driver_ops_type1);
> >  }
> >
> > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > index 0fd47f5..460fdf2 100644
> > --- a/include/uapi/linux/vfio.h
> > +++ b/include/uapi/linux/vfio.h
> > @@ -23,6 +23,7 @@
> >
> >  #define VFIO_TYPE1_IOMMU               1
> >  #define VFIO_SPAPR_TCE_IOMMU           2
> > +#define VFIO_TYPE1v2_IOMMU             3
> >
> >  /*
> >   * The IOCTL interface is designed for extensibility by embedding the
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> >




^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2014-03-18 15:16 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-02-17 20:23 [PATCH 0/4] vfio: type1 multi-domain support & kvm-vfio coherency checking Alex Williamson
2014-02-17 20:23 ` [PATCH 1/4] vfio/iommu_type1: Multi-IOMMU domain support Alex Williamson
2014-03-18 10:24   ` Varun Sethi
2014-03-18 10:24     ` Varun Sethi
2014-03-18 14:16     ` Alex Williamson
     [not found]   ` <CAKKYfmFrRF8U3NKhkSgepDfKNeLQ8+2yiz8VSFBzMc58dNAPvg@mail.gmail.com>
2014-03-18 14:30     ` Alex Williamson
2014-02-17 20:24 ` [PATCH 2/4] vfio/type1: Add extension to test DMA cache coherence of IOMMU Alex Williamson
2014-02-17 20:24 ` [PATCH 3/4] vfio: Add external user check extension interface Alex Williamson
2014-02-17 20:24 ` [PATCH 4/4] kvm/vfio: Support for DMA coherent IOMMUs Alex Williamson
2014-02-18 10:38   ` Paolo Bonzini

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.