kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V2 0/9] vfio virtual address update
@ 2021-01-19 17:48 Steve Sistare
  2021-01-19 17:48 ` [PATCH V2 1/9] vfio: option to unmap all Steve Sistare
                   ` (9 more replies)
  0 siblings, 10 replies; 22+ messages in thread
From: Steve Sistare @ 2021-01-19 17:48 UTC (permalink / raw)
  To: kvm; +Cc: Alex Williamson, Cornelia Huck, Kirti Wankhede, Steve Sistare

Add interfaces that allow the underlying memory object of an iova range
to be mapped to a new virtual address in the host process:

  - VFIO_DMA_UNMAP_FLAG_VADDR for VFIO_IOMMU_UNMAP_DMA
  - VFIO_DMA_MAP_FLAG_VADDR flag for VFIO_IOMMU_MAP_DMA
  - VFIO_UPDATE_VADDR for VFIO_CHECK_EXTENSION
  - VFIO_DMA_UNMAP_FLAG_ALL for VFIO_IOMMU_UNMAP_DMA
  - VFIO_UNMAP_ALL for VFIO_CHECK_EXTENSION

Unmap-vaddr invalidates the host virtual address in an iova range and blocks
vfio translation of host virtual addresses, but DMA to already-mapped pages
continues.  Map-vaddr updates the base VA and resumes translation.  The
implementation supports iommu type1 and mediated devices.  Unmap-all allows
all ranges to be unmapped or invalidated in a single ioctl, which simplifies
userland code.

This functionality is necessary for live update, in which a host process
such as qemu exec's an updated version of itself, while preserving its
guest and vfio devices.  The process blocks vfio VA translation, exec's
its new self, mmap's the memory object(s) underlying vfio object, updates
the VA, and unblocks translation.  For a working example that uses these
new interfaces, see the QEMU patch series "[PATCH V2] Live Update" at
https://lore.kernel.org/qemu-devel/1609861330-129855-1-git-send-email-steven.sistare@oracle.com

Patches 1-4 define and implement the flag to unmap all ranges.
Patches 5-6 define and implement the flags to update vaddr.
Patches 7-9 add blocking to complete the implementation.

Changes from V1:
 - define a flag for unmap all instead of special range values
 - define the VFIO_UNMAP_ALL extension
 - forbid the combination of unmap-all and get-dirty-bitmap
 - unwind in unmap on vaddr error
 - add a new function to find first dma in a range instead of modifying
   an existing function
 - change names of update flags
 - fix concurrency bugs due to iommu lock being dropped
 - call down from from vfio to a new backend interface instead of up from
   driver to detect container close
 - use wait/wake instead of sleep and polling
 - refine the uapi specification
 - split patches into vfio vs type1

Steve Sistare (9):
  vfio: option to unmap all
  vfio/type1: find first dma
  vfio/type1: unmap cleanup
  vfio/type1: implement unmap all
  vfio: interfaces to update vaddr
  vfio/type1: implement interfaces to update vaddr
  vfio: iommu driver notify callback
  vfio/type1: implement notify callback
  vfio/type1: block on invalid vaddr

 drivers/vfio/vfio.c             |   5 +
 drivers/vfio/vfio_iommu_type1.c | 229 ++++++++++++++++++++++++++++++++++------
 include/linux/vfio.h            |   5 +
 include/uapi/linux/vfio.h       |  27 +++++
 4 files changed, 231 insertions(+), 35 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH V2 1/9] vfio: option to unmap all
  2021-01-19 17:48 [PATCH V2 0/9] vfio virtual address update Steve Sistare
@ 2021-01-19 17:48 ` Steve Sistare
  2021-01-19 17:48 ` [PATCH V2 2/9] vfio/type1: find first dma Steve Sistare
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: Steve Sistare @ 2021-01-19 17:48 UTC (permalink / raw)
  To: kvm; +Cc: Alex Williamson, Cornelia Huck, Kirti Wankhede, Steve Sistare

For the UNMAP_DMA ioctl, delete all mappings if VFIO_DMA_UNMAP_FLAG_ALL
is set.  Define the corresponding VFIO_UNMAP_ALL extension.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/uapi/linux/vfio.h | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 9204705..55578b6 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -46,6 +46,9 @@
  */
 #define VFIO_NOIOMMU_IOMMU		8
 
+/* Supports VFIO_DMA_UNMAP_FLAG_ALL */
+#define VFIO_UNMAP_ALL			9
+
 /*
  * The IOCTL interface is designed for extensibility by embedding the
  * structure length (argsz) and flags into structures passed between
@@ -1074,6 +1077,7 @@ struct vfio_bitmap {
  * field.  No guarantee is made to the user that arbitrary unmaps of iova
  * or size different from those used in the original mapping call will
  * succeed.
+ *
  * VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP should be set to get the dirty bitmap
  * before unmapping IO virtual addresses. When this flag is set, the user must
  * provide a struct vfio_bitmap in data[]. User must provide zero-allocated
@@ -1083,11 +1087,15 @@ struct vfio_bitmap {
  * indicates that the page at that offset from iova is dirty. A Bitmap of the
  * pages in the range of unmapped size is returned in the user-provided
  * vfio_bitmap.data.
+ *
+ * If flags & VFIO_DMA_UNMAP_FLAG_ALL, unmap all addresses.  iova and size
+ * must be 0.  This may not be combined with the get-dirty-bitmap flag.
  */
 struct vfio_iommu_type1_dma_unmap {
 	__u32	argsz;
 	__u32	flags;
 #define VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP (1 << 0)
+#define VFIO_DMA_UNMAP_FLAG_ALL		     (1 << 1)
 	__u64	iova;				/* IO virtual address */
 	__u64	size;				/* Size of mapping (bytes) */
 	__u8    data[];
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH V2 2/9] vfio/type1: find first dma
  2021-01-19 17:48 [PATCH V2 0/9] vfio virtual address update Steve Sistare
  2021-01-19 17:48 ` [PATCH V2 1/9] vfio: option to unmap all Steve Sistare
@ 2021-01-19 17:48 ` Steve Sistare
  2021-01-19 17:48 ` [PATCH V2 3/9] vfio/type1: unmap cleanup Steve Sistare
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: Steve Sistare @ 2021-01-19 17:48 UTC (permalink / raw)
  To: kvm; +Cc: Alex Williamson, Cornelia Huck, Kirti Wankhede, Steve Sistare

Add a variant of vfio_find_dma that returns the entry with the lowest iova
in the search range.  This is needed later for visiting all entries without
requiring one to delete each visited entry as in vfio_dma_do_unmap.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 drivers/vfio/vfio_iommu_type1.c | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 5fbf0c1..a2c2b96 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -173,6 +173,29 @@ static struct vfio_dma *vfio_find_dma(struct vfio_iommu *iommu,
 	return NULL;
 }
 
+static struct vfio_dma *vfio_find_dma_first(struct vfio_iommu *iommu,
+					    dma_addr_t start, size_t size)
+{
+	struct vfio_dma *res = NULL;
+	struct rb_node *node = iommu->dma_list.rb_node;
+
+	while (node) {
+		struct vfio_dma *dma = rb_entry(node, struct vfio_dma, node);
+
+		if (start < dma->iova + dma->size) {
+			res = dma;
+			if (start >= dma->iova)
+				break;
+			node = node->rb_left;
+		} else {
+			node = node->rb_right;
+		}
+	}
+	if (res && size && res->iova >= start + size)
+		res = NULL;
+	return res;
+}
+
 static void vfio_link_dma(struct vfio_iommu *iommu, struct vfio_dma *new)
 {
 	struct rb_node **link = &iommu->dma_list.rb_node, *parent = NULL;
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH V2 3/9] vfio/type1: unmap cleanup
  2021-01-19 17:48 [PATCH V2 0/9] vfio virtual address update Steve Sistare
  2021-01-19 17:48 ` [PATCH V2 1/9] vfio: option to unmap all Steve Sistare
  2021-01-19 17:48 ` [PATCH V2 2/9] vfio/type1: find first dma Steve Sistare
@ 2021-01-19 17:48 ` Steve Sistare
  2021-01-19 17:48 ` [PATCH V2 4/9] vfio/type1: implement unmap all Steve Sistare
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: Steve Sistare @ 2021-01-19 17:48 UTC (permalink / raw)
  To: kvm; +Cc: Alex Williamson, Cornelia Huck, Kirti Wankhede, Steve Sistare

Minor changes in vfio_dma_do_unmap to improve readability, which also
simplify the subsequent unmap-all patch.  No functional change.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 drivers/vfio/vfio_iommu_type1.c | 38 +++++++++++++++-----------------------
 1 file changed, 15 insertions(+), 23 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index a2c2b96..c687174 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -1096,34 +1096,28 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
 {
 	struct vfio_dma *dma, *dma_last = NULL;
 	size_t unmapped = 0, pgsize;
-	int ret = 0, retries = 0;
+	int ret = -EINVAL, retries = 0;
 	unsigned long pgshift;
+	dma_addr_t iova = unmap->iova;
+	unsigned long size = unmap->size;
 
 	mutex_lock(&iommu->lock);
 
 	pgshift = __ffs(iommu->pgsize_bitmap);
 	pgsize = (size_t)1 << pgshift;
 
-	if (unmap->iova & (pgsize - 1)) {
-		ret = -EINVAL;
+	if (iova & (pgsize - 1))
 		goto unlock;
-	}
 
-	if (!unmap->size || unmap->size & (pgsize - 1)) {
-		ret = -EINVAL;
+	if (!size || size & (pgsize - 1))
 		goto unlock;
-	}
 
-	if (unmap->iova + unmap->size - 1 < unmap->iova ||
-	    unmap->size > SIZE_MAX) {
-		ret = -EINVAL;
+	if (iova + size - 1 < iova || size > SIZE_MAX)
 		goto unlock;
-	}
 
 	/* When dirty tracking is enabled, allow only min supported pgsize */
 	if ((unmap->flags & VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP) &&
 	    (!iommu->dirty_page_tracking || (bitmap->pgsize != pgsize))) {
-		ret = -EINVAL;
 		goto unlock;
 	}
 
@@ -1161,20 +1155,18 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
 	 * mappings within the range.
 	 */
 	if (iommu->v2) {
-		dma = vfio_find_dma(iommu, unmap->iova, 1);
-		if (dma && dma->iova != unmap->iova) {
-			ret = -EINVAL;
+		dma = vfio_find_dma(iommu, iova, 1);
+		if (dma && dma->iova != iova)
 			goto unlock;
-		}
-		dma = vfio_find_dma(iommu, unmap->iova + unmap->size - 1, 0);
-		if (dma && dma->iova + dma->size != unmap->iova + unmap->size) {
-			ret = -EINVAL;
+
+		dma = vfio_find_dma(iommu, iova + size - 1, 0);
+		if (dma && dma->iova + dma->size != iova + size)
 			goto unlock;
-		}
 	}
 
-	while ((dma = vfio_find_dma(iommu, unmap->iova, unmap->size))) {
-		if (!iommu->v2 && unmap->iova > dma->iova)
+	ret = 0;
+	while ((dma = vfio_find_dma(iommu, iova, size))) {
+		if (!iommu->v2 && iova > dma->iova)
 			break;
 		/*
 		 * Task with same address space who mapped this iova range is
@@ -1212,7 +1204,7 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
 
 		if (unmap->flags & VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP) {
 			ret = update_user_bitmap(bitmap->data, iommu, dma,
-						 unmap->iova, pgsize);
+						 iova, pgsize);
 			if (ret)
 				break;
 		}
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH V2 4/9] vfio/type1: implement unmap all
  2021-01-19 17:48 [PATCH V2 0/9] vfio virtual address update Steve Sistare
                   ` (2 preceding siblings ...)
  2021-01-19 17:48 ` [PATCH V2 3/9] vfio/type1: unmap cleanup Steve Sistare
@ 2021-01-19 17:48 ` Steve Sistare
  2021-01-22 21:22   ` Alex Williamson
  2021-01-19 17:48 ` [PATCH V2 5/9] vfio: interfaces to update vaddr Steve Sistare
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 22+ messages in thread
From: Steve Sistare @ 2021-01-19 17:48 UTC (permalink / raw)
  To: kvm; +Cc: Alex Williamson, Cornelia Huck, Kirti Wankhede, Steve Sistare

Implement VFIO_DMA_UNMAP_FLAG_ALL.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 drivers/vfio/vfio_iommu_type1.c | 22 +++++++++++++++++-----
 1 file changed, 17 insertions(+), 5 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index c687174..ef83018 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -1100,6 +1100,7 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
 	unsigned long pgshift;
 	dma_addr_t iova = unmap->iova;
 	unsigned long size = unmap->size;
+	bool unmap_all = !!(unmap->flags & VFIO_DMA_UNMAP_FLAG_ALL);
 
 	mutex_lock(&iommu->lock);
 
@@ -1109,8 +1110,13 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
 	if (iova & (pgsize - 1))
 		goto unlock;
 
-	if (!size || size & (pgsize - 1))
+	if (unmap_all) {
+		if (iova || size)
+			goto unlock;
+		size = SIZE_MAX;
+	} else if (!size || size & (pgsize - 1)) {
 		goto unlock;
+	}
 
 	if (iova + size - 1 < iova || size > SIZE_MAX)
 		goto unlock;
@@ -1154,7 +1160,7 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
 	 * will only return success and a size of zero if there were no
 	 * mappings within the range.
 	 */
-	if (iommu->v2) {
+	if (iommu->v2 && !unmap_all) {
 		dma = vfio_find_dma(iommu, iova, 1);
 		if (dma && dma->iova != iova)
 			goto unlock;
@@ -1165,7 +1171,7 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
 	}
 
 	ret = 0;
-	while ((dma = vfio_find_dma(iommu, iova, size))) {
+	while ((dma = vfio_find_dma_first(iommu, iova, size))) {
 		if (!iommu->v2 && iova > dma->iova)
 			break;
 		/*
@@ -2538,6 +2544,7 @@ static int vfio_iommu_type1_check_extension(struct vfio_iommu *iommu,
 	case VFIO_TYPE1_IOMMU:
 	case VFIO_TYPE1v2_IOMMU:
 	case VFIO_TYPE1_NESTING_IOMMU:
+	case VFIO_UNMAP_ALL:
 		return 1;
 	case VFIO_DMA_CC_IOMMU:
 		if (!iommu)
@@ -2710,6 +2717,8 @@ static int vfio_iommu_type1_unmap_dma(struct vfio_iommu *iommu,
 {
 	struct vfio_iommu_type1_dma_unmap unmap;
 	struct vfio_bitmap bitmap = { 0 };
+	uint32_t mask = VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP |
+			VFIO_DMA_UNMAP_FLAG_ALL;
 	unsigned long minsz;
 	int ret;
 
@@ -2718,8 +2727,11 @@ static int vfio_iommu_type1_unmap_dma(struct vfio_iommu *iommu,
 	if (copy_from_user(&unmap, (void __user *)arg, minsz))
 		return -EFAULT;
 
-	if (unmap.argsz < minsz ||
-	    unmap.flags & ~VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP)
+	if (unmap.argsz < minsz || unmap.flags & ~mask)
+		return -EINVAL;
+
+	if ((unmap.flags & VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP) &&
+	    (unmap.flags & ~VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP))
 		return -EINVAL;
 
 	if (unmap.flags & VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP) {
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH V2 5/9] vfio: interfaces to update vaddr
  2021-01-19 17:48 [PATCH V2 0/9] vfio virtual address update Steve Sistare
                   ` (3 preceding siblings ...)
  2021-01-19 17:48 ` [PATCH V2 4/9] vfio/type1: implement unmap all Steve Sistare
@ 2021-01-19 17:48 ` Steve Sistare
  2021-01-19 17:48 ` [PATCH V2 6/9] vfio/type1: implement " Steve Sistare
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: Steve Sistare @ 2021-01-19 17:48 UTC (permalink / raw)
  To: kvm; +Cc: Alex Williamson, Cornelia Huck, Kirti Wankhede, Steve Sistare

Define interfaces that allow the underlying memory object of an iova
range to be mapped to a new host virtual address in the host process:

  - VFIO_DMA_UNMAP_FLAG_VADDR for VFIO_IOMMU_UNMAP_DMA
  - VFIO_DMA_MAP_FLAG_VADDR flag for VFIO_IOMMU_MAP_DMA
  - VFIO_UPDATE_VADDR extension for VFIO_CHECK_EXTENSION

Unmap vaddr invalidates the host virtual address in an iova range, and
blocks vfio translation of host virtual addresses.  DMA to already-mapped
pages continues.  Map vaddr updates the base VA and resumes translation.
See comments in uapi/linux/vfio.h for more details.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 include/uapi/linux/vfio.h | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 55578b6..85bb89a 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -49,6 +49,9 @@
 /* Supports VFIO_DMA_UNMAP_FLAG_ALL */
 #define VFIO_UNMAP_ALL			9
 
+/* Supports the vaddr flag for DMA map and unmap */
+#define VFIO_UPDATE_VADDR		10
+
 /*
  * The IOCTL interface is designed for extensibility by embedding the
  * structure length (argsz) and flags into structures passed between
@@ -1049,12 +1052,22 @@ struct vfio_iommu_type1_info_cap_migration {
  *
  * Map process virtual addresses to IO virtual addresses using the
  * provided struct vfio_dma_map. Caller sets argsz. READ &/ WRITE required.
+ *
+ * If flags & VFIO_DMA_MAP_FLAG_VADDR, update the base vaddr for iova, and
+ * unblock translation of host virtual addresses in the iova range.  The vaddr
+ * must have previously been invalidated with VFIO_DMA_UNMAP_FLAG_VADDR.  To
+ * maintain memory consistency within the user application, the updated vaddr
+ * must address the same memory object as originally mapped.  Failure to do so
+ * will result in user memory corruption and/or device misbehavior.  iova and
+ * size must match those in the original MAP_DMA call.  Protection is not
+ * changed, and the READ & WRITE flags must be 0.
  */
 struct vfio_iommu_type1_dma_map {
 	__u32	argsz;
 	__u32	flags;
 #define VFIO_DMA_MAP_FLAG_READ (1 << 0)		/* readable from device */
 #define VFIO_DMA_MAP_FLAG_WRITE (1 << 1)	/* writable from device */
+#define VFIO_DMA_MAP_FLAG_VADDR (1 << 2)
 	__u64	vaddr;				/* Process virtual address */
 	__u64	iova;				/* IO virtual address */
 	__u64	size;				/* Size of mapping (bytes) */
@@ -1090,12 +1103,18 @@ struct vfio_bitmap {
  *
  * If flags & VFIO_DMA_UNMAP_FLAG_ALL, unmap all addresses.  iova and size
  * must be 0.  This may not be combined with the get-dirty-bitmap flag.
+ *
+ * If flags & VFIO_DMA_UNMAP_FLAG_VADDR, do not unmap, but invalidate host
+ * virtual addresses in the iova range.  Tasks that attempt to translate an
+ * iova's vaddr will block.  DMA to already-mapped pages continues.  This may
+ * not be combined with the get-dirty-bitmap flag.
  */
 struct vfio_iommu_type1_dma_unmap {
 	__u32	argsz;
 	__u32	flags;
 #define VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP (1 << 0)
 #define VFIO_DMA_UNMAP_FLAG_ALL		     (1 << 1)
+#define VFIO_DMA_UNMAP_FLAG_VADDR	     (1 << 2)
 	__u64	iova;				/* IO virtual address */
 	__u64	size;				/* Size of mapping (bytes) */
 	__u8    data[];
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH V2 6/9] vfio/type1: implement interfaces to update vaddr
  2021-01-19 17:48 [PATCH V2 0/9] vfio virtual address update Steve Sistare
                   ` (4 preceding siblings ...)
  2021-01-19 17:48 ` [PATCH V2 5/9] vfio: interfaces to update vaddr Steve Sistare
@ 2021-01-19 17:48 ` Steve Sistare
  2021-01-22 21:48   ` Alex Williamson
  2021-01-19 17:48 ` [PATCH V2 7/9] vfio: iommu driver notify callback Steve Sistare
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 22+ messages in thread
From: Steve Sistare @ 2021-01-19 17:48 UTC (permalink / raw)
  To: kvm; +Cc: Alex Williamson, Cornelia Huck, Kirti Wankhede, Steve Sistare

Implement VFIO_DMA_UNMAP_FLAG_VADDR, VFIO_DMA_MAP_FLAG_VADDR, and
VFIO_UPDATE_VADDR.  This is a partial implementation.  Blocking is
added in the next patch.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 drivers/vfio/vfio_iommu_type1.c | 54 ++++++++++++++++++++++++++++++++++++++---
 1 file changed, 51 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index ef83018..c307f62 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -92,6 +92,7 @@ struct vfio_dma {
 	int			prot;		/* IOMMU_READ/WRITE */
 	bool			iommu_mapped;
 	bool			lock_cap;	/* capable(CAP_IPC_LOCK) */
+	bool			vaddr_valid;
 	struct task_struct	*task;
 	struct rb_root		pfn_list;	/* Ex-user pinned pfn list */
 	unsigned long		*bitmap;
@@ -1101,6 +1102,7 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
 	dma_addr_t iova = unmap->iova;
 	unsigned long size = unmap->size;
 	bool unmap_all = !!(unmap->flags & VFIO_DMA_UNMAP_FLAG_ALL);
+	bool invalidate = !!(unmap->flags & VFIO_DMA_UNMAP_FLAG_VADDR);
 
 	mutex_lock(&iommu->lock);
 
@@ -1181,6 +1183,18 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
 		if (dma->task->mm != current->mm)
 			break;
 
+		if (invalidate) {
+			if (!dma->vaddr_valid)
+				goto unwind;
+			dma->vaddr_valid = false;
+			unmapped += dma->size;
+			size -= dma->iova + dma->size - iova;
+			iova = dma->iova + dma->size;
+			if (iova == 0)
+				break;
+			continue;
+		}
+
 		if (!RB_EMPTY_ROOT(&dma->pfn_list)) {
 			struct vfio_iommu_type1_dma_unmap nb_unmap;
 
@@ -1218,6 +1232,20 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
 		unmapped += dma->size;
 		vfio_remove_dma(iommu, dma);
 	}
+	goto unlock;
+
+unwind:
+	iova = unmap->iova;
+	size = unmap_all ? SIZE_MAX : unmap->size;
+	dma_last = dma;
+	while ((dma = vfio_find_dma_first(iommu, iova, size)) &&
+	       dma != dma_last) {
+		dma->vaddr_valid = true;
+		size -= dma->iova + dma->size - iova;
+		iova = dma->iova + dma->size;
+	}
+	ret = -EINVAL;
+	unmapped = 0;
 
 unlock:
 	mutex_unlock(&iommu->lock);
@@ -1319,6 +1347,7 @@ static bool vfio_iommu_iova_dma_valid(struct vfio_iommu *iommu,
 static int vfio_dma_do_map(struct vfio_iommu *iommu,
 			   struct vfio_iommu_type1_dma_map *map)
 {
+	bool update = map->flags & VFIO_DMA_MAP_FLAG_VADDR;
 	dma_addr_t iova = map->iova;
 	unsigned long vaddr = map->vaddr;
 	size_t size = map->size;
@@ -1336,13 +1365,16 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
 	if (map->flags & VFIO_DMA_MAP_FLAG_READ)
 		prot |= IOMMU_READ;
 
+	if ((prot && update) || (!prot && !update))
+		return -EINVAL;
+
 	mutex_lock(&iommu->lock);
 
 	pgsize = (size_t)1 << __ffs(iommu->pgsize_bitmap);
 
 	WARN_ON((pgsize - 1) & PAGE_MASK);
 
-	if (!prot || !size || (size | iova | vaddr) & (pgsize - 1)) {
+	if (!size || (size | iova | vaddr) & (pgsize - 1)) {
 		ret = -EINVAL;
 		goto out_unlock;
 	}
@@ -1353,7 +1385,19 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
 		goto out_unlock;
 	}
 
-	if (vfio_find_dma(iommu, iova, size)) {
+	dma = vfio_find_dma(iommu, iova, size);
+	if (update) {
+		if (!dma) {
+			ret = -ENOENT;
+		} else if (dma->vaddr_valid || dma->iova != iova ||
+			   dma->size != size) {
+			ret = -EINVAL;
+		} else {
+			dma->vaddr = vaddr;
+			dma->vaddr_valid = true;
+		}
+		goto out_unlock;
+	} else if (dma) {
 		ret = -EEXIST;
 		goto out_unlock;
 	}
@@ -1377,6 +1421,7 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
 	iommu->dma_avail--;
 	dma->iova = iova;
 	dma->vaddr = vaddr;
+	dma->vaddr_valid = true;
 	dma->prot = prot;
 
 	/*
@@ -2545,6 +2590,7 @@ static int vfio_iommu_type1_check_extension(struct vfio_iommu *iommu,
 	case VFIO_TYPE1v2_IOMMU:
 	case VFIO_TYPE1_NESTING_IOMMU:
 	case VFIO_UNMAP_ALL:
+	case VFIO_UPDATE_VADDR:
 		return 1;
 	case VFIO_DMA_CC_IOMMU:
 		if (!iommu)
@@ -2699,7 +2745,8 @@ static int vfio_iommu_type1_map_dma(struct vfio_iommu *iommu,
 {
 	struct vfio_iommu_type1_dma_map map;
 	unsigned long minsz;
-	uint32_t mask = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
+	uint32_t mask = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE |
+			VFIO_DMA_MAP_FLAG_VADDR;
 
 	minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);
 
@@ -2718,6 +2765,7 @@ static int vfio_iommu_type1_unmap_dma(struct vfio_iommu *iommu,
 	struct vfio_iommu_type1_dma_unmap unmap;
 	struct vfio_bitmap bitmap = { 0 };
 	uint32_t mask = VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP |
+			VFIO_DMA_UNMAP_FLAG_VADDR |
 			VFIO_DMA_UNMAP_FLAG_ALL;
 	unsigned long minsz;
 	int ret;
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH V2 7/9] vfio: iommu driver notify callback
  2021-01-19 17:48 [PATCH V2 0/9] vfio virtual address update Steve Sistare
                   ` (5 preceding siblings ...)
  2021-01-19 17:48 ` [PATCH V2 6/9] vfio/type1: implement " Steve Sistare
@ 2021-01-19 17:48 ` Steve Sistare
  2021-01-19 17:48 ` [PATCH V2 8/9] vfio/type1: implement " Steve Sistare
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 22+ messages in thread
From: Steve Sistare @ 2021-01-19 17:48 UTC (permalink / raw)
  To: kvm; +Cc: Alex Williamson, Cornelia Huck, Kirti Wankhede, Steve Sistare

Define a vfio_iommu_driver_ops notify callback, for sending events to
the driver.  Drivers are not required to provide the callback, and
may ignore any events.  The handling of events is driver specific.

Define the CONTAINER_CLOSE event, called when the container's file
descriptor is closed.  This event signifies that no further state changes
will occur via container ioctl's.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 drivers/vfio/vfio.c  | 5 +++++
 include/linux/vfio.h | 5 +++++
 2 files changed, 10 insertions(+)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 262ab0e..99458fc 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -1220,6 +1220,11 @@ static int vfio_fops_open(struct inode *inode, struct file *filep)
 static int vfio_fops_release(struct inode *inode, struct file *filep)
 {
 	struct vfio_container *container = filep->private_data;
+	struct vfio_iommu_driver *driver = container->iommu_driver;
+
+	if (driver && driver->ops->notify)
+		driver->ops->notify(container->iommu_data,
+				    VFIO_DRIVER_NOTIFY_CONTAINER_CLOSE, NULL);
 
 	filep->private_data = NULL;
 
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 38d3c6a..9642579 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -57,6 +57,9 @@ extern int vfio_add_group_dev(struct device *dev,
 extern void vfio_device_put(struct vfio_device *device);
 extern void *vfio_device_data(struct vfio_device *device);
 
+/* events for the backend driver notify callback */
+#define VFIO_DRIVER_NOTIFY_CONTAINER_CLOSE	1
+
 /**
  * struct vfio_iommu_driver_ops - VFIO IOMMU driver callbacks
  */
@@ -90,6 +93,8 @@ struct vfio_iommu_driver_ops {
 					       struct notifier_block *nb);
 	int		(*dma_rw)(void *iommu_data, dma_addr_t user_iova,
 				  void *data, size_t count, bool write);
+	void		(*notify)(void *iommu_data, unsigned int event,
+				  void *data);
 };
 
 extern int vfio_register_iommu_driver(const struct vfio_iommu_driver_ops *ops);
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH V2 8/9] vfio/type1: implement notify callback
  2021-01-19 17:48 [PATCH V2 0/9] vfio virtual address update Steve Sistare
                   ` (6 preceding siblings ...)
  2021-01-19 17:48 ` [PATCH V2 7/9] vfio: iommu driver notify callback Steve Sistare
@ 2021-01-19 17:48 ` Steve Sistare
  2021-01-22 23:00   ` Alex Williamson
  2021-01-19 17:48 ` [PATCH V2 9/9] vfio/type1: block on invalid vaddr Steve Sistare
       [not found] ` <55725169-de0d-4019-f96c-ded20dfde0d7@oracle.com>
  9 siblings, 1 reply; 22+ messages in thread
From: Steve Sistare @ 2021-01-19 17:48 UTC (permalink / raw)
  To: kvm; +Cc: Alex Williamson, Cornelia Huck, Kirti Wankhede, Steve Sistare

Implement a notify callback that remembers if the container's file
descriptor has been closed.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 drivers/vfio/vfio_iommu_type1.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index c307f62..0167996 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -74,6 +74,7 @@ struct vfio_iommu {
 	bool			nesting;
 	bool			dirty_page_tracking;
 	bool			pinned_page_dirty_scope;
+	bool			controlled;
 };
 
 struct vfio_domain {
@@ -2518,6 +2519,7 @@ static void *vfio_iommu_type1_open(unsigned long arg)
 	INIT_LIST_HEAD(&iommu->iova_list);
 	iommu->dma_list = RB_ROOT;
 	iommu->dma_avail = dma_entry_limit;
+	iommu->controlled = true;
 	mutex_init(&iommu->lock);
 	BLOCKING_INIT_NOTIFIER_HEAD(&iommu->notifier);
 
@@ -3043,6 +3045,18 @@ static int vfio_iommu_type1_dma_rw(void *iommu_data, dma_addr_t user_iova,
 	return ret;
 }
 
+static void vfio_iommu_type1_notify(void *iommu_data, unsigned int event,
+				    void *data)
+{
+	struct vfio_iommu *iommu = iommu_data;
+
+	if (event != VFIO_DRIVER_NOTIFY_CONTAINER_CLOSE)
+		return;
+	mutex_lock(&iommu->lock);
+	iommu->controlled = false;
+	mutex_unlock(&iommu->lock);
+}
+
 static const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_type1 = {
 	.name			= "vfio-iommu-type1",
 	.owner			= THIS_MODULE,
@@ -3056,6 +3070,7 @@ static int vfio_iommu_type1_dma_rw(void *iommu_data, dma_addr_t user_iova,
 	.register_notifier	= vfio_iommu_type1_register_notifier,
 	.unregister_notifier	= vfio_iommu_type1_unregister_notifier,
 	.dma_rw			= vfio_iommu_type1_dma_rw,
+	.notify			= vfio_iommu_type1_notify,
 };
 
 static int __init vfio_iommu_type1_init(void)
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH V2 9/9] vfio/type1: block on invalid vaddr
  2021-01-19 17:48 [PATCH V2 0/9] vfio virtual address update Steve Sistare
                   ` (7 preceding siblings ...)
  2021-01-19 17:48 ` [PATCH V2 8/9] vfio/type1: implement " Steve Sistare
@ 2021-01-19 17:48 ` Steve Sistare
  2021-01-22 22:59   ` Alex Williamson
       [not found] ` <55725169-de0d-4019-f96c-ded20dfde0d7@oracle.com>
  9 siblings, 1 reply; 22+ messages in thread
From: Steve Sistare @ 2021-01-19 17:48 UTC (permalink / raw)
  To: kvm; +Cc: Alex Williamson, Cornelia Huck, Kirti Wankhede, Steve Sistare

Block translation of host virtual address while an iova range has an
invalid vaddr.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 drivers/vfio/vfio_iommu_type1.c | 83 +++++++++++++++++++++++++++++++++++++----
 1 file changed, 76 insertions(+), 7 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 0167996..c97573a 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -31,6 +31,7 @@
 #include <linux/rbtree.h>
 #include <linux/sched/signal.h>
 #include <linux/sched/mm.h>
+#include <linux/kthread.h>
 #include <linux/slab.h>
 #include <linux/uaccess.h>
 #include <linux/vfio.h>
@@ -75,6 +76,7 @@ struct vfio_iommu {
 	bool			dirty_page_tracking;
 	bool			pinned_page_dirty_scope;
 	bool			controlled;
+	wait_queue_head_t	vaddr_wait;
 };
 
 struct vfio_domain {
@@ -145,6 +147,8 @@ struct vfio_regions {
 #define DIRTY_BITMAP_PAGES_MAX	 ((u64)INT_MAX)
 #define DIRTY_BITMAP_SIZE_MAX	 DIRTY_BITMAP_BYTES(DIRTY_BITMAP_PAGES_MAX)
 
+#define WAITED 1
+
 static int put_pfn(unsigned long pfn, int prot);
 
 static struct vfio_group *vfio_iommu_find_iommu_group(struct vfio_iommu *iommu,
@@ -505,6 +509,52 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
 }
 
 /*
+ * Wait for vaddr of *dma_p to become valid.  iommu lock is dropped if the task
+ * waits, but is re-locked on return.  If the task waits, then return an updated
+ * dma struct in *dma_p.  Return 0 on success with no waiting, 1 on success if
+ * waited, and -errno on error.
+ */
+static int vfio_vaddr_wait(struct vfio_iommu *iommu, struct vfio_dma **dma_p)
+{
+	struct vfio_dma *dma = *dma_p;
+	unsigned long iova = dma->iova;
+	size_t size = dma->size;
+	int ret = 0;
+	DEFINE_WAIT(wait);
+
+	while (!dma->vaddr_valid) {
+		ret = WAITED;
+		prepare_to_wait(&iommu->vaddr_wait, &wait, TASK_KILLABLE);
+		mutex_unlock(&iommu->lock);
+		schedule();
+		mutex_lock(&iommu->lock);
+		finish_wait(&iommu->vaddr_wait, &wait);
+		if (kthread_should_stop() || !iommu->controlled ||
+		    fatal_signal_pending(current)) {
+			return -EFAULT;
+		}
+		*dma_p = dma = vfio_find_dma(iommu, iova, size);
+		if (!dma)
+			return -EINVAL;
+	}
+	return ret;
+}
+
+/*
+ * Find dma struct and wait for its vaddr to be valid.  iommu lock is dropped
+ * if the task waits, but is re-locked on return.  Return result in *dma_p.
+ * Return 0 on success, 1 on success if waited,  and -errno on error.
+ */
+static int vfio_find_vaddr(struct vfio_iommu *iommu, dma_addr_t start,
+			   size_t size, struct vfio_dma **dma_p)
+{
+	*dma_p = vfio_find_dma(iommu, start, size);
+	if (!*dma_p)
+		return -EINVAL;
+	return vfio_vaddr_wait(iommu, dma_p);
+}
+
+/*
  * Attempt to pin pages.  We really don't want to track all the pfns and
  * the iommu can only map chunks of consecutive pfns anyway, so get the
  * first page and all consecutive pages with the same locking.
@@ -693,11 +743,11 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
 		struct vfio_pfn *vpfn;
 
 		iova = user_pfn[i] << PAGE_SHIFT;
-		dma = vfio_find_dma(iommu, iova, PAGE_SIZE);
-		if (!dma) {
-			ret = -EINVAL;
+		ret = vfio_find_vaddr(iommu, iova, PAGE_SIZE, &dma);
+		if (ret < 0)
 			goto pin_unwind;
-		}
+		else if (ret == WAITED)
+			do_accounting = !IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu);
 
 		if ((dma->prot & prot) != prot) {
 			ret = -EPERM;
@@ -1496,6 +1546,22 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
 	unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
 	int ret;
 
+	/*
+	 * Wait for all vaddr to be valid so they can be used in the main loop.
+	 * If we do wait, the lock was dropped and re-taken, so start over to
+	 * ensure the dma list is consistent.
+	 */
+again:
+	for (n = rb_first(&iommu->dma_list); n; n = rb_next(n)) {
+		struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
+
+		ret = vfio_vaddr_wait(iommu, &dma);
+		if (ret < 0)
+			return ret;
+		else if (ret == WAITED)
+			goto again;
+	}
+
 	/* Arbitrarily pick the first domain in the list for lookups */
 	if (!list_empty(&iommu->domain_list))
 		d = list_first_entry(&iommu->domain_list,
@@ -2522,6 +2588,7 @@ static void *vfio_iommu_type1_open(unsigned long arg)
 	iommu->controlled = true;
 	mutex_init(&iommu->lock);
 	BLOCKING_INIT_NOTIFIER_HEAD(&iommu->notifier);
+	init_waitqueue_head(&iommu->vaddr_wait);
 
 	return iommu;
 }
@@ -2972,12 +3039,13 @@ static int vfio_iommu_type1_dma_rw_chunk(struct vfio_iommu *iommu,
 	struct vfio_dma *dma;
 	bool kthread = current->mm == NULL;
 	size_t offset;
+	int ret;
 
 	*copied = 0;
 
-	dma = vfio_find_dma(iommu, user_iova, 1);
-	if (!dma)
-		return -EINVAL;
+	ret = vfio_find_vaddr(iommu, user_iova, 1, &dma);
+	if (ret < 0)
+		return ret;
 
 	if ((write && !(dma->prot & IOMMU_WRITE)) ||
 			!(dma->prot & IOMMU_READ))
@@ -3055,6 +3123,7 @@ static void vfio_iommu_type1_notify(void *iommu_data, unsigned int event,
 	mutex_lock(&iommu->lock);
 	iommu->controlled = false;
 	mutex_unlock(&iommu->lock);
+	wake_up_all(&iommu->vaddr_wait);
 }
 
 static const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_type1 = {
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V2 4/9] vfio/type1: implement unmap all
  2021-01-19 17:48 ` [PATCH V2 4/9] vfio/type1: implement unmap all Steve Sistare
@ 2021-01-22 21:22   ` Alex Williamson
  2021-01-27 23:07     ` Steven Sistare
  0 siblings, 1 reply; 22+ messages in thread
From: Alex Williamson @ 2021-01-22 21:22 UTC (permalink / raw)
  To: Steve Sistare; +Cc: kvm, Cornelia Huck, Kirti Wankhede

On Tue, 19 Jan 2021 09:48:24 -0800
Steve Sistare <steven.sistare@oracle.com> wrote:

> Implement VFIO_DMA_UNMAP_FLAG_ALL.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c | 22 +++++++++++++++++-----
>  1 file changed, 17 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index c687174..ef83018 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -1100,6 +1100,7 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>  	unsigned long pgshift;
>  	dma_addr_t iova = unmap->iova;
>  	unsigned long size = unmap->size;
> +	bool unmap_all = !!(unmap->flags & VFIO_DMA_UNMAP_FLAG_ALL);
>  
>  	mutex_lock(&iommu->lock);
>  
> @@ -1109,8 +1110,13 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>  	if (iova & (pgsize - 1))
>  		goto unlock;
>  
> -	if (!size || size & (pgsize - 1))
> +	if (unmap_all) {
> +		if (iova || size)
> +			goto unlock;
> +		size = SIZE_MAX;
> +	} else if (!size || size & (pgsize - 1)) {
>  		goto unlock;
> +	}
>  
>  	if (iova + size - 1 < iova || size > SIZE_MAX)
>  		goto unlock;
> @@ -1154,7 +1160,7 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>  	 * will only return success and a size of zero if there were no
>  	 * mappings within the range.
>  	 */
> -	if (iommu->v2) {
> +	if (iommu->v2 && !unmap_all) {
>  		dma = vfio_find_dma(iommu, iova, 1);
>  		if (dma && dma->iova != iova)
>  			goto unlock;
> @@ -1165,7 +1171,7 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>  	}
>  
>  	ret = 0;
> -	while ((dma = vfio_find_dma(iommu, iova, size))) {
> +	while ((dma = vfio_find_dma_first(iommu, iova, size))) {


Why is this necessary?  Isn't vfio_find_dma_first() O(logN) for this
operation while vfio_find_dma() is O(1)?

>  		if (!iommu->v2 && iova > dma->iova)
>  			break;
>  		/*
> @@ -2538,6 +2544,7 @@ static int vfio_iommu_type1_check_extension(struct vfio_iommu *iommu,
>  	case VFIO_TYPE1_IOMMU:
>  	case VFIO_TYPE1v2_IOMMU:
>  	case VFIO_TYPE1_NESTING_IOMMU:
> +	case VFIO_UNMAP_ALL:
>  		return 1;
>  	case VFIO_DMA_CC_IOMMU:
>  		if (!iommu)
> @@ -2710,6 +2717,8 @@ static int vfio_iommu_type1_unmap_dma(struct vfio_iommu *iommu,
>  {
>  	struct vfio_iommu_type1_dma_unmap unmap;
>  	struct vfio_bitmap bitmap = { 0 };
> +	uint32_t mask = VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP |
> +			VFIO_DMA_UNMAP_FLAG_ALL;
>  	unsigned long minsz;
>  	int ret;
>  
> @@ -2718,8 +2727,11 @@ static int vfio_iommu_type1_unmap_dma(struct vfio_iommu *iommu,
>  	if (copy_from_user(&unmap, (void __user *)arg, minsz))
>  		return -EFAULT;
>  
> -	if (unmap.argsz < minsz ||
> -	    unmap.flags & ~VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP)
> +	if (unmap.argsz < minsz || unmap.flags & ~mask)
> +		return -EINVAL;
> +
> +	if ((unmap.flags & VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP) &&
> +	    (unmap.flags & ~VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP))

Somehow we're jumping from unmap-all and dirty-bitmap being mutually
exclusive to dirty-bitmap is absolutely exclusive, which seems like a
future bug or maintenance issue.  Let's just test specifically what
we're deciding is unsupported.  Thanks,

Alex

>  		return -EINVAL;
>  
>  	if (unmap.flags & VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP) {


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V2 6/9] vfio/type1: implement interfaces to update vaddr
  2021-01-19 17:48 ` [PATCH V2 6/9] vfio/type1: implement " Steve Sistare
@ 2021-01-22 21:48   ` Alex Williamson
  2021-01-27 23:23     ` Steven Sistare
  0 siblings, 1 reply; 22+ messages in thread
From: Alex Williamson @ 2021-01-22 21:48 UTC (permalink / raw)
  To: Steve Sistare; +Cc: kvm, Cornelia Huck, Kirti Wankhede

On Tue, 19 Jan 2021 09:48:26 -0800
Steve Sistare <steven.sistare@oracle.com> wrote:

> Implement VFIO_DMA_UNMAP_FLAG_VADDR, VFIO_DMA_MAP_FLAG_VADDR, and
> VFIO_UPDATE_VADDR.  This is a partial implementation.  Blocking is
> added in the next patch.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c | 54 ++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 51 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index ef83018..c307f62 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -92,6 +92,7 @@ struct vfio_dma {
>  	int			prot;		/* IOMMU_READ/WRITE */
>  	bool			iommu_mapped;
>  	bool			lock_cap;	/* capable(CAP_IPC_LOCK) */
> +	bool			vaddr_valid;
>  	struct task_struct	*task;
>  	struct rb_root		pfn_list;	/* Ex-user pinned pfn list */
>  	unsigned long		*bitmap;
> @@ -1101,6 +1102,7 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>  	dma_addr_t iova = unmap->iova;
>  	unsigned long size = unmap->size;
>  	bool unmap_all = !!(unmap->flags & VFIO_DMA_UNMAP_FLAG_ALL);
> +	bool invalidate = !!(unmap->flags & VFIO_DMA_UNMAP_FLAG_VADDR);
>  
>  	mutex_lock(&iommu->lock);
>  
> @@ -1181,6 +1183,18 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>  		if (dma->task->mm != current->mm)
>  			break;
>  
> +		if (invalidate) {
> +			if (!dma->vaddr_valid)
> +				goto unwind;
> +			dma->vaddr_valid = false;
> +			unmapped += dma->size;
> +			size -= dma->iova + dma->size - iova;
> +			iova = dma->iova + dma->size;
> +			if (iova == 0)
> +				break;
> +			continue;
> +		}

Clearly this is where the change to find-first semantics should have
been introduced, the previous patch didn't require it.  Would it really
be that unsightly to do:

while (1) {
	if (unlikely(invalidate))
		dma = vfio_find_dma_first(...
	else
		dma = vfio_find_dma(...

	if (!dma)
		break;

Find-first followed by a tree walk might be more efficient, but of
course requires a full loop redesign.

> +
>  		if (!RB_EMPTY_ROOT(&dma->pfn_list)) {
>  			struct vfio_iommu_type1_dma_unmap nb_unmap;
>  
> @@ -1218,6 +1232,20 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>  		unmapped += dma->size;
>  		vfio_remove_dma(iommu, dma);
>  	}
> +	goto unlock;
> +
> +unwind:
> +	iova = unmap->iova;
> +	size = unmap_all ? SIZE_MAX : unmap->size;
> +	dma_last = dma;
> +	while ((dma = vfio_find_dma_first(iommu, iova, size)) &&
> +	       dma != dma_last) {
> +		dma->vaddr_valid = true;
> +		size -= dma->iova + dma->size - iova;
> +		iova = dma->iova + dma->size;
> +	}
> +	ret = -EINVAL;
> +	unmapped = 0;
>  
>  unlock:
>  	mutex_unlock(&iommu->lock);
> @@ -1319,6 +1347,7 @@ static bool vfio_iommu_iova_dma_valid(struct vfio_iommu *iommu,
>  static int vfio_dma_do_map(struct vfio_iommu *iommu,
>  			   struct vfio_iommu_type1_dma_map *map)
>  {
> +	bool update = map->flags & VFIO_DMA_MAP_FLAG_VADDR;

Please pick a slightly more specific variable name, update_vaddr maybe.
Perhaps even clear_vaddr rather than invalidate above for consistency.

>  	dma_addr_t iova = map->iova;
>  	unsigned long vaddr = map->vaddr;
>  	size_t size = map->size;
> @@ -1336,13 +1365,16 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>  	if (map->flags & VFIO_DMA_MAP_FLAG_READ)
>  		prot |= IOMMU_READ;
>  
> +	if ((prot && update) || (!prot && !update))
> +		return -EINVAL;
> +
>  	mutex_lock(&iommu->lock);
>  
>  	pgsize = (size_t)1 << __ffs(iommu->pgsize_bitmap);
>  
>  	WARN_ON((pgsize - 1) & PAGE_MASK);
>  
> -	if (!prot || !size || (size | iova | vaddr) & (pgsize - 1)) {
> +	if (!size || (size | iova | vaddr) & (pgsize - 1)) {
>  		ret = -EINVAL;
>  		goto out_unlock;
>  	}
> @@ -1353,7 +1385,19 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>  		goto out_unlock;
>  	}
>  
> -	if (vfio_find_dma(iommu, iova, size)) {
> +	dma = vfio_find_dma(iommu, iova, size);
> +	if (update) {
> +		if (!dma) {
> +			ret = -ENOENT;
> +		} else if (dma->vaddr_valid || dma->iova != iova ||
> +			   dma->size != size) {
> +			ret = -EINVAL;
> +		} else {
> +			dma->vaddr = vaddr;
> +			dma->vaddr_valid = true;
> +		}
> +		goto out_unlock;
> +	} else if (dma) {
>  		ret = -EEXIST;
>  		goto out_unlock;
>  	}
> @@ -1377,6 +1421,7 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>  	iommu->dma_avail--;
>  	dma->iova = iova;
>  	dma->vaddr = vaddr;
> +	dma->vaddr_valid = true;
>  	dma->prot = prot;
>  
>  	/*
> @@ -2545,6 +2590,7 @@ static int vfio_iommu_type1_check_extension(struct vfio_iommu *iommu,
>  	case VFIO_TYPE1v2_IOMMU:
>  	case VFIO_TYPE1_NESTING_IOMMU:
>  	case VFIO_UNMAP_ALL:
> +	case VFIO_UPDATE_VADDR:
>  		return 1;
>  	case VFIO_DMA_CC_IOMMU:
>  		if (!iommu)
> @@ -2699,7 +2745,8 @@ static int vfio_iommu_type1_map_dma(struct vfio_iommu *iommu,
>  {
>  	struct vfio_iommu_type1_dma_map map;
>  	unsigned long minsz;
> -	uint32_t mask = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
> +	uint32_t mask = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE |
> +			VFIO_DMA_MAP_FLAG_VADDR;
>  
>  	minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);
>  
> @@ -2718,6 +2765,7 @@ static int vfio_iommu_type1_unmap_dma(struct vfio_iommu *iommu,
>  	struct vfio_iommu_type1_dma_unmap unmap;
>  	struct vfio_bitmap bitmap = { 0 };
>  	uint32_t mask = VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP |
> +			VFIO_DMA_UNMAP_FLAG_VADDR |
>  			VFIO_DMA_UNMAP_FLAG_ALL;

dirty + vaddr would need a separate test per my previous patch comment.
Thanks,

Alex

>  	unsigned long minsz;
>  	int ret;


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V2 9/9] vfio/type1: block on invalid vaddr
  2021-01-19 17:48 ` [PATCH V2 9/9] vfio/type1: block on invalid vaddr Steve Sistare
@ 2021-01-22 22:59   ` Alex Williamson
  2021-01-27 23:25     ` Steven Sistare
  0 siblings, 1 reply; 22+ messages in thread
From: Alex Williamson @ 2021-01-22 22:59 UTC (permalink / raw)
  To: Steve Sistare; +Cc: kvm, Cornelia Huck, Kirti Wankhede

On Tue, 19 Jan 2021 09:48:29 -0800
Steve Sistare <steven.sistare@oracle.com> wrote:

> Block translation of host virtual address while an iova range has an
> invalid vaddr.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c | 83 +++++++++++++++++++++++++++++++++++++----
>  1 file changed, 76 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 0167996..c97573a 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -31,6 +31,7 @@
>  #include <linux/rbtree.h>
>  #include <linux/sched/signal.h>
>  #include <linux/sched/mm.h>
> +#include <linux/kthread.h>
>  #include <linux/slab.h>
>  #include <linux/uaccess.h>
>  #include <linux/vfio.h>
> @@ -75,6 +76,7 @@ struct vfio_iommu {
>  	bool			dirty_page_tracking;
>  	bool			pinned_page_dirty_scope;
>  	bool			controlled;
> +	wait_queue_head_t	vaddr_wait;
>  };
>  
>  struct vfio_domain {
> @@ -145,6 +147,8 @@ struct vfio_regions {
>  #define DIRTY_BITMAP_PAGES_MAX	 ((u64)INT_MAX)
>  #define DIRTY_BITMAP_SIZE_MAX	 DIRTY_BITMAP_BYTES(DIRTY_BITMAP_PAGES_MAX)
>  
> +#define WAITED 1
> +
>  static int put_pfn(unsigned long pfn, int prot);
>  
>  static struct vfio_group *vfio_iommu_find_iommu_group(struct vfio_iommu *iommu,
> @@ -505,6 +509,52 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
>  }
>  
>  /*
> + * Wait for vaddr of *dma_p to become valid.  iommu lock is dropped if the task
> + * waits, but is re-locked on return.  If the task waits, then return an updated
> + * dma struct in *dma_p.  Return 0 on success with no waiting, 1 on success if

s/1/WAITED/

> + * waited, and -errno on error.
> + */
> +static int vfio_vaddr_wait(struct vfio_iommu *iommu, struct vfio_dma **dma_p)
> +{
> +	struct vfio_dma *dma = *dma_p;
> +	unsigned long iova = dma->iova;
> +	size_t size = dma->size;
> +	int ret = 0;
> +	DEFINE_WAIT(wait);
> +
> +	while (!dma->vaddr_valid) {
> +		ret = WAITED;
> +		prepare_to_wait(&iommu->vaddr_wait, &wait, TASK_KILLABLE);
> +		mutex_unlock(&iommu->lock);
> +		schedule();
> +		mutex_lock(&iommu->lock);
> +		finish_wait(&iommu->vaddr_wait, &wait);
> +		if (kthread_should_stop() || !iommu->controlled ||
> +		    fatal_signal_pending(current)) {
> +			return -EFAULT;
> +		}
> +		*dma_p = dma = vfio_find_dma(iommu, iova, size);
> +		if (!dma)
> +			return -EINVAL;
> +	}
> +	return ret;
> +}
> +
> +/*
> + * Find dma struct and wait for its vaddr to be valid.  iommu lock is dropped
> + * if the task waits, but is re-locked on return.  Return result in *dma_p.
> + * Return 0 on success, 1 on success if waited,  and -errno on error.
> + */
> +static int vfio_find_vaddr(struct vfio_iommu *iommu, dma_addr_t start,
> +			   size_t size, struct vfio_dma **dma_p)

more of a vfio_find_dma_valid()

> +{
> +	*dma_p = vfio_find_dma(iommu, start, size);
> +	if (!*dma_p)
> +		return -EINVAL;
> +	return vfio_vaddr_wait(iommu, dma_p);
> +}
> +
> +/*
>   * Attempt to pin pages.  We really don't want to track all the pfns and
>   * the iommu can only map chunks of consecutive pfns anyway, so get the
>   * first page and all consecutive pages with the same locking.
> @@ -693,11 +743,11 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
>  		struct vfio_pfn *vpfn;
>  
>  		iova = user_pfn[i] << PAGE_SHIFT;
> -		dma = vfio_find_dma(iommu, iova, PAGE_SIZE);
> -		if (!dma) {
> -			ret = -EINVAL;
> +		ret = vfio_find_vaddr(iommu, iova, PAGE_SIZE, &dma);
> +		if (ret < 0)
>  			goto pin_unwind;
> -		}
> +		else if (ret == WAITED)
> +			do_accounting = !IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu);

We're iterating through an array of pfns to provide translations, once
we've released the lock it's not just the current one that could be
invalid.  I'm afraid we need to unwind and try again, but I think it's
actually worse than that because if we've marked pages within a
vfio_dma's pfn_list as pinned, then during the locking gap it gets
unmapped, the unmap path will call the unmap notifier chain to release
the page that the vendor driver doesn't have yet.  Yikes!

>  
>  		if ((dma->prot & prot) != prot) {
>  			ret = -EPERM;
> @@ -1496,6 +1546,22 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
>  	unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
>  	int ret;
>  
> +	/*
> +	 * Wait for all vaddr to be valid so they can be used in the main loop.
> +	 * If we do wait, the lock was dropped and re-taken, so start over to
> +	 * ensure the dma list is consistent.
> +	 */
> +again:
> +	for (n = rb_first(&iommu->dma_list); n; n = rb_next(n)) {
> +		struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
> +
> +		ret = vfio_vaddr_wait(iommu, &dma);
> +		if (ret < 0)
> +			return ret;
> +		else if (ret == WAITED)
> +			goto again;
> +	}

This "do nothing until all the vaddrs are valid" approach could work
above too, but gosh is that a lot of cache unfriendly work for a rare
invalidation.  Thanks,

Alex

> +
>  	/* Arbitrarily pick the first domain in the list for lookups */
>  	if (!list_empty(&iommu->domain_list))
>  		d = list_first_entry(&iommu->domain_list,
> @@ -2522,6 +2588,7 @@ static void *vfio_iommu_type1_open(unsigned long arg)
>  	iommu->controlled = true;
>  	mutex_init(&iommu->lock);
>  	BLOCKING_INIT_NOTIFIER_HEAD(&iommu->notifier);
> +	init_waitqueue_head(&iommu->vaddr_wait);
>  
>  	return iommu;
>  }
> @@ -2972,12 +3039,13 @@ static int vfio_iommu_type1_dma_rw_chunk(struct vfio_iommu *iommu,
>  	struct vfio_dma *dma;
>  	bool kthread = current->mm == NULL;
>  	size_t offset;
> +	int ret;
>  
>  	*copied = 0;
>  
> -	dma = vfio_find_dma(iommu, user_iova, 1);
> -	if (!dma)
> -		return -EINVAL;
> +	ret = vfio_find_vaddr(iommu, user_iova, 1, &dma);
> +	if (ret < 0)
> +		return ret;
>  
>  	if ((write && !(dma->prot & IOMMU_WRITE)) ||
>  			!(dma->prot & IOMMU_READ))
> @@ -3055,6 +3123,7 @@ static void vfio_iommu_type1_notify(void *iommu_data, unsigned int event,
>  	mutex_lock(&iommu->lock);
>  	iommu->controlled = false;
>  	mutex_unlock(&iommu->lock);
> +	wake_up_all(&iommu->vaddr_wait);
>  }
>  
>  static const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_type1 = {


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V2 8/9] vfio/type1: implement notify callback
  2021-01-19 17:48 ` [PATCH V2 8/9] vfio/type1: implement " Steve Sistare
@ 2021-01-22 23:00   ` Alex Williamson
  2021-01-27 23:24     ` Steven Sistare
  0 siblings, 1 reply; 22+ messages in thread
From: Alex Williamson @ 2021-01-22 23:00 UTC (permalink / raw)
  To: Steve Sistare; +Cc: kvm, Cornelia Huck, Kirti Wankhede

On Tue, 19 Jan 2021 09:48:28 -0800
Steve Sistare <steven.sistare@oracle.com> wrote:

> Implement a notify callback that remembers if the container's file
> descriptor has been closed.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c | 15 +++++++++++++++
>  1 file changed, 15 insertions(+)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index c307f62..0167996 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -74,6 +74,7 @@ struct vfio_iommu {
>  	bool			nesting;
>  	bool			dirty_page_tracking;
>  	bool			pinned_page_dirty_scope;
> +	bool			controlled;

s/controlled/container_open/?  Thanks,

Alex

>  };
>  
>  struct vfio_domain {
> @@ -2518,6 +2519,7 @@ static void *vfio_iommu_type1_open(unsigned long arg)
>  	INIT_LIST_HEAD(&iommu->iova_list);
>  	iommu->dma_list = RB_ROOT;
>  	iommu->dma_avail = dma_entry_limit;
> +	iommu->controlled = true;
>  	mutex_init(&iommu->lock);
>  	BLOCKING_INIT_NOTIFIER_HEAD(&iommu->notifier);
>  
> @@ -3043,6 +3045,18 @@ static int vfio_iommu_type1_dma_rw(void *iommu_data, dma_addr_t user_iova,
>  	return ret;
>  }
>  
> +static void vfio_iommu_type1_notify(void *iommu_data, unsigned int event,
> +				    void *data)
> +{
> +	struct vfio_iommu *iommu = iommu_data;
> +
> +	if (event != VFIO_DRIVER_NOTIFY_CONTAINER_CLOSE)
> +		return;
> +	mutex_lock(&iommu->lock);
> +	iommu->controlled = false;
> +	mutex_unlock(&iommu->lock);
> +}
> +
>  static const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_type1 = {
>  	.name			= "vfio-iommu-type1",
>  	.owner			= THIS_MODULE,
> @@ -3056,6 +3070,7 @@ static int vfio_iommu_type1_dma_rw(void *iommu_data, dma_addr_t user_iova,
>  	.register_notifier	= vfio_iommu_type1_register_notifier,
>  	.unregister_notifier	= vfio_iommu_type1_unregister_notifier,
>  	.dma_rw			= vfio_iommu_type1_dma_rw,
> +	.notify			= vfio_iommu_type1_notify,
>  };
>  
>  static int __init vfio_iommu_type1_init(void)


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V2 4/9] vfio/type1: implement unmap all
  2021-01-22 21:22   ` Alex Williamson
@ 2021-01-27 23:07     ` Steven Sistare
  0 siblings, 0 replies; 22+ messages in thread
From: Steven Sistare @ 2021-01-27 23:07 UTC (permalink / raw)
  To: Alex Williamson; +Cc: kvm, Cornelia Huck, Kirti Wankhede

On 1/22/2021 4:22 PM, Alex Williamson wrote:
> On Tue, 19 Jan 2021 09:48:24 -0800
> Steve Sistare <steven.sistare@oracle.com> wrote:
> 
>> Implement VFIO_DMA_UNMAP_FLAG_ALL.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>  drivers/vfio/vfio_iommu_type1.c | 22 +++++++++++++++++-----
>>  1 file changed, 17 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>> index c687174..ef83018 100644
>> --- a/drivers/vfio/vfio_iommu_type1.c
>> +++ b/drivers/vfio/vfio_iommu_type1.c
>> @@ -1100,6 +1100,7 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>>  	unsigned long pgshift;
>>  	dma_addr_t iova = unmap->iova;
>>  	unsigned long size = unmap->size;
>> +	bool unmap_all = !!(unmap->flags & VFIO_DMA_UNMAP_FLAG_ALL);
>>  
>>  	mutex_lock(&iommu->lock);
>>  
>> @@ -1109,8 +1110,13 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>>  	if (iova & (pgsize - 1))
>>  		goto unlock;
>>  
>> -	if (!size || size & (pgsize - 1))
>> +	if (unmap_all) {
>> +		if (iova || size)
>> +			goto unlock;
>> +		size = SIZE_MAX;
>> +	} else if (!size || size & (pgsize - 1)) {
>>  		goto unlock;
>> +	}
>>  
>>  	if (iova + size - 1 < iova || size > SIZE_MAX)
>>  		goto unlock;
>> @@ -1154,7 +1160,7 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>>  	 * will only return success and a size of zero if there were no
>>  	 * mappings within the range.
>>  	 */
>> -	if (iommu->v2) {
>> +	if (iommu->v2 && !unmap_all) {
>>  		dma = vfio_find_dma(iommu, iova, 1);
>>  		if (dma && dma->iova != iova)
>>  			goto unlock;
>> @@ -1165,7 +1171,7 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>>  	}
>>  
>>  	ret = 0;
>> -	while ((dma = vfio_find_dma(iommu, iova, size))) {
>> +	while ((dma = vfio_find_dma_first(iommu, iova, size))) {
> 
> 
> Why is this necessary?  Isn't vfio_find_dma_first() O(logN) for this
> operation while vfio_find_dma() is O(1)?

True, vfio_find_dma is O(1) for unmap-all, and find-first is not needed until a later patch.  
I'll continue discussing this issue in my response to your next email.
 
>>  		if (!iommu->v2 && iova > dma->iova)
>>  			break;
>>  		/*
>> @@ -2538,6 +2544,7 @@ static int vfio_iommu_type1_check_extension(struct vfio_iommu *iommu,
>>  	case VFIO_TYPE1_IOMMU:
>>  	case VFIO_TYPE1v2_IOMMU:
>>  	case VFIO_TYPE1_NESTING_IOMMU:
>> +	case VFIO_UNMAP_ALL:
>>  		return 1;
>>  	case VFIO_DMA_CC_IOMMU:
>>  		if (!iommu)
>> @@ -2710,6 +2717,8 @@ static int vfio_iommu_type1_unmap_dma(struct vfio_iommu *iommu,
>>  {
>>  	struct vfio_iommu_type1_dma_unmap unmap;
>>  	struct vfio_bitmap bitmap = { 0 };
>> +	uint32_t mask = VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP |
>> +			VFIO_DMA_UNMAP_FLAG_ALL;
>>  	unsigned long minsz;
>>  	int ret;
>>  
>> @@ -2718,8 +2727,11 @@ static int vfio_iommu_type1_unmap_dma(struct vfio_iommu *iommu,
>>  	if (copy_from_user(&unmap, (void __user *)arg, minsz))
>>  		return -EFAULT;
>>  
>> -	if (unmap.argsz < minsz ||
>> -	    unmap.flags & ~VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP)
>> +	if (unmap.argsz < minsz || unmap.flags & ~mask)
>> +		return -EINVAL;
>> +
>> +	if ((unmap.flags & VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP) &&
>> +	    (unmap.flags & ~VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP))
> 
> Somehow we're jumping from unmap-all and dirty-bitmap being mutually
> exclusive to dirty-bitmap is absolutely exclusive, which seems like a
> future bug or maintenance issue.  Let's just test specifically what
> we're deciding is unsupported.  Thanks,

OK, I will make it future proof.

- Steve

 		return -EINVAL;
>>  
>>  	if (unmap.flags & VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP) {
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V2 6/9] vfio/type1: implement interfaces to update vaddr
  2021-01-22 21:48   ` Alex Williamson
@ 2021-01-27 23:23     ` Steven Sistare
  0 siblings, 0 replies; 22+ messages in thread
From: Steven Sistare @ 2021-01-27 23:23 UTC (permalink / raw)
  To: Alex Williamson; +Cc: kvm, Cornelia Huck, Kirti Wankhede

On 1/22/2021 4:48 PM, Alex Williamson wrote:
> On Tue, 19 Jan 2021 09:48:26 -0800
> Steve Sistare <steven.sistare@oracle.com> wrote:
> 
>> Implement VFIO_DMA_UNMAP_FLAG_VADDR, VFIO_DMA_MAP_FLAG_VADDR, and
>> VFIO_UPDATE_VADDR.  This is a partial implementation.  Blocking is
>> added in the next patch.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>  drivers/vfio/vfio_iommu_type1.c | 54 ++++++++++++++++++++++++++++++++++++++---
>>  1 file changed, 51 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>> index ef83018..c307f62 100644
>> --- a/drivers/vfio/vfio_iommu_type1.c
>> +++ b/drivers/vfio/vfio_iommu_type1.c
>> @@ -92,6 +92,7 @@ struct vfio_dma {
>>  	int			prot;		/* IOMMU_READ/WRITE */
>>  	bool			iommu_mapped;
>>  	bool			lock_cap;	/* capable(CAP_IPC_LOCK) */
>> +	bool			vaddr_valid;
>>  	struct task_struct	*task;
>>  	struct rb_root		pfn_list;	/* Ex-user pinned pfn list */
>>  	unsigned long		*bitmap;
>> @@ -1101,6 +1102,7 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>>  	dma_addr_t iova = unmap->iova;
>>  	unsigned long size = unmap->size;
>>  	bool unmap_all = !!(unmap->flags & VFIO_DMA_UNMAP_FLAG_ALL);
>> +	bool invalidate = !!(unmap->flags & VFIO_DMA_UNMAP_FLAG_VADDR);
>>  
>>  	mutex_lock(&iommu->lock);
>>  
>> @@ -1181,6 +1183,18 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>>  		if (dma->task->mm != current->mm)
>>  			break;
>>  
>> +		if (invalidate) {
>> +			if (!dma->vaddr_valid)
>> +				goto unwind;
>> +			dma->vaddr_valid = false;
>> +			unmapped += dma->size;
>> +			size -= dma->iova + dma->size - iova;
>> +			iova = dma->iova + dma->size;
>> +			if (iova == 0)
>> +				break;
>> +			continue;
>> +		}
> 
> Clearly this is where the change to find-first semantics should have
> been introduced, the previous patch didn't require it.  Would it really
> be that unsightly to do:
> 
> while (1) {
> 	if (unlikely(invalidate))
> 		dma = vfio_find_dma_first(...
> 	else
> 		dma = vfio_find_dma(...
> 
> 	if (!dma)
> 		break;
> 
> Find-first followed by a tree walk might be more efficient, but of
> course requires a full loop redesign.

OK, I'll bite.  How about:

n = vfio_find_dma_first(iommu, iova, size);
while (n) {
        dma = rb_entry(n, struct vfio_dma, node);
        ...
        n = rb_next(n);
        vfio_remove_dma(iommu, dma);
}

plus similar changes in unwind.  This costs logN + N, eg O(N). 

For unmap-all, I could do find-any in constant time instead of logN and suppress the rb_next, 
but the code would be messier and the overall cost would still be O(N), so it does not seem worth it.

>> +
>>  		if (!RB_EMPTY_ROOT(&dma->pfn_list)) {
>>  			struct vfio_iommu_type1_dma_unmap nb_unmap;
>>  
>> @@ -1218,6 +1232,20 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>>  		unmapped += dma->size;
>>  		vfio_remove_dma(iommu, dma);
>>  	}
>> +	goto unlock;
>> +
>> +unwind:
>> +	iova = unmap->iova;
>> +	size = unmap_all ? SIZE_MAX : unmap->size;
>> +	dma_last = dma;
>> +	while ((dma = vfio_find_dma_first(iommu, iova, size)) &&
>> +	       dma != dma_last) {
>> +		dma->vaddr_valid = true;
>> +		size -= dma->iova + dma->size - iova;
>> +		iova = dma->iova + dma->size;
>> +	}
>> +	ret = -EINVAL;
>> +	unmapped = 0;
>>  
>>  unlock:
>>  	mutex_unlock(&iommu->lock);
>> @@ -1319,6 +1347,7 @@ static bool vfio_iommu_iova_dma_valid(struct vfio_iommu *iommu,
>>  static int vfio_dma_do_map(struct vfio_iommu *iommu,
>>  			   struct vfio_iommu_type1_dma_map *map)
>>  {
>> +	bool update = map->flags & VFIO_DMA_MAP_FLAG_VADDR;
> 
> Please pick a slightly more specific variable name, update_vaddr maybe.
> Perhaps even clear_vaddr rather than invalidate above for consistency.

OK.

>>  	dma_addr_t iova = map->iova;
>>  	unsigned long vaddr = map->vaddr;
>>  	size_t size = map->size;
>> @@ -1336,13 +1365,16 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>>  	if (map->flags & VFIO_DMA_MAP_FLAG_READ)
>>  		prot |= IOMMU_READ;
>>  
>> +	if ((prot && update) || (!prot && !update))
>> +		return -EINVAL;
>> +
>>  	mutex_lock(&iommu->lock);
>>  
>>  	pgsize = (size_t)1 << __ffs(iommu->pgsize_bitmap);
>>  
>>  	WARN_ON((pgsize - 1) & PAGE_MASK);
>>  
>> -	if (!prot || !size || (size | iova | vaddr) & (pgsize - 1)) {
>> +	if (!size || (size | iova | vaddr) & (pgsize - 1)) {
>>  		ret = -EINVAL;
>>  		goto out_unlock;
>>  	}
>> @@ -1353,7 +1385,19 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>>  		goto out_unlock;
>>  	}
>>  
>> -	if (vfio_find_dma(iommu, iova, size)) {
>> +	dma = vfio_find_dma(iommu, iova, size);
>> +	if (update) {
>> +		if (!dma) {
>> +			ret = -ENOENT;
>> +		} else if (dma->vaddr_valid || dma->iova != iova ||
>> +			   dma->size != size) {
>> +			ret = -EINVAL;
>> +		} else {
>> +			dma->vaddr = vaddr;
>> +			dma->vaddr_valid = true;
>> +		}
>> +		goto out_unlock;
>> +	} else if (dma) {
>>  		ret = -EEXIST;
>>  		goto out_unlock;
>>  	}
>> @@ -1377,6 +1421,7 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>>  	iommu->dma_avail--;
>>  	dma->iova = iova;
>>  	dma->vaddr = vaddr;
>> +	dma->vaddr_valid = true;
>>  	dma->prot = prot;
>>  
>>  	/*
>> @@ -2545,6 +2590,7 @@ static int vfio_iommu_type1_check_extension(struct vfio_iommu *iommu,
>>  	case VFIO_TYPE1v2_IOMMU:
>>  	case VFIO_TYPE1_NESTING_IOMMU:
>>  	case VFIO_UNMAP_ALL:
>> +	case VFIO_UPDATE_VADDR:
>>  		return 1;
>>  	case VFIO_DMA_CC_IOMMU:
>>  		if (!iommu)
>> @@ -2699,7 +2745,8 @@ static int vfio_iommu_type1_map_dma(struct vfio_iommu *iommu,
>>  {
>>  	struct vfio_iommu_type1_dma_map map;
>>  	unsigned long minsz;
>> -	uint32_t mask = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
>> +	uint32_t mask = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE |
>> +			VFIO_DMA_MAP_FLAG_VADDR;
>>  
>>  	minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);
>>  
>> @@ -2718,6 +2765,7 @@ static int vfio_iommu_type1_unmap_dma(struct vfio_iommu *iommu,
>>  	struct vfio_iommu_type1_dma_unmap unmap;
>>  	struct vfio_bitmap bitmap = { 0 };
>>  	uint32_t mask = VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP |
>> +			VFIO_DMA_UNMAP_FLAG_VADDR |
>>  			VFIO_DMA_UNMAP_FLAG_ALL;
> 
> dirty + vaddr would need a separate test per my previous patch comment.
> Thanks,

Yes.

- Steve

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V2 8/9] vfio/type1: implement notify callback
  2021-01-22 23:00   ` Alex Williamson
@ 2021-01-27 23:24     ` Steven Sistare
  0 siblings, 0 replies; 22+ messages in thread
From: Steven Sistare @ 2021-01-27 23:24 UTC (permalink / raw)
  To: Alex Williamson; +Cc: kvm, Cornelia Huck, Kirti Wankhede

On 1/22/2021 6:00 PM, Alex Williamson wrote:
> On Tue, 19 Jan 2021 09:48:28 -0800
> Steve Sistare <steven.sistare@oracle.com> wrote:
> 
>> Implement a notify callback that remembers if the container's file
>> descriptor has been closed.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>  drivers/vfio/vfio_iommu_type1.c | 15 +++++++++++++++
>>  1 file changed, 15 insertions(+)
>>
>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>> index c307f62..0167996 100644
>> --- a/drivers/vfio/vfio_iommu_type1.c
>> +++ b/drivers/vfio/vfio_iommu_type1.c
>> @@ -74,6 +74,7 @@ struct vfio_iommu {
>>  	bool			nesting;
>>  	bool			dirty_page_tracking;
>>  	bool			pinned_page_dirty_scope;
>> +	bool			controlled;
> 
> s/controlled/container_open/?  Thanks,

I like controlled because it documents the issue nicely: it is no longer possible for
the user to change the state via IOConTroL.  If you like more context, then container_controlled.
But container_open is fine.

- Steve
 
>>  };
>>  
>>  struct vfio_domain {
>> @@ -2518,6 +2519,7 @@ static void *vfio_iommu_type1_open(unsigned long arg)
>>  	INIT_LIST_HEAD(&iommu->iova_list);
>>  	iommu->dma_list = RB_ROOT;
>>  	iommu->dma_avail = dma_entry_limit;
>> +	iommu->controlled = true;
>>  	mutex_init(&iommu->lock);
>>  	BLOCKING_INIT_NOTIFIER_HEAD(&iommu->notifier);
>>  
>> @@ -3043,6 +3045,18 @@ static int vfio_iommu_type1_dma_rw(void *iommu_data, dma_addr_t user_iova,
>>  	return ret;
>>  }
>>  
>> +static void vfio_iommu_type1_notify(void *iommu_data, unsigned int event,
>> +				    void *data)
>> +{
>> +	struct vfio_iommu *iommu = iommu_data;
>> +
>> +	if (event != VFIO_DRIVER_NOTIFY_CONTAINER_CLOSE)
>> +		return;
>> +	mutex_lock(&iommu->lock);
>> +	iommu->controlled = false;
>> +	mutex_unlock(&iommu->lock);
>> +}
>> +
>>  static const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_type1 = {
>>  	.name			= "vfio-iommu-type1",
>>  	.owner			= THIS_MODULE,
>> @@ -3056,6 +3070,7 @@ static int vfio_iommu_type1_dma_rw(void *iommu_data, dma_addr_t user_iova,
>>  	.register_notifier	= vfio_iommu_type1_register_notifier,
>>  	.unregister_notifier	= vfio_iommu_type1_unregister_notifier,
>>  	.dma_rw			= vfio_iommu_type1_dma_rw,
>> +	.notify			= vfio_iommu_type1_notify,
>>  };
>>  
>>  static int __init vfio_iommu_type1_init(void)
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V2 9/9] vfio/type1: block on invalid vaddr
  2021-01-22 22:59   ` Alex Williamson
@ 2021-01-27 23:25     ` Steven Sistare
  2021-01-28  0:03       ` Alex Williamson
  0 siblings, 1 reply; 22+ messages in thread
From: Steven Sistare @ 2021-01-27 23:25 UTC (permalink / raw)
  To: Alex Williamson; +Cc: kvm, Cornelia Huck, Kirti Wankhede

On 1/22/2021 5:59 PM, Alex Williamson wrote:
> On Tue, 19 Jan 2021 09:48:29 -0800
> Steve Sistare <steven.sistare@oracle.com> wrote:
> 
>> Block translation of host virtual address while an iova range has an
>> invalid vaddr.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>  drivers/vfio/vfio_iommu_type1.c | 83 +++++++++++++++++++++++++++++++++++++----
>>  1 file changed, 76 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>> index 0167996..c97573a 100644
>> --- a/drivers/vfio/vfio_iommu_type1.c
>> +++ b/drivers/vfio/vfio_iommu_type1.c
>> @@ -31,6 +31,7 @@
>>  #include <linux/rbtree.h>
>>  #include <linux/sched/signal.h>
>>  #include <linux/sched/mm.h>
>> +#include <linux/kthread.h>
>>  #include <linux/slab.h>
>>  #include <linux/uaccess.h>
>>  #include <linux/vfio.h>
>> @@ -75,6 +76,7 @@ struct vfio_iommu {
>>  	bool			dirty_page_tracking;
>>  	bool			pinned_page_dirty_scope;
>>  	bool			controlled;
>> +	wait_queue_head_t	vaddr_wait;
>>  };
>>  
>>  struct vfio_domain {
>> @@ -145,6 +147,8 @@ struct vfio_regions {
>>  #define DIRTY_BITMAP_PAGES_MAX	 ((u64)INT_MAX)
>>  #define DIRTY_BITMAP_SIZE_MAX	 DIRTY_BITMAP_BYTES(DIRTY_BITMAP_PAGES_MAX)
>>  
>> +#define WAITED 1
>> +
>>  static int put_pfn(unsigned long pfn, int prot);
>>  
>>  static struct vfio_group *vfio_iommu_find_iommu_group(struct vfio_iommu *iommu,
>> @@ -505,6 +509,52 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
>>  }
>>  
>>  /*
>> + * Wait for vaddr of *dma_p to become valid.  iommu lock is dropped if the task
>> + * waits, but is re-locked on return.  If the task waits, then return an updated
>> + * dma struct in *dma_p.  Return 0 on success with no waiting, 1 on success if
> 
> s/1/WAITED/

OK, but the WAITED state will not need to be returned in the new scheme below.

>> + * waited, and -errno on error.
>> + */
>> +static int vfio_vaddr_wait(struct vfio_iommu *iommu, struct vfio_dma **dma_p)
>> +{
>> +	struct vfio_dma *dma = *dma_p;
>> +	unsigned long iova = dma->iova;
>> +	size_t size = dma->size;
>> +	int ret = 0;
>> +	DEFINE_WAIT(wait);
>> +
>> +	while (!dma->vaddr_valid) {
>> +		ret = WAITED;
>> +		prepare_to_wait(&iommu->vaddr_wait, &wait, TASK_KILLABLE);
>> +		mutex_unlock(&iommu->lock);
>> +		schedule();
>> +		mutex_lock(&iommu->lock);
>> +		finish_wait(&iommu->vaddr_wait, &wait);
>> +		if (kthread_should_stop() || !iommu->controlled ||
>> +		    fatal_signal_pending(current)) {
>> +			return -EFAULT;
>> +		}
>> +		*dma_p = dma = vfio_find_dma(iommu, iova, size);
>> +		if (!dma)
>> +			return -EINVAL;
>> +	}
>> +	return ret;
>> +}
>> +
>> +/*
>> + * Find dma struct and wait for its vaddr to be valid.  iommu lock is dropped
>> + * if the task waits, but is re-locked on return.  Return result in *dma_p.
>> + * Return 0 on success, 1 on success if waited,  and -errno on error.
>> + */
>> +static int vfio_find_vaddr(struct vfio_iommu *iommu, dma_addr_t start,
>> +			   size_t size, struct vfio_dma **dma_p)
> 
> more of a vfio_find_dma_valid()

I will slightly modify and rename this with the new scheme I describe below.

>> +{
>> +	*dma_p = vfio_find_dma(iommu, start, size);
>> +	if (!*dma_p)
>> +		return -EINVAL;
>> +	return vfio_vaddr_wait(iommu, dma_p);
>> +}
>> +
>> +/*
>>   * Attempt to pin pages.  We really don't want to track all the pfns and
>>   * the iommu can only map chunks of consecutive pfns anyway, so get the
>>   * first page and all consecutive pages with the same locking.
>> @@ -693,11 +743,11 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
>>  		struct vfio_pfn *vpfn;
>>  
>>  		iova = user_pfn[i] << PAGE_SHIFT;
>> -		dma = vfio_find_dma(iommu, iova, PAGE_SIZE);
>> -		if (!dma) {
>> -			ret = -EINVAL;
>> +		ret = vfio_find_vaddr(iommu, iova, PAGE_SIZE, &dma);
>> +		if (ret < 0)
>>  			goto pin_unwind;
>> -		}
>> +		else if (ret == WAITED)
>> +			do_accounting = !IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu);
> 
> We're iterating through an array of pfns to provide translations, once
> we've released the lock it's not just the current one that could be
> invalid.  I'm afraid we need to unwind and try again, but I think it's
> actually worse than that because if we've marked pages within a
> vfio_dma's pfn_list as pinned, then during the locking gap it gets
> unmapped, the unmap path will call the unmap notifier chain to release
> the page that the vendor driver doesn't have yet.  Yikes!

Yikes indeed.  The fix is easy, though.  I will maintain a count in vfio_iommu of 
vfio_dma objects with invalid vaddr's, modified and tested while holding the lock, 
provide a function to wait for the count to become 0, and call that function at the 
start of vfio_iommu_type1_pin_pages and vfio_iommu_replay.  I will use iommu->vaddr_wait 
for wait and wake.

>>  		if ((dma->prot & prot) != prot) {
>>  			ret = -EPERM;
>> @@ -1496,6 +1546,22 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
>>  	unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
>>  	int ret;
>>  
>> +	/*
>> +	 * Wait for all vaddr to be valid so they can be used in the main loop.
>> +	 * If we do wait, the lock was dropped and re-taken, so start over to
>> +	 * ensure the dma list is consistent.
>> +	 */
>> +again:
>> +	for (n = rb_first(&iommu->dma_list); n; n = rb_next(n)) {
>> +		struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
>> +
>> +		ret = vfio_vaddr_wait(iommu, &dma);
>> +		if (ret < 0)
>> +			return ret;
>> +		else if (ret == WAITED)
>> +			goto again;
>> +	}
> 
> This "do nothing until all the vaddrs are valid" approach could work
> above too, but gosh is that a lot of cache unfriendly work for a rare
> invalidation.  Thanks,

The new wait function described above is fast in the common case, just a check that
the invalid vaddr count is 0.

- Steve

>> +
>>  	/* Arbitrarily pick the first domain in the list for lookups */
>>  	if (!list_empty(&iommu->domain_list))
>>  		d = list_first_entry(&iommu->domain_list,
>> @@ -2522,6 +2588,7 @@ static void *vfio_iommu_type1_open(unsigned long arg)
>>  	iommu->controlled = true;
>>  	mutex_init(&iommu->lock);
>>  	BLOCKING_INIT_NOTIFIER_HEAD(&iommu->notifier);
>> +	init_waitqueue_head(&iommu->vaddr_wait);
>>  
>>  	return iommu;
>>  }
>> @@ -2972,12 +3039,13 @@ static int vfio_iommu_type1_dma_rw_chunk(struct vfio_iommu *iommu,
>>  	struct vfio_dma *dma;
>>  	bool kthread = current->mm == NULL;
>>  	size_t offset;
>> +	int ret;
>>  
>>  	*copied = 0;
>>  
>> -	dma = vfio_find_dma(iommu, user_iova, 1);
>> -	if (!dma)
>> -		return -EINVAL;
>> +	ret = vfio_find_vaddr(iommu, user_iova, 1, &dma);
>> +	if (ret < 0)
>> +		return ret;
>>  
>>  	if ((write && !(dma->prot & IOMMU_WRITE)) ||
>>  			!(dma->prot & IOMMU_READ))
>> @@ -3055,6 +3123,7 @@ static void vfio_iommu_type1_notify(void *iommu_data, unsigned int event,
>>  	mutex_lock(&iommu->lock);
>>  	iommu->controlled = false;
>>  	mutex_unlock(&iommu->lock);
>> +	wake_up_all(&iommu->vaddr_wait);
>>  }
>>  
>>  static const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_type1 = {
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V2 9/9] vfio/type1: block on invalid vaddr
  2021-01-27 23:25     ` Steven Sistare
@ 2021-01-28  0:03       ` Alex Williamson
  2021-01-28 17:07         ` Alex Williamson
  0 siblings, 1 reply; 22+ messages in thread
From: Alex Williamson @ 2021-01-28  0:03 UTC (permalink / raw)
  To: Steven Sistare; +Cc: kvm, Cornelia Huck, Kirti Wankhede

On Wed, 27 Jan 2021 18:25:13 -0500
Steven Sistare <steven.sistare@oracle.com> wrote:

> On 1/22/2021 5:59 PM, Alex Williamson wrote:
> > On Tue, 19 Jan 2021 09:48:29 -0800
> > Steve Sistare <steven.sistare@oracle.com> wrote:
> >   
> >> Block translation of host virtual address while an iova range has an
> >> invalid vaddr.
> >>
> >> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> >> ---
> >>  drivers/vfio/vfio_iommu_type1.c | 83 +++++++++++++++++++++++++++++++++++++----
> >>  1 file changed, 76 insertions(+), 7 deletions(-)
> >>
> >> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> >> index 0167996..c97573a 100644
> >> --- a/drivers/vfio/vfio_iommu_type1.c
> >> +++ b/drivers/vfio/vfio_iommu_type1.c
> >> @@ -31,6 +31,7 @@
> >>  #include <linux/rbtree.h>
> >>  #include <linux/sched/signal.h>
> >>  #include <linux/sched/mm.h>
> >> +#include <linux/kthread.h>
> >>  #include <linux/slab.h>
> >>  #include <linux/uaccess.h>
> >>  #include <linux/vfio.h>
> >> @@ -75,6 +76,7 @@ struct vfio_iommu {
> >>  	bool			dirty_page_tracking;
> >>  	bool			pinned_page_dirty_scope;
> >>  	bool			controlled;
> >> +	wait_queue_head_t	vaddr_wait;
> >>  };
> >>  
> >>  struct vfio_domain {
> >> @@ -145,6 +147,8 @@ struct vfio_regions {
> >>  #define DIRTY_BITMAP_PAGES_MAX	 ((u64)INT_MAX)
> >>  #define DIRTY_BITMAP_SIZE_MAX	 DIRTY_BITMAP_BYTES(DIRTY_BITMAP_PAGES_MAX)
> >>  
> >> +#define WAITED 1
> >> +
> >>  static int put_pfn(unsigned long pfn, int prot);
> >>  
> >>  static struct vfio_group *vfio_iommu_find_iommu_group(struct vfio_iommu *iommu,
> >> @@ -505,6 +509,52 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
> >>  }
> >>  
> >>  /*
> >> + * Wait for vaddr of *dma_p to become valid.  iommu lock is dropped if the task
> >> + * waits, but is re-locked on return.  If the task waits, then return an updated
> >> + * dma struct in *dma_p.  Return 0 on success with no waiting, 1 on success if  
> > 
> > s/1/WAITED/  
> 
> OK, but the WAITED state will not need to be returned in the new scheme below.
> 
> >> + * waited, and -errno on error.
> >> + */
> >> +static int vfio_vaddr_wait(struct vfio_iommu *iommu, struct vfio_dma **dma_p)
> >> +{
> >> +	struct vfio_dma *dma = *dma_p;
> >> +	unsigned long iova = dma->iova;
> >> +	size_t size = dma->size;
> >> +	int ret = 0;
> >> +	DEFINE_WAIT(wait);
> >> +
> >> +	while (!dma->vaddr_valid) {
> >> +		ret = WAITED;
> >> +		prepare_to_wait(&iommu->vaddr_wait, &wait, TASK_KILLABLE);
> >> +		mutex_unlock(&iommu->lock);
> >> +		schedule();
> >> +		mutex_lock(&iommu->lock);
> >> +		finish_wait(&iommu->vaddr_wait, &wait);
> >> +		if (kthread_should_stop() || !iommu->controlled ||
> >> +		    fatal_signal_pending(current)) {
> >> +			return -EFAULT;
> >> +		}
> >> +		*dma_p = dma = vfio_find_dma(iommu, iova, size);
> >> +		if (!dma)
> >> +			return -EINVAL;
> >> +	}
> >> +	return ret;
> >> +}
> >> +
> >> +/*
> >> + * Find dma struct and wait for its vaddr to be valid.  iommu lock is dropped
> >> + * if the task waits, but is re-locked on return.  Return result in *dma_p.
> >> + * Return 0 on success, 1 on success if waited,  and -errno on error.
> >> + */
> >> +static int vfio_find_vaddr(struct vfio_iommu *iommu, dma_addr_t start,
> >> +			   size_t size, struct vfio_dma **dma_p)  
> > 
> > more of a vfio_find_dma_valid()  
> 
> I will slightly modify and rename this with the new scheme I describe below.
> 
> >> +{
> >> +	*dma_p = vfio_find_dma(iommu, start, size);
> >> +	if (!*dma_p)
> >> +		return -EINVAL;
> >> +	return vfio_vaddr_wait(iommu, dma_p);
> >> +}
> >> +
> >> +/*
> >>   * Attempt to pin pages.  We really don't want to track all the pfns and
> >>   * the iommu can only map chunks of consecutive pfns anyway, so get the
> >>   * first page and all consecutive pages with the same locking.
> >> @@ -693,11 +743,11 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
> >>  		struct vfio_pfn *vpfn;
> >>  
> >>  		iova = user_pfn[i] << PAGE_SHIFT;
> >> -		dma = vfio_find_dma(iommu, iova, PAGE_SIZE);
> >> -		if (!dma) {
> >> -			ret = -EINVAL;
> >> +		ret = vfio_find_vaddr(iommu, iova, PAGE_SIZE, &dma);
> >> +		if (ret < 0)
> >>  			goto pin_unwind;
> >> -		}
> >> +		else if (ret == WAITED)
> >> +			do_accounting = !IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu);  
> > 
> > We're iterating through an array of pfns to provide translations, once
> > we've released the lock it's not just the current one that could be
> > invalid.  I'm afraid we need to unwind and try again, but I think it's
> > actually worse than that because if we've marked pages within a
> > vfio_dma's pfn_list as pinned, then during the locking gap it gets
> > unmapped, the unmap path will call the unmap notifier chain to release
> > the page that the vendor driver doesn't have yet.  Yikes!  
> 
> Yikes indeed.  The fix is easy, though.  I will maintain a count in vfio_iommu of 
> vfio_dma objects with invalid vaddr's, modified and tested while holding the lock, 
> provide a function to wait for the count to become 0, and call that function at the 
> start of vfio_iommu_type1_pin_pages and vfio_iommu_replay.  I will use iommu->vaddr_wait 
> for wait and wake.

I prefer the overhead of this, but the resulting behavior seems pretty
non-intuitive.  Any invalidated vaddr blocks all vaddr use cases, which
almost suggests the unmap _VADDR flag should only be allowed with the
_ALL flag, but then the map _VADDR flag can only be per mapping, which
would make accounting and recovering from _VADDR + _ALL pretty
complicated.  Thanks,

Alex

> >>  		if ((dma->prot & prot) != prot) {
> >>  			ret = -EPERM;
> >> @@ -1496,6 +1546,22 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
> >>  	unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> >>  	int ret;
> >>  
> >> +	/*
> >> +	 * Wait for all vaddr to be valid so they can be used in the main loop.
> >> +	 * If we do wait, the lock was dropped and re-taken, so start over to
> >> +	 * ensure the dma list is consistent.
> >> +	 */
> >> +again:
> >> +	for (n = rb_first(&iommu->dma_list); n; n = rb_next(n)) {
> >> +		struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
> >> +
> >> +		ret = vfio_vaddr_wait(iommu, &dma);
> >> +		if (ret < 0)
> >> +			return ret;
> >> +		else if (ret == WAITED)
> >> +			goto again;
> >> +	}  
> > 
> > This "do nothing until all the vaddrs are valid" approach could work
> > above too, but gosh is that a lot of cache unfriendly work for a rare
> > invalidation.  Thanks,  
> 
> The new wait function described above is fast in the common case, just a check that
> the invalid vaddr count is 0.
> 
> - Steve
> 
> >> +
> >>  	/* Arbitrarily pick the first domain in the list for lookups */
> >>  	if (!list_empty(&iommu->domain_list))
> >>  		d = list_first_entry(&iommu->domain_list,
> >> @@ -2522,6 +2588,7 @@ static void *vfio_iommu_type1_open(unsigned long arg)
> >>  	iommu->controlled = true;
> >>  	mutex_init(&iommu->lock);
> >>  	BLOCKING_INIT_NOTIFIER_HEAD(&iommu->notifier);
> >> +	init_waitqueue_head(&iommu->vaddr_wait);
> >>  
> >>  	return iommu;
> >>  }
> >> @@ -2972,12 +3039,13 @@ static int vfio_iommu_type1_dma_rw_chunk(struct vfio_iommu *iommu,
> >>  	struct vfio_dma *dma;
> >>  	bool kthread = current->mm == NULL;
> >>  	size_t offset;
> >> +	int ret;
> >>  
> >>  	*copied = 0;
> >>  
> >> -	dma = vfio_find_dma(iommu, user_iova, 1);
> >> -	if (!dma)
> >> -		return -EINVAL;
> >> +	ret = vfio_find_vaddr(iommu, user_iova, 1, &dma);
> >> +	if (ret < 0)
> >> +		return ret;
> >>  
> >>  	if ((write && !(dma->prot & IOMMU_WRITE)) ||
> >>  			!(dma->prot & IOMMU_READ))
> >> @@ -3055,6 +3123,7 @@ static void vfio_iommu_type1_notify(void *iommu_data, unsigned int event,
> >>  	mutex_lock(&iommu->lock);
> >>  	iommu->controlled = false;
> >>  	mutex_unlock(&iommu->lock);
> >> +	wake_up_all(&iommu->vaddr_wait);
> >>  }
> >>  
> >>  static const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_type1 = {  
> >   
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V2 9/9] vfio/type1: block on invalid vaddr
  2021-01-28  0:03       ` Alex Williamson
@ 2021-01-28 17:07         ` Alex Williamson
  2021-01-28 17:18           ` Steven Sistare
  0 siblings, 1 reply; 22+ messages in thread
From: Alex Williamson @ 2021-01-28 17:07 UTC (permalink / raw)
  To: Steven Sistare; +Cc: kvm, Cornelia Huck, Kirti Wankhede

On Wed, 27 Jan 2021 17:03:25 -0700
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Wed, 27 Jan 2021 18:25:13 -0500
> Steven Sistare <steven.sistare@oracle.com> wrote:
> 
> > On 1/22/2021 5:59 PM, Alex Williamson wrote:  
> > > On Tue, 19 Jan 2021 09:48:29 -0800
> > > Steve Sistare <steven.sistare@oracle.com> wrote:
> > >     
> > >> Block translation of host virtual address while an iova range has an
> > >> invalid vaddr.
> > >>
> > >> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> > >> ---
> > >>  drivers/vfio/vfio_iommu_type1.c | 83 +++++++++++++++++++++++++++++++++++++----
> > >>  1 file changed, 76 insertions(+), 7 deletions(-)
> > >>
> > >> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > >> index 0167996..c97573a 100644
> > >> --- a/drivers/vfio/vfio_iommu_type1.c
> > >> +++ b/drivers/vfio/vfio_iommu_type1.c
> > >> @@ -31,6 +31,7 @@
> > >>  #include <linux/rbtree.h>
> > >>  #include <linux/sched/signal.h>
> > >>  #include <linux/sched/mm.h>
> > >> +#include <linux/kthread.h>
> > >>  #include <linux/slab.h>
> > >>  #include <linux/uaccess.h>
> > >>  #include <linux/vfio.h>
> > >> @@ -75,6 +76,7 @@ struct vfio_iommu {
> > >>  	bool			dirty_page_tracking;
> > >>  	bool			pinned_page_dirty_scope;
> > >>  	bool			controlled;
> > >> +	wait_queue_head_t	vaddr_wait;
> > >>  };
> > >>  
> > >>  struct vfio_domain {
> > >> @@ -145,6 +147,8 @@ struct vfio_regions {
> > >>  #define DIRTY_BITMAP_PAGES_MAX	 ((u64)INT_MAX)
> > >>  #define DIRTY_BITMAP_SIZE_MAX	 DIRTY_BITMAP_BYTES(DIRTY_BITMAP_PAGES_MAX)
> > >>  
> > >> +#define WAITED 1
> > >> +
> > >>  static int put_pfn(unsigned long pfn, int prot);
> > >>  
> > >>  static struct vfio_group *vfio_iommu_find_iommu_group(struct vfio_iommu *iommu,
> > >> @@ -505,6 +509,52 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
> > >>  }
> > >>  
> > >>  /*
> > >> + * Wait for vaddr of *dma_p to become valid.  iommu lock is dropped if the task
> > >> + * waits, but is re-locked on return.  If the task waits, then return an updated
> > >> + * dma struct in *dma_p.  Return 0 on success with no waiting, 1 on success if    
> > > 
> > > s/1/WAITED/    
> > 
> > OK, but the WAITED state will not need to be returned in the new scheme below.
> >   
> > >> + * waited, and -errno on error.
> > >> + */
> > >> +static int vfio_vaddr_wait(struct vfio_iommu *iommu, struct vfio_dma **dma_p)
> > >> +{
> > >> +	struct vfio_dma *dma = *dma_p;
> > >> +	unsigned long iova = dma->iova;
> > >> +	size_t size = dma->size;
> > >> +	int ret = 0;
> > >> +	DEFINE_WAIT(wait);
> > >> +
> > >> +	while (!dma->vaddr_valid) {
> > >> +		ret = WAITED;
> > >> +		prepare_to_wait(&iommu->vaddr_wait, &wait, TASK_KILLABLE);
> > >> +		mutex_unlock(&iommu->lock);
> > >> +		schedule();
> > >> +		mutex_lock(&iommu->lock);
> > >> +		finish_wait(&iommu->vaddr_wait, &wait);
> > >> +		if (kthread_should_stop() || !iommu->controlled ||
> > >> +		    fatal_signal_pending(current)) {
> > >> +			return -EFAULT;
> > >> +		}
> > >> +		*dma_p = dma = vfio_find_dma(iommu, iova, size);
> > >> +		if (!dma)
> > >> +			return -EINVAL;
> > >> +	}
> > >> +	return ret;
> > >> +}
> > >> +
> > >> +/*
> > >> + * Find dma struct and wait for its vaddr to be valid.  iommu lock is dropped
> > >> + * if the task waits, but is re-locked on return.  Return result in *dma_p.
> > >> + * Return 0 on success, 1 on success if waited,  and -errno on error.
> > >> + */
> > >> +static int vfio_find_vaddr(struct vfio_iommu *iommu, dma_addr_t start,
> > >> +			   size_t size, struct vfio_dma **dma_p)    
> > > 
> > > more of a vfio_find_dma_valid()    
> > 
> > I will slightly modify and rename this with the new scheme I describe below.
> >   
> > >> +{
> > >> +	*dma_p = vfio_find_dma(iommu, start, size);
> > >> +	if (!*dma_p)
> > >> +		return -EINVAL;
> > >> +	return vfio_vaddr_wait(iommu, dma_p);
> > >> +}
> > >> +
> > >> +/*
> > >>   * Attempt to pin pages.  We really don't want to track all the pfns and
> > >>   * the iommu can only map chunks of consecutive pfns anyway, so get the
> > >>   * first page and all consecutive pages with the same locking.
> > >> @@ -693,11 +743,11 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
> > >>  		struct vfio_pfn *vpfn;
> > >>  
> > >>  		iova = user_pfn[i] << PAGE_SHIFT;
> > >> -		dma = vfio_find_dma(iommu, iova, PAGE_SIZE);
> > >> -		if (!dma) {
> > >> -			ret = -EINVAL;
> > >> +		ret = vfio_find_vaddr(iommu, iova, PAGE_SIZE, &dma);
> > >> +		if (ret < 0)
> > >>  			goto pin_unwind;
> > >> -		}
> > >> +		else if (ret == WAITED)
> > >> +			do_accounting = !IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu);    
> > > 
> > > We're iterating through an array of pfns to provide translations, once
> > > we've released the lock it's not just the current one that could be
> > > invalid.  I'm afraid we need to unwind and try again, but I think it's
> > > actually worse than that because if we've marked pages within a
> > > vfio_dma's pfn_list as pinned, then during the locking gap it gets
> > > unmapped, the unmap path will call the unmap notifier chain to release
> > > the page that the vendor driver doesn't have yet.  Yikes!    
> > 
> > Yikes indeed.  The fix is easy, though.  I will maintain a count in vfio_iommu of 
> > vfio_dma objects with invalid vaddr's, modified and tested while holding the lock, 
> > provide a function to wait for the count to become 0, and call that function at the 
> > start of vfio_iommu_type1_pin_pages and vfio_iommu_replay.  I will use iommu->vaddr_wait 
> > for wait and wake.  
> 
> I prefer the overhead of this, but the resulting behavior seems pretty
> non-intuitive.  Any invalidated vaddr blocks all vaddr use cases, which
> almost suggests the unmap _VADDR flag should only be allowed with the
> _ALL flag, but then the map _VADDR flag can only be per mapping, which
> would make accounting and recovering from _VADDR + _ALL pretty
> complicated.  Thanks,

I wonder if there's a hybrid approach, a counter on the vfio_iommu
which when non-zero causes pin pages to pre-test vaddr on all required
vfio_dma objects, waiting and being woken on counter decrement to check
again.  Thanks,

Alex

> > >>  		if ((dma->prot & prot) != prot) {
> > >>  			ret = -EPERM;
> > >> @@ -1496,6 +1546,22 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
> > >>  	unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > >>  	int ret;
> > >>  
> > >> +	/*
> > >> +	 * Wait for all vaddr to be valid so they can be used in the main loop.
> > >> +	 * If we do wait, the lock was dropped and re-taken, so start over to
> > >> +	 * ensure the dma list is consistent.
> > >> +	 */
> > >> +again:
> > >> +	for (n = rb_first(&iommu->dma_list); n; n = rb_next(n)) {
> > >> +		struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
> > >> +
> > >> +		ret = vfio_vaddr_wait(iommu, &dma);
> > >> +		if (ret < 0)
> > >> +			return ret;
> > >> +		else if (ret == WAITED)
> > >> +			goto again;
> > >> +	}    
> > > 
> > > This "do nothing until all the vaddrs are valid" approach could work
> > > above too, but gosh is that a lot of cache unfriendly work for a rare
> > > invalidation.  Thanks,    
> > 
> > The new wait function described above is fast in the common case, just a check that
> > the invalid vaddr count is 0.
> > 
> > - Steve
> >   
> > >> +
> > >>  	/* Arbitrarily pick the first domain in the list for lookups */
> > >>  	if (!list_empty(&iommu->domain_list))
> > >>  		d = list_first_entry(&iommu->domain_list,
> > >> @@ -2522,6 +2588,7 @@ static void *vfio_iommu_type1_open(unsigned long arg)
> > >>  	iommu->controlled = true;
> > >>  	mutex_init(&iommu->lock);
> > >>  	BLOCKING_INIT_NOTIFIER_HEAD(&iommu->notifier);
> > >> +	init_waitqueue_head(&iommu->vaddr_wait);
> > >>  
> > >>  	return iommu;
> > >>  }
> > >> @@ -2972,12 +3039,13 @@ static int vfio_iommu_type1_dma_rw_chunk(struct vfio_iommu *iommu,
> > >>  	struct vfio_dma *dma;
> > >>  	bool kthread = current->mm == NULL;
> > >>  	size_t offset;
> > >> +	int ret;
> > >>  
> > >>  	*copied = 0;
> > >>  
> > >> -	dma = vfio_find_dma(iommu, user_iova, 1);
> > >> -	if (!dma)
> > >> -		return -EINVAL;
> > >> +	ret = vfio_find_vaddr(iommu, user_iova, 1, &dma);
> > >> +	if (ret < 0)
> > >> +		return ret;
> > >>  
> > >>  	if ((write && !(dma->prot & IOMMU_WRITE)) ||
> > >>  			!(dma->prot & IOMMU_READ))
> > >> @@ -3055,6 +3123,7 @@ static void vfio_iommu_type1_notify(void *iommu_data, unsigned int event,
> > >>  	mutex_lock(&iommu->lock);
> > >>  	iommu->controlled = false;
> > >>  	mutex_unlock(&iommu->lock);
> > >> +	wake_up_all(&iommu->vaddr_wait);
> > >>  }
> > >>  
> > >>  static const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_type1 = {    
> > >     
> >   
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V2 9/9] vfio/type1: block on invalid vaddr
  2021-01-28 17:07         ` Alex Williamson
@ 2021-01-28 17:18           ` Steven Sistare
  0 siblings, 0 replies; 22+ messages in thread
From: Steven Sistare @ 2021-01-28 17:18 UTC (permalink / raw)
  To: Alex Williamson; +Cc: kvm, Cornelia Huck, Kirti Wankhede

On 1/28/2021 12:07 PM, Alex Williamson wrote:
> On Wed, 27 Jan 2021 17:03:25 -0700
> Alex Williamson <alex.williamson@redhat.com> wrote:
> 
>> On Wed, 27 Jan 2021 18:25:13 -0500
>> Steven Sistare <steven.sistare@oracle.com> wrote:
>>
>>> On 1/22/2021 5:59 PM, Alex Williamson wrote:  
>>>> On Tue, 19 Jan 2021 09:48:29 -0800
>>>> Steve Sistare <steven.sistare@oracle.com> wrote:
>>>>     
>>>>> Block translation of host virtual address while an iova range has an
>>>>> invalid vaddr.
>>>>>
>>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>>> ---
>>>>>  drivers/vfio/vfio_iommu_type1.c | 83 +++++++++++++++++++++++++++++++++++++----
>>>>>  1 file changed, 76 insertions(+), 7 deletions(-)
>>>>>
>>>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>>>>> index 0167996..c97573a 100644
>>>>> --- a/drivers/vfio/vfio_iommu_type1.c
>>>>> +++ b/drivers/vfio/vfio_iommu_type1.c
>>>>> @@ -31,6 +31,7 @@
>>>>>  #include <linux/rbtree.h>
>>>>>  #include <linux/sched/signal.h>
>>>>>  #include <linux/sched/mm.h>
>>>>> +#include <linux/kthread.h>
>>>>>  #include <linux/slab.h>
>>>>>  #include <linux/uaccess.h>
>>>>>  #include <linux/vfio.h>
>>>>> @@ -75,6 +76,7 @@ struct vfio_iommu {
>>>>>  	bool			dirty_page_tracking;
>>>>>  	bool			pinned_page_dirty_scope;
>>>>>  	bool			controlled;
>>>>> +	wait_queue_head_t	vaddr_wait;
>>>>>  };
>>>>>  
>>>>>  struct vfio_domain {
>>>>> @@ -145,6 +147,8 @@ struct vfio_regions {
>>>>>  #define DIRTY_BITMAP_PAGES_MAX	 ((u64)INT_MAX)
>>>>>  #define DIRTY_BITMAP_SIZE_MAX	 DIRTY_BITMAP_BYTES(DIRTY_BITMAP_PAGES_MAX)
>>>>>  
>>>>> +#define WAITED 1
>>>>> +
>>>>>  static int put_pfn(unsigned long pfn, int prot);
>>>>>  
>>>>>  static struct vfio_group *vfio_iommu_find_iommu_group(struct vfio_iommu *iommu,
>>>>> @@ -505,6 +509,52 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
>>>>>  }
>>>>>  
>>>>>  /*
>>>>> + * Wait for vaddr of *dma_p to become valid.  iommu lock is dropped if the task
>>>>> + * waits, but is re-locked on return.  If the task waits, then return an updated
>>>>> + * dma struct in *dma_p.  Return 0 on success with no waiting, 1 on success if    
>>>>
>>>> s/1/WAITED/    
>>>
>>> OK, but the WAITED state will not need to be returned in the new scheme below.
>>>   
>>>>> + * waited, and -errno on error.
>>>>> + */
>>>>> +static int vfio_vaddr_wait(struct vfio_iommu *iommu, struct vfio_dma **dma_p)
>>>>> +{
>>>>> +	struct vfio_dma *dma = *dma_p;
>>>>> +	unsigned long iova = dma->iova;
>>>>> +	size_t size = dma->size;
>>>>> +	int ret = 0;
>>>>> +	DEFINE_WAIT(wait);
>>>>> +
>>>>> +	while (!dma->vaddr_valid) {
>>>>> +		ret = WAITED;
>>>>> +		prepare_to_wait(&iommu->vaddr_wait, &wait, TASK_KILLABLE);
>>>>> +		mutex_unlock(&iommu->lock);
>>>>> +		schedule();
>>>>> +		mutex_lock(&iommu->lock);
>>>>> +		finish_wait(&iommu->vaddr_wait, &wait);
>>>>> +		if (kthread_should_stop() || !iommu->controlled ||
>>>>> +		    fatal_signal_pending(current)) {
>>>>> +			return -EFAULT;
>>>>> +		}
>>>>> +		*dma_p = dma = vfio_find_dma(iommu, iova, size);
>>>>> +		if (!dma)
>>>>> +			return -EINVAL;
>>>>> +	}
>>>>> +	return ret;
>>>>> +}
>>>>> +
>>>>> +/*
>>>>> + * Find dma struct and wait for its vaddr to be valid.  iommu lock is dropped
>>>>> + * if the task waits, but is re-locked on return.  Return result in *dma_p.
>>>>> + * Return 0 on success, 1 on success if waited,  and -errno on error.
>>>>> + */
>>>>> +static int vfio_find_vaddr(struct vfio_iommu *iommu, dma_addr_t start,
>>>>> +			   size_t size, struct vfio_dma **dma_p)    
>>>>
>>>> more of a vfio_find_dma_valid()    
>>>
>>> I will slightly modify and rename this with the new scheme I describe below.
>>>   
>>>>> +{
>>>>> +	*dma_p = vfio_find_dma(iommu, start, size);
>>>>> +	if (!*dma_p)
>>>>> +		return -EINVAL;
>>>>> +	return vfio_vaddr_wait(iommu, dma_p);
>>>>> +}
>>>>> +
>>>>> +/*
>>>>>   * Attempt to pin pages.  We really don't want to track all the pfns and
>>>>>   * the iommu can only map chunks of consecutive pfns anyway, so get the
>>>>>   * first page and all consecutive pages with the same locking.
>>>>> @@ -693,11 +743,11 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
>>>>>  		struct vfio_pfn *vpfn;
>>>>>  
>>>>>  		iova = user_pfn[i] << PAGE_SHIFT;
>>>>> -		dma = vfio_find_dma(iommu, iova, PAGE_SIZE);
>>>>> -		if (!dma) {
>>>>> -			ret = -EINVAL;
>>>>> +		ret = vfio_find_vaddr(iommu, iova, PAGE_SIZE, &dma);
>>>>> +		if (ret < 0)
>>>>>  			goto pin_unwind;
>>>>> -		}
>>>>> +		else if (ret == WAITED)
>>>>> +			do_accounting = !IS_IOMMU_CAP_DOMAIN_IN_CONTAINER(iommu);    
>>>>
>>>> We're iterating through an array of pfns to provide translations, once
>>>> we've released the lock it's not just the current one that could be
>>>> invalid.  I'm afraid we need to unwind and try again, but I think it's
>>>> actually worse than that because if we've marked pages within a
>>>> vfio_dma's pfn_list as pinned, then during the locking gap it gets
>>>> unmapped, the unmap path will call the unmap notifier chain to release
>>>> the page that the vendor driver doesn't have yet.  Yikes!    
>>>
>>> Yikes indeed.  The fix is easy, though.  I will maintain a count in vfio_iommu of 
>>> vfio_dma objects with invalid vaddr's, modified and tested while holding the lock, 
>>> provide a function to wait for the count to become 0, and call that function at the 
>>> start of vfio_iommu_type1_pin_pages and vfio_iommu_replay.  I will use iommu->vaddr_wait 
>>> for wait and wake.  
>>
>> I prefer the overhead of this, but the resulting behavior seems pretty
>> non-intuitive.  Any invalidated vaddr blocks all vaddr use cases, which
>> almost suggests the unmap _VADDR flag should only be allowed with the
>> _ALL flag, but then the map _VADDR flag can only be per mapping, which
>> would make accounting and recovering from _VADDR + _ALL pretty
>> complicated.  Thanks,
> 
> I wonder if there's a hybrid approach, a counter on the vfio_iommu
> which when non-zero causes pin pages to pre-test vaddr on all required
> vfio_dma objects, waiting and being woken on counter decrement to check
> again.  Thanks,

Sounds good, thanks.

- Steve

>>>>>  		if ((dma->prot & prot) != prot) {
>>>>>  			ret = -EPERM;
>>>>> @@ -1496,6 +1546,22 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
>>>>>  	unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
>>>>>  	int ret;
>>>>>  
>>>>> +	/*
>>>>> +	 * Wait for all vaddr to be valid so they can be used in the main loop.
>>>>> +	 * If we do wait, the lock was dropped and re-taken, so start over to
>>>>> +	 * ensure the dma list is consistent.
>>>>> +	 */
>>>>> +again:
>>>>> +	for (n = rb_first(&iommu->dma_list); n; n = rb_next(n)) {
>>>>> +		struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
>>>>> +
>>>>> +		ret = vfio_vaddr_wait(iommu, &dma);
>>>>> +		if (ret < 0)
>>>>> +			return ret;
>>>>> +		else if (ret == WAITED)
>>>>> +			goto again;
>>>>> +	}    
>>>>
>>>> This "do nothing until all the vaddrs are valid" approach could work
>>>> above too, but gosh is that a lot of cache unfriendly work for a rare
>>>> invalidation.  Thanks,    
>>>
>>> The new wait function described above is fast in the common case, just a check that
>>> the invalid vaddr count is 0.
>>>
>>> - Steve
>>>   
>>>>> +
>>>>>  	/* Arbitrarily pick the first domain in the list for lookups */
>>>>>  	if (!list_empty(&iommu->domain_list))
>>>>>  		d = list_first_entry(&iommu->domain_list,
>>>>> @@ -2522,6 +2588,7 @@ static void *vfio_iommu_type1_open(unsigned long arg)
>>>>>  	iommu->controlled = true;
>>>>>  	mutex_init(&iommu->lock);
>>>>>  	BLOCKING_INIT_NOTIFIER_HEAD(&iommu->notifier);
>>>>> +	init_waitqueue_head(&iommu->vaddr_wait);
>>>>>  
>>>>>  	return iommu;
>>>>>  }
>>>>> @@ -2972,12 +3039,13 @@ static int vfio_iommu_type1_dma_rw_chunk(struct vfio_iommu *iommu,
>>>>>  	struct vfio_dma *dma;
>>>>>  	bool kthread = current->mm == NULL;
>>>>>  	size_t offset;
>>>>> +	int ret;
>>>>>  
>>>>>  	*copied = 0;
>>>>>  
>>>>> -	dma = vfio_find_dma(iommu, user_iova, 1);
>>>>> -	if (!dma)
>>>>> -		return -EINVAL;
>>>>> +	ret = vfio_find_vaddr(iommu, user_iova, 1, &dma);
>>>>> +	if (ret < 0)
>>>>> +		return ret;
>>>>>  
>>>>>  	if ((write && !(dma->prot & IOMMU_WRITE)) ||
>>>>>  			!(dma->prot & IOMMU_READ))
>>>>> @@ -3055,6 +3123,7 @@ static void vfio_iommu_type1_notify(void *iommu_data, unsigned int event,
>>>>>  	mutex_lock(&iommu->lock);
>>>>>  	iommu->controlled = false;
>>>>>  	mutex_unlock(&iommu->lock);
>>>>> +	wake_up_all(&iommu->vaddr_wait);
>>>>>  }
>>>>>  
>>>>>  static const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_type1 = {    
>>>>     
>>>   
>>
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH V2 0/9] vfio virtual address update
       [not found] ` <55725169-de0d-4019-f96c-ded20dfde0d7@oracle.com>
@ 2021-01-29 17:05   ` Alex Williamson
  0 siblings, 0 replies; 22+ messages in thread
From: Alex Williamson @ 2021-01-29 17:05 UTC (permalink / raw)
  To: Steven Sistare; +Cc: Cornelia Huck, Kirti Wankhede, kvm

On Fri, 29 Jan 2021 10:48:10 -0500
Steven Sistare <steven.sistare@oracle.com> wrote:

> Hi Alex, thanks for the feedback on V2.  Any more comments before I submit V3? 

Nope, I'm ok with your proposals.  Thanks,

Alex

> On 1/19/2021 12:48 PM, Steve Sistare wrote:
> > Add interfaces that allow the underlying memory object of an iova range
> > to be mapped to a new virtual address in the host process:
> > 
> >   - VFIO_DMA_UNMAP_FLAG_VADDR for VFIO_IOMMU_UNMAP_DMA
> >   - VFIO_DMA_MAP_FLAG_VADDR flag for VFIO_IOMMU_MAP_DMA
> >   - VFIO_UPDATE_VADDR for VFIO_CHECK_EXTENSION
> >   - VFIO_DMA_UNMAP_FLAG_ALL for VFIO_IOMMU_UNMAP_DMA
> >   - VFIO_UNMAP_ALL for VFIO_CHECK_EXTENSION
> > 
> > Unmap-vaddr invalidates the host virtual address in an iova range and blocks
> > vfio translation of host virtual addresses, but DMA to already-mapped pages
> > continues.  Map-vaddr updates the base VA and resumes translation.  The
> > implementation supports iommu type1 and mediated devices.  Unmap-all allows
> > all ranges to be unmapped or invalidated in a single ioctl, which simplifies
> > userland code.
> > 
> > This functionality is necessary for live update, in which a host process
> > such as qemu exec's an updated version of itself, while preserving its
> > guest and vfio devices.  The process blocks vfio VA translation, exec's
> > its new self, mmap's the memory object(s) underlying vfio object, updates
> > the VA, and unblocks translation.  For a working example that uses these
> > new interfaces, see the QEMU patch series "[PATCH V2] Live Update" at
> > https://lore.kernel.org/qemu-devel/1609861330-129855-1-git-send-email-steven.sistare@oracle.com
> > 
> > Patches 1-4 define and implement the flag to unmap all ranges.
> > Patches 5-6 define and implement the flags to update vaddr.
> > Patches 7-9 add blocking to complete the implementation.
> > 
> > Changes from V1:
> >  - define a flag for unmap all instead of special range values
> >  - define the VFIO_UNMAP_ALL extension
> >  - forbid the combination of unmap-all and get-dirty-bitmap
> >  - unwind in unmap on vaddr error
> >  - add a new function to find first dma in a range instead of modifying
> >    an existing function
> >  - change names of update flags
> >  - fix concurrency bugs due to iommu lock being dropped
> >  - call down from from vfio to a new backend interface instead of up from
> >    driver to detect container close
> >  - use wait/wake instead of sleep and polling
> >  - refine the uapi specification
> >  - split patches into vfio vs type1
> > 
> > Steve Sistare (9):
> >   vfio: option to unmap all
> >   vfio/type1: find first dma
> >   vfio/type1: unmap cleanup
> >   vfio/type1: implement unmap all
> >   vfio: interfaces to update vaddr
> >   vfio/type1: implement interfaces to update vaddr
> >   vfio: iommu driver notify callback
> >   vfio/type1: implement notify callback
> >   vfio/type1: block on invalid vaddr
> > 
> >  drivers/vfio/vfio.c             |   5 +
> >  drivers/vfio/vfio_iommu_type1.c | 229 ++++++++++++++++++++++++++++++++++------
> >  include/linux/vfio.h            |   5 +
> >  include/uapi/linux/vfio.h       |  27 +++++
> >  4 files changed, 231 insertions(+), 35 deletions(-)
> >   
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2021-01-29 17:07 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-19 17:48 [PATCH V2 0/9] vfio virtual address update Steve Sistare
2021-01-19 17:48 ` [PATCH V2 1/9] vfio: option to unmap all Steve Sistare
2021-01-19 17:48 ` [PATCH V2 2/9] vfio/type1: find first dma Steve Sistare
2021-01-19 17:48 ` [PATCH V2 3/9] vfio/type1: unmap cleanup Steve Sistare
2021-01-19 17:48 ` [PATCH V2 4/9] vfio/type1: implement unmap all Steve Sistare
2021-01-22 21:22   ` Alex Williamson
2021-01-27 23:07     ` Steven Sistare
2021-01-19 17:48 ` [PATCH V2 5/9] vfio: interfaces to update vaddr Steve Sistare
2021-01-19 17:48 ` [PATCH V2 6/9] vfio/type1: implement " Steve Sistare
2021-01-22 21:48   ` Alex Williamson
2021-01-27 23:23     ` Steven Sistare
2021-01-19 17:48 ` [PATCH V2 7/9] vfio: iommu driver notify callback Steve Sistare
2021-01-19 17:48 ` [PATCH V2 8/9] vfio/type1: implement " Steve Sistare
2021-01-22 23:00   ` Alex Williamson
2021-01-27 23:24     ` Steven Sistare
2021-01-19 17:48 ` [PATCH V2 9/9] vfio/type1: block on invalid vaddr Steve Sistare
2021-01-22 22:59   ` Alex Williamson
2021-01-27 23:25     ` Steven Sistare
2021-01-28  0:03       ` Alex Williamson
2021-01-28 17:07         ` Alex Williamson
2021-01-28 17:18           ` Steven Sistare
     [not found] ` <55725169-de0d-4019-f96c-ded20dfde0d7@oracle.com>
2021-01-29 17:05   ` [PATCH V2 0/9] vfio virtual address update Alex Williamson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).