All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V1 0/5] vfio virtual address update
@ 2021-01-05 15:36 Steve Sistare
  2021-01-05 15:36 ` [PATCH V1 1/5] vfio: maintain dma_list order Steve Sistare
                   ` (5 more replies)
  0 siblings, 6 replies; 39+ messages in thread
From: Steve Sistare @ 2021-01-05 15:36 UTC (permalink / raw)
  To: kvm; +Cc: Alex Williamson, Cornelia Huck, Kirti Wankhede, Steve Sistare

Add interfaces that allow the underlying memory object of an iova range
to be mapped to a new virtual address in the host process:

  - VFIO_DMA_UNMAP_FLAG_SUSPEND for VFIO_IOMMU_UNMAP_DMA
  - VFIO_DMA_MAP_FLAG_RESUME flag for VFIO_IOMMU_MAP_DMA
  - VFIO_SUSPEND extension for VFIO_CHECK_EXTENSION

The suspend interface blocks vfio translation of host virtual addresses in
a range, but DMA to already-mapped pages continues.  The resume interface
records the new base VA and resumes translation.  The implementation
supports iommu type1 and mediated devices.

This functionality is necessary for live update, in which a host process
such as qemu exec's an updated version of itself, while preserving its
guest and vfio devices.  The process suspends vfio VA translation, exec's
its new self, mmap's the memory object(s) underlying vfio object, and
resumes VA translation.  For a working example that uses these new
interfaces, see the QEMU patch series "[PATCH V2] Live Update".

Patch 1 modifies the iova rbtree to allow iteration over ranges with gaps,
  without deleting each entry.  This is required by patch 4.
Patch 2 adds an option to unmap all ranges, which simplifies userland code.
Patch 3 adds an interface to detect if an iommu_group has a valid container,
  which patch 5 uses to release a blocked thread if a container is closed.
Patch 4 implements the new ioctl's.
Patch 5 adds blocking to complete the implementation .

Steve Sistare (5):
  vfio: maintain dma_list order
  vfio: option to unmap all
  vfio: detect closed container
  vfio: VA suspend interface
  vfio: block during VA suspend

 drivers/vfio/vfio.c             |  12 ++++
 drivers/vfio/vfio_iommu_type1.c | 122 ++++++++++++++++++++++++++++++++++------
 include/linux/vfio.h            |   1 +
 include/uapi/linux/vfio.h       |  19 ++++++-
 4 files changed, 135 insertions(+), 19 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH V1 1/5] vfio: maintain dma_list order
  2021-01-05 15:36 [PATCH V1 0/5] vfio virtual address update Steve Sistare
@ 2021-01-05 15:36 ` Steve Sistare
  2021-01-05 18:48     ` kernel test robot
  2021-01-06  0:02   ` Alex Williamson
  2021-01-05 15:36 ` [PATCH V1 2/5] vfio: option to unmap all Steve Sistare
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 39+ messages in thread
From: Steve Sistare @ 2021-01-05 15:36 UTC (permalink / raw)
  To: kvm; +Cc: Alex Williamson, Cornelia Huck, Kirti Wankhede, Steve Sistare

Keep entries properly sorted in the dma_list rb_tree so that iterating
over multiple entries across a range with gaps works, without requiring
one to delete each visited entry as in vfio_dma_do_unmap.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 drivers/vfio/vfio_iommu_type1.c | 18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 5fbf0c1..02228d0 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -157,20 +157,24 @@ static struct vfio_group *vfio_iommu_find_iommu_group(struct vfio_iommu *iommu,
 static struct vfio_dma *vfio_find_dma(struct vfio_iommu *iommu,
 				      dma_addr_t start, size_t size)
 {
+	struct vfio_dma *res = 0;
 	struct rb_node *node = iommu->dma_list.rb_node;
 
 	while (node) {
 		struct vfio_dma *dma = rb_entry(node, struct vfio_dma, node);
 
-		if (start + size <= dma->iova)
+		if (start < dma->iova + dma->size) {
+			res = dma;
+			if (start >= dma->iova)
+				break;
 			node = node->rb_left;
-		else if (start >= dma->iova + dma->size)
+		} else {
 			node = node->rb_right;
-		else
-			return dma;
+		}
 	}
-
-	return NULL;
+	if (res && size && res->iova >= start + size)
+		res = 0;
+	return res;
 }
 
 static void vfio_link_dma(struct vfio_iommu *iommu, struct vfio_dma *new)
@@ -182,7 +186,7 @@ static void vfio_link_dma(struct vfio_iommu *iommu, struct vfio_dma *new)
 		parent = *link;
 		dma = rb_entry(parent, struct vfio_dma, node);
 
-		if (new->iova + new->size <= dma->iova)
+		if (new->iova < dma->iova)
 			link = &(*link)->rb_left;
 		else
 			link = &(*link)->rb_right;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH V1 2/5] vfio: option to unmap all
  2021-01-05 15:36 [PATCH V1 0/5] vfio virtual address update Steve Sistare
  2021-01-05 15:36 ` [PATCH V1 1/5] vfio: maintain dma_list order Steve Sistare
@ 2021-01-05 15:36 ` Steve Sistare
  2021-01-08 19:35   ` Alex Williamson
  2021-01-05 15:36 ` [PATCH V1 3/5] vfio: detect closed container Steve Sistare
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 39+ messages in thread
From: Steve Sistare @ 2021-01-05 15:36 UTC (permalink / raw)
  To: kvm; +Cc: Alex Williamson, Cornelia Huck, Kirti Wankhede, Steve Sistare

For VFIO_IOMMU_UNMAP_DMA, delete all mappings if iova=0 and size=0.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 drivers/vfio/vfio_iommu_type1.c | 11 ++++++++---
 include/uapi/linux/vfio.h       |  3 ++-
 2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 02228d0..3dc501d 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -1079,6 +1079,8 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
 	size_t unmapped = 0, pgsize;
 	int ret = 0, retries = 0;
 	unsigned long pgshift;
+	dma_addr_t iova;
+	unsigned long size;
 
 	mutex_lock(&iommu->lock);
 
@@ -1090,7 +1092,7 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
 		goto unlock;
 	}
 
-	if (!unmap->size || unmap->size & (pgsize - 1)) {
+	if ((!unmap->size && unmap->iova) || unmap->size & (pgsize - 1)) {
 		ret = -EINVAL;
 		goto unlock;
 	}
@@ -1154,8 +1156,11 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
 		}
 	}
 
-	while ((dma = vfio_find_dma(iommu, unmap->iova, unmap->size))) {
-		if (!iommu->v2 && unmap->iova > dma->iova)
+	iova = unmap->iova;
+	size = unmap->size ? unmap->size : SIZE_MAX;
+
+	while ((dma = vfio_find_dma(iommu, iova, size))) {
+		if (!iommu->v2 && iova > dma->iova)
 			break;
 		/*
 		 * Task with same address space who mapped this iova range is
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 9204705..896e527 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -1073,7 +1073,8 @@ struct vfio_bitmap {
  * Caller sets argsz.  The actual unmapped size is returned in the size
  * field.  No guarantee is made to the user that arbitrary unmaps of iova
  * or size different from those used in the original mapping call will
- * succeed.
+ * succeed.  If iova=0 and size=0, all addresses are unmapped.
+ *
  * VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP should be set to get the dirty bitmap
  * before unmapping IO virtual addresses. When this flag is set, the user must
  * provide a struct vfio_bitmap in data[]. User must provide zero-allocated
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH V1 3/5] vfio: detect closed container
  2021-01-05 15:36 [PATCH V1 0/5] vfio virtual address update Steve Sistare
  2021-01-05 15:36 ` [PATCH V1 1/5] vfio: maintain dma_list order Steve Sistare
  2021-01-05 15:36 ` [PATCH V1 2/5] vfio: option to unmap all Steve Sistare
@ 2021-01-05 15:36 ` Steve Sistare
  2021-01-08 19:39   ` Alex Williamson
  2021-01-05 15:36 ` [PATCH V1 4/5] vfio: VA suspend interface Steve Sistare
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 39+ messages in thread
From: Steve Sistare @ 2021-01-05 15:36 UTC (permalink / raw)
  To: kvm; +Cc: Alex Williamson, Cornelia Huck, Kirti Wankhede, Steve Sistare

Add a function that detects if an iommu_group has a valid container.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 drivers/vfio/vfio.c  | 12 ++++++++++++
 include/linux/vfio.h |  1 +
 2 files changed, 13 insertions(+)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 262ab0e..f89ab80 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -61,6 +61,7 @@ struct vfio_container {
 	struct vfio_iommu_driver	*iommu_driver;
 	void				*iommu_data;
 	bool				noiommu;
+	bool				closed;
 };
 
 struct vfio_unbound_dev {
@@ -1223,6 +1224,7 @@ static int vfio_fops_release(struct inode *inode, struct file *filep)
 
 	filep->private_data = NULL;
 
+	container->closed = true;
 	vfio_container_put(container);
 
 	return 0;
@@ -2216,6 +2218,16 @@ void vfio_group_set_kvm(struct vfio_group *group, struct kvm *kvm)
 }
 EXPORT_SYMBOL_GPL(vfio_group_set_kvm);
 
+bool vfio_iommu_group_contained(struct iommu_group *iommu_group)
+{
+	struct vfio_group *group = vfio_group_get_from_iommu(iommu_group);
+	bool ret = group && group->container && !group->container->closed;
+
+	vfio_group_put(group);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(vfio_iommu_group_contained);
+
 static int vfio_register_group_notifier(struct vfio_group *group,
 					unsigned long *events,
 					struct notifier_block *nb)
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 38d3c6a..b2724e7 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -148,6 +148,7 @@ extern int vfio_unregister_notifier(struct device *dev,
 
 struct kvm;
 extern void vfio_group_set_kvm(struct vfio_group *group, struct kvm *kvm);
+extern bool vfio_iommu_group_contained(struct iommu_group *group);
 
 /*
  * Sub-module helpers
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH V1 4/5] vfio: VA suspend interface
  2021-01-05 15:36 [PATCH V1 0/5] vfio virtual address update Steve Sistare
                   ` (2 preceding siblings ...)
  2021-01-05 15:36 ` [PATCH V1 3/5] vfio: detect closed container Steve Sistare
@ 2021-01-05 15:36 ` Steve Sistare
  2021-01-08 21:15   ` Alex Williamson
  2021-01-05 15:36 ` [PATCH V1 5/5] vfio: block during VA suspend Steve Sistare
  2021-01-12 21:52 ` [PATCH V1 0/5] vfio virtual address update Steven Sistare
  5 siblings, 1 reply; 39+ messages in thread
From: Steve Sistare @ 2021-01-05 15:36 UTC (permalink / raw)
  To: kvm; +Cc: Alex Williamson, Cornelia Huck, Kirti Wankhede, Steve Sistare

Add interfaces that allow the underlying memory object of an iova
range to be mapped to a new host virtual address in the host process:

  - VFIO_DMA_UNMAP_FLAG_SUSPEND for VFIO_IOMMU_UNMAP_DMA
  - VFIO_DMA_MAP_FLAG_RESUME flag for VFIO_IOMMU_MAP_DMA
  - VFIO_SUSPEND extension for VFIO_CHECK_EXTENSION

The suspend interface blocks vfio translation of host virtual
addresses in a range, but DMA to already-mapped pages continues.
The resume interface records the new base VA and resumes translation.
See comments in uapi/linux/vfio.h for more details.

This is a partial implementation.  Blocking is added in the next patch.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 drivers/vfio/vfio_iommu_type1.c | 47 +++++++++++++++++++++++++++++++++++------
 include/uapi/linux/vfio.h       | 16 ++++++++++++++
 2 files changed, 57 insertions(+), 6 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 3dc501d..2c164a6 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -92,6 +92,7 @@ struct vfio_dma {
 	int			prot;		/* IOMMU_READ/WRITE */
 	bool			iommu_mapped;
 	bool			lock_cap;	/* capable(CAP_IPC_LOCK) */
+	bool			suspended;
 	struct task_struct	*task;
 	struct rb_root		pfn_list;	/* Ex-user pinned pfn list */
 	unsigned long		*bitmap;
@@ -1080,7 +1081,7 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
 	int ret = 0, retries = 0;
 	unsigned long pgshift;
 	dma_addr_t iova;
-	unsigned long size;
+	unsigned long size, consumed;
 
 	mutex_lock(&iommu->lock);
 
@@ -1169,6 +1170,21 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
 		if (dma->task->mm != current->mm)
 			break;
 
+		if (unmap->flags & VFIO_DMA_UNMAP_FLAG_SUSPEND) {
+			if (dma->suspended) {
+				ret = -EINVAL;
+				goto unlock;
+			}
+			dma->suspended = true;
+			consumed = dma->iova + dma->size - iova;
+			if (consumed >= size)
+				break;
+			iova += consumed;
+			size -= consumed;
+			unmapped += dma->size;
+			continue;
+		}
+
 		if (!RB_EMPTY_ROOT(&dma->pfn_list)) {
 			struct vfio_iommu_type1_dma_unmap nb_unmap;
 
@@ -1307,6 +1323,7 @@ static bool vfio_iommu_iova_dma_valid(struct vfio_iommu *iommu,
 static int vfio_dma_do_map(struct vfio_iommu *iommu,
 			   struct vfio_iommu_type1_dma_map *map)
 {
+	bool resume = map->flags & VFIO_DMA_MAP_FLAG_RESUME;
 	dma_addr_t iova = map->iova;
 	unsigned long vaddr = map->vaddr;
 	size_t size = map->size;
@@ -1324,13 +1341,16 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
 	if (map->flags & VFIO_DMA_MAP_FLAG_READ)
 		prot |= IOMMU_READ;
 
+	if ((prot && resume) || (!prot && !resume))
+		return -EINVAL;
+
 	mutex_lock(&iommu->lock);
 
 	pgsize = (size_t)1 << __ffs(iommu->pgsize_bitmap);
 
 	WARN_ON((pgsize - 1) & PAGE_MASK);
 
-	if (!prot || !size || (size | iova | vaddr) & (pgsize - 1)) {
+	if (!size || (size | iova | vaddr) & (pgsize - 1)) {
 		ret = -EINVAL;
 		goto out_unlock;
 	}
@@ -1341,7 +1361,19 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
 		goto out_unlock;
 	}
 
-	if (vfio_find_dma(iommu, iova, size)) {
+	dma = vfio_find_dma(iommu, iova, size);
+	if (resume) {
+		if (!dma) {
+			ret = -ENOENT;
+		} else if (!dma->suspended || dma->iova != iova ||
+			   dma->size != size) {
+			ret = -EINVAL;
+		} else {
+			dma->vaddr = vaddr;
+			dma->suspended = false;
+		}
+		goto out_unlock;
+	} else if (dma) {
 		ret = -EEXIST;
 		goto out_unlock;
 	}
@@ -2532,6 +2564,7 @@ static int vfio_iommu_type1_check_extension(struct vfio_iommu *iommu,
 	case VFIO_TYPE1_IOMMU:
 	case VFIO_TYPE1v2_IOMMU:
 	case VFIO_TYPE1_NESTING_IOMMU:
+	case VFIO_SUSPEND:
 		return 1;
 	case VFIO_DMA_CC_IOMMU:
 		if (!iommu)
@@ -2686,7 +2719,8 @@ static int vfio_iommu_type1_map_dma(struct vfio_iommu *iommu,
 {
 	struct vfio_iommu_type1_dma_map map;
 	unsigned long minsz;
-	uint32_t mask = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
+	uint32_t mask = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE |
+			VFIO_DMA_MAP_FLAG_RESUME;
 
 	minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);
 
@@ -2704,6 +2738,8 @@ static int vfio_iommu_type1_unmap_dma(struct vfio_iommu *iommu,
 {
 	struct vfio_iommu_type1_dma_unmap unmap;
 	struct vfio_bitmap bitmap = { 0 };
+	uint32_t mask = VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP |
+			VFIO_DMA_UNMAP_FLAG_SUSPEND;
 	unsigned long minsz;
 	int ret;
 
@@ -2712,8 +2748,7 @@ static int vfio_iommu_type1_unmap_dma(struct vfio_iommu *iommu,
 	if (copy_from_user(&unmap, (void __user *)arg, minsz))
 		return -EFAULT;
 
-	if (unmap.argsz < minsz ||
-	    unmap.flags & ~VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP)
+	if (unmap.argsz < minsz || unmap.flags & ~mask || unmap.flags == mask)
 		return -EINVAL;
 
 	if (unmap.flags & VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP) {
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 896e527..fcf7b56 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -46,6 +46,9 @@
  */
 #define VFIO_NOIOMMU_IOMMU		8
 
+/* Supports VFIO DMA suspend and resume */
+#define VFIO_SUSPEND			9
+
 /*
  * The IOCTL interface is designed for extensibility by embedding the
  * structure length (argsz) and flags into structures passed between
@@ -1046,12 +1049,19 @@ struct vfio_iommu_type1_info_cap_migration {
  *
  * Map process virtual addresses to IO virtual addresses using the
  * provided struct vfio_dma_map. Caller sets argsz. READ &/ WRITE required.
+ *
+ * If flags & VFIO_DMA_MAP_FLAG_RESUME, record the new base vaddr for iova, and
+ * resume translation of host virtual addresses in the iova range.  The new
+ * vaddr must point to the same memory object as the old vaddr, but this is not
+ * verified.  iova and size must match those in the original MAP_DMA call.
+ * Protection is not changed, and the READ & WRITE flags must be 0.
  */
 struct vfio_iommu_type1_dma_map {
 	__u32	argsz;
 	__u32	flags;
 #define VFIO_DMA_MAP_FLAG_READ (1 << 0)		/* readable from device */
 #define VFIO_DMA_MAP_FLAG_WRITE (1 << 1)	/* writable from device */
+#define VFIO_DMA_MAP_FLAG_RESUME (1 << 2)
 	__u64	vaddr;				/* Process virtual address */
 	__u64	iova;				/* IO virtual address */
 	__u64	size;				/* Size of mapping (bytes) */
@@ -1084,11 +1094,17 @@ struct vfio_bitmap {
  * indicates that the page at that offset from iova is dirty. A Bitmap of the
  * pages in the range of unmapped size is returned in the user-provided
  * vfio_bitmap.data.
+ *
+ * If flags & VFIO_DMA_UNMAP_FLAG_SUSPEND, do not unmap, but suspend vfio
+ * translation of host virtual addresses in the iova range.  During suspension,
+ * kernel threads that attempt to translate will block.  DMA to already-mapped
+ * pages continues.
  */
 struct vfio_iommu_type1_dma_unmap {
 	__u32	argsz;
 	__u32	flags;
 #define VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP (1 << 0)
+#define VFIO_DMA_UNMAP_FLAG_SUSPEND	     (1 << 1)
 	__u64	iova;				/* IO virtual address */
 	__u64	size;				/* Size of mapping (bytes) */
 	__u8    data[];
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH V1 5/5] vfio: block during VA suspend
  2021-01-05 15:36 [PATCH V1 0/5] vfio virtual address update Steve Sistare
                   ` (3 preceding siblings ...)
  2021-01-05 15:36 ` [PATCH V1 4/5] vfio: VA suspend interface Steve Sistare
@ 2021-01-05 15:36 ` Steve Sistare
  2021-01-05 18:08     ` kernel test robot
                     ` (3 more replies)
  2021-01-12 21:52 ` [PATCH V1 0/5] vfio virtual address update Steven Sistare
  5 siblings, 4 replies; 39+ messages in thread
From: Steve Sistare @ 2021-01-05 15:36 UTC (permalink / raw)
  To: kvm; +Cc: Alex Williamson, Cornelia Huck, Kirti Wankhede, Steve Sistare

Block translation of host virtual address while an iova range is suspended.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 drivers/vfio/vfio_iommu_type1.c | 48 ++++++++++++++++++++++++++++++++++++++---
 1 file changed, 45 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 2c164a6..8035b9a 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -31,6 +31,8 @@
 #include <linux/rbtree.h>
 #include <linux/sched/signal.h>
 #include <linux/sched/mm.h>
+#include <linux/delay.h>
+#include <linux/kthread.h>
 #include <linux/slab.h>
 #include <linux/uaccess.h>
 #include <linux/vfio.h>
@@ -484,6 +486,34 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
 	return ret;
 }
 
+static bool vfio_iommu_contained(struct vfio_iommu *iommu)
+{
+	struct vfio_domain *domain = iommu->external_domain;
+	struct vfio_group *group;
+
+	if (!domain)
+		domain = list_first_entry(&iommu->domain_list,
+					  struct vfio_domain, next);
+
+	group = list_first_entry(&domain->group_list, struct vfio_group, next);
+	return vfio_iommu_group_contained(group->iommu_group);
+}
+
+
+bool vfio_vaddr_valid(struct vfio_iommu *iommu, struct vfio_dma *dma)
+{
+	while (dma->suspended) {
+		mutex_unlock(&iommu->lock);
+		msleep_interruptible(10);
+		mutex_lock(&iommu->lock);
+		if (kthread_should_stop() || !vfio_iommu_contained(iommu) ||
+		    fatal_signal_pending(current)) {
+			return false;
+		}
+	}
+	return true;
+}
+
 /*
  * Attempt to pin pages.  We really don't want to track all the pfns and
  * the iommu can only map chunks of consecutive pfns anyway, so get the
@@ -690,6 +720,11 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
 			continue;
 		}
 
+		if (!vfio_vaddr_valid(iommu, dma)) {
+			ret = -EFAULT;
+			goto pin_unwind;
+		}
+
 		remote_vaddr = dma->vaddr + (iova - dma->iova);
 		ret = vfio_pin_page_external(dma, remote_vaddr, &phys_pfn[i],
 					     do_accounting);
@@ -1514,12 +1549,16 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
 					i += PAGE_SIZE;
 				}
 			} else {
-				unsigned long pfn;
-				unsigned long vaddr = dma->vaddr +
-						     (iova - dma->iova);
+				unsigned long pfn, vaddr;
 				size_t n = dma->iova + dma->size - iova;
 				long npage;
 
+				if (!vfio_vaddr_valid(iommu, dma)) {
+					ret = -EFAULT;
+					goto unwind;
+				}
+				vaddr = dma->vaddr + (iova - dma->iova);
+
 				npage = vfio_pin_pages_remote(dma, vaddr,
 							      n >> PAGE_SHIFT,
 							      &pfn, limit);
@@ -2965,6 +3004,9 @@ static int vfio_iommu_type1_dma_rw_chunk(struct vfio_iommu *iommu,
 	if (count > dma->size - offset)
 		count = dma->size - offset;
 
+	if (!vfio_vaddr_valid(iommu, dma))
+		goto out;
+
 	vaddr = dma->vaddr + offset;
 
 	if (write) {
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH V1 5/5] vfio: block during VA suspend
  2021-01-05 15:36 ` [PATCH V1 5/5] vfio: block during VA suspend Steve Sistare
@ 2021-01-05 18:08     ` kernel test robot
  2021-01-05 20:03     ` kernel test robot
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 39+ messages in thread
From: kernel test robot @ 2021-01-05 18:08 UTC (permalink / raw)
  To: Steve Sistare, kvm
  Cc: kbuild-all, Alex Williamson, Cornelia Huck, Kirti Wankhede,
	Steve Sistare

[-- Attachment #1: Type: text/plain, Size: 2241 bytes --]

Hi Steve,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on vfio/next]
[also build test WARNING on v5.11-rc2 next-20210104]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Steve-Sistare/vfio-virtual-address-update/20210106-000752
base:   https://github.com/awilliam/linux-vfio.git next
config: s390-randconfig-r015-20210105 (attached as .config)
compiler: s390-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/f680b8156755f4465e94574d5421747561d23749
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Steve-Sistare/vfio-virtual-address-update/20210106-000752
        git checkout f680b8156755f4465e94574d5421747561d23749
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=s390 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> drivers/vfio/vfio_iommu_type1.c:503:6: warning: no previous prototype for 'vfio_vaddr_valid' [-Wmissing-prototypes]
     503 | bool vfio_vaddr_valid(struct vfio_iommu *iommu, struct vfio_dma *dma)
         |      ^~~~~~~~~~~~~~~~


vim +/vfio_vaddr_valid +503 drivers/vfio/vfio_iommu_type1.c

   501	
   502	
 > 503	bool vfio_vaddr_valid(struct vfio_iommu *iommu, struct vfio_dma *dma)
   504	{
   505		while (dma->suspended) {
   506			mutex_unlock(&iommu->lock);
   507			msleep_interruptible(10);
   508			mutex_lock(&iommu->lock);
   509			if (kthread_should_stop() || !vfio_iommu_contained(iommu) ||
   510			    fatal_signal_pending(current)) {
   511				return false;
   512			}
   513		}
   514		return true;
   515	}
   516	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 28839 bytes --]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH V1 5/5] vfio: block during VA suspend
@ 2021-01-05 18:08     ` kernel test robot
  0 siblings, 0 replies; 39+ messages in thread
From: kernel test robot @ 2021-01-05 18:08 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 2299 bytes --]

Hi Steve,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on vfio/next]
[also build test WARNING on v5.11-rc2 next-20210104]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Steve-Sistare/vfio-virtual-address-update/20210106-000752
base:   https://github.com/awilliam/linux-vfio.git next
config: s390-randconfig-r015-20210105 (attached as .config)
compiler: s390-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/f680b8156755f4465e94574d5421747561d23749
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Steve-Sistare/vfio-virtual-address-update/20210106-000752
        git checkout f680b8156755f4465e94574d5421747561d23749
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=s390 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> drivers/vfio/vfio_iommu_type1.c:503:6: warning: no previous prototype for 'vfio_vaddr_valid' [-Wmissing-prototypes]
     503 | bool vfio_vaddr_valid(struct vfio_iommu *iommu, struct vfio_dma *dma)
         |      ^~~~~~~~~~~~~~~~


vim +/vfio_vaddr_valid +503 drivers/vfio/vfio_iommu_type1.c

   501	
   502	
 > 503	bool vfio_vaddr_valid(struct vfio_iommu *iommu, struct vfio_dma *dma)
   504	{
   505		while (dma->suspended) {
   506			mutex_unlock(&iommu->lock);
   507			msleep_interruptible(10);
   508			mutex_lock(&iommu->lock);
   509			if (kthread_should_stop() || !vfio_iommu_contained(iommu) ||
   510			    fatal_signal_pending(current)) {
   511				return false;
   512			}
   513		}
   514		return true;
   515	}
   516	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 28839 bytes --]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH V1 1/5] vfio: maintain dma_list order
  2021-01-05 15:36 ` [PATCH V1 1/5] vfio: maintain dma_list order Steve Sistare
@ 2021-01-05 18:48     ` kernel test robot
  2021-01-06  0:02   ` Alex Williamson
  1 sibling, 0 replies; 39+ messages in thread
From: kernel test robot @ 2021-01-05 18:48 UTC (permalink / raw)
  To: Steve Sistare, kvm
  Cc: kbuild-all, Alex Williamson, Cornelia Huck, Kirti Wankhede,
	Steve Sistare

[-- Attachment #1: Type: text/plain, Size: 2705 bytes --]

Hi Steve,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on vfio/next]
[also build test WARNING on v5.11-rc2 next-20210104]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Steve-Sistare/vfio-virtual-address-update/20210106-000752
base:   https://github.com/awilliam/linux-vfio.git next
config: x86_64-randconfig-s022-20210105 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-15) 9.3.0
reproduce:
        # apt-get install sparse
        # sparse version: v0.6.3-208-g46a52ca4-dirty
        # https://github.com/0day-ci/linux/commit/c7f4454de15f02ea6be7591e8ab4520e21da646e
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Steve-Sistare/vfio-virtual-address-update/20210106-000752
        git checkout c7f4454de15f02ea6be7591e8ab4520e21da646e
        # save the attached .config to linux build tree
        make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>


"sparse warnings: (new ones prefixed by >>)"
>> drivers/vfio/vfio_iommu_type1.c:160:32: sparse: sparse: Using plain integer as NULL pointer
   drivers/vfio/vfio_iommu_type1.c:176:23: sparse: sparse: Using plain integer as NULL pointer

vim +160 drivers/vfio/vfio_iommu_type1.c

   147	
   148	static struct vfio_group *vfio_iommu_find_iommu_group(struct vfio_iommu *iommu,
   149						       struct iommu_group *iommu_group);
   150	
   151	static void update_pinned_page_dirty_scope(struct vfio_iommu *iommu);
   152	/*
   153	 * This code handles mapping and unmapping of user data buffers
   154	 * into DMA'ble space using the IOMMU
   155	 */
   156	
   157	static struct vfio_dma *vfio_find_dma(struct vfio_iommu *iommu,
   158					      dma_addr_t start, size_t size)
   159	{
 > 160		struct vfio_dma *res = 0;
   161		struct rb_node *node = iommu->dma_list.rb_node;
   162	
   163		while (node) {
   164			struct vfio_dma *dma = rb_entry(node, struct vfio_dma, node);
   165	
   166			if (start < dma->iova + dma->size) {
   167				res = dma;
   168				if (start >= dma->iova)
   169					break;
   170				node = node->rb_left;
   171			} else {
   172				node = node->rb_right;
   173			}
   174		}
   175		if (res && size && res->iova >= start + size)
   176			res = 0;
   177		return res;
   178	}
   179	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 37517 bytes --]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH V1 1/5] vfio: maintain dma_list order
@ 2021-01-05 18:48     ` kernel test robot
  0 siblings, 0 replies; 39+ messages in thread
From: kernel test robot @ 2021-01-05 18:48 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 2778 bytes --]

Hi Steve,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on vfio/next]
[also build test WARNING on v5.11-rc2 next-20210104]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Steve-Sistare/vfio-virtual-address-update/20210106-000752
base:   https://github.com/awilliam/linux-vfio.git next
config: x86_64-randconfig-s022-20210105 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-15) 9.3.0
reproduce:
        # apt-get install sparse
        # sparse version: v0.6.3-208-g46a52ca4-dirty
        # https://github.com/0day-ci/linux/commit/c7f4454de15f02ea6be7591e8ab4520e21da646e
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Steve-Sistare/vfio-virtual-address-update/20210106-000752
        git checkout c7f4454de15f02ea6be7591e8ab4520e21da646e
        # save the attached .config to linux build tree
        make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>


"sparse warnings: (new ones prefixed by >>)"
>> drivers/vfio/vfio_iommu_type1.c:160:32: sparse: sparse: Using plain integer as NULL pointer
   drivers/vfio/vfio_iommu_type1.c:176:23: sparse: sparse: Using plain integer as NULL pointer

vim +160 drivers/vfio/vfio_iommu_type1.c

   147	
   148	static struct vfio_group *vfio_iommu_find_iommu_group(struct vfio_iommu *iommu,
   149						       struct iommu_group *iommu_group);
   150	
   151	static void update_pinned_page_dirty_scope(struct vfio_iommu *iommu);
   152	/*
   153	 * This code handles mapping and unmapping of user data buffers
   154	 * into DMA'ble space using the IOMMU
   155	 */
   156	
   157	static struct vfio_dma *vfio_find_dma(struct vfio_iommu *iommu,
   158					      dma_addr_t start, size_t size)
   159	{
 > 160		struct vfio_dma *res = 0;
   161		struct rb_node *node = iommu->dma_list.rb_node;
   162	
   163		while (node) {
   164			struct vfio_dma *dma = rb_entry(node, struct vfio_dma, node);
   165	
   166			if (start < dma->iova + dma->size) {
   167				res = dma;
   168				if (start >= dma->iova)
   169					break;
   170				node = node->rb_left;
   171			} else {
   172				node = node->rb_right;
   173			}
   174		}
   175		if (res && size && res->iova >= start + size)
   176			res = 0;
   177		return res;
   178	}
   179	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 37517 bytes --]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH V1 5/5] vfio: block during VA suspend
  2021-01-05 15:36 ` [PATCH V1 5/5] vfio: block during VA suspend Steve Sistare
@ 2021-01-05 20:03     ` kernel test robot
  2021-01-05 20:03     ` kernel test robot
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 39+ messages in thread
From: kernel test robot @ 2021-01-05 20:03 UTC (permalink / raw)
  To: Steve Sistare, kvm
  Cc: kbuild-all, Alex Williamson, Cornelia Huck, Kirti Wankhede,
	Steve Sistare

[-- Attachment #1: Type: text/plain, Size: 1800 bytes --]

Hi Steve,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on vfio/next]
[also build test WARNING on v5.11-rc2 next-20210104]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Steve-Sistare/vfio-virtual-address-update/20210106-000752
base:   https://github.com/awilliam/linux-vfio.git next
config: x86_64-randconfig-s022-20210105 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-15) 9.3.0
reproduce:
        # apt-get install sparse
        # sparse version: v0.6.3-208-g46a52ca4-dirty
        # https://github.com/0day-ci/linux/commit/f680b8156755f4465e94574d5421747561d23749
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Steve-Sistare/vfio-virtual-address-update/20210106-000752
        git checkout f680b8156755f4465e94574d5421747561d23749
        # save the attached .config to linux build tree
        make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>


"sparse warnings: (new ones prefixed by >>)"
   drivers/vfio/vfio_iommu_type1.c:163:32: sparse: sparse: Using plain integer as NULL pointer
   drivers/vfio/vfio_iommu_type1.c:179:23: sparse: sparse: Using plain integer as NULL pointer
>> drivers/vfio/vfio_iommu_type1.c:503:6: sparse: sparse: symbol 'vfio_vaddr_valid' was not declared. Should it be static?

Please review and possibly fold the followup patch.

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 37517 bytes --]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH V1 5/5] vfio: block during VA suspend
@ 2021-01-05 20:03     ` kernel test robot
  0 siblings, 0 replies; 39+ messages in thread
From: kernel test robot @ 2021-01-05 20:03 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 1840 bytes --]

Hi Steve,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on vfio/next]
[also build test WARNING on v5.11-rc2 next-20210104]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Steve-Sistare/vfio-virtual-address-update/20210106-000752
base:   https://github.com/awilliam/linux-vfio.git next
config: x86_64-randconfig-s022-20210105 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-15) 9.3.0
reproduce:
        # apt-get install sparse
        # sparse version: v0.6.3-208-g46a52ca4-dirty
        # https://github.com/0day-ci/linux/commit/f680b8156755f4465e94574d5421747561d23749
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Steve-Sistare/vfio-virtual-address-update/20210106-000752
        git checkout f680b8156755f4465e94574d5421747561d23749
        # save the attached .config to linux build tree
        make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>


"sparse warnings: (new ones prefixed by >>)"
   drivers/vfio/vfio_iommu_type1.c:163:32: sparse: sparse: Using plain integer as NULL pointer
   drivers/vfio/vfio_iommu_type1.c:179:23: sparse: sparse: Using plain integer as NULL pointer
>> drivers/vfio/vfio_iommu_type1.c:503:6: sparse: sparse: symbol 'vfio_vaddr_valid' was not declared. Should it be static?

Please review and possibly fold the followup patch.

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 37517 bytes --]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [RFC PATCH] vfio: vfio_vaddr_valid() can be static
  2021-01-05 15:36 ` [PATCH V1 5/5] vfio: block during VA suspend Steve Sistare
@ 2021-01-05 20:03     ` kernel test robot
  2021-01-05 20:03     ` kernel test robot
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 39+ messages in thread
From: kernel test robot @ 2021-01-05 20:03 UTC (permalink / raw)
  To: Steve Sistare, kvm
  Cc: kbuild-all, Alex Williamson, Cornelia Huck, Kirti Wankhede,
	Steve Sistare


Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: kernel test robot <lkp@intel.com>
---
 vfio_iommu_type1.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index cc9b61d8b57bab..697e661f8295f8 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -500,7 +500,7 @@ static bool vfio_iommu_contained(struct vfio_iommu *iommu)
 }
 
 
-bool vfio_vaddr_valid(struct vfio_iommu *iommu, struct vfio_dma *dma)
+static bool vfio_vaddr_valid(struct vfio_iommu *iommu, struct vfio_dma *dma)
 {
 	while (dma->suspended) {
 		mutex_unlock(&iommu->lock);

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [RFC PATCH] vfio: vfio_vaddr_valid() can be static
@ 2021-01-05 20:03     ` kernel test robot
  0 siblings, 0 replies; 39+ messages in thread
From: kernel test robot @ 2021-01-05 20:03 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 692 bytes --]


Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: kernel test robot <lkp@intel.com>
---
 vfio_iommu_type1.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index cc9b61d8b57bab..697e661f8295f8 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -500,7 +500,7 @@ static bool vfio_iommu_contained(struct vfio_iommu *iommu)
 }
 
 
-bool vfio_vaddr_valid(struct vfio_iommu *iommu, struct vfio_dma *dma)
+static bool vfio_vaddr_valid(struct vfio_iommu *iommu, struct vfio_dma *dma)
 {
 	while (dma->suspended) {
 		mutex_unlock(&iommu->lock);

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH V1 1/5] vfio: maintain dma_list order
  2021-01-05 15:36 ` [PATCH V1 1/5] vfio: maintain dma_list order Steve Sistare
  2021-01-05 18:48     ` kernel test robot
@ 2021-01-06  0:02   ` Alex Williamson
  2021-01-06 14:50     ` Steven Sistare
  1 sibling, 1 reply; 39+ messages in thread
From: Alex Williamson @ 2021-01-06  0:02 UTC (permalink / raw)
  To: Steve Sistare; +Cc: kvm, Cornelia Huck, Kirti Wankhede

Hi Steven,

On Tue,  5 Jan 2021 07:36:49 -0800
Steve Sistare <steven.sistare@oracle.com> wrote:

> Keep entries properly sorted in the dma_list rb_tree

Nothing here changes the order of entries in the tree, they're already
sorted.  The second chunk is the only thing that touches the tree
construction, but that appears to be just a micro optimization that
we've already used vfio_find_dma() to verify that a new entry doesn't
overlap any existing entries, therefore if the start of the new entry
is less than the test entry, the end must also be less.  The tree is
not changed afaict.

> so that iterating
> over multiple entries across a range with gaps works, without requiring
> one to delete each visited entry as in vfio_dma_do_unmap.

As above, I don't see that the tree is changed, so this is just a
manipulation of our search function, changing it from a "find any
vfio_dma within this range" to a "find the vfio_dma with the lowest
iova with this range".  But find-any and find-first are computationally
different, so I don't think we should blindly replace one with the
other.  Wouldn't it make more sense to add a vfio_find_first_dma()
function in patch 4/ to handle this case?  Thanks,

Alex

> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c | 18 +++++++++++-------
>  1 file changed, 11 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 5fbf0c1..02228d0 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -157,20 +157,24 @@ static struct vfio_group *vfio_iommu_find_iommu_group(struct vfio_iommu *iommu,
>  static struct vfio_dma *vfio_find_dma(struct vfio_iommu *iommu,
>  				      dma_addr_t start, size_t size)
>  {
> +	struct vfio_dma *res = 0;
>  	struct rb_node *node = iommu->dma_list.rb_node;
>  
>  	while (node) {
>  		struct vfio_dma *dma = rb_entry(node, struct vfio_dma, node);
>  
> -		if (start + size <= dma->iova)
> +		if (start < dma->iova + dma->size) {
> +			res = dma;
> +			if (start >= dma->iova)
> +				break;
>  			node = node->rb_left;
> -		else if (start >= dma->iova + dma->size)
> +		} else {
>  			node = node->rb_right;
> -		else
> -			return dma;
> +		}
>  	}
> -
> -	return NULL;
> +	if (res && size && res->iova >= start + size)
> +		res = 0;
> +	return res;
>  }
>  
>  static void vfio_link_dma(struct vfio_iommu *iommu, struct vfio_dma *new)
> @@ -182,7 +186,7 @@ static void vfio_link_dma(struct vfio_iommu *iommu, struct vfio_dma *new)
>  		parent = *link;
>  		dma = rb_entry(parent, struct vfio_dma, node);
>  
> -		if (new->iova + new->size <= dma->iova)
> +		if (new->iova < dma->iova)
>  			link = &(*link)->rb_left;
>  		else
>  			link = &(*link)->rb_right;


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH V1 1/5] vfio: maintain dma_list order
  2021-01-06  0:02   ` Alex Williamson
@ 2021-01-06 14:50     ` Steven Sistare
  0 siblings, 0 replies; 39+ messages in thread
From: Steven Sistare @ 2021-01-06 14:50 UTC (permalink / raw)
  To: Alex Williamson; +Cc: kvm, Cornelia Huck, Kirti Wankhede

On 1/5/2021 7:02 PM, Alex Williamson wrote:
> Hi Steven,
> 
> On Tue,  5 Jan 2021 07:36:49 -0800
> Steve Sistare <steven.sistare@oracle.com> wrote:
> 
>> Keep entries properly sorted in the dma_list rb_tree
> 
> Nothing here changes the order of entries in the tree, they're already
> sorted.  The second chunk is the only thing that touches the tree
> construction, but that appears to be just a micro optimization that
> we've already used vfio_find_dma() to verify that a new entry doesn't
> overlap any existing entries, therefore if the start of the new entry
> is less than the test entry, the end must also be less.  The tree is
> not changed afaict.

Agreed.  Bad explanation on my part.

>> so that iterating
>> over multiple entries across a range with gaps works, without requiring
>> one to delete each visited entry as in vfio_dma_do_unmap.
> 
> As above, I don't see that the tree is changed, so this is just a
> manipulation of our search function, changing it from a "find any
> vfio_dma within this range" to a "find the vfio_dma with the lowest
> iova with this range".  But find-any and find-first are computationally
> different, so I don't think we should blindly replace one with the
> other.  Wouldn't it make more sense to add a vfio_find_first_dma()
> function in patch 4/ to handle this case?  Thanks,

Sure, will do.

- Steve

> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>  drivers/vfio/vfio_iommu_type1.c | 18 +++++++++++-------
>>  1 file changed, 11 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>> index 5fbf0c1..02228d0 100644
>> --- a/drivers/vfio/vfio_iommu_type1.c
>> +++ b/drivers/vfio/vfio_iommu_type1.c
>> @@ -157,20 +157,24 @@ static struct vfio_group *vfio_iommu_find_iommu_group(struct vfio_iommu *iommu,
>>  static struct vfio_dma *vfio_find_dma(struct vfio_iommu *iommu,
>>  				      dma_addr_t start, size_t size)
>>  {
>> +	struct vfio_dma *res = 0;
>>  	struct rb_node *node = iommu->dma_list.rb_node;
>>  
>>  	while (node) {
>>  		struct vfio_dma *dma = rb_entry(node, struct vfio_dma, node);
>>  
>> -		if (start + size <= dma->iova)
>> +		if (start < dma->iova + dma->size) {
>> +			res = dma;
>> +			if (start >= dma->iova)
>> +				break;
>>  			node = node->rb_left;
>> -		else if (start >= dma->iova + dma->size)
>> +		} else {
>>  			node = node->rb_right;
>> -		else
>> -			return dma;
>> +		}
>>  	}
>> -
>> -	return NULL;
>> +	if (res && size && res->iova >= start + size)
>> +		res = 0;
>> +	return res;
>>  }
>>  
>>  static void vfio_link_dma(struct vfio_iommu *iommu, struct vfio_dma *new)
>> @@ -182,7 +186,7 @@ static void vfio_link_dma(struct vfio_iommu *iommu, struct vfio_dma *new)
>>  		parent = *link;
>>  		dma = rb_entry(parent, struct vfio_dma, node);
>>  
>> -		if (new->iova + new->size <= dma->iova)
>> +		if (new->iova < dma->iova)
>>  			link = &(*link)->rb_left;
>>  		else
>>  			link = &(*link)->rb_right;
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH V1 5/5] vfio: block during VA suspend
  2021-01-05 20:03     ` kernel test robot
  (?)
@ 2021-01-07 15:17     ` Steven Sistare
  -1 siblings, 0 replies; 39+ messages in thread
From: Steven Sistare @ 2021-01-07 15:17 UTC (permalink / raw)
  To: kvm; +Cc: Alex Williamson, Cornelia Huck, Kirti Wankhede

I will fix the warnings in V2, and make W=1 C=1 for all my submissions!

- Steve

On 1/5/2021 3:03 PM, kernel test robot wrote:
> Hi Steve,
> 
> Thank you for the patch! Perhaps something to improve:
> 
> [auto build test WARNING on vfio/next]
> [also build test WARNING on v5.11-rc2 next-20210104]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch]
> 
> url:    https://github.com/0day-ci/linux/commits/Steve-Sistare/vfio-virtual-address-update/20210106-000752
> base:   https://github.com/awilliam/linux-vfio.git next
> config: x86_64-randconfig-s022-20210105 (attached as .config)
> compiler: gcc-9 (Debian 9.3.0-15) 9.3.0
> reproduce:
>         # apt-get install sparse
>         # sparse version: v0.6.3-208-g46a52ca4-dirty
>         # https://github.com/0day-ci/linux/commit/f680b8156755f4465e94574d5421747561d23749
>         git remote add linux-review https://github.com/0day-ci/linux
>         git fetch --no-tags linux-review Steve-Sistare/vfio-virtual-address-update/20210106-000752
>         git checkout f680b8156755f4465e94574d5421747561d23749
>         # save the attached .config to linux build tree
>         make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' ARCH=x86_64 
> 
> If you fix the issue, kindly add following tag as appropriate
> Reported-by: kernel test robot <lkp@intel.com>
> 
> 
> "sparse warnings: (new ones prefixed by >>)"
>    drivers/vfio/vfio_iommu_type1.c:163:32: sparse: sparse: Using plain integer as NULL pointer
>    drivers/vfio/vfio_iommu_type1.c:179:23: sparse: sparse: Using plain integer as NULL pointer
>>> drivers/vfio/vfio_iommu_type1.c:503:6: sparse: sparse: symbol 'vfio_vaddr_valid' was not declared. Should it be static?
> 
> Please review and possibly fold the followup patch.
> 
> ---
> 0-DAY CI Kernel Test Service, Intel Corporation
> https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH V1 2/5] vfio: option to unmap all
  2021-01-05 15:36 ` [PATCH V1 2/5] vfio: option to unmap all Steve Sistare
@ 2021-01-08 19:35   ` Alex Williamson
  2021-01-11 21:09     ` Steven Sistare
  0 siblings, 1 reply; 39+ messages in thread
From: Alex Williamson @ 2021-01-08 19:35 UTC (permalink / raw)
  To: Steve Sistare; +Cc: kvm, Cornelia Huck, Kirti Wankhede

Hi Steve,

On Tue,  5 Jan 2021 07:36:50 -0800
Steve Sistare <steven.sistare@oracle.com> wrote:

> For VFIO_IOMMU_UNMAP_DMA, delete all mappings if iova=0 and size=0.

Only the latter is invalid, iova=0 is not special, so does it make
sense to use this combination to invoke something special?  It seems
like it opens the door that any size less than the minimum mapping
granularity means something special.

Why not use a flag to trigger an unmap-all?

Does userspace have any means to know this is supported other than to
test it before creating any mappings?

What's the intended interaction with retrieving the dirty bitmap during
an unmap-all?

> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c | 11 ++++++++---
>  include/uapi/linux/vfio.h       |  3 ++-
>  2 files changed, 10 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 02228d0..3dc501d 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -1079,6 +1079,8 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>  	size_t unmapped = 0, pgsize;
>  	int ret = 0, retries = 0;
>  	unsigned long pgshift;
> +	dma_addr_t iova;
> +	unsigned long size;
>  
>  	mutex_lock(&iommu->lock);
>  
> @@ -1090,7 +1092,7 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>  		goto unlock;
>  	}
>  
> -	if (!unmap->size || unmap->size & (pgsize - 1)) {
> +	if ((!unmap->size && unmap->iova) || unmap->size & (pgsize - 1)) {
>  		ret = -EINVAL;
>  		goto unlock;
>  	}
> @@ -1154,8 +1156,11 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,

It looks like the code just above this would have an issue if there are
dma mappings at iova=0.

>  		}
>  	}
>  
> -	while ((dma = vfio_find_dma(iommu, unmap->iova, unmap->size))) {
> -		if (!iommu->v2 && unmap->iova > dma->iova)
> +	iova = unmap->iova;
> +	size = unmap->size ? unmap->size : SIZE_MAX;

AFAICT the only difference of this versus the user calling the unmap
with iova=0 size=SIZE_MAX is that SIZE_MAX will throw an -EINVAL due to
page size alignment.  If we assume there are no IOMMUs with 1 byte page
size, the special combination could instead be {0, SIZE_MAX}.  Or the
caller could just track a high water mark for their mappings and use
the interface that exists.  Thanks,

Alex

> +
> +	while ((dma = vfio_find_dma(iommu, iova, size))) {
> +		if (!iommu->v2 && iova > dma->iova)
>  			break;
>  		/*
>  		 * Task with same address space who mapped this iova range is
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 9204705..896e527 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -1073,7 +1073,8 @@ struct vfio_bitmap {
>   * Caller sets argsz.  The actual unmapped size is returned in the size
>   * field.  No guarantee is made to the user that arbitrary unmaps of iova
>   * or size different from those used in the original mapping call will
> - * succeed.
> + * succeed.  If iova=0 and size=0, all addresses are unmapped.
> + *
>   * VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP should be set to get the dirty bitmap
>   * before unmapping IO virtual addresses. When this flag is set, the user must
>   * provide a struct vfio_bitmap in data[]. User must provide zero-allocated


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH V1 3/5] vfio: detect closed container
  2021-01-05 15:36 ` [PATCH V1 3/5] vfio: detect closed container Steve Sistare
@ 2021-01-08 19:39   ` Alex Williamson
  2021-01-11 21:12     ` Steven Sistare
  0 siblings, 1 reply; 39+ messages in thread
From: Alex Williamson @ 2021-01-08 19:39 UTC (permalink / raw)
  To: Steve Sistare; +Cc: kvm, Cornelia Huck, Kirti Wankhede

On Tue,  5 Jan 2021 07:36:51 -0800
Steve Sistare <steven.sistare@oracle.com> wrote:

> Add a function that detects if an iommu_group has a valid container.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  drivers/vfio/vfio.c  | 12 ++++++++++++
>  include/linux/vfio.h |  1 +
>  2 files changed, 13 insertions(+)
> 
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index 262ab0e..f89ab80 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -61,6 +61,7 @@ struct vfio_container {
>  	struct vfio_iommu_driver	*iommu_driver;
>  	void				*iommu_data;
>  	bool				noiommu;
> +	bool				closed;
>  };
>  
>  struct vfio_unbound_dev {
> @@ -1223,6 +1224,7 @@ static int vfio_fops_release(struct inode *inode, struct file *filep)
>  
>  	filep->private_data = NULL;
>  
> +	container->closed = true;
>  	vfio_container_put(container);
>  
>  	return 0;
> @@ -2216,6 +2218,16 @@ void vfio_group_set_kvm(struct vfio_group *group, struct kvm *kvm)
>  }
>  EXPORT_SYMBOL_GPL(vfio_group_set_kvm);
>  
> +bool vfio_iommu_group_contained(struct iommu_group *iommu_group)
> +{
> +	struct vfio_group *group = vfio_group_get_from_iommu(iommu_group);
> +	bool ret = group && group->container && !group->container->closed;
> +
> +	vfio_group_put(group);
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(vfio_iommu_group_contained);

This seems like a pointless interface, the result is immediately stale.
Anything that relies on the container staying open needs to add itself
as a user.  We already have some interfaces for that, ex.
vfio_group_get_external_user().  Thanks,

Alex

> +
>  static int vfio_register_group_notifier(struct vfio_group *group,
>  					unsigned long *events,
>  					struct notifier_block *nb)
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index 38d3c6a..b2724e7 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -148,6 +148,7 @@ extern int vfio_unregister_notifier(struct device *dev,
>  
>  struct kvm;
>  extern void vfio_group_set_kvm(struct vfio_group *group, struct kvm *kvm);
> +extern bool vfio_iommu_group_contained(struct iommu_group *group);
>  
>  /*
>   * Sub-module helpers


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH V1 4/5] vfio: VA suspend interface
  2021-01-05 15:36 ` [PATCH V1 4/5] vfio: VA suspend interface Steve Sistare
@ 2021-01-08 21:15   ` Alex Williamson
  2021-01-11 21:15     ` Steven Sistare
  0 siblings, 1 reply; 39+ messages in thread
From: Alex Williamson @ 2021-01-08 21:15 UTC (permalink / raw)
  To: Steve Sistare; +Cc: kvm, Cornelia Huck, Kirti Wankhede

On Tue,  5 Jan 2021 07:36:52 -0800
Steve Sistare <steven.sistare@oracle.com> wrote:

> Add interfaces that allow the underlying memory object of an iova
> range to be mapped to a new host virtual address in the host process:
> 
>   - VFIO_DMA_UNMAP_FLAG_SUSPEND for VFIO_IOMMU_UNMAP_DMA
>   - VFIO_DMA_MAP_FLAG_RESUME flag for VFIO_IOMMU_MAP_DMA
>   - VFIO_SUSPEND extension for VFIO_CHECK_EXTENSION

Suspend and Resume can imply many things other than what's done here.
Should these be something more akin to INVALIDATE_VADDR and
REPLACE_VADDR?

> 
> The suspend interface blocks vfio translation of host virtual
> addresses in a range, but DMA to already-mapped pages continues.
> The resume interface records the new base VA and resumes translation.
> See comments in uapi/linux/vfio.h for more details.
> 
> This is a partial implementation.  Blocking is added in the next patch.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c | 47 +++++++++++++++++++++++++++++++++++------
>  include/uapi/linux/vfio.h       | 16 ++++++++++++++
>  2 files changed, 57 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 3dc501d..2c164a6 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -92,6 +92,7 @@ struct vfio_dma {
>  	int			prot;		/* IOMMU_READ/WRITE */
>  	bool			iommu_mapped;
>  	bool			lock_cap;	/* capable(CAP_IPC_LOCK) */
> +	bool			suspended;

Is there a value we could use for vfio_dma.vaddr that would always be
considered invalid, ex. ULONG_MAX?  We'd need to decide if we want to
allow users to create mappings (mdev-only) using an initial invalid
vaddr.

>  	struct task_struct	*task;
>  	struct rb_root		pfn_list;	/* Ex-user pinned pfn list */
>  	unsigned long		*bitmap;
> @@ -1080,7 +1081,7 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>  	int ret = 0, retries = 0;
>  	unsigned long pgshift;
>  	dma_addr_t iova;
> -	unsigned long size;
> +	unsigned long size, consumed;

This could be scoped into the branch below.

>  
>  	mutex_lock(&iommu->lock);
>  
> @@ -1169,6 +1170,21 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>  		if (dma->task->mm != current->mm)
>  			break;
>  
> +		if (unmap->flags & VFIO_DMA_UNMAP_FLAG_SUSPEND) {
> +			if (dma->suspended) {
> +				ret = -EINVAL;
> +				goto unlock;
> +			}

This leaves us in a state where we marked some entries but not others.
We should either unwind or... what's the actual harm in skipping these?

> +			dma->suspended = true;
> +			consumed = dma->iova + dma->size - iova;
> +			if (consumed >= size)
> +				break;
> +			iova += consumed;
> +			size -= consumed;
> +			unmapped += dma->size;
> +			continue;
> +		}

This short-cuts the dirty bitmap flag, so we need to decide if it's
legal to call them together or we need to prevent it... Oh, I see
you've excluded them earlier below.

> +
>  		if (!RB_EMPTY_ROOT(&dma->pfn_list)) {
>  			struct vfio_iommu_type1_dma_unmap nb_unmap;
>  
> @@ -1307,6 +1323,7 @@ static bool vfio_iommu_iova_dma_valid(struct vfio_iommu *iommu,
>  static int vfio_dma_do_map(struct vfio_iommu *iommu,
>  			   struct vfio_iommu_type1_dma_map *map)
>  {
> +	bool resume = map->flags & VFIO_DMA_MAP_FLAG_RESUME;
>  	dma_addr_t iova = map->iova;
>  	unsigned long vaddr = map->vaddr;
>  	size_t size = map->size;
> @@ -1324,13 +1341,16 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>  	if (map->flags & VFIO_DMA_MAP_FLAG_READ)
>  		prot |= IOMMU_READ;
>  
> +	if ((prot && resume) || (!prot && !resume))
> +		return -EINVAL;
> +
>  	mutex_lock(&iommu->lock);
>  
>  	pgsize = (size_t)1 << __ffs(iommu->pgsize_bitmap);
>  
>  	WARN_ON((pgsize - 1) & PAGE_MASK);
>  
> -	if (!prot || !size || (size | iova | vaddr) & (pgsize - 1)) {
> +	if (!size || (size | iova | vaddr) & (pgsize - 1)) {
>  		ret = -EINVAL;
>  		goto out_unlock;
>  	}
> @@ -1341,7 +1361,19 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>  		goto out_unlock;
>  	}
>  
> -	if (vfio_find_dma(iommu, iova, size)) {
> +	dma = vfio_find_dma(iommu, iova, size);
> +	if (resume) {
> +		if (!dma) {
> +			ret = -ENOENT;
> +		} else if (!dma->suspended || dma->iova != iova ||
> +			   dma->size != size) {

Why is it necessary that the vfio_dma be suspended before being
resumed?  Couldn't a user simply use this to change the vaddr?  Does
that promote abusive use?

> +			ret = -EINVAL;
> +		} else {
> +			dma->vaddr = vaddr;

Seems like there's a huge opportunity for a user to create coherency
issues here... it's their data though I guess.

> +			dma->suspended = false;
> +		}
> +		goto out_unlock;
> +	} else if (dma) {
>  		ret = -EEXIST;
>  		goto out_unlock;
>  	}
> @@ -2532,6 +2564,7 @@ static int vfio_iommu_type1_check_extension(struct vfio_iommu *iommu,
>  	case VFIO_TYPE1_IOMMU:
>  	case VFIO_TYPE1v2_IOMMU:
>  	case VFIO_TYPE1_NESTING_IOMMU:
> +	case VFIO_SUSPEND:
>  		return 1;
>  	case VFIO_DMA_CC_IOMMU:
>  		if (!iommu)
> @@ -2686,7 +2719,8 @@ static int vfio_iommu_type1_map_dma(struct vfio_iommu *iommu,
>  {
>  	struct vfio_iommu_type1_dma_map map;
>  	unsigned long minsz;
> -	uint32_t mask = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
> +	uint32_t mask = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE |
> +			VFIO_DMA_MAP_FLAG_RESUME;
>  
>  	minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);
>  
> @@ -2704,6 +2738,8 @@ static int vfio_iommu_type1_unmap_dma(struct vfio_iommu *iommu,
>  {
>  	struct vfio_iommu_type1_dma_unmap unmap;
>  	struct vfio_bitmap bitmap = { 0 };
> +	uint32_t mask = VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP |
> +			VFIO_DMA_UNMAP_FLAG_SUSPEND;
>  	unsigned long minsz;
>  	int ret;
>  
> @@ -2712,8 +2748,7 @@ static int vfio_iommu_type1_unmap_dma(struct vfio_iommu *iommu,
>  	if (copy_from_user(&unmap, (void __user *)arg, minsz))
>  		return -EFAULT;
>  
> -	if (unmap.argsz < minsz ||
> -	    unmap.flags & ~VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP)
> +	if (unmap.argsz < minsz || unmap.flags & ~mask || unmap.flags == mask)

Maybe a short comment here to note that dirty-bimap and
suspend/invalidate are mutually exclusive.  Probably should be
mentioned in the uapi too.

>  		return -EINVAL;
>  
>  	if (unmap.flags & VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP) {
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 896e527..fcf7b56 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -46,6 +46,9 @@
>   */
>  #define VFIO_NOIOMMU_IOMMU		8
>  
> +/* Supports VFIO DMA suspend and resume */
> +#define VFIO_SUSPEND			9
> +
>  /*
>   * The IOCTL interface is designed for extensibility by embedding the
>   * structure length (argsz) and flags into structures passed between
> @@ -1046,12 +1049,19 @@ struct vfio_iommu_type1_info_cap_migration {
>   *
>   * Map process virtual addresses to IO virtual addresses using the
>   * provided struct vfio_dma_map. Caller sets argsz. READ &/ WRITE required.
> + *
> + * If flags & VFIO_DMA_MAP_FLAG_RESUME, record the new base vaddr for iova, and
> + * resume translation of host virtual addresses in the iova range.  The new
> + * vaddr must point to the same memory object as the old vaddr, but this is not
> + * verified.

It's hard to use "must" terminology here if we're not going to check.
Maybe the phrasing should be something more along the lines of "should
point to the same memory object or the user risks coherency issues
within their virtual address space".

>  iova and size must match those in the original MAP_DMA call.
> + * Protection is not changed, and the READ & WRITE flags must be 0.

This doesn't mention that the entry must be previously
suspended/invalidated (if we choose to keep those semantics).  Thanks,

Alex

>   */
>  struct vfio_iommu_type1_dma_map {
>  	__u32	argsz;
>  	__u32	flags;
>  #define VFIO_DMA_MAP_FLAG_READ (1 << 0)		/* readable from device */
>  #define VFIO_DMA_MAP_FLAG_WRITE (1 << 1)	/* writable from device */
> +#define VFIO_DMA_MAP_FLAG_RESUME (1 << 2)
>  	__u64	vaddr;				/* Process virtual address */
>  	__u64	iova;				/* IO virtual address */
>  	__u64	size;				/* Size of mapping (bytes) */
> @@ -1084,11 +1094,17 @@ struct vfio_bitmap {
>   * indicates that the page at that offset from iova is dirty. A Bitmap of the
>   * pages in the range of unmapped size is returned in the user-provided
>   * vfio_bitmap.data.
> + *
> + * If flags & VFIO_DMA_UNMAP_FLAG_SUSPEND, do not unmap, but suspend vfio
> + * translation of host virtual addresses in the iova range.  During suspension,
> + * kernel threads that attempt to translate will block.  DMA to already-mapped
> + * pages continues.
>   */
>  struct vfio_iommu_type1_dma_unmap {
>  	__u32	argsz;
>  	__u32	flags;
>  #define VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP (1 << 0)
> +#define VFIO_DMA_UNMAP_FLAG_SUSPEND	     (1 << 1)
>  	__u64	iova;				/* IO virtual address */
>  	__u64	size;				/* Size of mapping (bytes) */
>  	__u8    data[];


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH V1 5/5] vfio: block during VA suspend
  2021-01-05 15:36 ` [PATCH V1 5/5] vfio: block during VA suspend Steve Sistare
                     ` (2 preceding siblings ...)
  2021-01-05 20:03     ` kernel test robot
@ 2021-01-08 21:32   ` Alex Williamson
  2021-01-11 21:16     ` Steven Sistare
  3 siblings, 1 reply; 39+ messages in thread
From: Alex Williamson @ 2021-01-08 21:32 UTC (permalink / raw)
  To: Steve Sistare; +Cc: kvm, Cornelia Huck, Kirti Wankhede

On Tue,  5 Jan 2021 07:36:53 -0800
Steve Sistare <steven.sistare@oracle.com> wrote:

> Block translation of host virtual address while an iova range is suspended.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c | 48 ++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 45 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 2c164a6..8035b9a 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -31,6 +31,8 @@
>  #include <linux/rbtree.h>
>  #include <linux/sched/signal.h>
>  #include <linux/sched/mm.h>
> +#include <linux/delay.h>
> +#include <linux/kthread.h>
>  #include <linux/slab.h>
>  #include <linux/uaccess.h>
>  #include <linux/vfio.h>
> @@ -484,6 +486,34 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
>  	return ret;
>  }
>  
> +static bool vfio_iommu_contained(struct vfio_iommu *iommu)
> +{
> +	struct vfio_domain *domain = iommu->external_domain;
> +	struct vfio_group *group;
> +
> +	if (!domain)
> +		domain = list_first_entry(&iommu->domain_list,
> +					  struct vfio_domain, next);
> +
> +	group = list_first_entry(&domain->group_list, struct vfio_group, next);
> +	return vfio_iommu_group_contained(group->iommu_group);
> +}

This seems really backwards for a vfio iommu backend to ask vfio-core
whether its internal object is active.

> +
> +
> +bool vfio_vaddr_valid(struct vfio_iommu *iommu, struct vfio_dma *dma)
> +{
> +	while (dma->suspended) {
> +		mutex_unlock(&iommu->lock);
> +		msleep_interruptible(10);
> +		mutex_lock(&iommu->lock);
> +		if (kthread_should_stop() || !vfio_iommu_contained(iommu) ||
> +		    fatal_signal_pending(current)) {
> +			return false;
> +		}
> +	}
> +	return true;
> +}
> +
>  /*
>   * Attempt to pin pages.  We really don't want to track all the pfns and
>   * the iommu can only map chunks of consecutive pfns anyway, so get the
> @@ -690,6 +720,11 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
>  			continue;
>  		}
>  
> +		if (!vfio_vaddr_valid(iommu, dma)) {
> +			ret = -EFAULT;
> +			goto pin_unwind;
> +		}

This could have dropped iommu->lock, we have no basis to believe our
vfio_dma object is still valid.  Also this is called with an elevated
reference to the container, so we theoretically should not need to test
whether the iommu is still contained.

> +
>  		remote_vaddr = dma->vaddr + (iova - dma->iova);
>  		ret = vfio_pin_page_external(dma, remote_vaddr, &phys_pfn[i],
>  					     do_accounting);
> @@ -1514,12 +1549,16 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
>  					i += PAGE_SIZE;
>  				}
>  			} else {
> -				unsigned long pfn;
> -				unsigned long vaddr = dma->vaddr +
> -						     (iova - dma->iova);
> +				unsigned long pfn, vaddr;
>  				size_t n = dma->iova + dma->size - iova;
>  				long npage;
>  
> +				if (!vfio_vaddr_valid(iommu, dma)) {
> +					ret = -EFAULT;
> +					goto unwind;
> +				}

This one is even scarier, do we have any basis to assume either object
is still valid after iommu->lock is released?  We can only get here if
we're attaching a group to the iommu with suspended vfio_dmas.  Do we
need to allow that?  It seems we could -EAGAIN and let the user figure
out that they can't attach a group while mappings have invalid vaddrs.

> +				vaddr = dma->vaddr + (iova - dma->iova);
> +
>  				npage = vfio_pin_pages_remote(dma, vaddr,
>  							      n >> PAGE_SHIFT,
>  							      &pfn, limit);
> @@ -2965,6 +3004,9 @@ static int vfio_iommu_type1_dma_rw_chunk(struct vfio_iommu *iommu,
>  	if (count > dma->size - offset)
>  		count = dma->size - offset;
>  
> +	if (!vfio_vaddr_valid(iommu, dma))

Same as the pinning path, this should hold a container user reference
but we cannot assume our vfio_dma object is valid if we've released the
lock.  Thanks,

Alex

> +		goto out;
> +
>  	vaddr = dma->vaddr + offset;
>  
>  	if (write) {


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH V1 2/5] vfio: option to unmap all
  2021-01-08 19:35   ` Alex Williamson
@ 2021-01-11 21:09     ` Steven Sistare
  2021-01-13 19:41       ` Alex Williamson
  0 siblings, 1 reply; 39+ messages in thread
From: Steven Sistare @ 2021-01-11 21:09 UTC (permalink / raw)
  To: Alex Williamson; +Cc: kvm, Cornelia Huck, Kirti Wankhede

On 1/8/2021 2:35 PM, Alex Williamson wrote:
> Hi Steve,
> 
> On Tue,  5 Jan 2021 07:36:50 -0800
> Steve Sistare <steven.sistare@oracle.com> wrote:
> 
>> For VFIO_IOMMU_UNMAP_DMA, delete all mappings if iova=0 and size=0.
> 
> Only the latter is invalid, iova=0 is not special, so does it make
> sense to use this combination to invoke something special?  It seems
> like it opens the door that any size less than the minimum mapping
> granularity means something special.
> 
> Why not use a flag to trigger an unmap-all?

Hi Alex, that would be fine.

> Does userspace have any means to know this is supported other than to
> test it before creating any mappings?

Not currently.  We could overload VFIO_SUSPEND, or define a new extension code.
 
> What's the intended interaction with retrieving the dirty bitmap during
> an unmap-all?

Undefined and broken if there are gaps between segments :(  Good catch, thanks.  
I will disallow the combination of unmap-all and get-dirty-bitmap.

>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>  drivers/vfio/vfio_iommu_type1.c | 11 ++++++++---
>>  include/uapi/linux/vfio.h       |  3 ++-
>>  2 files changed, 10 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>> index 02228d0..3dc501d 100644
>> --- a/drivers/vfio/vfio_iommu_type1.c
>> +++ b/drivers/vfio/vfio_iommu_type1.c
>> @@ -1079,6 +1079,8 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>>  	size_t unmapped = 0, pgsize;
>>  	int ret = 0, retries = 0;
>>  	unsigned long pgshift;
>> +	dma_addr_t iova;
>> +	unsigned long size;
>>  
>>  	mutex_lock(&iommu->lock);
>>  
>> @@ -1090,7 +1092,7 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>>  		goto unlock;
>>  	}
>>  
>> -	if (!unmap->size || unmap->size & (pgsize - 1)) {
>> +	if ((!unmap->size && unmap->iova) || unmap->size & (pgsize - 1)) {
>>  		ret = -EINVAL;
>>  		goto unlock;
>>  	}
>> @@ -1154,8 +1156,11 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
> 
> It looks like the code just above this would have an issue if there are
> dma mappings at iova=0.

Are you referring to this code?

        if (iommu->v2) {
                dma = vfio_find_dma(iommu, unmap->iova, 1);
                if (dma && dma->iova != unmap->iova) {
                        ret = -EINVAL;

Both unmap->iova and dma->iova would be 0, so I don't see the problem.

>>  		}
>>  	}
>>  
>> -	while ((dma = vfio_find_dma(iommu, unmap->iova, unmap->size))) {
>> -		if (!iommu->v2 && unmap->iova > dma->iova)
>> +	iova = unmap->iova;
>> +	size = unmap->size ? unmap->size : SIZE_MAX;
> 
> AFAICT the only difference of this versus the user calling the unmap
> with iova=0 size=SIZE_MAX is that SIZE_MAX will throw an -EINVAL due to
> page size alignment.  If we assume there are no IOMMUs with 1 byte page
> size, the special combination could instead be {0, SIZE_MAX}.  

Fine, but we would still need to document it specifically so the user knows that 
the unaligned SIZE_MAX does not return EINVAL.

> Or the
> caller could just track a high water mark for their mappings and use
> the interface that exists.  Thanks,

I am trying to avoid the need to modify existing code, for legacy qemu live update.
Either a new flag or {0, SIZE_MAX} is suitable.  Which do you prefer?

- Steve
 
>> +
>> +	while ((dma = vfio_find_dma(iommu, iova, size))) {
>> +		if (!iommu->v2 && iova > dma->iova)
>>  			break;
>>  		/*
>>  		 * Task with same address space who mapped this iova range is
>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>> index 9204705..896e527 100644
>> --- a/include/uapi/linux/vfio.h
>> +++ b/include/uapi/linux/vfio.h
>> @@ -1073,7 +1073,8 @@ struct vfio_bitmap {
>>   * Caller sets argsz.  The actual unmapped size is returned in the size
>>   * field.  No guarantee is made to the user that arbitrary unmaps of iova
>>   * or size different from those used in the original mapping call will
>> - * succeed.
>> + * succeed.  If iova=0 and size=0, all addresses are unmapped.
>> + *
>>   * VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP should be set to get the dirty bitmap
>>   * before unmapping IO virtual addresses. When this flag is set, the user must
>>   * provide a struct vfio_bitmap in data[]. User must provide zero-allocated
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH V1 3/5] vfio: detect closed container
  2021-01-08 19:39   ` Alex Williamson
@ 2021-01-11 21:12     ` Steven Sistare
  2021-01-13 19:26       ` Alex Williamson
  0 siblings, 1 reply; 39+ messages in thread
From: Steven Sistare @ 2021-01-11 21:12 UTC (permalink / raw)
  To: Alex Williamson; +Cc: kvm, Cornelia Huck, Kirti Wankhede

On 1/8/2021 2:39 PM, Alex Williamson wrote:
> On Tue,  5 Jan 2021 07:36:51 -0800
> Steve Sistare <steven.sistare@oracle.com> wrote:
> 
>> Add a function that detects if an iommu_group has a valid container.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>  drivers/vfio/vfio.c  | 12 ++++++++++++
>>  include/linux/vfio.h |  1 +
>>  2 files changed, 13 insertions(+)
>>
>> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
>> index 262ab0e..f89ab80 100644
>> --- a/drivers/vfio/vfio.c
>> +++ b/drivers/vfio/vfio.c
>> @@ -61,6 +61,7 @@ struct vfio_container {
>>  	struct vfio_iommu_driver	*iommu_driver;
>>  	void				*iommu_data;
>>  	bool				noiommu;
>> +	bool				closed;
>>  };
>>  
>>  struct vfio_unbound_dev {
>> @@ -1223,6 +1224,7 @@ static int vfio_fops_release(struct inode *inode, struct file *filep)
>>  
>>  	filep->private_data = NULL;
>>  
>> +	container->closed = true;
>>  	vfio_container_put(container);
>>  
>>  	return 0;
>> @@ -2216,6 +2218,16 @@ void vfio_group_set_kvm(struct vfio_group *group, struct kvm *kvm)
>>  }
>>  EXPORT_SYMBOL_GPL(vfio_group_set_kvm);
>>  
>> +bool vfio_iommu_group_contained(struct iommu_group *iommu_group)
>> +{
>> +	struct vfio_group *group = vfio_group_get_from_iommu(iommu_group);
>> +	bool ret = group && group->container && !group->container->closed;
>> +
>> +	vfio_group_put(group);
>> +	return ret;
>> +}
>> +EXPORT_SYMBOL_GPL(vfio_iommu_group_contained);
> 
> This seems like a pointless interface, the result is immediately stale.
> Anything that relies on the container staying open needs to add itself
> as a user.  We already have some interfaces for that, ex.
> vfio_group_get_external_user().  Thanks,

The significant part here is container->closed, which is only set if userland closes the
container fd, which is a one-way trip -- the fd for this instance can never be opened 
again.  The container object may still have other references, and not be destroyed yet.
In patch 5, kernel code that waits for the RESUME ioctl calls this accessor to test if 
the ioctl will never arrive.

- Steve
 
>> +
>>  static int vfio_register_group_notifier(struct vfio_group *group,
>>  					unsigned long *events,
>>  					struct notifier_block *nb)
>> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
>> index 38d3c6a..b2724e7 100644
>> --- a/include/linux/vfio.h
>> +++ b/include/linux/vfio.h
>> @@ -148,6 +148,7 @@ extern int vfio_unregister_notifier(struct device *dev,
>>  
>>  struct kvm;
>>  extern void vfio_group_set_kvm(struct vfio_group *group, struct kvm *kvm);
>> +extern bool vfio_iommu_group_contained(struct iommu_group *group);
>>  
>>  /*
>>   * Sub-module helpers
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH V1 4/5] vfio: VA suspend interface
  2021-01-08 21:15   ` Alex Williamson
@ 2021-01-11 21:15     ` Steven Sistare
  2021-01-12 15:47       ` Jason Zeng
                         ` (2 more replies)
  0 siblings, 3 replies; 39+ messages in thread
From: Steven Sistare @ 2021-01-11 21:15 UTC (permalink / raw)
  To: Alex Williamson; +Cc: kvm, Cornelia Huck, Kirti Wankhede

On 1/8/2021 4:15 PM, Alex Williamson wrote:
> On Tue,  5 Jan 2021 07:36:52 -0800
> Steve Sistare <steven.sistare@oracle.com> wrote:
> 
>> Add interfaces that allow the underlying memory object of an iova
>> range to be mapped to a new host virtual address in the host process:
>>
>>   - VFIO_DMA_UNMAP_FLAG_SUSPEND for VFIO_IOMMU_UNMAP_DMA
>>   - VFIO_DMA_MAP_FLAG_RESUME flag for VFIO_IOMMU_MAP_DMA
>>   - VFIO_SUSPEND extension for VFIO_CHECK_EXTENSION
> 
> Suspend and Resume can imply many things other than what's done here.
> Should these be something more akin to INVALIDATE_VADDR and
> REPLACE_VADDR?

Agreed.  I suspected we would discuss the names.  Some possibilities:

INVALIDATE_VADDR  REPLACE_VADDR
INV_VADDR         SET_VADDR
CLEAR_VADDR       SET_VADDR
SUSPEND_VADDR     RESUME_VADDR

>> The suspend interface blocks vfio translation of host virtual
>> addresses in a range, but DMA to already-mapped pages continues.
>> The resume interface records the new base VA and resumes translation.
>> See comments in uapi/linux/vfio.h for more details.
>>
>> This is a partial implementation.  Blocking is added in the next patch.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>  drivers/vfio/vfio_iommu_type1.c | 47 +++++++++++++++++++++++++++++++++++------
>>  include/uapi/linux/vfio.h       | 16 ++++++++++++++
>>  2 files changed, 57 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>> index 3dc501d..2c164a6 100644
>> --- a/drivers/vfio/vfio_iommu_type1.c
>> +++ b/drivers/vfio/vfio_iommu_type1.c
>> @@ -92,6 +92,7 @@ struct vfio_dma {
>>  	int			prot;		/* IOMMU_READ/WRITE */
>>  	bool			iommu_mapped;
>>  	bool			lock_cap;	/* capable(CAP_IPC_LOCK) */
>> +	bool			suspended;
> 
> Is there a value we could use for vfio_dma.vaddr that would always be
> considered invalid, ex. ULONG_MAX?  

Yes, that could replace the suspend flag.  That, plus changing the language from suspend
to invalidate, will probably yield equally understandable code.  I'll try it.

> We'd need to decide if we want to
> allow users to create mappings (mdev-only) using an initial invalid
> vaddr.

Maybe.  Not sure yet.

>>  	struct task_struct	*task;
>>  	struct rb_root		pfn_list;	/* Ex-user pinned pfn list */
>>  	unsigned long		*bitmap;
>> @@ -1080,7 +1081,7 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>>  	int ret = 0, retries = 0;
>>  	unsigned long pgshift;
>>  	dma_addr_t iova;
>> -	unsigned long size;
>> +	unsigned long size, consumed;
> 
> This could be scoped into the branch below.

OK.

>>  	mutex_lock(&iommu->lock);
>>  
>> @@ -1169,6 +1170,21 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>>  		if (dma->task->mm != current->mm)
>>  			break;
>>  
>> +		if (unmap->flags & VFIO_DMA_UNMAP_FLAG_SUSPEND) {
>> +			if (dma->suspended) {
>> +				ret = -EINVAL;
>> +				goto unlock;
>> +			}
> 
> This leaves us in a state where we marked some entries but not others.
> We should either unwind or... what's the actual harm in skipping these?

We could skip them with no ill effect.  However, it likely means the app is confused
and potentially broken, and it would be courteous to inform them so.  I found such bugs
in qemu as I was developing this feature.

IMO unwinding does not help the app, and adds unnecessary code.  It can still leave some
ranges suspended and some not.  The safest recovery is for the app to exit, and tell the 
developer to fix the redundant suspend call.

>> +			dma->suspended = true;
>> +			consumed = dma->iova + dma->size - iova;
>> +			if (consumed >= size)
>> +				break;
>> +			iova += consumed;
>> +			size -= consumed;
>> +			unmapped += dma->size;
>> +			continue;
>> +		}
> 
> This short-cuts the dirty bitmap flag, so we need to decide if it's
> legal to call them together or we need to prevent it... Oh, I see
> you've excluded them earlier below.
> 
>> +
>>  		if (!RB_EMPTY_ROOT(&dma->pfn_list)) {
>>  			struct vfio_iommu_type1_dma_unmap nb_unmap;
>>  
>> @@ -1307,6 +1323,7 @@ static bool vfio_iommu_iova_dma_valid(struct vfio_iommu *iommu,
>>  static int vfio_dma_do_map(struct vfio_iommu *iommu,
>>  			   struct vfio_iommu_type1_dma_map *map)
>>  {
>> +	bool resume = map->flags & VFIO_DMA_MAP_FLAG_RESUME;
>>  	dma_addr_t iova = map->iova;
>>  	unsigned long vaddr = map->vaddr;
>>  	size_t size = map->size;
>> @@ -1324,13 +1341,16 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>>  	if (map->flags & VFIO_DMA_MAP_FLAG_READ)
>>  		prot |= IOMMU_READ;
>>  
>> +	if ((prot && resume) || (!prot && !resume))
>> +		return -EINVAL;
>> +
>>  	mutex_lock(&iommu->lock);
>>  
>>  	pgsize = (size_t)1 << __ffs(iommu->pgsize_bitmap);
>>  
>>  	WARN_ON((pgsize - 1) & PAGE_MASK);
>>  
>> -	if (!prot || !size || (size | iova | vaddr) & (pgsize - 1)) {
>> +	if (!size || (size | iova | vaddr) & (pgsize - 1)) {
>>  		ret = -EINVAL;
>>  		goto out_unlock;
>>  	}
>> @@ -1341,7 +1361,19 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>>  		goto out_unlock;
>>  	}
>>  
>> -	if (vfio_find_dma(iommu, iova, size)) {
>> +	dma = vfio_find_dma(iommu, iova, size);
>> +	if (resume) {
>> +		if (!dma) {
>> +			ret = -ENOENT;
>> +		} else if (!dma->suspended || dma->iova != iova ||
>> +			   dma->size != size) {
> 
> Why is it necessary that the vfio_dma be suspended before being
> resumed?  Couldn't a user simply use this to change the vaddr?  Does
> that promote abusive use?

This would almost always be incorrect.  If the vaddr changes, then the old vaddr was already
invalidated, and there is a window where it is not OK for kernel code to use the old vaddr.
This could only be safe if the memory object is mapped at both the old vaddr and the new
vaddr concurrently, which is an unlikely use case.

>> +			ret = -EINVAL;
>> +		} else {
>> +			dma->vaddr = vaddr;
> 
> Seems like there's a huge opportunity for a user to create coherency
> issues here... it's their data though I guess.

Yes.  That's what the language in the uapi about mapping the same memory object is about.

>> +			dma->suspended = false;
>> +		}
>> +		goto out_unlock;
>> +	} else if (dma) {
>>  		ret = -EEXIST;
>>  		goto out_unlock;
>>  	}
>> @@ -2532,6 +2564,7 @@ static int vfio_iommu_type1_check_extension(struct vfio_iommu *iommu,
>>  	case VFIO_TYPE1_IOMMU:
>>  	case VFIO_TYPE1v2_IOMMU:
>>  	case VFIO_TYPE1_NESTING_IOMMU:
>> +	case VFIO_SUSPEND:
>>  		return 1;
>>  	case VFIO_DMA_CC_IOMMU:
>>  		if (!iommu)
>> @@ -2686,7 +2719,8 @@ static int vfio_iommu_type1_map_dma(struct vfio_iommu *iommu,
>>  {
>>  	struct vfio_iommu_type1_dma_map map;
>>  	unsigned long minsz;
>> -	uint32_t mask = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
>> +	uint32_t mask = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE |
>> +			VFIO_DMA_MAP_FLAG_RESUME;
>>  
>>  	minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);
>>  
>> @@ -2704,6 +2738,8 @@ static int vfio_iommu_type1_unmap_dma(struct vfio_iommu *iommu,
>>  {
>>  	struct vfio_iommu_type1_dma_unmap unmap;
>>  	struct vfio_bitmap bitmap = { 0 };
>> +	uint32_t mask = VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP |
>> +			VFIO_DMA_UNMAP_FLAG_SUSPEND;
>>  	unsigned long minsz;
>>  	int ret;
>>  
>> @@ -2712,8 +2748,7 @@ static int vfio_iommu_type1_unmap_dma(struct vfio_iommu *iommu,
>>  	if (copy_from_user(&unmap, (void __user *)arg, minsz))
>>  		return -EFAULT;
>>  
>> -	if (unmap.argsz < minsz ||
>> -	    unmap.flags & ~VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP)
>> +	if (unmap.argsz < minsz || unmap.flags & ~mask || unmap.flags == mask)
> 
> Maybe a short comment here to note that dirty-bimap and
> suspend/invalidate are mutually exclusive.  Probably should be
> mentioned in the uapi too.

Will do, for both.

>>  		return -EINVAL;
>>  
>>  	if (unmap.flags & VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP) {
>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>> index 896e527..fcf7b56 100644
>> --- a/include/uapi/linux/vfio.h
>> +++ b/include/uapi/linux/vfio.h
>> @@ -46,6 +46,9 @@
>>   */
>>  #define VFIO_NOIOMMU_IOMMU		8
>>  
>> +/* Supports VFIO DMA suspend and resume */
>> +#define VFIO_SUSPEND			9
>> +
>>  /*
>>   * The IOCTL interface is designed for extensibility by embedding the
>>   * structure length (argsz) and flags into structures passed between
>> @@ -1046,12 +1049,19 @@ struct vfio_iommu_type1_info_cap_migration {
>>   *
>>   * Map process virtual addresses to IO virtual addresses using the
>>   * provided struct vfio_dma_map. Caller sets argsz. READ &/ WRITE required.
>> + *
>> + * If flags & VFIO_DMA_MAP_FLAG_RESUME, record the new base vaddr for iova, and
>> + * resume translation of host virtual addresses in the iova range.  The new
>> + * vaddr must point to the same memory object as the old vaddr, but this is not
>> + * verified.
> 
> It's hard to use "must" terminology here if we're not going to check.
> Maybe the phrasing should be something more along the lines of "should
> point to the same memory object or the user risks coherency issues
> within their virtual address space".

I used "must" because it is always incorrect if the object is not the same.  How about:
  The new vaddr must point to the same memory object as the old vaddr, but this is not
  verified.  Violation of this constraint may result in memory corruption within the
  host process and/or guest.

>>  iova and size must match those in the original MAP_DMA call.
>> + * Protection is not changed, and the READ & WRITE flags must be 0.
> 
> This doesn't mention that the entry must be previously
> suspended/invalidated (if we choose to keep those semantics).  Thanks,

Will add, thanks.

- Steve 
>>   */
>>  struct vfio_iommu_type1_dma_map {
>>  	__u32	argsz;
>>  	__u32	flags;
>>  #define VFIO_DMA_MAP_FLAG_READ (1 << 0)		/* readable from device */
>>  #define VFIO_DMA_MAP_FLAG_WRITE (1 << 1)	/* writable from device */
>> +#define VFIO_DMA_MAP_FLAG_RESUME (1 << 2)
>>  	__u64	vaddr;				/* Process virtual address */
>>  	__u64	iova;				/* IO virtual address */
>>  	__u64	size;				/* Size of mapping (bytes) */
>> @@ -1084,11 +1094,17 @@ struct vfio_bitmap {
>>   * indicates that the page at that offset from iova is dirty. A Bitmap of the
>>   * pages in the range of unmapped size is returned in the user-provided
>>   * vfio_bitmap.data.
>> + *
>> + * If flags & VFIO_DMA_UNMAP_FLAG_SUSPEND, do not unmap, but suspend vfio
>> + * translation of host virtual addresses in the iova range.  During suspension,
>> + * kernel threads that attempt to translate will block.  DMA to already-mapped
>> + * pages continues.
>>   */
>>  struct vfio_iommu_type1_dma_unmap {
>>  	__u32	argsz;
>>  	__u32	flags;
>>  #define VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP (1 << 0)
>> +#define VFIO_DMA_UNMAP_FLAG_SUSPEND	     (1 << 1)
>>  	__u64	iova;				/* IO virtual address */
>>  	__u64	size;				/* Size of mapping (bytes) */
>>  	__u8    data[];
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH V1 5/5] vfio: block during VA suspend
  2021-01-08 21:32   ` [PATCH V1 5/5] vfio: block during VA suspend Alex Williamson
@ 2021-01-11 21:16     ` Steven Sistare
  0 siblings, 0 replies; 39+ messages in thread
From: Steven Sistare @ 2021-01-11 21:16 UTC (permalink / raw)
  To: Alex Williamson; +Cc: kvm, Cornelia Huck, Kirti Wankhede

On 1/8/2021 4:32 PM, Alex Williamson wrote:
> On Tue,  5 Jan 2021 07:36:53 -0800
> Steve Sistare <steven.sistare@oracle.com> wrote:
> 
>> Block translation of host virtual address while an iova range is suspended.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>  drivers/vfio/vfio_iommu_type1.c | 48 ++++++++++++++++++++++++++++++++++++++---
>>  1 file changed, 45 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>> index 2c164a6..8035b9a 100644
>> --- a/drivers/vfio/vfio_iommu_type1.c
>> +++ b/drivers/vfio/vfio_iommu_type1.c
>> @@ -31,6 +31,8 @@
>>  #include <linux/rbtree.h>
>>  #include <linux/sched/signal.h>
>>  #include <linux/sched/mm.h>
>> +#include <linux/delay.h>
>> +#include <linux/kthread.h>
>>  #include <linux/slab.h>
>>  #include <linux/uaccess.h>
>>  #include <linux/vfio.h>
>> @@ -484,6 +486,34 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
>>  	return ret;
>>  }
>>  
>> +static bool vfio_iommu_contained(struct vfio_iommu *iommu)
>> +{
>> +	struct vfio_domain *domain = iommu->external_domain;
>> +	struct vfio_group *group;
>> +
>> +	if (!domain)
>> +		domain = list_first_entry(&iommu->domain_list,
>> +					  struct vfio_domain, next);
>> +
>> +	group = list_first_entry(&domain->group_list, struct vfio_group, next);
>> +	return vfio_iommu_group_contained(group->iommu_group);
>> +}
> 
> This seems really backwards for a vfio iommu backend to ask vfio-core
> whether its internal object is active.

Yes.  I considered the architecturally purer approach of defining a new back end
interface and calling it from the core when the change to uncontained occurs,
but it seemed like overkill.
 
>> +
>> +
>> +bool vfio_vaddr_valid(struct vfio_iommu *iommu, struct vfio_dma *dma)
>> +{
>> +	while (dma->suspended) {
>> +		mutex_unlock(&iommu->lock);
>> +		msleep_interruptible(10);
>> +		mutex_lock(&iommu->lock);
>> +		if (kthread_should_stop() || !vfio_iommu_contained(iommu) ||
>> +		    fatal_signal_pending(current)) {
>> +			return false;
>> +		}
>> +	}
>> +	return true;
>> +}
>> +
>>  /*
>>   * Attempt to pin pages.  We really don't want to track all the pfns and
>>   * the iommu can only map chunks of consecutive pfns anyway, so get the
>> @@ -690,6 +720,11 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
>>  			continue;
>>  		}
>>  
>> +		if (!vfio_vaddr_valid(iommu, dma)) {
>> +			ret = -EFAULT;
>> +			goto pin_unwind;
>> +		}
> 
> This could have dropped iommu->lock, we have no basis to believe our
> vfio_dma object is still valid.  

Ouch, I am embarrassed. To fix, I will pass iova to vfio_vaddr_valid() and call
vfio_find_dma after re-acquiring the lock.

> Also this is called with an elevated
> reference to the container, so we theoretically should not need to test
> whether the iommu is still contained.

We still have a reference to the container, but userland may have closed the container fd
and hence will never call the ioctl to replace the vaddr, so we must bail.
 
>> +
>>  		remote_vaddr = dma->vaddr + (iova - dma->iova);
>>  		ret = vfio_pin_page_external(dma, remote_vaddr, &phys_pfn[i],
>>  					     do_accounting);
>> @@ -1514,12 +1549,16 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
>>  					i += PAGE_SIZE;
>>  				}
>>  			} else {
>> -				unsigned long pfn;
>> -				unsigned long vaddr = dma->vaddr +
>> -						     (iova - dma->iova);
>> +				unsigned long pfn, vaddr;
>>  				size_t n = dma->iova + dma->size - iova;
>>  				long npage;
>>  
>> +				if (!vfio_vaddr_valid(iommu, dma)) {
>> +					ret = -EFAULT;
>> +					goto unwind;
>> +				}
> 
> This one is even scarier, do we have any basis to assume either object
> is still valid after iommu->lock is released?  

Another ouch. The iommu object still exists, but its dma_list may have changed.  
We cannot even unwind and punt with EAGAIN.

I can fix this by adding a preamble loop that holds iommu->lock and verifies that
all dma_list entries are not suspended.  If the loop hits a suspended entry, it
drops the lock and sleeps (like vfio_vaddr_valid), and retries from the beginning
of the list.  On reaching the end of the list, the lock remains held and prevents
entries from being suspended.

Instead of retrying, I could return EAGAIN, but that would require unwinding of 
earlier groups in __vfio_container_attach_groups.

> We can only get here if
> we're attaching a group to the iommu with suspended vfio_dmas.  Do we
> need to allow that?  It seems we could -EAGAIN and let the user figure
> out that they can't attach a group while mappings have invalid vaddrs.

I would like to allow it to minimize userland code changes.

>> +				vaddr = dma->vaddr + (iova - dma->iova);
>> +
>>  				npage = vfio_pin_pages_remote(dma, vaddr,
>>  							      n >> PAGE_SHIFT,
>>  							      &pfn, limit);
>> @@ -2965,6 +3004,9 @@ static int vfio_iommu_type1_dma_rw_chunk(struct vfio_iommu *iommu,
>>  	if (count > dma->size - offset)
>>  		count = dma->size - offset;
>>  
>> +	if (!vfio_vaddr_valid(iommu, dma))
> 
> Same as the pinning path, this should hold a container user reference
> but we cannot assume our vfio_dma object is valid if we've released the
> lock.  Thanks,

Yup.  Same fix needed.

- Steve

>> +		goto out;
>> +
>>  	vaddr = dma->vaddr + offset;
>>  
>>  	if (write) {
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH V1 4/5] vfio: VA suspend interface
  2021-01-11 21:15     ` Steven Sistare
@ 2021-01-12 15:47       ` Jason Zeng
  2021-01-12 22:09         ` Alex Williamson
  2021-01-12 22:47       ` Alex Williamson
  2021-01-19 20:11       ` Steven Sistare
  2 siblings, 1 reply; 39+ messages in thread
From: Jason Zeng @ 2021-01-12 15:47 UTC (permalink / raw)
  To: Steven Sistare; +Cc: Alex Williamson, kvm, Cornelia Huck, Kirti Wankhede

On Mon, Jan 11, 2021 at 04:15:02PM -0500, Steven Sistare wrote:
> On 1/8/2021 4:15 PM, Alex Williamson wrote:
> > On Tue,  5 Jan 2021 07:36:52 -0800
> > Steve Sistare <steven.sistare@oracle.com> wrote:
> > 
> >> Add interfaces that allow the underlying memory object of an iova
> >> range to be mapped to a new host virtual address in the host process:
> >>
> >>   - VFIO_DMA_UNMAP_FLAG_SUSPEND for VFIO_IOMMU_UNMAP_DMA
> >>   - VFIO_DMA_MAP_FLAG_RESUME flag for VFIO_IOMMU_MAP_DMA
> >>   - VFIO_SUSPEND extension for VFIO_CHECK_EXTENSION
> > 
> > Suspend and Resume can imply many things other than what's done here.
> > Should these be something more akin to INVALIDATE_VADDR and
> > REPLACE_VADDR?
> 
> Agreed.  I suspected we would discuss the names.  Some possibilities:
> 
> INVALIDATE_VADDR  REPLACE_VADDR
> INV_VADDR         SET_VADDR
> CLEAR_VADDR       SET_VADDR
> SUSPEND_VADDR     RESUME_VADDR
> 

What about SET_KEEPALIVE/CLEAR_KEEPALIVE? Vaddr can be updated as part
of CLEAR_KEEPALIVE.

Actually we are keeping the DMA mappings alive, instead of suspending
them, when new Qemu are exec'ing, because the hardware can still do
DMA during this period.

This name will also be friendly to VMM Fast Restart [1], which aims to
kexec reboot host kernel while VMs are paused in memory and resumed
seamlessly after the new kernel is ready.

The VMM Fast Restart scenario is as such: Before kexec-reboot,
SET_KEEPALIVE is issued to set keepalive flag for VFIO DMA mapping
entries, so that their underlying IOMMU DMA mappings will not be
destroyed when Qemu quits before kexec-reboot. After kexec-reboot,
the VFIO DMA mappings are restored together with its keepalive flag
in the new kernel so that new Qemu can do CLEAR_KEEPALIVE operation.

Thanks,
Jason

[1] https://static.sched.com/hosted_files/kvmforum2019/66/VMM-fast-restart_kvmforum2019.pdf,
    POC code at github repo:
    https://github.com/intel/vmm-fast-restart-linux.git vmm-fast-restart



> >> The suspend interface blocks vfio translation of host virtual
> >> addresses in a range, but DMA to already-mapped pages continues.
> >> The resume interface records the new base VA and resumes translation.
> >> See comments in uapi/linux/vfio.h for more details.
> >>
> >> This is a partial implementation.  Blocking is added in the next patch.
> >>
> >> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> >> ---
> >>  drivers/vfio/vfio_iommu_type1.c | 47 +++++++++++++++++++++++++++++++++++------
> >>  include/uapi/linux/vfio.h       | 16 ++++++++++++++
> >>  2 files changed, 57 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> >> index 3dc501d..2c164a6 100644
> >> --- a/drivers/vfio/vfio_iommu_type1.c
> >> +++ b/drivers/vfio/vfio_iommu_type1.c
> >> @@ -92,6 +92,7 @@ struct vfio_dma {
> >>  	int			prot;		/* IOMMU_READ/WRITE */
> >>  	bool			iommu_mapped;
> >>  	bool			lock_cap;	/* capable(CAP_IPC_LOCK) */
> >> +	bool			suspended;
> > 
> > Is there a value we could use for vfio_dma.vaddr that would always be
> > considered invalid, ex. ULONG_MAX?  
> 
> Yes, that could replace the suspend flag.  That, plus changing the language from suspend
> to invalidate, will probably yield equally understandable code.  I'll try it.
> 
> > We'd need to decide if we want to
> > allow users to create mappings (mdev-only) using an initial invalid
> > vaddr.
> 
> Maybe.  Not sure yet.
> 
> >>  	struct task_struct	*task;
> >>  	struct rb_root		pfn_list;	/* Ex-user pinned pfn list */
> >>  	unsigned long		*bitmap;
> >> @@ -1080,7 +1081,7 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
> >>  	int ret = 0, retries = 0;
> >>  	unsigned long pgshift;
> >>  	dma_addr_t iova;
> >> -	unsigned long size;
> >> +	unsigned long size, consumed;
> > 
> > This could be scoped into the branch below.
> 
> OK.
> 
> >>  	mutex_lock(&iommu->lock);
> >>  
> >> @@ -1169,6 +1170,21 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
> >>  		if (dma->task->mm != current->mm)
> >>  			break;
> >>  
> >> +		if (unmap->flags & VFIO_DMA_UNMAP_FLAG_SUSPEND) {
> >> +			if (dma->suspended) {
> >> +				ret = -EINVAL;
> >> +				goto unlock;
> >> +			}
> > 
> > This leaves us in a state where we marked some entries but not others.
> > We should either unwind or... what's the actual harm in skipping these?
> 
> We could skip them with no ill effect.  However, it likely means the app is confused
> and potentially broken, and it would be courteous to inform them so.  I found such bugs
> in qemu as I was developing this feature.
> 
> IMO unwinding does not help the app, and adds unnecessary code.  It can still leave some
> ranges suspended and some not.  The safest recovery is for the app to exit, and tell the 
> developer to fix the redundant suspend call.
> 
> >> +			dma->suspended = true;
> >> +			consumed = dma->iova + dma->size - iova;
> >> +			if (consumed >= size)
> >> +				break;
> >> +			iova += consumed;
> >> +			size -= consumed;
> >> +			unmapped += dma->size;
> >> +			continue;
> >> +		}
> > 
> > This short-cuts the dirty bitmap flag, so we need to decide if it's
> > legal to call them together or we need to prevent it... Oh, I see
> > you've excluded them earlier below.
> > 
> >> +
> >>  		if (!RB_EMPTY_ROOT(&dma->pfn_list)) {
> >>  			struct vfio_iommu_type1_dma_unmap nb_unmap;
> >>  
> >> @@ -1307,6 +1323,7 @@ static bool vfio_iommu_iova_dma_valid(struct vfio_iommu *iommu,
> >>  static int vfio_dma_do_map(struct vfio_iommu *iommu,
> >>  			   struct vfio_iommu_type1_dma_map *map)
> >>  {
> >> +	bool resume = map->flags & VFIO_DMA_MAP_FLAG_RESUME;
> >>  	dma_addr_t iova = map->iova;
> >>  	unsigned long vaddr = map->vaddr;
> >>  	size_t size = map->size;
> >> @@ -1324,13 +1341,16 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
> >>  	if (map->flags & VFIO_DMA_MAP_FLAG_READ)
> >>  		prot |= IOMMU_READ;
> >>  
> >> +	if ((prot && resume) || (!prot && !resume))
> >> +		return -EINVAL;
> >> +
> >>  	mutex_lock(&iommu->lock);
> >>  
> >>  	pgsize = (size_t)1 << __ffs(iommu->pgsize_bitmap);
> >>  
> >>  	WARN_ON((pgsize - 1) & PAGE_MASK);
> >>  
> >> -	if (!prot || !size || (size | iova | vaddr) & (pgsize - 1)) {
> >> +	if (!size || (size | iova | vaddr) & (pgsize - 1)) {
> >>  		ret = -EINVAL;
> >>  		goto out_unlock;
> >>  	}
> >> @@ -1341,7 +1361,19 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
> >>  		goto out_unlock;
> >>  	}
> >>  
> >> -	if (vfio_find_dma(iommu, iova, size)) {
> >> +	dma = vfio_find_dma(iommu, iova, size);
> >> +	if (resume) {
> >> +		if (!dma) {
> >> +			ret = -ENOENT;
> >> +		} else if (!dma->suspended || dma->iova != iova ||
> >> +			   dma->size != size) {
> > 
> > Why is it necessary that the vfio_dma be suspended before being
> > resumed?  Couldn't a user simply use this to change the vaddr?  Does
> > that promote abusive use?
> 
> This would almost always be incorrect.  If the vaddr changes, then the old vaddr was already
> invalidated, and there is a window where it is not OK for kernel code to use the old vaddr.
> This could only be safe if the memory object is mapped at both the old vaddr and the new
> vaddr concurrently, which is an unlikely use case.
> 
> >> +			ret = -EINVAL;
> >> +		} else {
> >> +			dma->vaddr = vaddr;
> > 
> > Seems like there's a huge opportunity for a user to create coherency
> > issues here... it's their data though I guess.
> 
> Yes.  That's what the language in the uapi about mapping the same memory object is about.
> 
> >> +			dma->suspended = false;
> >> +		}
> >> +		goto out_unlock;
> >> +	} else if (dma) {
> >>  		ret = -EEXIST;
> >>  		goto out_unlock;
> >>  	}
> >> @@ -2532,6 +2564,7 @@ static int vfio_iommu_type1_check_extension(struct vfio_iommu *iommu,
> >>  	case VFIO_TYPE1_IOMMU:
> >>  	case VFIO_TYPE1v2_IOMMU:
> >>  	case VFIO_TYPE1_NESTING_IOMMU:
> >> +	case VFIO_SUSPEND:
> >>  		return 1;
> >>  	case VFIO_DMA_CC_IOMMU:
> >>  		if (!iommu)
> >> @@ -2686,7 +2719,8 @@ static int vfio_iommu_type1_map_dma(struct vfio_iommu *iommu,
> >>  {
> >>  	struct vfio_iommu_type1_dma_map map;
> >>  	unsigned long minsz;
> >> -	uint32_t mask = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
> >> +	uint32_t mask = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE |
> >> +			VFIO_DMA_MAP_FLAG_RESUME;
> >>  
> >>  	minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);
> >>  
> >> @@ -2704,6 +2738,8 @@ static int vfio_iommu_type1_unmap_dma(struct vfio_iommu *iommu,
> >>  {
> >>  	struct vfio_iommu_type1_dma_unmap unmap;
> >>  	struct vfio_bitmap bitmap = { 0 };
> >> +	uint32_t mask = VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP |
> >> +			VFIO_DMA_UNMAP_FLAG_SUSPEND;
> >>  	unsigned long minsz;
> >>  	int ret;
> >>  
> >> @@ -2712,8 +2748,7 @@ static int vfio_iommu_type1_unmap_dma(struct vfio_iommu *iommu,
> >>  	if (copy_from_user(&unmap, (void __user *)arg, minsz))
> >>  		return -EFAULT;
> >>  
> >> -	if (unmap.argsz < minsz ||
> >> -	    unmap.flags & ~VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP)
> >> +	if (unmap.argsz < minsz || unmap.flags & ~mask || unmap.flags == mask)
> > 
> > Maybe a short comment here to note that dirty-bimap and
> > suspend/invalidate are mutually exclusive.  Probably should be
> > mentioned in the uapi too.
> 
> Will do, for both.
> 
> >>  		return -EINVAL;
> >>  
> >>  	if (unmap.flags & VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP) {
> >> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> >> index 896e527..fcf7b56 100644
> >> --- a/include/uapi/linux/vfio.h
> >> +++ b/include/uapi/linux/vfio.h
> >> @@ -46,6 +46,9 @@
> >>   */
> >>  #define VFIO_NOIOMMU_IOMMU		8
> >>  
> >> +/* Supports VFIO DMA suspend and resume */
> >> +#define VFIO_SUSPEND			9
> >> +
> >>  /*
> >>   * The IOCTL interface is designed for extensibility by embedding the
> >>   * structure length (argsz) and flags into structures passed between
> >> @@ -1046,12 +1049,19 @@ struct vfio_iommu_type1_info_cap_migration {
> >>   *
> >>   * Map process virtual addresses to IO virtual addresses using the
> >>   * provided struct vfio_dma_map. Caller sets argsz. READ &/ WRITE required.
> >> + *
> >> + * If flags & VFIO_DMA_MAP_FLAG_RESUME, record the new base vaddr for iova, and
> >> + * resume translation of host virtual addresses in the iova range.  The new
> >> + * vaddr must point to the same memory object as the old vaddr, but this is not
> >> + * verified.
> > 
> > It's hard to use "must" terminology here if we're not going to check.
> > Maybe the phrasing should be something more along the lines of "should
> > point to the same memory object or the user risks coherency issues
> > within their virtual address space".
> 
> I used "must" because it is always incorrect if the object is not the same.  How about:
>   The new vaddr must point to the same memory object as the old vaddr, but this is not
>   verified.  Violation of this constraint may result in memory corruption within the
>   host process and/or guest.
> 
> >>  iova and size must match those in the original MAP_DMA call.
> >> + * Protection is not changed, and the READ & WRITE flags must be 0.
> > 
> > This doesn't mention that the entry must be previously
> > suspended/invalidated (if we choose to keep those semantics).  Thanks,
> 
> Will add, thanks.
> 
> - Steve 
> >>   */
> >>  struct vfio_iommu_type1_dma_map {
> >>  	__u32	argsz;
> >>  	__u32	flags;
> >>  #define VFIO_DMA_MAP_FLAG_READ (1 << 0)		/* readable from device */
> >>  #define VFIO_DMA_MAP_FLAG_WRITE (1 << 1)	/* writable from device */
> >> +#define VFIO_DMA_MAP_FLAG_RESUME (1 << 2)
> >>  	__u64	vaddr;				/* Process virtual address */
> >>  	__u64	iova;				/* IO virtual address */
> >>  	__u64	size;				/* Size of mapping (bytes) */
> >> @@ -1084,11 +1094,17 @@ struct vfio_bitmap {
> >>   * indicates that the page at that offset from iova is dirty. A Bitmap of the
> >>   * pages in the range of unmapped size is returned in the user-provided
> >>   * vfio_bitmap.data.
> >> + *
> >> + * If flags & VFIO_DMA_UNMAP_FLAG_SUSPEND, do not unmap, but suspend vfio
> >> + * translation of host virtual addresses in the iova range.  During suspension,
> >> + * kernel threads that attempt to translate will block.  DMA to already-mapped
> >> + * pages continues.
> >>   */
> >>  struct vfio_iommu_type1_dma_unmap {
> >>  	__u32	argsz;
> >>  	__u32	flags;
> >>  #define VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP (1 << 0)
> >> +#define VFIO_DMA_UNMAP_FLAG_SUSPEND	     (1 << 1)
> >>  	__u64	iova;				/* IO virtual address */
> >>  	__u64	size;				/* Size of mapping (bytes) */
> >>  	__u8    data[];
> > 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH V1 0/5] vfio virtual address update
  2021-01-05 15:36 [PATCH V1 0/5] vfio virtual address update Steve Sistare
                   ` (4 preceding siblings ...)
  2021-01-05 15:36 ` [PATCH V1 5/5] vfio: block during VA suspend Steve Sistare
@ 2021-01-12 21:52 ` Steven Sistare
  5 siblings, 0 replies; 39+ messages in thread
From: Steven Sistare @ 2021-01-12 21:52 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Cornelia Huck, Kirti Wankhede, kvm

On 1/5/2021 10:36 AM, Steve Sistare wrote:
> Add interfaces that allow the underlying memory object of an iova range
> to be mapped to a new virtual address in the host process:
> 
>   - VFIO_DMA_UNMAP_FLAG_SUSPEND for VFIO_IOMMU_UNMAP_DMA
>   - VFIO_DMA_MAP_FLAG_RESUME flag for VFIO_IOMMU_MAP_DMA
>   - VFIO_SUSPEND extension for VFIO_CHECK_EXTENSION
> 
> The suspend interface blocks vfio translation of host virtual addresses in
> a range, but DMA to already-mapped pages continues.  The resume interface
> records the new base VA and resumes translation.  The implementation
> supports iommu type1 and mediated devices.
> 
> This functionality is necessary for live update, in which a host process
> such as qemu exec's an updated version of itself, while preserving its
> guest and vfio devices.  The process suspends vfio VA translation, exec's
> its new self, mmap's the memory object(s) underlying vfio object, and
> resumes VA translation.  For a working example that uses these new
> interfaces, see the QEMU patch series "[PATCH V2] Live Update".
> 
> Patch 1 modifies the iova rbtree to allow iteration over ranges with gaps,
>   without deleting each entry.  This is required by patch 4.
> Patch 2 adds an option to unmap all ranges, which simplifies userland code.
> Patch 3 adds an interface to detect if an iommu_group has a valid container,
>   which patch 5 uses to release a blocked thread if a container is closed.
> Patch 4 implements the new ioctl's.
> Patch 5 adds blocking to complete the implementation .
> 
> Steve Sistare (5):
>   vfio: maintain dma_list order
>   vfio: option to unmap all
>   vfio: detect closed container
>   vfio: VA suspend interface
>   vfio: block during VA suspend
> 
>  drivers/vfio/vfio.c             |  12 ++++
>  drivers/vfio/vfio_iommu_type1.c | 122 ++++++++++++++++++++++++++++++++++------
>  include/linux/vfio.h            |   1 +
>  include/uapi/linux/vfio.h       |  19 ++++++-
>  4 files changed, 135 insertions(+), 19 deletions(-)
> 

Hi Alex,
  I can send a patch V2 for review if you weigh in on the following:

* preferred interface for unmap-all (patch 2)
* name of the suspend and resume flags (patch 4)
* is vfio_iommu_contained() acceptable, or is a new backend interface needed? (patch 5)

- Steve


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH V1 4/5] vfio: VA suspend interface
  2021-01-12 15:47       ` Jason Zeng
@ 2021-01-12 22:09         ` Alex Williamson
  2021-01-13  4:37           ` Jason Zeng
  0 siblings, 1 reply; 39+ messages in thread
From: Alex Williamson @ 2021-01-12 22:09 UTC (permalink / raw)
  To: Jason Zeng; +Cc: Steven Sistare, kvm, Cornelia Huck, Kirti Wankhede

On Tue, 12 Jan 2021 23:47:35 +0800
Jason Zeng <jason.zeng@intel.com> wrote:

> On Mon, Jan 11, 2021 at 04:15:02PM -0500, Steven Sistare wrote:
> > On 1/8/2021 4:15 PM, Alex Williamson wrote:  
> > > On Tue,  5 Jan 2021 07:36:52 -0800
> > > Steve Sistare <steven.sistare@oracle.com> wrote:
> > >   
> > >> Add interfaces that allow the underlying memory object of an iova
> > >> range to be mapped to a new host virtual address in the host process:
> > >>
> > >>   - VFIO_DMA_UNMAP_FLAG_SUSPEND for VFIO_IOMMU_UNMAP_DMA
> > >>   - VFIO_DMA_MAP_FLAG_RESUME flag for VFIO_IOMMU_MAP_DMA
> > >>   - VFIO_SUSPEND extension for VFIO_CHECK_EXTENSION  
> > > 
> > > Suspend and Resume can imply many things other than what's done here.
> > > Should these be something more akin to INVALIDATE_VADDR and
> > > REPLACE_VADDR?  
> > 
> > Agreed.  I suspected we would discuss the names.  Some possibilities:
> > 
> > INVALIDATE_VADDR  REPLACE_VADDR
> > INV_VADDR         SET_VADDR
> > CLEAR_VADDR       SET_VADDR
> > SUSPEND_VADDR     RESUME_VADDR
> >   
> 
> What about SET_KEEPALIVE/CLEAR_KEEPALIVE? Vaddr can be updated as part
> of CLEAR_KEEPALIVE.

I prefer names that describe what they do, not an arbitrary use case.

> Actually we are keeping the DMA mappings alive, instead of suspending
> them, when new Qemu are exec'ing, because the hardware can still do
> DMA during this period.

Why would they not stay alive when the vfio file descriptors remain
open and nothing has been unmapped?
 
> This name will also be friendly to VMM Fast Restart [1], which aims to
> kexec reboot host kernel while VMs are paused in memory and resumed
> seamlessly after the new kernel is ready.
> 
> The VMM Fast Restart scenario is as such: Before kexec-reboot,
> SET_KEEPALIVE is issued to set keepalive flag for VFIO DMA mapping
> entries, so that their underlying IOMMU DMA mappings will not be
> destroyed when Qemu quits before kexec-reboot. After kexec-reboot,
> the VFIO DMA mappings are restored together with its keepalive flag
> in the new kernel so that new Qemu can do CLEAR_KEEPALIVE operation.

That entails far more than simply invalidating vaddr.  We need to have
a very specific uapi, it's difficult to expand the scope of a flag
when/if such features are ever proposed upstream.  Thanks,

Alex

> > >> The suspend interface blocks vfio translation of host virtual
> > >> addresses in a range, but DMA to already-mapped pages continues.
> > >> The resume interface records the new base VA and resumes translation.
> > >> See comments in uapi/linux/vfio.h for more details.
> > >>
> > >> This is a partial implementation.  Blocking is added in the next patch.
> > >>
> > >> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> > >> ---
> > >>  drivers/vfio/vfio_iommu_type1.c | 47 +++++++++++++++++++++++++++++++++++------
> > >>  include/uapi/linux/vfio.h       | 16 ++++++++++++++
> > >>  2 files changed, 57 insertions(+), 6 deletions(-)
> > >>
> > >> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > >> index 3dc501d..2c164a6 100644
> > >> --- a/drivers/vfio/vfio_iommu_type1.c
> > >> +++ b/drivers/vfio/vfio_iommu_type1.c
> > >> @@ -92,6 +92,7 @@ struct vfio_dma {
> > >>  	int			prot;		/* IOMMU_READ/WRITE */
> > >>  	bool			iommu_mapped;
> > >>  	bool			lock_cap;	/* capable(CAP_IPC_LOCK) */
> > >> +	bool			suspended;  
> > > 
> > > Is there a value we could use for vfio_dma.vaddr that would always be
> > > considered invalid, ex. ULONG_MAX?    
> > 
> > Yes, that could replace the suspend flag.  That, plus changing the language from suspend
> > to invalidate, will probably yield equally understandable code.  I'll try it.
> >   
> > > We'd need to decide if we want to
> > > allow users to create mappings (mdev-only) using an initial invalid
> > > vaddr.  
> > 
> > Maybe.  Not sure yet.
> >   
> > >>  	struct task_struct	*task;
> > >>  	struct rb_root		pfn_list;	/* Ex-user pinned pfn list */
> > >>  	unsigned long		*bitmap;
> > >> @@ -1080,7 +1081,7 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
> > >>  	int ret = 0, retries = 0;
> > >>  	unsigned long pgshift;
> > >>  	dma_addr_t iova;
> > >> -	unsigned long size;
> > >> +	unsigned long size, consumed;  
> > > 
> > > This could be scoped into the branch below.  
> > 
> > OK.
> >   
> > >>  	mutex_lock(&iommu->lock);
> > >>  
> > >> @@ -1169,6 +1170,21 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
> > >>  		if (dma->task->mm != current->mm)
> > >>  			break;
> > >>  
> > >> +		if (unmap->flags & VFIO_DMA_UNMAP_FLAG_SUSPEND) {
> > >> +			if (dma->suspended) {
> > >> +				ret = -EINVAL;
> > >> +				goto unlock;
> > >> +			}  
> > > 
> > > This leaves us in a state where we marked some entries but not others.
> > > We should either unwind or... what's the actual harm in skipping these?  
> > 
> > We could skip them with no ill effect.  However, it likely means the app is confused
> > and potentially broken, and it would be courteous to inform them so.  I found such bugs
> > in qemu as I was developing this feature.
> > 
> > IMO unwinding does not help the app, and adds unnecessary code.  It can still leave some
> > ranges suspended and some not.  The safest recovery is for the app to exit, and tell the 
> > developer to fix the redundant suspend call.
> >   
> > >> +			dma->suspended = true;
> > >> +			consumed = dma->iova + dma->size - iova;
> > >> +			if (consumed >= size)
> > >> +				break;
> > >> +			iova += consumed;
> > >> +			size -= consumed;
> > >> +			unmapped += dma->size;
> > >> +			continue;
> > >> +		}  
> > > 
> > > This short-cuts the dirty bitmap flag, so we need to decide if it's
> > > legal to call them together or we need to prevent it... Oh, I see
> > > you've excluded them earlier below.
> > >   
> > >> +
> > >>  		if (!RB_EMPTY_ROOT(&dma->pfn_list)) {
> > >>  			struct vfio_iommu_type1_dma_unmap nb_unmap;
> > >>  
> > >> @@ -1307,6 +1323,7 @@ static bool vfio_iommu_iova_dma_valid(struct vfio_iommu *iommu,
> > >>  static int vfio_dma_do_map(struct vfio_iommu *iommu,
> > >>  			   struct vfio_iommu_type1_dma_map *map)
> > >>  {
> > >> +	bool resume = map->flags & VFIO_DMA_MAP_FLAG_RESUME;
> > >>  	dma_addr_t iova = map->iova;
> > >>  	unsigned long vaddr = map->vaddr;
> > >>  	size_t size = map->size;
> > >> @@ -1324,13 +1341,16 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
> > >>  	if (map->flags & VFIO_DMA_MAP_FLAG_READ)
> > >>  		prot |= IOMMU_READ;
> > >>  
> > >> +	if ((prot && resume) || (!prot && !resume))
> > >> +		return -EINVAL;
> > >> +
> > >>  	mutex_lock(&iommu->lock);
> > >>  
> > >>  	pgsize = (size_t)1 << __ffs(iommu->pgsize_bitmap);
> > >>  
> > >>  	WARN_ON((pgsize - 1) & PAGE_MASK);
> > >>  
> > >> -	if (!prot || !size || (size | iova | vaddr) & (pgsize - 1)) {
> > >> +	if (!size || (size | iova | vaddr) & (pgsize - 1)) {
> > >>  		ret = -EINVAL;
> > >>  		goto out_unlock;
> > >>  	}
> > >> @@ -1341,7 +1361,19 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
> > >>  		goto out_unlock;
> > >>  	}
> > >>  
> > >> -	if (vfio_find_dma(iommu, iova, size)) {
> > >> +	dma = vfio_find_dma(iommu, iova, size);
> > >> +	if (resume) {
> > >> +		if (!dma) {
> > >> +			ret = -ENOENT;
> > >> +		} else if (!dma->suspended || dma->iova != iova ||
> > >> +			   dma->size != size) {  
> > > 
> > > Why is it necessary that the vfio_dma be suspended before being
> > > resumed?  Couldn't a user simply use this to change the vaddr?  Does
> > > that promote abusive use?  
> > 
> > This would almost always be incorrect.  If the vaddr changes, then the old vaddr was already
> > invalidated, and there is a window where it is not OK for kernel code to use the old vaddr.
> > This could only be safe if the memory object is mapped at both the old vaddr and the new
> > vaddr concurrently, which is an unlikely use case.
> >   
> > >> +			ret = -EINVAL;
> > >> +		} else {
> > >> +			dma->vaddr = vaddr;  
> > > 
> > > Seems like there's a huge opportunity for a user to create coherency
> > > issues here... it's their data though I guess.  
> > 
> > Yes.  That's what the language in the uapi about mapping the same memory object is about.
> >   
> > >> +			dma->suspended = false;
> > >> +		}
> > >> +		goto out_unlock;
> > >> +	} else if (dma) {
> > >>  		ret = -EEXIST;
> > >>  		goto out_unlock;
> > >>  	}
> > >> @@ -2532,6 +2564,7 @@ static int vfio_iommu_type1_check_extension(struct vfio_iommu *iommu,
> > >>  	case VFIO_TYPE1_IOMMU:
> > >>  	case VFIO_TYPE1v2_IOMMU:
> > >>  	case VFIO_TYPE1_NESTING_IOMMU:
> > >> +	case VFIO_SUSPEND:
> > >>  		return 1;
> > >>  	case VFIO_DMA_CC_IOMMU:
> > >>  		if (!iommu)
> > >> @@ -2686,7 +2719,8 @@ static int vfio_iommu_type1_map_dma(struct vfio_iommu *iommu,
> > >>  {
> > >>  	struct vfio_iommu_type1_dma_map map;
> > >>  	unsigned long minsz;
> > >> -	uint32_t mask = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
> > >> +	uint32_t mask = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE |
> > >> +			VFIO_DMA_MAP_FLAG_RESUME;
> > >>  
> > >>  	minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);
> > >>  
> > >> @@ -2704,6 +2738,8 @@ static int vfio_iommu_type1_unmap_dma(struct vfio_iommu *iommu,
> > >>  {
> > >>  	struct vfio_iommu_type1_dma_unmap unmap;
> > >>  	struct vfio_bitmap bitmap = { 0 };
> > >> +	uint32_t mask = VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP |
> > >> +			VFIO_DMA_UNMAP_FLAG_SUSPEND;
> > >>  	unsigned long minsz;
> > >>  	int ret;
> > >>  
> > >> @@ -2712,8 +2748,7 @@ static int vfio_iommu_type1_unmap_dma(struct vfio_iommu *iommu,
> > >>  	if (copy_from_user(&unmap, (void __user *)arg, minsz))
> > >>  		return -EFAULT;
> > >>  
> > >> -	if (unmap.argsz < minsz ||
> > >> -	    unmap.flags & ~VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP)
> > >> +	if (unmap.argsz < minsz || unmap.flags & ~mask || unmap.flags == mask)  
> > > 
> > > Maybe a short comment here to note that dirty-bimap and
> > > suspend/invalidate are mutually exclusive.  Probably should be
> > > mentioned in the uapi too.  
> > 
> > Will do, for both.
> >   
> > >>  		return -EINVAL;
> > >>  
> > >>  	if (unmap.flags & VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP) {
> > >> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > >> index 896e527..fcf7b56 100644
> > >> --- a/include/uapi/linux/vfio.h
> > >> +++ b/include/uapi/linux/vfio.h
> > >> @@ -46,6 +46,9 @@
> > >>   */
> > >>  #define VFIO_NOIOMMU_IOMMU		8
> > >>  
> > >> +/* Supports VFIO DMA suspend and resume */
> > >> +#define VFIO_SUSPEND			9
> > >> +
> > >>  /*
> > >>   * The IOCTL interface is designed for extensibility by embedding the
> > >>   * structure length (argsz) and flags into structures passed between
> > >> @@ -1046,12 +1049,19 @@ struct vfio_iommu_type1_info_cap_migration {
> > >>   *
> > >>   * Map process virtual addresses to IO virtual addresses using the
> > >>   * provided struct vfio_dma_map. Caller sets argsz. READ &/ WRITE required.
> > >> + *
> > >> + * If flags & VFIO_DMA_MAP_FLAG_RESUME, record the new base vaddr for iova, and
> > >> + * resume translation of host virtual addresses in the iova range.  The new
> > >> + * vaddr must point to the same memory object as the old vaddr, but this is not
> > >> + * verified.  
> > > 
> > > It's hard to use "must" terminology here if we're not going to check.
> > > Maybe the phrasing should be something more along the lines of "should
> > > point to the same memory object or the user risks coherency issues
> > > within their virtual address space".  
> > 
> > I used "must" because it is always incorrect if the object is not the same.  How about:
> >   The new vaddr must point to the same memory object as the old vaddr, but this is not
> >   verified.  Violation of this constraint may result in memory corruption within the
> >   host process and/or guest.
> >   
> > >>  iova and size must match those in the original MAP_DMA call.
> > >> + * Protection is not changed, and the READ & WRITE flags must be 0.  
> > > 
> > > This doesn't mention that the entry must be previously
> > > suspended/invalidated (if we choose to keep those semantics).  Thanks,  
> > 
> > Will add, thanks.
> > 
> > - Steve   
> > >>   */
> > >>  struct vfio_iommu_type1_dma_map {
> > >>  	__u32	argsz;
> > >>  	__u32	flags;
> > >>  #define VFIO_DMA_MAP_FLAG_READ (1 << 0)		/* readable from device */
> > >>  #define VFIO_DMA_MAP_FLAG_WRITE (1 << 1)	/* writable from device */
> > >> +#define VFIO_DMA_MAP_FLAG_RESUME (1 << 2)
> > >>  	__u64	vaddr;				/* Process virtual address */
> > >>  	__u64	iova;				/* IO virtual address */
> > >>  	__u64	size;				/* Size of mapping (bytes) */
> > >> @@ -1084,11 +1094,17 @@ struct vfio_bitmap {
> > >>   * indicates that the page at that offset from iova is dirty. A Bitmap of the
> > >>   * pages in the range of unmapped size is returned in the user-provided
> > >>   * vfio_bitmap.data.
> > >> + *
> > >> + * If flags & VFIO_DMA_UNMAP_FLAG_SUSPEND, do not unmap, but suspend vfio
> > >> + * translation of host virtual addresses in the iova range.  During suspension,
> > >> + * kernel threads that attempt to translate will block.  DMA to already-mapped
> > >> + * pages continues.
> > >>   */
> > >>  struct vfio_iommu_type1_dma_unmap {
> > >>  	__u32	argsz;
> > >>  	__u32	flags;
> > >>  #define VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP (1 << 0)
> > >> +#define VFIO_DMA_UNMAP_FLAG_SUSPEND	     (1 << 1)
> > >>  	__u64	iova;				/* IO virtual address */
> > >>  	__u64	size;				/* Size of mapping (bytes) */
> > >>  	__u8    data[];  
> > >   
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH V1 4/5] vfio: VA suspend interface
  2021-01-11 21:15     ` Steven Sistare
  2021-01-12 15:47       ` Jason Zeng
@ 2021-01-12 22:47       ` Alex Williamson
  2021-01-13  4:10         ` Alex Williamson
  2021-01-13 18:01         ` Steven Sistare
  2021-01-19 20:11       ` Steven Sistare
  2 siblings, 2 replies; 39+ messages in thread
From: Alex Williamson @ 2021-01-12 22:47 UTC (permalink / raw)
  To: Steven Sistare; +Cc: kvm, Cornelia Huck, Kirti Wankhede

On Mon, 11 Jan 2021 16:15:02 -0500
Steven Sistare <steven.sistare@oracle.com> wrote:

> On 1/8/2021 4:15 PM, Alex Williamson wrote:
> > On Tue,  5 Jan 2021 07:36:52 -0800
> > Steve Sistare <steven.sistare@oracle.com> wrote:
> >   
> >> Add interfaces that allow the underlying memory object of an iova
> >> range to be mapped to a new host virtual address in the host process:
> >>
> >>   - VFIO_DMA_UNMAP_FLAG_SUSPEND for VFIO_IOMMU_UNMAP_DMA
> >>   - VFIO_DMA_MAP_FLAG_RESUME flag for VFIO_IOMMU_MAP_DMA
> >>   - VFIO_SUSPEND extension for VFIO_CHECK_EXTENSION  
> > 
> > Suspend and Resume can imply many things other than what's done here.
> > Should these be something more akin to INVALIDATE_VADDR and
> > REPLACE_VADDR?  
> 
> Agreed.  I suspected we would discuss the names.  Some possibilities:
> 
> INVALIDATE_VADDR  REPLACE_VADDR
> INV_VADDR         SET_VADDR
> CLEAR_VADDR       SET_VADDR
> SUSPEND_VADDR     RESUME_VADDR
> 
> >> The suspend interface blocks vfio translation of host virtual
> >> addresses in a range, but DMA to already-mapped pages continues.
> >> The resume interface records the new base VA and resumes translation.
> >> See comments in uapi/linux/vfio.h for more details.
> >>
> >> This is a partial implementation.  Blocking is added in the next patch.
> >>
> >> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> >> ---
> >>  drivers/vfio/vfio_iommu_type1.c | 47 +++++++++++++++++++++++++++++++++++------
> >>  include/uapi/linux/vfio.h       | 16 ++++++++++++++
> >>  2 files changed, 57 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> >> index 3dc501d..2c164a6 100644
> >> --- a/drivers/vfio/vfio_iommu_type1.c
> >> +++ b/drivers/vfio/vfio_iommu_type1.c
> >> @@ -92,6 +92,7 @@ struct vfio_dma {
> >>  	int			prot;		/* IOMMU_READ/WRITE */
> >>  	bool			iommu_mapped;
> >>  	bool			lock_cap;	/* capable(CAP_IPC_LOCK) */
> >> +	bool			suspended;  
> > 
> > Is there a value we could use for vfio_dma.vaddr that would always be
> > considered invalid, ex. ULONG_MAX?    
> 
> Yes, that could replace the suspend flag.  That, plus changing the language from suspend
> to invalidate, will probably yield equally understandable code.  I'll try it.

Thinking about this further, if we defined a VFIO_IOMMU_TYPE1_INV_VADDR
as part of the uapi, could we implement this with only a single flag on
the DMA_MAP ioctl?  For example the user would call DMA_MAP with a flag
to set the vaddr, first to the invalid valid, then to a new value.  It's
always seemed a bit awkward to use DMA_UNMAP to invalidate the vaddr
when the mapping is not actually unmapped.  That might lean towards an
UPDATE or REPLACE flag.

> > We'd need to decide if we want to
> > allow users to create mappings (mdev-only) using an initial invalid
> > vaddr.  
> 
> Maybe.  Not sure yet.

If we used the above, it almost seems strange not to allow it, but at
the same time we don't really want to have different rules for
different devices types.  An initially valid vaddr doesn't seem
unreasonable... though we don't test it until the vendor driver tries
to pin or rw pages w/o IOMMU backing.
 
> >>  	struct task_struct	*task;
> >>  	struct rb_root		pfn_list;	/* Ex-user pinned pfn list */
> >>  	unsigned long		*bitmap;
> >> @@ -1080,7 +1081,7 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
> >>  	int ret = 0, retries = 0;
> >>  	unsigned long pgshift;
> >>  	dma_addr_t iova;
> >> -	unsigned long size;
> >> +	unsigned long size, consumed;  
> > 
> > This could be scoped into the branch below.  
> 
> OK.
> 
> >>  	mutex_lock(&iommu->lock);
> >>  
> >> @@ -1169,6 +1170,21 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
> >>  		if (dma->task->mm != current->mm)
> >>  			break;
> >>  
> >> +		if (unmap->flags & VFIO_DMA_UNMAP_FLAG_SUSPEND) {
> >> +			if (dma->suspended) {
> >> +				ret = -EINVAL;
> >> +				goto unlock;
> >> +			}  
> > 
> > This leaves us in a state where we marked some entries but not others.
> > We should either unwind or... what's the actual harm in skipping these?  
> 
> We could skip them with no ill effect.  However, it likely means the app is confused
> and potentially broken, and it would be courteous to inform them so.  I found such bugs
> in qemu as I was developing this feature.
> 
> IMO unwinding does not help the app, and adds unnecessary code.  It can still leave some
> ranges suspended and some not.  The safest recovery is for the app to exit, and tell the 
> developer to fix the redundant suspend call.

That sounds like an entirely practical rationalization, but our
standard practice is to maintain a consistent state.  If an ioctl fails
is should effectively be as if the ioctl was never called, where
possible.  Userspace can be broken, and potentially so broken that their
best choice is to abort, but we should maintain consistent, predictable
behavior.

> >> +			dma->suspended = true;
> >> +			consumed = dma->iova + dma->size - iova;
> >> +			if (consumed >= size)
> >> +				break;
> >> +			iova += consumed;
> >> +			size -= consumed;
> >> +			unmapped += dma->size;
> >> +			continue;
> >> +		}  
> > 
> > This short-cuts the dirty bitmap flag, so we need to decide if it's
> > legal to call them together or we need to prevent it... Oh, I see
> > you've excluded them earlier below.
> >   
> >> +
> >>  		if (!RB_EMPTY_ROOT(&dma->pfn_list)) {
> >>  			struct vfio_iommu_type1_dma_unmap nb_unmap;
> >>  
> >> @@ -1307,6 +1323,7 @@ static bool vfio_iommu_iova_dma_valid(struct vfio_iommu *iommu,
> >>  static int vfio_dma_do_map(struct vfio_iommu *iommu,
> >>  			   struct vfio_iommu_type1_dma_map *map)
> >>  {
> >> +	bool resume = map->flags & VFIO_DMA_MAP_FLAG_RESUME;
> >>  	dma_addr_t iova = map->iova;
> >>  	unsigned long vaddr = map->vaddr;
> >>  	size_t size = map->size;
> >> @@ -1324,13 +1341,16 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
> >>  	if (map->flags & VFIO_DMA_MAP_FLAG_READ)
> >>  		prot |= IOMMU_READ;
> >>  
> >> +	if ((prot && resume) || (!prot && !resume))
> >> +		return -EINVAL;
> >> +
> >>  	mutex_lock(&iommu->lock);
> >>  
> >>  	pgsize = (size_t)1 << __ffs(iommu->pgsize_bitmap);
> >>  
> >>  	WARN_ON((pgsize - 1) & PAGE_MASK);
> >>  
> >> -	if (!prot || !size || (size | iova | vaddr) & (pgsize - 1)) {
> >> +	if (!size || (size | iova | vaddr) & (pgsize - 1)) {
> >>  		ret = -EINVAL;
> >>  		goto out_unlock;
> >>  	}
> >> @@ -1341,7 +1361,19 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
> >>  		goto out_unlock;
> >>  	}
> >>  
> >> -	if (vfio_find_dma(iommu, iova, size)) {
> >> +	dma = vfio_find_dma(iommu, iova, size);
> >> +	if (resume) {
> >> +		if (!dma) {
> >> +			ret = -ENOENT;
> >> +		} else if (!dma->suspended || dma->iova != iova ||
> >> +			   dma->size != size) {  
> > 
> > Why is it necessary that the vfio_dma be suspended before being
> > resumed?  Couldn't a user simply use this to change the vaddr?  Does
> > that promote abusive use?  
> 
> This would almost always be incorrect.  If the vaddr changes, then the old vaddr was already
> invalidated, and there is a window where it is not OK for kernel code to use the old vaddr.
> This could only be safe if the memory object is mapped at both the old vaddr and the new
> vaddr concurrently, which is an unlikely use case.

Ok, it's not like the use can't make it instantaneously invalid and then
replace it.

> >> +			ret = -EINVAL;
> >> +		} else {
> >> +			dma->vaddr = vaddr;  
> > 
> > Seems like there's a huge opportunity for a user to create coherency
> > issues here... it's their data though I guess.  
> 
> Yes.  That's what the language in the uapi about mapping the same memory object is about.
> 
> >> +			dma->suspended = false;
> >> +		}
> >> +		goto out_unlock;
> >> +	} else if (dma) {
> >>  		ret = -EEXIST;
> >>  		goto out_unlock;
> >>  	}
> >> @@ -2532,6 +2564,7 @@ static int vfio_iommu_type1_check_extension(struct vfio_iommu *iommu,
> >>  	case VFIO_TYPE1_IOMMU:
> >>  	case VFIO_TYPE1v2_IOMMU:
> >>  	case VFIO_TYPE1_NESTING_IOMMU:
> >> +	case VFIO_SUSPEND:
> >>  		return 1;
> >>  	case VFIO_DMA_CC_IOMMU:
> >>  		if (!iommu)
> >> @@ -2686,7 +2719,8 @@ static int vfio_iommu_type1_map_dma(struct vfio_iommu *iommu,
> >>  {
> >>  	struct vfio_iommu_type1_dma_map map;
> >>  	unsigned long minsz;
> >> -	uint32_t mask = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
> >> +	uint32_t mask = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE |
> >> +			VFIO_DMA_MAP_FLAG_RESUME;
> >>  
> >>  	minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);
> >>  
> >> @@ -2704,6 +2738,8 @@ static int vfio_iommu_type1_unmap_dma(struct vfio_iommu *iommu,
> >>  {
> >>  	struct vfio_iommu_type1_dma_unmap unmap;
> >>  	struct vfio_bitmap bitmap = { 0 };
> >> +	uint32_t mask = VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP |
> >> +			VFIO_DMA_UNMAP_FLAG_SUSPEND;
> >>  	unsigned long minsz;
> >>  	int ret;
> >>  
> >> @@ -2712,8 +2748,7 @@ static int vfio_iommu_type1_unmap_dma(struct vfio_iommu *iommu,
> >>  	if (copy_from_user(&unmap, (void __user *)arg, minsz))
> >>  		return -EFAULT;
> >>  
> >> -	if (unmap.argsz < minsz ||
> >> -	    unmap.flags & ~VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP)
> >> +	if (unmap.argsz < minsz || unmap.flags & ~mask || unmap.flags == mask)  
> > 
> > Maybe a short comment here to note that dirty-bimap and
> > suspend/invalidate are mutually exclusive.  Probably should be
> > mentioned in the uapi too.  
> 
> Will do, for both.
> 
> >>  		return -EINVAL;
> >>  
> >>  	if (unmap.flags & VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP) {
> >> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> >> index 896e527..fcf7b56 100644
> >> --- a/include/uapi/linux/vfio.h
> >> +++ b/include/uapi/linux/vfio.h
> >> @@ -46,6 +46,9 @@
> >>   */
> >>  #define VFIO_NOIOMMU_IOMMU		8
> >>  
> >> +/* Supports VFIO DMA suspend and resume */
> >> +#define VFIO_SUSPEND			9
> >> +
> >>  /*
> >>   * The IOCTL interface is designed for extensibility by embedding the
> >>   * structure length (argsz) and flags into structures passed between
> >> @@ -1046,12 +1049,19 @@ struct vfio_iommu_type1_info_cap_migration {
> >>   *
> >>   * Map process virtual addresses to IO virtual addresses using the
> >>   * provided struct vfio_dma_map. Caller sets argsz. READ &/ WRITE required.
> >> + *
> >> + * If flags & VFIO_DMA_MAP_FLAG_RESUME, record the new base vaddr for iova, and
> >> + * resume translation of host virtual addresses in the iova range.  The new
> >> + * vaddr must point to the same memory object as the old vaddr, but this is not
> >> + * verified.  
> > 
> > It's hard to use "must" terminology here if we're not going to check.
> > Maybe the phrasing should be something more along the lines of "should
> > point to the same memory object or the user risks coherency issues
> > within their virtual address space".  
> 
> I used "must" because it is always incorrect if the object is not the same.  How about:
>   The new vaddr must point to the same memory object as the old vaddr, but this is not
>   verified.  Violation of this constraint may result in memory corruption within the
>   host process and/or guest.

Since the "must" is not relative to the API but to the resulting
behavior, perhaps something like:

  In order to maintain memory consistency within the user application,
  the updated vaddr must address the same memory object as originally
  mapped, failure to do so will result in user memory corruption and/or
  device misbehavior.

Thanks,
Alex

> >>  iova and size must match those in the original MAP_DMA call.
> >> + * Protection is not changed, and the READ & WRITE flags must be 0.  
> > 
> > This doesn't mention that the entry must be previously
> > suspended/invalidated (if we choose to keep those semantics).  Thanks,  
> 
> Will add, thanks.
> 
> - Steve 
> >>   */
> >>  struct vfio_iommu_type1_dma_map {
> >>  	__u32	argsz;
> >>  	__u32	flags;
> >>  #define VFIO_DMA_MAP_FLAG_READ (1 << 0)		/* readable from device */
> >>  #define VFIO_DMA_MAP_FLAG_WRITE (1 << 1)	/* writable from device */
> >> +#define VFIO_DMA_MAP_FLAG_RESUME (1 << 2)
> >>  	__u64	vaddr;				/* Process virtual address */
> >>  	__u64	iova;				/* IO virtual address */
> >>  	__u64	size;				/* Size of mapping (bytes) */
> >> @@ -1084,11 +1094,17 @@ struct vfio_bitmap {
> >>   * indicates that the page at that offset from iova is dirty. A Bitmap of the
> >>   * pages in the range of unmapped size is returned in the user-provided
> >>   * vfio_bitmap.data.
> >> + *
> >> + * If flags & VFIO_DMA_UNMAP_FLAG_SUSPEND, do not unmap, but suspend vfio
> >> + * translation of host virtual addresses in the iova range.  During suspension,
> >> + * kernel threads that attempt to translate will block.  DMA to already-mapped
> >> + * pages continues.
> >>   */
> >>  struct vfio_iommu_type1_dma_unmap {
> >>  	__u32	argsz;
> >>  	__u32	flags;
> >>  #define VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP (1 << 0)
> >> +#define VFIO_DMA_UNMAP_FLAG_SUSPEND	     (1 << 1)
> >>  	__u64	iova;				/* IO virtual address */
> >>  	__u64	size;				/* Size of mapping (bytes) */
> >>  	__u8    data[];  
> >   
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH V1 4/5] vfio: VA suspend interface
  2021-01-12 22:47       ` Alex Williamson
@ 2021-01-13  4:10         ` Alex Williamson
  2021-01-13 18:02           ` Steven Sistare
  2021-01-13 18:01         ` Steven Sistare
  1 sibling, 1 reply; 39+ messages in thread
From: Alex Williamson @ 2021-01-13  4:10 UTC (permalink / raw)
  To: Steven Sistare; +Cc: kvm, Cornelia Huck, Kirti Wankhede

On Tue, 12 Jan 2021 15:47:56 -0700
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Mon, 11 Jan 2021 16:15:02 -0500
> Steven Sistare <steven.sistare@oracle.com> wrote:
> 
> > On 1/8/2021 4:15 PM, Alex Williamson wrote:  
> > > On Tue,  5 Jan 2021 07:36:52 -0800
> > > Steve Sistare <steven.sistare@oracle.com> wrote:
> > >     
> > >> Add interfaces that allow the underlying memory object of an iova
> > >> range to be mapped to a new host virtual address in the host process:
> > >>
> > >>   - VFIO_DMA_UNMAP_FLAG_SUSPEND for VFIO_IOMMU_UNMAP_DMA
> > >>   - VFIO_DMA_MAP_FLAG_RESUME flag for VFIO_IOMMU_MAP_DMA
> > >>   - VFIO_SUSPEND extension for VFIO_CHECK_EXTENSION    
> > > 
> > > Suspend and Resume can imply many things other than what's done here.
> > > Should these be something more akin to INVALIDATE_VADDR and
> > > REPLACE_VADDR?    
> > 
> > Agreed.  I suspected we would discuss the names.  Some possibilities:
> > 
> > INVALIDATE_VADDR  REPLACE_VADDR
> > INV_VADDR         SET_VADDR
> > CLEAR_VADDR       SET_VADDR
> > SUSPEND_VADDR     RESUME_VADDR
> >   
> > >> The suspend interface blocks vfio translation of host virtual
> > >> addresses in a range, but DMA to already-mapped pages continues.
> > >> The resume interface records the new base VA and resumes translation.
> > >> See comments in uapi/linux/vfio.h for more details.
> > >>
> > >> This is a partial implementation.  Blocking is added in the next patch.
> > >>
> > >> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> > >> ---
> > >>  drivers/vfio/vfio_iommu_type1.c | 47 +++++++++++++++++++++++++++++++++++------
> > >>  include/uapi/linux/vfio.h       | 16 ++++++++++++++
> > >>  2 files changed, 57 insertions(+), 6 deletions(-)
> > >>
> > >> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > >> index 3dc501d..2c164a6 100644
> > >> --- a/drivers/vfio/vfio_iommu_type1.c
> > >> +++ b/drivers/vfio/vfio_iommu_type1.c
> > >> @@ -92,6 +92,7 @@ struct vfio_dma {
> > >>  	int			prot;		/* IOMMU_READ/WRITE */
> > >>  	bool			iommu_mapped;
> > >>  	bool			lock_cap;	/* capable(CAP_IPC_LOCK) */
> > >> +	bool			suspended;    
> > > 
> > > Is there a value we could use for vfio_dma.vaddr that would always be
> > > considered invalid, ex. ULONG_MAX?      
> > 
> > Yes, that could replace the suspend flag.  That, plus changing the language from suspend
> > to invalidate, will probably yield equally understandable code.  I'll try it.  
> 
> Thinking about this further, if we defined a VFIO_IOMMU_TYPE1_INV_VADDR
> as part of the uapi, could we implement this with only a single flag on
> the DMA_MAP ioctl?  For example the user would call DMA_MAP with a flag
> to set the vaddr, first to the invalid valid, then to a new value.  It's
> always seemed a bit awkward to use DMA_UNMAP to invalidate the vaddr
> when the mapping is not actually unmapped.  That might lean towards an
> UPDATE or REPLACE flag.

I realized you really want to make use of the DMA_UNMAP ioctl in order
to use ranges, maybe we can make the mental model more coherent with an
unmap flag like VFIO_DMA_UNMAP_FLAG_VADDR_ONLY, ie. we're only asking
to unmap the vaddr.  The DMA_MAP side might take a similar VADDR_ONLY
flag to reset the vaddr.  That also retains your desired semantics that
we can't "resume" a vaddr that wasn't previously "suspended", we can't
map a vaddr that wasn't previously unmapped.

For the unmap-all problem, userspace already needs to work around this,
see for instance QEMU:

1b296c3def4b vfio: Don't issue full 2^64 unmap

So I wonder really how critical it is and whether it really would be
sufficient for userspace to track a high water mark for mappings.
Otherwise, I think I'm leaning towards a DMA_UNMAP flag like
VFIO_DMA_UNMAP_FLAG_UNMAP_ALL that would disregard the iova and size
fields to apply to all mappings.  Designating special values for iova or
size trigger extended behavior feels a bit hackish.  Thanks,

Alex
 
> > > We'd need to decide if we want to
> > > allow users to create mappings (mdev-only) using an initial invalid
> > > vaddr.    
> > 
> > Maybe.  Not sure yet.  
> 
> If we used the above, it almost seems strange not to allow it, but at
> the same time we don't really want to have different rules for
> different devices types.  An initially valid vaddr doesn't seem
> unreasonable... though we don't test it until the vendor driver tries
> to pin or rw pages w/o IOMMU backing.
>  
> > >>  	struct task_struct	*task;
> > >>  	struct rb_root		pfn_list;	/* Ex-user pinned pfn list */
> > >>  	unsigned long		*bitmap;
> > >> @@ -1080,7 +1081,7 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
> > >>  	int ret = 0, retries = 0;
> > >>  	unsigned long pgshift;
> > >>  	dma_addr_t iova;
> > >> -	unsigned long size;
> > >> +	unsigned long size, consumed;    
> > > 
> > > This could be scoped into the branch below.    
> > 
> > OK.
> >   
> > >>  	mutex_lock(&iommu->lock);
> > >>  
> > >> @@ -1169,6 +1170,21 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
> > >>  		if (dma->task->mm != current->mm)
> > >>  			break;
> > >>  
> > >> +		if (unmap->flags & VFIO_DMA_UNMAP_FLAG_SUSPEND) {
> > >> +			if (dma->suspended) {
> > >> +				ret = -EINVAL;
> > >> +				goto unlock;
> > >> +			}    
> > > 
> > > This leaves us in a state where we marked some entries but not others.
> > > We should either unwind or... what's the actual harm in skipping these?    
> > 
> > We could skip them with no ill effect.  However, it likely means the app is confused
> > and potentially broken, and it would be courteous to inform them so.  I found such bugs
> > in qemu as I was developing this feature.
> > 
> > IMO unwinding does not help the app, and adds unnecessary code.  It can still leave some
> > ranges suspended and some not.  The safest recovery is for the app to exit, and tell the 
> > developer to fix the redundant suspend call.  
> 
> That sounds like an entirely practical rationalization, but our
> standard practice is to maintain a consistent state.  If an ioctl fails
> is should effectively be as if the ioctl was never called, where
> possible.  Userspace can be broken, and potentially so broken that their
> best choice is to abort, but we should maintain consistent, predictable
> behavior.
> 
> > >> +			dma->suspended = true;
> > >> +			consumed = dma->iova + dma->size - iova;
> > >> +			if (consumed >= size)
> > >> +				break;
> > >> +			iova += consumed;
> > >> +			size -= consumed;
> > >> +			unmapped += dma->size;
> > >> +			continue;
> > >> +		}    
> > > 
> > > This short-cuts the dirty bitmap flag, so we need to decide if it's
> > > legal to call them together or we need to prevent it... Oh, I see
> > > you've excluded them earlier below.
> > >     
> > >> +
> > >>  		if (!RB_EMPTY_ROOT(&dma->pfn_list)) {
> > >>  			struct vfio_iommu_type1_dma_unmap nb_unmap;
> > >>  
> > >> @@ -1307,6 +1323,7 @@ static bool vfio_iommu_iova_dma_valid(struct vfio_iommu *iommu,
> > >>  static int vfio_dma_do_map(struct vfio_iommu *iommu,
> > >>  			   struct vfio_iommu_type1_dma_map *map)
> > >>  {
> > >> +	bool resume = map->flags & VFIO_DMA_MAP_FLAG_RESUME;
> > >>  	dma_addr_t iova = map->iova;
> > >>  	unsigned long vaddr = map->vaddr;
> > >>  	size_t size = map->size;
> > >> @@ -1324,13 +1341,16 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
> > >>  	if (map->flags & VFIO_DMA_MAP_FLAG_READ)
> > >>  		prot |= IOMMU_READ;
> > >>  
> > >> +	if ((prot && resume) || (!prot && !resume))
> > >> +		return -EINVAL;
> > >> +
> > >>  	mutex_lock(&iommu->lock);
> > >>  
> > >>  	pgsize = (size_t)1 << __ffs(iommu->pgsize_bitmap);
> > >>  
> > >>  	WARN_ON((pgsize - 1) & PAGE_MASK);
> > >>  
> > >> -	if (!prot || !size || (size | iova | vaddr) & (pgsize - 1)) {
> > >> +	if (!size || (size | iova | vaddr) & (pgsize - 1)) {
> > >>  		ret = -EINVAL;
> > >>  		goto out_unlock;
> > >>  	}
> > >> @@ -1341,7 +1361,19 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
> > >>  		goto out_unlock;
> > >>  	}
> > >>  
> > >> -	if (vfio_find_dma(iommu, iova, size)) {
> > >> +	dma = vfio_find_dma(iommu, iova, size);
> > >> +	if (resume) {
> > >> +		if (!dma) {
> > >> +			ret = -ENOENT;
> > >> +		} else if (!dma->suspended || dma->iova != iova ||
> > >> +			   dma->size != size) {    
> > > 
> > > Why is it necessary that the vfio_dma be suspended before being
> > > resumed?  Couldn't a user simply use this to change the vaddr?  Does
> > > that promote abusive use?    
> > 
> > This would almost always be incorrect.  If the vaddr changes, then the old vaddr was already
> > invalidated, and there is a window where it is not OK for kernel code to use the old vaddr.
> > This could only be safe if the memory object is mapped at both the old vaddr and the new
> > vaddr concurrently, which is an unlikely use case.  
> 
> Ok, it's not like the use can't make it instantaneously invalid and then
> replace it.
> 
> > >> +			ret = -EINVAL;
> > >> +		} else {
> > >> +			dma->vaddr = vaddr;    
> > > 
> > > Seems like there's a huge opportunity for a user to create coherency
> > > issues here... it's their data though I guess.    
> > 
> > Yes.  That's what the language in the uapi about mapping the same memory object is about.
> >   
> > >> +			dma->suspended = false;
> > >> +		}
> > >> +		goto out_unlock;
> > >> +	} else if (dma) {
> > >>  		ret = -EEXIST;
> > >>  		goto out_unlock;
> > >>  	}
> > >> @@ -2532,6 +2564,7 @@ static int vfio_iommu_type1_check_extension(struct vfio_iommu *iommu,
> > >>  	case VFIO_TYPE1_IOMMU:
> > >>  	case VFIO_TYPE1v2_IOMMU:
> > >>  	case VFIO_TYPE1_NESTING_IOMMU:
> > >> +	case VFIO_SUSPEND:
> > >>  		return 1;
> > >>  	case VFIO_DMA_CC_IOMMU:
> > >>  		if (!iommu)
> > >> @@ -2686,7 +2719,8 @@ static int vfio_iommu_type1_map_dma(struct vfio_iommu *iommu,
> > >>  {
> > >>  	struct vfio_iommu_type1_dma_map map;
> > >>  	unsigned long minsz;
> > >> -	uint32_t mask = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
> > >> +	uint32_t mask = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE |
> > >> +			VFIO_DMA_MAP_FLAG_RESUME;
> > >>  
> > >>  	minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);
> > >>  
> > >> @@ -2704,6 +2738,8 @@ static int vfio_iommu_type1_unmap_dma(struct vfio_iommu *iommu,
> > >>  {
> > >>  	struct vfio_iommu_type1_dma_unmap unmap;
> > >>  	struct vfio_bitmap bitmap = { 0 };
> > >> +	uint32_t mask = VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP |
> > >> +			VFIO_DMA_UNMAP_FLAG_SUSPEND;
> > >>  	unsigned long minsz;
> > >>  	int ret;
> > >>  
> > >> @@ -2712,8 +2748,7 @@ static int vfio_iommu_type1_unmap_dma(struct vfio_iommu *iommu,
> > >>  	if (copy_from_user(&unmap, (void __user *)arg, minsz))
> > >>  		return -EFAULT;
> > >>  
> > >> -	if (unmap.argsz < minsz ||
> > >> -	    unmap.flags & ~VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP)
> > >> +	if (unmap.argsz < minsz || unmap.flags & ~mask || unmap.flags == mask)    
> > > 
> > > Maybe a short comment here to note that dirty-bimap and
> > > suspend/invalidate are mutually exclusive.  Probably should be
> > > mentioned in the uapi too.    
> > 
> > Will do, for both.
> >   
> > >>  		return -EINVAL;
> > >>  
> > >>  	if (unmap.flags & VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP) {
> > >> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > >> index 896e527..fcf7b56 100644
> > >> --- a/include/uapi/linux/vfio.h
> > >> +++ b/include/uapi/linux/vfio.h
> > >> @@ -46,6 +46,9 @@
> > >>   */
> > >>  #define VFIO_NOIOMMU_IOMMU		8
> > >>  
> > >> +/* Supports VFIO DMA suspend and resume */
> > >> +#define VFIO_SUSPEND			9
> > >> +
> > >>  /*
> > >>   * The IOCTL interface is designed for extensibility by embedding the
> > >>   * structure length (argsz) and flags into structures passed between
> > >> @@ -1046,12 +1049,19 @@ struct vfio_iommu_type1_info_cap_migration {
> > >>   *
> > >>   * Map process virtual addresses to IO virtual addresses using the
> > >>   * provided struct vfio_dma_map. Caller sets argsz. READ &/ WRITE required.
> > >> + *
> > >> + * If flags & VFIO_DMA_MAP_FLAG_RESUME, record the new base vaddr for iova, and
> > >> + * resume translation of host virtual addresses in the iova range.  The new
> > >> + * vaddr must point to the same memory object as the old vaddr, but this is not
> > >> + * verified.    
> > > 
> > > It's hard to use "must" terminology here if we're not going to check.
> > > Maybe the phrasing should be something more along the lines of "should
> > > point to the same memory object or the user risks coherency issues
> > > within their virtual address space".    
> > 
> > I used "must" because it is always incorrect if the object is not the same.  How about:
> >   The new vaddr must point to the same memory object as the old vaddr, but this is not
> >   verified.  Violation of this constraint may result in memory corruption within the
> >   host process and/or guest.  
> 
> Since the "must" is not relative to the API but to the resulting
> behavior, perhaps something like:
> 
>   In order to maintain memory consistency within the user application,
>   the updated vaddr must address the same memory object as originally
>   mapped, failure to do so will result in user memory corruption and/or
>   device misbehavior.
> 
> Thanks,
> Alex
> 
> > >>  iova and size must match those in the original MAP_DMA call.
> > >> + * Protection is not changed, and the READ & WRITE flags must be 0.    
> > > 
> > > This doesn't mention that the entry must be previously
> > > suspended/invalidated (if we choose to keep those semantics).  Thanks,    
> > 
> > Will add, thanks.
> > 
> > - Steve   
> > >>   */
> > >>  struct vfio_iommu_type1_dma_map {
> > >>  	__u32	argsz;
> > >>  	__u32	flags;
> > >>  #define VFIO_DMA_MAP_FLAG_READ (1 << 0)		/* readable from device */
> > >>  #define VFIO_DMA_MAP_FLAG_WRITE (1 << 1)	/* writable from device */
> > >> +#define VFIO_DMA_MAP_FLAG_RESUME (1 << 2)
> > >>  	__u64	vaddr;				/* Process virtual address */
> > >>  	__u64	iova;				/* IO virtual address */
> > >>  	__u64	size;				/* Size of mapping (bytes) */
> > >> @@ -1084,11 +1094,17 @@ struct vfio_bitmap {
> > >>   * indicates that the page at that offset from iova is dirty. A Bitmap of the
> > >>   * pages in the range of unmapped size is returned in the user-provided
> > >>   * vfio_bitmap.data.
> > >> + *
> > >> + * If flags & VFIO_DMA_UNMAP_FLAG_SUSPEND, do not unmap, but suspend vfio
> > >> + * translation of host virtual addresses in the iova range.  During suspension,
> > >> + * kernel threads that attempt to translate will block.  DMA to already-mapped
> > >> + * pages continues.
> > >>   */
> > >>  struct vfio_iommu_type1_dma_unmap {
> > >>  	__u32	argsz;
> > >>  	__u32	flags;
> > >>  #define VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP (1 << 0)
> > >> +#define VFIO_DMA_UNMAP_FLAG_SUSPEND	     (1 << 1)
> > >>  	__u64	iova;				/* IO virtual address */
> > >>  	__u64	size;				/* Size of mapping (bytes) */
> > >>  	__u8    data[];    
> > >     
> >   
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH V1 4/5] vfio: VA suspend interface
  2021-01-12 22:09         ` Alex Williamson
@ 2021-01-13  4:37           ` Jason Zeng
  0 siblings, 0 replies; 39+ messages in thread
From: Jason Zeng @ 2021-01-13  4:37 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Steven Sistare, kvm, Cornelia Huck, Kirti Wankhede

On Tue, Jan 12, 2021 at 03:09:14PM -0700, Alex Williamson wrote:
> On Tue, 12 Jan 2021 23:47:35 +0800
> Jason Zeng <jason.zeng@intel.com> wrote:
> 
> > On Mon, Jan 11, 2021 at 04:15:02PM -0500, Steven Sistare wrote:
> > > On 1/8/2021 4:15 PM, Alex Williamson wrote:  
> > > > On Tue,  5 Jan 2021 07:36:52 -0800
> > > > Steve Sistare <steven.sistare@oracle.com> wrote:
> > > >   
> > > >> Add interfaces that allow the underlying memory object of an iova
> > > >> range to be mapped to a new host virtual address in the host process:
> > > >>
> > > >>   - VFIO_DMA_UNMAP_FLAG_SUSPEND for VFIO_IOMMU_UNMAP_DMA
> > > >>   - VFIO_DMA_MAP_FLAG_RESUME flag for VFIO_IOMMU_MAP_DMA
> > > >>   - VFIO_SUSPEND extension for VFIO_CHECK_EXTENSION  
> > > > 
> > > > Suspend and Resume can imply many things other than what's done here.
> > > > Should these be something more akin to INVALIDATE_VADDR and
> > > > REPLACE_VADDR?  
> > > 
> > > Agreed.  I suspected we would discuss the names.  Some possibilities:
> > > 
> > > INVALIDATE_VADDR  REPLACE_VADDR
> > > INV_VADDR         SET_VADDR
> > > CLEAR_VADDR       SET_VADDR
> > > SUSPEND_VADDR     RESUME_VADDR
> > >   
> > 
> > What about SET_KEEPALIVE/CLEAR_KEEPALIVE? Vaddr can be updated as part
> > of CLEAR_KEEPALIVE.
> 
> I prefer names that describe what they do, not an arbitrary use case.
> 

Yes, agree.

> > Actually we are keeping the DMA mappings alive, instead of suspending
> > them, when new Qemu are exec'ing, because the hardware can still do
> > DMA during this period.
> 
> Why would they not stay alive when the vfio file descriptors remain
> open and nothing has been unmapped?
>  

I was thinking usually we teardown vfio fds in normal cases when we
switch the device ownership from one qemu to another qemu, which means
device is not alive during the switching period. Here we keep these
fds remaining open so they can be called keepalive. :)

> > This name will also be friendly to VMM Fast Restart [1], which aims to
> > kexec reboot host kernel while VMs are paused in memory and resumed
> > seamlessly after the new kernel is ready.
> > 
> > The VMM Fast Restart scenario is as such: Before kexec-reboot,
> > SET_KEEPALIVE is issued to set keepalive flag for VFIO DMA mapping
> > entries, so that their underlying IOMMU DMA mappings will not be
> > destroyed when Qemu quits before kexec-reboot. After kexec-reboot,
> > the VFIO DMA mappings are restored together with its keepalive flag
> > in the new kernel so that new Qemu can do CLEAR_KEEPALIVE operation.
> 
> That entails far more than simply invalidating vaddr.  We need to have
> a very specific uapi, it's difficult to expand the scope of a flag
> when/if such features are ever proposed upstream.  Thanks,
> 

Agree, thanks!

Jason

> Alex
> 
> > > >> The suspend interface blocks vfio translation of host virtual
> > > >> addresses in a range, but DMA to already-mapped pages continues.
> > > >> The resume interface records the new base VA and resumes translation.
> > > >> See comments in uapi/linux/vfio.h for more details.
> > > >>
> > > >> This is a partial implementation.  Blocking is added in the next patch.
> > > >>
> > > >> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> > > >> ---
> > > >>  drivers/vfio/vfio_iommu_type1.c | 47 +++++++++++++++++++++++++++++++++++------
> > > >>  include/uapi/linux/vfio.h       | 16 ++++++++++++++
> > > >>  2 files changed, 57 insertions(+), 6 deletions(-)
> > > >>
> > > >> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > > >> index 3dc501d..2c164a6 100644
> > > >> --- a/drivers/vfio/vfio_iommu_type1.c
> > > >> +++ b/drivers/vfio/vfio_iommu_type1.c
> > > >> @@ -92,6 +92,7 @@ struct vfio_dma {
> > > >>  	int			prot;		/* IOMMU_READ/WRITE */
> > > >>  	bool			iommu_mapped;
> > > >>  	bool			lock_cap;	/* capable(CAP_IPC_LOCK) */
> > > >> +	bool			suspended;  
> > > > 
> > > > Is there a value we could use for vfio_dma.vaddr that would always be
> > > > considered invalid, ex. ULONG_MAX?    
> > > 
> > > Yes, that could replace the suspend flag.  That, plus changing the language from suspend
> > > to invalidate, will probably yield equally understandable code.  I'll try it.
> > >   
> > > > We'd need to decide if we want to
> > > > allow users to create mappings (mdev-only) using an initial invalid
> > > > vaddr.  
> > > 
> > > Maybe.  Not sure yet.
> > >   
> > > >>  	struct task_struct	*task;
> > > >>  	struct rb_root		pfn_list;	/* Ex-user pinned pfn list */
> > > >>  	unsigned long		*bitmap;
> > > >> @@ -1080,7 +1081,7 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
> > > >>  	int ret = 0, retries = 0;
> > > >>  	unsigned long pgshift;
> > > >>  	dma_addr_t iova;
> > > >> -	unsigned long size;
> > > >> +	unsigned long size, consumed;  
> > > > 
> > > > This could be scoped into the branch below.  
> > > 
> > > OK.
> > >   
> > > >>  	mutex_lock(&iommu->lock);
> > > >>  
> > > >> @@ -1169,6 +1170,21 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
> > > >>  		if (dma->task->mm != current->mm)
> > > >>  			break;
> > > >>  
> > > >> +		if (unmap->flags & VFIO_DMA_UNMAP_FLAG_SUSPEND) {
> > > >> +			if (dma->suspended) {
> > > >> +				ret = -EINVAL;
> > > >> +				goto unlock;
> > > >> +			}  
> > > > 
> > > > This leaves us in a state where we marked some entries but not others.
> > > > We should either unwind or... what's the actual harm in skipping these?  
> > > 
> > > We could skip them with no ill effect.  However, it likely means the app is confused
> > > and potentially broken, and it would be courteous to inform them so.  I found such bugs
> > > in qemu as I was developing this feature.
> > > 
> > > IMO unwinding does not help the app, and adds unnecessary code.  It can still leave some
> > > ranges suspended and some not.  The safest recovery is for the app to exit, and tell the 
> > > developer to fix the redundant suspend call.
> > >   
> > > >> +			dma->suspended = true;
> > > >> +			consumed = dma->iova + dma->size - iova;
> > > >> +			if (consumed >= size)
> > > >> +				break;
> > > >> +			iova += consumed;
> > > >> +			size -= consumed;
> > > >> +			unmapped += dma->size;
> > > >> +			continue;
> > > >> +		}  
> > > > 
> > > > This short-cuts the dirty bitmap flag, so we need to decide if it's
> > > > legal to call them together or we need to prevent it... Oh, I see
> > > > you've excluded them earlier below.
> > > >   
> > > >> +
> > > >>  		if (!RB_EMPTY_ROOT(&dma->pfn_list)) {
> > > >>  			struct vfio_iommu_type1_dma_unmap nb_unmap;
> > > >>  
> > > >> @@ -1307,6 +1323,7 @@ static bool vfio_iommu_iova_dma_valid(struct vfio_iommu *iommu,
> > > >>  static int vfio_dma_do_map(struct vfio_iommu *iommu,
> > > >>  			   struct vfio_iommu_type1_dma_map *map)
> > > >>  {
> > > >> +	bool resume = map->flags & VFIO_DMA_MAP_FLAG_RESUME;
> > > >>  	dma_addr_t iova = map->iova;
> > > >>  	unsigned long vaddr = map->vaddr;
> > > >>  	size_t size = map->size;
> > > >> @@ -1324,13 +1341,16 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
> > > >>  	if (map->flags & VFIO_DMA_MAP_FLAG_READ)
> > > >>  		prot |= IOMMU_READ;
> > > >>  
> > > >> +	if ((prot && resume) || (!prot && !resume))
> > > >> +		return -EINVAL;
> > > >> +
> > > >>  	mutex_lock(&iommu->lock);
> > > >>  
> > > >>  	pgsize = (size_t)1 << __ffs(iommu->pgsize_bitmap);
> > > >>  
> > > >>  	WARN_ON((pgsize - 1) & PAGE_MASK);
> > > >>  
> > > >> -	if (!prot || !size || (size | iova | vaddr) & (pgsize - 1)) {
> > > >> +	if (!size || (size | iova | vaddr) & (pgsize - 1)) {
> > > >>  		ret = -EINVAL;
> > > >>  		goto out_unlock;
> > > >>  	}
> > > >> @@ -1341,7 +1361,19 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
> > > >>  		goto out_unlock;
> > > >>  	}
> > > >>  
> > > >> -	if (vfio_find_dma(iommu, iova, size)) {
> > > >> +	dma = vfio_find_dma(iommu, iova, size);
> > > >> +	if (resume) {
> > > >> +		if (!dma) {
> > > >> +			ret = -ENOENT;
> > > >> +		} else if (!dma->suspended || dma->iova != iova ||
> > > >> +			   dma->size != size) {  
> > > > 
> > > > Why is it necessary that the vfio_dma be suspended before being
> > > > resumed?  Couldn't a user simply use this to change the vaddr?  Does
> > > > that promote abusive use?  
> > > 
> > > This would almost always be incorrect.  If the vaddr changes, then the old vaddr was already
> > > invalidated, and there is a window where it is not OK for kernel code to use the old vaddr.
> > > This could only be safe if the memory object is mapped at both the old vaddr and the new
> > > vaddr concurrently, which is an unlikely use case.
> > >   
> > > >> +			ret = -EINVAL;
> > > >> +		} else {
> > > >> +			dma->vaddr = vaddr;  
> > > > 
> > > > Seems like there's a huge opportunity for a user to create coherency
> > > > issues here... it's their data though I guess.  
> > > 
> > > Yes.  That's what the language in the uapi about mapping the same memory object is about.
> > >   
> > > >> +			dma->suspended = false;
> > > >> +		}
> > > >> +		goto out_unlock;
> > > >> +	} else if (dma) {
> > > >>  		ret = -EEXIST;
> > > >>  		goto out_unlock;
> > > >>  	}
> > > >> @@ -2532,6 +2564,7 @@ static int vfio_iommu_type1_check_extension(struct vfio_iommu *iommu,
> > > >>  	case VFIO_TYPE1_IOMMU:
> > > >>  	case VFIO_TYPE1v2_IOMMU:
> > > >>  	case VFIO_TYPE1_NESTING_IOMMU:
> > > >> +	case VFIO_SUSPEND:
> > > >>  		return 1;
> > > >>  	case VFIO_DMA_CC_IOMMU:
> > > >>  		if (!iommu)
> > > >> @@ -2686,7 +2719,8 @@ static int vfio_iommu_type1_map_dma(struct vfio_iommu *iommu,
> > > >>  {
> > > >>  	struct vfio_iommu_type1_dma_map map;
> > > >>  	unsigned long minsz;
> > > >> -	uint32_t mask = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
> > > >> +	uint32_t mask = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE |
> > > >> +			VFIO_DMA_MAP_FLAG_RESUME;
> > > >>  
> > > >>  	minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);
> > > >>  
> > > >> @@ -2704,6 +2738,8 @@ static int vfio_iommu_type1_unmap_dma(struct vfio_iommu *iommu,
> > > >>  {
> > > >>  	struct vfio_iommu_type1_dma_unmap unmap;
> > > >>  	struct vfio_bitmap bitmap = { 0 };
> > > >> +	uint32_t mask = VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP |
> > > >> +			VFIO_DMA_UNMAP_FLAG_SUSPEND;
> > > >>  	unsigned long minsz;
> > > >>  	int ret;
> > > >>  
> > > >> @@ -2712,8 +2748,7 @@ static int vfio_iommu_type1_unmap_dma(struct vfio_iommu *iommu,
> > > >>  	if (copy_from_user(&unmap, (void __user *)arg, minsz))
> > > >>  		return -EFAULT;
> > > >>  
> > > >> -	if (unmap.argsz < minsz ||
> > > >> -	    unmap.flags & ~VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP)
> > > >> +	if (unmap.argsz < minsz || unmap.flags & ~mask || unmap.flags == mask)  
> > > > 
> > > > Maybe a short comment here to note that dirty-bimap and
> > > > suspend/invalidate are mutually exclusive.  Probably should be
> > > > mentioned in the uapi too.  
> > > 
> > > Will do, for both.
> > >   
> > > >>  		return -EINVAL;
> > > >>  
> > > >>  	if (unmap.flags & VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP) {
> > > >> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > > >> index 896e527..fcf7b56 100644
> > > >> --- a/include/uapi/linux/vfio.h
> > > >> +++ b/include/uapi/linux/vfio.h
> > > >> @@ -46,6 +46,9 @@
> > > >>   */
> > > >>  #define VFIO_NOIOMMU_IOMMU		8
> > > >>  
> > > >> +/* Supports VFIO DMA suspend and resume */
> > > >> +#define VFIO_SUSPEND			9
> > > >> +
> > > >>  /*
> > > >>   * The IOCTL interface is designed for extensibility by embedding the
> > > >>   * structure length (argsz) and flags into structures passed between
> > > >> @@ -1046,12 +1049,19 @@ struct vfio_iommu_type1_info_cap_migration {
> > > >>   *
> > > >>   * Map process virtual addresses to IO virtual addresses using the
> > > >>   * provided struct vfio_dma_map. Caller sets argsz. READ &/ WRITE required.
> > > >> + *
> > > >> + * If flags & VFIO_DMA_MAP_FLAG_RESUME, record the new base vaddr for iova, and
> > > >> + * resume translation of host virtual addresses in the iova range.  The new
> > > >> + * vaddr must point to the same memory object as the old vaddr, but this is not
> > > >> + * verified.  
> > > > 
> > > > It's hard to use "must" terminology here if we're not going to check.
> > > > Maybe the phrasing should be something more along the lines of "should
> > > > point to the same memory object or the user risks coherency issues
> > > > within their virtual address space".  
> > > 
> > > I used "must" because it is always incorrect if the object is not the same.  How about:
> > >   The new vaddr must point to the same memory object as the old vaddr, but this is not
> > >   verified.  Violation of this constraint may result in memory corruption within the
> > >   host process and/or guest.
> > >   
> > > >>  iova and size must match those in the original MAP_DMA call.
> > > >> + * Protection is not changed, and the READ & WRITE flags must be 0.  
> > > > 
> > > > This doesn't mention that the entry must be previously
> > > > suspended/invalidated (if we choose to keep those semantics).  Thanks,  
> > > 
> > > Will add, thanks.
> > > 
> > > - Steve   
> > > >>   */
> > > >>  struct vfio_iommu_type1_dma_map {
> > > >>  	__u32	argsz;
> > > >>  	__u32	flags;
> > > >>  #define VFIO_DMA_MAP_FLAG_READ (1 << 0)		/* readable from device */
> > > >>  #define VFIO_DMA_MAP_FLAG_WRITE (1 << 1)	/* writable from device */
> > > >> +#define VFIO_DMA_MAP_FLAG_RESUME (1 << 2)
> > > >>  	__u64	vaddr;				/* Process virtual address */
> > > >>  	__u64	iova;				/* IO virtual address */
> > > >>  	__u64	size;				/* Size of mapping (bytes) */
> > > >> @@ -1084,11 +1094,17 @@ struct vfio_bitmap {
> > > >>   * indicates that the page at that offset from iova is dirty. A Bitmap of the
> > > >>   * pages in the range of unmapped size is returned in the user-provided
> > > >>   * vfio_bitmap.data.
> > > >> + *
> > > >> + * If flags & VFIO_DMA_UNMAP_FLAG_SUSPEND, do not unmap, but suspend vfio
> > > >> + * translation of host virtual addresses in the iova range.  During suspension,
> > > >> + * kernel threads that attempt to translate will block.  DMA to already-mapped
> > > >> + * pages continues.
> > > >>   */
> > > >>  struct vfio_iommu_type1_dma_unmap {
> > > >>  	__u32	argsz;
> > > >>  	__u32	flags;
> > > >>  #define VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP (1 << 0)
> > > >> +#define VFIO_DMA_UNMAP_FLAG_SUSPEND	     (1 << 1)
> > > >>  	__u64	iova;				/* IO virtual address */
> > > >>  	__u64	size;				/* Size of mapping (bytes) */
> > > >>  	__u8    data[];  
> > > >   
> > 
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH V1 4/5] vfio: VA suspend interface
  2021-01-12 22:47       ` Alex Williamson
  2021-01-13  4:10         ` Alex Williamson
@ 2021-01-13 18:01         ` Steven Sistare
  1 sibling, 0 replies; 39+ messages in thread
From: Steven Sistare @ 2021-01-13 18:01 UTC (permalink / raw)
  To: Alex Williamson; +Cc: kvm, Cornelia Huck, Kirti Wankhede

On 1/12/2021 5:47 PM, Alex Williamson wrote:
> On Mon, 11 Jan 2021 16:15:02 -0500
> Steven Sistare <steven.sistare@oracle.com> wrote:
> 
>> On 1/8/2021 4:15 PM, Alex Williamson wrote:
>>> On Tue,  5 Jan 2021 07:36:52 -0800
>>> Steve Sistare <steven.sistare@oracle.com> wrote:
>>>   
>>>> Add interfaces that allow the underlying memory object of an iova
>>>> range to be mapped to a new host virtual address in the host process:
>>>>
>>>>   - VFIO_DMA_UNMAP_FLAG_SUSPEND for VFIO_IOMMU_UNMAP_DMA
>>>>   - VFIO_DMA_MAP_FLAG_RESUME flag for VFIO_IOMMU_MAP_DMA
>>>>   - VFIO_SUSPEND extension for VFIO_CHECK_EXTENSION  
>>>
>>> Suspend and Resume can imply many things other than what's done here.
>>> Should these be something more akin to INVALIDATE_VADDR and
>>> REPLACE_VADDR?  
>>
>> Agreed.  I suspected we would discuss the names.  Some possibilities:
>>
>> INVALIDATE_VADDR  REPLACE_VADDR
>> INV_VADDR         SET_VADDR
>> CLEAR_VADDR       SET_VADDR
>> SUSPEND_VADDR     RESUME_VADDR
>>
>>>> The suspend interface blocks vfio translation of host virtual
>>>> addresses in a range, but DMA to already-mapped pages continues.
>>>> The resume interface records the new base VA and resumes translation.
>>>> See comments in uapi/linux/vfio.h for more details.
>>>>
>>>> This is a partial implementation.  Blocking is added in the next patch.
>>>>
>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>> ---
>>>>  drivers/vfio/vfio_iommu_type1.c | 47 +++++++++++++++++++++++++++++++++++------
>>>>  include/uapi/linux/vfio.h       | 16 ++++++++++++++
>>>>  2 files changed, 57 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>>>> index 3dc501d..2c164a6 100644
>>>> --- a/drivers/vfio/vfio_iommu_type1.c
>>>> +++ b/drivers/vfio/vfio_iommu_type1.c
>>>> @@ -92,6 +92,7 @@ struct vfio_dma {
>>>>  	int			prot;		/* IOMMU_READ/WRITE */
>>>>  	bool			iommu_mapped;
>>>>  	bool			lock_cap;	/* capable(CAP_IPC_LOCK) */
>>>> +	bool			suspended;  
>>>
>>> Is there a value we could use for vfio_dma.vaddr that would always be
>>> considered invalid, ex. ULONG_MAX?    
>>
>> Yes, that could replace the suspend flag.  That, plus changing the language from suspend
>> to invalidate, will probably yield equally understandable code.  I'll try it.
> 
> Thinking about this further, if we defined a VFIO_IOMMU_TYPE1_INV_VADDR
> as part of the uapi, could we implement this with only a single flag on
> the DMA_MAP ioctl?  For example the user would call DMA_MAP with a flag
> to set the vaddr, first to the invalid valid, then to a new value.  It's
> always seemed a bit awkward to use DMA_UNMAP to invalidate the vaddr
> when the mapping is not actually unmapped.  That might lean towards an
> UPDATE or REPLACE flag.

I like your VADDR_ONLY suggestion in your next email better.

>>> We'd need to decide if we want to
>>> allow users to create mappings (mdev-only) using an initial invalid
>>> vaddr.  
>>
>> Maybe.  Not sure yet.
> 
> If we used the above, it almost seems strange not to allow it, but at
> the same time we don't really want to have different rules for
> different devices types.  An initially valid vaddr doesn't seem
> unreasonable... though we don't test it until the vendor driver tries
> to pin or rw pages w/o IOMMU backing.
>  
>>>>  	struct task_struct	*task;
>>>>  	struct rb_root		pfn_list;	/* Ex-user pinned pfn list */
>>>>  	unsigned long		*bitmap;
>>>> @@ -1080,7 +1081,7 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>>>>  	int ret = 0, retries = 0;
>>>>  	unsigned long pgshift;
>>>>  	dma_addr_t iova;
>>>> -	unsigned long size;
>>>> +	unsigned long size, consumed;  
>>>
>>> This could be scoped into the branch below.  
>>
>> OK.
>>
>>>>  	mutex_lock(&iommu->lock);
>>>>  
>>>> @@ -1169,6 +1170,21 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>>>>  		if (dma->task->mm != current->mm)
>>>>  			break;
>>>>  
>>>> +		if (unmap->flags & VFIO_DMA_UNMAP_FLAG_SUSPEND) {
>>>> +			if (dma->suspended) {
>>>> +				ret = -EINVAL;
>>>> +				goto unlock;
>>>> +			}  
>>>
>>> This leaves us in a state where we marked some entries but not others.
>>> We should either unwind or... what's the actual harm in skipping these?  
>>
>> We could skip them with no ill effect.  However, it likely means the app is confused
>> and potentially broken, and it would be courteous to inform them so.  I found such bugs
>> in qemu as I was developing this feature.
>>
>> IMO unwinding does not help the app, and adds unnecessary code.  It can still leave some
>> ranges suspended and some not.  The safest recovery is for the app to exit, and tell the 
>> developer to fix the redundant suspend call.
> 
> That sounds like an entirely practical rationalization, but our
> standard practice is to maintain a consistent state.  If an ioctl fails
> is should effectively be as if the ioctl was never called, where
> possible.  Userspace can be broken, and potentially so broken that their
> best choice is to abort, but we should maintain consistent, predictable
> behavior.

Agreed, I take it back.  Or unwind it?
What's that Big Chill quote: ever gone a week without a rationalization?

I'll unwind and return EINVAL.

>>>> +			dma->suspended = true;
>>>> +			consumed = dma->iova + dma->size - iova;
>>>> +			if (consumed >= size)
>>>> +				break;
>>>> +			iova += consumed;
>>>> +			size -= consumed;
>>>> +			unmapped += dma->size;
>>>> +			continue;
>>>> +		}  
>>>
>>> This short-cuts the dirty bitmap flag, so we need to decide if it's
>>> legal to call them together or we need to prevent it... Oh, I see
>>> you've excluded them earlier below.
>>>   
>>>> +
>>>>  		if (!RB_EMPTY_ROOT(&dma->pfn_list)) {
>>>>  			struct vfio_iommu_type1_dma_unmap nb_unmap;
>>>>  
>>>> @@ -1307,6 +1323,7 @@ static bool vfio_iommu_iova_dma_valid(struct vfio_iommu *iommu,
>>>>  static int vfio_dma_do_map(struct vfio_iommu *iommu,
>>>>  			   struct vfio_iommu_type1_dma_map *map)
>>>>  {
>>>> +	bool resume = map->flags & VFIO_DMA_MAP_FLAG_RESUME;
>>>>  	dma_addr_t iova = map->iova;
>>>>  	unsigned long vaddr = map->vaddr;
>>>>  	size_t size = map->size;
>>>> @@ -1324,13 +1341,16 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>>>>  	if (map->flags & VFIO_DMA_MAP_FLAG_READ)
>>>>  		prot |= IOMMU_READ;
>>>>  
>>>> +	if ((prot && resume) || (!prot && !resume))
>>>> +		return -EINVAL;
>>>> +
>>>>  	mutex_lock(&iommu->lock);
>>>>  
>>>>  	pgsize = (size_t)1 << __ffs(iommu->pgsize_bitmap);
>>>>  
>>>>  	WARN_ON((pgsize - 1) & PAGE_MASK);
>>>>  
>>>> -	if (!prot || !size || (size | iova | vaddr) & (pgsize - 1)) {
>>>> +	if (!size || (size | iova | vaddr) & (pgsize - 1)) {
>>>>  		ret = -EINVAL;
>>>>  		goto out_unlock;
>>>>  	}
>>>> @@ -1341,7 +1361,19 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>>>>  		goto out_unlock;
>>>>  	}
>>>>  
>>>> -	if (vfio_find_dma(iommu, iova, size)) {
>>>> +	dma = vfio_find_dma(iommu, iova, size);
>>>> +	if (resume) {
>>>> +		if (!dma) {
>>>> +			ret = -ENOENT;
>>>> +		} else if (!dma->suspended || dma->iova != iova ||
>>>> +			   dma->size != size) {  
>>>
>>> Why is it necessary that the vfio_dma be suspended before being
>>> resumed?  Couldn't a user simply use this to change the vaddr?  Does
>>> that promote abusive use?  
>>
>> This would almost always be incorrect.  If the vaddr changes, then the old vaddr was already
>> invalidated, and there is a window where it is not OK for kernel code to use the old vaddr.
>> This could only be safe if the memory object is mapped at both the old vaddr and the new
>> vaddr concurrently, which is an unlikely use case.
> 
> Ok, it's not like the use can't make it instantaneously invalid and then
> replace it.
> 
>>>> +			ret = -EINVAL;
>>>> +		} else {
>>>> +			dma->vaddr = vaddr;  
>>>
>>> Seems like there's a huge opportunity for a user to create coherency
>>> issues here... it's their data though I guess.  
>>
>> Yes.  That's what the language in the uapi about mapping the same memory object is about.
>>
>>>> +			dma->suspended = false;
>>>> +		}
>>>> +		goto out_unlock;
>>>> +	} else if (dma) {
>>>>  		ret = -EEXIST;
>>>>  		goto out_unlock;
>>>>  	}
>>>> @@ -2532,6 +2564,7 @@ static int vfio_iommu_type1_check_extension(struct vfio_iommu *iommu,
>>>>  	case VFIO_TYPE1_IOMMU:
>>>>  	case VFIO_TYPE1v2_IOMMU:
>>>>  	case VFIO_TYPE1_NESTING_IOMMU:
>>>> +	case VFIO_SUSPEND:
>>>>  		return 1;
>>>>  	case VFIO_DMA_CC_IOMMU:
>>>>  		if (!iommu)
>>>> @@ -2686,7 +2719,8 @@ static int vfio_iommu_type1_map_dma(struct vfio_iommu *iommu,
>>>>  {
>>>>  	struct vfio_iommu_type1_dma_map map;
>>>>  	unsigned long minsz;
>>>> -	uint32_t mask = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
>>>> +	uint32_t mask = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE |
>>>> +			VFIO_DMA_MAP_FLAG_RESUME;
>>>>  
>>>>  	minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);
>>>>  
>>>> @@ -2704,6 +2738,8 @@ static int vfio_iommu_type1_unmap_dma(struct vfio_iommu *iommu,
>>>>  {
>>>>  	struct vfio_iommu_type1_dma_unmap unmap;
>>>>  	struct vfio_bitmap bitmap = { 0 };
>>>> +	uint32_t mask = VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP |
>>>> +			VFIO_DMA_UNMAP_FLAG_SUSPEND;
>>>>  	unsigned long minsz;
>>>>  	int ret;
>>>>  
>>>> @@ -2712,8 +2748,7 @@ static int vfio_iommu_type1_unmap_dma(struct vfio_iommu *iommu,
>>>>  	if (copy_from_user(&unmap, (void __user *)arg, minsz))
>>>>  		return -EFAULT;
>>>>  
>>>> -	if (unmap.argsz < minsz ||
>>>> -	    unmap.flags & ~VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP)
>>>> +	if (unmap.argsz < minsz || unmap.flags & ~mask || unmap.flags == mask)  
>>>
>>> Maybe a short comment here to note that dirty-bimap and
>>> suspend/invalidate are mutually exclusive.  Probably should be
>>> mentioned in the uapi too.  
>>
>> Will do, for both.
>>
>>>>  		return -EINVAL;
>>>>  
>>>>  	if (unmap.flags & VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP) {
>>>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>>>> index 896e527..fcf7b56 100644
>>>> --- a/include/uapi/linux/vfio.h
>>>> +++ b/include/uapi/linux/vfio.h
>>>> @@ -46,6 +46,9 @@
>>>>   */
>>>>  #define VFIO_NOIOMMU_IOMMU		8
>>>>  
>>>> +/* Supports VFIO DMA suspend and resume */
>>>> +#define VFIO_SUSPEND			9
>>>> +
>>>>  /*
>>>>   * The IOCTL interface is designed for extensibility by embedding the
>>>>   * structure length (argsz) and flags into structures passed between
>>>> @@ -1046,12 +1049,19 @@ struct vfio_iommu_type1_info_cap_migration {
>>>>   *
>>>>   * Map process virtual addresses to IO virtual addresses using the
>>>>   * provided struct vfio_dma_map. Caller sets argsz. READ &/ WRITE required.
>>>> + *
>>>> + * If flags & VFIO_DMA_MAP_FLAG_RESUME, record the new base vaddr for iova, and
>>>> + * resume translation of host virtual addresses in the iova range.  The new
>>>> + * vaddr must point to the same memory object as the old vaddr, but this is not
>>>> + * verified.  
>>>
>>> It's hard to use "must" terminology here if we're not going to check.
>>> Maybe the phrasing should be something more along the lines of "should
>>> point to the same memory object or the user risks coherency issues
>>> within their virtual address space".  
>>
>> I used "must" because it is always incorrect if the object is not the same.  How about:
>>   The new vaddr must point to the same memory object as the old vaddr, but this is not
>>   verified.  Violation of this constraint may result in memory corruption within the
>>   host process and/or guest.
> 
> Since the "must" is not relative to the API but to the resulting
> behavior, perhaps something like:
> 
>   In order to maintain memory consistency within the user application,
>   the updated vaddr must address the same memory object as originally
>   mapped, failure to do so will result in user memory corruption and/or
>   device misbehavior.

Sounds good.

- Steve

>>>>  iova and size must match those in the original MAP_DMA call.
>>>> + * Protection is not changed, and the READ & WRITE flags must be 0.  
>>>
>>> This doesn't mention that the entry must be previously
>>> suspended/invalidated (if we choose to keep those semantics).  Thanks,  
>>
>> Will add, thanks.
>>
>> - Steve 
>>>>   */
>>>>  struct vfio_iommu_type1_dma_map {
>>>>  	__u32	argsz;
>>>>  	__u32	flags;
>>>>  #define VFIO_DMA_MAP_FLAG_READ (1 << 0)		/* readable from device */
>>>>  #define VFIO_DMA_MAP_FLAG_WRITE (1 << 1)	/* writable from device */
>>>> +#define VFIO_DMA_MAP_FLAG_RESUME (1 << 2)
>>>>  	__u64	vaddr;				/* Process virtual address */
>>>>  	__u64	iova;				/* IO virtual address */
>>>>  	__u64	size;				/* Size of mapping (bytes) */
>>>> @@ -1084,11 +1094,17 @@ struct vfio_bitmap {
>>>>   * indicates that the page at that offset from iova is dirty. A Bitmap of the
>>>>   * pages in the range of unmapped size is returned in the user-provided
>>>>   * vfio_bitmap.data.
>>>> + *
>>>> + * If flags & VFIO_DMA_UNMAP_FLAG_SUSPEND, do not unmap, but suspend vfio
>>>> + * translation of host virtual addresses in the iova range.  During suspension,
>>>> + * kernel threads that attempt to translate will block.  DMA to already-mapped
>>>> + * pages continues.
>>>>   */
>>>>  struct vfio_iommu_type1_dma_unmap {
>>>>  	__u32	argsz;
>>>>  	__u32	flags;
>>>>  #define VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP (1 << 0)
>>>> +#define VFIO_DMA_UNMAP_FLAG_SUSPEND	     (1 << 1)
>>>>  	__u64	iova;				/* IO virtual address */
>>>>  	__u64	size;				/* Size of mapping (bytes) */
>>>>  	__u8    data[];  
>>>   
>>
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH V1 4/5] vfio: VA suspend interface
  2021-01-13  4:10         ` Alex Williamson
@ 2021-01-13 18:02           ` Steven Sistare
  2021-01-13 18:34             ` Alex Williamson
  0 siblings, 1 reply; 39+ messages in thread
From: Steven Sistare @ 2021-01-13 18:02 UTC (permalink / raw)
  To: Alex Williamson; +Cc: kvm, Cornelia Huck, Kirti Wankhede

On 1/12/2021 11:10 PM, Alex Williamson wrote:
> On Tue, 12 Jan 2021 15:47:56 -0700
> Alex Williamson <alex.williamson@redhat.com> wrote:
> 
>> On Mon, 11 Jan 2021 16:15:02 -0500
>> Steven Sistare <steven.sistare@oracle.com> wrote:
>>
>>> On 1/8/2021 4:15 PM, Alex Williamson wrote:  
>>>> On Tue,  5 Jan 2021 07:36:52 -0800
>>>> Steve Sistare <steven.sistare@oracle.com> wrote:
>>>>     
>>>>> Add interfaces that allow the underlying memory object of an iova
>>>>> range to be mapped to a new host virtual address in the host process:
>>>>>
>>>>>   - VFIO_DMA_UNMAP_FLAG_SUSPEND for VFIO_IOMMU_UNMAP_DMA
>>>>>   - VFIO_DMA_MAP_FLAG_RESUME flag for VFIO_IOMMU_MAP_DMA
>>>>>   - VFIO_SUSPEND extension for VFIO_CHECK_EXTENSION    
>>>>
>>>> Suspend and Resume can imply many things other than what's done here.
>>>> Should these be something more akin to INVALIDATE_VADDR and
>>>> REPLACE_VADDR?    
>>>
>>> Agreed.  I suspected we would discuss the names.  Some possibilities:
>>>
>>> INVALIDATE_VADDR  REPLACE_VADDR
>>> INV_VADDR         SET_VADDR
>>> CLEAR_VADDR       SET_VADDR
>>> SUSPEND_VADDR     RESUME_VADDR
>>>   
>>>>> The suspend interface blocks vfio translation of host virtual
>>>>> addresses in a range, but DMA to already-mapped pages continues.
>>>>> The resume interface records the new base VA and resumes translation.
>>>>> See comments in uapi/linux/vfio.h for more details.
>>>>>
>>>>> This is a partial implementation.  Blocking is added in the next patch.
>>>>>
>>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>>> ---
>>>>>  drivers/vfio/vfio_iommu_type1.c | 47 +++++++++++++++++++++++++++++++++++------
>>>>>  include/uapi/linux/vfio.h       | 16 ++++++++++++++
>>>>>  2 files changed, 57 insertions(+), 6 deletions(-)
>>>>>
>>>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>>>>> index 3dc501d..2c164a6 100644
>>>>> --- a/drivers/vfio/vfio_iommu_type1.c
>>>>> +++ b/drivers/vfio/vfio_iommu_type1.c
>>>>> @@ -92,6 +92,7 @@ struct vfio_dma {
>>>>>  	int			prot;		/* IOMMU_READ/WRITE */
>>>>>  	bool			iommu_mapped;
>>>>>  	bool			lock_cap;	/* capable(CAP_IPC_LOCK) */
>>>>> +	bool			suspended;    
>>>>
>>>> Is there a value we could use for vfio_dma.vaddr that would always be
>>>> considered invalid, ex. ULONG_MAX?      
>>>
>>> Yes, that could replace the suspend flag.  That, plus changing the language from suspend
>>> to invalidate, will probably yield equally understandable code.  I'll try it.  
>>
>> Thinking about this further, if we defined a VFIO_IOMMU_TYPE1_INV_VADDR
>> as part of the uapi, could we implement this with only a single flag on
>> the DMA_MAP ioctl?  For example the user would call DMA_MAP with a flag
>> to set the vaddr, first to the invalid valid, then to a new value.  It's
>> always seemed a bit awkward to use DMA_UNMAP to invalidate the vaddr
>> when the mapping is not actually unmapped.  That might lean towards an
>> UPDATE or REPLACE flag.
> 
> I realized you really want to make use of the DMA_UNMAP ioctl in order
> to use ranges, maybe we can make the mental model more coherent with an
> unmap flag like VFIO_DMA_UNMAP_FLAG_VADDR_ONLY, ie. we're only asking
> to unmap the vaddr.  The DMA_MAP side might take a similar VADDR_ONLY
> flag to reset the vaddr.  That also retains your desired semantics that
> we can't "resume" a vaddr that wasn't previously "suspended", we can't
> map a vaddr that wasn't previously unmapped.

I like this.

> For the unmap-all problem, userspace already needs to work around this,
> see for instance QEMU:
> 
> 1b296c3def4b vfio: Don't issue full 2^64 unmap

Yes, I saw that.  And one cannot split the range in two at some arbitrary 
mmu-page-size boundary, least you split a mapping, which V2 does not allow.

> So I wonder really how critical it is and whether it really would be
> sufficient for userspace to track a high water mark for mappings.

If the high water mark is 2^64, then you also need a low water mark to 
enable a single call, else one call cannot span the range.  And low water
better not be 0.  And I would rather not add code to compute either one :)

> Otherwise, I think I'm leaning towards a DMA_UNMAP flag like
> VFIO_DMA_UNMAP_FLAG_UNMAP_ALL that would disregard the iova and size
> fields to apply to all mappings.  Designating special values for iova or
> size trigger extended behavior feels a bit hackish.  Thanks,

The flag sounds best to me.

- Steve

>>>> We'd need to decide if we want to
>>>> allow users to create mappings (mdev-only) using an initial invalid
>>>> vaddr.    
>>>
>>> Maybe.  Not sure yet.  
>>
>> If we used the above, it almost seems strange not to allow it, but at
>> the same time we don't really want to have different rules for
>> different devices types.  An initially valid vaddr doesn't seem
>> unreasonable... though we don't test it until the vendor driver tries
>> to pin or rw pages w/o IOMMU backing.
>>  
>>>>>  	struct task_struct	*task;
>>>>>  	struct rb_root		pfn_list;	/* Ex-user pinned pfn list */
>>>>>  	unsigned long		*bitmap;
>>>>> @@ -1080,7 +1081,7 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>>>>>  	int ret = 0, retries = 0;
>>>>>  	unsigned long pgshift;
>>>>>  	dma_addr_t iova;
>>>>> -	unsigned long size;
>>>>> +	unsigned long size, consumed;    
>>>>
>>>> This could be scoped into the branch below.    
>>>
>>> OK.
>>>   
>>>>>  	mutex_lock(&iommu->lock);
>>>>>  
>>>>> @@ -1169,6 +1170,21 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>>>>>  		if (dma->task->mm != current->mm)
>>>>>  			break;
>>>>>  
>>>>> +		if (unmap->flags & VFIO_DMA_UNMAP_FLAG_SUSPEND) {
>>>>> +			if (dma->suspended) {
>>>>> +				ret = -EINVAL;
>>>>> +				goto unlock;
>>>>> +			}    
>>>>
>>>> This leaves us in a state where we marked some entries but not others.
>>>> We should either unwind or... what's the actual harm in skipping these?    
>>>
>>> We could skip them with no ill effect.  However, it likely means the app is confused
>>> and potentially broken, and it would be courteous to inform them so.  I found such bugs
>>> in qemu as I was developing this feature.
>>>
>>> IMO unwinding does not help the app, and adds unnecessary code.  It can still leave some
>>> ranges suspended and some not.  The safest recovery is for the app to exit, and tell the 
>>> developer to fix the redundant suspend call.  
>>
>> That sounds like an entirely practical rationalization, but our
>> standard practice is to maintain a consistent state.  If an ioctl fails
>> is should effectively be as if the ioctl was never called, where
>> possible.  Userspace can be broken, and potentially so broken that their
>> best choice is to abort, but we should maintain consistent, predictable
>> behavior.
>>
>>>>> +			dma->suspended = true;
>>>>> +			consumed = dma->iova + dma->size - iova;
>>>>> +			if (consumed >= size)
>>>>> +				break;
>>>>> +			iova += consumed;
>>>>> +			size -= consumed;
>>>>> +			unmapped += dma->size;
>>>>> +			continue;
>>>>> +		}    
>>>>
>>>> This short-cuts the dirty bitmap flag, so we need to decide if it's
>>>> legal to call them together or we need to prevent it... Oh, I see
>>>> you've excluded them earlier below.
>>>>     
>>>>> +
>>>>>  		if (!RB_EMPTY_ROOT(&dma->pfn_list)) {
>>>>>  			struct vfio_iommu_type1_dma_unmap nb_unmap;
>>>>>  
>>>>> @@ -1307,6 +1323,7 @@ static bool vfio_iommu_iova_dma_valid(struct vfio_iommu *iommu,
>>>>>  static int vfio_dma_do_map(struct vfio_iommu *iommu,
>>>>>  			   struct vfio_iommu_type1_dma_map *map)
>>>>>  {
>>>>> +	bool resume = map->flags & VFIO_DMA_MAP_FLAG_RESUME;
>>>>>  	dma_addr_t iova = map->iova;
>>>>>  	unsigned long vaddr = map->vaddr;
>>>>>  	size_t size = map->size;
>>>>> @@ -1324,13 +1341,16 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>>>>>  	if (map->flags & VFIO_DMA_MAP_FLAG_READ)
>>>>>  		prot |= IOMMU_READ;
>>>>>  
>>>>> +	if ((prot && resume) || (!prot && !resume))
>>>>> +		return -EINVAL;
>>>>> +
>>>>>  	mutex_lock(&iommu->lock);
>>>>>  
>>>>>  	pgsize = (size_t)1 << __ffs(iommu->pgsize_bitmap);
>>>>>  
>>>>>  	WARN_ON((pgsize - 1) & PAGE_MASK);
>>>>>  
>>>>> -	if (!prot || !size || (size | iova | vaddr) & (pgsize - 1)) {
>>>>> +	if (!size || (size | iova | vaddr) & (pgsize - 1)) {
>>>>>  		ret = -EINVAL;
>>>>>  		goto out_unlock;
>>>>>  	}
>>>>> @@ -1341,7 +1361,19 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>>>>>  		goto out_unlock;
>>>>>  	}
>>>>>  
>>>>> -	if (vfio_find_dma(iommu, iova, size)) {
>>>>> +	dma = vfio_find_dma(iommu, iova, size);
>>>>> +	if (resume) {
>>>>> +		if (!dma) {
>>>>> +			ret = -ENOENT;
>>>>> +		} else if (!dma->suspended || dma->iova != iova ||
>>>>> +			   dma->size != size) {    
>>>>
>>>> Why is it necessary that the vfio_dma be suspended before being
>>>> resumed?  Couldn't a user simply use this to change the vaddr?  Does
>>>> that promote abusive use?    
>>>
>>> This would almost always be incorrect.  If the vaddr changes, then the old vaddr was already
>>> invalidated, and there is a window where it is not OK for kernel code to use the old vaddr.
>>> This could only be safe if the memory object is mapped at both the old vaddr and the new
>>> vaddr concurrently, which is an unlikely use case.  
>>
>> Ok, it's not like the use can't make it instantaneously invalid and then
>> replace it.
>>
>>>>> +			ret = -EINVAL;
>>>>> +		} else {
>>>>> +			dma->vaddr = vaddr;    
>>>>
>>>> Seems like there's a huge opportunity for a user to create coherency
>>>> issues here... it's their data though I guess.    
>>>
>>> Yes.  That's what the language in the uapi about mapping the same memory object is about.
>>>   
>>>>> +			dma->suspended = false;
>>>>> +		}
>>>>> +		goto out_unlock;
>>>>> +	} else if (dma) {
>>>>>  		ret = -EEXIST;
>>>>>  		goto out_unlock;
>>>>>  	}
>>>>> @@ -2532,6 +2564,7 @@ static int vfio_iommu_type1_check_extension(struct vfio_iommu *iommu,
>>>>>  	case VFIO_TYPE1_IOMMU:
>>>>>  	case VFIO_TYPE1v2_IOMMU:
>>>>>  	case VFIO_TYPE1_NESTING_IOMMU:
>>>>> +	case VFIO_SUSPEND:
>>>>>  		return 1;
>>>>>  	case VFIO_DMA_CC_IOMMU:
>>>>>  		if (!iommu)
>>>>> @@ -2686,7 +2719,8 @@ static int vfio_iommu_type1_map_dma(struct vfio_iommu *iommu,
>>>>>  {
>>>>>  	struct vfio_iommu_type1_dma_map map;
>>>>>  	unsigned long minsz;
>>>>> -	uint32_t mask = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
>>>>> +	uint32_t mask = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE |
>>>>> +			VFIO_DMA_MAP_FLAG_RESUME;
>>>>>  
>>>>>  	minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);
>>>>>  
>>>>> @@ -2704,6 +2738,8 @@ static int vfio_iommu_type1_unmap_dma(struct vfio_iommu *iommu,
>>>>>  {
>>>>>  	struct vfio_iommu_type1_dma_unmap unmap;
>>>>>  	struct vfio_bitmap bitmap = { 0 };
>>>>> +	uint32_t mask = VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP |
>>>>> +			VFIO_DMA_UNMAP_FLAG_SUSPEND;
>>>>>  	unsigned long minsz;
>>>>>  	int ret;
>>>>>  
>>>>> @@ -2712,8 +2748,7 @@ static int vfio_iommu_type1_unmap_dma(struct vfio_iommu *iommu,
>>>>>  	if (copy_from_user(&unmap, (void __user *)arg, minsz))
>>>>>  		return -EFAULT;
>>>>>  
>>>>> -	if (unmap.argsz < minsz ||
>>>>> -	    unmap.flags & ~VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP)
>>>>> +	if (unmap.argsz < minsz || unmap.flags & ~mask || unmap.flags == mask)    
>>>>
>>>> Maybe a short comment here to note that dirty-bimap and
>>>> suspend/invalidate are mutually exclusive.  Probably should be
>>>> mentioned in the uapi too.    
>>>
>>> Will do, for both.
>>>   
>>>>>  		return -EINVAL;
>>>>>  
>>>>>  	if (unmap.flags & VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP) {
>>>>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>>>>> index 896e527..fcf7b56 100644
>>>>> --- a/include/uapi/linux/vfio.h
>>>>> +++ b/include/uapi/linux/vfio.h
>>>>> @@ -46,6 +46,9 @@
>>>>>   */
>>>>>  #define VFIO_NOIOMMU_IOMMU		8
>>>>>  
>>>>> +/* Supports VFIO DMA suspend and resume */
>>>>> +#define VFIO_SUSPEND			9
>>>>> +
>>>>>  /*
>>>>>   * The IOCTL interface is designed for extensibility by embedding the
>>>>>   * structure length (argsz) and flags into structures passed between
>>>>> @@ -1046,12 +1049,19 @@ struct vfio_iommu_type1_info_cap_migration {
>>>>>   *
>>>>>   * Map process virtual addresses to IO virtual addresses using the
>>>>>   * provided struct vfio_dma_map. Caller sets argsz. READ &/ WRITE required.
>>>>> + *
>>>>> + * If flags & VFIO_DMA_MAP_FLAG_RESUME, record the new base vaddr for iova, and
>>>>> + * resume translation of host virtual addresses in the iova range.  The new
>>>>> + * vaddr must point to the same memory object as the old vaddr, but this is not
>>>>> + * verified.    
>>>>
>>>> It's hard to use "must" terminology here if we're not going to check.
>>>> Maybe the phrasing should be something more along the lines of "should
>>>> point to the same memory object or the user risks coherency issues
>>>> within their virtual address space".    
>>>
>>> I used "must" because it is always incorrect if the object is not the same.  How about:
>>>   The new vaddr must point to the same memory object as the old vaddr, but this is not
>>>   verified.  Violation of this constraint may result in memory corruption within the
>>>   host process and/or guest.  
>>
>> Since the "must" is not relative to the API but to the resulting
>> behavior, perhaps something like:
>>
>>   In order to maintain memory consistency within the user application,
>>   the updated vaddr must address the same memory object as originally
>>   mapped, failure to do so will result in user memory corruption and/or
>>   device misbehavior.
>>
>> Thanks,
>> Alex
>>
>>>>>  iova and size must match those in the original MAP_DMA call.
>>>>> + * Protection is not changed, and the READ & WRITE flags must be 0.    
>>>>
>>>> This doesn't mention that the entry must be previously
>>>> suspended/invalidated (if we choose to keep those semantics).  Thanks,    
>>>
>>> Will add, thanks.
>>>
>>> - Steve   
>>>>>   */
>>>>>  struct vfio_iommu_type1_dma_map {
>>>>>  	__u32	argsz;
>>>>>  	__u32	flags;
>>>>>  #define VFIO_DMA_MAP_FLAG_READ (1 << 0)		/* readable from device */
>>>>>  #define VFIO_DMA_MAP_FLAG_WRITE (1 << 1)	/* writable from device */
>>>>> +#define VFIO_DMA_MAP_FLAG_RESUME (1 << 2)
>>>>>  	__u64	vaddr;				/* Process virtual address */
>>>>>  	__u64	iova;				/* IO virtual address */
>>>>>  	__u64	size;				/* Size of mapping (bytes) */
>>>>> @@ -1084,11 +1094,17 @@ struct vfio_bitmap {
>>>>>   * indicates that the page at that offset from iova is dirty. A Bitmap of the
>>>>>   * pages in the range of unmapped size is returned in the user-provided
>>>>>   * vfio_bitmap.data.
>>>>> + *
>>>>> + * If flags & VFIO_DMA_UNMAP_FLAG_SUSPEND, do not unmap, but suspend vfio
>>>>> + * translation of host virtual addresses in the iova range.  During suspension,
>>>>> + * kernel threads that attempt to translate will block.  DMA to already-mapped
>>>>> + * pages continues.
>>>>>   */
>>>>>  struct vfio_iommu_type1_dma_unmap {
>>>>>  	__u32	argsz;
>>>>>  	__u32	flags;
>>>>>  #define VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP (1 << 0)
>>>>> +#define VFIO_DMA_UNMAP_FLAG_SUSPEND	     (1 << 1)
>>>>>  	__u64	iova;				/* IO virtual address */
>>>>>  	__u64	size;				/* Size of mapping (bytes) */
>>>>>  	__u8    data[];    
>>>>     
>>>   
>>
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH V1 4/5] vfio: VA suspend interface
  2021-01-13 18:02           ` Steven Sistare
@ 2021-01-13 18:34             ` Alex Williamson
  0 siblings, 0 replies; 39+ messages in thread
From: Alex Williamson @ 2021-01-13 18:34 UTC (permalink / raw)
  To: Steven Sistare; +Cc: kvm, Cornelia Huck, Kirti Wankhede

On Wed, 13 Jan 2021 13:02:04 -0500
Steven Sistare <steven.sistare@oracle.com> wrote:

> On 1/12/2021 11:10 PM, Alex Williamson wrote:
> > On Tue, 12 Jan 2021 15:47:56 -0700
> > Alex Williamson <alex.williamson@redhat.com> wrote:
> >   
> >> On Mon, 11 Jan 2021 16:15:02 -0500
> >> Steven Sistare <steven.sistare@oracle.com> wrote:
> >>  
> >>> On 1/8/2021 4:15 PM, Alex Williamson wrote:    
> >>>> On Tue,  5 Jan 2021 07:36:52 -0800
> >>>> Steve Sistare <steven.sistare@oracle.com> wrote:
> >>>>       
> >>>>> Add interfaces that allow the underlying memory object of an iova
> >>>>> range to be mapped to a new host virtual address in the host process:
> >>>>>
> >>>>>   - VFIO_DMA_UNMAP_FLAG_SUSPEND for VFIO_IOMMU_UNMAP_DMA
> >>>>>   - VFIO_DMA_MAP_FLAG_RESUME flag for VFIO_IOMMU_MAP_DMA
> >>>>>   - VFIO_SUSPEND extension for VFIO_CHECK_EXTENSION      
> >>>>
> >>>> Suspend and Resume can imply many things other than what's done here.
> >>>> Should these be something more akin to INVALIDATE_VADDR and
> >>>> REPLACE_VADDR?      
> >>>
> >>> Agreed.  I suspected we would discuss the names.  Some possibilities:
> >>>
> >>> INVALIDATE_VADDR  REPLACE_VADDR
> >>> INV_VADDR         SET_VADDR
> >>> CLEAR_VADDR       SET_VADDR
> >>> SUSPEND_VADDR     RESUME_VADDR
> >>>     
> >>>>> The suspend interface blocks vfio translation of host virtual
> >>>>> addresses in a range, but DMA to already-mapped pages continues.
> >>>>> The resume interface records the new base VA and resumes translation.
> >>>>> See comments in uapi/linux/vfio.h for more details.
> >>>>>
> >>>>> This is a partial implementation.  Blocking is added in the next patch.
> >>>>>
> >>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> >>>>> ---
> >>>>>  drivers/vfio/vfio_iommu_type1.c | 47 +++++++++++++++++++++++++++++++++++------
> >>>>>  include/uapi/linux/vfio.h       | 16 ++++++++++++++
> >>>>>  2 files changed, 57 insertions(+), 6 deletions(-)
> >>>>>
> >>>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> >>>>> index 3dc501d..2c164a6 100644
> >>>>> --- a/drivers/vfio/vfio_iommu_type1.c
> >>>>> +++ b/drivers/vfio/vfio_iommu_type1.c
> >>>>> @@ -92,6 +92,7 @@ struct vfio_dma {
> >>>>>  	int			prot;		/* IOMMU_READ/WRITE */
> >>>>>  	bool			iommu_mapped;
> >>>>>  	bool			lock_cap;	/* capable(CAP_IPC_LOCK) */
> >>>>> +	bool			suspended;      
> >>>>
> >>>> Is there a value we could use for vfio_dma.vaddr that would always be
> >>>> considered invalid, ex. ULONG_MAX?        
> >>>
> >>> Yes, that could replace the suspend flag.  That, plus changing the language from suspend
> >>> to invalidate, will probably yield equally understandable code.  I'll try it.    
> >>
> >> Thinking about this further, if we defined a VFIO_IOMMU_TYPE1_INV_VADDR
> >> as part of the uapi, could we implement this with only a single flag on
> >> the DMA_MAP ioctl?  For example the user would call DMA_MAP with a flag
> >> to set the vaddr, first to the invalid valid, then to a new value.  It's
> >> always seemed a bit awkward to use DMA_UNMAP to invalidate the vaddr
> >> when the mapping is not actually unmapped.  That might lean towards an
> >> UPDATE or REPLACE flag.  
> > 
> > I realized you really want to make use of the DMA_UNMAP ioctl in order
> > to use ranges, maybe we can make the mental model more coherent with an
> > unmap flag like VFIO_DMA_UNMAP_FLAG_VADDR_ONLY, ie. we're only asking
> > to unmap the vaddr.  The DMA_MAP side might take a similar VADDR_ONLY
> > flag to reset the vaddr.  That also retains your desired semantics that
> > we can't "resume" a vaddr that wasn't previously "suspended", we can't
> > map a vaddr that wasn't previously unmapped.  
> 
> I like this.
> 
> > For the unmap-all problem, userspace already needs to work around this,
> > see for instance QEMU:
> > 
> > 1b296c3def4b vfio: Don't issue full 2^64 unmap  
> 
> Yes, I saw that.  And one cannot split the range in two at some arbitrary 
> mmu-page-size boundary, least you split a mapping, which V2 does not allow.

Yes, we noted this shortfall in its design, but we're also not
currently in danger of using anything in the upper half of the 64-bit
address space.  In fact vfio_listener_skipped_section() uses bit 63 to
filter out PCI BAR sizing operations that the QEMU PCI core erroneously
sends out to MemoryListeners.

> > So I wonder really how critical it is and whether it really would be
> > sufficient for userspace to track a high water mark for mappings.  
> 
> If the high water mark is 2^64, then you also need a low water mark to 
> enable a single call, else one call cannot span the range.  And low water
> better not be 0.  And I would rather not add code to compute either one :)

The high water marker would track the highest iova+size < 2^64.  If an
unmap occurs that covers that high water marker, the marker is reset to
the start of the unmap.  If an unmap of the full 64-bit address space
is requested, userspace performs 2 unmaps, one for the area below the
marker, one for the area above and resets the marker to zero.  The
mechanics seem pretty simple, we don't need to cover the entire used
address space in a single unmap, we only need to track a dividing line
to split the address space.  Anyway, it's been a nagging oversight in
the original design and I'm not opposed to solving it with a flag to
unburden userspace from such things.  Thanks,

Alex

> > Otherwise, I think I'm leaning towards a DMA_UNMAP flag like
> > VFIO_DMA_UNMAP_FLAG_UNMAP_ALL that would disregard the iova and size
> > fields to apply to all mappings.  Designating special values for iova or
> > size trigger extended behavior feels a bit hackish.  Thanks,  
> 
> The flag sounds best to me.
> 
> - Steve
> 
> >>>> We'd need to decide if we want to
> >>>> allow users to create mappings (mdev-only) using an initial invalid
> >>>> vaddr.      
> >>>
> >>> Maybe.  Not sure yet.    
> >>
> >> If we used the above, it almost seems strange not to allow it, but at
> >> the same time we don't really want to have different rules for
> >> different devices types.  An initially valid vaddr doesn't seem
> >> unreasonable... though we don't test it until the vendor driver tries
> >> to pin or rw pages w/o IOMMU backing.
> >>    
> >>>>>  	struct task_struct	*task;
> >>>>>  	struct rb_root		pfn_list;	/* Ex-user pinned pfn list */
> >>>>>  	unsigned long		*bitmap;
> >>>>> @@ -1080,7 +1081,7 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
> >>>>>  	int ret = 0, retries = 0;
> >>>>>  	unsigned long pgshift;
> >>>>>  	dma_addr_t iova;
> >>>>> -	unsigned long size;
> >>>>> +	unsigned long size, consumed;      
> >>>>
> >>>> This could be scoped into the branch below.      
> >>>
> >>> OK.
> >>>     
> >>>>>  	mutex_lock(&iommu->lock);
> >>>>>  
> >>>>> @@ -1169,6 +1170,21 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
> >>>>>  		if (dma->task->mm != current->mm)
> >>>>>  			break;
> >>>>>  
> >>>>> +		if (unmap->flags & VFIO_DMA_UNMAP_FLAG_SUSPEND) {
> >>>>> +			if (dma->suspended) {
> >>>>> +				ret = -EINVAL;
> >>>>> +				goto unlock;
> >>>>> +			}      
> >>>>
> >>>> This leaves us in a state where we marked some entries but not others.
> >>>> We should either unwind or... what's the actual harm in skipping these?      
> >>>
> >>> We could skip them with no ill effect.  However, it likely means the app is confused
> >>> and potentially broken, and it would be courteous to inform them so.  I found such bugs
> >>> in qemu as I was developing this feature.
> >>>
> >>> IMO unwinding does not help the app, and adds unnecessary code.  It can still leave some
> >>> ranges suspended and some not.  The safest recovery is for the app to exit, and tell the 
> >>> developer to fix the redundant suspend call.    
> >>
> >> That sounds like an entirely practical rationalization, but our
> >> standard practice is to maintain a consistent state.  If an ioctl fails
> >> is should effectively be as if the ioctl was never called, where
> >> possible.  Userspace can be broken, and potentially so broken that their
> >> best choice is to abort, but we should maintain consistent, predictable
> >> behavior.
> >>  
> >>>>> +			dma->suspended = true;
> >>>>> +			consumed = dma->iova + dma->size - iova;
> >>>>> +			if (consumed >= size)
> >>>>> +				break;
> >>>>> +			iova += consumed;
> >>>>> +			size -= consumed;
> >>>>> +			unmapped += dma->size;
> >>>>> +			continue;
> >>>>> +		}      
> >>>>
> >>>> This short-cuts the dirty bitmap flag, so we need to decide if it's
> >>>> legal to call them together or we need to prevent it... Oh, I see
> >>>> you've excluded them earlier below.
> >>>>       
> >>>>> +
> >>>>>  		if (!RB_EMPTY_ROOT(&dma->pfn_list)) {
> >>>>>  			struct vfio_iommu_type1_dma_unmap nb_unmap;
> >>>>>  
> >>>>> @@ -1307,6 +1323,7 @@ static bool vfio_iommu_iova_dma_valid(struct vfio_iommu *iommu,
> >>>>>  static int vfio_dma_do_map(struct vfio_iommu *iommu,
> >>>>>  			   struct vfio_iommu_type1_dma_map *map)
> >>>>>  {
> >>>>> +	bool resume = map->flags & VFIO_DMA_MAP_FLAG_RESUME;
> >>>>>  	dma_addr_t iova = map->iova;
> >>>>>  	unsigned long vaddr = map->vaddr;
> >>>>>  	size_t size = map->size;
> >>>>> @@ -1324,13 +1341,16 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
> >>>>>  	if (map->flags & VFIO_DMA_MAP_FLAG_READ)
> >>>>>  		prot |= IOMMU_READ;
> >>>>>  
> >>>>> +	if ((prot && resume) || (!prot && !resume))
> >>>>> +		return -EINVAL;
> >>>>> +
> >>>>>  	mutex_lock(&iommu->lock);
> >>>>>  
> >>>>>  	pgsize = (size_t)1 << __ffs(iommu->pgsize_bitmap);
> >>>>>  
> >>>>>  	WARN_ON((pgsize - 1) & PAGE_MASK);
> >>>>>  
> >>>>> -	if (!prot || !size || (size | iova | vaddr) & (pgsize - 1)) {
> >>>>> +	if (!size || (size | iova | vaddr) & (pgsize - 1)) {
> >>>>>  		ret = -EINVAL;
> >>>>>  		goto out_unlock;
> >>>>>  	}
> >>>>> @@ -1341,7 +1361,19 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
> >>>>>  		goto out_unlock;
> >>>>>  	}
> >>>>>  
> >>>>> -	if (vfio_find_dma(iommu, iova, size)) {
> >>>>> +	dma = vfio_find_dma(iommu, iova, size);
> >>>>> +	if (resume) {
> >>>>> +		if (!dma) {
> >>>>> +			ret = -ENOENT;
> >>>>> +		} else if (!dma->suspended || dma->iova != iova ||
> >>>>> +			   dma->size != size) {      
> >>>>
> >>>> Why is it necessary that the vfio_dma be suspended before being
> >>>> resumed?  Couldn't a user simply use this to change the vaddr?  Does
> >>>> that promote abusive use?      
> >>>
> >>> This would almost always be incorrect.  If the vaddr changes, then the old vaddr was already
> >>> invalidated, and there is a window where it is not OK for kernel code to use the old vaddr.
> >>> This could only be safe if the memory object is mapped at both the old vaddr and the new
> >>> vaddr concurrently, which is an unlikely use case.    
> >>
> >> Ok, it's not like the use can't make it instantaneously invalid and then
> >> replace it.
> >>  
> >>>>> +			ret = -EINVAL;
> >>>>> +		} else {
> >>>>> +			dma->vaddr = vaddr;      
> >>>>
> >>>> Seems like there's a huge opportunity for a user to create coherency
> >>>> issues here... it's their data though I guess.      
> >>>
> >>> Yes.  That's what the language in the uapi about mapping the same memory object is about.
> >>>     
> >>>>> +			dma->suspended = false;
> >>>>> +		}
> >>>>> +		goto out_unlock;
> >>>>> +	} else if (dma) {
> >>>>>  		ret = -EEXIST;
> >>>>>  		goto out_unlock;
> >>>>>  	}
> >>>>> @@ -2532,6 +2564,7 @@ static int vfio_iommu_type1_check_extension(struct vfio_iommu *iommu,
> >>>>>  	case VFIO_TYPE1_IOMMU:
> >>>>>  	case VFIO_TYPE1v2_IOMMU:
> >>>>>  	case VFIO_TYPE1_NESTING_IOMMU:
> >>>>> +	case VFIO_SUSPEND:
> >>>>>  		return 1;
> >>>>>  	case VFIO_DMA_CC_IOMMU:
> >>>>>  		if (!iommu)
> >>>>> @@ -2686,7 +2719,8 @@ static int vfio_iommu_type1_map_dma(struct vfio_iommu *iommu,
> >>>>>  {
> >>>>>  	struct vfio_iommu_type1_dma_map map;
> >>>>>  	unsigned long minsz;
> >>>>> -	uint32_t mask = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
> >>>>> +	uint32_t mask = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE |
> >>>>> +			VFIO_DMA_MAP_FLAG_RESUME;
> >>>>>  
> >>>>>  	minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);
> >>>>>  
> >>>>> @@ -2704,6 +2738,8 @@ static int vfio_iommu_type1_unmap_dma(struct vfio_iommu *iommu,
> >>>>>  {
> >>>>>  	struct vfio_iommu_type1_dma_unmap unmap;
> >>>>>  	struct vfio_bitmap bitmap = { 0 };
> >>>>> +	uint32_t mask = VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP |
> >>>>> +			VFIO_DMA_UNMAP_FLAG_SUSPEND;
> >>>>>  	unsigned long minsz;
> >>>>>  	int ret;
> >>>>>  
> >>>>> @@ -2712,8 +2748,7 @@ static int vfio_iommu_type1_unmap_dma(struct vfio_iommu *iommu,
> >>>>>  	if (copy_from_user(&unmap, (void __user *)arg, minsz))
> >>>>>  		return -EFAULT;
> >>>>>  
> >>>>> -	if (unmap.argsz < minsz ||
> >>>>> -	    unmap.flags & ~VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP)
> >>>>> +	if (unmap.argsz < minsz || unmap.flags & ~mask || unmap.flags == mask)      
> >>>>
> >>>> Maybe a short comment here to note that dirty-bimap and
> >>>> suspend/invalidate are mutually exclusive.  Probably should be
> >>>> mentioned in the uapi too.      
> >>>
> >>> Will do, for both.
> >>>     
> >>>>>  		return -EINVAL;
> >>>>>  
> >>>>>  	if (unmap.flags & VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP) {
> >>>>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> >>>>> index 896e527..fcf7b56 100644
> >>>>> --- a/include/uapi/linux/vfio.h
> >>>>> +++ b/include/uapi/linux/vfio.h
> >>>>> @@ -46,6 +46,9 @@
> >>>>>   */
> >>>>>  #define VFIO_NOIOMMU_IOMMU		8
> >>>>>  
> >>>>> +/* Supports VFIO DMA suspend and resume */
> >>>>> +#define VFIO_SUSPEND			9
> >>>>> +
> >>>>>  /*
> >>>>>   * The IOCTL interface is designed for extensibility by embedding the
> >>>>>   * structure length (argsz) and flags into structures passed between
> >>>>> @@ -1046,12 +1049,19 @@ struct vfio_iommu_type1_info_cap_migration {
> >>>>>   *
> >>>>>   * Map process virtual addresses to IO virtual addresses using the
> >>>>>   * provided struct vfio_dma_map. Caller sets argsz. READ &/ WRITE required.
> >>>>> + *
> >>>>> + * If flags & VFIO_DMA_MAP_FLAG_RESUME, record the new base vaddr for iova, and
> >>>>> + * resume translation of host virtual addresses in the iova range.  The new
> >>>>> + * vaddr must point to the same memory object as the old vaddr, but this is not
> >>>>> + * verified.      
> >>>>
> >>>> It's hard to use "must" terminology here if we're not going to check.
> >>>> Maybe the phrasing should be something more along the lines of "should
> >>>> point to the same memory object or the user risks coherency issues
> >>>> within their virtual address space".      
> >>>
> >>> I used "must" because it is always incorrect if the object is not the same.  How about:
> >>>   The new vaddr must point to the same memory object as the old vaddr, but this is not
> >>>   verified.  Violation of this constraint may result in memory corruption within the
> >>>   host process and/or guest.    
> >>
> >> Since the "must" is not relative to the API but to the resulting
> >> behavior, perhaps something like:
> >>
> >>   In order to maintain memory consistency within the user application,
> >>   the updated vaddr must address the same memory object as originally
> >>   mapped, failure to do so will result in user memory corruption and/or
> >>   device misbehavior.
> >>
> >> Thanks,
> >> Alex
> >>  
> >>>>>  iova and size must match those in the original MAP_DMA call.
> >>>>> + * Protection is not changed, and the READ & WRITE flags must be 0.      
> >>>>
> >>>> This doesn't mention that the entry must be previously
> >>>> suspended/invalidated (if we choose to keep those semantics).  Thanks,      
> >>>
> >>> Will add, thanks.
> >>>
> >>> - Steve     
> >>>>>   */
> >>>>>  struct vfio_iommu_type1_dma_map {
> >>>>>  	__u32	argsz;
> >>>>>  	__u32	flags;
> >>>>>  #define VFIO_DMA_MAP_FLAG_READ (1 << 0)		/* readable from device */
> >>>>>  #define VFIO_DMA_MAP_FLAG_WRITE (1 << 1)	/* writable from device */
> >>>>> +#define VFIO_DMA_MAP_FLAG_RESUME (1 << 2)
> >>>>>  	__u64	vaddr;				/* Process virtual address */
> >>>>>  	__u64	iova;				/* IO virtual address */
> >>>>>  	__u64	size;				/* Size of mapping (bytes) */
> >>>>> @@ -1084,11 +1094,17 @@ struct vfio_bitmap {
> >>>>>   * indicates that the page at that offset from iova is dirty. A Bitmap of the
> >>>>>   * pages in the range of unmapped size is returned in the user-provided
> >>>>>   * vfio_bitmap.data.
> >>>>> + *
> >>>>> + * If flags & VFIO_DMA_UNMAP_FLAG_SUSPEND, do not unmap, but suspend vfio
> >>>>> + * translation of host virtual addresses in the iova range.  During suspension,
> >>>>> + * kernel threads that attempt to translate will block.  DMA to already-mapped
> >>>>> + * pages continues.
> >>>>>   */
> >>>>>  struct vfio_iommu_type1_dma_unmap {
> >>>>>  	__u32	argsz;
> >>>>>  	__u32	flags;
> >>>>>  #define VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP (1 << 0)
> >>>>> +#define VFIO_DMA_UNMAP_FLAG_SUSPEND	     (1 << 1)
> >>>>>  	__u64	iova;				/* IO virtual address */
> >>>>>  	__u64	size;				/* Size of mapping (bytes) */
> >>>>>  	__u8    data[];      
> >>>>       
> >>>     
> >>  
> >   
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH V1 3/5] vfio: detect closed container
  2021-01-11 21:12     ` Steven Sistare
@ 2021-01-13 19:26       ` Alex Williamson
  2021-01-13 19:44         ` Steven Sistare
  2021-01-13 19:44         ` Alex Williamson
  0 siblings, 2 replies; 39+ messages in thread
From: Alex Williamson @ 2021-01-13 19:26 UTC (permalink / raw)
  To: Steven Sistare; +Cc: kvm, Cornelia Huck, Kirti Wankhede

On Mon, 11 Jan 2021 16:12:18 -0500
Steven Sistare <steven.sistare@oracle.com> wrote:

> On 1/8/2021 2:39 PM, Alex Williamson wrote:
> > On Tue,  5 Jan 2021 07:36:51 -0800
> > Steve Sistare <steven.sistare@oracle.com> wrote:
> >   
> >> Add a function that detects if an iommu_group has a valid container.
> >>
> >> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> >> ---
> >>  drivers/vfio/vfio.c  | 12 ++++++++++++
> >>  include/linux/vfio.h |  1 +
> >>  2 files changed, 13 insertions(+)
> >>
> >> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> >> index 262ab0e..f89ab80 100644
> >> --- a/drivers/vfio/vfio.c
> >> +++ b/drivers/vfio/vfio.c
> >> @@ -61,6 +61,7 @@ struct vfio_container {
> >>  	struct vfio_iommu_driver	*iommu_driver;
> >>  	void				*iommu_data;
> >>  	bool				noiommu;
> >> +	bool				closed;
> >>  };
> >>  
> >>  struct vfio_unbound_dev {
> >> @@ -1223,6 +1224,7 @@ static int vfio_fops_release(struct inode *inode, struct file *filep)
> >>  
> >>  	filep->private_data = NULL;
> >>  
> >> +	container->closed = true;
> >>  	vfio_container_put(container);
> >>  
> >>  	return 0;
> >> @@ -2216,6 +2218,16 @@ void vfio_group_set_kvm(struct vfio_group *group, struct kvm *kvm)
> >>  }
> >>  EXPORT_SYMBOL_GPL(vfio_group_set_kvm);
> >>  
> >> +bool vfio_iommu_group_contained(struct iommu_group *iommu_group)
> >> +{
> >> +	struct vfio_group *group = vfio_group_get_from_iommu(iommu_group);
> >> +	bool ret = group && group->container && !group->container->closed;
> >> +
> >> +	vfio_group_put(group);
> >> +	return ret;
> >> +}
> >> +EXPORT_SYMBOL_GPL(vfio_iommu_group_contained);  
> > 
> > This seems like a pointless interface, the result is immediately stale.
> > Anything that relies on the container staying open needs to add itself
> > as a user.  We already have some interfaces for that, ex.
> > vfio_group_get_external_user().  Thanks,  
> 
> The significant part here is container->closed, which is only set if userland closes the
> container fd, which is a one-way trip -- the fd for this instance can never be opened 
> again.  The container object may still have other references, and not be destroyed yet.
> In patch 5, kernel code that waits for the RESUME ioctl calls this accessor to test if 
> the ioctl will never arrive.

Ok, that makes sense, the "contained" naming notsomuch.  You mention on
5/5 that you considered defining a new backend interface to call from
the core, but considered it overkill.  I'm not sure what you're
imagining that would be overkill.  We can pretty simply add an optional
callback to struct vfio_iommu_drivers_ops that would allow the iommu
driver to be notified when the container fd is released.  That seems
quite simple and avoids the inverted calling structure presented here.
Thanks,

Alex


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH V1 2/5] vfio: option to unmap all
  2021-01-11 21:09     ` Steven Sistare
@ 2021-01-13 19:41       ` Alex Williamson
  0 siblings, 0 replies; 39+ messages in thread
From: Alex Williamson @ 2021-01-13 19:41 UTC (permalink / raw)
  To: Steven Sistare; +Cc: kvm, Cornelia Huck, Kirti Wankhede

On Mon, 11 Jan 2021 16:09:18 -0500
Steven Sistare <steven.sistare@oracle.com> wrote:

> On 1/8/2021 2:35 PM, Alex Williamson wrote:
> > Hi Steve,
> > 
> > On Tue,  5 Jan 2021 07:36:50 -0800
> > Steve Sistare <steven.sistare@oracle.com> wrote:
> >   
> >> For VFIO_IOMMU_UNMAP_DMA, delete all mappings if iova=0 and size=0.  
> > 
> > Only the latter is invalid, iova=0 is not special, so does it make
> > sense to use this combination to invoke something special?  It seems
> > like it opens the door that any size less than the minimum mapping
> > granularity means something special.
> > 
> > Why not use a flag to trigger an unmap-all?  
> 
> Hi Alex, that would be fine.
> 
> > Does userspace have any means to know this is supported other than to
> > test it before creating any mappings?  
> 
> Not currently.  We could overload VFIO_SUSPEND, or define a new extension code.

Either an extension or a capability on the IOMMU_INFO return data.
If I interpret our trend on which to use, an extension seems
appropriate here as were only indicating support for a feature with no
additional data to return.

> > What's the intended interaction with retrieving the dirty bitmap during
> > an unmap-all?  
> 
> Undefined and broken if there are gaps between segments :(  Good catch, thanks.  
> I will disallow the combination of unmap-all and get-dirty-bitmap.
> 
> >> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> >> ---
> >>  drivers/vfio/vfio_iommu_type1.c | 11 ++++++++---
> >>  include/uapi/linux/vfio.h       |  3 ++-
> >>  2 files changed, 10 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> >> index 02228d0..3dc501d 100644
> >> --- a/drivers/vfio/vfio_iommu_type1.c
> >> +++ b/drivers/vfio/vfio_iommu_type1.c
> >> @@ -1079,6 +1079,8 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
> >>  	size_t unmapped = 0, pgsize;
> >>  	int ret = 0, retries = 0;
> >>  	unsigned long pgshift;
> >> +	dma_addr_t iova;
> >> +	unsigned long size;
> >>  
> >>  	mutex_lock(&iommu->lock);
> >>  
> >> @@ -1090,7 +1092,7 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
> >>  		goto unlock;
> >>  	}
> >>  
> >> -	if (!unmap->size || unmap->size & (pgsize - 1)) {
> >> +	if ((!unmap->size && unmap->iova) || unmap->size & (pgsize - 1)) {
> >>  		ret = -EINVAL;
> >>  		goto unlock;
> >>  	}
> >> @@ -1154,8 +1156,11 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,  
> > 
> > It looks like the code just above this would have an issue if there are
> > dma mappings at iova=0.  
> 
> Are you referring to this code?
> 
>         if (iommu->v2) {
>                 dma = vfio_find_dma(iommu, unmap->iova, 1);
>                 if (dma && dma->iova != unmap->iova) {
>                         ret = -EINVAL;
> 
> Both unmap->iova and dma->iova would be 0, so I don't see the problem.

Yeah, I think I was mistaken.  Thanks,

Alex

> >>  		}
> >>  	}
> >>  
> >> -	while ((dma = vfio_find_dma(iommu, unmap->iova, unmap->size))) {
> >> -		if (!iommu->v2 && unmap->iova > dma->iova)
> >> +	iova = unmap->iova;
> >> +	size = unmap->size ? unmap->size : SIZE_MAX;  
> > 
> > AFAICT the only difference of this versus the user calling the unmap
> > with iova=0 size=SIZE_MAX is that SIZE_MAX will throw an -EINVAL due to
> > page size alignment.  If we assume there are no IOMMUs with 1 byte page
> > size, the special combination could instead be {0, SIZE_MAX}.    
> 
> Fine, but we would still need to document it specifically so the user knows that 
> the unaligned SIZE_MAX does not return EINVAL.
> 
> > Or the
> > caller could just track a high water mark for their mappings and use
> > the interface that exists.  Thanks,  
> 
> I am trying to avoid the need to modify existing code, for legacy qemu live update.
> Either a new flag or {0, SIZE_MAX} is suitable.  Which do you prefer?
> 
> - Steve
>  
> >> +
> >> +	while ((dma = vfio_find_dma(iommu, iova, size))) {
> >> +		if (!iommu->v2 && iova > dma->iova)
> >>  			break;
> >>  		/*
> >>  		 * Task with same address space who mapped this iova range is
> >> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> >> index 9204705..896e527 100644
> >> --- a/include/uapi/linux/vfio.h
> >> +++ b/include/uapi/linux/vfio.h
> >> @@ -1073,7 +1073,8 @@ struct vfio_bitmap {
> >>   * Caller sets argsz.  The actual unmapped size is returned in the size
> >>   * field.  No guarantee is made to the user that arbitrary unmaps of iova
> >>   * or size different from those used in the original mapping call will
> >> - * succeed.
> >> + * succeed.  If iova=0 and size=0, all addresses are unmapped.
> >> + *
> >>   * VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP should be set to get the dirty bitmap
> >>   * before unmapping IO virtual addresses. When this flag is set, the user must
> >>   * provide a struct vfio_bitmap in data[]. User must provide zero-allocated  
> >   
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH V1 3/5] vfio: detect closed container
  2021-01-13 19:26       ` Alex Williamson
@ 2021-01-13 19:44         ` Steven Sistare
  2021-01-13 19:44         ` Alex Williamson
  1 sibling, 0 replies; 39+ messages in thread
From: Steven Sistare @ 2021-01-13 19:44 UTC (permalink / raw)
  To: Alex Williamson; +Cc: kvm, Cornelia Huck, Kirti Wankhede

On 1/13/2021 2:26 PM, Alex Williamson wrote:
> On Mon, 11 Jan 2021 16:12:18 -0500
> Steven Sistare <steven.sistare@oracle.com> wrote:
> 
>> On 1/8/2021 2:39 PM, Alex Williamson wrote:
>>> On Tue,  5 Jan 2021 07:36:51 -0800
>>> Steve Sistare <steven.sistare@oracle.com> wrote:
>>>   
>>>> Add a function that detects if an iommu_group has a valid container.
>>>>
>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>> ---
>>>>  drivers/vfio/vfio.c  | 12 ++++++++++++
>>>>  include/linux/vfio.h |  1 +
>>>>  2 files changed, 13 insertions(+)
>>>>
>>>> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
>>>> index 262ab0e..f89ab80 100644
>>>> --- a/drivers/vfio/vfio.c
>>>> +++ b/drivers/vfio/vfio.c
>>>> @@ -61,6 +61,7 @@ struct vfio_container {
>>>>  	struct vfio_iommu_driver	*iommu_driver;
>>>>  	void				*iommu_data;
>>>>  	bool				noiommu;
>>>> +	bool				closed;
>>>>  };
>>>>  
>>>>  struct vfio_unbound_dev {
>>>> @@ -1223,6 +1224,7 @@ static int vfio_fops_release(struct inode *inode, struct file *filep)
>>>>  
>>>>  	filep->private_data = NULL;
>>>>  
>>>> +	container->closed = true;
>>>>  	vfio_container_put(container);
>>>>  
>>>>  	return 0;
>>>> @@ -2216,6 +2218,16 @@ void vfio_group_set_kvm(struct vfio_group *group, struct kvm *kvm)
>>>>  }
>>>>  EXPORT_SYMBOL_GPL(vfio_group_set_kvm);
>>>>  
>>>> +bool vfio_iommu_group_contained(struct iommu_group *iommu_group)
>>>> +{
>>>> +	struct vfio_group *group = vfio_group_get_from_iommu(iommu_group);
>>>> +	bool ret = group && group->container && !group->container->closed;
>>>> +
>>>> +	vfio_group_put(group);
>>>> +	return ret;
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(vfio_iommu_group_contained);  
>>>
>>> This seems like a pointless interface, the result is immediately stale.
>>> Anything that relies on the container staying open needs to add itself
>>> as a user.  We already have some interfaces for that, ex.
>>> vfio_group_get_external_user().  Thanks,  
>>
>> The significant part here is container->closed, which is only set if userland closes the
>> container fd, which is a one-way trip -- the fd for this instance can never be opened 
>> again.  The container object may still have other references, and not be destroyed yet.
>> In patch 5, kernel code that waits for the RESUME ioctl calls this accessor to test if 
>> the ioctl will never arrive.
> 
> Ok, that makes sense, the "contained" naming notsomuch.  You mention on
> 5/5 that you considered defining a new backend interface to call from
> the core, but considered it overkill.  I'm not sure what you're
> imagining that would be overkill.  We can pretty simply add an optional
> callback to struct vfio_iommu_drivers_ops that would allow the iommu
> driver to be notified when the container fd is released.  That seems
> quite simple and avoids the inverted calling structure presented here.
> Thanks,

OK.

Thanks very much for all your comments.  Off to write PATCH V2 ...

- Steve

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH V1 3/5] vfio: detect closed container
  2021-01-13 19:26       ` Alex Williamson
  2021-01-13 19:44         ` Steven Sistare
@ 2021-01-13 19:44         ` Alex Williamson
  1 sibling, 0 replies; 39+ messages in thread
From: Alex Williamson @ 2021-01-13 19:44 UTC (permalink / raw)
  To: Steven Sistare; +Cc: kvm, Cornelia Huck, Kirti Wankhede

On Wed, 13 Jan 2021 12:26:09 -0700
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Mon, 11 Jan 2021 16:12:18 -0500
> Steven Sistare <steven.sistare@oracle.com> wrote:
> 
> > On 1/8/2021 2:39 PM, Alex Williamson wrote:  
> > > On Tue,  5 Jan 2021 07:36:51 -0800
> > > Steve Sistare <steven.sistare@oracle.com> wrote:
> > >     
> > >> Add a function that detects if an iommu_group has a valid container.
> > >>
> > >> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> > >> ---
> > >>  drivers/vfio/vfio.c  | 12 ++++++++++++
> > >>  include/linux/vfio.h |  1 +
> > >>  2 files changed, 13 insertions(+)
> > >>
> > >> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> > >> index 262ab0e..f89ab80 100644
> > >> --- a/drivers/vfio/vfio.c
> > >> +++ b/drivers/vfio/vfio.c
> > >> @@ -61,6 +61,7 @@ struct vfio_container {
> > >>  	struct vfio_iommu_driver	*iommu_driver;
> > >>  	void				*iommu_data;
> > >>  	bool				noiommu;
> > >> +	bool				closed;
> > >>  };
> > >>  
> > >>  struct vfio_unbound_dev {
> > >> @@ -1223,6 +1224,7 @@ static int vfio_fops_release(struct inode *inode, struct file *filep)
> > >>  
> > >>  	filep->private_data = NULL;
> > >>  
> > >> +	container->closed = true;
> > >>  	vfio_container_put(container);
> > >>  
> > >>  	return 0;
> > >> @@ -2216,6 +2218,16 @@ void vfio_group_set_kvm(struct vfio_group *group, struct kvm *kvm)
> > >>  }
> > >>  EXPORT_SYMBOL_GPL(vfio_group_set_kvm);
> > >>  
> > >> +bool vfio_iommu_group_contained(struct iommu_group *iommu_group)
> > >> +{
> > >> +	struct vfio_group *group = vfio_group_get_from_iommu(iommu_group);
> > >> +	bool ret = group && group->container && !group->container->closed;
> > >> +
> > >> +	vfio_group_put(group);
> > >> +	return ret;
> > >> +}
> > >> +EXPORT_SYMBOL_GPL(vfio_iommu_group_contained);    
> > > 
> > > This seems like a pointless interface, the result is immediately stale.
> > > Anything that relies on the container staying open needs to add itself
> > > as a user.  We already have some interfaces for that, ex.
> > > vfio_group_get_external_user().  Thanks,    
> > 
> > The significant part here is container->closed, which is only set if userland closes the
> > container fd, which is a one-way trip -- the fd for this instance can never be opened 
> > again.  The container object may still have other references, and not be destroyed yet.
> > In patch 5, kernel code that waits for the RESUME ioctl calls this accessor to test if 
> > the ioctl will never arrive.  
> 
> Ok, that makes sense, the "contained" naming notsomuch.  You mention on
> 5/5 that you considered defining a new backend interface to call from
> the core, but considered it overkill.  I'm not sure what you're
> imagining that would be overkill.  We can pretty simply add an optional
> callback to struct vfio_iommu_drivers_ops that would allow the iommu
> driver to be notified when the container fd is released.  That seems
> quite simple and avoids the inverted calling structure presented here.

A callback into the iommu backend would also allow you to use a
waitqueue and wake_up for the blocking mechanism while vaddr is
invalid.  Seems like an improvement over the polling model.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH V1 4/5] vfio: VA suspend interface
  2021-01-11 21:15     ` Steven Sistare
  2021-01-12 15:47       ` Jason Zeng
  2021-01-12 22:47       ` Alex Williamson
@ 2021-01-19 20:11       ` Steven Sistare
  2 siblings, 0 replies; 39+ messages in thread
From: Steven Sistare @ 2021-01-19 20:11 UTC (permalink / raw)
  To: Alex Williamson; +Cc: kvm, Cornelia Huck, Kirti Wankhede

On 1/11/2021 4:15 PM, Steven Sistare wrote:
> On 1/8/2021 4:15 PM, Alex Williamson wrote:
>> On Tue,  5 Jan 2021 07:36:52 -0800
>> Steve Sistare <steven.sistare@oracle.com> wrote:
>>
>>> Add interfaces that allow the underlying memory object of an iova
>>> range to be mapped to a new host virtual address in the host process:
>>>
>>>   - VFIO_DMA_UNMAP_FLAG_SUSPEND for VFIO_IOMMU_UNMAP_DMA
>>>   - VFIO_DMA_MAP_FLAG_RESUME flag for VFIO_IOMMU_MAP_DMA
>>>   - VFIO_SUSPEND extension for VFIO_CHECK_EXTENSION
>>
>> Suspend and Resume can imply many things other than what's done here.
>> Should these be something more akin to INVALIDATE_VADDR and
>> REPLACE_VADDR?
> 
> Agreed.  I suspected we would discuss the names.  Some possibilities:
> 
> INVALIDATE_VADDR  REPLACE_VADDR
> INV_VADDR         SET_VADDR
> CLEAR_VADDR       SET_VADDR
> SUSPEND_VADDR     RESUME_VADDR
> 
>>> The suspend interface blocks vfio translation of host virtual
>>> addresses in a range, but DMA to already-mapped pages continues.
>>> The resume interface records the new base VA and resumes translation.
>>> See comments in uapi/linux/vfio.h for more details.
>>>
>>> This is a partial implementation.  Blocking is added in the next patch.
>>>
>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>> ---
>>>  drivers/vfio/vfio_iommu_type1.c | 47 +++++++++++++++++++++++++++++++++++------
>>>  include/uapi/linux/vfio.h       | 16 ++++++++++++++
>>>  2 files changed, 57 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>>> index 3dc501d..2c164a6 100644
>>> --- a/drivers/vfio/vfio_iommu_type1.c
>>> +++ b/drivers/vfio/vfio_iommu_type1.c
>>> @@ -92,6 +92,7 @@ struct vfio_dma {
>>>  	int			prot;		/* IOMMU_READ/WRITE */
>>>  	bool			iommu_mapped;
>>>  	bool			lock_cap;	/* capable(CAP_IPC_LOCK) */
>>> +	bool			suspended;
>>
>> Is there a value we could use for vfio_dma.vaddr that would always be
>> considered invalid, ex. ULONG_MAX?  
> 
> Yes, that could replace the suspend flag.  That, plus changing the language from suspend
> to invalidate, will probably yield equally understandable code.  I'll try it.

Hi Alex, I was not able to implement this suggestion in V2, because the old vaddr must be 
be preserved throughout the loop in vfio_dma_do_unmap so that unwind can recover it.  I 
renamed the suspended flag to vaddr_valid.

- Steve


^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2021-01-19 20:13 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-05 15:36 [PATCH V1 0/5] vfio virtual address update Steve Sistare
2021-01-05 15:36 ` [PATCH V1 1/5] vfio: maintain dma_list order Steve Sistare
2021-01-05 18:48   ` kernel test robot
2021-01-05 18:48     ` kernel test robot
2021-01-06  0:02   ` Alex Williamson
2021-01-06 14:50     ` Steven Sistare
2021-01-05 15:36 ` [PATCH V1 2/5] vfio: option to unmap all Steve Sistare
2021-01-08 19:35   ` Alex Williamson
2021-01-11 21:09     ` Steven Sistare
2021-01-13 19:41       ` Alex Williamson
2021-01-05 15:36 ` [PATCH V1 3/5] vfio: detect closed container Steve Sistare
2021-01-08 19:39   ` Alex Williamson
2021-01-11 21:12     ` Steven Sistare
2021-01-13 19:26       ` Alex Williamson
2021-01-13 19:44         ` Steven Sistare
2021-01-13 19:44         ` Alex Williamson
2021-01-05 15:36 ` [PATCH V1 4/5] vfio: VA suspend interface Steve Sistare
2021-01-08 21:15   ` Alex Williamson
2021-01-11 21:15     ` Steven Sistare
2021-01-12 15:47       ` Jason Zeng
2021-01-12 22:09         ` Alex Williamson
2021-01-13  4:37           ` Jason Zeng
2021-01-12 22:47       ` Alex Williamson
2021-01-13  4:10         ` Alex Williamson
2021-01-13 18:02           ` Steven Sistare
2021-01-13 18:34             ` Alex Williamson
2021-01-13 18:01         ` Steven Sistare
2021-01-19 20:11       ` Steven Sistare
2021-01-05 15:36 ` [PATCH V1 5/5] vfio: block during VA suspend Steve Sistare
2021-01-05 18:08   ` kernel test robot
2021-01-05 18:08     ` kernel test robot
2021-01-05 20:03   ` kernel test robot
2021-01-05 20:03     ` kernel test robot
2021-01-07 15:17     ` Steven Sistare
2021-01-05 20:03   ` [RFC PATCH] vfio: vfio_vaddr_valid() can be static kernel test robot
2021-01-05 20:03     ` kernel test robot
2021-01-08 21:32   ` [PATCH V1 5/5] vfio: block during VA suspend Alex Williamson
2021-01-11 21:16     ` Steven Sistare
2021-01-12 21:52 ` [PATCH V1 0/5] vfio virtual address update Steven Sistare

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.