kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] vfio: Device memory DMA mapping improvements
@ 2021-02-12 19:27 Alex Williamson
  2021-02-12 19:27 ` [PATCH 1/3] vfio: Introduce vma ops registration and notifier Alex Williamson
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: Alex Williamson @ 2021-02-12 19:27 UTC (permalink / raw)
  To: alex.williamson; +Cc: cohuck, kvm, linux-kernel, jgg, peterx

This series intends to improve some long standing issues with mapping
device memory through the vfio IOMMU interface (ie. P2P DMA mappings).
Unlike mapping DMA to RAM, we can't pin device memory, nor is it
always accessible.  We attempt to tackle this (predominantly the
first issue in this iteration) by creating a registration and
notification interface through vfio-core, between the IOMMU backend
and the bus driver.  This allows us to do things like automatically
remove a DMA mapping to device if it's closed by the user.  We also
keep references to the device container such that it remains isolated
while this mapping exists.

Unlike my previous attempt[1], this version works across containers.
For example if a user has device1 with IOMMU context in container1
and device2 in container2, a mapping of device2 memory into container1
IOMMU context would be removed when device2 is released.

What I don't tackle here is when device memory is disabled, such as
for a PCI device when the command register memory bit is cleared or
while the device is in reset.  Ideally is seems like it might be
nice to have IOMMU API interfaces that could remove r/w permissions
from the IOTLB entry w/o removing it entirely, but I'm also unsure
of the ultimate value in modifying the IOTLB entries at this point.

In the PCI example again, I'd expect a DMA to disabled or unavailable
device memory to get an Unsupported Request response.  If we play
with the IOTLB mapping, we might change this to an IOMMU fault for
either page permissions or page not present, depending on how we
choose to invalidate that entry.  However, it seems that a system that
escalates an UR error to fatal, through things like firmware first
handling, is just as likely to also make the IOMMU fault fatal.  Are
there cases where we expect otherwise, and if not is there value to
tracking device memory enable state to that extent in the IOMMU?

Jason, I'm also curious if this scratches your itch relative to your
suggestion to solve this with dma-bufs, and if that's still your
preference, I'd love an outline to accomplish this same with that
method.

Thanks,
Alex

[1]https://lore.kernel.org/kvm/158947414729.12590.4345248265094886807.stgit@gimli.home/

---

Alex Williamson (3):
      vfio: Introduce vma ops registration and notifier
      vfio/pci: Implement vm_ops registration
      vfio/type1: Implement vma registration and restriction


 drivers/vfio/pci/Kconfig            |    1 
 drivers/vfio/pci/vfio_pci.c         |   87 ++++++++++++++++
 drivers/vfio/pci/vfio_pci_private.h |    1 
 drivers/vfio/vfio.c                 |  120 ++++++++++++++++++++++
 drivers/vfio/vfio_iommu_type1.c     |  192 ++++++++++++++++++++++++++++-------
 include/linux/vfio.h                |   20 ++++
 6 files changed, 384 insertions(+), 37 deletions(-)


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/3] vfio: Introduce vma ops registration and notifier
  2021-02-12 19:27 [PATCH 0/3] vfio: Device memory DMA mapping improvements Alex Williamson
@ 2021-02-12 19:27 ` Alex Williamson
  2021-02-12 21:20   ` Jason Gunthorpe
  2021-02-12 19:27 ` [PATCH 2/3] vfio/pci: Implement vm_ops registration Alex Williamson
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 11+ messages in thread
From: Alex Williamson @ 2021-02-12 19:27 UTC (permalink / raw)
  To: alex.williamson; +Cc: cohuck, kvm, linux-kernel, jgg, peterx

Create an interface through vfio-core where a vfio bus driver (ex.
vfio-pci) can register the vm_operations_struct it uses to map device
memory, along with a set of registration callbacks.  This allows
vfio-core to expose interfaces for IOMMU backends to match a
vm_area_struct to a bus driver and register a notifier for relavant
changes to the device mapping.  For now we define only a notifier
action for closing the device.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 drivers/vfio/vfio.c  |  120 ++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/vfio.h |   20 ++++++++
 2 files changed, 140 insertions(+)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 38779e6fd80c..568f5e37a95f 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -47,6 +47,8 @@ static struct vfio {
 	struct cdev			group_cdev;
 	dev_t				group_devt;
 	wait_queue_head_t		release_q;
+	struct list_head		vm_ops_list;
+	struct mutex			vm_ops_lock;
 } vfio;
 
 struct vfio_iommu_driver {
@@ -2354,6 +2356,121 @@ struct iommu_domain *vfio_group_iommu_domain(struct vfio_group *group)
 }
 EXPORT_SYMBOL_GPL(vfio_group_iommu_domain);
 
+struct vfio_vma_ops {
+	const struct vm_operations_struct	*vm_ops;
+	vfio_register_vma_nb_t			*reg_fn;
+	vfio_unregister_vma_nb_t		*unreg_fn;
+	struct list_head			next;
+};
+
+int vfio_register_vma_ops(const struct vm_operations_struct *vm_ops,
+			  vfio_register_vma_nb_t *reg_fn,
+			  vfio_unregister_vma_nb_t *unreg_fn)
+{
+	struct vfio_vma_ops *ops;
+	int ret = 0;
+
+	mutex_lock(&vfio.vm_ops_lock);
+	list_for_each_entry(ops, &vfio.vm_ops_list, next) {
+		if (ops->vm_ops == vm_ops) {
+			ret = -EEXIST;
+			goto out;
+		}
+	}
+
+	ops = kmalloc(sizeof(*ops), GFP_KERNEL);
+	if (!ops) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	ops->vm_ops = vm_ops;
+	ops->reg_fn = reg_fn;
+	ops->unreg_fn = unreg_fn;
+
+	list_add(&ops->next, &vfio.vm_ops_list);
+out:
+	mutex_unlock(&vfio.vm_ops_lock);
+	return ret;
+
+}
+EXPORT_SYMBOL_GPL(vfio_register_vma_ops);
+
+void vfio_unregister_vma_ops(const struct vm_operations_struct *vm_ops)
+{
+	struct vfio_vma_ops *ops;
+
+	mutex_lock(&vfio.vm_ops_lock);
+	list_for_each_entry(ops, &vfio.vm_ops_list, next) {
+		if (ops->vm_ops == vm_ops) {
+			list_del(&ops->next);
+			kfree(ops);
+			break;
+		}
+	}
+	mutex_unlock(&vfio.vm_ops_lock);
+}
+EXPORT_SYMBOL_GPL(vfio_unregister_vma_ops);
+
+struct vfio_vma_obj {
+	const struct vm_operations_struct *vm_ops;
+	void *opaque;
+};
+
+void *vfio_register_vma_nb(struct vm_area_struct *vma,
+			   struct notifier_block *nb)
+{
+	struct vfio_vma_ops *ops;
+	struct vfio_vma_obj *obj = ERR_PTR(-ENODEV);
+
+	mutex_lock(&vfio.vm_ops_lock);
+	list_for_each_entry(ops, &vfio.vm_ops_list, next) {
+		if (ops->vm_ops == vma->vm_ops) {
+			void *opaque;
+
+			obj = kmalloc(sizeof(*obj), GFP_KERNEL);
+			if (!obj) {
+				obj = ERR_PTR(-ENOMEM);
+				break;
+			}
+
+			obj->vm_ops = ops->vm_ops;
+
+			opaque = ops->reg_fn(vma, nb);
+			if (IS_ERR(opaque)) {
+				kfree(obj);
+				obj = opaque;
+			} else {
+				obj->opaque = opaque;
+			}
+
+			break;
+		}
+	}
+	mutex_unlock(&vfio.vm_ops_lock);
+
+	return obj;
+}
+EXPORT_SYMBOL_GPL(vfio_register_vma_nb);
+
+void vfio_unregister_vma_nb(void *opaque)
+{
+	struct vfio_vma_obj *obj = opaque;
+	struct vfio_vma_ops *ops;
+
+	mutex_lock(&vfio.vm_ops_lock);
+	list_for_each_entry(ops, &vfio.vm_ops_list, next) {
+		if (ops->vm_ops == obj->vm_ops) {
+			ops->unreg_fn(obj->opaque);
+			break;
+		}
+	}
+	mutex_unlock(&vfio.vm_ops_lock);
+
+	kfree(obj);
+}
+EXPORT_SYMBOL_GPL(vfio_unregister_vma_nb);
+
 /**
  * Module/class support
  */
@@ -2377,8 +2494,10 @@ static int __init vfio_init(void)
 	idr_init(&vfio.group_idr);
 	mutex_init(&vfio.group_lock);
 	mutex_init(&vfio.iommu_drivers_lock);
+	mutex_init(&vfio.vm_ops_lock);
 	INIT_LIST_HEAD(&vfio.group_list);
 	INIT_LIST_HEAD(&vfio.iommu_drivers_list);
+	INIT_LIST_HEAD(&vfio.vm_ops_list);
 	init_waitqueue_head(&vfio.release_q);
 
 	ret = misc_register(&vfio_dev);
@@ -2425,6 +2544,7 @@ static int __init vfio_init(void)
 static void __exit vfio_cleanup(void)
 {
 	WARN_ON(!list_empty(&vfio.group_list));
+	WARN_ON(!list_empty(&vfio.vm_ops_list));
 
 #ifdef CONFIG_VFIO_NOIOMMU
 	vfio_unregister_iommu_driver(&vfio_noiommu_ops);
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index b7e18bde5aa8..1b5c6179d869 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -137,6 +137,26 @@ extern int vfio_dma_rw(struct vfio_group *group, dma_addr_t user_iova,
 
 extern struct iommu_domain *vfio_group_iommu_domain(struct vfio_group *group);
 
+typedef void *(vfio_register_vma_nb_t)(struct vm_area_struct *vma,
+				       struct notifier_block *nb);
+typedef void (vfio_unregister_vma_nb_t)(void *opaque);
+
+extern int vfio_register_vma_ops(const struct vm_operations_struct *vm_ops,
+				 vfio_register_vma_nb_t *reg_fn,
+				 vfio_unregister_vma_nb_t *unreg_fn);
+
+extern void vfio_unregister_vma_ops(const struct vm_operations_struct *vm_ops);
+
+enum vfio_vma_notify_type {
+	VFIO_VMA_NOTIFY_CLOSE = 0,
+};
+
+extern void *vfio_register_vma_nb(struct vm_area_struct *vma,
+				  struct notifier_block *nb);
+
+extern void vfio_unregister_vma_nb(void *opaque);
+
+
 /* each type has independent events */
 enum vfio_notify_type {
 	VFIO_IOMMU_NOTIFY = 0,


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/3] vfio/pci: Implement vm_ops registration
  2021-02-12 19:27 [PATCH 0/3] vfio: Device memory DMA mapping improvements Alex Williamson
  2021-02-12 19:27 ` [PATCH 1/3] vfio: Introduce vma ops registration and notifier Alex Williamson
@ 2021-02-12 19:27 ` Alex Williamson
  2021-02-12 19:28 ` [PATCH 3/3] vfio/type1: Implement vma registration and restriction Alex Williamson
  2021-02-12 20:57 ` [PATCH 0/3] vfio: Device memory DMA mapping improvements Jason Gunthorpe
  3 siblings, 0 replies; 11+ messages in thread
From: Alex Williamson @ 2021-02-12 19:27 UTC (permalink / raw)
  To: alex.williamson; +Cc: cohuck, kvm, linux-kernel, jgg, peterx

The vfio-pci vfio bus driver implements a vm_operations_struct for
managing mmaps to device BARs, therefore given a vma with matching
vm_ops we can create a reference using the existing vfio external user
interfaces and register the provided notifier to receive callbacks
relative to the device.  The close notifier is implemented for when
the device is released, rather than closing the vma to avoid
possibly breaking userspace (ie. mmap -> dma map -> munmap is
currently allowed and maintains the dma mapping to the device).

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 drivers/vfio/pci/Kconfig            |    1 
 drivers/vfio/pci/vfio_pci.c         |   87 +++++++++++++++++++++++++++++++++++
 drivers/vfio/pci/vfio_pci_private.h |    1 
 3 files changed, 89 insertions(+)

diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index 40a223381ab6..4e3059d6206c 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -4,6 +4,7 @@ config VFIO_PCI
 	depends on VFIO && PCI && EVENTFD
 	select VFIO_VIRQFD
 	select IRQ_BYPASS_MANAGER
+	select SRCU
 	help
 	  Support for the PCI VFIO bus driver.  This is required to make
 	  use of PCI drivers using the VFIO framework.
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 706de3ef94bb..dcbdaece80f8 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -560,6 +560,8 @@ static void vfio_pci_release(void *device_data)
 	mutex_lock(&vdev->reflck->lock);
 
 	if (!(--vdev->refcnt)) {
+		srcu_notifier_call_chain(&vdev->vma_notifier,
+					 VFIO_VMA_NOTIFY_CLOSE, NULL);
 		vfio_pci_vf_token_user_add(vdev, -1);
 		vfio_spapr_pci_eeh_release(vdev->pdev);
 		vfio_pci_disable(vdev);
@@ -1969,6 +1971,7 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	mutex_init(&vdev->vma_lock);
 	INIT_LIST_HEAD(&vdev->vma_list);
 	init_rwsem(&vdev->memory_lock);
+	srcu_init_notifier_head(&vdev->vma_notifier);
 
 	ret = vfio_add_group_dev(&pdev->dev, &vfio_pci_ops, vdev);
 	if (ret)
@@ -2362,6 +2365,7 @@ static void vfio_pci_try_bus_reset(struct vfio_pci_device *vdev)
 
 static void __exit vfio_pci_cleanup(void)
 {
+	vfio_unregister_vma_ops(&vfio_pci_mmap_ops);
 	pci_unregister_driver(&vfio_pci_driver);
 	vfio_pci_uninit_perm_bits();
 }
@@ -2407,6 +2411,81 @@ static void __init vfio_pci_fill_ids(void)
 	}
 }
 
+struct vfio_pci_vma_obj {
+	struct vfio_pci_device *vdev;
+	struct vfio_group *group;
+	struct vfio_device *device;
+	struct notifier_block *nb;
+};
+
+static void *vfio_pci_register_vma_notifier(struct vm_area_struct *vma,
+					    struct notifier_block *nb)
+{
+	struct vfio_pci_device *vdev = vma->vm_private_data;
+	struct vfio_pci_vma_obj *obj;
+	struct vfio_group *group;
+	struct vfio_device *device;
+	int ret;
+
+	if (!vdev || vma->vm_ops != &vfio_pci_mmap_ops)
+		return ERR_PTR(-EINVAL);
+
+	obj = kmalloc(sizeof(*obj), GFP_KERNEL);
+	if (!obj)
+		return ERR_PTR(-ENOMEM);
+
+	/*
+	 * Get a group and container reference, this prevents the container
+	 * from being torn down while this vma is mapped, ie. device stays
+	 * isolated.
+	 *
+	 * NB. The container must be torn down on device close without
+	 * explicit unmaps, therefore we must notify on close.
+	 */
+	group = vfio_group_get_external_user_from_dev(&vdev->pdev->dev);
+	if (IS_ERR(group)) {
+		kfree(obj);
+		return group;
+	}
+
+	/* Also need device reference to prevent unbind */
+	device = vfio_device_get_from_dev(&vdev->pdev->dev);
+	if (IS_ERR(device)) {
+		vfio_group_put_external_user(group);
+		kfree(obj);
+		return device;
+	}
+
+	/*
+	 * Use the srcu notifier chain variant to avoid AB-BA locking issues
+	 * with the caller, ex. iommu->lock vs nh->rwsem
+	 */
+	ret = srcu_notifier_chain_register(&vdev->vma_notifier, nb);
+	if (ret) {
+		vfio_device_put(device);
+		vfio_group_put_external_user(group);
+		kfree(obj);
+		return ERR_PTR(ret);
+	}
+
+	obj->vdev = vdev;
+	obj->group = group;
+	obj->device = device;
+	obj->nb = nb;
+
+	return obj;
+}
+
+static void vfio_pci_unregister_vma_notifier(void *opaque)
+{
+	struct vfio_pci_vma_obj *obj = opaque;
+
+	srcu_notifier_chain_unregister(&obj->vdev->vma_notifier, obj->nb);
+	vfio_device_put(obj->device);
+	vfio_group_put_external_user(obj->group);
+	kfree(obj);
+}
+
 static int __init vfio_pci_init(void)
 {
 	int ret;
@@ -2421,6 +2500,12 @@ static int __init vfio_pci_init(void)
 	if (ret)
 		goto out_driver;
 
+	ret = vfio_register_vma_ops(&vfio_pci_mmap_ops,
+				    vfio_pci_register_vma_notifier,
+				    vfio_pci_unregister_vma_notifier);
+	if (ret)
+		goto out_vma;
+
 	vfio_pci_fill_ids();
 
 	if (disable_denylist)
@@ -2428,6 +2513,8 @@ static int __init vfio_pci_init(void)
 
 	return 0;
 
+out_vma:
+	pci_unregister_driver(&vfio_pci_driver);
 out_driver:
 	vfio_pci_uninit_perm_bits();
 	return ret;
diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
index 5c90e560c5c7..12c61d099d1a 100644
--- a/drivers/vfio/pci/vfio_pci_private.h
+++ b/drivers/vfio/pci/vfio_pci_private.h
@@ -142,6 +142,7 @@ struct vfio_pci_device {
 	struct mutex		vma_lock;
 	struct list_head	vma_list;
 	struct rw_semaphore	memory_lock;
+	struct srcu_notifier_head	vma_notifier;
 };
 
 #define is_intx(vdev) (vdev->irq_type == VFIO_PCI_INTX_IRQ_INDEX)


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 3/3] vfio/type1: Implement vma registration and restriction
  2021-02-12 19:27 [PATCH 0/3] vfio: Device memory DMA mapping improvements Alex Williamson
  2021-02-12 19:27 ` [PATCH 1/3] vfio: Introduce vma ops registration and notifier Alex Williamson
  2021-02-12 19:27 ` [PATCH 2/3] vfio/pci: Implement vm_ops registration Alex Williamson
@ 2021-02-12 19:28 ` Alex Williamson
  2021-02-12 22:47   ` kernel test robot
  2021-02-12 20:57 ` [PATCH 0/3] vfio: Device memory DMA mapping improvements Jason Gunthorpe
  3 siblings, 1 reply; 11+ messages in thread
From: Alex Williamson @ 2021-02-12 19:28 UTC (permalink / raw)
  To: alex.williamson; +Cc: cohuck, kvm, linux-kernel, jgg, peterx

Use the new vma registration interface to configure a notifier for DMA
mappings to device memory.  On close notification, remove the mapping.
This allows us to implement a new default policy to restrict PFNMAP
mappings to only those vmas whose vm_ops is registered with vfio-core
to provide this support.  A new module option is provided to opt-out
should this conflict with existing use cases.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 drivers/vfio/vfio_iommu_type1.c |  192 +++++++++++++++++++++++++++++++--------
 1 file changed, 155 insertions(+), 37 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 90715413c3d9..137aab2a00fd 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -62,6 +62,11 @@ module_param_named(dma_entry_limit, dma_entry_limit, uint, 0644);
 MODULE_PARM_DESC(dma_entry_limit,
 		 "Maximum number of user DMA mappings per container (65535).");
 
+static bool strict_mmio_maps = true;
+module_param_named(strict_mmio_maps, strict_mmio_maps, bool, 0644);
+MODULE_PARM_DESC(strict_mmio_maps,
+		 "Restrict to safe DMA mappings of device memory (true).");
+
 struct vfio_iommu {
 	struct list_head	domain_list;
 	struct list_head	iova_list;
@@ -89,6 +94,13 @@ struct vfio_domain {
 	bool			fgsp;		/* Fine-grained super pages */
 };
 
+struct pfnmap_obj {
+	struct notifier_block	nb;
+	struct work_struct	work;
+	struct vfio_iommu	*iommu;
+	void			*opaque;
+};
+
 struct vfio_dma {
 	struct rb_node		node;
 	dma_addr_t		iova;		/* Device address */
@@ -101,6 +113,7 @@ struct vfio_dma {
 	struct task_struct	*task;
 	struct rb_root		pfn_list;	/* Ex-user pinned pfn list */
 	unsigned long		*bitmap;
+	struct pfnmap_obj	*pfnmap;
 };
 
 struct vfio_group {
@@ -495,15 +508,108 @@ static int follow_fault_pfn(struct vm_area_struct *vma, struct mm_struct *mm,
 	return ret;
 }
 
-static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
-			 int prot, unsigned long *pfn)
+static void unregister_vma_bg(struct work_struct *work)
+{
+	struct pfnmap_obj *pfnmap = container_of(work, struct pfnmap_obj, work);
+
+	vfio_unregister_vma_nb(pfnmap->opaque);
+	kfree(pfnmap);
+}
+
+struct vfio_dma *pfnmap_find_dma(struct pfnmap_obj *pfnmap)
+{
+	struct rb_node *n;
+
+	for (n = rb_first(&pfnmap->iommu->dma_list); n; n = rb_next(n)) {
+		struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
+
+		if (dma->pfnmap == pfnmap)
+			return dma;
+	}
+
+	return NULL;
+}
+
+/* Return 1 if iommu->lock dropped and notified, 0 if done */
+static int unmap_dma_pfn_list(struct vfio_iommu *iommu, struct vfio_dma *dma,
+			      struct vfio_dma **dma_last, int *retries)
+{
+	if (!RB_EMPTY_ROOT(&dma->pfn_list)) {
+		struct vfio_iommu_type1_dma_unmap nb_unmap;
+
+		if (*dma_last == dma) {
+			BUG_ON(++(*retries) > 10);
+		} else {
+			*dma_last = dma;
+			*retries = 0;
+		}
+
+		nb_unmap.iova = dma->iova;
+		nb_unmap.size = dma->size;
+
+		/*
+		 * Notify anyone (mdev vendor drivers) to invalidate and
+		 * unmap iovas within the range we're about to unmap.
+		 * Vendor drivers MUST unpin pages in response to an
+		 * invalidation.
+		 */
+		mutex_unlock(&iommu->lock);
+		blocking_notifier_call_chain(&iommu->notifier,
+					    VFIO_IOMMU_NOTIFY_DMA_UNMAP,
+					    &nb_unmap);
+		return 1;
+	}
+
+	return 0;
+}
+
+static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma);
+
+static int vfio_vma_nb_cb(struct notifier_block *nb,
+			  unsigned long action, void *unused)
+{
+	struct pfnmap_obj *pfnmap = container_of(nb, struct pfnmap_obj, nb);
+
+	switch (action) {
+	case VFIO_VMA_NOTIFY_CLOSE:
+	{
+		struct vfio_dma *dma, *dma_last = NULL;
+		int retries = 0;
+
+again:
+		mutex_lock(&pfnmap->iommu->lock);
+		dma = pfnmap_find_dma(pfnmap);
+		if (dma) {
+			if (unmap_dma_pfn_list(pfnmap->iommu, dma,
+					       &dma_last, &retries))
+				goto again;
+
+			dma->pfnmap = NULL;
+			vfio_remove_dma(pfnmap->iommu, dma);
+		}
+		mutex_unlock(&pfnmap->iommu->lock);
+
+		/* Cannot unregister notifier from callback chain */
+		INIT_WORK(&pfnmap->work, unregister_vma_bg);
+		schedule_work(&pfnmap->work);
+
+		break;
+	}
+	}
+
+	return NOTIFY_OK;
+}
+
+static int vaddr_get_pfn(struct vfio_iommu *iommu, struct vfio_dma *dma,
+			 struct mm_struct *mm, unsigned long vaddr,
+			 unsigned long *pfn)
 {
 	struct page *page[1];
 	struct vm_area_struct *vma;
 	unsigned int flags = 0;
 	int ret;
 
-	if (prot & IOMMU_WRITE)
+	if (dma->prot & IOMMU_WRITE)
 		flags |= FOLL_WRITE;
 
 	mmap_read_lock(mm);
@@ -521,12 +627,40 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
 	vma = find_vma_intersection(mm, vaddr, vaddr + 1);
 
 	if (vma && vma->vm_flags & VM_PFNMAP) {
-		ret = follow_fault_pfn(vma, mm, vaddr, pfn, prot & IOMMU_WRITE);
+		ret = follow_fault_pfn(vma, mm, vaddr, pfn,
+				       dma->prot & IOMMU_WRITE);
 		if (ret == -EAGAIN)
 			goto retry;
 
-		if (!ret && !is_invalid_reserved_pfn(*pfn))
+		if (!ret && !is_invalid_reserved_pfn(*pfn)) {
 			ret = -EFAULT;
+			goto done;
+		}
+
+		if (!dma->pfnmap) {
+			struct pfnmap_obj *pfnmap;
+			void *opaque;
+
+			pfnmap = kzalloc(sizeof(*pfnmap), GFP_KERNEL);
+			if (!pfnmap) {
+				ret = -ENOMEM;
+				goto done;
+			}
+
+			pfnmap->iommu = iommu;
+			pfnmap->nb.notifier_call = vfio_vma_nb_cb;
+			opaque = vfio_register_vma_nb(vma, &pfnmap->nb);
+			if (IS_ERR(opaque)) {
+				kfree(pfnmap);
+				if (strict_mmio_maps) {
+					ret = PTR_ERR(opaque);
+					goto done;
+				}
+			} else {
+				pfnmap->opaque = opaque;
+				dma->pfnmap = pfnmap;
+			}
+		}
 	}
 done:
 	mmap_read_unlock(mm);
@@ -593,7 +727,8 @@ static int vfio_wait_all_valid(struct vfio_iommu *iommu)
  * the iommu can only map chunks of consecutive pfns anyway, so get the
  * first page and all consecutive pages with the same locking.
  */
-static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
+static long vfio_pin_pages_remote(struct vfio_iommu *iommu,
+				  struct vfio_dma *dma, unsigned long vaddr,
 				  long npage, unsigned long *pfn_base,
 				  unsigned long limit)
 {
@@ -606,7 +741,7 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
 	if (!current->mm)
 		return -ENODEV;
 
-	ret = vaddr_get_pfn(current->mm, vaddr, dma->prot, pfn_base);
+	ret = vaddr_get_pfn(iommu, dma, current->mm, vaddr, pfn_base);
 	if (ret)
 		return ret;
 
@@ -633,7 +768,7 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
 	/* Lock all the consecutive pages from pfn_base */
 	for (vaddr += PAGE_SIZE, iova += PAGE_SIZE; pinned < npage;
 	     pinned++, vaddr += PAGE_SIZE, iova += PAGE_SIZE) {
-		ret = vaddr_get_pfn(current->mm, vaddr, dma->prot, &pfn);
+		ret = vaddr_get_pfn(iommu, dma, current->mm, vaddr, &pfn);
 		if (ret)
 			break;
 
@@ -693,7 +828,8 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova,
 	return unlocked;
 }
 
-static int vfio_pin_page_external(struct vfio_dma *dma, unsigned long vaddr,
+static int vfio_pin_page_external(struct vfio_iommu *iommu,
+				  struct vfio_dma *dma, unsigned long vaddr,
 				  unsigned long *pfn_base, bool do_accounting)
 {
 	struct mm_struct *mm;
@@ -703,7 +839,7 @@ static int vfio_pin_page_external(struct vfio_dma *dma, unsigned long vaddr,
 	if (!mm)
 		return -ENODEV;
 
-	ret = vaddr_get_pfn(mm, vaddr, dma->prot, pfn_base);
+	ret = vaddr_get_pfn(iommu, dma, mm, vaddr, pfn_base);
 	if (!ret && do_accounting && !is_invalid_reserved_pfn(*pfn_base)) {
 		ret = vfio_lock_acct(dma, 1, true);
 		if (ret) {
@@ -811,8 +947,8 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
 		}
 
 		remote_vaddr = dma->vaddr + (iova - dma->iova);
-		ret = vfio_pin_page_external(dma, remote_vaddr, &phys_pfn[i],
-					     do_accounting);
+		ret = vfio_pin_page_external(iommu, dma, remote_vaddr,
+					     &phys_pfn[i], do_accounting);
 		if (ret)
 			goto pin_unwind;
 
@@ -1071,6 +1207,10 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
 static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *dma)
 {
 	WARN_ON(!RB_EMPTY_ROOT(&dma->pfn_list));
+	if (dma->pfnmap) {
+		vfio_unregister_vma_nb(dma->pfnmap->opaque);
+		kfree(dma->pfnmap);
+	}
 	vfio_unmap_unpin(iommu, dma, true);
 	vfio_unlink_dma(iommu, dma);
 	put_task_struct(dma->task);
@@ -1318,29 +1458,7 @@ static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
 			continue;
 		}
 
-		if (!RB_EMPTY_ROOT(&dma->pfn_list)) {
-			struct vfio_iommu_type1_dma_unmap nb_unmap;
-
-			if (dma_last == dma) {
-				BUG_ON(++retries > 10);
-			} else {
-				dma_last = dma;
-				retries = 0;
-			}
-
-			nb_unmap.iova = dma->iova;
-			nb_unmap.size = dma->size;
-
-			/*
-			 * Notify anyone (mdev vendor drivers) to invalidate and
-			 * unmap iovas within the range we're about to unmap.
-			 * Vendor drivers MUST unpin pages in response to an
-			 * invalidation.
-			 */
-			mutex_unlock(&iommu->lock);
-			blocking_notifier_call_chain(&iommu->notifier,
-						    VFIO_IOMMU_NOTIFY_DMA_UNMAP,
-						    &nb_unmap);
+		if (unmap_dma_pfn_list(iommu, dma, &dma_last, &retries)) {
 			mutex_lock(&iommu->lock);
 			goto again;
 		}
@@ -1404,7 +1522,7 @@ static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct vfio_dma *dma,
 
 	while (size) {
 		/* Pin a contiguous chunk of memory */
-		npage = vfio_pin_pages_remote(dma, vaddr + dma->size,
+		npage = vfio_pin_pages_remote(iommu, dma, vaddr + dma->size,
 					      size >> PAGE_SHIFT, &pfn, limit);
 		if (npage <= 0) {
 			WARN_ON(!npage);
@@ -1660,7 +1778,7 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
 				size_t n = dma->iova + dma->size - iova;
 				long npage;
 
-				npage = vfio_pin_pages_remote(dma, vaddr,
+				npage = vfio_pin_pages_remote(iommu, dma, vaddr,
 							      n >> PAGE_SHIFT,
 							      &pfn, limit);
 				if (npage <= 0) {


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/3] vfio: Device memory DMA mapping improvements
  2021-02-12 19:27 [PATCH 0/3] vfio: Device memory DMA mapping improvements Alex Williamson
                   ` (2 preceding siblings ...)
  2021-02-12 19:28 ` [PATCH 3/3] vfio/type1: Implement vma registration and restriction Alex Williamson
@ 2021-02-12 20:57 ` Jason Gunthorpe
  3 siblings, 0 replies; 11+ messages in thread
From: Jason Gunthorpe @ 2021-02-12 20:57 UTC (permalink / raw)
  To: Alex Williamson; +Cc: cohuck, kvm, linux-kernel, peterx

On Fri, Feb 12, 2021 at 12:27:19PM -0700, Alex Williamson wrote:
> This series intends to improve some long standing issues with mapping
> device memory through the vfio IOMMU interface (ie. P2P DMA mappings).
> Unlike mapping DMA to RAM, we can't pin device memory, nor is it
> always accessible.  We attempt to tackle this (predominantly the
> first issue in this iteration) by creating a registration and
> notification interface through vfio-core, between the IOMMU backend
> and the bus driver.  This allows us to do things like automatically
> remove a DMA mapping to device if it's closed by the user.  We also
> keep references to the device container such that it remains isolated
> while this mapping exists.
> 
> Unlike my previous attempt[1], this version works across containers.
> For example if a user has device1 with IOMMU context in container1
> and device2 in container2, a mapping of device2 memory into container1
> IOMMU context would be removed when device2 is released.
> 
> What I don't tackle here is when device memory is disabled, such as
> for a PCI device when the command register memory bit is cleared or
> while the device is in reset.  Ideally is seems like it might be
> nice to have IOMMU API interfaces that could remove r/w permissions
> from the IOTLB entry w/o removing it entirely, but I'm also unsure
> of the ultimate value in modifying the IOTLB entries at this point.
> 
> In the PCI example again, I'd expect a DMA to disabled or unavailable
> device memory to get an Unsupported Request response.  If we play
> with the IOTLB mapping, we might change this to an IOMMU fault for
> either page permissions or page not present, depending on how we
> choose to invalidate that entry.  However, it seems that a system that
> escalates an UR error to fatal, through things like firmware first
> handling, is just as likely to also make the IOMMU fault fatal.  Are
> there cases where we expect otherwise, and if not is there value to
> tracking device memory enable state to that extent in the IOMMU?
> 
> Jason, I'm also curious if this scratches your itch relative to your
> suggestion to solve this with dma-bufs, and if that's still your
> preference, I'd love an outline to accomplish this same with that
> method.

I will look at this more closely later, but given this is solving a
significant security problem and the patches now exist, I'm not
inclined to push too hard to do something different if this works OK.

That said, it is not great to see VFIO create its own little dmabuf
like thing inside itself, in particular if this was in core code we
could add a new vm_operations_struct member like:

   struct dmabuf (*getdma_buf)(struct vm_operations_struct *area);

And completely avoid a lot of the searching and fiddling with
ops. Maybe we can make this look closer to that ideal..

Jason

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/3] vfio: Introduce vma ops registration and notifier
  2021-02-12 19:27 ` [PATCH 1/3] vfio: Introduce vma ops registration and notifier Alex Williamson
@ 2021-02-12 21:20   ` Jason Gunthorpe
  2021-02-18  1:12     ` Jason Gunthorpe
  0 siblings, 1 reply; 11+ messages in thread
From: Jason Gunthorpe @ 2021-02-12 21:20 UTC (permalink / raw)
  To: Alex Williamson; +Cc: cohuck, kvm, linux-kernel, peterx

On Fri, Feb 12, 2021 at 12:27:39PM -0700, Alex Williamson wrote:
> Create an interface through vfio-core where a vfio bus driver (ex.
> vfio-pci) can register the vm_operations_struct it uses to map device
> memory, along with a set of registration callbacks.  This allows
> vfio-core to expose interfaces for IOMMU backends to match a
> vm_area_struct to a bus driver and register a notifier for relavant
> changes to the device mapping.  For now we define only a notifier
> action for closing the device.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
>  drivers/vfio/vfio.c  |  120 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/vfio.h |   20 ++++++++
>  2 files changed, 140 insertions(+)
> 
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index 38779e6fd80c..568f5e37a95f 100644
> +++ b/drivers/vfio/vfio.c
> @@ -47,6 +47,8 @@ static struct vfio {
>  	struct cdev			group_cdev;
>  	dev_t				group_devt;
>  	wait_queue_head_t		release_q;
> +	struct list_head		vm_ops_list;
> +	struct mutex			vm_ops_lock;
>  } vfio;
>  
>  struct vfio_iommu_driver {
> @@ -2354,6 +2356,121 @@ struct iommu_domain *vfio_group_iommu_domain(struct vfio_group *group)
>  }
>  EXPORT_SYMBOL_GPL(vfio_group_iommu_domain);
>  
> +struct vfio_vma_ops {
> +	const struct vm_operations_struct	*vm_ops;
> +	vfio_register_vma_nb_t			*reg_fn;
> +	vfio_unregister_vma_nb_t		*unreg_fn;
> +	struct list_head			next;
> +};
> +
> +int vfio_register_vma_ops(const struct vm_operations_struct *vm_ops,
> +			  vfio_register_vma_nb_t *reg_fn,
> +			  vfio_unregister_vma_nb_t *unreg_fn)

This just feels a little bit too complicated

I've recently learned from Daniel that we can use the address_space
machinery to drive the zap_vma_ptes() via unmap_mapping_range(). This
technique replaces all the open, close and vma_list logic in vfio_pci

If we don't need open anymore, we could do something like this:

 static const struct vm_operations_struct vfio_pci_mmap_ops = {
        .open = vfio_pfn_open, // implemented in vfio.c
        .close = vfio_pfn_close,
        .fault = vfio_pci_mmap_fault,
 };

Then we could code the function needed:

struct vfio_pfn_range_handle 
{
       struct kref kref;
       struct vfio_device *vfio;
       struct notifier_block invalidation_cb;
       unsigned int flags;
}

struct vfio_pfn_range_handle *get_pfn_range(struct vm_area_struct *vma)
{
       struct vfio_pfn_range_handle *handle;

       if (vma->ops->open != vfio_pfn_open)
              return NULL;
       
       handle = vma->vm_private_data;
       if (test_bit(handle->flags, DMA_STOPPED)
              return NULL;
       kref_get(&handle->kref);
       return handle;
}

Where the common open/close only kref inc/dec the kref and all 'vfio'
VMAs always have a pointer to the same vfio_pfn_range_handle in their
private_data.

The vm_pgoff is already pointing at the physical pfn, so every part of
the system can get the information it needs fairly trivially.

Some stop access function is pretty simple looking

void stop_access(struct vfio_pfn_range_handle *handle)
{
      set_bit(handle->flags, DMA_STOPPED);
      unmap_mapping_range(handle->vfio->[..]->inode, 0, max, false);
      srcu_notifier_call_chain(handle->invalidation_cb, VFIO_VMA_NOTIFY_CLOSE, NULL);
}

(well, have to sort out the locking some more, but that is the
 general idea)

I think that would remove alot of the code added here and acts a lot
closer to how a someday dmabuf could act.

Also, this will need to update the nvlink vmops as well

Jason

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/3] vfio/type1: Implement vma registration and restriction
  2021-02-12 19:28 ` [PATCH 3/3] vfio/type1: Implement vma registration and restriction Alex Williamson
@ 2021-02-12 22:47   ` kernel test robot
  0 siblings, 0 replies; 11+ messages in thread
From: kernel test robot @ 2021-02-12 22:47 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kbuild-all, clang-built-linux, cohuck, kvm, linux-kernel, jgg, peterx

[-- Attachment #1: Type: text/plain, Size: 2633 bytes --]

Hi Alex,

I love your patch! Perhaps something to improve:

[auto build test WARNING on vfio/next]
[also build test WARNING on next-20210211]
[cannot apply to v5.11-rc7]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Alex-Williamson/vfio-Device-memory-DMA-mapping-improvements/20210213-033317
base:   https://github.com/awilliam/linux-vfio.git next
config: arm-randconfig-r032-20210212 (attached as .config)
compiler: clang version 12.0.0 (https://github.com/llvm/llvm-project c9439ca36342fb6013187d0a69aef92736951476)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install arm cross compiling tool for clang build
        # apt-get install binutils-arm-linux-gnueabi
        # https://github.com/0day-ci/linux/commit/f983107aef5b1386ae601ee8573da872f1484633
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Alex-Williamson/vfio-Device-memory-DMA-mapping-improvements/20210213-033317
        git checkout f983107aef5b1386ae601ee8573da872f1484633
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=arm 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> drivers/vfio/vfio_iommu_type1.c:509:18: warning: no previous prototype for function 'pfnmap_find_dma' [-Wmissing-prototypes]
   struct vfio_dma *pfnmap_find_dma(struct pfnmap_obj *pfnmap)
                    ^
   drivers/vfio/vfio_iommu_type1.c:509:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   struct vfio_dma *pfnmap_find_dma(struct pfnmap_obj *pfnmap)
   ^
   static 
   1 warning generated.


vim +/pfnmap_find_dma +509 drivers/vfio/vfio_iommu_type1.c

   508	
 > 509	struct vfio_dma *pfnmap_find_dma(struct pfnmap_obj *pfnmap)
   510	{
   511		struct rb_node *n;
   512	
   513		for (n = rb_first(&pfnmap->iommu->dma_list); n; n = rb_next(n)) {
   514			struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
   515	
   516			if (dma->pfnmap == pfnmap)
   517				return dma;
   518		}
   519	
   520		return NULL;
   521	}
   522	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 34203 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/3] vfio: Introduce vma ops registration and notifier
  2021-02-12 21:20   ` Jason Gunthorpe
@ 2021-02-18  1:12     ` Jason Gunthorpe
  2021-02-18 21:56       ` Alex Williamson
  0 siblings, 1 reply; 11+ messages in thread
From: Jason Gunthorpe @ 2021-02-18  1:12 UTC (permalink / raw)
  To: Alex Williamson; +Cc: cohuck, kvm, linux-kernel, peterx

On Fri, Feb 12, 2021 at 05:20:57PM -0400, Jason Gunthorpe wrote:
> On Fri, Feb 12, 2021 at 12:27:39PM -0700, Alex Williamson wrote:
> > Create an interface through vfio-core where a vfio bus driver (ex.
> > vfio-pci) can register the vm_operations_struct it uses to map device
> > memory, along with a set of registration callbacks.  This allows
> > vfio-core to expose interfaces for IOMMU backends to match a
> > vm_area_struct to a bus driver and register a notifier for relavant
> > changes to the device mapping.  For now we define only a notifier
> > action for closing the device.
> > 
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> >  drivers/vfio/vfio.c  |  120 ++++++++++++++++++++++++++++++++++++++++++++++++++
> >  include/linux/vfio.h |   20 ++++++++
> >  2 files changed, 140 insertions(+)
> > 
> > diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> > index 38779e6fd80c..568f5e37a95f 100644
> > +++ b/drivers/vfio/vfio.c
> > @@ -47,6 +47,8 @@ static struct vfio {
> >  	struct cdev			group_cdev;
> >  	dev_t				group_devt;
> >  	wait_queue_head_t		release_q;
> > +	struct list_head		vm_ops_list;
> > +	struct mutex			vm_ops_lock;
> >  } vfio;
> >  
> >  struct vfio_iommu_driver {
> > @@ -2354,6 +2356,121 @@ struct iommu_domain *vfio_group_iommu_domain(struct vfio_group *group)
> >  }
> >  EXPORT_SYMBOL_GPL(vfio_group_iommu_domain);
> >  
> > +struct vfio_vma_ops {
> > +	const struct vm_operations_struct	*vm_ops;
> > +	vfio_register_vma_nb_t			*reg_fn;
> > +	vfio_unregister_vma_nb_t		*unreg_fn;
> > +	struct list_head			next;
> > +};
> > +
> > +int vfio_register_vma_ops(const struct vm_operations_struct *vm_ops,
> > +			  vfio_register_vma_nb_t *reg_fn,
> > +			  vfio_unregister_vma_nb_t *unreg_fn)
> 
> This just feels a little bit too complicated
> 
> I've recently learned from Daniel that we can use the address_space
> machinery to drive the zap_vma_ptes() via unmap_mapping_range(). This
> technique replaces all the open, close and vma_list logic in vfio_pci

Here is my effort to make rdma use this, it removes a lot of ugly code:

https://github.com/jgunthorpe/linux/commits/rdma_addr_space

Still needs some more detailed testing.

This gives an option to detect vfio VMAs by checking

   if (vma->vm_file &&
       file_inode(vma->vm_file) &&
       file_inode(vma->vm_file)->i_sb->s_type == vfio_fs_type)

And all vfio VMA's can have some consistent vm_private_data, or at
worst a consistent extended vm operations struct.

Jason

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/3] vfio: Introduce vma ops registration and notifier
  2021-02-18  1:12     ` Jason Gunthorpe
@ 2021-02-18 21:56       ` Alex Williamson
  2021-02-18 23:04         ` Jason Gunthorpe
  0 siblings, 1 reply; 11+ messages in thread
From: Alex Williamson @ 2021-02-18 21:56 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: cohuck, kvm, linux-kernel, peterx

On Wed, 17 Feb 2021 21:12:09 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Fri, Feb 12, 2021 at 05:20:57PM -0400, Jason Gunthorpe wrote:
> > On Fri, Feb 12, 2021 at 12:27:39PM -0700, Alex Williamson wrote:  
> > > Create an interface through vfio-core where a vfio bus driver (ex.
> > > vfio-pci) can register the vm_operations_struct it uses to map device
> > > memory, along with a set of registration callbacks.  This allows
> > > vfio-core to expose interfaces for IOMMU backends to match a
> > > vm_area_struct to a bus driver and register a notifier for relavant
> > > changes to the device mapping.  For now we define only a notifier
> > > action for closing the device.
> > > 
> > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > >  drivers/vfio/vfio.c  |  120 ++++++++++++++++++++++++++++++++++++++++++++++++++
> > >  include/linux/vfio.h |   20 ++++++++
> > >  2 files changed, 140 insertions(+)
> > > 
> > > diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> > > index 38779e6fd80c..568f5e37a95f 100644
> > > +++ b/drivers/vfio/vfio.c
> > > @@ -47,6 +47,8 @@ static struct vfio {
> > >  	struct cdev			group_cdev;
> > >  	dev_t				group_devt;
> > >  	wait_queue_head_t		release_q;
> > > +	struct list_head		vm_ops_list;
> > > +	struct mutex			vm_ops_lock;
> > >  } vfio;
> > >  
> > >  struct vfio_iommu_driver {
> > > @@ -2354,6 +2356,121 @@ struct iommu_domain *vfio_group_iommu_domain(struct vfio_group *group)
> > >  }
> > >  EXPORT_SYMBOL_GPL(vfio_group_iommu_domain);
> > >  
> > > +struct vfio_vma_ops {
> > > +	const struct vm_operations_struct	*vm_ops;
> > > +	vfio_register_vma_nb_t			*reg_fn;
> > > +	vfio_unregister_vma_nb_t		*unreg_fn;
> > > +	struct list_head			next;
> > > +};
> > > +
> > > +int vfio_register_vma_ops(const struct vm_operations_struct *vm_ops,
> > > +			  vfio_register_vma_nb_t *reg_fn,
> > > +			  vfio_unregister_vma_nb_t *unreg_fn)  
> > 
> > This just feels a little bit too complicated
> > 
> > I've recently learned from Daniel that we can use the address_space
> > machinery to drive the zap_vma_ptes() via unmap_mapping_range(). This
> > technique replaces all the open, close and vma_list logic in vfio_pci  
> 
> Here is my effort to make rdma use this, it removes a lot of ugly code:
> 
> https://github.com/jgunthorpe/linux/commits/rdma_addr_space
> 
> Still needs some more detailed testing.
> 
> This gives an option to detect vfio VMAs by checking
> 
>    if (vma->vm_file &&
>        file_inode(vma->vm_file) &&
>        file_inode(vma->vm_file)->i_sb->s_type == vfio_fs_type)
> 
> And all vfio VMA's can have some consistent vm_private_data, or at
> worst a consistent extended vm operations struct.

Looks pretty slick.  I won't claim it's fully gelled in my head yet,
but AIUI you're creating these inodes on your new pseudo fs and
associating it via the actual user fd via the f_mapping pointer, which
allows multiple fds to associate and address space back to this inode
when you want to call unmap_mapping_range().  That clarifies from the
previous email how we'd store the inode on the vfio_device without
introducing yet another tracking list for device fds.  I'll try to
piece together something similar for vfio, especially if we can avoid
that nasty lock switcheroo we copied from
uverbs_user_mmap_disassociate().  Thanks,

Alex


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/3] vfio: Introduce vma ops registration and notifier
  2021-02-18 21:56       ` Alex Williamson
@ 2021-02-18 23:04         ` Jason Gunthorpe
  2021-02-19 22:02           ` Alex Williamson
  0 siblings, 1 reply; 11+ messages in thread
From: Jason Gunthorpe @ 2021-02-18 23:04 UTC (permalink / raw)
  To: Alex Williamson; +Cc: cohuck, kvm, linux-kernel, peterx

On Thu, Feb 18, 2021 at 02:56:06PM -0700, Alex Williamson wrote:

> Looks pretty slick.  I won't claim it's fully gelled in my head yet,
> but AIUI you're creating these inodes on your new pseudo fs and
> associating it via the actual user fd via the f_mapping pointer, which
> allows multiple fds to associate and address space back to this inode
> when you want to call unmap_mapping_range().  

Yes, from what I can tell all the fs/inode stuff is just mandatory
overhead to get a unique address_space pointer, as that is the only
thing this is actually using.

I have to check the mmap flow more carefully, I recall pointing to a
existing race here with Daniel, but the general idea should hold.

> That clarifies from the previous email how we'd store the inode on
> the vfio_device without introducing yet another tracking list for
> device fds.

Right, you can tell from the vma what inode it is for, and the inode
can tell you if it is a VFIO VMA or not, so no tracking lists needed
at all.

Jason

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/3] vfio: Introduce vma ops registration and notifier
  2021-02-18 23:04         ` Jason Gunthorpe
@ 2021-02-19 22:02           ` Alex Williamson
  0 siblings, 0 replies; 11+ messages in thread
From: Alex Williamson @ 2021-02-19 22:02 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: cohuck, kvm, linux-kernel, peterx

On Thu, 18 Feb 2021 19:04:04 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Thu, Feb 18, 2021 at 02:56:06PM -0700, Alex Williamson wrote:
> 
> > Looks pretty slick.  I won't claim it's fully gelled in my head yet,
> > but AIUI you're creating these inodes on your new pseudo fs and
> > associating it via the actual user fd via the f_mapping pointer, which
> > allows multiple fds to associate and address space back to this inode
> > when you want to call unmap_mapping_range().    
> 
> Yes, from what I can tell all the fs/inode stuff is just mandatory
> overhead to get a unique address_space pointer, as that is the only
> thing this is actually using.
> 
> I have to check the mmap flow more carefully, I recall pointing to a
> existing race here with Daniel, but the general idea should hold.
> 
> > That clarifies from the previous email how we'd store the inode on
> > the vfio_device without introducing yet another tracking list for
> > device fds.  
> 
> Right, you can tell from the vma what inode it is for, and the inode
> can tell you if it is a VFIO VMA or not, so no tracking lists needed
> at all.

Seems to be a nice cleanup for vfio as well, more testing and analysis
required, but here are a few (4) wip commits that re-implement the
current vma zapping scheme following your example: 

https://github.com/awilliam/linux-vfio/commits/vfio-unmap-mapping-range

Thanks,
Alex


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2021-02-19 22:04 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-12 19:27 [PATCH 0/3] vfio: Device memory DMA mapping improvements Alex Williamson
2021-02-12 19:27 ` [PATCH 1/3] vfio: Introduce vma ops registration and notifier Alex Williamson
2021-02-12 21:20   ` Jason Gunthorpe
2021-02-18  1:12     ` Jason Gunthorpe
2021-02-18 21:56       ` Alex Williamson
2021-02-18 23:04         ` Jason Gunthorpe
2021-02-19 22:02           ` Alex Williamson
2021-02-12 19:27 ` [PATCH 2/3] vfio/pci: Implement vm_ops registration Alex Williamson
2021-02-12 19:28 ` [PATCH 3/3] vfio/type1: Implement vma registration and restriction Alex Williamson
2021-02-12 22:47   ` kernel test robot
2021-02-12 20:57 ` [PATCH 0/3] vfio: Device memory DMA mapping improvements Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).